Intro
In our last post, we discussed the fundamental Kubernetes hardening components of Kubernetes Security Posture Management. Configurations that prevent exposing your Kubernetes control plane to the internet, limiting the use of privileged access, and not running containers as root are critical for hardening your clusters and preventing the most common kinds of attacks. In this post, we will go beyond hardening to discuss how KSPM can help in the case of an incident in your cluster.
Tying KSPM to incident response
What happens if there is an incident in your clusters? How would you identify the incident? How could you contain it? And what is the overall level of “responsiveness” in your security posture?
Responsiveness will, in some ways, build upon the basic cyber hygiene we discussed in the last post. For example, in a serious security incident you may need to invoke the “break glass” roles we discussed in the Kubernetes RBAC section. A few other Kubernetes security posture measures will improve your ability to detect and respond to real incidents in your cluster.
Common Kubernetes Misconfigurations
Enable and Use Cluster Logging
Kubernetes logging is not just a security requirement. Logging is also a requirement for effective DevOps, providing the feedback developers need to drive bug-fixes and new releases based on the actual use of their apps and the bugs their users encounter. Logging also supports the work of SRE teams, helping them identify the root cause of an outage and work with developers on a fix. Similarly, when a possible security incident emerges, your security team will need to be able to trace activity through the logs to determine its nature, severity, and extent. Kubernetes by default collects system component and container runtime logs, storing them on the local filesystem and rotating them based on the cluster’s configuration. This is a good start, but ideally you will want to aggregate the logs somehow so that you can more easily monitor and search them.
Crawl: Use the Kubernetes Default: Kubernetes by default collects many important logs on the local file system of your nodes. You can search those logs using kubectl logs commands. That gets you started!
Walk: Deploy a logging agent: A logging agent allows you to (a) collect more logs, (b) sort and filter those logs based on your priorities, and (c) aggregate logs into a common storage archive for searching and analysis.
Run: Aggregate logs to a SIEM: If you are collecting logs into a single storage archive, the next step is to push them into a SIEM solution. This will allow you to organize, index, and search across them more easily.
Have Real-Time Monitoring in Your Cluster
Human based log analysis is necessary when you are retrospectively analyzing a security incident. Detecting that incident in the first place requires a degree of real-time monitoring and correlation. You can do some of that “manually” with well crafted dashboards, queries, and alerts in something like a SIEM solution, but it requires tremendous time and energy to surface the right information. To have a truly responsive security posture, you need something to automate that real-time monitoring and analysis and respond to events as they happen.
Crawl: Deploy a Detection and Response tool: The first step is deploying a tool that is able to analyze activity in your cluster and make real-time judgements on what it sees. Just deploying the tool with its default detection posture is a win.
Walk: Start tuning the detections based on your cluster: Any real-time detection and response tooling has to make judgment calls based on the presence of certain known factors. If your cluster has things like internal apps the tool has never seen before, lift-and-shift legacy software that might not act in the most Kubernetes “normal” way, or other features that could be “unusual”, you’re going to get a lot of false positives. The next step is tuning the detections based on your actual cluster so that you have a better signal-to-noise ratio.
Run: Actively monitor with real-time KSPM: Detection and response tooling is usually configured to generate an alert whenever it finds something it thinks is out of the ordinary. The thing is, someone has to see and respond to that alert. Email is great, but how often have you seen an email exactly when it came in? The next step is to actively monitor those alerts and whatever other telemetry the tooling is giving you so that you reduce your response time as much as possible. Ensure the misconfigurations you are seeing in Kubernetes are tied in real-time to the Kubernetes lifecycle versus polling intervals so you have full historical context.
Container Runtime Security
Use an Admission Controller
Not all KSPM solutions include admission control, but this is a critical security feature in controlling what is deployed via Kubernetes. To understand the most basic policies to set, the OWASP Top 10 for Kubernetes is helpful. An admissions controller allows you to enforce policy on Kubernetes objects at deployment time, rejecting those that do not conform to your requirements. It can be used for a variety of purposes, including preventing containers with root permissions from being deployed, verifying artifact signatures before allowing admission, or blocking certain “known-bad” images. Some admissions controllers can also be used to check existing objects in the cluster for compliance (or can be integrated with a scanning tool that does this), giving you the ability to find and remediate insecure or out-of-compliance cluster resources. The use of an admissions controller provides several pathways to greater responsiveness in your cluster. You can rely on its existing ruleset to automatically block objects that run afoul of your policies, providing some degree of built-in responsiveness to bad actors attempting to deploy known-bad or non-compliant resources into your cluster. You can also modify the rules, giving you a pathway for gradually strengthening the security posture of your running workloads. You could even add new rules in direct response to a security threat or incident, giving you another tool as part of your response toolkit.
Crawl: Just deploy one: Get started by deploying an admissions controller with its default rule set. This will provide some degree of protection right out the gate and give you a chance to understand the tool.
Walk: Add a cluster assessment component: This will let you scan your existing workloads against the admissions controller ruleset. It does not eject those workloads, so it’s not a disruptive measure, but it gives you the ability to identify workloads that wouldn’t meet your admissions controller rules and take manual action to remediate.
Run: Begin to write your own rules and do a dry-run before enforcing: Adapt the existing ruleset of the admissions controller based on your specific security requirements and ensure you and your engineering teams understand the impact of admission control policies before enforcing them. As you get comfortable with this, it’ll be easier to add rules on the fly in response to emerging or detected security threats.
Conclusion
Though KSPM is generally more helpful for hardening your clusters versus responding in an incident, there are still configurations and capabilities that can help prevent and respond to incidents, as well as understand the historical context of what happened. Generally, you will need real-time capabilities to excel here, and you will want as much flexibility as possible when dealing with admission control. In the next post in the series, we will discuss more advanced configurations for a defense-in-depth strategy.
