In a series of articles we are going to explore Kubernetes Security Posture Management (KSPM), one of the basic tenets of Kubernetes security. This first post in the series will focus specifically on fundamental “cyber hygiene” practices aimed at preventing the most common attack vectors against your cluster. Within each post, we’ll also provide suggestions for improving your posture following a “crawl, walk, run” maturity model.
Ever seen those advertisements for posture training devices? They are designed to alert you when you are standing or sitting in a way that puts extra strain on your neck and spine. The concept of “posture” also applies to cyber security. In this case, bad security posture potentially impacts your ability to respond to new and emerging threats because of extra “strain” on your security capabilities caused by misconfigurations, gaps in tooling, or inadequate training.
Security Posture Management can be applied to different areas of your organization’s technical landscape, including the cloud (CSPM) and Kubernetes (KSPM). When we recently discussed the difference between CSPM and KSPM, we noted that:
Cloud Security Posture Management will focus on things like hardening hosts and networks, deploying EDR agents across your whole fleet, and restricting access to particular infrastructure components. In contrast, Kubernetes Security Posture Management will focus on things like API security, controlling access to the control plane, properly provisioning RBAC roles within the cluster, admissions control, container permissions, and real-time workload protection.
Here we will describe some of the most common misconfigurations, specific to Kubernetes, that you will want to avoid.
One of the most important Kubernetes security measures you can take is often missed because the control plane is exposed by default in many managed Kubernetes services. Removing external access to the control plane and API provides an instant, high level of security against exploits targeting vulnerabilities in control plane components. Attackers now need to either chain those vulnerabilities to exploits of exposed running workloads or compromise the account of someone with internal access to the control plane. Both of these attack vectors are possible, and you’ll need additional security measures for both. But taking the control plane offline is like locking your front door: a motivated thief can drill through the deadbolt, sure, but 90% or more simply won’t. Of course, you will still need access to your cluster, so you’ll need a way in while the control plane is offline. Here are some options for facilitating that:
Crawl: Use a Bastion Host— an internet accessible server in the same private network as your cluster but not joined as a node— as the gateway to your cluster. Ideally, you spin the bastion up when needed, then shut it down when you don’t to minimize its exposure, too.
Walk: Use a Cloud provider service to facilitate secure connections. For example, AWS customers can use System Manager (SSM) to connect to nodes in the cluster without a public IP. This uses AWS’s IAM service to handle authentication and authorization.
Run: The Zero Trust way: use an identity-aware proxy to broker access to the nodes in your network without making them available to the public internet. This can then tie into your existing identity provider and authorization system.
Once someone has network connectivity to your cluster, step two is authenticating to the cluster to assume a role. Kubernetes largely outsources authentication— it does not even have an API object for “normal” users. Instead, it assumes you will handle authentication of users before they reach the cluster. You tell the cluster the type of authentication material to expect and where to expect it from. The cluster then trusts any assertions found in valid (ie, properly sourced and signed) authentication materials that are provided. Further, Kubernetes does not support revoking authentication material, so the provided materials must be set to expire by the provider. This puts the burden for securing authentication entirely on whatever external authentication system you adopt. Once authenticated, authorization is handled natively via Kubernetes RBAC. Kubernetes accepts x509 certificates and bearer tokens as valid authentication materials. This gives you a few ways of generating and securing the necessary materials.
Crawl: Use your cloud provider: if you are using a managed Kubernetes service, your cloud provider may have a way of translating its native authentication protocol to a bearer token for Kubernetes authentication. This moves the problem back one step: now you need to protect authentication to your cloud provider (ideally using SSO from your existing IdP).
Walk: Use your existing IdP: Kubernetes supports OIDC for authentication, so if your IdP is an OIDC provider, you can use it to authenticate directly to the cluster (rather than using it to authenticate to the cloud provider and then using the cloud provider to authenticate to the cluster).
Run: The Zero Trust way: Zero Trust architectures broker access through identity-aware proxies. If you have configured your cluster nodes to only be accessible via Zero Trust, you’ve already established an identity when you connect to those nodes. You can use the same Zero Trust architecture to establish your identity to the cluster itself.
Role Based Access Control (RBAC) is intended to help enforce the Principle of Least Privilege. A corollary of this is that roles and groups with expansive privileges should be narrowly restricted in their assignment (and ideally used by those assigned only when necessary). This means restricting who has access in your cluster to powerful roles (ie, admin) and groups (system:masters, for example). The system:masters group in particular should be limited to break-glass scenarios when other ways of accessing or controlling the cluster have been lost.
Crawl: Restrict privileged access to a group: This is the essence of what RBAC requires: privileged access is limited to only those who need it.
Walk: Require elevation: The next step in tightening down privileged access is to make it a regular practice for members of the privileged access group to use a lower privilege account except when they need higher ones. This requires them to re-authenticate with a more privileged account. This brings two advantages: first, an added layer of protection for privileged access, and, second, a more clear audit trail for all privileged activities.
Run: Restrict privileged access to break-glass only: This pairs especially nicely with a GitOps deployment and management system (see next item). In essence don’t give out access to admin or otherwise privileged accounts, keep the credentials for them in a secure place only to be used in a break-glass scenario.
The idea behind GitOps is that all changes in your cluster are handled through changes (managed by git) to your Configuration as Code (CaC). As a result, there should be no manual changes made in the cluster. This may sound at first like a radical application of the Principle of Least Privilege, perhaps even taking it to its logical extreme. But it turns out GitOps has benefits well beyond security (and actually, the security benefits may be a happy accident). GitOps provides predictability and stability to cluster deployments. It also ensures admins know the state of the cluster (ie, there hasn’t been configuration drift) and maintains parity between test and production clusters built off the same codebase. Of course, it also has the security benefit of dramatically reducing the number of users with “write” access in the cluster. Thats pretty nice, too.
Crawl: Deploy with a simple CICD job: When your pipelines are approved and you merge to main, run a simple “helm upgrade” job. This is easy to implement, but requires giving your CICD system at least one (possibly more, depending on how your CaC is organized) fairly privileged account in your cluster.
Walk: Use a GitOps Operator: Instead of pushing out changes directly from your CICD, this approach pulls changes in using an operator in the cluster that is watching your git repos for changes. Now instead of granting your CI tooling credentials for your cluster, you grant the single operator already running in your cluster read-access to your relevant CaC repos.
Run: Make by-hand changes break-glass only: Once your GitOps workflow is going smoothly, there shouldn’t be a need for user roles in your production clusters that can make manual changes. Dev, test and (possibly) staging clusters should probably never reach this level of “maturity,” as part of the point of those is trying things out.
Underneath it all, the Kubernetes lifecycle is composed of Kubernetes workloads running as containers. Those containers, in turn, run as processes on a host. Those processes have privileges (as do any process running on a host) derived from a combination of the user on the host running the container and the user declared inside the container running the tasks/processes of the workload. There are therefore two ways to limit the privileges granted to a workload: first, run the container under a non-privileged (ie, not root) user on the host, and second, run the workload in the container as a non-privileged user.
Ultimately, if an attacker manages to “escape” a container, they will inherit the host-user’s privileges, so the first prong is arguably more important. However, even a less privileged host user will have some important privileges (like, for example, the ability to run other, ie, malicious, containers). It is therefore desirable to minimize opportunities for container escapes of any kind. You can make container escape much harder by using a non-root user for the workload inside the container.
Crawl: Audit your containers: The first step is to know what you have running in a privileged mode. Then you can begin removing privileges from workloads that don’t need it.
Walk: Use an Admissions Controller: Start enforcing restrictions on privileged containers with an admissions controller rule to prevent containers running in privileged mode from running at all.
Run: Check privileges during CICD: Evaluate containers for the use of root users during your CICD pipelines so that developers can fix the permissions before attempting a deployment.
The many misconfigurations possible in Kubernetes highlight the importance of KSPM in drastically reducing your attack surface. Since Kubernetes spans both the build as well as runtime, any discussion of KSPM must include incident response. The next post in this series will show how to set yourself up for incident response using KSPM.