Reading time ~12 minutes

On Establishing a Cloud Security Program

The Goal: a Roadmap for Cloud Security Teams
The North Star
- Identify
- Protect
- Detect
- Respond
- Recover
Building the Roadmap
Conclusions

Congratulations! You have been tasked with establishing a cloud security strategy. Now what?

In this post, part of the “Cloud Security Strategies” series, I’m going to walk through actionable advice that can be undertaken to establish a cloud security program aimed at protecting a cloud native, service provider agnostic, container-based, offering.

The Goal: a Roadmap for Cloud Security Teams

Security strategies focusing on cloud native solutions are becoming prominent within the industry, but it feels like everyone is trying to - due to a lack of shared knowledge - reinvent the wheel every time.

Infact, there are not many public resources describing how to approach this topic: although different resources cover specific aspects of specific use cases (e.g., how to do container scanning, or how to deploy Open Policy Agent), there is a lack of a single holistic view on how to integrate everything together.

In this post, I will start with the foundations, and go through the different milestones (or maturity levels) required to reach a “best in class” solution to support and secure a product that span across multiple service providers (hence the requirement of not being tied to platform-specific solutions), runs on Kubernetes, and must comply with strict regulations (like the ones that apply to fintech companies).

The North Star

Before jumping into the details, I think it is important to define a “North Star” that can be used as a reference point (and driver) for the definition of your strategy.

These are the high-level goals that will then be reflected within the roadmap and mapped to actual controls that can be implemented. For cloud native solutions, I grouped these main pillars by the five functions of the NIST Cybersecurity Framework: Identify, Protect, Detect, Respond, and Recover.

Identify

Area	Goals
Architecture definition	Define and document architecture decisions, like network architecture diagrams to clearly identify high-risk environments and data flows, and threat model documentation to support the architecture definition. Define and document a data classification scheme that classifies data according to its sensitivity and is used to ensure the implemented security controls are consistent, sufficient, and proportional.
Immutable infrastructure	Embed Infrastructure as Code (IaC) principles throughout the development, release, and deployment processes, so to ensure consistency and auditability of the resulting infrastructure. Follow Secure Software Development Life Cycle (SSDLC) practices for IaC, and perform code reviews to validate any change to the infrastructure to confirm no reduction to the security controls are introduced.

Area

Goals

Architecture definition

Define and document architecture decisions, like network architecture diagrams to clearly identify high-risk environments and data flows, and threat model documentation to support the architecture definition.
Define and document a data classification scheme that classifies data according to its sensitivity and is used to ensure the implemented security controls are consistent, sufficient, and proportional.

Immutable infrastructure

Embed Infrastructure as Code (IaC) principles throughout the development, release, and deployment processes, so to ensure consistency and auditability of the resulting infrastructure.
Follow Secure Software Development Life Cycle (SSDLC) practices for IaC, and perform code reviews to validate any change to the infrastructure to confirm no reduction to the security controls are introduced.

Protect

Area	Goals
Known good state	Configure each core component of the infrastructure according to a known and approved secure baseline, based on best industry standards such as Center for Internet Security (CIS), Cloud Security Alliance (CSA), and National Institute of Standards and Technology (NIST) Programmatically enforce the known good state, by ensuring there are no deviations from the baseline
Zero Trust model	Treat all hosting environments as hostile, encrypting data at rest and in flight, and retaining control of the associated cryptographic material Enforce strong account authentication
Micro blast radius	Contain and respond to potential breaches, segregate networks, and provision accounts following least privilege principles
Strong authentication	Implement Authentication schemes to ensure that principals are strongly authenticated and the strength of each authentication mechanism increases proportionally with the criticality of the asset protected by it Configure Identity and Access Management (IAM) to enforce strict account segregation and to require Multi-Factor Authentication (MFA) for sensitive operations and privileged accounts Utilize Role-Based Access Control (RBAC) to manage access to resources and workloads Continuously validate the known good state through regular scanning of account privileges, to ensure no privilege creep or permission drift arises
Continuous secure baseline validation	Continuously validate the approved secure baseline with an automated process integrated within the CI/CD pipeline which provides an inventory of assets, as well as validation of cloud deployments and cluster configurations

Detect

Area	Goals
Assumed breach	Assumed breach: at any given time your product, infrastructure, or an (even administrative) account could be compromised Deploy controls to anticipate common Tactics Techniques and Procedures (TTPs) of attackers and identify potential Indicators of Compromise (IOCs) Monitor the entire tech stack and thoroughly log events

Respond

Area	Goals
Containment	Leverage security monitoring to provide actionable events to trigger (semi-)automated containment After containment is triggered, embed mechanisms for the forensic collection of evidence and recovery from the breach
Business continuity	Business continuity and security incident response plans shall also be subject to testing at planned intervals, or upon significant organizational or environmental changes

Recover

Area	Goals
Strong auditability and accountability	Consistently audit and assure immutable logs and traceability of the entire security solution

📙 The CloudSec Engineer is out now!

The CloudSec Engineer is a practical guide on how to enter, establish yourself, and thrive in the Cloud Security industry as an individual contributor.

You can head over to CloudSecBooks.com to find more information about the book and its contents.

Building the Roadmap

As said, these high-level goals provide macro-areas that can be worked against, but they are very general (and open to interpretations). Taking a step further, how can they be applied to a cloud native platform, where multiple cloud service providers and Kubernetes clusters are involved?

Ideally, we would like to use a framework which:

Allows to embrace an agile approach (with multiple iterations, which enable continuous improvement).
Is transparent to other engineering teams (i.e., security teams should be low friction and not be blockers).
Will ultimately lead to a solution that is compliant with industry regulations (e.g., ISO27001, PCI DSS, etc.) by “default”.

Hence, I took the Cloud Security Alliance (CSA) Cloud Controls Matrix (CCM) and started performing a gap analysis and RACI matrix to map controls to Security teams, and selecting areas directly applicable to a cloud security team (i.e., excluding controls like physical security of a data center, usually not directly applicable to such teams). Then, I enhanced this list by adding cloud-specific controls I thought are essential for a comprehensive program (usually also backed by CNCF) and re-organized them in areas of interest.

In the sections below I will explain in detail these main areas (Domains), workstreams (Controls), and actionable Tasks which compose the Roadmap: from the definition of high-level security policies, network architecture, IAM, and assets inventory; to monitoring, code provenance, policy as code; and up to automatic enforcement of security policies, runtime anomaly detection, and business continuity.

Domains

Domains can be considered as “macro-areas” which can be used to group set of Controls:

Domain	Description
[1] Policies & Standards	Definition of Security Policies and Standards which provide reference documentation on best practices for cloud security, with a particular focus on cloud providers and containerization solutions.
[2] Architecture	Definition and review of architectural decisions, with particular focus on network architecture, identity and access management, secrets management, and data classification.
[3] Verification	Continuously verify and enforce all cloud resources are abiding by the policies and expected baseline configuration.
[4] Supply Chain Security	Enforce security controls throughout the pipeline: Image/Pod Security: enforcement of hardened base images and linting. Continuous Integration (CI): IaC scanning (Dockerfiles, Kubernetes manifests, Terraform, etc.). Continuous Delivery (CD): protect Supply Chain Integrity. In-Cluster Controls: preventative controls like admission controllers. Cloud provider-Specific Controls: deploy guardrails (SCPs/Org Policies), restrict access.
[5] Monitoring and Alerting	Implement logging, monitoring, and alerting systems so to have visibility around activities and/or changes affecting the environments.
[6] Incidents and Remediation	Implement processes for containment, forensics, and automatic remediation of security violations.
[7] Business Continuity	Prepare countermeasures for unexpected incidents or disasters.

Controls

These domains can then be fleshed out into a variety of workstreams (or Controls).

Before exploring them in detail, it is worth noting that, generally speaking, a cloud security program can be implemented throughout a series of maturity levels. The sub-sections below will provide an overview of the main initiatives that, for each Domain, could be undertaken at each level of maturity.

Maturity Level 1 - The foundations

Definition of Security Policies: start by defining some overarching policies that will define your overall approach and that the business will have to abide by (e.g., Cloud Security Policy, Vulnerability/Patch Management Standard).
Architecture: review the network architecture and ensure proper segregation of environments (especially production), review the Identity and Access Management Framework, as well as how secrets management is performed.
Verification: start by getting the so-called “low hanging fruits” by validating no obvious misconfigurations (both at the CSP and K8s level) are present, as well as by starting obtaining a list of public endpoints.
Supply Chain: deploy container image scanning, and start restricting access to privileged AWS/GCP users.
Monitoring: start defining a security logging strategy (I provided examples for both AWS and GCP).

Maturity Level 2

Definition of Security Standards: continue developing standards covering more “advanced” topics like Key Management/Generation and Data Handling/Labeling.
Architecture: depending on the current state of IAM and Secrets management (found in Level 1), you might want to tackle processes like credentials management and user access provisioning.
Verification: start deploying a solution that can continuously provide an up-to-date asset inventory (for example, see “Mapping Moving Clouds: How to stay on top of your ephemeral environments with Cartography”). Improve the validation of the environments by deploying automation that can continuously report misconfigurations and drift.
Supply Chain: start working on securing the images used (define a list of base images and harden them). Enforce the use of these secure images in the CI/CD pipeline, and add automation able to scan Infrastructure as Code for security issues. Work with your Application Security team to ensure a system to prevent the leaking of secrets through the codebase is integrated into the pipeline.
Monitoring: deploy the security logging solution designed at Level 1, and ensure logs are collected from all environments. Start defining monitoring and alerting rules to act on indicators of compromise and/or known classes of issues.

Maturity Level 3

Definition of Security Standards: keep extending standards to cover Identity and Access Management, Encryption, Key Management/Generation, Data Handling/Labeling, Change Management.
Verification: provide continuous identification of deviations from defined Security Policies and compliance frameworks (e.g., via AWS Security Hub and GCP Security Command Center), with a process integrated within the security pipeline (i.e., your SIEM). Start deploying guardrails (e.g., SCPs and Org Policies) to prevent entire classes of misconfigurations.
Supply Chain: ensure automatic validation of the configuration of the Kubernetes clusters and running containers is performed so to detect any misconfiguration. Address hardening of the AWS/GCP organizations.
Monitoring: start aggregate and report on both logged data and anomalies, and create visualizations/dashboards to facilitate their consumption. Deploy processes and tools to detect cases of credential compromise.
Remediation: Employ processes to automate the remediation of (at least) the most common types of misconfigurations.

Maturity Level 4

Business Continuity: start tackling Business Continuity issues (Audit Planning, Business Continuity Planning, Incident Management).
Monitoring: any changes made to production should be logged and eventually alerted upon. In addition, file integrity (host) and network intrusion detection (IDS) tools should be deployed to help facilitate timely detection, investigation by root cause analysis, and response to incidents. In particular, processes and tools shall be put in place to implement a runtime anomaly detection solution, aligned with MITRE ATT&CK for Cloud.
Remediation: start creating playbooks to define detailed processes to follow in case of an incident. Timely de-provisioning of user access to data and systems should be implemented.
Business Continuity: a Disaster Recovery Plan should be outlined, in the eventuality of the outage/failure of one or more core components of the infrastructure (e.g., failure of an AZ or Region).

Maturity Level 5

Supply Chain: utilize a framework (like TUF, in-toto, providence) to protect the integrity of the Supply Chain.
Monitoring: a solution should be put in place to detect exfiltration of data, by monitoring egress traffic.
Remediation: automated processes should be put in place to automate the containment of (at least) the most common compromise types, and to automate the forensic collection of evidence after the declaration of a security incident.
Business Continuity: tabletop exercises and live tests should be conducted to test the effectiveness of controls put in place to mitigate an eventual failure of one or more core components of the infrastructure.

Tasks

At a first glance, the list of initiatives outlined above might seem quite dense (and not super-actionable). That’s why I expanded them into a set of Tasks (95 at the time of writing), which can be individually worked upon.

Having almost a hundred controls in a blog post wouldn’t be practical, though, so I created a micro-website to host them in a spreadsheet-style format.

Each row represents a Task, and has the following attributes:

Attribute	Description
Domain	The `Domain` the Task belongs to
Control	The `Control` the Task belongs to
Task	The Task name
Description	A description of what the Task involves
Status	To keep track of progress (`NOT STARTED`, `IN PROGRESS`, `BLOCKED`, `DONE`)
Priority	The `Maturity Level` the Task belongs to (`1`-`5`)
Maturity	How mature is the deployment/rollout of the Task, once you started working on it
Layer	Whether it affects a Cloud Provider, Kubernetes cluster, or both
Epic	Link to Jira/Issue Tracker, to keep track of progress
Deliverable	Type of deliverable for the Task (`Documentation`, `Tooling`, etc.)
Artifact	Link to the final deliverable for the Task
Useful Resources	Some useful resources that can help during the implementation phase
Metrics	Metrics that can be used to track the success of the Task
CSA CCM	Reference to the related entry in the CSA CCM, if any

From there, you’ll have the ability to export it as CSV and tailor it to your needs.

I’d like to stress that you don’t have to follow the tasks in order, but you should use the Priority column to define your own priorities, which can change based on your business priorities and industry.

Putting all Together: The Roadmap

The detailed list of Controls can be found at: roadmap.cloudsecdocs.com

Conclusions

In this post, part of the “Cloud Security Strategies” series, I outlined some actionable advice that can be undertaken to establish a cloud security program aimed at protecting a cloud native, service provider agnostic, container-based, offering.

It does represent my perspective and reflects my experiences, so it definitely won’t be a “one size fits all”, but I hope it could be a useful baseline.

I hope you found this post valuable and interesting, and I’m keen to get feedback on it! If you find the information shared helpful, if something is missing, or if you have ideas on improving it, please let me know on 🐣 Twitter or at 📢 feedback.marcolancini.it.

Thank you! 🙇‍♂️

About