If you had to architect a multi-account security logging strategy, where should you start?

This post, part of the “Continuous Visibility into Ephemeral Cloud Environments” series, will describe a design for a state of the art multi-account security-related logging platform in GCP.

A previous post covered a similar setup for AWS, hence I tried to follow the same structure here. A later post will cover a setup for Kubernetes instead.

This is a living document, which I regularly update as new services/improvements get released.
Last updated:

Problem Statement

One of the usual requirements for Security teams is to improve the visibility over (production) environments. In this regard, it is often necessary to design and rollout a strategy around security-related logging. This entails defining the scope for logging (resources, frequency, etc.), as well as providing an integration with existing monitoring and alerting systems.

The end goal is to deploy a security logging and monitoring solution with well established metrics and integrations with a SIEM of choice (Elasticsearch in this case). In particular, the solution should be able to:

  • Collect security-related logs from all environments.
  • Ingest those logs into a SIEM (e.g., Elasticsearch).
  • Parse those logs and use them to generate dashboards in Kibana.
  • Create alerts on anomalies.

In this regard, this post is composed of two main parts. The first introduces the logging-related services made available by GCP to their customers, alongside with their main features. The second describes a state of the art the design for a security-related logging platform, and provides the high-level architecture and best practices to follow during the implementation phase.


Which Services Can We Leverage?

Similar to AWS, GCP offers multiple services around logging and monitoring. Cloud Operations (formerly known as StackDriver) has been defined as a suite of products to monitor, troubleshoot, and operate services at scale. It now includes Cloud Logging, Cloud Monitoring, Cloud Trace, Cloud Debugger, and Cloud Profiler.

In the remainder of this section I’ll provide a summary of the main services we will need to design our security logging platform.

Cloud Logging

Cloud Logging receives, indexes, and stores log entries from many sources, including GCP, AWS, VM instances running the fluentd agent, and user applications:

Type Description
Agent Logs
  • Application and Host/OS-level logs can be collected via the Cloud Logging Agent, an application based on fluentd that runs on supported VM instances.
  • The Agent is installed by default on VMs running in Google Kubernetes Engine or App Engine.
  • By default, it collects the following logs:
    • For Linux: Syslog, nginx, apache2, apache-error.
    • For Windows: Windows Event Logs.
  • For a list of all the monitored resource types used in the Logging API, refer to the Monitored resources and services page of the GCP documentation.
Cloud Audit Logs
  • GCP services write audit log entries to help answer the questions of "who did what, where, and when?" within Google Cloud resources.
  • Cloud Audit Logs maintains multiple types of audit logs (more on this below):
    1. Admin Activity
    2. System Event
    3. Data Access
    4. Policy Denied
  • For a list of Google Cloud services that write audit logs, see Google services with audit logs.
Access Transparency Logs
  • Logs of actions taken by Google staff when accessing your data.
  • The difference here is that, while Cloud Audit Logs provides logs about actions taken by members within your own organization, Access Transparency provides logs of actions taken by Google staff.

Audit Logs

As briefly mentioned above, Google Cloud Audit Logs record the who, where, and when for activity within your environment, and ultimately help security teams maintain audit trails in GCP.

With them, it is possible to attain the same level of transparency over administrative activities and accesses to data in GCP as in on-premises environments. Every administrative activity is recorded on a hardened, always-on audit trail, which cannot be disabled by any rogue actor.

Cloud Audit Logs provides the following audit logs for each Project, Folder, and Organization within a resource hierarchy:

Type Description Retention Period
Admin Activity Audit Logs
  • Contain log entries for API calls or other administrative actions that modify the configuration or metadata of resources.
  • For example, these logs record when users create VM instances or change Cloud IAM permissions.
  • To view these logs, you must have the Cloud IAM role Logging/Logs Viewer or Project/Viewer.
  • Admin Activity audit logs are always written, and it is not possible neither to configure nor disable them.
400 days
System Event Audit Logs
  • Contain log entries for Google Cloud administrative actions that modify the configuration of resources. They are generated by Google systems (they are not driven by direct user action).
  • For example, these logs record when GCE live migrates an instance to another host.
  • To view these logs, you must have the Cloud IAM role Logging/Logs Viewer or Project/Viewer.
  • System Event audit logs are always written, and it is not possible neither to configure nor disable them.
400 days
Data Access Audit Logs
  • Contain API calls that read the configuration or metadata of resources, as well as user-driven API calls that create, modify, or read user-provided resource data.
  • Data Access audit logs consist of three sub-types:
    1. Admin read: reads of service metadata or configuration data (e.g., listing buckets or nodes within a cluster)
    2. Data read: reads of data within a service (e.g., listing data within a bucket)
    3. Data write: writes of data to a service (e.g., writing data to a bucket)
  • Data Access audit logs do not record the data-access operations on resources that are publicly shared (available to All Users or All Authenticated Users) or that can be accessed without logging into Google Cloud.
  • To view these logs, you must have the Cloud IAM roles Logging/Private Logs Viewer or Project/Owner.
  • Data Access audit logs are disabled by default because they can be quite large.
  • Caveat for GKE: The GKE Admin Activity logs are missing get operations on Secrets by default. To have this logged you'll have to enable Data Access Logs.
30 days
Policy Denied Audit Logs
  • Cloud Logging records Policy Denied audit logs when a Google Cloud service denies access to a user or service account because of a security policy violation.
  • To view these logs, you must have the IAM role Logging/Logs Viewer or Project/Viewer.
30 days

For more information, see Best practices for Cloud Audit Logs.

Cloud Monitoring

Cloud Monitoring collects metrics, events, and metadata from GCP, AWS, hosted uptime probes, and application instrumentation. It also provides dashboards, alerts, and uptime checks that can be used to ensure systems are running reliably.

In addition, Cloud Monitoring allows to create custom alerting policies: whenever events trigger conditions in one of the alerting policies defined, Cloud Monitoring creates and displays an incident in the console. If you set up notifications, Cloud Monitoring can also send notifications to people or third-party notification services.

Cloud Identity

Cloud Identity is Google’s Identity as a Service (IDaaS) product, which can be used to provision, manage, and authenticate users across GCP environments. Cloud Identity is how people in an organization gain a Google identity, and it’s these identities that are granted access to Google Cloud resources.

In this regard, Cloud Identity logs track events that may have a direct impact on a GCP environment. Relevant logs include:

Type Description
Admin Audit Logs
  • Track actions performed in the Google Admin Console.
  • For example, it allows to see when an administrator added a user or changed a setting.
Login Audit Logs
  • Track when users sign in the domain.
  • Interesting events are:
    • Failed Login: logged every time a user fails to login.
    • Suspicious Login: if a user logged in under suspicious circumstances, such as from an unfamiliar IP address.
Groups Audit Logs
  • Track changes to group settings and group memberships in Google Groups.
OAuth Token Audit Logs
  • Track third-party application usage and data access requests.
SAML Audit Logs
  • Track successful and failed logins to SAML applications.
  • Only available to GSuite/Cloud Identity Premium customers.

Security Command Center

Security Command Center is defined by Google as a risk dashboard and analytics system for surfacing, understanding, and remediating Google Cloud security and data risks across an organization.

Security Command Center enables the generation of insights that provide a unique view of incoming threats and attacks to Google Cloud resources (called “assets”), by displaying possible security risks (called “findings”) that are associated with each asset. Findings can come from security sources that include Security Command Center’s built-in services, third-party partners (like Cloudflare, CrowdStrike, Prisma Cloud, and Qualys), or even custom sources.

Security Command Center - Courtesy of Google.
Security Command Center - Courtesy of Google.

Security Command Center currently focuses on asset inventory, discovery, search, and management:

Feature Description
Asset discovery and inventory
  • Cloud Asset Inventory
    • Discover and view assets in near-real time across App Engine, BigQuery, Cloud SQL, Cloud Storage, Compute Engine, Cloud IAM, Google Kubernetes Engine, and more.
    • Review historical discovery scans to identify new, modified, or deleted assets.
Threat prevention
  • Understand the security state of your Google Cloud assets.
  • Security Health Analytics:
    • Provides managed vulnerability assessment scanning that can automatically detect the highest severity vulnerabilities and misconfigurations for Google Cloud assets
  • Web Security Scanner (💰 PREMIUM):
    • Provides managed scans that identify common web application vulnerabilities (such as cross-site scripting or outdated libraries) in web applications running on App Engine, GKE, and Compute Engine
    • The full list of finding types is available on the GCP documentation
Threat detection
  • Event Threat Detection (💰 PREMIUM):
    • Monitors Cloud Logging stream and consumes logs for one or more projects as they become available
    • It detects threats like:
      • Malware
      • Cryptomining
      • Brute force SSH
      • Outgoing DoS
      • IAM anomalous grant
      • Data exfiltration
    • The full list of Event Threat Detection rules is available on the GCP documentation
  • Container Threat Detection (💰 PREMIUM):
    • Continuously monitors the state of Container-Optimized OS node images (see supported GKE versions)
    • It evaluates all changes and remote access attempts to detect runtime attacks in near-real time
    • It includes several detection capabilities, including suspicious binaries and libraries, and uses natural language processing (NLP) to detect malicious bash scripts
    • The full list of Container Threat Detection detectors is available on the GCP documentation
  • Virtual Machine Threat Detection (💰 PREMIUM):
    • Provides threat detection through hypervisor-level instrumentation
    • Scans enabled Compute Engine projects and VM instances to detect unwanted applications, such as cryptocurrency mining software, running in VMs
  • Sensitive Actions Service (💰 PREMIUM):
    • Detects when actions are taken in your Google Cloud organization, folders, and projects that could be damaging to your business if they are taken by a malicious actor
    • Currently in Pre-GA

Alerts triggered by Security Command Center can be turned into real-time notifications via integrations with Pub/Sub.

Access Logs

Particular mention has to be made for Access Logs, which are generated by a variety of services:

Service Description
Cloud Storage
  • Data Access logs are not recorded by default, and provides information for all of the requests made on a specified bucket, including access requests and changes made by the Object Lifecycle Management feature.
  • Logs are created hourly, when there is activity.
  • Cloud Storage Administrative activity, instead, is logged automatically, and includes operations that modify the configuration or metadata of a bucket, or object.
VPC FLow Logs
  • VPC Flow Logs capture information about the traffic going to and from a VPC's network interfaces and can be applied at the VPC, or VM level.
  • Flow log data is stored using Cloud Logging and can be exported to BigQuery or Pub/Sub for additional analytics or visualization of network traffic flows.
  • VPC Flow Logs can be useful when organizational legal or security policies require capturing network flow data.
Cloud Load Balancing
  • Logs the details of each request/connection made to the Load Balancer (i.e., HttpRequest log fields), alongside with information which explains why the load balancer returned the HTTP status that it did.
Cloud CDN
  • Each Cloud CDN request is logged in Cloud Logging.
  • Logs for Cloud CDN are associated with the external HTTP(S) load balancer that the Cloud CDN backends are attached to.

Subscribe to CloudSecList

If you found this article interesting, you can join thousands of security professionals getting curated security-related news focused on the cloud native landscape by subscribing to CloudSecList.com.


State of the Art Security Logging Platform in GCP

So how could we design a multi-account security-related logging platform in GCP?

Let’s start with a high-level architecture diagram of a solution with multiple “projects” (or customers), each with production and non-production environments (note how every project/customer will have the same setup). Here I will assume the workloads run predominantly in a Kubernetes cluster (managed GKE), but with some stateful services involved as well (i.e., CloudSQL).

Architecture Diagram - Security Logging Platform in GCP
Architecture Diagram - Security Logging Platform in GCP

Collection

Starting from collection, Cloud Logging should be enabled in every GCP project, so to collect logs from every environment (whether it is production or not).

In particular, the following information should be collected:

Log Type Description
Agent Logs
  • A fluentd-based agent (that can run on supported VM instances) will collect entries from the GKE clusters and the applications running on them.
Application Event Logs
  • The same fluentd-based agent should be used to capture application event and error logs.
Audit and Access Transparency Logs
  • Both Audit and Access Transparency Logs should be collected, as described in the Audit Logs section.
Access Logs
  • VPC Flow Logs: V​PC Flow Logs​ can be collected to comply with regulatory policies requiring to capture network flow data, as it ingests information about IP traffic going to and from a VPC's network interfaces.
  • Cloud Storage: Cloud Storage Access Logging can be enabled to record made to buckets.
  • Cloud Load Balancing: Cloud Load Balancing Access Logging can be enabled to record individual requests made to load balancers.
Kubernetes Logs
  • GKE includes native integration with Cloud Monitoring and Cloud Logging: when a new GKE cluster is setup, system and application logs are enabled by default.
    • A dedicated agent is automatically deployed and managed on the GKE node to collect logs (alongside with metadata about the container, pod and cluster), and then forward them to Cloud Logging. Both system logs and app logs are then ingested and stored in Cloud Logging.
  • Control plane logs: control plane API, audit, controller, authenticator, and scheduler logs are collected by GKE itself and forwarded to Cloud Logging.
  • Worker node logs: collection depends on whether the compute plane is self-managed or GCP-managed. I'll write a follow-up post specifically on this.
  • Task container logs: the application logs. I'll write a follow-up post specifically on this.
DNS Query Logs
  • Cloud DNS logging can track queries that name servers resolve for VPC networks.
  • Queries from an external entity directly to a public zone are not logged because a public name server handles them.

In conjunction, Cloud Monitoring is going to be enabled in order to ingest events, metrics, and metadata and generate insights (through dashboards, charts, and alerts). In addition, Cloud Monitoring should also be used to create and manage custom alerting policies (more on this later).

On top of this, it could be useful to also collect findings coming from Security Command Center. Security Command Center, enabled at the GCP Organization level, ingests findings from Security Health Analytics, as well as Event Threat Detection and Container Threat Detection. Once ingested, Notification Configs can be used to dispatch each finding to a Pub/Sub topic hosted in the relevant GCP project (the one the finding is associated with).

Finally, Cloud Identity Logs (at least the Admin, Login and Groups Audit Logs) should be collected, as described in the Cloud Identity section.

Delivery

Since the integrity, completeness and availability of the collected logs is crucial for forensic and auditing purposes, a queueing system like Pub/Sub should be used to receive and buffer all the logs collected.

Since Cloud Logging retains app and audit logs for a limited period of time, export sinks are going to be configured in order to store logs for extended periods, both to meet compliance obligations and for historical analysis: Pub/Sub is going to get configured to receive and buffer all the logs forwarded by Cloud Logging, so that they can be exported to any external monitoring service. In this regard, the “Design patterns for exporting from Logging” guide, together with the “Aggregated Exports” feature (which allows to set up a sink at the Cloud IAM organization level, and export logs from all the projects inside the organization), can be used as a reference for the export strategy.

This not only will improve the resiliency of the platform by queueing (without discarding) messages in the event of the failure of a downstream component which is meant to consume logs, but it also allows to decouple log ingestion from log consumption.

Long-Term Storage and Audit Trail

A dedicated and highly restricted Project (here named ​Logging Project​) should also be created for each project/customer for long term (immutable) storage of the logs.

In that Project, a Logstash​ Agent can be used to pull logs directly from Pub/Sub topics and to store them into a bucket where they will be treated as immutable files. This can be achieved via Bucket Retention Policies and Retention Policy Locks (see “Retention policies and retention policy locks”), to ensure that nobody would be able to delete the objects during a pre-defined retention period.

In addition, a ​Data Loss Prevention (DLP)​ solution could be employed to prevent and detect cases of attempted data exfiltration. It should be noted that, to ensure the integrity of the logs stored in such projects, IAM controls should be put in place to limit access to these buckets (see “Access control guide for Cloud Logging”).

Monitoring and Alerting

Finally, a centralized Account/Project (here called ​Centralized Monitoring Account, and hosted in another cloud provider) can then be used to aggregate logs collected from the different Projects.

In this account, another Logstash Agent will have dedicated subscriptions to pull logs from each Pub/Sub topic defined in every Project and forward them to an ​ElasticSearch​ instance used by a Security Operations (i.e., SOC) team to monitor and respond to threats in (near) real time.

As mentioned previously, Cloud Monitoring could also be used to create and manage alerting policies. This way, whenever events trigger conditions in one of the alerting policies, Cloud Monitoring creates and displays an incident in the Monitoring console. Notifications can be setup, so that Cloud Monitoring can send notifications to relevant staff members.


Conclusions

In this blog post, part of the “Continuous Visibility into Ephemeral Cloud Environments” series, I described a possible approach for designing a multi-account security-related logging platform in GCP.

A previous post covered a similar setup for AWS, while a later post will cover Kubernetes instead.

I hope you found this post useful and interesting, and I’m keen to get feedback on it! If you find the information shared was useful, if something is missing, or if you have ideas on how to improve it, please let me know on Twitter.