Google Cloud Architecture Framework

Last reviewed 2024-10-31 UTC

The Google Cloud Architecture Framework provides recommendations to help architects, developers, administrators, and other cloud practitioners design and operate a cloud topology that's secure, efficient, resilient, high-performing, and cost-effective. The Google Cloud Architecture Framework is our version of a well-architected framework.

A cross-functional team of experts at Google validates the recommendations in the Architecture Framework. The team curates the Architecture Framework to reflect the expanding capabilities of Google Cloud, industry best practices, community knowledge, and feedback from you. For a summary of the significant changes to the Architecture Framework, see What's new.

The Architecture Framework is relevant to applications built for the cloud and for workloads migrated from on-premises to Google Cloud, hybrid cloud deployments, and multi-cloud environments.

Architecture Framework pillars and perspectives

The Google Cloud Architecture Framework is organized into five pillars, as shown in the following diagram. We also provide cross-pillar perspectives that focus on recommendations for selected domains, industries, and technologies like AI and machine learning (ML).

Google Cloud Architecture Framework.

Pillars

Operational excellence
Efficiently deploy, operate, monitor, and manage your cloud workloads.
Security, privacy, and compliance
Maximize the security of your data and workloads in the cloud, design for privacy, and align with regulatory requirements and standards.
Reliability
Design and operate resilient and highly available workloads in the cloud.
Cost optimization
Maximize the business value of your investment in Google Cloud.
Performance optimization
Design and tune your cloud resources for optimal performance.

Perspectives

AI and ML
A cross-pillar view of recommendations that are specific to AI and ML workloads.

Core principles

Before you explore the recommendations in each pillar of the Architecture Framework, review the following core principles:

Design for change

No system is static. The needs of its users, the goals of the team that builds the system, and the system itself are constantly changing. With the need for change in mind, build a development and production process that enables teams to regularly deliver small changes and get fast feedback on those changes. Consistently demonstrating the ability to deploy changes helps to build trust with stakeholders, including the teams responsible for the system, and the users of the system. Using DORA's software delivery metrics can help your team monitor the speed, ease, and safety of making changes to the system.

Document your architecture

When you start to move your workloads to the cloud or build your applications, lack of documentation about the system can be a major obstacle. Documentation is especially important for correctly visualizing the architecture of your current deployments.

Quality documentation isn't achieved by producing a specific amount of documentation, but by how clear content is, how useful it is, and how it's maintained as the system changes.

A properly documented cloud architecture establishes a common language and standards, which enable cross-functional teams to communicate and collaborate effectively. The documentation also provides the information that's necessary to identify and guide future design decisions. Documentation should be written with your use cases in mind, to provide context for the design decisions.

Over time, your design decisions will evolve and change. The change history provides the context that your teams require to align initiatives, avoid duplication, and measure performance changes effectively over time. Change logs are particularly valuable when you onboard a new cloud architect who is not yet familiar with your current design, strategy, or history.

Analysis by DORA has found a clear link between documentation quality and organizational performance — the organization's ability to meet their performance and profitability goals.

Simplify your design and use fully managed services

Simplicity is crucial for design. If your architecture is too complex to understand, it will be difficult to implement the design and manage it over time. Where feasible, use fully managed services to minimize the risks, time, and effort associated with managing and maintaining baseline systems.

If you're already running your workloads in production, test with managed services to see how they might help to reduce operational complexities. If you're developing new workloads, then start simple, establish a minimal viable product (MVP), and resist the urge to over-engineer. You can identify exceptional use cases, iterate, and improve your systems incrementally over time.

Decouple your architecture

Research from DORA shows that architecture is an important predictor for achieving continuous delivery. Decoupling is a technique that's used to separate your applications and service components into smaller components that can operate independently. For example, you might separate a monolithic application stack into individual service components. In a loosely coupled architecture, an application can run its functions independently, regardless of the various dependencies.

A decoupled architecture gives you increased flexibility to do the following:

  • Apply independent upgrades.
  • Enforce specific security controls.
  • Establish reliability goals for each subsystem.
  • Monitor health.
  • Granularly control performance and cost parameters.

You can start the decoupling process early in your design phase or incorporate it as part of your system upgrades as you scale.

Use a stateless architecture

A stateless architecture can increase both the reliability and scalability of your applications.

Stateful applications rely on various dependencies to perform tasks, such as local caching of data. Stateful applications often require additional mechanisms to capture progress and restart gracefully. Stateless applications can perform tasks without significant local dependencies by using shared storage or cached services. A stateless architecture enables your applications to scale up quickly with minimum boot dependencies. The applications can withstand hard restarts, have lower downtime, and provide better performance for end users.

Google Cloud Architecture Framework: Operational excellence

The operational excellence pillar in the Google Cloud Architecture Framework provides recommendations to operate workloads efficiently on Google Cloud. Operational excellence in the cloud involves designing, implementing, and managing cloud solutions that provide value, performance, security, and reliability. The recommendations in this pillar help you to continuously improve and adapt workloads to meet the dynamic and ever-evolving needs in the cloud.

The operational excellence pillar is relevant to the following audiences:

  • Managers and leaders: A framework to establish and maintain operational excellence in the cloud and to ensure that cloud investments deliver value and support business objectives.
  • Cloud operations teams: Guidance to manage incidents and problems, plan capacity, optimize performance, and manage change.
  • Site reliability engineers (SREs): Best practices that help you to achieve high levels of service reliability, including monitoring, incident response, and automation.
  • Cloud architects and engineers: Operational requirements and best practices for the design and implementation phases, to help ensure that solutions are designed for operational efficiency and scalability.
  • DevOps teams: Guidance about automation, CI/CD pipelines, and change management, to help enable faster and more reliable software delivery.

To achieve operational excellence, you should embrace automation, orchestration, and data-driven insights. Automation helps to eliminate toil. It also streamlines and builds guardrails around repetitive tasks. Orchestration helps to coordinate complex processes. Data-driven insights enable evidence-based decision-making. By using these practices, you can optimize cloud operations, reduce costs, improve service availability, and enhance security.

Operational excellence in the cloud goes beyond technical proficiency in cloud operations. It includes a cultural shift that encourages continuous learning and experimentation. Teams must be empowered to innovate, iterate, and adopt a growth mindset. A culture of operational excellence fosters a collaborative environment where individuals are encouraged to share ideas, challenge assumptions, and drive improvement.

The recommendations in the operational excellence pillar of the Architecture Framework are mapped to the following core principles for automation, orchestration, and data-driven insights:

Contributors

Authors:

Other contributors:

Ensure operational readiness and performance using CloudOps

This principle in the operational excellence pillar of the Google Cloud Architecture Framework helps you to ensure operational readiness and performance of your cloud workloads. It emphasizes establishing clear expectations and commitments for service performance, implementing robust monitoring and alerting, conducting performance testing, and proactively planning for capacity needs.

Principle overview

Different organizations might interpret operational readiness differently. Operational readiness is how your organization prepares to successfully operate workloads on Google Cloud. Preparing to operate a complex, multilayered cloud workload requires careful planning for both go-live and day-2 operations. These operations are often called CloudOps.

Focus areas of operational readiness

Operational readiness consists of four focus areas. Each focus area consists of a set of activities and components that are necessary to prepare to operate a complex application or environment in Google Cloud. The following table lists the components and activities of each focus area:

Focus area of operational readiness Activities and components
Workforce
  • Defining clear roles and responsibilities for the teams that manage and operate the cloud resources.
  • Ensuring that team members have appropriate skills.
  • Developing a learning program.
  • Establishing a clear team structure.
  • Hiring the required talent.
Processes
  • Observability.
  • Managing service disruptions.
  • Cloud delivery.
  • Core cloud operations.
Tooling Tools that are required to support CloudOps processes.
Governance
  • Service levels and reporting.
  • Cloud financials.
  • Cloud operating model.
  • Architectural review and governance boards.
  • Cloud architecture and compliance.

Recommendations

To ensure operational readiness and performance by using CloudOps, consider the recommendations in the following sections. Each recommendation in this document is relevant to one or more of the focus areas of operational readiness.

Define SLOs and SLAs

A core responsibility of the cloud operations team is to define service level objectives (SLOs) and service level agreements (SLAs) for all of the critical workloads. This recommendation is relevant to the governance focus area of operational readiness.

SLOs must be specific, measurable, achievable, relevant, and time-bound (SMART), and they must reflect the level of service and performance that you want.

  • Specific: Clearly articulates the required level of service and performance.
  • Measurable: Quantifiable and trackable.
  • Achievable: Attainable within the limits of your organization's capabilities and resources.
  • Relevant: Aligned with business goals and priorities.
  • Time-bound: Has a defined timeframe for measurement and evaluation.

For example, an SLO for a web application might be "99.9% availability" or "average response time less than 200 ms." Such SLOs clearly define the required level of service and performance for the web application, and the SLOs can be measured and tracked over time.

SLAs outline the commitments to customers regarding service availability, performance, and support, including any penalties or remedies for noncompliance. SLAs must include specific details about the services that are provided, the level of service that can be expected, the responsibilities of both the service provider and the customer, and any penalties or remedies for noncompliance. SLAs serve as a contractual agreement between the two parties, ensuring that both have a clear understanding of the expectations and obligations that are associated with the cloud service.

Google Cloud provides tools like Cloud Monitoring and service level indicators (SLIs) to help you define and track SLOs. Cloud Monitoring provides comprehensive monitoring and observability capabilities that enable your organization to collect and analyze metrics that are related to the availability, performance, and latency of cloud-based applications and services. SLIs are specific metrics that you can use to measure and track SLOs over time. By utilizing these tools, you can effectively monitor and manage cloud services, and ensure that they meet the SLOs and SLAs.

Clearly defining and communicating SLOs and SLAs for all of your critical cloud services helps to ensure reliability and performance of your deployed applications and services.

Implement comprehensive observability

To get real-time visibility into the health and performance of your cloud environment, we recommend that you use a combination of Google Cloud Observability tools and third-party solutions. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.

Implementing a combination of observability solutions provides you with a comprehensive observability strategy that covers various aspects of your cloud infrastructure and applications. Google Cloud Observability is a unified platform for collecting, analyzing, and visualizing metrics, logs, and traces from various Google Cloud services, applications, and external sources. By using Cloud Monitoring, you can gain insights into resource utilization, performance characteristics, and overall health of your resources.

To ensure comprehensive monitoring, monitor important metrics that align with system health indicators such as CPU utilization, memory usage, network traffic, disk I/O, and application response times. You must also consider business-specific metrics. By tracking these metrics, you can identify potential bottlenecks, performance issues, and resource constraints. Additionally, you can set up alerts to notify relevant teams proactively about potential issues or anomalies.

To enhance your monitoring capabilities further, you can integrate third-party solutions with Google Cloud Observability. These solutions can provide additional functionality, such as advanced analytics, machine learning-powered anomaly detection, and incident management capabilities. This combination of Google Cloud Observability tools and third-party solutions lets you create a robust and customizable monitoring ecosystem that's tailored to your specific needs. By using this combination approach, you can proactively identify and address issues, optimize resource utilization, and ensure the overall reliability and availability of your cloud applications and services.

Implement performance and load testing

Performing regular performance testing helps you to ensure that your cloud-based applications and infrastructure can handle peak loads and maintain optimal performance. Load testing simulates realistic traffic patterns. Stress testing pushes the system to its limits to identify potential bottlenecks and performance limitations. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.

Tools like Cloud Load Balancing and load testing services can help you to simulate real-world traffic patterns and stress-test your applications. These tools provide valuable insights into how your system behaves under various load conditions, and can help you to identify areas that require optimization.

Based on the results of performance testing, you can make decisions to optimize your cloud infrastructure and applications for optimal performance and scalability. This optimization might involve adjusting resource allocation, tuning configurations, or implementing caching mechanisms.

For example, if you find that your application is experiencing slowdowns during periods of high traffic, you might need to increase the number of virtual machines or containers that are allocated to the application. Alternatively, you might need to adjust the configuration of your web server or database to improve performance.

By regularly conducting performance testing and implementing the necessary optimizations, you can ensure that your cloud-based applications and infrastructure always run at peak performance, and deliver a seamless and responsive experience for your users. Doing so can help you to maintain a competitive advantage and build trust with your customers.

Plan and manage capacity

Proactively planning for future capacity needs—both organic or inorganic—helps you to ensure the smooth operation and scalability of your cloud-based systems. This recommendation is relevant to the processes focus area of operational readiness.

Planning for future capacity includes understanding and managing quotas for various resources like compute instances, storage, and API requests. By analyzing historical usage patterns, growth projections, and business requirements, you can accurately anticipate future capacity requirements. You can use tools like Cloud Monitoring and BigQuery to collect and analyze usage data, identify trends, and forecast future demand.

Historical usage patterns provide valuable insights into resource utilization over time. By examining metrics like CPU utilization, memory usage, and network traffic, you can identify periods of high demand and potential bottlenecks. Additionally, you can help to estimate future capacity needs by making growth projections based on factors like growth in the user base, new products and features, and marketing campaigns. When you assess capacity needs, you should also consider business requirements like SLAs and performance targets.

When you determine the resource sizing for a workload, consider factors that can affect utilization of resources. Seasonal variations like holiday shopping periods or end-of-quarter sales can lead to temporary spikes in demand. Planned events like product launches or marketing campaigns can also significantly increase traffic. To make sure that your primary and disaster recovery (DR) system can handle unexpected surges in demand, plan for capacity that can support graceful failover during disruptions like natural disasters and cyberattacks.

Autoscaling is an important strategy for dynamically adjusting your cloud resources based on workload fluctuations. By using autoscaling policies, you can automatically scale compute instances, storage, and other resources in response to changing demand. This ensures optimal performance during peak periods while minimizing costs when resource utilization is low. Autoscaling algorithms use metrics like CPU utilization, memory usage, and queue depth to determine when to scale resources.

Continuously monitor and optimize

To manage and optimize cloud workloads, you must establish a process for continuously monitoring and analyzing performance metrics. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.

To establish a process for continuous monitoring and analysis, you track, collect, and evaluate data that's related to various aspects of your cloud environment. By using this data, you can proactively identify areas for improvement, optimize resource utilization, and ensure that your cloud infrastructure consistently meets or exceeds your performance expectations.

An important aspect of performance monitoring is regularly reviewing logs and traces. Logs provide valuable insights into system events, errors, and warnings. Traces provide detailed information about the flow of requests through your application. By analyzing logs and traces, you can identify potential issues, identify the root causes of problems, and get a better understanding of how your applications behave under different conditions. Metrics like the round-trip time between services can help you to identify and understand bottlenecks that are in your workloads.

Further, you can use performance-tuning techniques to significantly enhance application response times and overall efficiency. The following are examples of techniques that you can use:

  • Caching: Store frequently accessed data in memory to reduce the need for repeated database queries or API calls.
  • Database optimization: Use techniques like indexing and query optimization to improve the performance of database operations.
  • Code profiling: Identify areas of your code that consume excessive resources or cause performance issues.

By applying these techniques, you can optimize your applications and ensure that they run efficiently in the cloud.

Manage incidents and problems

This principle in the operational excellence pillar of the Google Cloud Architecture Framework provides recommendations to help you manage incidents and problems related to your cloud workloads. It involves implementing comprehensive monitoring and observability, establishing clear incident response procedures, conducting thorough root cause analysis, and implementing preventive measures. Many of the topics that are discussed in this principle are covered in detail in the Reliability pillar.

Principle overview

Incident management and problem management are important components of a functional operations environment. How you respond to, categorize, and solve incidents of differing severity can significantly affect your operations. You must also proactively and continuously make adjustments to optimize reliability and performance. An efficient process for incident and problem management relies on the following foundational elements:

  • Continuous monitoring: Identify and resolve issues quickly.
  • Automation: Streamline tasks and improve efficiency.
  • Orchestration: Coordinate and manage cloud resources effectively.
  • Data-driven insights: Optimize cloud operations and make informed decisions.

These elements help you to build a resilient cloud environment that can handle a wide range of challenges and disruptions. These elements can also help to reduce the risk of costly incidents and downtime, and they can help you to achieve greater business agility and success. These foundational elements are spread across the four focus areas of operational readiness: Workforce, Processes, Tooling, and Governance.

Recommendations

To manage incidents and problems effectively, consider the recommendations in the following sections. Each recommendation in this document is relevant to one or more of the focus areas of operational readiness.

Establish clear incident response procedures

Clear roles and responsibilities are essential to ensure effective and coordinated response to incidents. Additionally, clear communication protocols and escalation paths help to ensure that information is shared promptly and effectively during an incident. This recommendation is relevant to these focus areas of operational readiness: workforce, processes, and tooling.

To establish incident response procedures, you need to define the roles and expectations of each team member, such as incident commanders, investigators, communicators, and technical experts. Establishing communication and escalation paths includes identifying important contacts, setting up communication channels, and defining the process for escalating incidents to higher levels of management when necessary. Regular training and preparation helps to ensure that teams are equipped with the knowledge and skills to respond to incidents effectively.

By documenting incident response procedures in a runbook or playbook, you can provide a standardized reference guide for teams to follow during an incident. The runbook must outline the steps to be taken at each stage of the incident response process, including communication, triage, investigation, and resolution. It must also include information about relevant tools and resources and contact information for important personnel. You must regularly review and update the runbook to ensure that it remains current and effective.

Centralize incident management

For effective tracking and management throughout the incident lifecycle, consider using a centralized incident management system. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.

A centralized incident management system provides the following advantages:

  • Improved visibility: By consolidating all incident-related data in a single location, you eliminate the need for teams to search in various channels or systems for context. This approach saves time and reduces confusion, and it gives stakeholders a comprehensive view of the incident, including its status, impact, and progress.
  • Better coordination and collaboration: A centralized system provides a unified platform for communication and task management. It promotes seamless collaboration between the different departments and functions that are involved in incident response. This approach ensures that everyone has access to up-to-date information and it reduces the risk of miscommunication and misalignment.
  • Enhanced accountability and ownership: A centralized incident management system enables your organization to allocate tasks to specific individuals or teams and it ensures that responsibilities are clearly defined and tracked. This approach promotes accountability and encourages proactive problem-solving because team members can easily monitor their progress and contributions.

A centralized incident management system must offer robust features for incident tracking, task assignment, and communication management. These features let you customize workflows, set priorities, and integrate with other systems, such as monitoring tools and ticketing systems.

By implementing a centralized incident management system, you can optimize your organization's incident response processes, improve collaboration, and enhance visibility. Doing so leads to faster incident resolution times, reduced downtime, and improved customer satisfaction. It also helps foster a culture of continuous improvement because you can learn from past incidents and identify areas for improvement.

Conduct thorough post-incident reviews

After an incident occurs, you must conduct a detailed post-incident review (PIR), which is also known as a postmortem, to identify the root cause, contributing factors, and lessons learned. This thorough review helps you to prevent similar incidents in the future. This recommendation is relevant to these focus areas of operational readiness: processes and governance.

The PIR process must involve a multidisciplinary team that has expertise in various aspects of the incident. The team must gather all of the relevant information through interviews, documentation review, and site inspections. A timeline of events must be created to establish the sequence of actions that led up to the incident.

After the team gathers the required information, they must conduct a root cause analysis to determine the factors that led to the incident. This analysis must identify both the immediate cause and the systemic issues that contributed to the incident.

Along with identifying the root cause, the PIR team must identify any other contributing factors that might have caused the incident. These factors could include human error, equipment failure, or organizational factors like communication breakdowns and lack of training.

The PIR report must document the findings of the investigation, including the timeline of events, root cause analysis, and recommended actions. The report is a valuable resource for implementing corrective actions and preventing recurrence. The report must be shared with all of the relevant stakeholders and it must be used to develop safety training and procedures.

To ensure a successful PIR process, your organization must foster a blameless culture that focuses on learning and improvement rather than assigning blame. This culture encourages individuals to report incidents without fear of retribution, and it lets you address systemic issues and make meaningful improvements.

By conducting thorough PIRs and implementing corrective measures based on the findings, you can significantly reduce the risk of similar incidents occurring in the future. This proactive approach to incident investigation and prevention helps to create a safer and more efficient work environment for everyone involved.

Maintain a knowledge base

A knowledge base of known issues, solutions, and troubleshooting guides is essential for incident management and resolution. Team members can use the knowledge base to quickly identify and address common problems. Implementing a knowledge base helps to reduce the need for escalation and it improves overall efficiency. This recommendation is relevant to these focus areas of operational readiness: workforce and processes.

A primary benefit of a knowledge base is that it lets teams learn from past experiences and avoid repeating mistakes. By capturing and sharing solutions to known issues, teams can build a collective understanding of how to resolve common problems and best practices for incident management. Use of a knowledge base saves time and effort, and helps to standardize processes and ensure consistency in incident resolution.

Along with helping to improve incident resolution times, a knowledge base promotes knowledge sharing and collaboration across teams. With a central repository of information, teams can easily access and contribute to the knowledge base, which promotes a culture of continuous learning and improvement. This culture encourages teams to share their expertise and experiences, leading to a more comprehensive and valuable knowledge base.

To create and manage a knowledge base effectively, use appropriate tools and technologies. Collaboration platforms like Google Workspace are well-suited for this purpose because they let you easily create, edit, and share documents collaboratively. These tools also support version control and change tracking, which ensures that the knowledge base remains up-to-date and accurate.

Make the knowledge base easily accessible to all relevant teams. You can achieve this by integrating the knowledge base with existing incident management systems or by providing a dedicated portal or intranet site. A knowledge base that's readily available lets teams quickly access the information that they need to resolve incidents efficiently. This availability helps to reduce downtime and minimize the impact on business operations.

Regularly review and update the knowledge base to ensure that it remains relevant and useful. Monitor incident reports, identify common issues and trends, and incorporate new solutions and troubleshooting guides into the knowledge base. An up-to-date knowledge base helps your teams resolve incidents faster and more effectively.

Automate incident response

Automation helps to streamline your incident response and remediation processes. It lets you address security breaches and system failures promptly and efficiently. By using Google Cloud products like Cloud Run functions or Cloud Run, you can automate various tasks that are typically manual and time-consuming. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.

Automated incident response provides the following benefits:

  • Reduction in incident detection and resolution times: Automated tools can continuously monitor systems and applications, detect suspicious or anomalous activities in real time, and notify stakeholders or respond without intervention. This automation lets you identify potential threats or issues before they escalate into major incidents. When an incident is detected, automated tools can trigger predefined remediation actions, such as isolating affected systems, quarantining malicious files, or rolling back changes to restore the system to a known good state.
  • Reduced burden on security and operations teams: Automated incident response lets the security and operations teams focus on more strategic tasks. By automating routine and repetitive tasks, such as collecting diagnostic information or triggering alerts, your organization can free up personnel to handle more complex and critical incidents. This automation can lead to improved overall incident response effectiveness and efficiency.
  • Enhanced consistency and accuracy of the remediation process: Automated tools can ensure that remediation actions are applied uniformly across all affected systems, minimizing the risk of human error or inconsistency. This standardization of the remediation process helps to minimize the impact of incidents on users and the business.

Manage and optimize cloud resources

This principle in the operational excellence pillar of the Google Cloud Architecture Framework provides recommendations to help you manage and optimize the resources that are used by your cloud workloads. It involves right-sizing resources based on actual usage and demand, using autoscaling for dynamic resource allocation, implementing cost optimization strategies, and regularly reviewing resource utilization and costs. Many of the topics that are discussed in this principle are covered in detail in the Cost optimization pillar.

Principle overview

Cloud resource management and optimization play a vital role in optimizing cloud spending, resource usage, and infrastructure efficiency. It includes various strategies and best practices aimed at maximizing the value and return from your cloud spending.

This pillar's focus on optimization extends beyond cost reduction. It emphasizes the following goals:

  • Efficiency: Using automation and data analytics to achieve peak performance and cost savings.
  • Performance: Scaling resources effortlessly to meet fluctuating demands and deliver optimal results.
  • Scalability: Adapting infrastructure and processes to accommodate rapid growth and diverse workloads.

By focusing on these goals, you achieve a balance between cost and functionality. You can make informed decisions regarding resource provisioning, scaling, and migration. Additionally, you gain valuable insights into resource consumption patterns, which lets you proactively identify and address potential issues before they escalate.

Recommendations

To manage and optimize resources, consider the recommendations in the following sections. Each recommendation in this document is relevant to one or more of the focus areas of operational readiness.

Right-size resources

Continuously monitoring resource utilization and adjusting resource allocation to match actual demand are essential for efficient cloud resource management. Over-provisioning resources can lead to unnecessary costs, and under-provisioning can cause performance bottlenecks that affect application performance and user experience. To achieve an optimal balance, you must adopt a proactive approach to right-sizing cloud resources. This recommendation is relevant to the governance focus area of operational readiness.

Cloud Monitoring and Recommender can help you to identify opportunities for right-sizing. Cloud Monitoring provides real-time visibility into resource utilization metrics. This visibility lets you track resource usage patterns and identify potential inefficiencies. Recommender analyzes resource utilization data to make intelligent recommendations for optimizing resource allocation. By using these tools, you can gain insights into resource usage and make informed decisions about right-sizing the resources.

In addition to Cloud Monitoring and Recommender, consider using custom metrics to trigger automated right-sizing actions. Custom metrics let you track specific resource utilization metrics that are relevant to your applications and workloads. You can also configure alerts to notify administrators when predefined thresholds are met. The administrators can then take necessary actions to adjust resource allocation. This proactive approach ensures that resources are scaled in a timely manner, which helps to optimize cloud costs and prevent performance issues.

Use autoscaling

Autoscaling compute and other resources helps to ensure optimal performance and cost efficiency of your cloud-based applications. Autoscaling lets you dynamically adjust the capacity of your resources based on workload fluctuations, so that you have the resources that you need when you need them and you can avoid over-provisioning and unnecessary costs. This recommendation is relevant to the processes focus area of operational readiness.

To meet the diverse needs of different applications and workloads, Google Cloud offers various autoscaling options, including the following:

  • Compute Engine managed instance groups (MIGs) are groups of VMs that are managed and scaled as a single entity. With MIGs, you can define autoscaling policies that specify the minimum and maximum number of VMs to maintain in the group, and the conditions that trigger autoscaling. For example, you can configure a policy to add VMs in a MIG when the CPU utilization reaches a certain threshold and to remove VMs when the utilization drops below a different threshold.
  • Google Kubernetes Engine (GKE) autoscaling dynamically adjusts your cluster resources to match your application's needs. It offers the following tools:

    • Cluster Autoscaler adds or removes nodes based on Pod resource demands.
    • Horizontal Pod Autoscaler changes the number of Pod replicas based on CPU, memory, or custom metrics.
    • Vertical Pod Autoscaler fine-tunes Pod resource requests and limits based on usage patterns.
    • Node Auto-Provisioning automatically creates optimized node pools for your workloads.

    These tools work together to optimize resource utilization, ensure application performance, and simplify cluster management.

  • Cloud Run is a serverless platform that lets you run code without having to manage infrastructure. Cloud Run offers built-in autoscaling, which automatically adjusts the number of instances based on the incoming traffic. When the volume of traffic increases, Cloud Run scales up the number of instances to handle the load. When traffic decreases, Cloud Run scales down the number of instances to reduce costs.

By using these autoscaling options, you can ensure that your cloud-based applications have the resources that they need to handle varying workloads, while avoiding overprovisioning and unnecessary costs. Using autoscaling can lead to improved performance, cost savings, and more efficient use of cloud resources.

Leverage cost optimization strategies

Optimizing cloud spending helps you to effectively manage your organization's IT budgets. This recommendation is relevant to the governance focus area of operational readiness.

Google Cloud offers several tools and techniques to help you optimize cloud costs. By using these tools and techniques, you can get the best value from your cloud spending. These tools and techniques help you to identify areas where costs can be reduced, such as identifying underutilized resources or recommending more cost-effective instance types. Google Cloud options to help optimize cloud costs include the following:

Pricing models might change over time, and new features might be introduced that offer better performance or lower cost compared to existing options. Therefore, you should regularly review pricing models and consider alternative features. By staying informed about the latest pricing models and features, you can make informed decisions about your cloud architecture to minimize costs.

Google Cloud's Cost Management tools, such as budgets and alerts, provide valuable insights into cloud spending. Budgets and alerts let users set budgets and receive alerts when the budgets are exceeded. These tools help users track their cloud spending and identify areas where costs can be reduced.

Track resource usage and costs

You can use tagging and labeling to track resource usage and costs. By assigning tags and labels to your cloud resources like projects, departments, or other relevant dimensions, you can categorize and organize the resources. This lets you monitor and analyze spending patterns for specific resources and identify areas of high usage or potential cost savings. This recommendation is relevant to these focus areas of operational readiness: governance and tooling.

Tools like Cloud Billing and Cost Management help you to get a comprehensive understanding of your spending patterns. These tools provide detailed insights into your cloud usage and they let you identify trends, forecast costs, and make informed decisions. By analyzing historical data and current spending patterns, you can identify the focus areas for your cost-optimization efforts.

Custom dashboards and reports help you to visualize cost data and gain deeper insights into spending trends. By customizing dashboards with relevant metrics and dimensions, you can monitor key performance indicators (KPIs) and track progress towards your cost optimization goals. Reports offer deeper analyses of cost data. Reports let you filter the data by specific time periods or resource types to understand the underlying factors that contribute to your cloud spending.

Regularly review and update your tags, labels, and cost analysis tools to ensure that you have the most up-to-date information on your cloud usage and costs. By staying informed and conducting cost postmortems or proactive cost reviews, you can promptly identify any unexpected increases in spending. Doing so lets you make proactive decisions to optimize cloud resources and control costs.

Establish cost allocation and budgeting

Accountability and transparency in cloud cost management are crucial for optimizing resource utilization and ensuring financial control. This recommendation is relevant to the governance focus area of operational readiness.

To ensure accountability and transparency, you need to have clear mechanisms for cost allocation and chargeback. By allocating costs to specific teams, projects, or individuals, your organization can ensure that each of these entities is responsible for its cloud usage. This practice fosters a sense of ownership and encourages responsible resource management. Additionally, chargeback mechanisms enable your organization to recover cloud costs from internal customers, align incentives with performance, and promote fiscal discipline.

Establishing budgets for different teams or projects is another essential aspect of cloud cost management. Budgets enable your organization to define spending limits and track actual expenses against those limits. This approach lets you make proactive decisions to prevent uncontrolled spending. By setting realistic and achievable budgets, you can ensure that cloud resources are used efficiently and aligned with business objectives. Regular monitoring of actual spending against budgets helps you to identify variances and address potential overruns promptly.

To monitor budgets, you can use tools like Cloud Billing budgets and alerts. These tools provide real-time insights into cloud spending and they notify stakeholders of potential overruns. By using these capabilities, you can track cloud costs and take corrective actions before significant deviations occur. This proactive approach helps to prevent financial surprises and ensures that cloud resources are used responsibly.

Automate and manage change

This principle in the operational excellence pillar of the Google Cloud Architecture Framework provides recommendations to help you automate and manage change for your cloud workloads. It involves implementing infrastructure as code (IaC), establishing standard operating procedures, implementing a structured change management process, and using automation and orchestration.

Principle overview

Change management and automation play a crucial role in ensuring smooth and controlled transitions within cloud environments. For effective change management, you need to use strategies and best practices that minimize disruptions and ensure that changes are integrated seamlessly with existing systems.

Effective change management and automation include the following foundational elements:

  • Change governance: Establish clear policies and procedures for change management, including approval processes and communication plans.
  • Risk assessment: Identify potential risks associated with changes and mitigate them through risk management techniques.
  • Testing and validation: Thoroughly test changes to ensure that they meet functional and performance requirements and mitigate potential regressions.
  • Controlled deployment: Implement changes in a controlled manner, ensuring that users are seamlessly transitioned to the new environment, with mechanisms to seamlessly roll back if needed.

These foundational elements help to minimize the impact of changes and ensure that changes have a positive effect on business operations. These elements are represented by the processes, tooling, and governance focus areas of operational readiness.

Recommendations

To automate and manage change, consider the recommendations in the following sections. Each recommendation in this document is relevant to one or more of the focus areas of operational readiness.

Adopt IaC

Infrastructure as code (IaC) is a transformative approach for managing cloud infrastructure. You can define and manage cloud infrastructure declaratively by using tools like Terraform. IaC helps you achieve consistency, repeatability, and simplified change management. It also enables faster and more reliable deployments. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.

The following are the main benefits of adopting the IaC approach for your cloud deployments:

  • Human-readable resource configurations: With the IaC approach, you can declare your cloud infrastructure resources in a human-readable format, like JSON or YAML. Infrastructure administrators and operators can easily understand and modify the infrastructure and collaborate with others.
  • Consistency and repeatability: IaC enables consistency and repeatability in your infrastructure deployments. You can ensure that your infrastructure is provisioned and configured the same way every time, regardless of who is performing the deployment. This approach helps to reduce errors and ensures that your infrastructure is always in a known state.
  • Accountability and simplified troubleshooting: The IaC approach helps to improve accountability and makes it easier to troubleshoot issues. By storing your IaC code in a version control system, you can track changes, and identify when changes were made and by whom. If necessary, you can easily roll back to previous versions.

Implement version control

A version control system like Git is a key component of the IaC process. It provides robust change management and risk mitigation capabilities, which is why it's widely adopted, either through in-house development or SaaS solutions. This recommendation is relevant to these focus areas of operational readiness: governance and tooling.

By tracking changes to IaC code and configurations, version control provides visibility into the evolution of the code, making it easier to understand the impact of changes and identify potential issues. This enhanced visibility fosters collaboration among team members who work on the same IaC project.

Most version control systems let you easily roll back changes if needed. This capability helps to mitigate the risk of unintended consequences or errors. By using tools like Git in your IaC workflow, you can significantly improve change management processes, foster collaboration, and mitigate risks, which leads to a more efficient and reliable IaC implementation.

Build CI/CD pipelines

Continuous integration and continuous delivery (CI/CD) pipelines streamline the process of developing and deploying cloud applications. CI/CD pipelines automate the building, testing, and deployment stages, which enables faster and more frequent releases with improved quality control. This recommendation is relevant to the tooling focus area of operational readiness.

CI/CD pipelines ensure that code changes are continuously integrated into a central repository, typically a version control system like Git. Continuous integration facilitates early detection and resolution of issues, and it reduces the likelihood of bugs or compatibility problems.

To create and manage CI/CD pipelines for cloud applications, you can use tools like Cloud Build and Cloud Deploy.

  • Cloud Build is a fully managed build service that lets developers define and execute build steps in a declarative manner. It integrates seamlessly with popular source-code management platforms and it can be triggered by events like code pushes and pull requests.
  • Cloud Deploy is a serverless deployment service that automates the process of deploying applications to various environments, such as testing, staging, and production. It provides features like blue-green deployments, traffic splitting, and rollback capabilities, making it easier to manage and monitor application deployments.

Integrating CI/CD pipelines with version control systems and testing frameworks helps to ensure the quality and reliability of your cloud applications. By running automated tests as part of the CI/CD process, development teams can quickly identify and fix any issues before the code is deployed to the production environment. This integration helps to improve the overall stability and performance of your cloud applications.

Use configuration management tools

Tools like Puppet, Chef, Ansible, and VM Manager help you to automate the configuration and management of cloud resources. Using these tools, you can ensure resource consistency and compliance across your cloud environments. This recommendation is relevant to the tooling focus area of operational readiness.

Automating the configuration and management of cloud resources provides the following benefits:

  • Significant reduction in the risk of manual errors: When manual processes are involved, there is a higher likelihood of mistakes due to human error. Configuration management tools reduce this risk by automating processes, so that configurations are applied consistently and accurately across all cloud resources. This automation can lead to improved reliability and stability of the cloud environment.
  • Improvement in operational efficiency: By automating repetitive tasks, your organization can free up IT staff to focus on more strategic initiatives. This automation can lead to increased productivity and cost savings and improved responsiveness to changing business needs.
  • Simplified management of complex cloud infrastructure: As cloud environments grow in size and complexity, managing the resources can become increasingly difficult. Configuration management tools provide a centralized platform for managing cloud resources. The tools make it easier to track configurations, identify issues, and implement changes. Using these tools can lead to improved visibility, control, and security of your cloud environment.

Automate testing

Integrating automated testing into your CI/CD pipelines helps to ensure the quality and reliability of your cloud applications. By validating changes before deployment, you can significantly reduce the risk of errors and regressions, which leads to a more stable and robust software system. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.

The following are the main benefits of incorporating automated testing into your CI/CD pipelines:

  • Early detection of bugs and defects: Automated testing helps to detect bugs and defects early in the development process, before they can cause major problems in production. This capability saves time and resources by preventing the need for costly rework and bug fixes at later stages in the development process.
  • High quality and standards-based code: Automated testing can help improve the overall quality of your code by ensuring that the code meets certain standards and best practices. This capability leads to more maintainable and reliable applications that are less prone to errors.

You can use various types of testing techniques in CI/CD pipelines. Each test type serves a specific purpose.

  • Unit testing focuses on testing individual units of code, such as functions or methods, to ensure that they work as expected.
  • Integration testing tests the interactions between different components or modules of your application to verify that they work properly together.
  • End-to-end testing is often used along with unit and integration testing. End-to-end testing simulates real-world scenarios to test the application as a whole, and helps to ensure that the application meets the requirements of your end users.

To effectively integrate automated testing into your CI/CD pipelines, you must choose appropriate testing tools and frameworks. There are many different options, each with its own strengths and weaknesses. You must also establish a clear testing strategy that outlines the types of tests to be performed, the frequency of testing, and the criteria for passing or failing a test. By following these recommendations, you can ensure that your automated testing process is efficient and effective. Such a process provides valuable insights into the quality and reliability of your cloud applications.

Continuously improve and innovate

This principle in the operational excellence pillar of the Google Cloud Architecture Framework provides recommendations to help you continuously optimize cloud operations and drive innovation.

Principle overview

To continuously improve and innovate in the cloud, you need to focus on continuous learning, experimentation, and adaptation. This helps you to explore new technologies and optimize existing processes and it promotes a culture of excellence that enables your organization to achieve and maintain industry leadership.

Through continuous improvement and innovation, you can achieve the following goals:

  • Accelerate innovation: Explore new technologies and services to enhance capabilities and drive differentiation.
  • Reduce costs: Identify and eliminate inefficiencies through process-improvement initiatives.
  • Enhance agility: Adapt rapidly to changing market demands and customer needs.
  • Improve decision making: Gain valuable insights from data and analytics to make data-driven decisions.

Organizations that embrace the continuous improvement and innovation principle can unlock the full potential of the cloud environment and achieve sustainable growth. This principle maps primarily to the Workforce focus area of operational readiness. A culture of innovation lets teams experiment with new tools and technologies to expand capabilities and reduce costs.

Recommendations

To continuously improve and innovate your cloud workloads, consider the recommendations in the following sections. Each recommendation in this document is relevant to one or more of the focus areas of operational readiness.

Foster a culture of learning

Encourage teams to experiment, share knowledge, and learn continuously. Adopt a blameless culture where failures are viewed as opportunities for growth and improvement. This recommendation is relevant to the workforce focus area of operational readiness.

When you foster a culture of learning, teams can learn from mistakes and iterate quickly. This approach encourages team members to take risks, experiment with new ideas, and expand the boundaries of their work. It also creates a psychologically safe environment where individuals feel comfortable sharing failures and learning from them. Sharing in this way leads to a more open and collaborative environment.

To facilitate knowledge sharing and continuous learning, create opportunities for teams to share knowledge and learn from each other. You can do this through informal and formal learning sessions and conferences.

By fostering a culture of experimentation, knowledge sharing, and continuous learning, you can create an environment where teams are empowered to take risks, innovate, and grow. This environment can lead to increased productivity, improved problem-solving, and a more engaged and motivated workforce. Further, by promoting a blameless culture, you can create a safe space for employees to learn from mistakes and contribute to the collective knowledge of the team. This culture ultimately leads to a more resilient and adaptable workforce that is better equipped to handle challenges and drive success in the long run.

Conduct regular retrospectives

Retrospectives give teams an opportunity to reflect on their experiences, identify what went well, and identify what can be improved. By conducting retrospectives after projects or major incidents, teams can learn from successes and failures, and continuously improve their processes and practices. This recommendation is relevant to these focus areas of operational readiness: processes and governance.

An effective way to structure a retrospective is to use the Start-Stop-Continue model:

  • Start: In the Start phase of the retrospective, team members identify new practices, processes, and behaviors that they believe can enhance their work. They discuss why the changes are needed and how they can be implemented.
  • Stop: In the Stop phase, team members identify and eliminate practices, processes, and behaviors that are no longer effective or that hinder progress. They discuss why these changes are necessary and how they can be implemented.
  • Continue: In the Continue phase, team members identify practices, processes, and behaviors that work well and must be continued. They discuss why these elements are important and how they can be reinforced.

By using a structured format like the Start-Stop-Continue model, teams can ensure that retrospectives are productive and focused. This model helps to facilitate discussion, identify the main takeaways, and identify actionable steps for future enhancements.

Stay up-to-date with cloud technologies

To maximize the potential of Google Cloud services, you must keep up with the latest advancements, features, and best practices. This recommendation is relevant to the workforce focus area of operational readiness.

Participating in relevant conferences, webinars, and training sessions is a valuable way to expand your knowledge. These events provide opportunities to learn from Google Cloud experts, understand new capabilities, and engage with industry peers who might face similar challenges. By attending these sessions, you can gain insights into how to use new features effectively, optimize your cloud operations, and drive innovation within your organization.

To ensure that your team members keep up with cloud technologies, encourage them to obtain certifications and attend training courses. Google Cloud offers a wide range of certifications that validate skills and knowledge in specific cloud domains. Earning these certifications demonstrates commitment to excellence and provides tangible evidence of proficiency in cloud technologies. The training courses that are offered by Google Cloud and our partners delve deeper into specific topics. They provide direct experience and practical skills that can be immediately applied to real-world projects. By investing in the professional development of your team, you can foster a culture of continuous learning and ensure that everyone has the necessary skills to succeed in the cloud.

Actively seek and incorporate feedback

Collect feedback from users, stakeholders, and team members. Use the feedback to identify opportunities to improve your cloud solutions. This recommendation is relevant to the workforce focus area of operational readiness.

The feedback that you collect can help you to understand the evolving needs, issues, and expectations of the users of your solutions. This feedback serves as a valuable input to drive improvements and prioritize future enhancements. You can use various mechanisms to collect feedback:

  • Surveys are an effective way to gather quantitative data from a large number of users and stakeholders.
  • User interviews provide an opportunity for in-depth qualitative data collection. Interviews let you understand the specific challenges and experiences of individual users.
  • Feedback forms that are placed within the cloud solutions offer a convenient way for users to provide immediate feedback on their experience.
  • Regular meetings with team members can facilitate the collection of feedback on technical aspects and implementation challenges.

The feedback that you collect through these mechanisms must be analyzed and synthesized to identify common themes and patterns. This analysis can help you prioritize future enhancements based on the impact and feasibility of the suggested improvements. By addressing the needs and issues that are identified through feedback, you can ensure that your cloud solutions continue to meet the evolving requirements of your users and stakeholders.

Measure and track progress

Key performance indicators (KPIs) and metrics are crucial for tracking progress and measuring the effectiveness of your cloud operations. KPIs are quantifiable measurements that reflect the overall performance. Metrics are specific data points that contribute to the calculation of KPIs. Review the metrics regularly and use them to identify opportunities for improvement and measure progress. Doing so helps you to continuously improve and optimize your cloud environment. This recommendation is relevant to these focus areas of operational readiness: governance and processes.

A primary benefit of using KPIs and metrics is that they enable your organization to adopt a data-driven approach to cloud operations. By tracking and analyzing operational data, you can make informed decisions about how to improve the cloud environment. This data-driven approach helps you to identify trends, patterns, and anomalies that might not be visible without the use of systematic metrics.

To collect and analyze operational data, you can use tools like Cloud Monitoring and BigQuery. Cloud Monitoring enables real-time monitoring of cloud resources and services. BigQuery lets you store and analyze the data that you gather through monitoring. Using these tools together, you can create custom dashboards to visualize important metrics and trends.

Operational dashboards can provide a centralized view of the most important metrics, which lets you quickly identify any areas that need attention. For example, a dashboard might include metrics like CPU utilization, memory usage, network traffic, and latency for a particular application or service. By monitoring these metrics, you can quickly identify any potential issues and take steps to resolve them.

Google Cloud Architecture Framework: Security, privacy, and compliance

This pillar of the Google Cloud Architecture Framework shows you how to architect and operate secure services on Google Cloud. You also learn about Google Cloud products and features that support security and compliance.

The Architecture Framework describes best practices, provides implementation recommendations, and explains some of the available products and services. The framework helps you design your Google Cloud deployment so that it matches your business needs.

Moving your workloads into Google Cloud requires an evaluation of your business requirements, risks, compliance obligations, and security controls. This document helps you consider key best practices related to designing a secure solution in Google Cloud.

Google core principles include defense in depth, at scale, and by default. In Google Cloud, data and systems are protected through multiple layered defenses using policies and controls that are configured across IAM, encryption, networking, detection, logging, and monitoring.

Google Cloud comes with many security controls that you can build on, such as the following:

  • Secure options for data in transit, and default encryption for data at rest.
  • Built-in security features for Google Cloud products and services.
  • A global infrastructure that's designed for geo-redundancy, with security controls throughout the information-processing lifecycle.
  • Automation capabilities that use infrastructure as code (IaC) and configuration guardrails.

For more information about the security posture of Google Cloud, see the Google security paper and the Google Infrastructure Security Design Overview. For an example secure-by-default environment, see the Google Cloud enterprise foundations blueprint.

For security principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Security.

In the security pillar of the Architecture Framework, you learn to do the following:

Shared responsibilities and shared fate on Google Cloud

This document describes the differences between the shared responsibility model and shared fate in Google Cloud. It discusses the challenges and nuances of the shared responsibility model. This document describes what shared fate is and how we partner with our customers to address cloud security challenges.

Understanding the shared responsibility model is important when determining how to best protect your data and workloads on Google Cloud. The shared responsibility model describes the tasks that you have when it comes to security in the cloud and how these tasks are different for cloud providers.

Understanding shared responsibility, however, can be challenging. The model requires an in-depth understanding of each service you utilize, the configuration options that each service provides, and what Google Cloud does to secure the service. Every service has a different configuration profile, and it can be difficult to determine the best security configuration. Google believes that the shared responsibility model stops short of helping cloud customers achieve better security outcomes. Instead of shared responsibility, we believe in shared fate.

Shared fate includes us building and operating a trusted cloud platform for your workloads. We provide best practice guidance and secured, attested infrastructure code that you can use to deploy your workloads in a secure way. We release solutions that combine various Google Cloud services to solve complex security problems and we offer innovative insurance options to help you measure and mitigate the risks that you must accept. Shared fate involves us more closely interacting with you as you secure your resources on Google Cloud.

Shared responsibility

You're the expert in knowing the security and regulatory requirements for your business, and knowing the requirements for protecting your confidential data and resources. When you run your workloads on Google Cloud, you must identify the security controls that you need to configure in Google Cloud to help protect your confidential data and each workload. To decide which security controls to implement, you must consider the following factors:

  • Your regulatory compliance obligations
  • Your organization's security standards and risk management plan
  • Security requirements of your customers and your vendors

Defined by workloads

Traditionally, responsibilities are defined based on the type of workload that you're running and the cloud services that you require. Cloud services include the following categories:

Cloud service Description
Infrastructure as a service (IaaS) IaaS services include Compute Engine, Cloud Storage, and networking services such as Cloud VPN, Cloud Load Balancing, and Cloud DNS.

IaaS provides compute, storage, and network services on demand with pay-as-you-go pricing. You can use IaaS if you plan on migrating an existing on-premises workload to the cloud using lift-and-shift, or if you want to run your application on particular VMs, using specific databases or network configurations.

In IaaS, the bulk of the security responsibilities are yours, and our responsibilities are focused on the underlying infrastructure and physical security.

Platform as a service (PaaS) PaaS services include App Engine, Google Kubernetes Engine (GKE), and BigQuery.

PaaS provides the runtime environment that you can develop and run your applications in. You can use PaaS if you're building an application (such as a website), and want to focus on development not on the underlying infrastructure.

In PaaS, we're responsible for more controls than in IaaS. Typically, this will vary by the services and features that you use. You share responsibility with us for application-level controls and IAM management. You remain responsible for your data security and client protection.

Software as a service (SaaS) SaaS applications include Google Workspace, Google Security Operations, and third-party SaaS applications that are available in Google Cloud Marketplace.

SaaS provides online applications that you can subscribe to or pay for in some way. You can use SaaS applications when your enterprise doesn't have the internal expertise or business requirement to build the application themselves, but does require the ability to process workloads.

In SaaS, we own the bulk of the security responsibilities. You remain responsible for your access controls and the data that you choose to store in the application.

Function as a service (FaaS) or serverless

FaaS provides the platform for developers to run small, single-purpose code (called functions) that run in response to particular events. You would use FaaS when you want particular things to occur based on a particular event. For example, you might create a function that runs whenever data is uploaded to Cloud Storage so that it can be classified.

FaaS has a similar shared responsibility list as SaaS. Cloud Run functions is a FaaS application.

The following diagram shows the cloud services and defines how responsibilities are shared between the cloud provider and customer.

Shared security responsibilities

As the diagram shows, the cloud provider always remains responsible for the underlying network and infrastructure, and customers always remain responsible for their access policies and data.

Defined by industry and regulatory framework

Various industries have regulatory frameworks that define the security controls that must be in place. When you move your workloads to the cloud, you must understand the following:

  • Which security controls are your responsibility
  • Which security controls are available as part of the cloud offering
  • Which default security controls are inherited

Inherited security controls (such as our default encryption and infrastructure controls) are controls that you can provide as part of your evidence of your security posture to auditors and regulators. For example, the Payment Card Industry Data Security Standard (PCI DSS) defines regulations for payment processors. When you move your business to the cloud, these regulations are shared between you and your CSP. To understand how PCI DSS responsibilities are shared between you and Google Cloud, see Google Cloud: PCI DSS Shared Responsibility Matrix.

As another example, in the United States, the Health Insurance Portability and Accountability Act (HIPAA) has set standards for handling electronic personal health information (PHI). These responsibilities are also shared between the CSP and you. For more information on how Google Cloud meets our responsibilities under HIPAA, see HIPAA - Compliance.

Other industries (for example, finance or manufacturing) also have regulations that define how data can be gathered, processed, and stored. For more information about shared responsibility related to these, and how Google Cloud meets our responsibilities, see Compliance resource center.

Defined by location

Depending on your business scenario, you might need to consider your responsibilities based on the location of your business offices, your customers, and your data. Different countries and regions have created regulations that inform how you can process and store your customer's data. For example, if your business has customers who reside in the European Union, your business might need to abide by the requirements that are described in the General Data Protection Regulation (GDPR), and you might be obligated to keep your customer data in the EU itself. In this circumstance, you are responsible for ensuring that the data that you collect remains in the Google Cloud regions in the EU. For more information about how we meet our GDPR obligations, see GDPR and Google Cloud.

For information about the requirements related to your region, see Compliance offerings. If your scenario is particularly complicated, we recommend speaking with our sales team or one of our partners to help you evaluate your security responsibilities.

Challenges for shared responsibility

Though shared responsibility helps define the security roles that you or the cloud provider has, relying on shared responsibility can still create challenges. Consider the following scenarios:

  • Most cloud security breaches are the direct result of misconfiguration (listed as number 3 in the Cloud Security Alliance's Pandemic 11 Report) and this trend is expected to increase. Cloud products are constantly changing, and new ones are constantly being launched. Keeping up with constant change can seem overwhelming. Customers need cloud providers to provide them with opinionated best practices to help keep up with the change, starting with best practices by default and having a baseline secure configuration.
  • Though dividing items by cloud services is helpful, many enterprises have workloads that require multiple cloud services types. In this circumstance, you must consider how various security controls for these services interact, including whether they overlap between and across services. For example, you might have an on-premises application that you're migrating to Compute Engine, use Google Workspace for corporate email, and also run BigQuery to analyze data to improve your products.
  • Your business and markets are constantly changing; as regulations change, as you enter new markets, or as you acquire other companies. Your new markets might have different requirements, and your new acquisition might host their workloads on another cloud. To manage the constant changes, you must constantly re-assess your risk profile and be able to implement new controls quickly.
  • How and where to manage your data encryption keys is an important decision that ties with your responsibilities to protect your data. The option that you choose depends on your regulatory requirements, whether you're running a hybrid cloud environment or still have an on-premises environment, and the sensitivity of the data that you're processing and storing.
  • Incident management is an important, and often overlooked, area where your responsibilities and the cloud provider responsibilities aren't easily defined. Many incidents require close collaboration and support from the cloud provider to help investigate and mitigate them. Other incidents can result from poorly configured cloud resources or stolen credentials, and ensuring that you meet the best practices for securing your resources and accounts can be quite challenging.
  • Advanced persistent threats (APTs) and new vulnerabilities can impact your workloads in ways that you might not consider when you start your cloud transformation. Ensuring that you remain up-to-date on the changing landscape, and who is responsible for threat mitigation is difficult, particularly if your business doesn't have a large security team.

Shared fate

We developed shared fate in Google Cloud to start addressing the challenges that the shared responsibility model doesn't address. Shared fate focuses on how all parties can better interact to continuously improve security. Shared fate builds on the shared responsibility model because it views the relationship between cloud provider and customer as an ongoing partnership to improve security.

Shared fate is about us taking responsibility for making Google Cloud more secure. Shared fate includes helping you get started with a secured landing zone and being clear, opinionated, and transparent about recommended security controls, settings, and associated best practices. It includes helping you better quantify and manage your risk with cyber-insurance, using our Risk Protection Program. Using shared fate, we want to evolve from the standard shared responsibility framework to a better model that helps you secure your business and build trust in Google Cloud.

The following sections describe various components of shared fate.

Help getting started

A key component of shared fate is the resources that we provide to help you get started, in a secure configuration in Google Cloud. Starting with a secure configuration helps reduce the issue of misconfigurations which is the root cause of most security breaches.

Our resources include the following:

  • Enterprise foundations blueprint that discuss top security concerns and our top recommendations.
  • Secure blueprints that let you deploy and maintain secure solutions using infrastructure as code (IaC). Blueprints have our security recommendations enabled by default. Many blueprints are created by Google security teams and managed as products. This support means that they're updated regularly, go through a rigorous testing process, and receive attestations from third-party testing groups. Blueprints include the enterprise foundations blueprint and the secured data warehouse blueprint.

  • Architecture Framework best practices that address the top recommendations for building security into your designs. The Architecture Framework includes a security section and a community zone that you can use to connect with experts and peers.

  • Landing zone navigation guides that step you through the top decisions that you need to make to build a secure foundation for your workloads, including resource hierarchy, identity onboarding, security and key management, and network structure.

Risk Protection Program

Shared fate also includes the Risk Protection Program (currently in preview), which helps you use the power of Google Cloud as a platform to manage risk, rather than just seeing cloud workloads as another source of risk that you need to manage. The Risk Protection Program is a collaboration between Google Cloud and two leading cyber insurance companies, Munich Re and Allianz Global & Corporate Speciality.

The Risk Protection Program includes Risk Manager, which provides data-driven insights that you can use to better understand your cloud security posture. If you're looking for cyber insurance coverage, you can share these insights from Risk Manager directly with our insurance partners to obtain a quote. For more information, see Google Cloud Risk Protection Program now in Preview.

Help with deployment and governance

Shared fate also helps with your continued governance of your environment. For example, we focus efforts on products such as the following:

Putting shared responsibility and shared fate into practice

As part of your planning process, consider the following actions to help you understand and implement appropriate security controls:

  • Create a list of the type of workloads that you will host in Google Cloud, and whether they require IaaS, PaaS, and SaaS services. You can use the shared responsibility diagram as a checklist to ensure that you know the security controls that you need to consider.
  • Create a list of regulatory requirements that you must comply with, and access resources in the Compliance resource center that relate to those requirements.
  • Review the list of available blueprints and architectures in the Architecture Center for the security controls that you require for your particular workloads. The blueprints provide a list of recommended controls and the IaC code that you require to deploy that architecture.
  • Use the landing zone documentation and the recommendations in the enterprise foundations guide to design a resource hierarchy and network architecture that meets your requirements. You can use the opinionated workload blueprints, like the secured data warehouse, to accelerate your development process.
  • After you deploy your workloads, verify that you're meeting your security responsibilities using services such as the Risk Manager, Assured Workloads, Policy Intelligence tools, and Security Command Center Premium.

For more information, see the CISO's Guide to Cloud Transformation paper.

What's next

Security principles

This document in the Google Cloud Architecture Framework explains core principles for running secure and compliant services on Google Cloud. Many of the security principles that you're familiar with in your on-premises environment apply to cloud environments.

Build a layered security approach

Implement security at each level in your application and infrastructure by applying a defense-in-depth approach. Use the features in each product to limit access and configure encryption where appropriate.

Design for secured decoupled systems

Simplify system design to accommodate flexibility where possible, and document security requirements for each component. Incorporate a robust secured mechanism to account for resiliency and recovery.

Automate deployment of sensitive tasks

Take humans out of the workstream by automating deployment and other admin tasks.

Automate security monitoring

Use automated tools to monitor your application and infrastructure. To scan your infrastructure for vulnerabilities and detect security incidents, use automated scanning in your continuous integration and continuous deployment (CI/CD) pipelines.

Meet the compliance requirements for your regions

Be mindful that you might need to obfuscate or redact personally identifiable information (PII) to meet your regulatory requirements. Where possible, automate your compliance efforts. For example, use Sensitive Data Protection and Dataflow to automate the PII redaction job before new data is stored in the system.

Comply with data residency and sovereignty requirements

You might have internal (or external) requirements that require you to control the locations of data storage and processing. These requirements vary based on systems design objectives, industry regulatory concerns, national law, tax implications, and culture. Data residency describes where your data is stored. To help comply with data residency requirements, Google Cloud lets you control where data is stored, how data is accessed, and how it's processed.

Shift security left

DevOps and deployment automation let your organization increase the velocity of delivering products. To help ensure that your products remain secure, incorporate security processes from the start of the development process. For example, you can do the following:

  • Test for security issues in code early in the deployment pipeline.
  • Scan container images and the cloud infrastructure on an ongoing basis.
  • Automate detection of misconfiguration and security anti-patterns. For example, use automation to look for secrets that are hard-coded in applications or in configuration.

What's next

Learn more about core security principles with the following resources:

Manage risk with controls

This document in the Google Cloud Architecture Framework describes best practices for managing risks in a cloud deployment. Performing a careful analysis of the risks that apply to your organization allows you to determine the security controls that you require. You should complete risk analysis before you deploy workloads on Google Cloud, and regularly afterwards as your business needs, regulatory requirements, and the threats relevant to your organization change.

Identify risks to your organization

Before you create and deploy resources on Google Cloud, complete a risk assessment to determine what security features you need in order to meet your internal security requirements and external regulatory requirements. Your risk assessment provides you with a catalog of risks that are relevant to you, and tells you how capable your organization is in detecting and counteracting security threats.

Your risks in a cloud environment differ from your risks in an on-premises environment due to the shared responsibility arrangement that you enter with your cloud provider. For example, in an on-premises environment you need to mitigate vulnerabilities to the hardware stack. In contrast, in a cloud environment these risks are borne by the cloud provider.

In addition, your risks differ depending on how you plan on using Google Cloud. Are you transferring some of your workloads to Google Cloud, or all of them? Are you using Google Cloud only for disaster recovery purposes? Are you setting up a hybrid cloud environment?

We recommend that you use an industry-standard risk assessment framework that applies to cloud environments and to your regulatory requirements. For example, the Cloud Security Alliance (CSA) provides the Cloud Controls Matrix (CCM). In addition, there are threat models such as OWASP application threat modeling that provide you with a list of potential gaps, and that suggest actions to remediate any gaps that are found. You can check our partner directory for a list of experts in conducting risk assessments for Google Cloud.

To help catalog your risks, consider Risk Manager, which is part of the Risk Protection Program. (This program is currently in preview.) Risk Manager scans your workloads to help you understand your business risks. Its detailed reports provide you with a security baseline. In addition, you can use Risk Manager reports to compare your risks against the risks outlined in the Center for Internet Security (CIS) Benchmark.

After you catalog your risks, you must determine how to address them—that is, whether you want to accept, avoid, transfer, or mitigate them. The following section describes mitigation controls.

Mitigate your risks

You can mitigate risks using technical controls, contractual protections, and third-party verifications or attestations. The following table lists how you can use these mitigations when you adopt new public cloud services.

MitigationDescription
Technical controls Technical controls refer to the features and technologies that you use to protect your environment. These include built-in cloud security controls, such as firewalls and logging. Technical controls can also include using third-party tools to reinforce or support your security strategy.

There are two categories of technical controls:
  • Google Cloud includes various security controls to let you mitigate the risks that apply to you. For example, if you have an on-premises environment, you can use Cloud VPN and Cloud Interconnect to secure the connection between your on-premises and your cloud resources.
  • Google has robust internal controls and auditing to protect against insider access to customer data. Our audit logs provide our customers with near real-time logs of Google administrator access on Google Cloud.
Contractual protections Contractual protections refer to the legal commitments made by us regarding Google Cloud services.

Google is committed to maintaining and expanding our compliance portfolio. The Cloud Data Processing Addendum (CDPA) document defines our commitment to maintaining our ISO 27001, 27017, and 27018 certifications and to updating our SOC 2 and SOC 3 reports every 12 months.

The DPST document also outlines the access controls that are in place to limit access by Google support engineers to customers' environments, and it describes our rigorous logging and approval process.

We recommend that you review Google Cloud's contractual controls with your legal and regulatory experts and verify that they meet your requirements. If you need more information, contact your technical account representative.
Third-party verifications or attestations Third-party verifications or attestations refers to having a third-party vendor audit the cloud provider to ensure that the provider meets compliance requirements. For example, Google was audited by a third party for ISO 27017 compliance.

You can see the current Google Cloud certifications and letters of attestation at the Compliance Resource Center.

What's next

Learn more about risk management with the following resources:

Manage your assets

This document in the Google Cloud Architecture Framework provides best practices for managing assets.

Asset management is an important part of your business requirements analysis. You must know what assets you have, and you must have a good understanding of all your assets, their value, and any critical paths or processes related to them. You must have an accurate asset inventory before you can design any sort of security controls to protect your assets.

To manage security incidents and meet your organization's regulatory requirements, you need an accurate and up-to-date asset inventory that includes a way to analyze historical data. You must be able to track your assets, including how their risk exposure might change over time.

Moving to Google Cloud means that you need to modify your asset management processes to adapt to a cloud environment. For example, one of the benefits of moving to the cloud is that you increase your organization's ability to scale quickly. However, the ability to scale quickly can cause shadow IT issues, in which your employees create cloud resources that aren't properly managed and secured. Therefore, your asset management processes must provide sufficient flexibility for employees to get their work done while also providing for appropriate security controls.

Use cloud asset management tools

Google Cloud asset management tools are tailored specifically to our environment and to top customer use cases.

One of these tools is Cloud Asset Inventory, which provides you with both real-time information on the current state of your resources and with a five-week history. By using this service, you can get an organization-wide snapshot of your inventory for a wide variety of Google Cloud resources and policies. Automation tools can then use the snapshot for monitoring or for policy enforcement, or the tools can archive the snapshot for compliance auditing. If you want to analyze changes to the assets, asset inventory also lets you export metadata history.

For more information about Cloud Asset Inventory, see Custom solution to respond to asset changes and Detective controls.

Automate asset management

Automation lets you quickly create and manage assets based on the security requirements that you specify. You can automate aspects of the asset lifecycle in the following ways:

  • Deploy your cloud infrastructure using automation tools such as Terraform. Google Cloud provides the enterprise foundations blueprint, which helps you set up infrastructure resources that meet security best practices. In addition, it configures asset changes and policy compliance notifications in Cloud Asset Inventory.
  • Deploy your applications using automation tools such as Cloud Run and the Artifact Registry.

Monitor for deviations from your compliance policies

Deviations from policies can occur during all phases of the asset lifecycle. For example, assets might be created without the proper security controls, or their privileges might be escalated. Similarly, assets might be abandoned without the appropriate end-of-life procedures being followed.

To help avoid these scenarios, we recommend that you monitor assets for deviation from compliance. Which set of assets that you monitor depends on the results of your risk assessment and business requirements. For more information about monitoring assets, see Monitoring asset changes.

Integrate with your existing asset management monitoring systems

If you already use a SIEM system or other monitoring system, integrate your Google Cloud assets with that system. Integration ensures that your organization has a single, comprehensive view into all resources, regardless of environment. For more information, see Export Google Cloud security data to your SIEM system and Scenarios for exporting Cloud Logging data: Splunk.

Use data analysis to enrich your monitoring

You can export your inventory to a BigQuery table or Cloud Storage bucket for additional analysis.

What's next

Learn more about managing your assets with the following resources:

Manage identity and access

This document in the Google Cloud Architecture Framework provides best practices for managing identity and access.

The practice of identity and access management (generally referred to as IAM) helps you ensure that the right people can access the right resources. IAM addresses the following aspects of authentication and authorization:

  • Account management, including provisioning
  • Identity governance
  • Authentication
  • Access control (authorization)
  • Identity federation

Managing IAM can be challenging when you have different environments or you use multiple identity providers. However, it's critical that you set up a system that can meet your business requirements while mitigating risks.

The recommendations in this document help you review your current IAM policies and procedures and determine which of those you might need to modify for your workloads in Google Cloud. For example, you must review the following:

  • Whether you can use existing groups to manage access or whether you need to create new ones.
  • Your authentication requirements (such as multi-factor authentication (MFA) using a token).
  • The impact of service accounts on your current policies.
  • If you're using Google Cloud for disaster recovery, maintaining appropriate separation of duties.

Within Google Cloud, you use Cloud Identity to authenticate your users and resources and Google's Identity and Access Management (IAM) product to dictate resource access. Administrators can restrict access at the organization, folder, project, and resource level. Google IAM policies dictate who can do what on which resources. Correctly configured IAM policies help secure your environment by preventing unauthorized access to resources.

For more information, see Overview of identity and access management.

Use a single identity provider

Many of our customers have user accounts that are managed and provisioned by identity providers outside of Google Cloud. Google Cloud supports federation with most identity providers and with on-premises directories such as Active Directory.

Most identity providers let you enable single sign-on (SSO) for your users and groups. For applications that you deploy on Google Cloud and that use your external identity provider, you can extend your identity provider to Google Cloud. For more information, see Reference architectures and Patterns for authentication corporate users in a hybrid environment.

If you don't have an existing identity provider, you can use either Cloud Identity Premium or Google Workspace to manage identities for your employees.

Protect the super admin account

The super admin account (managed by Google Workspace or Cloud Identity) lets you create your Google Cloud organization. This admin account is therefore highly privileged. Best practices for this account include the following:

  • Create a new account for this purpose; don't use an existing user account.
  • Create and protect backup accounts.
  • Enable MFA.

For more information, see Super administrator account best practices.

Plan your use of service accounts

A service account is a Google account that applications can use to call the Google API of a service.

Unlike your user accounts, service accounts are created and managed within Google Cloud. Service accounts also authenticate differently than user accounts:

  • To let an application running on Google Cloud authenticate using a service account, you can attach a service account to the compute resource the application runs on.
  • To let an application running on GKE authenticate using a service account, you can use Workload Identity.
  • To let applications running outside of Google Cloud authenticate using a service account, you can use Workload identity federation

When you use service accounts, you must consider an appropriate segregation of duties during your design process. Note the API calls that you must make, and determine the service accounts and associated roles that the API calls require. For example, if you're setting up a BigQuery data warehouse, you probably need identities for at least the following processes and services:

  • Cloud Storage or Pub/Sub, depending on whether you're providing a batch file or creating a streaming service.
  • Dataflow and Sensitive Data Protection to de-identify sensitive data.

For more information, see Best practices for working with service accounts.

Update your identity processes for the cloud

Identity governance lets you track access, risks, and policy violations so that you can support your regulatory requirements. This governance requires that you have processes and policies in place so that you can grant and audit access control roles and permissions to users. Your processes and policies must reflect the requirements of your environments—for example, test, development, and production.

Before you deploy workloads on Google Cloud, review your current identity processes and update them if appropriate. Ensure that you appropriately plan for the types of accounts that your organization needs and that you have a good understanding of their role and access requirements.

To help you audit Google IAM activities, Google Cloud creates audit logs, which include the following:

  • Administrator activity. This logging can't be disabled.
  • Data access activity. You must enable this logging.

If necessary for compliance purposes, or if you want to set up log analysis (for example, with your SIEM system), you can export the logs. Because logs can increase your storage requirements, they might affect your costs. Ensure that you log only the actions that you require, and set appropriate retention schedules.

Set up SSO and MFA

Your identity provider manages user account authentication. Federated identities can authenticate to Google Cloud using SSO. For privileged accounts, such as super admins, you should configure MFA. Titan Security Keys are physical tokens that you can use for two-factor authentication (2FA) to help prevent phishing attacks.

Cloud Identity supports MFA using various methods. For more information, see Enforce uniform MFA to company-owned resources.

Google Cloud supports authentication for workload identities using the OAuth 2.0 protocol or signed JSON Web Tokens (JWT). For more information about workload authentication, see Authentication overview.

Implement least privilege and separation of duties

You must ensure that the right individuals get access only to the resources and services that they need in order to perform their jobs. That is, you should follow the principle of least privilege. In addition, you must ensure there is an appropriate separation of duties.

Overprovisioning user access can increase the risk of insider threat, misconfigured resources, and non-compliance with audits. Underprovisioning permissions can prevent users from being able to access the resources they need in order to complete their tasks.

One way to avoid overprovisioning is to implement just-in-time privileged access — that is, to provide privileged access only as needed, and to only grant it temporarily.

Be aware that when a Google Cloud organization is created, all users in your domain are granted the Billing Account Creator and Project Creator roles by default. Identify the users who will perform these duties, and revoke these roles from other users. For more information, see Creating and managing organizations.

For more information about how roles and permissions work in Google Cloud, see Overview and Understanding roles in the IAM documentation. For more information about enforcing least privilege, see Enforce least privilege with role recommendations.

Audit access

To monitor the activities of privileged accounts for deviations from approved conditions, use Cloud Audit Logs. Cloud Audit Logs records the actions that members in your Google Cloud organization have taken in your Google Cloud resources. You can work with various audit log types across Google services. For more information, see Using Cloud Audit Logs to Help Manage Insider Risk (video).

Use IAM recommender to track usage and to adjust permissions where appropriate. The roles that are recommended by IAM recommender can help you determine which roles to grant to a user based on the user's past behavior and on other criteria. For more information, see Best practices for role recommendations.

To audit and control access to your resources by Google support and engineering personnel, you can use Access Transparency. Access Transparency records the actions taken by Google personnel. Use Access Approval, which is part of Access Transparency, to grant explicit approval every time customer content is accessed. For more information, see Control cloud administrators' access to your data.

Automate your policy controls

Set access permissions programmatically whenever possible. For best practices, see Organization policy constraints. The Terraform scripts for the enterprise foundations blueprint are in the example foundation repository.

Google Cloud includes Policy Intelligence, which lets you automatically review and update your access permissions. Policy Intelligence includes the Recommender, Policy Troubleshooter, and Policy Analyzer tools, which do the following:

  • Provide recommendations for IAM role assignment.
  • Monitor and help prevent overly permissive IAM policies.
  • Assist with troubleshooting access-control-related issues.

Set restrictions on resources

Google IAM focuses on who, and it lets you authorize who can act on specific resources based on permissions. The Organization Policy Service focuses on what, and it lets you set restrictions on resources to specify how they can be configured. For example, you can use an organization policy to do the following:

In addition to using organizational policies for these tasks, you can restrict access to resources using one of the following methods:

  • Use tags to manage access to your resources without defining the access permissions on each resource. Instead, you add the tag and then set the access definition for the tag itself.
  • Use IAM Conditions for conditional, attribute-based control of access to resources.
  • Implement defense-in-depth using VPC Service Controls to further restrict access to resources.

For more information about resource management, see Decide a resource hierarchy for your Google Cloud landing zone.

What's next

Learn more about IAM with the following resources:

Implement compute and container security

Google Cloud includes controls to protect your compute resources and Google Kubernetes Engine (GKE) container resources. This document in the Google Cloud Architecture Framework describes key controls and best practices for using them.

Use hardened and curated VM images

Google Cloud includes Shielded VM, which allows you to harden your VM instances. Shielded VM is designed to prevent malicious code from being loaded during the boot cycle. It provides boot security, monitors integrity, and uses the Virtual Trusted Platform Module (vTPM). Use Shielded VM for sensitive workloads.

In addition to using Shielded VM, you can use Google Cloud partner solutions to further protect your VMs. Many partner solutions offered on Google Cloud integrate with Security Command Center, which provides event threat detection and health monitoring. You can use partners for advanced threat analysis or extra runtime security.

Use Confidential Computing for processing sensitive data

By default, Google Cloud encrypts data at rest and in transit across the network, but data isn't encrypted while it's in use in memory. If your organization handles confidential data, you need to mitigate against threats that undermine the confidentiality and integrity of either the application or the data in system memory. Confidential data includes personally identifiable information (PII), financial data, and health information.

Confidential Computing builds on Shielded VM. It protects data in use by performing computation in a hardware-based trusted execution environment. This type of secure and isolated environment helps prevent unauthorized access or modification of applications and data while that data is in use. A trusted execution environment also increases the security assurances for organizations that manage sensitive and regulated data.

In Google Cloud, you can enable Confidential Computing by running Confidential VMs or Confidential GKE nodes. Turn on Confidential Computing when you're processing confidential workloads, or when you have confidential data (for example, secrets) that must be exposed while they are processed. For more information, see the Confidential Computing Consortium.

Protect VMs and containers

OS Login lets your employees connect to your VMs using Identity and Access Management (IAM) permissions as the source of truth instead of relying on SSH keys. You therefore don't have to manage SSH keys throughout your organization. OS Login ties an administrator's access to their employee lifecycle, which means that if employees move to another role or leave your organization, their access is revoked with their account. OS Login also supports two-factor authentication, which adds an extra layer of security from account takeover attacks.

In GKE, App Engine runs application instances within Docker containers. To enable a defined risk profile and to restrict employees from making changes to containers, ensure that your containers are stateless and immutable. The principle of immutability means that your employees do not modify the container or access it interactively. If it must be changed, you build a new image and redeploy. Enable SSH access to the underlying containers only in specific debugging scenarios.

Disable external IP addresses unless they're necessary

To disable external IP address allocation (video) for your production VMs and to prevent the use of external load balancers, you can use organization policies. If you require your VMs to reach the internet or your on-premises data center, you can enable a Cloud NAT gateway.

You can deploy private clusters in GKE. In a private cluster, nodes have only internal IP addresses, which means that nodes and Pods are isolated from the internet by default. You can also define a network policy to manage Pod-to-Pod communication in the cluster. For more information, see Private access options for services.

Monitor your compute instance and GKE usage

Cloud Audit Logs are automatically enabled for Compute Engine and GKE. Audit logs let you automatically capture all activities with your cluster and monitor for any suspicious activity.

You can integrate GKE with partner products for runtime security. You can integrate these solutions with the Security Command Center to provide you with a single interface for monitoring your applications.

Keep your images and clusters up to date

Google Cloud provides curated OS images that are patched regularly. You can bring custom images and run them on Compute Engine, but if you do, you have to patch them yourself. Google Cloud regularly updates OS images to mitigate new vulnerabilities as described in security bulletins and provides remediation to fix vulnerabilities for existing deployments.

If you're using GKE, we recommend that you enable node auto-upgrade to have Google update your cluster nodes with the latest patches. Google manages GKE control planes, which are automatically updated and patched. In addition, use Google-curated container-optimized images for your deployment. Google regularly patches and updates these images.

Control access to your images and clusters

It's important to know who can create and launch instances. You can control this access using IAM. For information about how to determine what access workloads need, see Plan your workload identities.

In addition, you can use VPC Service Controls to define custom quotas on projects so that you can limit who can launch images. For more information, see the Secure your network section.

To provide infrastructure security for your cluster, GKE lets you use IAM with role-based access control (RBAC) to manage access to your cluster and namespaces.

Isolate containers in a sandbox

Use GKE Sandbox to deploy multi-tenant applications that need an extra layer of security and isolation from their host kernel. For example, use GKE Sandbox when you are executing unknown or untrusted code. GKE Sandbox is a container isolation solution that provides a second layer of defense between containerized workloads on GKE.

GKE Sandbox was built for applications that have low I/O requirements but that are highly scaled. These containerized workloads need to maintain their speed and performance, but might also involve untrusted code that demands added security. Use gVisor, a container runtime sandbox, to provide additional security isolation between applications and the host kernel. gVisor provides additional integrity checks and limits the scope of access for a service. It's not a container hardening service to protect against external threats. For more inforamtion about gVisor, see gVisor: Protecting GKE and serverless users in the real world.

What's next

Learn more about compute and container security with the following resources:

Secure your network

This document in the Google Cloud Architecture Framework provides best practices for securing your network.

Extending your existing network to include cloud environments has many implications for security. Your on-premises approach to multi-layered defenses likely involves a distinct perimeter between the internet and your internal network. You probably protect the perimeter by using mechanisms like physical firewalls, routers, and intrusion detection systems. Because the boundary is clearly defined, you can monitor for intrusions and respond accordingly.

When you move to the cloud (either completely or in a hybrid approach), you move beyond your on-premises perimeter. This document describes ways that you can continue to secure your organization's data and workloads on Google Cloud. As mentioned in Manage risks with controls, how you set up and secure your Google Cloud network depends on your business requirements and risk appetite.

This section assumes that you've already created a basic architecture diagram of your Google Cloud network components. For an example diagram, see Hub-and-spoke.

Deploy zero trust networks

Moving to the cloud means that your network trust model must change. Because your users and your workloads are no longer behind your on-premises perimeter, you can't use perimeter protections in the same way to create a trusted, inner network. The zero trust security model means that no one is trusted by default, whether they are inside or outside of your organization's network. When verifying access requests, the zero trust security model requires you to check both the user's identity and context. Unlike a VPN, you shift access controls from the network perimeter to the users and devices.

In Google Cloud, you can use Chrome Enterprise Premium as your zero trust solution. Chrome Enterprise Premium provides threat and data protection and additional access controls. For more information about how to set it up, see Getting started with Chrome Enterprise Premium.

In addition to Chrome Enterprise Premium, Google Cloud includes Identity-Aware Proxy (IAP). IAP lets you extend zero trust security to your applications both within Google Cloud and on-premises. IAP uses access control policies to provide authentication and authorization for users who access your applications and resources.

Secure connections to your on-premises or multicloud environments

Many organizations have workloads both in cloud environments and on-premises. In addition, for resiliency, some organizations use multicloud solutions. In these scenarios, it's critical to secure your connectivity between all of your environments.

Google Cloud includes private access methods for VMs that are supported by Cloud VPN or Cloud Interconnect, including the following:

For a comparison between the products, see Choosing a Network Connectivity product.

Disable default networks

When you create a new Google Cloud project, a default Google Cloud VPC network with auto mode IP addresses and pre-populated firewall rules is automatically provisioned. For production deployments, we recommend that you delete the default networks in existing projects, and disable the creation of default networks in new projects.

Virtual Private Cloud networks let you use any internal IP address. To avoid IP address conflicts, we recommend that you first plan your network and IP address allocation across your connected deployments and across your projects. A project allows multiple VPC networks, but it's usually a best practice to limit these networks to one per project in order to enforce access control effectively.

Secure your perimeter

In Google Cloud, you can use various methods to segment and secure your cloud perimeter, including firewalls and VPC Service Controls.

Use Shared VPC to build a production deployment that gives you a single shared network and that isolates workloads into individual projects that can be managed by different teams. Shared VPC provides centralized deployment, management, and control of the network and network security resources across multiple projects. Shared VPC consists of host and service projects that perform the following functions:

  • A host project contains the networking and network security-related resources, such as VPC networks, subnets, firewall rules, and hybrid connectivity.
  • A service project attaches to a host project. It lets you isolate workloads and users at the project level by using Identity and Access Management (IAM), while it shares the networking resources from the centrally managed host project.

Define firewall policies and rules at the organization, folder, and VPC network level. You can configure firewall rules to permit or deny traffic to or from VM instances. For examples, see Global and regional network firewall policy examples and Hierarchical firewall policy examples. In addition to defining rules based on IP addresses, protocols, and ports, you can manage traffic and apply firewall rules based on the service account that's used by a VM instance or by using secure tags.

To control the movement of data in Google services and to set up context-based perimeter security, consider VPC Service Controls. VPC Service Controls provides an extra layer of security for Google Cloud services that's independent of IAM and firewall rules and policies. For example, VPC Service Controls lets you set up perimeters between confidential and non-confidential data so that you can apply controls that help prevent data exfiltration.

Use Google Cloud Armor security policies to allow, deny, or redirect requests to your external Application Load Balancer at the Google Cloud edge, as close as possible to the source of incoming traffic. These policies prevent unwelcome traffic from consuming resources or entering your network.

Use Secure Web Proxy to apply granular access policies to your egress web traffic and to monitor access to untrusted web services.

Inspect your network traffic

You can use Cloud Intrusion Detection System (Cloud IDS) and Packet Mirroring to help you ensure the security and compliance of workloads running in Compute Engine and Google Kubernetes Engine (GKE).

Use Cloud IDS to get visibility in to the traffic moving into and out of your VPC networks. Cloud IDS creates a Google-managed peered network that has mirrored VMs. Palo Alto Networks threat protection technologies mirror and inspect the traffic. For more information, see Cloud IDS overview.

Packet Mirroring clones traffic of specified VM instances in your VPC network and forwards it for collection, retention, and examination. After you configure Packet Mirroring, you can use Cloud IDS or third-party tools to collect and inspect network traffic at scale. Inspecting network traffic in this way helps provide intrusion detection and application performance monitoring.

Use a web application firewall

For external web applications and services, you can enable Google Cloud Armor to provide distributed denial-of-service (DDoS) protection and web application firewall (WAF) capabilities. Google Cloud Armor supports Google Cloud workloads that are exposed using external HTTP(S) load balancing, TCP Proxy load balancing, or SSL Proxy load balancing.

Google Cloud Armor is offered in two service tiers, Standard and Managed Protection Plus. To take full advantage of advanced Google Cloud Armor capabilities, you should invest in Managed Protection Plus for your key workloads.

Automate infrastructure provisioning

Automation lets you create immutable infrastructure, which means that it can't be changed after provisioning. This measure gives your operations team a known good state, fast rollback, and troubleshooting capabilities. For automation, you can use tools such as Terraform, Jenkins, and Cloud Build.

To help you build an environment that uses automation, Google Cloud provides a series of security blueprints that are in turn built on the enterprise foundations blueprint. The security foundations blueprint provides Google's opinionated design for a secure application environment and describes step by step how to configure and deploy your Google Cloud estate. Using the instructions and the scripts that are part of the security foundations blueprint, you can configure an environment that meets our security best practices and guidelines. You can build on that blueprint with additional blueprints or design your own automation.

For more information about automation, see Use a CI/CD pipeline for data-processing workflows.

Monitor your network

Monitor your network and your traffic using telemetry.

VPC Flow Logs and Firewall Rules Logging provide near real-time visibility into the traffic and firewall usage in your Google Cloud environment. For example, Firewall Rules Logging logs traffic to and from Compute Engine VM instances. When you combine these tools with Cloud Logging and Cloud Monitoring, you can track, alert, and visualize traffic and access patterns to improve the operational security of your deployment.

Firewall Insights lets you review which firewall rules matched incoming and outgoing connections and whether the connections were allowed or denied. The shadowed rules feature helps you tune your firewall configuration by showing you which rules are never triggered because another rule is always triggered first.

Use Network Intelligence Center to see how your network topology and architecture are performing. You can get detailed insights into network performance and you can then optimize your deployment to eliminate any bottlenecks in your service. Connectivity Tests provide you with insights into the firewall rules and policies that are applied to the network path.

For more information about monitoring, see Implement logging and detective controls.

What's next

Learn more about network security with the following resources:

Implement data security

This document in the Google Cloud Architecture Framework provides best practices for implementing data security.

As part of your deployment architecture, you must consider what data you plan to process and store in Google Cloud, and the sensitivity of the data. Design your controls to help secure the data during its lifecycle, to identify data ownership and classification, and to help protect data from unauthorized use.

For a security blueprint that deploys a BigQuery data warehouse with the security best practices described in this document, see Secure a BigQuery data warehouse that stores confidential data.

Automatically classify your data

Perform data classification as early in the data management lifecycle as possible, ideally when the data is created. Usually, data classification efforts require only a few categories, such as the following:

  • Public: Data that has been approved for public access.
  • Internal: Non-sensitive data that isn't released to the public.
  • Confidential: Sensitive data that's available for general internal distribution.
  • Restricted: Highly sensitive or regulated data that requires restricted distribution.

Use Sensitive Data Protection to discover and classify data across your Google Cloud environment. Sensitive Data Protection has built-in support for scanning and classifying sensitive data in Cloud Storage, BigQuery, and Datastore. It also has a streaming API to support additional data sources and custom workloads.

Sensitive Data Protection can identify sensitive data using built-in infotypes. It can automatically classify, mask, tokenize, and transform sensitive elements (such as PII data) to let you manage the risk of collecting, storing, and using data. In other words, it can integrate with your data lifecycle processes to ensure that data in every stage is protected.

For more information, see De-identification and re-identification of PII in large-scale datasets using Sensitive Data Protection.

Manage data governance using metadata

Data governance is a combination of processes that ensure that data is secure, private, accurate, available, and usable. Although you are responsible for defining a data governance strategy for your organization, Google Cloud provides tools and technologies to help you put your strategy into practice. Google Cloud also provides a framework for data governance (PDF) in the cloud.

Use Data Catalog to find, curate, and use metadata to describe your data assets in the cloud. You can use Data Catalog to search for data assets, then tag the assets with metadata. To help accelerate your data classification efforts, integrate Data Catalog with Sensitive Data Protection to automatically identify confidential data. After data is tagged, you can use Google Identity and Access Management (IAM) to restrict which data users can query or use through Data Catalog views.

Use Dataproc Metastore or Hive metastore to manage metadata for workloads. Data Catalog has a hive connector that allows the service to discover metadata that's inside a hive metastore.

Use Dataprep by Trifacta to define and enforce data quality rules through a console. You can use Dataprep from within Cloud Data Fusion or use Dataprep as a standalone service.

Protect data according to its lifecycle phase and classification

After you define data within the context of its lifecycle and classify it based on its sensitivity and risk, you can assign the right security controls to protect it. You must ensure that your controls deliver adequate protections, meet compliance requirements, and reduce risk. As you move to the cloud, review your current strategy and where you might need to change your current processes.

The following table describes three characteristics of a data security strategy in the cloud.

Characteristic Description
Identification Understand the identity of users, resources, and applications as they create, modify, store, use, share, and delete data.

Use Cloud Identity and IAM to control access to data. If your identities require certificates, consider Certificate Authority Service.

For more information, see Manage identity and access.
Boundary and access Set up controls for how data is accessed, by whom, and under what circumstances. Access boundaries to data can be managed at these levels:

Visibility You can audit usage and create reports that demonstrate how data is controlled and accessed. Google Cloud Logging and Access Transparency provide insights into the activities of your own cloud administrators and Google personnel. For more information, see Monitor your data.

Encrypt your data

By default, Google Cloud encrypts customer data stored at rest, with no action required from you. In addition to default encryption, Google Cloud provides options for envelope encryption and encryption key management. For example, Compute Engine persistent disks are automatically encrypted, but you can supply or manage your own keys.

You must identify the solutions that best fit your requirements for key generation, storage, and rotation, whether you're choosing the keys for your storage, for compute, or for big data workloads.

Google Cloud includes the following options for encryption and key management:

  • Customer-managed encryption keys (CMEK). You can generate and manage your encryption keys using Cloud Key Management Service (Cloud KMS). Use this option if you have certain key management requirements, such as the need to rotate encryption keys regularly.
  • Customer-supplied encryption keys (CSEK). You can create and manage your own encryption keys, and then provide them to Google Cloud when necessary. Use this option if you generate your own keys using your on-premises key management system to bring your own key (BYOK). If you provide your own keys using CSEK, Google replicates them and makes them available to your workloads. However, the security and availability of CSEK is your responsibility because customer-supplied keys aren't stored in instance templates or in Google infrastructure. If you lose access to the keys, Google can't help you recover the encrypted data. Think carefully about which keys you want to create and manage yourself. You might use CSEK for only the most sensitive information. Another option is to perform client-side encryption on your data and then store the encrypted data in Google Cloud, where the data is encrypted again by Google.
  • Third-party key management system with Cloud External Key Manager (Cloud EKM). Cloud EKM protects your data at rest by using encryption keys that are stored and managed in a third-party key management system that you control outside of the Google infrastructure. When you use this method, you have high assurance that your data can't be accessed by anyone outside of your organization. Cloud EKM lets you achieve a secure hold-your-own-key (HYOK) model for key management. For compatibility information, see the Cloud EKM enabled services list.

Cloud KMS also lets you encrypt your data with either software-backed encryption keys or FIPS 140-2 Level 3 validated hardware security modules (HSMs). If you're using Cloud KMS, your cryptographic keys are stored in the region where you deploy the resource. Cloud HSM distributes your key management needs across regions, providing redundancy and global availability of keys.

For information on how envelope encryption works, see Encryption at rest in Google Cloud.

Control cloud administrators' access to your data

You can control access by Google support and engineering personnel to your environment on Google Cloud. Access Approval lets you explicitly approve before Google employees access your data or resources on Google Cloud. This product complements the visibility provided by Access Transparency, which generates logs when Google personnel interact with your data. These logs include the office location and the reason for the access.

Using these products together, you can deny Google the ability to decrypt your data for any reason.

Configure where your data is stored and where users can access it from

You can control the network locations from which users can access data by using VPC Service Controls. This product lets you limit access to users in a specific region. You can enforce this constraint even if the user is authorized according to your Google IAM policy. Using VPC Service Controls, you create a service perimeter which defines the virtual boundaries from which a service can be accessed, which prevents data from being moved outside those boundaries.

For more information, see the following:

Manage secrets using Secret Manager

Secret Manager lets you store all of your secrets in a centralized place. Secrets are configuration information such as database passwords, API keys, or TLS certificates. You can automatically rotate secrets, and you can configure applications to automatically use the latest version of a secret. Every interaction with Secret Manager generates an audit log, so you view every access to every secret.

Sensitive Data Protection also has a category of detectors to help you identify credentials and secrets in data that could be protected with Secret Manager.

Monitor your data

To view administrator activity and key use logs, use Cloud Audit Logs. To help secure your data, monitor logs using Cloud Monitoring to ensure proper use of your keys.

Cloud Logging captures Google Cloud events and lets you add additional sources if necessary. You can segment your logs by region, store them in buckets, and integrate custom code for processing logs. For an example, see Custom solution for automated log analysis.

You can also export logs to BigQuery to perform security and access analytics to help identify unauthorized changes and inappropriate access to your organization's data.

Security Command Center can help you identify and resolve insecure-access problems to sensitive organizational data that's stored in the cloud. Through a single management interface, you can scan for a wide variety of security vulnerabilities and risks to your cloud infrastructure. For example, you can monitor for data exfiltration, scan storage systems for confidential data, and detect which Cloud Storage buckets are open to the internet.

What's next

Learn more about data security with the following resources:

Deploy applications securely

This document in the Google Cloud Architecture Framework provides best practices for deploying applications securely.

To deploy secure applications, you must have a well-defined software development lifecycle, with appropriate security checks during the design, development, testing, and deployment stages. When you design an application, we recommend a layered system architecture that uses standardized frameworks for identity, authorization, and access control.

Automate secure releases

Without automated tools, it can be hard to deploy, update, and patch complex application environments to meet consistent security requirements. Therefore, we recommend that you build a CI/CD pipeline for these tasks, which can solve many of these issues. Automated pipelines remove manual errors, provide standardized development feedback loops, and enable fast product iterations. For example, Cloud Build private pools let you deploy a highly secure, managed CI/CD pipeline for highly regulated industries, including finance and healthcare.

You can use automation to scan for security vulnerabilities when artifacts are created. You can also define policies for different environments (development, test, production, and so on) so that only verified artifacts are deployed.

Ensure that application deployments follow approved processes

If an attacker compromises your CI/CD pipeline, your entire stack can be affected. To help secure the pipeline, you should enforce an established approval process before you deploy the code into production.

If you plan to use Google Kubernetes Engine (GKE) or GKE Enterprise, you can establish these checks and balances by using Binary Authorization. Binary Authorization attaches configurable signatures to container images. These signatures (also called attestations) help to validate the image. At deployment, Binary Authorization uses these attestations to determine that a process was completed earlier. For example, you can use Binary Authorization to do the following:

  • Verify that a specific build system or continuous integration (CI) pipeline created a container image.
  • Validate that a container image is compliant with a vulnerability signing policy.
  • Verify that a container image passes criteria for promotion to the next deployment environment, such as from development to QA.

Scan for known vulnerabilities before deployment

We recommend that you use automated tools that can continuously perform vulnerability scans on container images before the containers are deployed to production.

Use Artifact Analysis to automatically scan for vulnerabilities for containers that are stored in Artifact Registry. This process includes two tasks: scanning and continuous analysis.

To start, Artifact Analysis scans new images when they're uploaded to Artifact Registry. The scan extracts information about the system packages in the container.

Artifact Analysis then looks for vulnerabilities when you upload the image. After the initial scan, Artifact Analysis continuously monitors the metadata of scanned images in Artifact Registry for new vulnerabilities. When Artifact Analysis receives new and updated vulnerability information from vulnerability sources, it does the following:

  • Updates the metadata of the scanned images to keep them up to date.
  • Creates new vulnerability occurrences for new notes.
  • Deletes vulnerability occurrences that are no longer valid.

Monitor your application code for known vulnerabilities

It's a best practice to use automated tools that can constantly monitor your application code for known vulnerabilities such as the OWASP Top 10. For a description of Google Cloud products and features that support OWASP Top 10 mitigation techniques, see OWASP Top 10 mitigation options on Google Cloud.

Use Web Security Scanner to help identify security vulnerabilities in your App Engine, Compute Engine, and Google Kubernetes Engine web applications. The scanner crawls your application, following all links within the scope of your starting URLs, and attempts to exercise as many user inputs and event handlers as possible. It can automatically scan for and detect common vulnerabilities, including cross-site scripting (XSS), Flash injection, mixed content (HTTP in HTTPS), and outdated or insecure libraries. Web Security Scanner gives you early identification of these types of vulnerabilities with low false positive rates.

Control movement of data across perimeters

To control the movement of data across a perimeter, you can configure security perimeters around the resources of your Google-managed services. Use VPC Service Controls to place all components and services in your CI/CD pipeline (for example, Artifact Registry, Artifact Analysis, and Binary Authorization) inside a security perimeter.

VPC Service Controls improves your ability to mitigate the risk of unauthorized copying or transfer of data (data exfiltration) from Google-managed services. With VPC Service Controls, you configure security perimeters around the resources of your Google-managed services to control the movement of data across the perimeter boundary. When a service perimeter is enforced, requests that violate the perimeter policy are denied, such as requests that are made to protected services from outside a perimeter. When a service is protected by an enforced perimeter, VPC Service Controls ensures the following:

  • A service can't transmit data out of the perimeter. Protected services function as normal inside the perimeter, but can't send resources and data out of the perimeter. This restriction helps prevent malicious insiders who might have access to projects in the perimeter from exfiltrating data.
  • Requests that come from outside the perimeter to the protected service are honored only if the requests meet the criteria of access levels that are assigned to the perimeter.
  • A service can be made accessible to projects in other perimeters using perimeter bridges.

Encrypt your container images

In Google Cloud, you can encrypt your container images using customer-managed encryption keys (CMEK). CMEK keys are managed in Cloud Key Management Service (Cloud KMS). When you use CMEK, you can temporarily or permanently disable access to an encrypted container image by disabling or destroying the key.

What's next

Learn more about securing your supply chain and application security with the following resources:

Manage compliance obligations

This document in the Google Cloud Architecture Framework provides best practices for managing compliance obligations.

Your cloud regulatory requirements depend on a combination of factors, including the following:

  • The laws and regulations that apply your organization's physical locations.
  • The laws and regulations that apply to your customers' physical locations.
  • Your industry's regulatory requirements.

These requirements shape many of the decisions that you need to make about which security controls to enable for your workloads in Google Cloud.

A typical compliance journey goes through three stages: assessment, gap remediation, and continual monitoring. This section addresses the best practices that you can use during each stage.

Assess your compliance needs

Compliance assessment starts with a thorough review of all of your regulatory obligations and how your business is implementing them. To help you with your assessment of Google Cloud services, use the Compliance resource center. This site provides you with details on the following:

  • Service support for various regulations
  • Google Cloud certifications and attestations

You can ask for an engagement with a Google compliance specialist to better understand the compliance lifecycle at Google and how your requirements can be met.

For more information, see Assuring compliance in the cloud (PDF).

Deploy Assured Workloads

Assured Workloads is the Google Cloud tool that builds on the controls within Google Cloud to help you meet your compliance obligations. Assured Workloads lets you do the following:

  • Select your compliance regime. The tool then automatically sets the baseline personnel access controls.
  • Set the location for your data using organization policies so that your data at rest and your resources remain only in that region.
  • Select the key management option (such as the key rotation period) that best fits your security and compliance requirements.
  • For certain regulatory requirements such as FedRAMP Moderate, select the criteria for access by Google support personnel (for example, whether they have completed appropriate background checks).
  • Use Google-owned and Google-managed keys that are FIPS-140-2 compliant and support FedRAMP Moderate compliance. For an added layer of control and separation of duties, you can use customer-managed encryption keys (CMEK). For more information about keys, see Encrypt your data.

Review blueprints for templates and best practices that apply to your compliance regime

Google has published blueprints and solutions guides that describe best practices and that provide Terraform modules to let you provision and configure an environment that helps you achieve compliance. The following table lists a selection of blueprints that address security and alignment with compliance requirements.

StandardDescription
PCI
FedRAMP
HIPAA

Monitor your compliance

Most regulations require you to monitor particular activities, including access controls. To help with your monitoring, you can use the following:

  • Access Transparency, which provides near real-time logs when Google Cloud admins access your content.
  • Firewall Rules Logging to record TCP and UDP connections inside a VPC network for any rules that you create yourself. These logs can be useful for auditing network access or for providing early warning that the network is being used in an unapproved manner.
  • VPC Flow Logs to record network traffic flows that are sent or received by VM instances.
  • Security Command Center Premium to monitor for compliance with various standards.
  • OSSEC (or another open source tool) to log the activity of individuals who have administrator access to your environment.
  • Key Access Justifications to view the reasons for a key access request.

Automate your compliance

To help you remain in compliance with changing regulations, determine if there are ways that you can automate your security policies by incorporating them into your infrastructure as code deployments. For example, consider the following:

  • Use security blueprints to build your security policies into your infrastructure deployments.

  • Configure Security Command Center to alert when non-compliance issues occur. For example, monitor for issues such as users disabling two-step verification or over-privileged service accounts. For more information, see Setting up finding notifications.

  • Set up automatic remediation to particular notifications.

Fore more information about compliance automation, see the Risk and Compliance as Code (RCaC) solution.

What's next

Learn more about compliance with the following resources:

Implement data residency and sovereignty requirements

This document in the Google Cloud Architecture Framework provides best practices for implementing data residency and sovereignty requirements.

Data residency and sovereignty requirements are based on your regional and industry-specific regulations, and different organizations might have different data sovereignty requirements. For example, you might have the following requirements:

  • Control over all access to your data by Google Cloud, including what type of personnel can access the data and from which region they can access it.
  • Inspectability of changes to cloud infrastructure and services, which can have an impact on access to your data or the security of your data. Insight into these types of changes helps ensure that Google Cloud is unable to circumvent controls or move your data out of the region.
  • Survivability of your workloads for an extended time when you are unable to receive software updates from Google Cloud.

Manage your data sovereignty

Data sovereignty provides you with a mechanism to prevent Google from accessing your data. You approve access only for provider behaviors that you agree are necessary.

For example, you can manage your data sovereignty in the following ways:

Manage your operational sovereignty

Operational sovereignty provides you with assurances that Google personnel can't compromise your workloads.

For example, you can manage operational sovereignty in the following ways:

Manage software sovereignty

Software sovereignty provides you with assurances that you can control the availability of your workloads and run them wherever you want, without depending on (or being locked in to) a single cloud provider. Software sovereignty includes the ability to survive events that require you to quickly change where your workloads are deployed and what level of outside connection is allowed.

For example, Google Cloud supports hybrid and multicloud deployments. In addition, GKE Enterprise lets you manage and deploy your applications in both cloud environments and on-premises environments.

Control data residency

Data residency describes where your data is stored at rest. Data residency requirements vary based on systems design objectives, industry regulatory concerns, national law, tax implications, and even culture.

Controlling data residency starts with the following:

  • Understanding the type of your data and its location.
  • Determining what risks exist to your data, and what laws and regulations apply.
  • Controlling where data is or where it goes.

To help comply with data residency requirements, Google Cloud lets you control where your data is stored, how it is accessed, and how it's processed. You can use resource location policies to restrict where resources are created and to limit where data is replicated between regions. You can use the location property of a resource to identify where the service deploys and who maintains it.

For supportability information, see Resource locations supported services.

What's next

Learn more about data residency and sovereignty with the following resources:

Implement privacy requirements

This document in the Google Cloud Architecture Framework provides best practices for implementing privacy requirements.

Privacy regulations help define how you can obtain, process, store, and manage your users' data. Many privacy controls (for example, controls for cookies, session management, and obtaining user permission) are your responsibility because you own your data (including the data that you receive from your users).

Google Cloud includes the following controls that promote privacy:

  • Default encryption of all data when it's at rest, when it's in transit, and while it's being processed.
  • Safeguards against insider access.
  • Support for numerous privacy regulations.

For more information, see Google Cloud Privacy Commitments.

Classify your confidential data

You must define what data is confidential and then ensure that the confidential data is properly protected. Confidential data can include credit card numbers, addresses, phone numbers, and other personal identifiable information (PII).

Using Sensitive Data Protection, you can set up appropriate classifications. You can then tag and tokenize your data before you store it in Google Cloud. For more information, see Automatically classify your data.

Lock down access to sensitive data

Place sensitive data in its own service perimeter using VPC Service Controls, and set Google Identity and Access Management (IAM) access controls for that data. Configure multi-factor authentication (MFA) for all users who require access to sensitive data.

For more information, see Control movement of data across perimeters and Set up SSO and MFA.

Monitor for phishing attacks

Ensure that your email system is configured to protect against phishing attacks, which are often used for fraud and malware attacks.

If your organization uses Gmail, you can use advanced phishing and malware protection. This collection of settings provides controls to quarantine emails, defends against anomalous attachment types, and helps protect against from inbound spoofing emails. Security Sandbox detects malware in attachments. Gmail is continually and automatically updated with the latest security improvements and protections to help keep your organization's email safe.

Extend zero trust security to your hybrid workforce

A zero trust security model means that no one is trusted implicitly, whether they are inside or outside of your organization's network. When your IAM systems verify access requests, a zero trust security posture means that the user's identity and context (for example, their IP address or location) are considered. Unlike a VPN, zero trust security shifts access controls from the network perimeter to users and their devices. Zero trust security allows users to work more securely from any location. For example, users can access your organization's resources from their laptops or mobile devices while at home.

On Google Cloud, you can configure Chrome Enterprise Premium and Identity-Aware Proxy (IAP) to enable zero trust for your Google Cloud resources. If your users use Google Chrome and you enable Chrome Enterprise Premium, you can integrate zero-trust security into your users browsers.

What's next

Learn more about security and privacy with the following resources:

Implement logging and detective controls

This document in the Google Cloud Architecture Framework provides best practices for implementing logging and detective controls.

Detective controls use telemetry to detect misconfigurations, vulnerabilities, and potentially malicious activity in a cloud environment. Google Cloud lets you create tailored monitoring and detective controls for your environment. This section describes these additional features and recommendations for their use.

Monitor network performance

Network Intelligence Center gives you visibility into how your network topology and architecture are performing. You can get detailed insights into network performance and then use that information to optimize your deployment by eliminating bottlenecks on your services. Connectivity Tests provides you with insights into the firewall rules and policies that are applied to the network path.

Monitor and prevent data exfiltration

Data exfiltration is a key concern for organizations. Typically, it occurs when an authorized person extracts data from a secured system and then shares that data with an unauthorized party or moves it to an insecure system.

Google Cloud provides several features and tools that help you detect and prevent data exfiltration. For more information, see Preventing data exfiltration.

Centralize your monitoring

Security Command Center provides visibility into the resources that you have in Google Cloud and into their security state. Security Command Center helps you prevent, detect, and respond to threats. It provides a centralized dashboard that you can use to help identify security misconfigurations in virtual machines, in networks, in applications, and in storage buckets. You can address these issues before they result in business damage or loss. The built-in capabilities of Security Command Center can reveal suspicious activity in your Cloud Logging security logs or indicate compromised virtual machines.

You can respond to threats by following actionable recommendations or by exporting logs to your SIEM system for further investigation. For information about using a SIEM system with Google Cloud, see Security log analytics in Google Cloud.

Security Command Center also provides multiple detectors that help you analyze the security of your infrastructure. These detectors include the following:

Other Google Cloud services, such as Google Cloud Armor logs, also provide findings for display in Security Command Center.

Enable the services that you need for your workloads, and then only monitor and analyze important data. For more information about enabling logging on services, see the enable logs section in Security log analytics in Google Cloud.

Monitor for threats

Event Threat Detection is an optional managed service of Security Command Center Premium that detects threats in your log stream. By using Event Threat Detection, you can detect high-risk and costly threats such as malware, cryptomining, unauthorized access to Google Cloud resources, DDoS attacks, and brute-force SSH attacks. Using the tool's features to distill volumes of log data, your security teams can quickly identify high-risk incidents and focus on remediation.

To help detect potentially compromised user accounts in your organization, use the Sensitive Actions Cloud Platform logs to identify when sensitive actions are taken and to confirm that valid users took those actions for valid purposes. A sensitive action is an action, such as the addition of a highly privileged role, that could be damaging to your business if a malicious actor took the action. Use Cloud Logging to view, monitor, and query the Sensitive Actions Cloud Platform logs. You can also view the sensitive action log entries with the Sensitive Actions Service, a built-in service of Security Command Center Premium.

Google Security Operations can store and analyze all of your security data centrally. To help you see the entire span of an attack, Google SecOps can map logs into a common model, enrich them, and then link them together into timelines. Furthermore, you can use Google SecOps to create detection rules, set up indicators of compromise (IoC) matching, and perform threat-hunting activities. You write your detection rules in the YARA-L language. For sample threat detection rules in YARA-L, see the Community Security Analytics (CSA) repository. In addition to writing your own rules, you can take advantage of curated detections in Google SecOps. These curated detections are a set of predefined and managed YARA-L rules that can help you identify threats.

Another option to centralizing your logs for security analysis, audit, and investigation is to use BigQuery. In BigQuery, you monitor common threats or misconfigurations by using SQL queries (such as those in the CSA repository) to analyze permission changes, provisioning activity, workload usage, data access, and network activity. For more information about security log analytics in BigQuery from setup through analysis, see Security log analytics in Google Cloud.

The following diagram shows how to centralize your monitoring by using both the built-in threat detection capabilities of Security Command Center and the threat detection that you do in BigQuery, Google Security Operations, or a third-party SIEM.

How the various security analytics tools and content interact in Google Cloud.

As shown in the diagram, there are variety of security data sources that you should monitor. These data sources include logs from Cloud Logging, asset changes from Cloud Asset Inventory, Google Workspace logs, or events from hypervisor or a guest kernel. The diagram shows that you can use Security Command Center to monitor these data sources. This monitoring occurs automatically provided that you've enabled the appropriate features and threat detectors in Security Command Center. The diagram shows that you can also monitor for threats by exporting security data and Security Command Center findings to an analytics tool such as BigQuery, Google Security Operations, or a third-party SIEM. In your analytics tool, the diagram shows that you can perform further analysis and investigation by using and extending queries and rules like those available in CSA.

What's next

Learn more about logging and detection with the following resources:

Google Cloud Architecture Framework: Reliability

This pillar of the Google Cloud Architecture Framework covers the design principles that are required to architect and operate reliable services on a cloud platform at a high level.

The Architecture Framework describes best practices, provides implementation recommendations, and explains some of the available products and services. The framework aims to help you design your Google Cloud deployment so that it best matches your business needs.

For reliability principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Reliability.

To run a reliable service, your architecture must include the following:

  • Measurable reliability goals that you promptly correct whenever deviations occur
  • Design patterns for the following:
    • Scalability
    • High availability
    • Disaster recovery
    • Automated change management
  • Components that self-heal (have the ability to remediate issues without manual interventions)
  • Code that includes instrumentation for observability
  • Hands-free operation such as service runs with minimal manual work, cognitive operator load, and rapid failure detection and mitigation

The entire engineering organization is responsible for the reliability of the service, including development, product management, operations, and site reliability engineering (SRE) teams. Teams must understand their application's reliability targets, risk, and error budgets, and be accountable to these requirements. Conflicts between reliability and product feature development are to be prioritized and escalated accordingly.

Core reliability principles

This section explores the core principles of a reliable service and sets the foundation for the more detailed documents that follow. As you read further about this topic, you'll learn Google's approach to reliability is based on the following reliability principles.

Reliability is your top feature

Engineering teams sometimes prioritize new product development. While users anticipate new and exciting updates to their favorite applications, product updates are a short term goal for your users. Your customers always expect service reliability, even if they don't realize it. An expanded set of tools or flashy graphics in your application won't matter if your users can't access your service or your service exhibits poor performance. Poor application performance quickly makes these expanded features irrelevant.

Reliability is defined by the user

In short, your service is reliable when your customers are happy. Users aren't always predictable, and you may overestimate what it takes to satisfy them.

By today's standard, a web page should load in about two seconds. Page abandonment is roughly 53% when load time is delayed by an additional second, and dramatically increases to 87% when load time is delayed by three seconds. However, striving for a site that delivers pages in a second is probably not the best investment. To determine the right level of service reliability for your customers, you need to measure the following:

  • User-facing workload: Measure user experience. For example, measure the success ratio of user requests, not just server metrics like CPU usage.
  • Batch and streaming workloads: Measure key performance indicators (KPIs) for data throughput, such as rows scanned per time window. This approach is more informative than a server metric like disk usage. Throughput KPIs help ensure user requested processing finishes on time.

100% reliability is the wrong target

This principle is an extension of the previous one. Your systems are reliable enough when users are happy. Typically, users don't need 100% reliability to be happy. Thus, define service level objectives (SLOs) that set the reliability threshold to the percentage needed to make users happy, and then use error budgets to manage the appropriate rate of change.

Apply the design and operational principles in this framework to a product only if the SLO for that product or application justifies the cost.

Reliability and rapid innovation are complementary

Use error budgets to achieve a balance between system stability and developer agility. The following guidance helps you determine when to focus more on stability or on development:

  • When the error budget is diminished, slow down and focus on reliability features.
  • When an adequate error budget is available, you can innovate rapidly and improve the product or add product features.

Design and operational principles

The remaining documents in the reliability pillar of the Architecture Framework provide design and operational principles that help you maximize system reliability. The following sections provide a summary of the design and operational principles that you'll find in each document in this series.

Establish your reliability goals

Remember, user happiness defines reliability and your reliability goals are represented by the SLOs you set. When setting your SLOs, consider the following:

  • Choose appropriate service level indicators (SLI).
  • Set SLOs based on the user experience.
  • Iteratively improve SLOs.
  • Use strict internal SLOs.
  • Use error budgets to manage development velocity.

For more information, see Components of service level objectives.

Build observability into your infrastructure and applications

Instrument your code to maximize observability. For more information, see Build observability into your infrastructure and applications.

Design for scale and high availability

When it comes to scale and high availability (HA), consider the following principles:

  • Create redundancy for HA
  • Replicate data across regions for disaster recovery (DR)
  • Design a multi-region architecture for resilience to regional outages
  • Degrade service levels gracefully when overloaded
  • Fail safe in a way that preserves system functionality
  • Design API calls and operational commands to be retryable
  • Consider dependencies:
    • Identify and manage system dependencies
    • Minimize critical dependencies
  • Ensure every change can be rolled back

Additionally, the following activities help the reliability of your service:

  • Eliminate scalability bottlenecks
  • Prevent and mitigate traffic spikes
  • Sanitize and validate inputs

For more information, see Design for scale and high availability.

Create reliable tools and operational processes

Build reliability into tools and operations processes by doing the following:

  • Choose logical, self-defining names for applications and services
  • Use canary testing to implement progressive rollouts of procedures
  • Time your promotions and launches so that they spread out traffic and reduce system overload
  • Develop programmatic build, test, and deployment processes
  • Defend against human-caused incidents, intentional or not
  • Develop, test, and document failure response activities
  • Develop and test disaster recovery steps on a regular basis
  • Chaos engineering: Make it a practice of injecting faults into the system to determine your service's fault tolerance and resilience

For more information, see Create reliable operational processes and tools.

Build efficient alerts

When creating your alerts, we recommend that you do the following:

  • Optimize alerts for appropriate delays
  • Alert on symptoms, not causes
  • Alert on outliers, not averages

For more information, see Build efficient alerts in the Architecture Framework reliability pillar.

Build a collaborative incident management process

Incident response and management (IRM) is essential for service recovery and minimizing damage. Effective IRM includes:

  • Ownership: Assign clear service owners.
  • Well-tuned alerts: Improve incident response (IR) and reduce time to detect (TTD) with carefully designed alerts.
  • IRM plans and training: Reduce time to mitigate (TTM) with comprehensive plans, documentation and training.
  • Dashboards: Design dashboard layouts and content to efficiently alert when issues occur to minimize TTM.
  • Documentation: Create and maintain clear, concise content for all aspects of service support including diagnostic procedures and mitigation for outage scenarios.
  • Blameless culture:
    • Cultivate a blameless environment in your organization.
    • Establish a postmortem process that focuses on what, not who.
    • Learn from your outages by investigating properly and identifying areas to improve and prevent recurrences.

For more information, see Build a collaborative incident management process in the Architecture Framework reliability pillar.

What's next

Components of service level objectives

This document in the Google Cloud Architecture Framework defines the key concepts needed to understand and create service level objectives (SLOs).

At their core, SLOs reflect the reliability goals of the service you provide your users. It's important to include input from all critical stakeholders when defining these objectives. Many different groups and management levels have a deep interest in your service. These includes business owners, product owners, executives, engineers, support staff, operations, sales, and any other teams associated with your service.

There are as many ways to obtain stakeholder input as there are different reliability objectives to choose. How you ultimately choose your objectives is up to you and your organization based on requirements, stakeholders, and other factors. While this process is out of scope for this guide, a simple approach is to create a shared document that describes your SLOs and how you developed them. Your team can iterate on the document as it implements and continues to improve the SLOs over time.

The following sections define the various components of SLOs.

Service level

A service level is a measurement of how well a service performs its expected work for the user. This metric can be described in terms of user happiness and measured by various methods that depend on the unique characteristics of the service, its user base, and user expectations. In this guide, we associate performance with the system's reliability.

Example service level: Our users expect the service to be available and fast.

Service level indicator

A service level indicator (SLI) is a gauge of user happiness that can be measured quantitatively. An indicator is similar to a line on a graph that changes over time as the service improves or degrades. To evaluate a service level, choose an indicator that represents some aspect of user happiness. Availability is a common SLI.

Example SLI: The number of successful requests in the last 10 minutes divided by the number of all valid requests in the same timeframe.

The SLI in the example is specific and well-defined, and expressed as a numerical value. That value reflects how available the service is. By consistently tracking this SLI over time, a team can determine the overall availability of its service.

For more information about choosing your SLIs, see Choose your SLIs.

Service level objective

The service level objective (SLO) is the target range that you expect the service to achieve as measured by the SLI. The following example uses response time, or speed of the service, as the SLI.

Example SLO: Service response is faster than 400 milliseconds (ms) for 95% of all valid requests measured over 14 days.

In the example SLO, the SLI is the number of requests faster than 400 ms divided by the number of valid requests. This percentage is tracked over 14 days. The objective is to meet 95% of all requests. That is, if the end result (the percentage of requests that meet the criteria) is more than 95%, you've met your SLO for the service.

To recap, the SLI is some measurement (such as speed, availability, and success) of your service. The SLO is the expectation that a specific amount of those measurements (the percentage) meets or exceeds some predetermined level or range. Anything below the expected level is bad. You've failed to provide your users with a reliable service in a specific area of performance.

For more information about choosing your SLOs, see Choose your SLOs.

Service level agreement

The service level agreement (SLA) is the contract between you, the service provider, and your customers. It lists the SLOs the customers are promised and ultimately will expect. The SLA also specifies what happens if a SLO is not met. A broken SLO may result in the service provider refunding money, providing discounted services, or in more critical services may result in legal action or punitive damages.

SLAs are not heavily discussed in this guide. SLAs are mentioned to augment the your understanding of SLO, SLI, and the user.

Error budget

The final value to understand when discussing SLOs is the percentage or number of negative events your service can withstand before violating the SLO. This number, called the error budget, defines the amount of errors your business can expect and tolerate.

To demonstrate, use availability as the SLI (represented by a percentage). Three or more "nines" in the percentage indicates the precision to which you want to measure that SLI. In other words, the number of "9s" express the availability percentage.

Consider an SLO of three nines is 99.9%. Subtracting the SLO value from 100%, leaves us with a 0.1% error budget. When discussing availability, a 0.1% budget is slightly less than nine hours a year during which the service is unavailable. Adding another nine drastically reduces the error budget. An availability of 99.99% (four nines) allows less than an hour of service downtime a year.

That downtime includes requests that fail, server downtime by fault (crash or software bugs) or design (upgrades or testing), human error, accidents and many others.

What's next

Choose your service level objectives (SLOs)

This document in the Google Cloud Architecture Framework defines how the user experience defines reliability and how to choose the appropriate service level objectives to meet that level of reliability. This document builds on the concepts defined in Components of SLOs.

The culture of site reliability engineering (SRE) values reliable services and customer happiness (or customer satisfaction). Without a defined service level and a method to gather metrics, it's difficult (if not impossible) to determine where and how much to invest in improvements.

The overriding metric that you use to measure service level is the service level objective (SLO). An SLO is made up of the following values:

  • A service level indicator (SLI): A metric of a specific aspect of your service as described in Choose your SLIs.
  • Duration: The window where SLI is measured. This can be calendar-based or a rolling window.
  • A target: The value (or range of values) that the SLI should meet in the given duration in a healthy service. For example, the percentage of good events to total events that you expect your service to meet, such as 99.9%.

Choosing the right SLOs for your service is a process. You start by defining the user journeys that define reliability and ultimately your SLOs. The SLOs that you choose need to measure the entire system while also balancing the needs of feature development against operational stability. After you've chosen your SLOs, you need to both iteratively improve upon them and manage them by using error budgets.

Define your user journeys

Your SLIs and SLOs are ideally based on critical user journeys (CUJs). CUJs considers user goals and how your service helps users accomplish those goals. You define a CUJ without considering service boundaries. When a CUJ is met, the customer is happy and this is an indication of a successful service.

Customer happiness, or dissatisfaction for that matter, dictates reliability and is the most critical feature of any service.

Therefore, set your SLO just high enough that most users are happy with your service, and no higher. Just as 100% availability is not the right goal, adding more "nines" to your SLOs quickly becomes expensive and might not even matter to the customer.

For uptime and other vital metrics, aim for a target lower than 100%, but close to it. Assess the minimum level of service performance and availability required. Don't set targets based on external contractual levels.

Use CUJs to develop SLOs

Choose your company's most important CUJs, and follow these steps to develop SLOs:

  1. Choose an SLI specification (such as availability or freshness).
  2. Decide how to implement the SLI specification.
  3. Ensure that your plan covers all CUJs.
  4. Set SLOs based on previous performance or business needs.

CUJs should not be constrained to a single service, nor to a single development team or organization. Your service may depend on dozens or more other services. You might also expect those services to operate at 99.5%. However, if end-to-end (entire system) performance is not tracked, running a reliable service is challenging.

Define target and duration

Defining target and duration (see the previous definition of an SLO) can be difficult. One way to begin the process is to identify your SLIs and chart them over time. Remember, an SLO doesn't have to be perfect from the start. Iterate on your SLO to ensure that it aligns with customer happiness and meets your business needs.

As you track SLO compliance during events such as deployments, outages, and daily traffic patterns, you'll gain insights about the target. These insights will make it more apparent what is good, bad, or tolerable for your targets and durations.

Feature development, code improvements, hardware upgrades, and other maintenance tasks can help make your service more reliable. The ability to make these frequent, small changes helps you deliver features faster and with higher quality. However, the rate at which your service changes also affects reliability. Achievable reliability goals define a pace and scope of change (called feature velocity) that customers can tolerate and benefit from.

If you can't measure the customer experience and define goals around it, you can turn to outside sources and benchmark analysis. If there's no comparable benchmark, measure the customer experience, even if you can't define goals yet. Over time, you can get to a reasonable threshold of customer happiness. This threshold is your SLO.

Understand the entire system

Your service may exist in a long line of services with both upstream and downstream processing. Measuring performance of a distributed system in a piecemeal manner (service by service) doesn't accurately reflect your customer's experience and might cause an overly sensitive interpretation.

Instead, you should measure performance against the SLO at the frontend of the process to understand what users experience. The user is not concerned about a component failure that causes a query to fail if the query is automatically and successfully retried.

If there are shared internal services in place, each service can measure performance separately against the associated SLO, with user-facing services acting as their customers. Handle these SLOs separately.

It's possible to build a highly-available service (for example, 99.99%) on top of a less-available service (for example, 99.9%) by using resilience factors such as smart retries, caching, and queueing. Anyone with a working knowledge of statistics should be able to read and understand your SLO without understanding your underlying service or organizational layout as described in Conway's law.

Choose the correct SLOs

There is a natural tension between product development speed and operational stability. The more you change your system, the more likely it will break. Monitoring and observability tools are critical to operational stability as you increase feature velocity. Such tools are known as application performance management (APM) tools, and can also be used to set SLOs.

When defined correctly, an SLO helps teams make data-driven operational decisions that increase development velocity without sacrificing stability. The SLO can also align development and operations teams around a single agreed upon objective. Sharing a single objective alleviates the natural tension mentioned previously: the development team's goal to create and iterate on products, and the operations team's goal to maintain system integrity.

Use this document and other reliability documents in the Architecture Framework to understand and develop SLOs. Once you have read and understood these articles, move to more detailed information about SLOs (and other SRE practices) in The SRE Book and The SRE Workbook.

Use strict internal SLOs

It's a good practice to have stricter internal SLOs than external SLAs. As SLA violations tend to require issuing a financial credit or customer refunds, you want to address problems before they reach a financial impact.

We recommend using these stricter internal SLOs with a blameless retrospective process and incident review. For more information, see Build a collaborative incident management process.

Iteratively improve SLOs

SLOs shouldn't be set in stone. Revisit SLOs periodically — quarterly, or at least annually — and confirm that they accurately reflect user happiness and correlate with service outages. Ensure they cover current business needs and any new critical user journeys. Revise and augment your SLOs as needed after these reviews.

Use error budgets to manage development velocity

Error budgets show if your service is more or less reliable than is needed for a specific time window. Error budgets are calculated as 100% – SLO over a period of time, such as 30 days.

When you have capacity left in your error budget, you can continue to launch improvements or new features quickly. When the error budget is close to zero, slow down or freeze service changes and invest engineering resources to improve reliability features.

Google Cloud Observability includes SLO monitoring to minimize the effort of setting up SLOs and error budgets. The operations suite includes a graphical user interface to help you to configure SLOs manually, an API for programmatic setup of SLOs, and built-in dashboards to track the error budget burn rate. For more information, see Creating an SLO.

Summary of SLO recommendations

  • Define and measure customer-centric SLIs, such as the availability or latency of the service.
  • Define a customer-centric error budget that's stricter than your external SLA. Include consequences for violations like production freezes.
  • Set up latency SLIs to capture outlier values, such as 90th or 99th percentile, to detect the slowest responses.
  • Review SLOs at least annually and confirm that they correlate well with user happiness and service outages.

What's next

Choose your service level indicators (SLIs)

This document in the Google Cloud Architecture Framework describes how to choose appropriate service level indicators (SLIs) for your service. This document builds on the concepts defined in Components of SLOs.

Metrics are required to determine if your service level objectives (SLOs) are being met. You define those metrics as SLIs. Each SLI is the measurement of a specific aspect of your service such as response time, availability, or success rate.

SLOs include one or more SLIs, and are ideally based on critical user journeys (CUJs). CUJs refer to a specific set of user interactions or paths that a user takes to accomplish their goal on a website. Consider a customer shopping on an ecommerce service. The customer logs in, searches for a product, adds the item to a cart, navigates to the checkout page, and checks out. CUJs identify the different ways to help users complete tasks as quickly as possible.

When choosing SLIs, you need to consider the metrics that are appropriate to your service, the various metric types that you can use, the quality of the metric, and the correct number of metrics needed.

Choose appropriate SLIs for your service type

There are many service types. The following table lists common service types and provides examples of SLIs for each. Some SLIs are applicable to multiple service types. If an SLI appears more than once in the table, only the first SLI instance provides a definition. Recall that SLIs are often expressed by the number of "nines" in the metric.

Service type Typical SLIs
Serving systems
  • Availability — the percentage of the service that is usable. Availability is defined as the fraction of successful requests divided by the total number of requests, and expressed as a percentage such as 99.9%.
  • Latency — how quickly a certain percentage of requests are fulfilled. For example, 99th percentile at 300 ms.
  • Quality — the extent to which the content in the response to a request deviates from the ideal response content. For example, a scale from 0% to 100%.
Data processing systems
  • Coverage — the amount of data that has been processed, expressed as a fraction. For example, 95%.
  • Correctness — the fraction of output data deemed to be correct. For example, 99.99%.
  • Freshness — The freshness of the source data or aggregated output data. For example, data was refreshed 20 minutes ago.
  • Throughput — The amount of data processed. For example, 500 MiB per sec or 1000 requests per second.
Storage systems
  • Durability — the likelihood that data written to the system is accessed in the future. For example, 99.9999%.
  • Time to first byte (TTFB) — the time it takes to send and get the first byte of a page.
  • Blob availability — the ratio of customer requests returning a non-server error response to the total number of customer requests.
  • Throughput
  • Latency
Request-drive systems
  • Availability
  • Latency
  • Quality
Scheduled execution systems
  • Skew — the proportion of executions that start within an acceptable window of the expected start time.
  • Execution — The time a job takes to complete. For a given execution, a common failure mode is for actual duration to exceed scheduled duration.

Evaluate different metric types

In addition to choosing the appropriate SLI for your service, you need to decide the metric type to use for your SLI. The SLIs listed in the previous section tend to be one of the following types:

  • Counter: This type of metric can increase but not decrease. For example, the number of errors that occurred up to a given point of measurement.
  • Gauge: This type of metric can increase or decrease. For example, the actual value of a measurable part of the system (such as queue length).
  • Distribution (histogram): The number of events that inhabit a particular measurement segment for a given time period. For example, measuring how many requests take 0-10 ms to complete, how many take 11-30 ms, and how many take 31-100 ms. The result is a count for each bucket, such as [0-10: 50], [11-30: 220], and [31-100: 1103].

For more information about these types, see the Prometheus project documentation and Value types and metric kinds in Cloud Monitoring.

Consider the quality of the metric

Not every metric is useful. Apart from being a ratio of successful events to total events, you need to determine whether a metric is a good SLI for your needs. To help you make that determination, consider the following characteristics of a good metric:

  • Metrics relate directly to user happiness. Users are unhappy when a service does not behave as expected, such as when the service is slow, inaccurate, or fails completely. Validate any SLO based on these metrics by comparing the SLI to other signals of user happiness. This comparison includes data such as the number of customer complaint tickets, support call volume, and social media sentiment. (To learn more, see Continuous Improvement of SLO Targets).

    If your metric doesn't align with these other indicators of user happiness, it might not be a good SLI.

  • The metric deterioration correlates with outages. Any metric reporting good service results during an outage is clearly the wrong metric for an SLI. Conversely, a metric that looks bad during normal operation is also problematic

  • The metric provides a good signal-to-noise ratio. Dismiss any metric that results in a large number of false negatives or false positives.

  • The metric scales monotonically and linearly with customer happiness. Simply put, as the metric improves, customer happiness also improves.

Select the correct number of metrics

A single service can have multiple SLIs, especially if the service performs different types of work or serves different types of users. It's best to choose the appropriate metrics for each type.

In contrast, some services perform similar types of work which can be directly comparable. For example, users viewing different pages on your site (such as the homepage, subcategories, and the top-10 list). Instead of developing a separate SLI for each of these actions, combine them into a single SLI category, such as browse services.

Your users' expectations don't change much between actions of a similar category. Their happiness is quantifiable by the answer to the question: "Did I see a full page of items quickly?"

Use as few SLIs as possible to accurately represent your service tolerances. As a general guide, have two to six SLIs. With too few SLIs, you can miss valuable signals. Too many and your support team has too much data at hand with little added benefit. Your SLIs should simplify your understanding of production health and provide a sense of coverage, not overwhelm (or underwhelm) you.

What's next?

Measure your SLOs

This document in the Google Cloud Architecture Framework builds on the previous discussions of service level objectives (SLOs) by exploring the what and how of measuring in respect to common service workloads. This document builds on the concepts defined in Components of service level objectives.

Decide what to measure

Regardless of your domain, many services share common features and can use generic SLOs. This section discusses generic SLOs for different service types and provides detailed explanations of the SLIs that apply to each SLO.

Each of the following subsections identifies a particular service type and provides a short description of that service. Then, listed under each service type are possible SLIs, a definition of the indicator, and other information related to the SLI.

Request-driven services

This service type receives a request from a client (a user or another service), performs some computation, possibly sends network requests to a backend, and then returns a response to the client.

Availability as an SLI

Availability is the proportion of valid requests that are served successfully. The following list covers information to consider when using availability as an SLI:

  • As a service owner, you decide what is a valid request. Common definitions include not zero-length or adheres to a client-server protocol. One method to gauge validity is reviewing HTTP (or RPC) response codes. For example, HTTP 5xx codes are server-related errors that count against an SLO, while 4xx codes are client errors that don't count.
  • Each response code returned by your service must be examined to ensure that the application uses those codes properly and consistently. Is the code an accurate indicator of your users' experience of the service? For example, how does an ecommerce site respond when a user attempts to order an item that is out of stock? Does the request fail and return an error message? Does the site suggest similar products? Error codes must be tied to user expectations.
  • Developers can inadvertently misuse errors. Using the out-of-stock scenario from the previous bullet, a developer might mistakenly return an error. However, the system is working properly and not in error. The code needs to return a success, even though the user couldn't purchase the item. While service owners should be notified about the low inventory, the inability to make a sale isn't an error from the customer's perspective and doesn't count against an SLO.
  • An example of a more complex service is one that handles asynchronous requests or provides a long-running process for customers. While you can expose availability in another way, we recommend representing availability as the proportion of valid requests that are successful. Availability can also be defined as the number of minutes a customer's workload is running (sometimes referred to as the good minutes approach).
  • Consider a service offering virtual machines (VMs). You could measure availability in terms of the number of minutes after an initial request that the VM is available to the user.

Latency as an SLI

Latency (or speed) is the proportion of valid requests that are served faster than a threshold. Thus, latency indicates service quickness, and can be measured by calculating the difference between the start and stop times for a given request type. Remember, this is the user's perception of latency, and service owners commonly measure this value too precisely. In reality, users can't distinguish between a 100 millisecond (ms) and a 300 ms refresh, and might even accept responses between 300 ms and 1000 ms as normal. For more information, see the Rail model.

Develop activity-centric metrics that focus on the user. The following are some examples of such metrics:

  • Interactive: A user waits 1000 ms for a result after clicking an element.
  • Write: A change to an underlying distributed system takes 1500 ms. While this length of time is considered slow, users tend to accept it. We recommend that you explicitly distinguish between writes and reads in your metrics.
  • Background: Actions that are not user-visible,like a periodic refresh of data or other asynchronous requests, take no more than 5000 ms to complete.

Latency is commonly measured as a distribution and as mentioned in Choose your SLIs. Given a distribution, you can measure various percentiles. For example, you might measure the number of requests that are slower than the historical 99th percentile. Events faster than this threshold are considered good; slower requests are considered not so good. You set this threshold based on product requirements. You can even set multiple latency SLOs, for example typical latency versus tail latency.

Don't only use the average (or median) latency as your SLI. If the median latency is too slow, half your users are already unhappy. Also, your service can experience bad latency for days before you discover the problem. Therefore, define your SLO for tail latency (95th percentile) and for median latency (50th percentile).

In the ACM article Metrics That Matter, Benjamin Treynor Sloss writes the following:

  • "A good practical rule of thumb ... is that the 99th-percentile latency should be no more than three to five times the median latency."
  • "We find the 50th-, 95th-, and 99th-percentile latency measures for a service are each individually valuable, and we will ideally set SLOs around each of them."

Determine your latency thresholds based on historical percentiles, then measure how many requests fall into each bucket. This approach is a good model to follow.

Quality as an SLI

Quality is the proportion of valid requests that are served without degradation of service. As an example, quality is a helpful SLI for complex services that are designed to fail gracefully. To illustrate, consider a web page that loads its main content from one datastore and loads optional assets from 100 other services and datastores. If one of the optional elements is out of service or too slow, the page is still rendered without the optional elements. In this scenario, you can use SLIs to do the following:

  • By measuring the number of requests that receive a degraded response (a response missing a reply from at least one backend service), you can report the ratio of bad requests.
  • You can track the number of responses that are missing a reply from a single backend, or from multiple backends.

Data processing services

These services consume data from an input, process that data, and generate an output. Service performance at intermediate steps is not as important as the final result. The strongest SLIs are freshness, coverage, correctness, and throughput. Latency and availability are less useful.

Freshness as an SLI

Freshness is the proportion of valid data updated more recently than a threshold. The following list provides some examples of using freshness as an SLI:

  • In a batch system, freshness is measured as the time elapsed during a successfully completed processing run for a given output.
  • In real-time processing or more complex systems, freshness tracks the age of the most-recent record processed in a pipeline.
  • In an online game that generates map tiles in real time, users might not notice how quickly map tiles are created, but they might notice when map data is missing or is not fresh. In this case, freshness tracks missing or stale data.
  • In a service that reads records from a tracking system to generate the message "X items in stock" for an ecommerce website, a freshness SLI could be defined as the percentage of requests that are using recently refreshed (within the last minute) stock information.
  • You can also use a metric for serving non-fresh data to update the SLI for quality.

Coverage as an SLI

Coverage is the proportion of valid data processed successfully.

To define coverage, follow these steps:

  1. Determine whether to accept an input as valid or skip it. For example, if an input record is corrupted or zero-length and cannot be processed, you might consider that record invalid as a metric.
  2. Count the number of valid records. This count can be accomplished with a simple *count() *method and represents your total record count.
  3. Finally, count the number of records that were processed successfully and compare that number against the total valid record count. This value is your SLI for coverage.

Correctness as an SLI

Correctness is the proportion of valid data that produced correct output. Consider the following points when using correctness as an SLI:

  • In some cases, the methods to determine the correctness of an output are used to validate the output processing. For example, a system that rotates or colorizes an image should never produce a zero-byte image, or an image with a length or width of zero. It is important to separate this validation logic from the processing logic itself.
  • One method of measuring a correctness SLI is to use known-good test input data. This type of data is data that produces a known correct output. Remember, input data must be representative of user data.
  • In other cases, it's possible to use a mathematical or logical check against the output, as in the preceding example of rotating an image.
  • Lastly, consider a billing system that determines transaction validity by checking the difference between the balance before and after a transaction. If this matches the value of the transaction itself, it's a valid transaction.

Throughput as an SLI

Throughput is the proportion of time where the data processing rate was faster than the threshold. Here are some points to consider when using throughput as an SLI:

  • Throughput in a data processing system is often more representative of user happiness than a single latency measurement for a given operation. If the size of each input varies dramatically, it might not make sense to compare the time each element takes to finish (if all jobs progress at an acceptable rate).
  • Bytes per second is a common way to measure the amount of work it takes to process data regardless of the size of the dataset. But any metric that roughly scales linearly with respect to the cost of processing will work.
  • It might be worthwhile to partition your data processing systems based upon expected throughput rates. Or implement a quality of service system that ensures high-priority inputs are handled, and low-priority inputs are queued. Either way, measuring throughput as defined in this section helps determine if your system is working as within the SLO.

Scheduled execution services

For services that need to perform an action at a regular interval (such as Kubernetes cron jobs), measure skew and execution duration. The following is a sample scheduled Kubernetes cron job:

  apiVersion: batch/v1beta1
  kind: CronJob
  metadata:
  name: hello
  spec:
schedule: "0 * * * *"

Skew as an SLI

Skew is the proportion of executions that start within an acceptable window of the expected start time. When using skew, consider the following:

  • Skew measures the time difference between when a job is scheduled to start and its actual start time. Consider the preceding Kubernetes cron job example. If it's set to start at minute zero of every hour, but starts at three minutes past the hour, the skew is three minutes. When a job runs early, you have a negative skew.
  • You can measure skew as a distribution over time, with corresponding acceptable ranges that define good skew. To determine the SLI, compare the number of runs that were within a good range.

Execution duration as an SLI

Execution duration is the proportion of executions that complete within the acceptable duration window. The following covers important concepts related to using execution duration:

  • Execution duration is the time a job takes to complete. For a given execution, a common failure mode is when actual duration exceeds scheduled duration.
  • An interesting case is applying this SLI to a never-ending job. Because these jobs don't finish, record the time spent on a given job instead of waiting for a job to complete. This approach provides an accurate distribution of how long work takes to complete, even in worst-case scenarios.
  • As with skew, you can track execution duration as a distribution and define acceptable upper and lower bounds for good events.

Metrics for other system types

Many other workloads have their own metrics to generate SLIs and SLOs. Consider the following examples:

  • Storage systems: Durability, throughput, time to first byte, blob availability.
  • Media/video: Client playback continuity, time to start playback, transcode graph execution completeness.
  • Gaming: Time to match active players, time to generate a map.

Decide how to measure

The previous section covered what you're measuring. After you have determined what to measure, you can decide how to measure. You can collect SLI metrics in several ways. The following sections identify various measurement methods, provide a brief description of the method, list the method's advantages and disadvantages, and identify appropriate implementation tools for the method.

Server-side logging

The server-side logging method involves processing server-side logs of requests or processed data.

Server-side logging Details
Advantages
  • Reprocess existing logs to backfill historical SLI records.
  • Use cross-service session identifiers to reconstruct complex user journeys across multiple services.
Disadvantages
  • Requests that don't arrive at the server are not recorded.
  • Requests that cause a server to crash might not be recorded.
  • Length of time to process logs can cause stale SLIs, which might be inadequate data for an operational response.
  • Writing code to process logs can be an error-prone, time-consuming task.
Implementation method & tools

Application server metrics

The application server metrics method involves exporting SLI metrics from the code that serves user requests or processes their data.

Application server metrics Details
Advantages
  • Adding new metrics to code is typically fast and inexpensive.
Disadvantages
  • Requests that don't reach the application servers are not recorded.
  • Multi-service requests could be hard to track.
Implementation method & tools

Frontend infrastructure metrics

The fronted infrastructure metrics method utilizes metrics from the load-balancing infrastructure (such as, a global Layer 7 load balancer in Google Cloud).

Frontend insfrasture metrics Details
Advantages
  • Metrics and historical data often already exist, reducing the engineering effort to get started.
  • Measurements are taken at the point nearest the customer yet still within the serving infrastructure.
Disadvantages
  • Isn't viable for data processing SLIs.
  • Can only approximate multi-request user journeys.
Implementation method & tools

Synthetic clients or data

The synthetic clients or data method involves using clients that send fabricated requests at regular intervals and validates the responses.

Synthetic clients or data Details
Advantages
  • Measures all steps of a multi-request user journey.
  • Sending requests from outside your infrastructure captures more of the overall request path in the SLI.
Disadvantages
  • Approximating user experience with synthetic requests might be misleading (both false positives or false negatives).
  • Covering all corner cases is hard and can devolve into integration testing.
  • High reliability targets require frequent probing for accurate measurement.
  • Probe traffic can drown out real traffic.
Implementation method & tools

Client instrumentation

The client instrumentation method involves adding observability features to the client that the user interacts with, and logging events back to your serving infrastructure that tracks SLIs.

Client instrumentation Details
Advantages
  • Provides the most accurate measure of user experience.
  • Can quantify reliability of third parties, for example, CDN or payments providers.
Disadvantages
  • Client log ingestion and processing latency make these SLIs unsuitable for triggering an operational response.
  • SLI measurements contain a number of highly variable factors potentially outside of direct control.
  • Building instrumentation into the client can involve lots of engineering work.
Implementation method & tools

Choose a measurement method

After you have decided what and how to measure your SLO, your next step is to choose a measurement method that most closely aligns with your customer's experience of your service, and demands the least effort on your part. To achieve this ideal, you might need to use a combination of the methods in the previous tables. The following are suggested approaches that you can implement over time, listed in order of increasing effort:

  • Use application server exports and infrastructure metrics. Typically, you can access these metrics immediately, and they quickly provide value. Some APM tools include built-in SLO tooling.
  • Use client instrumentation. Because legacy systems typically lack built-in, end-user client instrumentation, setting up instrumentation might require a significant investment. However, if you use an APM suite or frontend framework that provides client instrumentation, you can quickly gain insight into your customer's happiness.
  • Use logs processing. If you can't implement server exports or client instrumentation (previous bullets) but logs do exist, logs processing might be your best approach. Another method is to combine exports and logs processing. Use exports as an immediate source for some SLIs (such as immediate availability) and logs processing for long-term signals (such as slow-burn alerts discussed in the SLOs and Alert) guide.
  • Implement synthetic testing. After you have a basic understanding of how your customers use your service, you test your service. For example, you can seed test accounts with known-good data and query for it. This approach can help highlight failure modes that aren't readily observed, such as for low-traffic services.

What's next?

SLOs and alerts

This document in the Google Cloud Architecture Framework: Reliability section provides details about alerting around SLOs.

A mistaken approach to introducing a new observability system like SLOs is to use the system to completely replace an earlier system. Rather, you should see SLOs as a complementary system. For example, instead of deleting your existing alerts, we recommend that you run them in parallel with the SLO alerts introduced here. This approach lets you discover which legacy alerts are predictive of SLO alerts, which alerts fire in parallel with your SLO alerts, and which alerts never fire.

A tenet of SRE is to alert based on symptoms, not on causes. SLOs are, by their very nature, measurements of symptoms. As you adopt SLO alerts, you might find that the symptom alert fires alongside other alerts. If you discover that your legacy, cause-based alerts fire with no SLO or symptoms, these are good candidates to be turned off entirely, turned into ticketing alerts, or logged for later reference.

For more information, see SRE Workbook, Chapter 5.

SLO burn rate

An SLO's burn rate is a measurement of how quickly an outage exposes users to errors and depletes the error budget. By measuring your burn rate, you can determine the time until a service violates its SLO. Alerting based on the SLO burn rate is a valuable approach. Remember that your SLO is based on a duration, which might be quite long (weeks or even months). However, the goal is to quickly detect a condition that results in an SLO violation before that violation actually occurs.

The following table shows the time it takes to exceed an objective if 100% of requests are failing for the given interval, assuming queries per second (QPS) is constant. For example, if you have a 99.9% SLO measured over 30 days, you can withstand 43.2 minutes of full downtime during that 30 days. For example, that downtime can occur all at once, or spaced over several incidents.

Objective 90 days 30 days 7 days 1 day
90% 9 days 3 days 16.8 hours 2.4 hours
99% 21.6 hours 7.2 hours 1.7 hours 14.4 minutes
99.9% 2.2 hours 43.2 minutes 10.1 minutes 1.4 minutes
99.99% 13 minutes 4.3 minutes 1 minute 8.6 seconds
99.999% 1.3 minutes 25.9 seconds 6 seconds 0.9 seconds

In practice, you cannot afford any 100%-outage incidents if you want to achieve high-success percentages. However, many distributed systems can partially fail or degrade gracefully. Even in those cases, you still want to know if a human needs to step in, even in such partial failures, and SLO alerts give you a way to determine that.

When to alert

An important question is when to act based on your SLO burn rate. As a rule, if you will exhaust your error budget in 24 hours, the time to page someone to fix an issue is now.

Measuring the rate of failure isn't always straightforward. A series of small errors might look terrifying in the moment but turn out to be short-lived and have an inconsequential impact on your SLO. Similarly, if a system is slightly broken for a long time, these errors can add up to an SLO violation.

Ideally, your team will react to these signals so that you spend almost all of your error budget (but not exceed it) for a given time period. If you spend too much, you violate your SLO. If you spend too little, you're not taking enough risk or possibly burning out your on-call team.

You need a way to determine when a system is broken enough that a human should intervene. The following sections discuss some approaches to that question.

Fast burns

One type of SLO burn is a fast SLO burn because it burns through your error budget quickly and demands that you intervene to avoid an SLO violation.

Suppose your service operates normally at 1000 queries per second (QPS), and you want to maintain 99% availability as measured over a seven-day week. Your error budget is about 6 million allowable errors (out of about 600 million requests). If you have 24 hours before your error budget is exhausted, for example, that gives you a limit of about 70 errors per second, or 252,000 errors in one hour. These parameters are based on the general rule that pageable incidents should consume at least 1% of the quarterly error budget.

You can choose to detect this rate of errors before that one hour has elapsed. For example, after observing 15 minutes of a 70-error-per-second rate, you might decide to page the on-call engineer, as the following diagram shows.

image

Ideally, the problem is solved before you expend one hour of your 24-hour budget. Choosing to detect this rate in a shorter window (for example, one minute) is likely to be too error-prone. If your target time to detect is shorter than 15 minutes, this number can be adjusted.

Slow burns

Another type of burn rate is a slow burn. Suppose you introduce a bug that burns your weekly error budget by day five or six, or your monthly budget by week two? What is the best response?

In this case, you might introduce a slow SLO burn alert that lets you know you're on course to consume your entire error budget before the end of the alerting window. Of course, that alert might return many false positives. For example, there might often be a condition where errors occur briefly but at a rate that would quickly consume your error budget. In these cases, the condition is a false positive because it lasts only a short time and does not threaten your error budget in the long term. Remember, the goal is not to eliminate all sources of error; it is to stay within the acceptable range to not exceed your error budget. You want to avoid alerting a human to intervene for events that are not legitimately threatening your error budget.

We recommend that you notify a ticket queue (as opposed to paging or emailing) for slow-burn events. Slow-burn events are not emergencies but do require human attention before the budget expires. These alerts shouldn't be emails to a team list, which quickly become a nuisance to be ignored. Tickets should be trackable, assignable, and transferrable. Teams should develop reports for ticket load, closure rates, actionability, and duplicates. Excessive, unactionable tickets are a great example of toil.

Using SLO alerts skillfully can take time and depend on your team's culture and expectations. Remember that you can fine-tune your SLO alerts over time. You can also have multiple alert methods, with varying alert windows, depending on your needs.

Latency alerts

In addition to availability alerts, you can also have latency alerts. With latency SLOs, you're measuring the percent of requests that are not meeting a latency target. By using this model, you can use the same alerting model that you use to detect fast or slow burns of your error budget.

As noted earlier about median latency SLOs, fully half your requests can be out of SLO. In other words, your users can suffer bad latency for days before you detect the impact on your long-term error budget. Instead, services should define tail latency objectives and typical latency objectives. We suggest using the historical 90th percentile to define typical and the 99th percentile for tail. After you set these targets, you can define SLOs based on the number of requests you expect to land in each latency category and how many are too slow. This approach is the same concept as an error budget and should be treated the same. Thus, you might end up with a statement like "90% of requests will be handled within typical latency and 99.9% within tail latency targets." These targets ensure that most users experience your typical latency and still let you track how many requests are slower than your tail latency targets.

Some services might have highly variant expected runtimes. For example, you might have dramatically different performance expectations for reading from a datastore system versus writing to it. Instead of enumerating every possible expectation, you can introduce runtime performance buckets, as the following tables show. This approach presumes that these types of requests are identifiable and pre-categorized into each bucket. You shouldn't expect to categorize requests on the fly.

User-facing website
Bucket Expected maximum runtime
Read 1 second
Write / update 3 seconds
Data processing systems
Bucket Expected maximum runtime
Small 10 seconds
Medium 1 minute
Large 5 minutes
Giant 1 hour
Enormous 8 hours

By measuring the system as it is today, you can understand how long these requests typically take to run. As an example, consider a system for processing video uploads. If the video is very long, the processing time should be expected to take longer. We can use the length of the video in seconds to categorize this work into a bucket, as the following table shows. The table records the number of requests per bucket as well as various percentiles for runtime distribution over the course of a week.

Video length Number of requests measured in one week 10% 90% 99.95%
Small 0 - - -
Medium 1.9 million 864 milliseconds 17 seconds 86 seconds
Large 25 million 1.8 seconds 52 seconds 9.6 minutes
Giant 4.3 million 2 seconds 43 seconds 23.8 minutes
Enormous 81,000 36 seconds 1.2 minutes 41 minutes

From such analysis, you can derive a few parameters for alerting:

  • fast_typical: At most, 10% of requests are faster than this time. If too many requests are faster than this time, your targets might be wrong, or something about your system might have changed.
  • slow_typical: At least 90% of requests are faster than this time. This limit drives your main latency SLO. This parameter indicates whether most of the requests are fast enough.
  • slow_tail: At least 99.95% of requests are faster than this time. This limit ensures that there aren't too many slow requests.
  • deadline: The point at which a user RPC or background processing times out and fails (a limit typically already hard-coded into the system). These requests won't actually be slow but will have actually failed with an error and instead count against your availability SLO.

A guideline in defining buckets is to keep a bucket's fast_typical, slow_typical, and slow_tail within an order of magnitude of each other. This guideline ensures that you don't have too broad of a bucket. We recommend that you don't attempt to prevent overlap or gaps between the buckets.

Bucket fast_typical slow_typical slow_tail deadline
Small 100 milliseconds 1 second 10 seconds 30 seconds
Medium 600 milliseconds 6 seconds 60 seconds (1 minute) 300 seconds
Large 3 seconds 30 seconds 300 seconds (5 minutes) 10 minutes
Giant 30 seconds 6 minutes 60 minutes (1 hour) 3 hours
Enormous 5 minutes 50 minutes 500 minutes (8 hours) 12 hours

This results in a rule like api.method: SMALL => [1s, 10s]. In this case, the SLO tracking system would see a request, determine its bucket (perhaps by analysing its method name or URI and comparing the name to a lookup table), then update the statistic based on the runtime of that request. If this took 700 milliseconds, it is within the slow_typical target. If it is 3 seconds, it is within slow_tail. If it is 22 seconds, it is beyond slow_tail, but not yet an error.

In terms of user happiness, you can think of missing tail latency as equivalent to being unavailable. (That is, the response is so slow that it should be considered a failure.) Due to this, we suggest using the same percentage that you use for availability, for example:

99.95% of all requests are satisfied within 10 seconds.

What you consider typical latency is up to you. Some teams within Google consider 90% to be a good target. This is related to your analysis and how you chose durations for slow_typical. For example:

90% of all requests are handled within 1 second.

Suggested alerts

Given these guidelines, the following table includes a suggested baseline set of SLO alerts.

SLOs Measurement window Burn rate Action

Availability, fast burn

Typical latency

Tail latency

1-hour window Less than 24 hours to SLO violation Page someone

Availability, slow burn

Typical latency, slow burn

Tail latency, slow burn

7-day window Greater than 24 hours to SLO violation Create a ticket

SLO alerting is a skill that can take time to develop. The durations in this section are suggestions; you can adjust these according to your own needs and level of precision. Tying your alerts to the measurement window or error budget expenditure might be helpful, or you might add another layer of alerting between fast burns and slow burns.

Build observability into your infrastructure and applications

This document in the Google Cloud Architecture Framework provides best practices to add observability into your services so that you can better understand your service performance and quickly identify issues. Observability includes monitoring, logging, tracing, profiling, debugging, and similar systems.

Monitoring is at the base of the service reliability hierarchy in the Google SRE Handbook. Without proper monitoring, you can't tell whether an application works correctly.

Instrument your code to maximize observability

A well-designed system aims to have the right amount of observability that starts in its development phase. Don't wait until an application is in production before you start to observe it. Instrument your code and consider the following guidance:

  • To debug and troubleshoot efficiently, think about what log and trace entries to write out, and what metrics to monitor and export. Prioritize by the most likely or frequent failure modes of the system.
  • Periodically audit and prune your monitoring. Delete unused or useless dashboards, graphs, alerts, tracing, and logging to eliminate clutter.

Google Cloud Observability provides real-time monitoring, hybrid multi-cloud monitoring and logging (such as for AWS and Azure), plus tracing, profiling, and debugging. Google Cloud Observability can also auto-discover and monitor microservices running on App Engine or in a service mesh like Istio.

If you generate lots of application data, you can [optimize large-scale ingestion of analytics events logs with BigQuery. BigQuery is also suitable for persisting and analyzing high-cardinality timeseries data from your monitoring framework. This approach is useful because it lets you run arbitrary queries at a lower cost rather than trying to design your monitoring perfectly from the start, and decouples reporting from monitoring. You can create reports from the data using Looker Studio or Looker.

Recommendations

To apply the guidance in the Architecture Framework to your own environment, follow these recommendations:

  • Implement monitoring early, such as before you initiate a migration or before you deploy a new application to a production environment.
  • Disambiguate between application issues and underlying cloud issues. Use the Monitoring API, or other Cloud Monitoring products and the Google Cloud Status Dashboard.
  • Define an observability strategy beyond monitoring that includes tracing, profiling, and debugging.
  • Regularly clean up observability artifacts that you don't use or that don't provide value, such as unactionable alerts.
  • If you generate large amounts of observability data, send application events to a data warehouse system such as BigQuery.

What's next

Design for scale and high availability

This document in the Google Cloud Architecture Framework provides design principles to architect your services so that they can tolerate failures and scale in response to customer demand. A reliable service continues to respond to customer requests when there's a high demand on the service or when there's a maintenance event. The following reliability design principles and best practices should be part of your system architecture and deployment plan.

Create redundancy for higher availability

Systems with high reliability needs must have no single points of failure, and their resources must be replicated across multiple failure domains. A failure domain is a pool of resources that can fail independently, such as a VM instance, zone, or region. When you replicate across failure domains, you get a higher aggregate level of availability than individual instances could achieve. For more information, see Regions and zones.

As a specific example of redundancy that might be part of your system architecture, to isolate failures in DNS registration to individual zones, use zonal DNS names for instances on the same network to access each other.

Design a multi-zone architecture with failover for high availability

Make your application resilient to zonal failures by architecting it to use pools of resources distributed across multiple zones, with data replication, load balancing and automated failover between zones. Run zonal replicas of every layer of the application stack, and eliminate all cross-zone dependencies in the architecture.

Replicate data across regions for disaster recovery

Replicate or archive data to a remote region to enable disaster recovery in the event of a regional outage or data loss. When replication is used, recovery is quicker because storage systems in the remote region already have data that is almost up to date, aside from the possible loss of a small amount of data due to replication delay. When you use periodic archiving instead of continuous replication, disaster recovery involves restoring data from backups or archives in a new region. This procedure usually results in longer service downtime than activating a continuously updated database replica and could involve more data loss due to the time gap between consecutive backup operations. Whichever approach is used, the entire application stack must be redeployed and started up in the new region, and the service will be unavailable while this is happening.

For a detailed discussion of disaster recovery concepts and techniques, see Architecting disaster recovery for cloud infrastructure outages.

Design a multi-region architecture for resilience to regional outages

If your service needs to run continuously even in the rare case when an entire region fails, design it to use pools of compute resources distributed across different regions. Run regional replicas of every layer of the application stack.

Use data replication across regions and automatic failover when a region goes down. Some Google Cloud services have multi-regional variants, such as Spanner. To be resilient against regional failures, use these multi-regional services in your design where possible. For more information on regions and service availability, see Google Cloud locations.

Make sure that there are no cross-region dependencies so that the breadth of impact of a region-level failure is limited to that region.

Eliminate regional single points of failure, such as a single-region primary database that might cause a global outage when it is unreachable. Note that multi-region architectures often cost more, so consider the business need versus the cost before you adopt this approach.

For further guidance on implementing redundancy across failure domains, see the survey paper Deployment Archetypes for Cloud Applications (PDF).

Eliminate scalability bottlenecks

Identify system components that can't grow beyond the resource limits of a single VM or a single zone. Some applications scale vertically, where you add more CPU cores, memory, or network bandwidth on a single VM instance to handle the increase in load. These applications have hard limits on their scalability, and you must often manually configure them to handle growth.

If possible, redesign these components to scale horizontally such as with sharding, or partitioning, across VMs or zones. To handle growth in traffic or usage, you add more shards. Use standard VM types that can be added automatically to handle increases in per-shard load. For more information, see Patterns for scalable and resilient apps.

If you can't redesign the application, you can replace components managed by you with fully managed cloud services that are designed to scale horizontally with no user action.

Degrade service levels gracefully when overloaded

Design your services to tolerate overload. Services should detect overload and return lower quality responses to the user or partially drop traffic, not fail completely under overload.

For example, a service can respond to user requests with static web pages and temporarily disable dynamic behavior that's more expensive to process. Or, the service can allow read-only operations and temporarily disable data updates.

Operators should be notified to correct the error condition when a service degrades.

Prevent and mitigate traffic spikes

Don't synchronize requests across clients. Too many clients that send traffic at the same instant causes traffic spikes that might cause cascading failures.

Implement spike mitigation strategies on the server side such as throttling, queueing, load shedding or circuit breaking, graceful degradation, and prioritizing critical requests.

Mitigation strategies on the client include client-side throttling and exponential backoff with jitter.

Sanitize and validate inputs

To prevent erroneous, random, or malicious inputs that cause service outages or security breaches, sanitize and validate input parameters for APIs and operational tools. For example, Apigee and Google Cloud Armor can help protect against injection attacks.

Regularly use fuzz testing where a test harness intentionally calls APIs with random, empty, or too-large inputs. Conduct these tests in an isolated test environment.

Operational tools should automatically validate configuration changes before the changes roll out, and should reject changes if validation fails.

Fail safe in a way that preserves function

If there's a failure due to a problem, the system components should fail in a way that allows the overall system to continue to function. These problems might be a software bug, bad input or configuration, an unplanned instance outage, or human error. What your services process helps to determine whether you should be overly permissive or overly simplistic, rather than overly restrictive.

Consider the following example scenarios and how to respond to failure:

  • It's usually better for a firewall component with a bad or empty configuration to fail open and allow unauthorized network traffic to pass through for a short period of time while the operator fixes the error. This behavior keeps the service available, rather than to fail closed and block 100% of traffic. The service must rely on authentication and authorization checks deeper in the application stack to protect sensitive areas while all traffic passes through.
  • However, it's better for a permissions server component that controls access to user data to fail closed and block all access. This behavior causes a service outage when it has the configuration is corrupt, but avoids the risk of a leak of confidential user data if it fails open.

In both cases, the failure should raise a high priority alert so that an operator can fix the error condition. Service components should err on the side of failing open unless it poses extreme risks to the business.

Design API calls and operational commands to be retryable

APIs and operational tools must make invocations retry-safe as far as possible. A natural approach to many error conditions is to retry the previous action, but you might not know whether the first try was successful.

Your system architecture should make actions idempotent - if you perform the identical action on an object two or more times in succession, it should produce the same results as a single invocation. Non-idempotent actions require more complex code to avoid a corruption of the system state.

Identify and manage service dependencies

Service designers and owners must maintain a complete list of dependencies on other system components. The service design must also include recovery from dependency failures, or graceful degradation if full recovery is not feasible. Take account of dependencies on cloud services used by your system and external dependencies, such as third party service APIs, recognizing that every system dependency has a non-zero failure rate.

When you set reliability targets, recognize that the SLO for a service is mathematically constrained by the SLOs of all its critical dependencies. You can't be more reliable than the lowest SLO of one of the dependencies. For more information, see the calculus of service availability.

Startup dependencies

Services behave differently when they start up compared to their steady-state behavior. Startup dependencies can differ significantly from steady-state runtime dependencies.

For example, at startup, a service may need to load user or account information from a user metadata service that it rarely invokes again. When many service replicas restart after a crash or routine maintenance, the replicas can sharply increase load on startup dependencies, especially when caches are empty and need to be repopulated.

Test service startup under load, and provision startup dependencies accordingly. Consider a design to gracefully degrade by saving a copy of the data it retrieves from critical startup dependencies. This behavior allows your service to restart with potentially stale data rather than being unable to start when a critical dependency has an outage. Your service can later load fresh data, when feasible, to revert to normal operation.

Startup dependencies are also important when you bootstrap a service in a new environment. Design your application stack with a layered architecture, with no cyclic dependencies between layers. Cyclic dependencies may seem tolerable because they don't block incremental changes to a single application. However, cyclic dependencies can make it difficult or impossible to restart after a disaster takes down the entire service stack.

Minimize critical dependencies

Minimize the number of critical dependencies for your service, that is, other components whose failure will inevitably cause outages for your service. To make your service more resilient to failures or slowness in other components it depends on, consider the following example design techniques and principles to convert critical dependencies into non-critical dependencies:

  • Increase the level of redundancy in critical dependencies. Adding more replicas makes it less likely that an entire component will be unavailable.
  • Use asynchronous requests to other services instead of blocking on a response or use publish/subscribe messaging to decouple requests from responses.
  • Cache responses from other services to recover from short-term unavailability of dependencies.

To render failures or slowness in your service less harmful to other components that depend on it, consider the following example design techniques and principles:

  • Use prioritized request queues and give higher priority to requests where a user is waiting for a response.
  • Serve responses out of a cache to reduce latency and load.
  • Fail safe in a way that preserves function.
  • Degrade gracefully when there's a traffic overload.

Ensure that every change can be rolled back

If there's no well-defined way to undo certain types of changes to a service, change the design of the service to support rollback. Test the rollback processes periodically. APIs for every component or microservice must be versioned, with backward compatibility such that the previous generations of clients continue to work correctly as the API evolves. This design principle is essential to permit progressive rollout of API changes, with rapid rollback when necessary.

Rollback can be costly to implement for mobile applications. Firebase Remote Config is a Google Cloud service to make feature rollback easier.

You can't readily roll back database schema changes, so execute them in multiple phases. Design each phase to allow safe schema read and update requests by the latest version of your application, and the prior version. This design approach lets you safely roll back if there's a problem with the latest version.

Recommendations

To apply the guidance in the Architecture Framework to your own environment, follow these recommendations:

  • Implement exponential backoff with randomization in the error retry logic of client applications.
  • Implement a multi-region architecture with automatic failover for high availability.
  • Use load balancing to distribute user requests across shards and regions.
  • Design the application to degrade gracefully under overload. Serve partial responses or provide limited functionality rather than failing completely.
  • Establish a data-driven process for capacity planning, and use load tests and traffic forecasts to determine when to provision resources.
  • Establish disaster recovery procedures and test them periodically.

What's next

Create reliable operational processes and tools

This document in the Google Cloud Architecture Framework provides operational principles to run your service in a reliable manner, such as how to deploy updates, run services in production environments, and test for failures. Architecting for reliability should cover the whole lifecycle of your service, not just software design.

Choose good names for applications and services

Avoid using internal code names in production configuration files, because they can be confusing, particularly to newer employees, potentially increasing time to mitigate (TTM) for outages. As much as possible, choose good names for all of your applications, services, and critical system resources such as VMs, clusters, and database instances, subject to their respective limits on name length. A good name describes the entity's purpose; is accurate, specific, and distinctive; and is meaningful to anybody who sees it. A good name avoids acronyms, code names, abbreviations, and potentially offensive terminology, and would not create a negative public response even if published externally.

Implement progressive rollouts with canary testing

Instantaneous global changes to service binaries or configuration are inherently risky. Roll out new versions of executables and configuration changes incrementally. Start with a small scope, such as a few VM instances in a zone, and gradually expand the scope. Roll back rapidly if the change doesn't perform as you expect, or negatively impacts users at any stage of the rollout. Your goal is to identify and address bugs when they only affect a small portion of user traffic, before you roll out the change globally.

Set up a canary testing system that's aware of service changes and does A/B comparison of the metrics of the changed servers with the remaining servers. The system should flag unexpected or anomalous behavior. If the change doesn't perform as you expect, the canary testing system should automatically halt rollouts. Problems can be clear, such as user errors, or subtle, like CPU usage increase or memory bloat.

It's better to stop and roll back at the first hint of trouble and diagnose issues without the time pressure of an outage. After the change passes canary testing, propagate it to larger scopes gradually, such as to a full zone, then to a second zone. Allow time for the changed system to handle progressively larger volumes of user traffic to expose any latent bugs.

Spread out traffic for timed promotions and launches

You might have promotional events, such as sales that start at a precise time and encourage many users to connect to the service simultaneously. If so, design client code to spread the traffic over a few seconds. Use random delays before they initiate requests.

You can also pre-warm the system. When you pre-warm the system, you send the user traffic you anticipate to it ahead of time to ensure it performs as you expect. This approach prevents instantaneous traffic spikes that could crash your servers at the scheduled start time.

Automate build, test, and deployment

Eliminate manual effort from your release process with the use of continuous integration and continuous delivery (CI/CD) pipelines. Perform automated integration testing and deployment. For example, create a modern CI/CD process with GKE.

For more information, see continuous integration, continuous delivery, test automation, and deployment automation.

Design for safety

Design your operational tools to reject potentially invalid configurations. Detect and alert when a configuration version is empty, partial or truncated, corrupt, logically incorrect or unexpected, or not received within the expected time. Tools should also reject configuration versions that differ too much from the previous version.

Disallow changes or commands with too broad a scope that are potentially destructive. These broad commands might be to "Revoke permissions for all users", "Restart all VMs in this region", or "Reformat all disks in this zone". Such changes should only be applied if the operator adds emergency override command-line flags or option settings when they deploy the configuration.

Tools must display the breadth of impact of risky commands, such as number of VMs the change impacts, and require explicit operator acknowledgment before the tool proceeds. You can also use features to lock critical resources and prevent their accidental or unauthorized deletion, such as Cloud Storage retention policy locks.

Test failure recovery

Regularly test your operational procedures to recover from failures in your service. Without regular tests, your procedures might not work when you need them if there's a real failure. Items to test periodically include regional failover, how to roll back a release, and how to restore data from backups.

Conduct disaster recovery tests

Like with failure recovery tests, don't wait for a disaster to strike. Periodically test and verify your disaster recovery procedures and processes.

You might create a system architecture to provide high availability (HA). This architecture doesn't entirely overlap with disaster recovery (DR), but it's often necessary to take HA into account when you think about recovery time objective (RTO) and recovery point objective (RPO) values.

HA helps you to meet or exceed an agreed level of operational performance, such as uptime. When you run production workloads on Google Cloud, you might deploy a passive or active standby instance in a second region. With this architecture, the application continues to provide service from the unaffected region if there's a disaster in the primary region. For more information, see Architecting disaster recovery for cloud outages.

Practice chaos engineering

Consider the use of chaos engineering in your test practices. Introduce actual failures into different components of production systems under load in a safe environment. This approach helps to ensure that there's no overall system impact because your service handles failures correctly at each level.

Failures you inject into the system can include crashing tasks, errors and timeouts on RPCs, or reductions in resource availability. Use random fault injection to test intermittent failures (flapping) in service dependencies. These behaviors are hard to detect and mitigate in production.

Chaos engineering ensures that the fallout from such experiments is minimized and contained. Treat such tests as practice for actual outages and use all of the information collected to improve your outage response.

What's next

Build efficient alerts

This document in the Google Cloud Architecture Framework provides operational principles to create alerts that help you run reliable services. The more information you have on how your service performs, the more informed your decisions are when there's an issue. Design your alerts for early and accurate detection of all user-impacting system problems, and minimize false positives.

Optimize the alert delay

There's a balance between alerts that are sent too soon that stress the operations team and alerts that are sent too late and cause long service outages. Tune the alert delay before the monitoring system notifies humans of a problem to minimize time to detect, while maximizing signal versus noise. Use the error budget consumption rate to derive the optimal alert configuration.

Alert on symptoms rather than causes

Trigger alerts based on the direct impact to user experience. Noncompliance with global or per-customer SLOs indicates a direct impact. Don't alert on every possible root cause of a failure, especially when the impact is limited to a single replica. A well-designed distributed system recovers seamlessly from single-replica failures.

Alert on outlier values rather than averages

When monitoring latency, define SLOs and set alerts for (pick two out of three) 90th, 95th, or 99th percentile latency, not for average or 50th percentile latency. Good mean or median latency values can hide unacceptably high values at the 90th percentile or above that cause very bad user experiences. Therefore you should apply this principle of alerting on outlier values when monitoring latency for any critical operation, such as a request-response interaction with a webserver, batch completion in a data processing pipeline, or a read or write operation on a storage service.

Build a collaborative incident management process

This document in the Google Cloud Architecture Framework provides best practices to manage services and define processes to respond to incidents. Incidents occur in all services, so you need a well-documented process to efficiently respond to these issues and mitigate them.

Incident management overview

It's inevitable that your well-designed system eventually fails to meet its SLOs. In the absence of an SLO, your customers loosely define what the acceptable service level is themselves from their past experience. Customers escalate to your technical support or similar group, regardless of what's in your SLA.

To properly serve your customers, establish and regularly test an incident management plan. The plan can be as short as a single-page checklist with ten items. This process helps your team to reduce time to detect (TTD) and time to mitigate (TTM).

TTM is preferred as opposed to TTR, where the R for repair or recovery is often used to mean a full fix versus mitigation. TTM emphasizes fast mitigation to quickly end the customer impact of an outage, and then start the often much longer process to fully fix the problem.

A well-designed system where operations are excellent increases the time between failures (TBF). In other words, operational principles for reliability, including good incident management, aim to make failures less frequent.

To run reliable services, apply the following best practices in your incident management process.

Assign clear service ownership

All services and their critical dependencies must have clear owners responsible for adherence to their SLOs. If there are reorganizations or team attrition, engineering leads must ensure that ownership is explicitly handed off to a new team, along with the documentation and training as required. The owners of a service must be easily discoverable by other teams.

Reduce time to detect (TTD) with well tuned alerts

Before you can reduce TTD, review and implement the recommendations in the build observability into your infrastructure and applications and define your reliability goals sections. For example, disambiguate between application issues and underlying cloud issues.

A well-tuned set of SLIs alerts your team at the right time without alert overload. For more information, see Build efficient alerts and Tune up your SLI metrics: CRE life lessons.

Reduce time to mitigate (TTM) with incident management plans and training

To reduce TTM, define a documented and well-exercised incident management plan. Have readily available data on what's changed in the environment. Make sure that teams know generic mitigations they can quickly apply to minimize TTM. These mitigation techniques include draining, rolling back changes, upsizing resources, and degrading quality of service.

As discussed elsewhere in the Architecture Framework, create reliable operational processes and tools to support the safe and rapid rollback of changes.

Design dashboard layouts and content to minimize TTM

Organize your service dashboard layout and navigation so that an operator can determine in a minute or two if the service and all of its critical dependencies are running. To quickly pinpoint potential causes of problems, operators must be able to scan all charts on the dashboard to rapidly look for graphs that change significantly at the time of the alert.

The following list of example graphs might be on your dashboard to help troubleshoot issues. Incident responders should be able to glance at them in a single view:

  • Service level indicators, such as successful requests divided by total valid requests
  • Configuration and binary rollouts
  • Requests per second to the system
  • Error responses per second from the system
  • Requests per second from the system to its dependencies
  • Error responses per second to the system from its dependencies

Other common graphs to help troubleshoot include latency, saturation, request size, response size, query cost, thread pool utilization, and Java virtual machine (JVM) metrics (where applicable). Saturation refers to fullness by some limit such as quota or system memory size. Thread pool utilization looks for regressions due to pool exhaustion.

Test the placement of these graphs against a few outage scenarios to ensure that the most important graphs are near the top, and that the order of the graphs matches your standard diagnostic workflow. You can also apply machine learning and statistical anomaly detection to surface the right subset of these graphs.

Document diagnostic procedures and mitigation for known outage scenarios

Write playbooks and link to them from alert notifications. If these documents are accessible from the alert notifications, operators can quickly get the information they need to troubleshoot and mitigate problems.

Use blameless postmortems to learn from outages and prevent recurrences

Establish a blameless postmortem culture and an incident review process. Blameless means that your team evaluates and documents what went wrong in an objective manner, without the need to assign blame.

Mistakes are opportunities to learn, not a cause for criticism. Always aim to make the system more resilient so that it can recover quickly from human error, or even better, detect and prevent human error. Extract as much learning as possible from each postmortem and follow up diligently on each postmortem action item in order to make outages less frequent, thereby increasing TBF.

Incident management plan example

Production issues have been detected, such as through an alert or page, or escalated to me:

  • Should I delegate to someone else?
    • Yes, if you and your team can't resolve the issue.
  • Is this issue a privacy or security breach?
    • If yes, delegate to the privacy or security team.
  • Is this issue an emergency or are SLOs at risk?
    • If in doubt, treat it as an emergency.
  • Should I involve more people?
    • Yes, if it impacts more than X% of customers or if it takes more than Y minutes to resolve. If in doubt, always involve more people, especially within business hours.
  • Define a primary communications channel, such as IRC, Hangouts Chat, or Slack.
  • Delegate previously defined roles, such as the following:
    • Incident commander who is responsible for overall coordination.
    • Communications lead who is responsible for internal and external communications.
    • Operations lead who is responsible to mitigate the issue.
  • Define when the incident is over. This decision might require an acknowledgment from a support representative or other similar teams.
  • Collaborate on the blameless postmortem.
  • Attend a postmortem incident review meeting to discuss and staff action items.

Recommendations

To apply the guidance in the Architecture Framework to your own environment, follow these recommendations::

What's next

Google Cloud Architecture Framework: Product reliability guides

This section in the Architecture Framework has product-specific best practice guidance for reliability, availability, and consistency of some Google Cloud products. The guides also provide recommendations for minimizing and recovering from failures and for scaling your applications well under load.

The product reliability guides are organized under the following areas:

Compute Engine reliability guide

Compute Engine is a customizable compute service that enables users to create and run virtual machines on demand on Google's infrastructure.

Best practices

Cloud Run reliability guide

Cloud Run is a managed compute platform suitable for deploying containerized applications, and is serverless. Cloud Run abstracts away all infrastructure so users can focus on building applications.

Best practices

  • Cloud Run general tips - how to implement a Cloud Run service, start containers quickly, use global variables, and improve container security.
  • Load testing best practices - how to load test Cloud Run services, including addressing concurrency problems before load testing, managing the maximum number of instances, choosing the best region for load testing, and ensuring services scale with load.
  • Instance scaling - how to scale and limit container instances and minimize response time by keeping some instances idle instead of stopping them.
  • Using minimum instances - specify the least number of container instances ready to serve, and when set appropriately high, minimize average response time by reducing the number of cold starts.
  • Optimizing Java applications for Cloud Run - understand the tradeoffs of some optimizations for Cloud Run services written in Java, and reduce startup time and memory usage.
  • Optimizing Python applications for Cloud Run - optimize the container image by improving efficiency of the WSGI server, and optimize applications by reducing the number of threads and executing startup tasks in parallel.

Cloud Run functions reliability guide

Cloud Run functions is a scalable, event-driven, serverless platform to help build and connect services. Cloud Run functions can be called via HTTP request or triggered based on background events.

Best practices

Google Kubernetes Engine reliability guide

Google Kubernetes Engine (GKE) is a system for operating containerized applications in the cloud, at scale. GKE deploys, manages, and provisions resources for your containerized applications. The GKE environment consists of Compute Engine instances grouped together to form a cluster.

Best practices

  • Best practices for Google Kubernetes Engine networking - use VPC-native clusters for easier scaling, plan IP addresses, scale cluster connectivity, use Google Cloud Armor to block Distributed Denial-of-Service (DDoS) attacks, implement container-native load balancing for lower latency, use the health check functionality of external Application Load Balancers for graceful failover, and use regional clusters to increase the availability of applications in a cluster.
  • Prepare cloud-based Kubernetes applications - learn the best practices to plan for application capacity, grow application horizontally or vertically, set resource limits relative to resource requests for memory versus CPU, make containers lean for faster application startup, and limit Pod disruption by setting a Pod Disruption Budget (PDB). Also, understand how to set up liveness probes and readiness probes for graceful application startup, ensure non-disruptive shutdowns, and implement exponential backoff on retried requests to prevent traffic spikes that overwhelm your application.
  • GKE multi-tenancy best practices - how to design a multi-tenant cluster architecture for high availability and reliability, use Google Kubernetes Engine (GKE) usage metering for per-tenant usage metrics, provide tenant-specific logs, and provide tenant-specific monitoring.

Cloud Storage reliability guide

Cloud Storage is a durable and highly available object repository with advanced security and sharing capabilities. This service is used for storing and accessing data on Google Cloud infrastructure.

Best practices

  • Best practices for Cloud Storage - general best practices for Cloud Storage, including tips to maximize availability and minimize latency of your applications, improve the reliability of upload operations, and improve the performance of large-scale data deletions.
  • Request rate and access distribution guidelines - how to minimize latency and error responses on read, write, and delete operations at very high request rates by understanding how Cloud Storage auto-scaling works.

Firestore reliability guide

Firestore is a NoSQL document database that lets you store, sync, and query data for your mobile and web applications, at global scale.

Best practices

  • Firestore best practices - how to select your database location for increased reliability, minimize performance pitfalls in indexing, improve the performance of read and write operations, reduce latency for change notifications, and design for scale.

Bigtable reliability guide

Bigtable is a fully managed, scalable, NoSQL database for large analytical and operational workloads. It is designed as a sparsely populated table that can scale to billions of rows and thousands of columns, and supports high read and write throughput at low latency.

Best practices

  • Understand Bigtable performance - estimating throughput for Bigtable, how to plan Bigtable capacity by looking at throughput and storage use, how enabling replication affects read and write throughput differently, and how Bigtable optimizes data over time.
  • Bigtable schema design - guidance on designing Bigtable schema, including concepts of key/value store, designing row keys based on planned read requests, handling columns and rows, and special use cases.
  • Bigtable replication overview - how to replicate Bigtable across multiple zones or regions, understand performance implications of replication, and how Bigtable resolves conflicts and handles failovers.
  • About Bigtable backups- how to save a copy of a table's schema and data with Bigtable Backups, which can help you recover from application-level data corruption or from operator errors, such as accidentally deleting a table.

Cloud SQL reliability guide

Cloud SQL is a fully managed relational database service for MySQL, PostgreSQL, and SQL Server. Cloud SQL easily integrates with existing applications and Google Cloud services such as Google Kubernetes Engine and BigQuery.

Best practices

Spanner reliability guide

Spanner is a distributed SQL database management and storage service, with features such as global transactions and highly available horizontal scaling and transactional consistency.

Best practices

  • Spanner backup and restore - key features of Spanner Backup and Restore, comparison of Backup and Restore with Import and Export, implementation details, and how to control access to Spanner resources.
  • Regional and multi-region configurations - description of the two types of instance configurations that Spanner offers: regional configurations and multi-region configurations. The description includes the differences and trade-offs between each configuration.
  • Autoscaling Spanner - introduction to the Autoscaler tool for Spanner (Autoscaler), an open source tool that you can use as a companion tool to Cloud Spanner. This tool lets you automatically increase or reduce the number of nodes or processing units in one or more Spanner instances based on the utilization metrics of each Spanner instance.
  • About point-in-time recovery (PITR) - description of Spanner point-in-time recovery (PITR), a feature that protects against accidental deletion or writes of Spanner data. For example, an operator inadvertently writes data or an application rollout corrupts the database. With PITR, you can recover your data from a point-in-time in the past (up to a maximum of seven days) seamlessly.
  • Spanner best practices - guidance on bulk loading, using Data Manipulation Language (DML), designing schema to avoid hotspots, and SQL best practices.

Filestore reliability guide

Filestore is a managed file storage service for Google Cloud applications, with a filesystem interface and a shared filesystem for data. Filestore offers petabyte-scale online network attached storage (NAS) for Compute Engine and Google Kubernetes Engine instances.

Best practices

  • Filestore performance - performance settings and Compute Engine machine type recommendations, NFS mount options for best performance on Linux client VM instances, and using the fio tool to test performance. Includes recommendations for improved performance across multiple Google Cloud resources.

  • Filestore backups - description of Filestore backups, common use cases, and best practices for creating and using backups.

  • Filestore snapshots - description of Filestore snapshots, common use cases, and best practices for creating and using snapshots.

  • Filestore networking - networking and IP resource requirements needed to use Filestore.

Memorystore reliability guide

Memorystore is a fully-managed, in-memory store that provides a managed version of two open source caching solutions: Redis and Memcached. Memorystore is scalable, and automates complex tasks such as provisioning, replication, failover, and patching.

Best practices

  • Redis general best practices - guidance on exporting Redis Database (RDB) backups, resource-intensive operations, and operations requiring connection retry. In addition, information on maintenance, memory management, and setting up Serverless VPC Access connector, as well as private services access connection mode, and monitoring and alerts.
  • Redis memory management best practices - memory management concepts such as instance capacity and Maxmemory configuration, export, scaling, and version upgrade operations, memory management metrics, and how to resolve an out-of-memory condition.
  • Redis exponential backoff - how exponential backoff works, an example algorithm, and how maximum backoff and maximum number of retries work.
  • Memcached best practices - how to design application for cache misses, connecting directly to nodes' IP addresses, and Memcached Auto Discovery service. Also, guidance on configuring max-item-size parameter, balancing clusters, and using Cloud Monitoring to monitor essential metrics.
  • Memcached memory management best practices - configuring memory for a Memcached instance, Reserved Memory configuration, when to increase Reserved Memory, and metrics for memory usage.

Cloud DNS reliability guide

Cloud DNS is a low-latency domain name system that helps register, manage, and serve your domains. Cloud DNS scales to large numbers of DNS zones and records, and millions of DNS records can be created and updated via a user interface.

Best practices

  • Cloud DNS best practices - learn how to manage private zones, configure DNS forwarding, and create DNS server policies. Includes guidance on using Cloud DNS in a hybrid environment.

Cloud Load Balancing reliability guide

Cloud Load Balancing is a fully distributed, software-defined, managed service for all your traffic. Cloud Load Balancing also provides seamless autoscaling, Layer 4 and Layer 7 load balancing, and support for features such as IPv6 global load balancing.

Best practices

  • Performance best practices - how to spread load across application instances to deliver optimal performance. Strategies include backend placement in regions closest to traffic, caching, forwarding rule protocol selection, and configuring session affinity.
  • Using load balancing for highly available applications - how to use Cloud Load Balancing with Compute Engine to provide high availability, even during a zonal outage.

Cloud CDN reliability guide

Cloud CDN (Content Delivery Network) is a service that accelerates internet content delivery by using Google's edge network to bring content as close as possible to the user. Cloud CDN helps reduce latency, cost, and load, making it easier to scale services to users.

Best practices

BigQuery reliability guide

BigQuery is Google Cloud's data warehouse platform for storing and analyzing data at scale.

Best practices

  • Introduction to reliability - reliability best practices and introduction to concepts such as availability, durability, and data consistency.
  • Availability and durability - the types of failure domains that can occur in Google Cloud data centers, how BigQuery provides storage redundancy based on data storage location, and why cross-region datasets enhance disaster recovery.
  • Best practices for multi-tenant workloads on BigQuery - common patterns used in multi-tenant data platforms. These patterns include ensuring reliability and isolation for customers of software as a service (SaaS) vendors, important BigQuery quotas and limits for capacity planning, using BigQuery Data Transfer Service to copy relevant datasets into another region, and more.
  • Use Materialized Views - how to use BigQuery Materialized Views for faster queries at lower cost, including querying materialized views, aligning partitions, and understanding smart-tuning (automatic rewriting of queries).

Dataflow reliability guide

Dataflow is a fully-managed data processing service which enables fast, simplified, streaming data pipeline development using open source Apache Beam libraries. Dataflow minimizes latency, processing time, and cost through autoscaling and batch processing.

Best practices

Building production-ready data pipelines using Dataflow - a document series on using Dataflow including planning, developing, deploying, and monitoring Dataflow pipelines.

  • Overview - introduction to Dataflow pipelines.
  • Planning - measuring SLOs, understanding the impact of data sources and sinks on pipeline scalability and performance, and taking high availability, disaster recovery, and network performance into account when specifying regions to run your Dataflow jobs.
  • Developing and testing - setting up deployment environments, preventing data loss by using dead letter queues for error handling, and reducing latency and cost by minimizing expensive per-element operations. Also, using batching to reduce performance overhead without overloading external services, unfusing inappropriately fused steps so that the steps are separated for better performance, and running end-to-end tests in preproduction to ensure that the pipeline continues to meet your SLOs and other production requirements.
  • Deploying - continuous integration (CI) and continuous delivery and deployment (CD), with special considerations for deploying new versions of streaming pipelines. Also, an example CI/CD pipeline, and some features for optimizing resource usage. Finally, a discussion of high availability, geographic redundancy, and best practices for pipeline reliability, including regional isolation, use of snapshots, handling job submission errors, and recovering from errors and outages impacting running pipelines.
  • Monitoring - observe service level indicators (SLIs) which are important indicators of pipeline performance, and define and measure service level objectives (SLOs).

Dataproc reliability guide

Dataproc is a fully managed, scalable service for running Apache Hadoop and Spark jobs. With Dataproc, virtual machines can be customized and scaled up and down as needed. Dataproc integrates tightly with Cloud Storage, BigQuery, Bigtable, and other Google Cloud services.

Best practices

  • Dataproc High Availability mode - compare Hadoop High Availability (HA) mode with the default non-HA mode in terms of instance names, Apache ZooKeeper, Hadoop Distributed File System (HDFS), and Yet Another Resource Negotiator (YARN). Also, how to create a high availability cluster.
  • Autoscaling clusters - when to use Dataproc autoscaling, how to create an autoscaling policy, multi-cluster policy usage, reliability best practices for autoscaling configuration, and metrics and logs.
  • Dataproc Enhanced Flexibility Mode (EFM) - examples of using Enhanced Flexibility Mode to minimize job progress delays, advanced configuration such as partitioning and parallelism, and YARN graceful decommissioning on EFM clusters.
  • Graceful decomissioning - using graceful decomissioning to minimize the impact of removing workers from a cluster, how to use this feature with secondary workers, and command examples for graceful decomissioning.
  • Restartable jobs - by using optional settings, you can set jobs to restart on failure to mitigate common types of job failure, including out-of-memory issues and unexpected Compute Engine virtual machine reboots.

Google Cloud Architecture Framework: Cost optimization

The cost optimization pillar in the Google Cloud Architecture Framework describes principles and recommendations to optimize the cost of your workloads in Google Cloud.

The intended audience includes the following:

  • CTOs, CIOs, CFOs, and other executives who are responsible for strategic cost management.
  • Architects, developers, administrators, and operators who make decisions that affect cost at all the stages of an organization's cloud journey.

The cost models for on-premises and cloud workloads differ significantly. On-premises IT costs include capital expenditure (CapEx) and operational expenditure (OpEx). On-premises hardware and software assets are acquired and the acquisition costs are depreciated over the operating life of the assets. In the cloud, the costs for most cloud resources are treated as OpEx, where costs are incurred when the cloud resources are consumed. This fundamental difference underscores the importance of the following core principles of cost optimization.

For cost optimization principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Cost optimization.

The recommendations in the cost optimization pillar of the Architecture Framework are mapped to the following core principles:

  • Align cloud spending with business value: Ensure that your cloud resources deliver measurable business value by aligning IT spending with business objectives.
  • Foster a culture of cost awareness: Ensure that people across your organization consider the cost impact of their decisions and activities, and ensure that they have access to the cost information required to make informed decisions.
  • Optimize resource usage: Provision only the resources that you need, and pay only for the resources that you consume.
  • Optimize continuously: Continuously monitor your cloud resource usage and costs, and proactively make adjustments as needed to optimize your spending. This approach involves identifying and addressing potential cost inefficiencies before they become significant problems.

These principles are closely aligned with the core tenets of cloud FinOps. FinOps is relevant to any organization, regardless of its size or maturity in the cloud. By adopting these principles and following the related recommendations, you can control and optimize costs throughout your journey in the cloud.

Contributors

Author: Nicolas Pintaux | Customer Engineer, Application Modernization Specialist

Other contributors:

Align cloud spending with business value

This principle in the cost optimization pillar of the Google Cloud Architecture Framework provides recommendations to align your use of Google Cloud resources with your organization's business goals.

Principle overview

To effectively manage cloud costs, you need to maximize the business value that the cloud resources provide and minimize the total cost of ownership (TCO). When you evaluate the resource options for your cloud workloads, consider not only the cost of provisioning and using the resources, but also the cost of managing them. For example, virtual machines (VMs) on Compute Engine might be a cost-effective option for hosting applications. However, when you consider the overhead to maintain, patch, and scale the VMs, the TCO can increase. On the other hand, serverless services like Cloud Run can offer greater business value. The lower operational overhead lets your team focus on core activities and helps to increase agility.

To ensure that your cloud resources deliver optimal value, evaluate the following factors:

  • Provisioning and usage costs: The expenses incurred when you purchase, provision, or consume resources.
  • Management costs: The recurring expenses for operating and maintaining resources, including tasks like patching, monitoring and scaling.
  • Indirect costs: The costs that you might incur to manage issues like downtime, data loss, or security breaches.
  • Business impact: The potential benefits from the resources, like increased revenue, improved customer satisfaction, and faster time to market.

By aligning cloud spending with business value, you get the following benefits:

  • Value-driven decisions: Your teams are encouraged to prioritize solutions that deliver the greatest business value and to consider both short-term and long-term cost implications.
  • Informed resource choice: Your teams have the information and knowledge that they need to assess the business value and TCO of various deployment options, so they choose resources that are cost-effective.
  • Cross-team alignment: Cross-functional collaboration between business, finance, and technical teams ensures that cloud decisions are aligned with the overall objectives of the organization.

Recommendations

To align cloud spending with business objectives, consider the following recommendations.

Prioritize managed services and serverless products

Whenever possible, choose managed services and serverless products to reduce operational overhead and maintenance costs. This choice lets your teams concentrate on their core business activities. They can accelerate the delivery of new features and functionalities, and help drive innovation and value.

The following are examples of how you can implement this recommendation:

  • To run PostgreSQL, MySQL, or Microsoft SQL Server server databases, use Cloud SQL instead of deploying those databases on VMs.
  • To run and manage Kubernetes clusters, use Google Kubernetes Engine (GKE) Autopilot instead of deploying containers on VMs.
  • For your Apache Hadoop or Apache Spark processing needs, use Dataproc and Dataproc Serverless. Per-second billing can help to achieve significantly lower TCO when compared to on-premises data lakes.

Balance cost efficiency with business agility

Controlling costs and optimizing resource utilization are important goals. However, you must balance these goals with the need for flexible infrastructure that lets you innovate rapidly, respond quickly to changes, and deliver value faster. The following are examples of how you can achieve this balance:

  • Adopt DORA metrics for software delivery performance. Metrics like change failure rate (CFR), time to detect (TTD), and time to restore (TTR) can help to identify and fix bottlenecks in your development and deployment processes. By reducing downtime and accelerating delivery, you can achieve both operational efficiency and business agility.
  • Follow Site Reliability Engineering (SRE) practices to improve operational reliability. SRE's focus on automation, observability, and incident response can lead to reduced downtime, lower recovery time, and higher customer satisfaction. By minimizing downtime and improving operational reliability, you can prevent revenue loss and avoid the need to overprovision resources as a safety net to handle outages.

Enable self-service optimization

Encourage a culture of experimentation and exploration by providing your teams with self-service cost optimization tools, observability tools, and resource management platforms. Enable them to provision, manage, and optimize their cloud resources autonomously. This approach helps to foster a sense of ownership, accelerate innovation, and ensure that teams can respond quickly to changing needs while being mindful of cost efficiency.

Adopt and implement FinOps

Adopt FinOps to establish a collaborative environment where everyone is empowered to make informed decisions that balance cost and value. FinOps fosters financial accountability and promotes effective cost optimization in the cloud.

Promote a value-driven and TCO-aware mindset

Encourage your team members to adopt a holistic attitude toward cloud spending, with an emphasis on TCO and not just upfront costs. Use techniques like value stream mapping to visualize and analyze the flow of value through your software delivery process and to identify areas for improvement. Implement unit costing for your applications and services to gain a granular understanding of cost drivers and discover opportunities for cost optimization. For more information, see Maximize business value with cloud FinOps.

Foster a culture of cost awareness

This principle in the cost optimization pillar of the Google Cloud Architecture Framework provides recommendations to promote cost awareness across your organization and ensure that team members have the cost information that they need to make informed decisions.

Conventionally, the responsibility for cost management might be centralized to a few select stakeholders and primarily focused on initial project architecture decisions. However, team members across all cloud user roles (analyst, architect, developer, or administrator) can help to reduce the cost of your resources in Google Cloud. By sharing cost data appropriately, you can empower team members to make cost-effective decisions throughout their development and deployment processes.

Principle overview

Stakeholders across various roles – product owners, developers, deployment engineers, administrators, and financial analysts – need visibility into relevant cost data and its relationship to business value. When provisioning and managing cloud resources, they need the following data:

  • Projected resource costs: Cost estimates at the time of design and deployment.
  • Real-time resource usage costs: Up-to-date cost data that can be used for ongoing monitoring and budget validation.
  • Costs mapped to business metrics: Insights into how cloud spending affects key performance indicators (KPIs), to enable teams to identify cost-effective strategies.

Every individual might not need access to raw cost data. However, promoting cost awareness across all roles is crucial because individual decisions can affect costs.

By promoting cost visibility and ensuring clear ownership of cost management practices, you ensure that everyone is aware of the financial implications of their choices and everyone actively contributes to the organization's cost optimization goals. Whether through a centralized FinOps team or a distributed model, establishing accountability is crucial for effective cost optimization efforts.

Recommendations

To promote cost awareness and ensure that your team members have the cost information that they need to make informed decisions, consider the following recommendations.

Provide organization-wide cost visibility

To achieve organization-wide cost visibility, the teams that are responsible for cost management can take the following actions:

  • Standardize cost calculation and budgeting: Use a consistent method to determine the full costs of cloud resources, after factoring in discounts and shared costs. Establish clear and standardized budgeting processes that align with your organization's goals and enable proactive cost management.
  • Use standardized cost management and visibility tools: Use appropriate tools that provide real-time insights into cloud spending and generate regular (for example, weekly) cost progression snapshots. These tools enable proactive budgeting, forecasting, and identification of optimization opportunities. The tools could be cloud provider tools (like the Google Cloud Billing dashboard), third-party solutions, or open-source solutions like the Cost Attribution solution.
  • Implement a cost allocation system: Allocate a portion of the overall cloud budget to each team or project. Such an allocation gives the teams a sense of ownership over cloud spending and encourages them to make cost-effective decisions within their allocated budget.
  • Promote transparency: Encourage teams to discuss cost implications during the design and decision-making processes. Create a safe and supportive environment for sharing ideas and concerns related to cost optimization. Some organizations use positive reinforcement mechanisms like leaderboards or recognition programs. If your organization has restrictions on sharing raw cost data due to business concerns, explore alternative approaches for sharing cost information and insights. For example, consider sharing aggregated metrics (like the total cost for an environment or feature) or relative metrics (like the average cost per transaction or user).

Understand how cloud resources are billed

Pricing for Google Cloud resources might vary across regions. Some resources are billed monthly at a fixed price, and others might be billed based on usage. To understand how Google Cloud resources are billed, use the Google Cloud pricing calculator and product-specific pricing information (for example, Google Kubernetes Engine (GKE) pricing).

Understand resource-based cost optimization options

For each type of cloud resource that you plan to use, explore strategies to optimize utilization and efficiency. The strategies include rightsizing, autoscaling, and adopting serverless technologies where appropriate. The following are examples of cost optimization options for a few Google Cloud products:

  • Cloud Run lets you configure always-allocated CPUs to handle predictable traffic loads at a fraction of the price of the default allocation method (that is, CPUs allocated only during request processing).
  • You can purchase BigQuery slot commitments to save money on data analysis.
  • GKE provides detailed metrics to help you understand cost optimization options.
  • Understand how network pricing can affect the cost of data transfers and how you can optimize costs for specific networking services. For example, you can reduce the data transfer costs for external Application Load Balancers by using Cloud CDN or Google Cloud Armor. For more information, see Ways to lower external Application Load Balancer costs.

Understand discount-based cost optimization options

Familiarize yourself with the discount programs that Google Cloud offers, such as the following examples:

  • Committed use discounts (CUDs): CUDs are suitable for resources that have predictable and steady usage. CUDs let you get significant reductions in price in exchange for committing to specific resource usage over a period (typically one to three years). You can also use CUD auto-renewal to avoid having to manually repurchase commitments when they expire.
  • Sustained use discounts: For certain Google Cloud products like Compute Engine and GKE, you can get automatic discount credits after continuous resource usage beyond specific duration thresholds.
  • Spot VMs: For fault-tolerant and flexible workloads, Spot VMs can help to reduce your Compute Engine costs. The cost of Spot VMs is significantly lower than regular VMs. However, Compute Engine might preemptively stop or delete Spot VMs to reclaim capacity. Spot VMs are suitable for batch jobs that can tolerate preemption and don't have high availability requirements.
  • Discounts for specific product options: Some managed services like BigQuery offer discounts when you purchase dedicated or autoscaling query processing capacity.

Evaluate and choose the discounts options that align with your workload characteristics and usage patterns.

Incorporate cost estimates into architecture blueprints

Encourage teams to develop architecture blueprints that include cost estimates for different deployment options and configurations. This practice empowers teams to compare costs proactively and make informed decisions that align with both technical and financial objectives.

Use a consistent and standard set of labels for all your resources

You can use labels to track costs and to identify and classify resources. Specifically, you can use labels to allocate costs to different projects, departments, or cost centers. Defining a formal labeling policy that aligns with the needs of the main stakeholders in your organization helps to make costs visible more widely. You can also use labels to filter resource cost and usage data based on target audience.

Use automation tools like Terraform to enforce labeling on every resource that is created. To enhance cost visibility and attribution further, you can use the tools provided by the open-source cost attribution solution.

Share cost reports with team members

By sharing cost reports with your team members, you empower them to take ownership of their cloud spending. This practice enables cost-effective decision making, continuous cost optimization, and systematic improvements to your cost allocation model.

Cost reports can be of several types, including the following:

  • Periodic cost reports: Regular reports inform teams about their current cloud spending. Conventionally, these reports might be spreadsheet exports. More effective methods include automated emails and specialized dashboards. To ensure that cost reports provide relevant and actionable information without overwhelming recipients with unnecessary detail, the reports must be tailored to the target audiences. Setting up tailored reports is a foundational step toward more real-time and interactive cost visibility and management.
  • Automated notifications: You can configure cost reports to proactively notify relevant stakeholders (for example, through email or chat) about cost anomalies, budget thresholds, or opportunities for cost optimization. By providing timely information directly to those who can act on it, automated alerts encourage prompt action and foster a proactive approach to cost optimization.
  • Google Cloud dashboards: You can use the built-in billing dashboards in Google Cloud to get insights into cost breakdowns and to identify opportunities for cost optimization. Google Cloud also provides FinOps hub to help you monitor savings and get recommendations for cost optimization. An AI engine powers the FinOps hub to recommend cost optimization opportunities for all the resources that are currently deployed. To control access to these recommendations, you can implement role-based access control (RBAC).
  • Custom dashboards: You can create custom dashboards by exporting cost data to an analytics database, like BigQuery. Use a visualization tool like Looker Studio to connect to the analytics database to build interactive reports and enable fine-grained access control through role-based permissions.
  • Multicloud cost reports: For multicloud deployments, you need a unified view of costs across all the cloud providers to ensure comprehensive analysis, budgeting, and optimization. Use tools like BigQuery to centralize and analyze cost data from multiple cloud providers, and use Looker Studio to build team-specific interactive reports.

Optimize resource usage

This principle in the cost optimization pillar of the Google Cloud Architecture Framework provides recommendations to help you plan and provision resources to match the requirements and consumption patterns of your cloud workloads.

Principle overview

To optimize the cost of your cloud resources, you need to thoroughly understand your workloads resource requirements and load patterns. This understanding is the basis for a well defined cost model that lets you forecast the total cost of ownership (TCO) and identify cost drivers throughout your cloud adoption journey. By proactively analyzing and forecasting cloud spending, you can make informed choices about resource provisioning, utilization, and cost optimization. This approach lets you control cloud spending, avoid overprovisioning, and ensure that cloud resources are aligned with the dynamic needs of your workloads and environments.

Recommendations

To effectively optimize cloud resource usage, consider the following recommendations.

Choose environment-specific resources

Each deployment environment has different requirements for availability, reliability and scalability. For example, developers might prefer an environment that lets them rapidly deploy and run applications for short durations, but might not need high availability. On the other hand, a production environment typically needs high availability. To maximize the utilization of your resources, define environment-specific requirements based on your business needs. The following table lists examples of environment-specific requirements.

Environment Requirements
Production
  • High availability
  • Predictable performance
  • Operational stability
  • Security with robust resources
Development and testing
  • Cost efficiency
  • Flexible infrastructure with burstable capacity
  • Ephemeral infrastructure when data persistence is not necessary
Other environments (like staging and QA)
  • Tailored resource allocation based on environment-specific requirements

Choose workload-specific resources

Each of your cloud workloads might have different requirements for availability, scalability, security, and performance. To optimize costs, you need to align resource choices with the specific requirements of each workload. For example, a stateless application might not require the same level of availability or reliability as a stateful backend. The following table lists more examples of workload-specific requirements.

Workload type Workload requirements Resource options
Mission-critical Continuous availability, robust security, and high performance Premium resources and managed services like Spanner for high availability and global consistency of data.
Non-critical Cost-efficient and autoscaling infrastructure Resources with basic features and ephemeral resources like Spot VMs.
Event-driven Dynamic scaling based on the current demand for capacity and performance Serverless services like Cloud Run and Cloud Run functions.
Experimental workloads Low cost and flexible environment for rapid development, iteration, testing, and innovation Resources with basic features, ephemeral resources like Spot VMs, and sandbox environments with defined spending limits.

A benefit of the cloud is the opportunity to take advantage of the most appropriate computing power for a given workload. Some workloads are developed to take advantage of processor instruction sets, and others might not be designed in this way. Benchmark and profile your workloads accordingly. Categorize your workloads and make workload-specific resource choices (for example, choose appropriate machine families for Compute Engine VMs). This practice helps to optimize costs, enable innovation, and maintain the level of availability and performance that your workloads need.

The following are examples of how you can implement this recommendation:

  • For mission-critical workloads that serve globally distributed users, consider using Spanner. Spanner removes the need for complex database deployments by ensuring reliability and consistency of data in all regions.
  • For workloads with fluctuating load levels, use autoscaling to ensure that you don't incur costs when the load is low and yet maintain sufficient capacity to meet the current load. You can configure autoscaling for many Google Cloud services, including Compute Engine VMs, Google Kubernetes Engine (GKE) clusters, and Cloud Run. When you set up autoscaling, you can configure maximum scaling limits to ensure that costs remain within specified budgets.

Select regions based on cost requirements

For your cloud workloads, carefully evaluate the available Google Cloud regions and choose regions that align with your cost objectives. The region with lowest cost might not offer optimal latency or it might not meet your sustainability requirements. Make informed decisions about where to deploy your workloads to achieve the desired balance. You can use the Google Cloud Region Picker to understand the trade-offs between cost, sustainability, latency, and other factors.

Use built-in cost optimization options

Google Cloud products provide built-in features to help you optimize resource usage and control costs. The following table lists examples of cost optimization features that you can use in some Google Cloud products:

Product Cost optimization feature
Compute Engine
  • Automatically add or remove VMs based on the current load by using autoscaling.
  • Avoid overprovisioning by creating and using custom machine types
  • that match your workload's requirements.
  • For non-critical or fault-tolerant workloads, reduce costs by using Spot VMs.
  • In development environments, reduce costs by limiting the run time of VMs or by suspending or stopping VMs when you don't need them.
GKE
  • Automatically adjust the size of GKE clusters based on the current load by using cluster autoscaler.
  • Automatically create and manage node pools based on workload requirements and ensure optimal resource utilization by using node auto-provisioning.
Cloud Storage
  • Automatically transition data to lower-cost storage classes based on the age of data or based on access patterns by using Object Lifecycle Management.
  • Dynamically move data to the most cost-effective storage class based on usage patterns by using Autoclass.
BigQuery
  • Reduce query processing costs for steady-state workloads by using capacity-based pricing.
  • Optimize query performance and costs by using partitioning and clustering techniques.
Google Cloud VMware Engine

Optimize resource sharing

To maximize the utilization of cloud resources, you can deploy multiple applications or services on the same infrastructure, while still meeting the security and other requirements of the applications. For example, in development and testing environments, you can use the same cloud infrastructure to test all the components of an application. For the production environment, you can deploy each component on a separate set of resources to limit the extent of impact in case of incidents.

The following are examples of how you can implement this recommendation:

  • Use a single Cloud SQL instance for multiple non-production environments.
  • Enable multiple development teams to share a GKE cluster by using the fleet team management feature in GKE Enterprise with appropriate access controls.
  • Use GKE Autopilot to take advantage of cost-optimization techniques like bin packing and autoscaling that GKE implements by default.
  • For AI and ML workloads, save GPU costs by using GPU-sharing strategies like multi-instance GPUs, time-sharing GPUs, and NVIDIA MPS.

Develop and maintain reference architectures

Create and maintain a repository of reference architectures that are tailored to meet the requirements of different deployment environments and workload types. To streamline the design and implementation process for individual projects, the blueprints can be centrally managed by a team like a Cloud Center of Excellence (CCoE). Project teams can choose suitable blueprints based on clearly defined criteria, to ensure architectural consistency and adoption of best practices. For requirements that are unique to a project, the project team and the central architecture team should collaborate to design new reference architectures. You can share the reference architectures across the organization to foster knowledge sharing and expand the repository of available solutions. This approach ensures consistency, accelerates development, simplifies decision-making, and promotes efficient resource utilization.

Review the reference architectures provided by Google for various use cases and technologies. These reference architectures incorporate best practices for resource selection, sizing, configuration, and deployment. By using these reference architectures, you can accelerate your development process and achieve cost savings from the start.

Enforce cost discipline by using organization policies

Consider using organization policies to limit the available Google Cloud locations and products that team members can use. These policies help to ensure that teams adhere to cost-effective solutions and provision resources in locations that are aligned with your cost optimization goals.

Estimate realistic budgets and set financial boundaries

Develop detailed budgets for each project, workload, and deployment environment. Make sure that the budgets cover all aspects of cloud operations, including infrastructure costs, software licenses, personnel, and anticipated growth. To prevent overspending and ensure alignment with your financial goals, establish clear spending limits or thresholds for projects, services, or specific resources. Monitor cloud spending regularly against these limits. You can use proactive quota alerts to identify potential cost overruns early and take timely corrective action.

In addition to setting budgets, you can use quotas and limits to help enforce cost discipline and prevent unexpected spikes in spending. You can exercise granular control over resource consumption by setting quotas at various levels, including projects, services, and even specific resource types.

The following are examples of how you can implement this recommendation:

  • Project-level quotas: Set spending limits or resource quotas at the project level to establish overall financial boundaries and control resource consumption across all the services within the project.
  • Service-specific quotas: Configure quotas for specific Google Cloud services like Compute Engine or BigQuery to limit the number of instances, CPUs, or storage capacity that can be provisioned.
  • Resource type-specific quotas: Apply quotas to individual resource types like Compute Engine VMs, Cloud Storage buckets, Cloud Run instances, or GKE nodes to restrict their usage and prevent unexpected cost overruns.
  • Quota alerts: Get notifications when your quota usage (at the project level) reaches a percentage of the maximum value.

By using quotas and limits in conjunction with budgeting and monitoring, you can create a proactive and multi-layered approach to cost control. This approach helps to ensure that your cloud spending remains within defined boundaries and aligns with your business objectives. Remember, these cost controls are not permanent or rigid. To ensure that the cost controls remain aligned with current industry standards and reflect your evolving business needs, you must review the controls regularly and adjust them to include new technologies and best practices.

Optimize continuously

This principle in the cost optimization pillar of the Google Cloud Architecture Framework provides recommendations to help you optimize the cost of your cloud deployments based on constantly changing and evolving business goals.

As your business grows and evolves, your cloud workloads need to adapt to changes in resource requirements and usage patterns. To derive maximum value from your cloud spending, you must maintain cost-efficiency while continuing to support business objectives. This requires a proactive and adaptive approach that focuses on continuous improvement and optimization.

Principle overview

To optimize cost continuously, you must proactively monitor and analyze your cloud environment and make suitable adjustments to meet current requirements. Focus your monitoring efforts on key performance indicators (KPIs) that directly affect your end users' experience, align with your business goals, and provide insights for continuous improvement. This approach lets you identify and address inefficiencies, adapt to changing needs, and continuously align cloud spending with strategic business goals. To balance comprehensive observability with cost effectiveness, understand the costs and benefits of monitoring resource usage and use appropriate process-improvement and optimization strategies.

Recommendations

To effectively monitor your Google Cloud environment and optimize cost continuously, consider the following recommendations.

Focus on business-relevant metrics

Effective monitoring starts with identifying the metrics that are most important for your business and customers. These metrics include the following:

  • User experience metrics: Latency, error rates, throughput, and customer satisfaction metrics are useful for understanding your end users' experience when using your applications.
  • Business outcome metrics: Revenue, customer growth, and engagement can be correlated with resource usage to identify opportunities for cost optimization.
  • DevOps Research & Assessment (DORA) metrics: Metrics like deployment frequency, lead time for changes, change failure rate, and time to restore provide insights into the efficiency and reliability of your software delivery process. By improving these metrics, you can increase productivity, reduce downtime, and optimize cost.
  • Site Reliability Engineering (SRE) metrics: Error budgets help teams to quantify and manage the acceptable level of service disruption. By establishing clear expectations for reliability, error budgets empower teams to innovate and deploy changes more confidently, knowing their safety margin. This proactive approach promotes a balance between innovation and stability, helping prevent excessive operational costs associated with major outages or prolonged downtime.

Use observability for resource optimization

The following are recommendations to use observability to identify resource bottlenecks and underutilized resources in your cloud deployments:

  • Monitor resource utilization: Use resource utilization metrics to identify Google Cloud resources that are underutilized. For example, use metrics like CPU and memory utilization to identify idle VM resources. For Google Kubernetes Engine (GKE), you can view a detailed breakdown of costs and cost-related optimization metrics. For Google Cloud VMware Engine, review resource utilization to optimize CUDs, storage consumption, and ESXi right-sizing.
  • Use cloud recommendations: Active Assist is a portfolio of intelligent tools that help you optimize your cloud operations. These tools provide actionable recommendations to reduce costs, increase performance, improve security and even make sustainability-focused decisions. For example, VM rightsizing insights can help to optimize resource allocation and avoid unnecessary spending.
  • Correlate resource utilization with performance: Analyze the relationship between resource utilization and application performance to determine whether you can downgrade to less expensive resources without affecting the user experience.

Balance troubleshooting needs with cost

Detailed observability data can help with diagnosing and troubleshooting issues. However, storing excessive amounts of observability data or exporting unnecessary data to external monitoring tools can lead to unnecessary costs. For efficient troubleshooting, consider the following recommendations:

  • Collect sufficient data for troubleshooting: Ensure that your monitoring solution captures enough data to efficiently diagnose and resolve issues when they arise. This data might include logs, traces, and metrics at various levels of granularity.
  • Use sampling and aggregation: Balance the need for detailed data with cost considerations by using sampling and aggregation techniques. This approach lets you collect representative data without incurring excessive storage costs.
  • Understand the pricing models of your monitoring tools and services: Evaluate different monitoring solutions and choose options that align with your project's specific needs, budget, and usage patterns. Consider factors like data volume, retention requirements, and the required features when making your selection.
  • Regularly review your monitoring configuration: Avoid collecting excessive data by removing unnecessary metrics or logs.

Tailor data collection to roles and set role-specific retention policies

Consider the specific data needs of different roles. For example, developers might primarily need access to traces and application-level logs, whereas IT administrators might focus on system logs and infrastructure metrics. By tailoring data collection, you can reduce unnecessary storage costs and avoid overwhelming users with irrelevant information.

Additionally, you can define retention policies based on the needs of each role and any regulatory requirements. For example, developers might need access to detailed logs for a shorter period, while financial analysts might require longer-term data.

Consider regulatory and compliance requirements

In certain industries, regulatory requirements mandate data retention. To avoid legal and financial risks, you need to ensure that your monitoring and data retention practices help you adhere to relevant regulations. At the same time, you need to maintain cost efficiency. Consider the following recommendations:

  • Determine the specific data retention requirements for your industry or region, and ensure that your monitoring strategy meets the requirements of those requirements.
  • Implement appropriate data archival and retrieval mechanisms to meet audit and compliance needs while minimizing storage costs.

Implement smart alerting

Alerting helps to detect and resolve issues in a timely manner. However, a balance is necessary between an approach that keeps you informed, and one that overwhelms you with notifications. By designing intelligent alerting systems, you can prioritize critical issues that have higher business impact. Consider the following recommendations:

  • Prioritize issues that affect customers: Design alerts that trigger rapidly for issues that directly affect the customer experience, like website outages, slow response times, or transaction failures.
  • Tune for temporary problems: Use appropriate thresholds and delay mechanisms to avoid unnecessary alerts for temporary problems or self-healing system issues that don't affect customers.
  • Customize alert severity: Ensure that the most urgent issues receive immediate attention by differentiating between critical and noncritical alerts.
  • Use notification channels wisely: Choose appropriate channels for alert notifications (email, SMS, or paging) based on the severity and urgency of the alerts.

Google Cloud Architecture Framework: Performance optimization

This pillar of the Google Cloud Architecture Framework describes the performance optimization process and best practices to optimize the performance of workloads in Google Cloud.

The information in this document is intended for architects, developers, and administrators who plan, design, deploy, and manage workloads in Google Cloud.

Optimizing the performance of workloads in the cloud can help your organization operate efficiently, improve customer satisfaction, increase revenue, and reduce cost. For example, when the backend processing time of an application decreases, users experience faster response times, which can lead to higher user retention and more revenue.

There might be trade-offs between performance and cost. But sometimes, optimizing performance can help you reduce cost. ​​For example, autoscaling helps provide predictable performance when the load increases by ensuring that the resources aren't overloaded. Autoscaling also helps you reduce cost during periods of low load by removing unused resources.

For performance optimization principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Performance optimization.

In this pillar of the Architecture Framework, you learn to do the following:

Performance optimization process

This document in the Google Cloud Architecture Framework provides an overview of the performance optimization process.

Performance optimization is a continuous process, not a one-time activity. The following diagram shows the stages in the performance optimization process:

Performance optimization process

The following is an overview of the stages in the performance optimization process:

Define performance requirements

Before you start to design and develop the applications that you intend to deploy or migrate to the cloud, determine the performance requirements. Define the requirements as granularly as possible for each layer of the application stack: frontend load balancing, web or applications servers, database, and storage. For example, for the storage layer of the stack, decide on the throughput and I/O operations per second (IOPS) that your applications need.

Design and deploy your applications

Design your applications by using elastic and scalable design patterns that can help you meet the performance requirements. Consider the following guidelines for designing applications that are elastic and scalable:

  • Architect the workloads for optimal content placement.
  • Isolate read and write traffic.
  • Isolate static and dynamic traffic.
  • Implement content caching. Use data caches for internal layers.
  • Use managed services and serverless architectures.

Google Cloud provides open source tools that you can use to benchmark the performance of Google Cloud services with other cloud platforms.

Monitor and analyze performance

After you deploy your applications, continuously monitor performance by using logs and alerts, analyze the data, and identify performance issues. As your applications grow and evolve, reassess your performance requirements. You might have to redesign some parts of the applications to maintain or improve performance.

Optimize performance

Based on the performance of your applications and changes in requirements, configure the cloud resources to meet the current performance requirements. For example, resize the resources or set up autoscaling. When you configure the resources, evaluate opportunities to use recently released Google Cloud features and services that can help further optimize performance.

The performance optimization process doesn't end at this point. Continue the cycle of monitoring performance, reassessing requirements when necessary, and adjusting the cloud resources to maintain and improve performance.

What's next

Monitor and analyze performance

This document in the Google Cloud Architecture Framework describes the services in the Google Cloud Observability that you can use to record, monitor, and analyze the performance of your workloads.

Monitor performance metrics

Use Cloud Monitoring to analyze trends of performance metrics, analyze the effects of experiments, define alerts for critical metrics, and perform retrospective analyses.

Log critical data and events

Cloud Logging is an integrated logging service that you can use to store, analyze, monitor, and set alerts for log data and events. Cloud Logging can collect logs from the services of Google Cloud and other cloud providers.

Analyze code performance

Code that performs poorly can increase the latency of your applications and the cost of running them. Cloud Profiler helps you identify and address performance issues by continuously analyzing the performance of CPU-intensive or memory-intensive functions that an application uses.

Collect latency data

In complex application stacks and microservices-based architectures, assessing latency in inter-service communication and identifying performance bottlenecks can be difficult. Cloud Trace and OpenTelemetry tools help you collect latency data from your deployments at scale. These tools also help you analyze the latency data efficiently.

Monitor network performance

The Performance Dashboard of the Network Intelligence Center gives you a comprehensive view of performance metrics for the Google network and the resources in your project. These metrics can help you determine the cause of network-related performance issues. For example, you can identify whether a performance issue is the result of a problem in your project or the Google network.

What's next

Optimize compute performance

This document in the Google Cloud Architecture Framework provides recommendations to help you optimize the performance of your Compute Engine, Google Kubernetes Engine (GKE), and serverless resources.

Compute Engine

This section provides guidance to help you optimize the performance of your Compute Engine resources.

Autoscale resources

Managed instance groups (MIGs) let you scale your stateless apps deployed on Compute Engine VMs efficiently. Autoscaling helps your apps continue to deliver predictable performance when the load increases. In a MIG, a group of Compute Engine VMs is launched based on a template that you define. In the instance group configuration, you configure an autoscaling policy, which specifies one or more signals that the autoscaler uses to scale the group. The autoscaling signals can be schedule-based, like start time or duration, or based on target metrics such as average CPU utilization. For more information, see Autoscaling groups of instances.

Disable SMT

Each virtual CPU (vCPU) that you allocate to a Compute Engine VM is implemented as a single hardware multithread. By default, two vCPUs share a physical CPU core. This architecture is called simultaneous multi-threading (SMT).

For workloads that are highly parallel or that perform floating point calculations (such as transcoding, Monte Carlo simulations, genetic sequence analysis, and financial risk modeling), you can improve performance by disabling SMT. For more information, see Set the number of threads per core.

Use GPUs

For workloads such as machine learning and visualization, you can add graphics processing units (GPUs) to your VMs. Compute Engine provides NVIDIA GPUs in passthrough mode so that your VMs have direct control over the GPUs and the associated memory. For graphics-intensive workloads such as 3D visualization, you can use NVIDIA RTX virtual workstations. After you deploy the workloads, monitor the GPU usage and review the options for optimizing GPU performance.

Use compute-optimized machine types

Workloads like gaming, media transcoding, and high performance computing (HPC) require consistently high performance per CPU core. Google recommends that you use compute-optimized machine types for the VMs that run such workloads. Compute-optimized VMs are built on an architecture that uses features like non-uniform memory access (NUMA) for optimal and reliable performance.

Tightly coupled HPC workloads have a unique set of requirements for achieving peak efficiency in performance. For more information, see Parallel file systems for HPC workloads.

Choose appropriate storage

Google Cloud offers a wide range of storage options for Compute Engine VMs: Persistent disks, local solid-state drive (SSD) disks, Filestore, and Cloud Storage. For design recommendations and best practices to optimize the performance of each of these storage options, see Optimize storage performance.

Google Kubernetes Engine

This section provides guidance to help you optimize the performance of your Google Kubernetes Engine (GKE) resources.

Autoscale resources

You can automatically resize the node pools in a GKE cluster to match the current load by using the cluster autoscaler feature. Autoscaling helps your apps continue to deliver predictable performance when the load increases. The cluster autoscaler resizes node pools automatically based on the resource requests (rather than actual resource utilization) of the Pods running on the nodes. When you use autoscaling, there can be a trade-off between performance and cost. Review the best practices for configuring cluster autoscaling efficiently.

Use C2D VMs

You can improve the performance of compute-intensive containerized workloads by using C2D machine types. You can add C2D nodes to your GKE clusters by choosing a C2D machine type in your node pools.

Disable SMT

Simultaneous multi-threading (SMT) can increase application throughput significantly for general computing tasks and for workloads that need high I/O. But for workloads in which both the virtual cores are compute-bound, SMT can cause inconsistent performance. To get better and more predictable performance, you can disable SMT for your GKE nodes by setting the number of vCPUs per core to 1.

Use GPUs

For compute-intensive workloads like image recognition and video transcoding, you can accelerate performance by creating node pools that use GPUs. For more information, see Running GPUs.

Use container-native load balancing

Container-native load balancing enables load balancers to distribute traffic directly and evenly to Pods. This approach provides better network performance and improved visibility into network latency between the load balancer and the Pods. Because of these benefits, container-native load balancing is the recommended solution for load balancing through Ingress.

Define a compact placement policy

Tightly coupled batch workloads need low network latency between the nodes in the GKE node pool. ​​You can deploy such workloads to single-zone node pools, and ensure that the nodes are physically close to each other by defining a compact placement policy. For more information, see Define compact placement for GKE nodes.

Serverless compute services

This section provides guidance to help you optimize the performance of your serverless compute services in Google Cloud: Cloud Run and Cloud Run functions. These services provide autoscaling capabilities, where the underlying infrastructure handles scaling automatically. By using these serverless services, you can reduce the effort to scale your microservices and functions, and focus on optimizing performance at the application level.

For more information, see the following documentation:

What's next

Review the best practices for optimizing the performance of your storage, networking, database, and analytics resources:

Optimize storage performance

This document in the Google Cloud Architecture Framework provides recommendations to help you optimize the performance of your storage resources in Google Cloud.

Cloud Storage

This section provides best practices to help you optimize the performance of your Cloud Storage operations.

Assess bucket performance

Assess the performance of your Cloud Storage buckets by using the gsutil perfdiag command. This command tests the performance of the specified bucket by sending a series of read and write requests with files of different sizes. You can tune the test to match the usage pattern of your applications. Use the diagnostic report that the command generates to set performance expectations and identify potential bottlenecks.

Cache frequently accessed objects

To improve the read latency for frequently accessed objects that are publicly accessible, you can configure such objects to be cached. Although caching can improve performance, stale content could be served if a cache has the old version of an object.

Scale requests efficiently

As the request rate for a bucket increases, Cloud Storage automatically increases the I/O capacity for the bucket by distributing the request load across multiple servers. To achieve optimal performance when scaling requests, follow the best practices for ramping up request rates and distributing load evenly.

Upload large files as composites

To upload large files, you can use a strategy called parallel composite uploads. With this strategy, the large file is split into chunks, which are uploaded in parallel and then recomposed in the cloud. Parallel composite uploads can be faster than regular upload operations when network bandwidth and disk speed are not limiting factors. However, this strategy has some limitations and cost implications. For more information, see Parallel composite uploads.

Persistent disks and local SSDs

This section provides best practices to help you optimize the performance of your Persistent Disks and Local SSDs that are attached to Compute Engine VMs.

The performance of persistent disks and local SSDs depends on the disk type and size, VM machine type, and number of vCPUs. Use the following guidelines to manage the performance of your persistent disks and local SSDs:

Filestore

This section provides best practices to help you optimize the performance of your Filestore instances. You can use Filestore to provision fully managed Network File System (NFS) file servers for Compute Engine VMs and GKE clusters.

  • When you provision a Filestore instance, choose a service tier that meets the performance and capacity requirements of your workload.
  • For client VMs that run cache-dependent workloads, use a machine type that helps optimize the network performance of the Filestore instance. For more information, see Recommended client machine type.
  • To optimize the performance of Filestore instances for client VMs that run Linux, Google recommends specific NFS mount settings. For more information, see Linux client mount options.
  • To minimize network latency, provision your Filestore instances in regions and zones that are close to where you plan to use the instances.
  • Monitor the performance of your Filestore instances, and set up alerts by using Cloud Monitoring.

What's next

Review the best practices for optimizing the performance of your compute, networking, database, and analytics resources:

Optimize networking and API performance

This document in the Google Cloud Architecture Framework provides recommendations to help you optimize the performance of your networking resources and APIs in Google Cloud.

Network Service Tiers

Network Service Tiers lets you optimize the network cost and performance of your workloads. You can choose from the following tiers:

  • Premium Tier uses Google's highly reliable global backbone to help you achieve minimal packet loss and latency. Traffic enters and leaves the Google network at a global edge point of presence (PoP) that's close to your end user. We recommend using Premium Tier as the default tier for optimal performance. Premium Tier supports both regional and global external IP addresses for VMs and load balancers.
  • Standard Tier is available only for resources that use regional external IP addresses. Traffic enters and leaves the Google network at an edge PoP that's closest to the Google Cloud location where your workload runs. The pricing for Standard Tier is lower than Premium Tier. Standard Tier is suitable for traffic that isn't sensitive to packet loss and that doesn't have low latency requirements.

You can view the network latency for Standard Tier and Premium Tier for each cloud region in the Network Intelligence Center Performance Dashboard.

Jumbo frames

Virtual Private Cloud (VPC) networks have a default maximum transmission unit (MTU) of 1460 bytes. However, you can configure your VPC networks to to support an MTU of up to 8896 (jumbo frames).

With a higher MTU, the network needs fewer packets to send the same amount of data, thus reducing the bandwidth used up by TCP/IP headers. This leads to a higher effective bandwidth for the network.

For more information about intra-VPC MTU and the maximum MTU of other connections, see the Maximum transmission unit page in the VPC documentation.

VM performance

Compute Engine VMs have a maximum egress bandwidth that in part depends upon the machine type. One aspect of choosing an appropriate machine type is to consider how much traffic you expect the VM to generate.

The Network bandwidth page contains a discussion and table of network bandwidths for Compute Engine machine types.

If your inter-VM bandwidth requirements are very high, consider VMs that support Tier_1 networking.

Cloud Load Balancing

This section provides best practices to help you optimize the performance of your Cloud Load Balancing instances.

Deploy applications close to your users

Provision your application backends close to the location where you expect user traffic to arrive at the load balancer. The closer your users or client applications are to your workload servers, the lower the network latency between the users and the workload. To minimize latency to clients in different parts of the world, you might have to deploy the backends in multiple regions. For more information, see Best practices for Compute Engine regions selection.

Choose an appropriate load balancer type

The type of load balancer that you choose for your application can determine the latency that your users experience. For information about measuring and optimizing application latency for different load balancer types, see Optimizing application latency with load balancing.

Enable caching

To accelerate content serving, enable caching and Cloud CDN as part of your default external HTTP load balancer configuration. Make sure that the backend servers are configured to send the response headers that are necessary for static responses to be cached.

Use HTTP when HTTPS isn't necessary

Google automatically encrypts traffic between proxy load balancers and backends at the packet level. Packet-level encryption makes Layer 7 encryption using HTTPS between the load balancer and the backends redundant for most purposes. Consider using HTTP rather than HTTPS or HTTP/2 for traffic between the load balancer and your backends. By using HTTP, you can also reduce the CPU usage of your backend VMs. However, when the backend is an internet network endpoint group (NEG), use HTTPS or HTTP/2 for traffic between the load balancer and the backend. This helps ensure that your traffic is secure on the public internet. For optimal performance, we recommend benchmarking your application's traffic patterns.

Network Intelligence Center

Google Cloud Network Intelligence Center provides a comprehensive view of the performance of the Google Cloud network across all regions. Network Intelligence Center helps you determine whether latency issues are caused by problems in your project or in the network. You can also use this information to select the regions and zones where you should deploy your workloads to optimize network performance.

Use the following tools provided by Network Intelligence Center to monitor and analyze network performance for your workloads in Google Cloud:

  • Performance Dashboard shows latency between Google Cloud regions and between individual regions and locations on the internet. Performance Dashboard can help you determine where to place workloads for best latency and help determine when an application issue might be due to underlying network issues.

  • Network Topology shows a visual view of your Virtual Private Cloud (VPC) networks, hybrid connectivity with your on-premises networks, and connectivity to Google-managed services. Network Topology provides real-time operational metrics that you can use to analyze and understand network performance and identify unusual traffic patterns.

  • Network Analyzer is an automatic configuration monitoring and diagnostics tool. It verifies VPC network configurations for firewall rules, routes, configuration dependencies, and connectivity for services and applications. It helps you identify network failures, and provides root cause analysis and recommendations. Network Analyzer provides prioritized insights to help you analyze problems with network configuration, such as high utilization of IP addresses in a subnet.

API Gateway and Apigee

This section provides recommendations to help you optimize the performance of the APIs that you deploy in Google Cloud by using API Gateway and Apigee.

API Gateway lets you create and manage APIs for Google Cloud serverless backends, including Cloud Run functions, Cloud Run, and App Engine. These services are managed services, and they scale automatically. But as the applications that are deployed on these services scale, you might need to increase the quotas and rate limits for API Gateway.

Apigee provides the following analytics dashboards to help you monitor the performance of your managed APIs:

If you use Apigee Integration, consider the system-configuration limits when you build and manage your integrations.

What's next

Review the best practices for optimizing the performance of your compute, storage, database, and analytics resources:

Optimize database performance

This document in the Google Cloud Architecture Framework provides recommendations to help you optimize the performance of your databases in Google Cloud.

Cloud SQL

The following recommendations help you to optimize the performance of your Cloud SQL instances running SQL Server, MySQL, and PostgreSQL databases.

For more information, see the following documentation:

Bigtable

This section provides recommendations to help you optimize the performance of your Bigtable instances.

Plan capacity based on performance requirements

You can use Bigtable in a broad spectrum of applications, each with a different optimization goal. For example, for batch data-processing jobs, throughput might be more important than latency. For an online service that serves user requests, you might need to prioritize lower latency over throughput. When you plan capacity for your Bigtable clusters, consider the tradeoffs between throughput and latency. For more information, see Plan your Bigtable capacity.

Follow schema-design best practices

Your tables can scale to billions of rows and thousands of columns, enabling you to store petabytes of data. When you design the schema for your Bigtable tables, consider the schema design best practices.

Monitor performance and make adjustments

Monitor the CPU and disk usage for your instances, analyze the performance of each cluster, and review the sizing recommendations that are shown in the monitoring charts.

Spanner

This section provides recommendations to help you optimize the performance of your Spanner instances.

Choose a primary key that prevents a hotspot

A hotspot is a single server that is forced to handle many requests. When you choose the primary key for your database, follow the schema design best practices to prevent a hotspot.

Follow best practices for SQL coding

The SQL compiler in Spanner converts each declarative SQL statement that you write into an imperative query execution plan. Spanner uses the execution plan to run the SQL statement. When you construct SQL statements, follow SQL best practices to make sure that Spanner uses execution plans that yield optimal performance.

Use query options to manage the SQL query optimizer

Spanner uses a SQL query optimizer to transform SQL statements into efficient query execution plans. The query execution plan that the optimizer produces might change slightly when the query optimizer itself evolves, or when the database statistics are updated. You can minimize the potential for performance regression when the query optimizer or the database statistics change by using query options.

Visualize and tune the structure of query execution plans

To analyze query performance issues, you can visualize and tune the structure of the query execution plans by using the query plan visualizer.

Use operations APIs to manage long-running operations

For certain method calls, Spanner creates long-running operations, which might take a substantial amount of time to complete. For example, when you restore a database, Spanner creates a long-running operation to track restore progress. To help you monitor and manage long-running operations, Spanner provides operations APIs. For more information, see Managing long-running operations.

Follow best practices for bulk loading

Spanner supports several options for loading large amounts of data in bulk. The performance of a bulk-load operation depends on factors such as partitioning, the number of write requests, and the size of each request. To load large amounts of data efficiently, follow bulk-loading best practices.

Monitor and control CPU utilization

The CPU utilization of your Spanner instance can affect request latencies. An overloaded backend server can cause higher request latencies. Spanner provides CPU utilization metrics to help you investigate high CPU utilization. For performance-sensitive applications, you might need to reduce CPU utilization by increasing the compute capacity.

Analyze and solve latency issues

When a client makes a remote procedure call to Spanner, the API request is first prepared by the client libraries. The request then passes through the Google Front End and the Cloud Spanner API frontend before it reaches the Spanner database. To analyze and solve latency issues, you must measure and analyze the latency for each segment of the path that the API request traverses. For more information, see Spanner end-to-end latency guide.

Launch applications after the database reaches the warm state

As your Spanner database grows, it divides the key space of your data into splits. Each split is a range of rows that contains a subset of your table. To balance the overall load on the database, Spanner dynamically moves individual splits independently and assigns them to different servers. When the splits are distributed across multiple servers, the database is considered to be in a warm state. A database that's warm can maximize parallelism and deliver improved performance. Before you launch your applications, we recommend that you warm up your database with test data loads.

What's next

Review the best practices for optimizing the performance of your compute, storage, networking, and analytics resources:

Optimize analytics performance

This document in the Google Cloud Architecture Framework provides recommendations to help you optimize the performance of your analytics workloads in Google Cloud.

BigQuery

This section provides recommendations to help you optimize the performance of queries in BigQuery.

Optimize query design

Query performance depends on factors like the number of bytes that your queries read and write, and the volume of data that's passed between slots. To optimize the performance of your queries in BigQuery, apply the best practices that are described in the following documentation:

Define and use materialized views efficiently

To improve the performance of workloads that use common and repeated queries, you can use materialized views. There are limits to the number of materialized views that you can create. Don't create a separate materialized view for every permutation of a query. Instead, define materialized views that you can use for multiple patterns of queries.

Improve JOIN performance

You can use materialized views to reduce the cost and latency of a query that performs aggregation on top of a JOIN. Consider a case where you join a large fact table with a few small dimension tables, and then perform an aggregation on top of the join. It might be practical to rewrite the query to first perform the aggregation on top of the fact table with foreign keys as grouping keys. Then, join the result with the dimension tables. Finally, perform a post-aggregation.

Dataflow

This section provides recommendations to help you optimize query performance of your Dataflow pipelines.

When you create and deploy pipelines, you can configure execution parameters, like the Compute Engine machine type that should be used for the Dataflow worker VMs. For more information, see Pipeline options.

After you deploy pipelines, Dataflow manages the Compute Engine and Cloud Storage resources that are necessary to run your jobs. In addition, the following features of Dataflow help optimize the performance of the pipelines:

You can monitor the performance of Dataflow pipelines by using the web-based monitoring interface or the Dataflow gcloud CLI.

Dataproc

This section describes best practices to optimize the performance of your Dataproc clusters.

Autoscale clusters

To ensure that your Dataproc clusters deliver predictable performance, you can enable autoscaling. Dataproc uses Hadoop YARN memory metrics and an autoscaling policy that you define to automatically adjust the number of worker VMs in a cluster. For more information about how to use and configure autoscaling, see Autoscaling clusters.

Provision appropriate storage

Choose an appropriate storage option for your Dataproc cluster based on your performance and cost requirements:

  • If you need a low-cost Hadoop-compatible file system (HCFS) that Hadoop and Spark jobs can read from and write to with minimal changes, use Cloud Storage. The data stored in Cloud Storage is persistent, and can be accessed by other Dataproc clusters and other products such as BigQuery.
  • If you need a low-latency Hadoop Distributed File System (HDFS) for your Dataproc cluster, use Compute Engine persistent disks attached to the worker nodes. The data stored in HDFS storage is transient, and the storage cost is higher than the Cloud Storage option.
  • To get the performance advantage of Compute Engine persistent disks and the cost and durability benefits of Cloud Storage, you can combine both of the storage options. For example, you can store your source and final datasets in Cloud Storage, and provision limited HDFS capacity for the intermediate datasets. When you decide on the size and type of the disks for HDFS storage, consider the recommendations in the Persistent disks and local SSDs section.

Reduce latency when using Cloud Storage

To reduce latency when you access data that's stored in Cloud Storage, we recommend the following:

  • Create your Cloud Storage bucket in the same region as the Dataproc cluster.
  • Disable auto.purge for Apache Hive-managed tables stored in Cloud Storage.
  • When using Spark SQL, consider creating Dataproc clusters with the latest versions of available images . By using the latest version, you can avoid performance issues that might remain in older versions, such as slow INSERT OVERWRITE performance in Spark 2.x.
  • To minimize the possibility of writing many files with varying or small sizes to Cloud Storage, you can configure the Spark SQL parameters spark.sql.shuffle.partitions and spark.default.parallelism or the Hadoop parameter mapreduce.job.reduces.

Monitor and adjust storage load and capacity

The persistent disks attached to the worker nodes in a Dataproc cluster hold shuffle data. To perform optimally, the worker nodes need sufficient disk space. If the nodes don't have sufficient disk space, the nodes are marked as UNHEALTHY in the YARN NodeManager log. If this issue occurs, either increase the disk size for the affected nodes, or run fewer jobs concurrently.

Enable EFM

When worker nodes are removed from a running Dataproc cluster, such as due to downscaling or preemption, shuffle data might be lost. To minimize job delays in such scenarios, we recommend that you enable Enhanced Flexibility Mode (EFM) for clusters that use preemptible VMs or that only autoscale the secondary worker group.

What's next

Review the best practices for optimizing the performance of your compute, storage, networking, and database resources:

Design for environmental sustainability

This document in the Google Cloud Architecture Framework summarizes how you can approach environmental sustainability for your workloads in Google Cloud. It includes information about how to minimize your carbon footprint on Google Cloud.

Understand your carbon footprint

To understand the carbon footprint from your Google Cloud usage, use the Carbon Footprint dashboard. The Carbon Footprint dashboard attributes emissions to the Google Cloud projects that you own and the cloud services that you use.

Choose the most suitable cloud regions

One effective way to reduce carbon emissions is to choose cloud regions with lower carbon emissions. To help you make this choice, Google publishes carbon data for all Google Cloud regions.

When you choose a region, you might need to balance lowering emissions with other requirements, such as pricing and network latency. To help select a region, use the Google Cloud Region Picker.

Choose the most suitable cloud services

To help reduce your existing carbon footprint, consider migrating your on-premises VM workloads to Compute Engine.

Consider serverless options for workloads that don't need VMs. These managed services often optimize resource usage automatically, reducing costs and carbon footprint.

Minimize idle cloud resources

Idle resources incur unnecessary costs and emissions. Some common causes of idle resources include the following:

  • Unused active cloud resources, such as idle VM instances.
  • Over-provisioned resources, such as larger VM instances machine types than necessary for a workload.
  • Non-optimal architectures, such as lift-and-shift migrations that aren't always optimized for efficiency. Consider making incremental improvements to these architectures.

The following are some general strategies to help minimize wasted cloud resources:

  • Identify idle or overprovisioned resources and either delete them or rightsize them.
  • Refactor your architecture to incorporate a more optimal design.
  • Migrate workloads to managed services.

Reduce emissions for batch workloads

Run batch workloads in regions with lower carbon emissions. For further reductions, run workloads at times that coincide with lower grid carbon intensity when possible.

What's next

Architecture Framework: AI and ML perspective

This document in the Google Cloud Architecture Framework describes principles and recommendations to help you to design, build, and manage AI and ML workloads in Google Cloud that meet your operational, security, reliability, cost, and performance goals.

The target audience for this document includes decision makers, architects, administrators, developers, and operators who design, build, deploy, and maintain AI and ML workloads in Google Cloud.

The following pages describe principles and recommendations that are specific to AI and ML, for each pillar of the Google Cloud Architecture Framework:

Contributors

Authors:

Other contributors:

AI and ML perspective: Operational excellence

This document in the Architecture Framework: AI and ML perspective provides an overview of the principles and recommendations to help you to build and operate robust AI and ML systems on Google Cloud. These recommendations help you to set up foundational elements like observability, automation, and scalability. This document's recommendations align with the operational excellence pillar of the Architecture Framework.

Operational excellence within the AI and ML domain is the ability to seamlessly deploy, manage, and govern the intricate AI and ML systems and pipelines that power your organization's strategic objectives. Operational excellence lets you respond efficiently to changes, reduce operational complexity, and ensure that operations remain aligned with business goals.

Build a robust foundation for model development

Establish a robust foundation to streamline model development, from problem definition to deployment. Such a foundation ensures that your AI solutions are built on reliable and efficient components and choices. This kind of foundation helps you to release changes and improvements quickly and easily.

Consider the following recommendations:

  • Define the problem that the AI system solves and the outcome that you want.
  • Identify and gather relevant data that's required to train and evaluate your models. Then, clean and preprocess the raw data. Implement data validation checks to ensure data quality and integrity.
  • Choose the appropriate ML approach for the task. When you design the structure and parameters of the model, consider the model's complexity and computational requirements.
  • Adopt a version control system for code, model, and data.

Automate the model-development lifecycle

From data preparation and training to deployment and monitoring, automation helps you to improve the quality and efficiency of your operations. Automation enables seamless, repeatable, and error-free model development and deployment. Automation minimizes manual intervention, speeds up release cycles, and ensures consistency across environments.

Consider the following recommendations:

  • Use a managed pipeline orchestration system to orchestrate and automate the ML workflow. The pipeline must handle the major steps of your development lifecycle: preparation, training, deployment, and evaluation.
  • Implement CI/CD pipelines for the model-development lifecycle. These pipelines should automate the building, testing, and deployment of models. The pipelines should also include continuous training to retrain models on new data as needed.
  • Implement phased release approaches such as canary deployments or A/B testing, for safe and controlled model releases.

Implement observability

When you implement observability, you can gain deep insights into model performance, data drift, and system health. Implement continuous monitoring, alerting, and logging mechanisms to proactively identify issues, trigger timely responses, and ensure operational continuity.

Consider the following recommendations:

  • Implement permanent and automated performance monitoring for your models. Use metrics and success criteria for ongoing evaluation of the model after deployment.
  • Monitor your deployment endpoints and infrastructure to ensure service availability.
  • Set up custom alerting based on business-specific thresholds and anomalies to ensure that issues are identified and resolved in a timely manner.
  • Use explainable AI techniques to understand and interpret model outputs.

Build a culture of operational excellence

Operational excellence is built on a foundation of people, culture, and professional practices. The success of your team and business depends on how effectively your organization implements methodologies that enable the reliable and rapid development of AI capabilities.

Consider the following recommendations:

  • Champion automation and standardization as core development methodologies. Streamline your workflows and manage the ML lifecycle efficiently by using MLOps techniques. Automate tasks to free up time for innovation, and standardize processes to support consistency and easier troubleshooting.
  • Prioritize continuous learning and improvement. Promote learning opportunities that team members can use to enhance their skills and stay current with AI and ML advancements. Encourage experimentation and conduct regular retrospectives to identify areas for improvement.
  • Cultivate a culture of accountability and ownership. Define clear roles so that everyone understands their contributions. Empower teams to make decisions within boundaries and track progress by using transparent metrics.
  • Embed AI ethics and safety into the culture. Prioritize responsible systems by integrating ethics considerations into every stage of the ML lifecycle. Establish clear ethics principles and foster open discussions about ethics-related challenges.

Design for scalability

Architect your AI solutions to handle growing data volumes and user demands. Use scalable infrastructure so that your models can adapt and perform optimally as your project expands.

Consider the following recommendations:

  • Plan for capacity and quotas. Anticipate future growth, and plan your infrastructure capacity and resource quotas accordingly.
  • Prepare for peak events. Ensure that your system can handle sudden spikes in traffic or workload during peak events.
  • Scale AI applications for production. Design for horizontal scaling to accommodate increases in the workload. Use frameworks like Ray on Vertex AI to parallelize tasks across multiple machines.
  • Use managed services where appropriate. Use services that help you to scale while minimizing the operational overhead and complexity of manual interventions.

Contributors

Authors:

Other contributors:

AI and ML perspective: Security

This document in the Architecture Framework: AI and ML perspective provides an overview of principles and recommendations to ensure that your AI and ML deployments meet the security and compliance requirements of your organization. The recommendations in this document align with the security pillar of the Architecture Framework.

Secure deployment of AI and ML workloads is a critical requirement, particularly in enterprise environments. To meet this requirement, you need to adopt a holistic security approach that starts from the initial conceptualization of your AI and ML solutions and extends to development, deployment, and ongoing operations. Google Cloud offers robust tools and services that are designed to help secure your AI and ML workloads.

Define clear goals and requirements

It's easier to integrate the required security and compliance controls early in your design and development process, than to add the controls after development. From the start of your design and development process, make decisions that are appropriate for your specific risk environment and your specific business priorities.

Consider the following recommendations:

  • Identify potential attack vectors and adopt a security and compliance perspective from the start. As you design and evolve your AI systems, keep track of the attack surface, potential risks, and obligations that you might face.
  • Align your AI and ML security efforts with your business goals and ensure that security is an integral part of your overall strategy. Understand the effects of your security choices on your main business goals.

Keep data secure and prevent loss or mishandling

Data is a valuable and sensitive asset that must be kept secure. Data security helps you to maintain user trust, support your business objectives, and meet your compliance requirements.

Consider the following recommendations:

  • Don't collect, keep, or use data that's not strictly necessary for your business goals. If possible, use synthetic or fully anonymized data.
  • Monitor data collection, storage, and transformation. Maintain logs for all data access and manipulation activities. The logs help you to audit data access, detect unauthorized access attempts, and prevent unwanted access.
  • Implement different levels of access (for example, no-access, read-only, or write) based on user roles. Ensure that permissions are assigned based on the principle of least privilege. Users must have only the minimum permissions that are necessary to let them perform their role activities.
  • Implement measures like encryption, secure perimeters, and restrictions on data movement. These measures help you to prevent data exfiltration and data loss.
  • Guard against data poisoning for your ML training systems.

Keep AI pipelines secure and robust against tampering

Your AI and ML code and the code-defined pipelines are critical assets. Code that isn't secured can be tampered with, which can lead to data leaks, compliance failure, and disruption of critical business activities. Keeping your AI and ML code secure helps to ensure the integrity and value of your models and model outputs.

Consider the following recommendations:

  • Use secure coding practices, such as dependency management or input validation and sanitization, during model development to prevent vulnerabilities.
  • Protect your pipeline code and your model artifacts, like files, model weights, and deployment specifications, from unauthorized access. Implement different access levels for each artifact based on user roles and needs.
  • Enforce lineage and tracking of your assets and pipeline runs. This enforcement helps you to meet compliance requirements and to avoid compromising production systems.

Deploy on secure systems with secure tools and artifacts

Ensure that your code and models run in a secure environment that has a robust access control system with security assurances for the tools and artifacts that are deployed in the environment.

Consider the following recommendations:

  • Train and deploy your models in a secure environment that has appropriate access controls and protection against unauthorized use or manipulation.
  • Follow standard Supply-chain Levels for Software Artifacts (SLSA) guidelines for your AI-specific artifacts, like models and software packages.
  • Prefer using validated prebuilt container images that are specifically designed for AI workloads.

Protect and monitor inputs

AI systems need inputs to make predictions, generate content, or automate actions. Some inputs might pose risks or be used as attack vectors that must be detected and sanitized. Detecting potential malicious inputs early helps you to keep your AI systems secure and operating as intended.

Consider the following recommendations:

  • Implement secure practices to develop and manage prompts for generative AI systems, and ensure that the prompts are screened for harmful intent.
  • Monitor inputs to predictive or generative systems to prevent issues like overloaded endpoints or prompts that the systems aren't designed to handle.
  • Ensure that only the intended users of a deployed system can use it.

Monitor, evaluate, and prepare to respond to outputs

AI systems deliver value because they produce outputs that augment, optimize, or automate human decision-making. To maintain the integrity and trustworthiness of your AI systems and applications, you need to make sure that the outputs are secure and within expected parameters. You also need a plan to respond to incidents.

Consider the following recommendations:

  • Monitor the outputs of your AI and ML models in production, and identify any performance, security, and compliance issues.
  • Evaluate model performance by implementing robust metrics and security measures, like identifying out-of-scope generative responses or extreme outputs in predictive models. Collect user feedback on model performance.
  • Implement robust alerting and incident response procedures to address any potential issues.

Contributors

Authors:

Other contributors:

AI and ML perspective: Reliability

This document in the Architecture Framework: AI and ML perspective provides an overview of the principles and recommendations to design and operate reliable AI and ML systems on Google Cloud. It explores how to integrate advanced reliability practices and observability into your architectural blueprints. The recommendations in this document align with the reliability pillar of the Architecture Framework.

In the fast-evolving AI and ML landscape, reliable systems are essential for ensuring customer satisfaction and achieving business goals. You need AI and ML systems that are robust, reliable, and adaptable to meet the unique demands of both predictive ML and generative AI. To handle the complexities of MLOps—from development to deployment and continuous improvement—you need to use a reliability-first approach. Google Cloud offers a purpose-built AI infrastructure that's aligned with Site Reliability Engineering (SRE) principles and provides a powerful foundation for reliable AI and ML systems.

Ensure that infrastructure is scalable and highly available

By architecting for scalability and availability, you enable your applications to handle varying levels of demand without service disruptions or performance degradation. This means that your AI services are still available to users during infrastructure outages and when traffic is very high.

Consider the following recommendations:

  • Design your AI systems with automatic and dynamic scaling capabilities to handle fluctuations in demand. This helps to ensure optimal performance, even during traffic spikes.
  • Manage resources proactively and anticipate future needs through load testing and performance monitoring. Use historical data and predictive analytics to make informed decisions about resource allocation.
  • Design for high availability and fault tolerance by adopting the multi-zone and multi-region deployment archetypes in Google Cloud and by implementing redundancy and replication.
  • Distribute incoming traffic across multiple instances of your AI and ML services and endpoints. Load balancing helps to prevent any single instance from being overloaded and helps to ensure consistent performance and availability.

Use a modular and loosely coupled architecture

To make your AI systems resilient to failures in individual components, use a modular architecture. For example, design the data processing and data validation components as separate modules. When a particular component fails, the modular architecture helps to minimize downtime and lets your teams develop and deploy fixes faster.

Consider the following recommendations:

  • Separate your AI and ML system into small self-contained modules or components. This approach promotes code reusability, simplifies testing and maintenance, and lets you develop and deploy individual components independently.
  • Design the loosely coupled modules with well-defined interfaces. This approach minimizes dependencies, and it lets you make independent updates and changes without impacting the entire system.
  • Plan for graceful degradation. When a component fails, the other parts of the system must continue to provide an adequate level of functionality.
  • Use APIs to create clear boundaries between modules and to hide the module-level implementation details. This approach lets you update or replace individual components without affecting interactions with other parts of the system.

Build an automated MLOps platform

With an automated MLOps platform, the stages and outputs of your model lifecycle are more reliable. By promoting consistency, loose coupling, and modularity, and by expressing operations and infrastructure as code, you remove fragile manual steps and maintain AI and ML systems that are more robust and reliable.

Consider the following recommendations:

  • Automate the model development lifecycle, from data preparation and validation to model training, evaluation, deployment, and monitoring.
  • Manage your infrastructure as code (IaC). This approach enables efficient version control, quick rollbacks when necessary, and repeatable deployments.
  • Validate that your models behave as expected with relevant data. Automate performance monitoring of your models, and build appropriate alerts for unexpected outputs.
  • Validate the inputs and outputs of your AI and ML pipelines. For example, validate data, configurations, command arguments, files, and predictions. Configure alerts for unexpected or unallowed values.
  • Adopt a managed version-control strategy for your model endpoints. This kind of strategy enables incremental releases and quick recovery in the event of problems.

Maintain trust and control through data and model governance

The reliability of AI and ML systems depends on the trust and governance capabilities of your data and models. AI outputs can fail to meet expectations in silent ways. For example, the outputs might be formally consistent but they might be incorrect or unwanted. By implementing traceability and strong governance, you can ensure that the outputs are reliable and trustworthy.

Consider the following recommendations:

  • Use a data and model catalog to track and manage your assets effectively. To facilitate tracing and audits, maintain a comprehensive record of data and model versions throughout the lifecycle.
  • Implement strict access controls and audit trails to protect sensitive data and models.
  • Address the critical issue of bias in AI, particularly in generative AI applications. To build trust, strive for transparency and explainability in model outputs.
  • Automate the generation of feature statistics and implement anomaly detection to proactively identify data issues. To ensure model reliability, establish mechanisms to detect and mitigate the impact of changes in data distributions.

Implement holistic AI and ML observability and reliability practices

To continuously improve your AI operations, you need to define meaningful reliability goals and measure progress. Observability is a foundational element of reliable systems. Observability lets you manage ongoing operations and critical events. Well-implemented observability helps you to build and maintain a reliable service for your users.

Consider the following recommendations:

  • Track infrastructure metrics for processors (CPUs, GPUs, and TPUs) and for other resources like memory usage, network latency, and disk usage. Perform load testing and performance monitoring. Use the test results and metrics from monitoring to manage scaling and capacity for your AI and ML systems.
  • Establish reliability goals and track application metrics. Measure metrics like throughput and latency for the AI applications that you build. Monitor the usage patterns of your applications and the exposed endpoints.
  • Establish model-specific metrics like accuracy or safety indicators in order to evaluate model reliability. Track these metrics over time to identify any drift or degradation. For efficient version control and automation, define the monitoring configurations as code.
  • Define and track business-level metrics to understand the impact of your models and reliability on business outcomes. To measure the reliability of your AI and ML services, consider adopting the SRE approach and define service level objectives (SLOs).

Contributors

Authors:

Other contributors:

AI and ML perspective: Cost optimization

This document in Architecture Framework: AI and ML perspective provides an overview of principles and recommendations to optimize the cost of your AI systems throughout the ML lifecycle. By adopting a proactive and informed cost management approach, your organization can realize the full potential of AI and ML systems and also maintain financial discipline. The recommendations in this document align with the cost optimization pillar of the Architecture Framework.

AI and ML systems can help you to unlock valuable insights and predictive capabilities from data. For example, you can reduce friction in internal processes, improve user experiences, and gain deeper customer insights. The cloud offers vast amounts of resources and quick time-to-value without large up-front investments for AI and ML workloads. To maximize business value and to align the spending with your business goals, you need to understand the cost drivers, proactively optimize costs, set up spending controls, and adopt FinOps practices.

Define and measure costs and returns

To effectively manage your AI and ML costs in Google Cloud, you must define and measure the expenses for cloud resources and the business value of your AI and ML initiatives. Google Cloud provides comprehensive tools for billing and cost management to help you to track expenses granularly. Business value metrics that you can measure include customer satisfaction, revenue, and operational costs. By establishing concrete metrics for both costs and business value, you can make informed decisions about resource allocation and optimization.

Consider the following recommendations:

  • Establish clear business objectives and key performance indicators (KPIs) for your AI and ML projects.
  • Use the billing information provided by Google Cloud to implement cost monitoring and reporting processes that can help you to attribute costs to specific AI and ML activities.
  • Establish dashboards, alerting, and reporting systems to track costs and returns against KPIs.

Optimize resource allocation

To achieve cost efficiency for your AI and ML workloads in Google Cloud, you must optimize resource allocation. By carefully aligning resource allocation with the needs of your workloads, you can avoid unnecessary expenses and ensure that your AI and ML systems have the resources that they need to perform optimally.

Consider the following recommendations:

  • Use autoscaling to dynamically adjust resources for training and inference.
  • Start with small models and data. Save costs by testing hypotheses at a smaller scale when possible.
  • Discover your compute needs through experimentation. Rightsize the resources that are used for training and serving based on your ML requirements.
  • Adopt MLOps practices to reduce duplication, manual processes, and inefficient resource allocation.

Enforce data management and governance practices

Effective data management and governance practices play a critical role in cost optimization. Well-organized data helps your organization to avoid needless duplication, reduces the effort required to obtain high quality data, and encourages teams to reuse datasets. By proactively managing data, you can reduce storage costs, enhance data quality, and ensure that your ML models are trained and operate on the most relevant and valuable data.

Consider the following recommendations:

  • Establish and adopt a well-defined data governance framework.
  • Apply labels and relevant metadata to datasets at the point of data ingestion.
  • Ensure that datasets are discoverable and accessible across the organization.
  • Make your datasets and features reusable throughout the ML lifecycle wherever possible.

Automate and streamline with MLOps

A primary benefit of adopting MLOps practices is a reduction in costs, both from a technology perspective and in terms of personnel activities. Automation helps you to avoid duplication of ML activities and improve the productivity of data scientists and ML engineers.

Consider the following recommendations:

  • Increase the level of automation and standardization in your data collection and processing technologies to reduce development effort and time.
  • Develop automated training pipelines to reduce the need for manual interventions and increase engineer productivity. Implement mechanisms for the pipelines to reuse existing assets like prepared datasets and trained models.
  • Use the model evaluation and tuning services in Google Cloud to increase model performance with fewer iterations. This enables your AI and ML teams to achieve more objectives in less time.

Use managed services and pre-trained or existing models

There are many approaches to achieving business goals by using AI and ML. Adopt an incremental approach to model selection and model development. This helps you to avoid excessive costs that are associated with starting fresh every time. To control costs, start with a simple approach: use ML frameworks, managed services, and pre-trained models.

Consider the following recommendations:

  • Enable exploratory and quick ML experiments by using notebook environments.
  • Use existing and pre-trained models as a starting point to accelerate your model selection and development process.
  • Use managed services to train or serve your models. Both AutoML and managed custom model training services can help to reduce the cost of model training. Managed services can also help to reduce the cost of your model-serving infrastructure.

Foster a culture of cost awareness and continuous optimization

Cultivate a collaborative environment that encourages communication and regular reviews. This approach helps teams to identify and implement cost-saving opportunities throughout the ML lifecycle.

Consider the following recommendations:

  • Adopt FinOps principles across your ML lifecycle.
  • Ensure that all costs and business benefits of AI and ML projects have assigned owners with clear accountability.

Contributors

Authors:

Other contributors:

AI and ML perspective: Performance optimization

This document in the Architecture Framework: AI and ML perspective provides an overview of principles and recommendations to help you to optimize the performance of your AI and ML workloads on Google Cloud. The recommendations in this document align with the performance optimization pillar of the Architecture Framework.

AI and ML systems enable new automation and decision-making capabilities for your organization. The performance of these systems can directly affect your business drivers like revenue, costs, and customer satisfaction. To realize the full potential of your AI and ML systems, you need to optimize their performance based on your business goals and technical requirements. The performance optimization process often involves certain trade-offs. For example, a design choice that provides the required performance might lead to higher costs. The recommendations in this document prioritize performance over other considerations like costs.

To optimize AI and ML performance, you need to make decisions regarding factors like the model architecture, parameters, and training strategy. When you make these decisions, consider the entire lifecycle of the AI and ML systems and their deployment environment. For example, LLMs that are very large can be highly performant on massive training infrastructure, but very large models might not perform well in capacity-constrained environments like mobile devices.

Translate business goals to performance objectives

To make architectural decisions that optimize performance, start with a clear set of business goals. Design AI and ML systems that provide the technical performance that's required to support your business goals and priorities. Your technical teams must understand the mapping between performance objectives and business goals.

Consider the following recommendations:

  • Translate business objectives into technical requirements: Translate the business objectives of your AI and ML systems into specific technical performance requirements and assess the effects of not meeting the requirements. For example, for an application that predicts customer churn, the ML model should perform well on standard metrics, like accuracy and recall, and the application should meet operational requirements like low latency.
  • Monitor performance at all stages of the model lifecycle: During experimentation and training after model deployment, monitor your key performance indicators (KPIs) and observe any deviations from business objectives.
  • Automate evaluation to make it reproducible and standardized: With a standardized and comparable platform and methodology for experiment evaluation, your engineers can increase the pace of performance improvement.

Run and track frequent experiments

To transform innovation and creativity into performance improvements, you need a culture and a platform that supports experimentation. Performance improvement is an ongoing process because AI and ML technologies are developing continuously and quickly. To maintain a fast-paced, iterative process, you need to separate the experimentation space from your training and serving platforms. A standardized and robust experimentation process is important.

Consider the following recommendations:

  • Build an experimentation environment: Performance improvements require a dedicated, powerful, and interactive environment that supports the experimentation and collaborative development of ML pipelines.
  • Embed experimentation as a culture: Run experiments before any production deployment. Release new versions iteratively and always collect performance data. Experiment with different data types, feature transformations, algorithms, and hyperparameters.

Build and automate training and serving services

Training and serving AI models are core components of your AI services. You need robust platforms and practices that support fast and reliable creation, deployment, and serving of AI models. Invest time and effort to create foundational platforms for your core AI training and serving tasks. These foundational platforms help to reduce time and effort for your teams and improve the quality of outputs in the medium and long term.

Consider the following recommendations:

  • Use AI-specialized components of a training service: Such components include high-performance compute and MLOps components like feature stores, model registries, metadata stores, and model performance-evaluation services.
  • Use AI-specialized components of a prediction service: Such components provide high-performance and scalable resources, support feature monitoring, and enable model performance monitoring. To prevent and manage performance degradation, implement reliable deployment and rollback strategies.

Match design choices to performance requirements

When you make design choices to improve performance, carefully assess whether the choices support your business requirements or are wasteful and counterproductive. To choose the appropriate infrastructure, models, or configurations, identify performance bottlenecks and assess how they're linked to your performance measures. For example, even on very powerful GPU accelerators, your training tasks can experience performance bottlenecks due to data I/O issues from the storage layer or due to performance limitations of the model itself.

Consider the following recommendations:

  • Optimize hardware consumption based on performance goals: To train and serve ML models that meet your performance requirements, you need to optimize infrastructure at the compute, storage, and network layers. You must measure and understand the variables that affect your performance goals. These variables are different for training and inference.
  • Focus on workload-specific requirements: Focus your performance optimization efforts on the unique requirements of your AI and ML workloads. Rely on managed services for the performance of the underlying infrastructure.
  • Choose appropriate training strategies: Several pre-trained and foundational models are available, and more such models are released often. Choose a training strategy that can deliver optimal performance for your task. Decide whether you should build your own model, tune a pre-trained model on your data, or use a pre-trained model API.
  • Recognize that performance-optimization strategies can have diminishing returns: When a particular performance-optimization strategy doesn't provide incremental business value that's measurable, stop pursuing that strategy.

To innovate, troubleshoot, and investigate performance issues, establish a clear link between design choices and performance outcomes. In addition to experimentation, you must reliably record the lineage of your assets, deployments, model outputs, and the configurations and inputs that produced the outputs.

Consider the following recommendations:

  • Build a data and model lineage system: All of your deployed assets and their performance metrics must be linked back to the data, configurations, code, and the choices that resulted in the deployed systems. In addition, model outputs must be linked to specific model versions and how the outputs were produced.
  • Use explainability tools to improve model performance: Adopt and standardize tools and benchmarks for model exploration and explainability. These tools help your ML engineers understand model behavior and improve performance or remove biases.

Contributors

Authors:

Other contributors: