The AWS Well-Architected Framework helps you understand the pros and cons of decisions you make while building systems on AWS. By using the Framework you will learn architectural best practices for designing and operating reliable, secure, efficient, and cost-effective systems in the cloud. It provides a way to consistently measure your architectures against best practices and identify areas for improvement. We believe that having well-architected systems greatly increases the likelihood of business success.

Introduction

When architecting solutions you make trade-offs between pillars based upon your business context. Security and operational excellence are generally not traded-off against the other pillars.

Definitions

Operational Excellence

The ability to run and monitor systems to deliver business value and to continually improve supporting processes and procedures.

Security

The ability to protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies.

Reliability

The ability of a system to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues.

Performance Efficiency

The ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve.

Cost Optimization

The ability to run systems to deliver business value at the lowest price point.

General Design principles

  • Stop guessing your capacity needs

  • Test systems at production scale

  • Automate to make architectural experimentation easier

  • Allow for evolutionary architectures

  • Drive architectures using data

  • Improve through game days

Operational Excellence

Design Principles

  • Perform operations as code

  • Annotated documentation (after every build)

  • Make frequent, small, reversible changes

  • Refine operations procedures frequently

  • Anticipate failure

  • Learn from all operational failures

Prepare

Operational Priorities

Your teams need to have a shared understanding of your entire workload, their role in it, and shared business goals in order to set the priorities that will enable business success. You also need to consider external regulatory and compliance requirements that may influence your priorities. Use your priorities to focus your operations improvement efforts where they will have the greatest impact (for example, developing team skills, improving workload performance, automating runbooks, or enhancing monitoring). Update your priorities as needs change.

Key AWS Services
  • AWS Cloud Compliance

  • AWS Trusted Advisor

  • Business Support

  • Entreprise Support

Design for Operations

The design of your workload should include how it will be deployed, updated, and operated. You will want to implement engineering practices that align with defect reduction and quick and safe fixes. To understand what is happening inside your architecture, you will need to enable observation with logging, instrumentation, and insightful business and technical metrics.

In AWS, you can view your entire workload (applications, infrastructure, policy, governance, and operations) as code.

Key AWS Services
  • AWS CloudFormation

  • AWS Developer Tools

  • AWS X-Ray

Operational Readiness

You should use a consistent process (including checklists) to know when you are ready to go live with your workload.

Key AWS Services
  • AWS Config

  • AWS Systems Manager

Operate

Understanding Operational Health

Key AWS Services
  • Amazon CloudWatch Logs

  • Amazon ES

  • Personal Health Dashboard

  • Service Health Dashboard

Responding to Events

Planned and Unplanned.

Key AWS Services
  • Amazon CloudWatch

  • Amazon CloudWatch Events

  • Amazon SNS

  • Auto Scaling

  • AWS Systems Manager

Evolve

Learning from Experience

Key AWS Services
  • Amazon QuickSight

  • Amazon Athena

  • Amazon S3

Share Learnings

You should share what your teams learn to increase the benefit across your organization. You will want to share information and resources to prevent avoidable errors and ease development efforts. This will allow you to focus on delivering features.

Key AWS Services
  • Amazon SNS

  • AWS CodeCommit

  • AWS Lambda

  • AWS CloudFormation

  • Amazon Machine Images (AMIs)

Conclusion

Operational excellence is an ongoing effort. Every operational event and failure should be treated as an opportunity to improve the operations of your architecture. By understanding the needs of your workloads, predefining runbooks for routine activities, and playbooks to guide issue resolution, using the operations as code features in AWS, and maintaining situational awareness, your operations will be ready and responsive when events occur. Through focusing on incremental improvement based on operational priorities, and lessons learned from event response and retrospective analysis, you will enable the success of your business by increasing the efficiency and effectiveness of your operations.