Cloud Operational Excellence Assessment

Discover how to strengthen your operational excellence with the AWS Well-Architected Framework. Use my online questionnaire to evaluate risks, align with best practices, and visualize results.

29.04.2025

Operational Excellence Cloud Assessment — Photo by Jametlene Reskp on Unsplash

This post was previously published on Linkedin.com.

If the foundation of a software system is not solid, structural problems can compromise its integrity and functionality. The AWS Well-Architected Framework, provides a set of best practices to evaluate architectures, and provides a set of questions to evaluate how well an architecture is aligned to best practices. The framework is based on six pillars — operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability.

In the upcoming sections, we'll dive deeper into the pillar of operational excellence.

Resources

AWS Well-Architected Framework, AWS

Operational Excellence

Operational excellence is a commitment to build software correctly while consistently delivering a great customer experience. The following are design principles for operational excellence in the cloud:

Organize teams around business outcomes: The operating model uses people, process, and technology capabilities to scale, optimize for productivity. The organization's long-term vision is translated into goals that are communicated across the enterprise to stakeholders and consumers of your cloud services. Goals and operational KPIs are aligned at all levels.
Implement observability for actionable insights: Gain a comprehensive understanding of workload behaviour, performance, reliability, cost, and health. Establish key performance indicators and leverage observability telemetry to make informed decisions and take prompt action when business outcomes are at risk.
Safely automate where possible: Define your workload and its operations as code. Automate your workload’s operations by initiating them in response to events. Employ automation safety by configuring guardrails, including rate control, error thresholds, and approvals.
Make frequent, small, reversible changes: Design workloads that are scalable and loosely coupled to permit components to be updated regularly. Automated deployment techniques together with smaller, incremental changes reduces the blast radius and allows for faster reversal when failures occur.
Refine operations procedures frequently: As you evolve your workloads, evolve your operations appropriately. Hold regular reviews and validate that all procedures are effective and that teams are familiar with them. Where gaps are identified, update procedures accordingly.
Anticipate failure: Maximize operational success by driving failure scenarios to understand the workload’s risk profile and its impact on your business outcomes. Test the effectiveness of your procedures and your team’s response against these simulated failures.
Learn from all operational events and metrics: Drive improvement through lessons learned from all operational events and failures. Share what is learned across teams and through the entire organization.
Use managed services: Reduce operational burden by using managed services where possible. Build operational procedures around interactions with those services.

There are four best practice areas for operational excellence in the cloud:

Organization: Your organization’s leadership defines business objectives. Your organization must understand requirements and priorities and use these to organize and conduct work to support the achievement of business outcomes.
Prepare: Your workload must emit the information necessary to support it. Implementing services to achieve integration, deployment, and delivery of your workload will create an increased ﬂow of beneficial changes into production by automating repetitive processes.
Operate: There may be risks inherent in the operation of your workload. Understand those risks and make an informed decision to enter production. Your teams must be able to support your workload. Business and operational metrics derived from desired business outcomes will permit you to understand the health of your workload, your operations activities, and respond to incidents.
Evolve: Your priorities will change as your business needs and business environment changes. Use these as a feedback loop to continually drive improvement for your organization and the operation of your workload.

Resources

The Review Process

The review should be lightweight conversional process (measured in hours, not days) to identify issues that could be improved. The outcome is a set of actions to improve the experience of a customer using the workload.

Ownership: Every team take responsibility for the quality of its architecture. Instead of rare formal review meeting, architects should continually review their architecture. The continuous approach permits the teams to update answers as the architecture evolves and improve the architecture.
Checkpoints: Apply reviews at key milestones in the product lifecycle: early on in design (to avoid irreversible “one‑way doors”) and then before go‑live. After launch, treat architecture as an evolving artifact - use regular hygiene processes whenever you make significant changes.
Prioritise by Impact: After you have done a review, you should have a list of issues that can be prioritised based by business impact and the impact on the day-to-day work of the team. Addressing issues early frees up time to work on creating business value. As you address issues, update the review to track improvements over time.
Organizational learning: By aggregating multiple reviews, you might identify thematic issues and identify mechanisms and trainings to address them.

To support the review process, I have created an online assessment questionnaire that includes all framework questions and links to the detailed documentation. You can filter questions by risk level, assess each question, and then display the result as a chart. Good luck and have fun improving your operational excellence!

Resources

Online assessment, Lukas Akermann
The review process, AWS

Cloud Operational Excellence Assessment ​

Operational Excellence ​

The Review Process ​

Cloud Operational Excellence Assessment

Operational Excellence

The Review Process