AeSOP was born out of the belief that modern software development, especially with DevOps cultures, has far outpaced our traditional OPs silos and solutions. We have increased automation, achieving sustained and better controlled speed, but only up until deployment. We have failed to address the “day 2” of operations. We have failed to address the continuous nature of “running a service” and how to ensure the best user experience and “business availability”.

What do we mean by “business availability”?

Broadly speaking, business availability is the number of business relevant operations that were served within acceptable ranges. With AeSOP these can be automatically (historical data) determined or user defined. With this data, we automatically ensure the best trade-off between scaling infrastructure and maximizing business availability.

What’s the cost of not tracking availability in business terms?

Traditionally we have tracked availability as a set of operational metrics with no relation to business outcomes: CPU, RAM, Network and Storage. These metrics clearly lack context. Some more advanced teams use the 4 golden signals of SRE (Latency, Traffic, Errors, Saturation). Even with these more advanced approaches, we see a huge burden on people as both systemic and business complexity create a high cognitive load. This is exacerbated when SRE/OPs teams are tasked with managing multi-team, multi-component systems.

As a result, even with modern development practices (such as adopting Infrastructure as Code and Cloud) we still see

• Massive OPs and SRE structures.

• Heavy reliance on the “Debug heros” that can accurately diagnose and “plug the leak”.

• Underserved and under-updated Dashboards for business relevant metrics.

• Lack of agility in adding, updating and retiring observability.

• Non-contextual metrics, telemetry and logs: What does 50% CPU mean for user Experience? Would 75% CPU drive better user experience and, ultimately, better business?

So how do we solve this?

We monitor existing system in a non-invasive, declarative way. We can track most of the disturbances that a system is subject to during its operations. These continuous disturbances are both internal (new deployments, config changes) and external (seasonal load, 3rd party APIs unavailability). This creates contextual data per deployment, but also historical expectations. If a “simple config change” is deployed, we track this as a new deployment.
Whether your business relevant metrics come from an API sub-set of a monolith or a fully instrumented Service Mesh, we track the business relevant metrics in context. In a multi-workflow, micro-service world, this is one of our key differentiatiors.

Who is this for?

Well AeSOP is firstly for the people. The people who are overwhelmed with the cognitive load of microservices, with the toil of maintaining monolithic deployments for legacy systems, for people versed in the “dark magic” of debugging with lack of quality incident description and logs.
AeSOP is “Observability for all”. Legacy, Cloud-Native, Serverless. All are welcome with AeSOP. We believe its about time we make healthier, more automated operational decisions.

AeSOP – Autonomic Service Operation can broadly be characterised as an Observability and AIOps solution. AeSOP tracks user experience and business availability, for each deployment, throughout its active life. All metrics can be used regardless of source. We believe that business availability is a fundamental part of your software, and should be declarative, just as you would declare your application MANIFEST in any language. Even if you don’t have a Service Mesh, Micro-services, Prometheus/ Nagios/ Datadog/ New Relic/ etc. AeSOP can still track business availability and, most importantly, determine potential culprits, react and maintain availability.

[AHF] António Howcroft Ferreira

t: +351 913 659 039 — call to the Portuguese mobile network.
e: antonio.howcroft@fiercely.pt