|
|
|
AeSOP was born out of the belief that modern software
development, especially with DevOps cultures, has far
outpaced our traditional OPs silos and solutions. We have
increased automation, achieving sustained and better
controlled speed, but only up until deployment. We have
failed to address the “day 2” of operations. We have
failed to address the continuous nature of “running a
service” and how to ensure the best user experience and
“business availability”.
What do we mean by “business availability”?
Broadly speaking, business availability is the number of
business relevant operations that were served within
acceptable ranges. With AeSOP these can be automatically
(historical data) determined or user defined. With this
data, we automatically ensure the best trade-off between
scaling infrastructure and maximizing business
availability.
What’s the cost of not tracking availability in business
terms?
Traditionally we have tracked availability as a set of
operational metrics with no relation to business outcomes:
CPU, RAM, Network and Storage. These metrics clearly lack
context. Some more advanced teams use the 4 golden signals
of SRE (Latency, Traffic, Errors, Saturation). Even with
these more advanced approaches, we see a huge burden on
people as both systemic and business complexity create a
high cognitive load. This is exacerbated when SRE/OPs
teams are tasked with managing multi-team, multi-component
systems.
As a result, even with modern development practices (such
as adopting Infrastructure as Code and Cloud) we still see
• Massive OPs and SRE structures.
• Heavy reliance on the “Debug heros” that can accurately
diagnose and “plug the leak”.
• Underserved and under-updated Dashboards for business
relevant metrics.
• Lack of agility in adding, updating and retiring
observability.
• Non-contextual metrics, telemetry and logs: What does
50% CPU mean for user Experience? Would 75% CPU drive
better user experience and, ultimately, better business?
So how do we solve this?
We monitor existing system in a non-invasive, declarative
way. We can track most of the disturbances that a system
is subject to during its operations. These continuous
disturbances are both internal (new deployments, config
changes) and external (seasonal load, 3rd party APIs
unavailability). This creates contextual data per
deployment, but also historical expectations. If a “simple
config change” is deployed, we track this as a new
deployment.
Whether your business relevant metrics come from an API
sub-set of a monolith or a fully instrumented Service
Mesh, we track the business relevant metrics in context.
In a multi-workflow, micro-service world, this is one of
our key differentiatiors.
Who is this for?
Well AeSOP is firstly for the people. The people who are
overwhelmed with the cognitive load of microservices, with
the toil of maintaining monolithic deployments for legacy
systems, for people versed in the “dark magic” of
debugging with lack of quality incident description and
logs.
AeSOP is “Observability for all”. Legacy, Cloud-Native,
Serverless. All are welcome with AeSOP. We believe its
about time we make healthier, more automated operational
decisions.
AeSOP – Autonomic Service Operation can broadly be
characterised as an Observability and AIOps solution.
AeSOP tracks user experience and business availability,
for each deployment, throughout its active life. All
metrics can be used regardless of source. We believe that
business availability is a fundamental part of your
software, and should be declarative, just as you would
declare your application MANIFEST in any language. Even if
you don’t have a Service Mesh, Micro-services, Prometheus/
Nagios/ Datadog/ New Relic/ etc. AeSOP can still track
business availability and, most importantly, determine
potential culprits, react and maintain availability.
|
|