For operational excellence, a production workload must emit information necessary to support it. This emitted information is used to quantify on Service Line Indicators (SLI) related to reliability, security, performance etc. Unpredictable behaviour of user facing systems or storage systems needs to be corrected on a timely basis to meet agreed Service Line Objectives (SLO). A system that cannot make sufficient information available to determine its health and behaviour is considered unobservable and hence for an operator, it is found to be difficult to support.
Application instrumentation as a solution
Observability is the practice of instrumenting systems with tools to gather actionable data. This helps in not only observing and detecting symptoms but also helps companies understand the key reasons for any possible issues. Instrumentation enables a system to provide understanding on its overall health which is based on telemetry.
A telemetry consists of three major categories: Traces, Metrics and Logs, which are collected at runtime around different cross-cutting concerns of a system. These cross-cutting concerns are the aspects of the system that are identified during system design and are managed at runtime with help of an instrumentation agent like Tomcat or GlassFish servers. To enable or improve observability of a system, all application components — not just critical services — must be instrumented with observability in mind, to tell the entire story.
Aspect Oriented Programming (AOP) for Instrumentation
Aspect Oriented Programming (AOP) based frameworks are used to generate and weave the instrumentation to the application code. An application developer concentrates on developing application code and leaves the responsibility of code instrumentation to the aspect code. Aspect code is generated from advice defined at pointcuts between cross-cutting concerns. Aspect weaver does the magic of connecting aspect code with the application code during code compilation. This modular approach provided by AOP allows for clean isolation between the application code and aspect code. It additionally helps in reusing the aspect code, simplifies maintenance and provides much needed insights on executing code.
When a system can externalize its state information, system monitoring can help in further understanding and predicting when the system is likely to be broken and the key reasons for its breakdown. A reasonable alerting mechanism that is based upon the system’s state can help in timely human intervention to determine the real problem at hand and take steps to mitigate the issue. With woven instrumentation in application code, white-box monitoring becomes imaginable to inspect on the innards of the system. It also helps in focusing on causes instead of just symptoms. Collected telemetry data can be used to assist in effective debugging to fix imminent problems. With full stack observability and white-box monitoring, a software team can deliver high quality software at speed, see the real-time performance, and build a culture of innovation.
Optimising the instrumentation
Application instrumentation is a non-functional requirement, and it incurs additional implementation costs, which further depends on the level of instrumentation needed and the degree of automation that is achievable with respect to the desired instrumentation.
Every programming language permits writing logs with different log levels. Typically, the overhead for writing these logs is low. Brendan Gregg’s USE method suggests that enterprises must instrument the resources to log on meaningful data around key factors such as utilization, saturation and error count. Utilization can be understood as “the average time that the resource was busy servicing work”, and Saturation can be described as “the degree to which the resource has extra work which it can’t service, and is often queued”.
Within a distributed complex environment, there could be dozens of services calling one another generating metrics, traces and logs. If there is no established correlation between these calls, then the collected telemetry data will be like data silos and will not be of much use for root-cause analysis. This issue can be addressed by using tracing instrumentation in a distributed environment to understand how different services connect and how requests flow through the path. For each request, a globally unique ID is assigned which is then propagated throughout the request path. Each point of instrumentation along the path can enrich metadata and insert data before passing the ID to the next service.
Instrumentation using AOP adds an overhead to execution which is directly proportionate to the number of measurement points defined using the aspect advice. To keep this overhead as low as possible, instrumentation needs to be applied wherever it makes sense and visibility is needed the most.
References
Spring docs — Aspect Oriented Programming with Spring
Wikipedia — Distributed AOP
Brendan Gregg’s USE method