OpenTelemetry has announced that it has incorporated continuous profiling as a core telemetry signal, and Elastic has donated its continuous profiling agent to the OpenTelemetry project.
Elastic's donation introduces a production-grade, eBPF-based continuous profiling agent to the OpenTelemetry ecosystem. This agent offers whole-system, always-on profiling capabilities with minimal overhead, addressing many limitations of traditional profiling approaches. This contribution follows the March 2023 merger of a profiling data model OTEP (OpenTelemetry Enhancement Proposal) and ongoing efforts to establish a stable specification and implementation for profiling within OpenTelemetry. Key features include:
- Low-performance impact, with approximately 1% CPU usage
- No need for code instrumentation, recompilation, or service restarts
- Support for a wide range of programming languages and runtimes
- Ability to observe third-party libraries and kernel operations
The agent's capabilities include identifying non-optimal code paths and providing comprehensive visibility into application runtime behaviour. This contribution should accelerate the adoption of profiling as the fourth key signal in OpenTelemetry, alongside tracing, metrics, and logs.
Adopting continuous profiling into an observability system addresses several limitations of traditional profiling methods. It eliminates the need for disruptive service restarts, reduces the performance overhead associated with code instrumentation, and provides visibility into third-party libraries that were previously challenging to profile.
The second announcement provides context on the broader journey of integrating profiling into OpenTelemetry. It details the formation of a dedicated Special Interest Group (SIG) for profiles and the challenges faced in developing a standardised approach to continuous profiling within the OpenTelemetry framework.
The SIG Profiles group had to navigate several essential decisions, including:
- Whether to build upon existing data models or create an entirely new one
- How to balance domain-specific profiling conventions with OpenTelemetry's framework-specific conventions
- Selecting an appropriate existing profiling format as a foundation
Integrating profiling data into the OpenTelemetry Collector follows a similar pattern to other signals. The data is ingested, deconstructed into the collector's internal "pdata" format, and then processed uniformly alongside other telemetry signals.
Continuous profiling enables many use cases beyond traditional performance and cost analysis. These include signal correlation, incident response, and detailed resource consumption analysis. The technology shows promise in identifying issues like CPU spikes, memory problems, mutex contention, and network jitter. Adding continuous profiling to OpenTelemetry will help engineers identify resource-intensive code and increase vendor neutrality by reducing reliance on proprietary APM agents.
Another significant trend is integrating eBPF technology in profiling solutions like Elastic's donated agent. eBPF allows for comprehensive system-wide profiling with minimal overhead, although it presents challenges in symbol management and runtime compatibility.
In a discussion about convincing organisations to adopt OpenTelemetry, user SuperQue on Reddit highlights the prior lack of continuous profiling in OpenTelemetry as being a weakness:
"Frankly, I haven't seen the benefit of Otel. We've spent months and months working on deploying it, getting all the backend storage setup (clickhouse). But it hasn't actually provided any additional value over the already instrumented with Prometheus libraries we use. I'm actually more looking forward to continuous profiling tools than Otel. The data that Polar Signals and Pyroscope produce look like they will tell you in much better detail on what parts of your code are slow. Much more useful than what tracing seems to provide."
The OpenTelemetry community's adoption of continuous profiling aligns with a growing industry trend. Several startups and major observability vendors have entered this domain recently, recognising the value of profiling data when correlated with other telemetry signals. Other continuous profiling agents, such as Polar Signals' Parca Agent and Grafana Alloy and Grafana Agent are available in this space.
A video published on the OpenObservability Talks YouTube channel features an in-depth discussion on integrating continuous profiling into OpenTelemetry, featuring experts Felix Geisendörfer from Datadog and Ryan Perry from Grafana Labs. They discuss the evolution of profiling from a performance and cost analysis tool to a key observability signal, alongside logs, metrics, and traces, and cover the merging of the OpenTelemetry Enhancement Proposal (OTEP) for profiling, which moves the profiling signal to an experimental stage within OpenTelemetry.
They also explain the decision to adopt an extended version of the pprof format, dubbed "pprof-extended," as the standard for OTel profiling data, go on to highlight the challenges in balancing performance requirements with OTel's existing conventions and discuss the potential for supporting multiple profiling formats. A discussion on developing reference implementations for various programming languages and runtimes is also provided.