New Relic's Chief Technology Strategist, Nic Benders, discusses the evolution of observability from passive monitoring to an AI-driven "intelligence era," aiming for autonomous action. The conversation highlights how AI, particularly LLMs combined with statistical tools, helps surface critical signals from vast datasets, reduces alert fatigue, and moves towards "understandability" of complex systems. The episode also explores the emerging challenge of observing non-deterministic AI systems themselves and the impact of AI on the future of software engineering.
Summarized by Podsumo
New Relic's Observability Evolution: The platform has transitioned from an instrumentation era to a data platform, and now into an intelligence era leveraging AI to move beyond passive monitoring to active, autonomous problem resolution.
AI's Role in Reducing Alert Fatigue: AI, especially LLMs combined with statistical methods, helps filter massive data, identify meaningful anomalies, and reduce alert noise, allowing systems to take action before human intervention.
Observing Non-Deterministic AI Systems: Monitoring AI requires new "golden signals" beyond traditional metrics, focusing on aspects like token usage, sentiment analysis, and quality of responses to detect issues like hallucination.
Shift from Observability to Understandability: The goal is not just to observe data but to understand what's happening in complex systems, providing answers and insights rather than just dashboards and alerts, ultimately linking technical performance to business goals.
Impact on Software Engineering: AI will elevate the level of abstraction for software engineers, shifting focus from low-level coding to higher-level architectural patterns and system design, making humans more productive in tackling more complex problems.
"Dashboarding and alerting as the way you do observability has reached its conclusion. You can't go forwards anymore with a, I'm going to make fancier dashboards or tinier widgets... They want answers. They want to know what's important in their system."
"The real root of this is why do we even have these damn alerts? Like what are alerts for? We create the alerts because we are afraid that something is broken in the system and we don't know about it. Can we get to the root of that issue instead using these techniques and say, well, when I see it anomaly, I'm going to evaluate it and figure out, is this interesting?"
"I believe it's the same model. It's just different triggers that a AI system is no more different to a web system than a databases. That they have their own language, they have their own metrics like we need to track tokens. We need to track sentiment, be set up for performing judges against like, I want to know am I getting the right answers?"