Install now
Products
Developer Experience
Kafka replicator
Lenses AI
Kafka Connectors
Pricing
Company
About
Careers
Contact
Solutions by industry
Financial services
For engineers
Docs
Ask Marios Discourse
Github
Slack
For executives
Case studies
Resources
Blog
Press room
Events
LinkedIn
Youtube
Legal
Terms
Privacy
Cookies
SLAs
EULA
© 2026Apache, Apache Kafka, Kafka and associated open source project names are trademarks of the Apache Software Foundation
Don't trust AI agents on Kafka. Unless you have OAuth 2.1 | Register here
  • Pricing
  • Install Now
installNow icon
installNow icon
Install Now
homeMobile icon
homeMobile icon
Home
picingMobile icon
picingMobile icon
Pricing
blogMobile icon
blogMobile icon
Blog
Agents on Kafka and OAuth 2.1
Lenses, autonomy in data streaming

From hours of Kafka troubleshooting to insights in minutes

Andrew Stevenson
By Andrew StevensonJune 23, 2025
Kafka troubleshooting SRE AI agent
In this article:
  • 01.Help, my Kafka consumer is down
  • 02.The hidden time-sink of Kafka operations
  • 03.The poison pill problem
  • 04.Introducing the Lenses SRE Agent
  • 05. Impact: From Kafka troubleshooting investigation to resolution
  • 06.Human-operated vs. AI-operated triage: Understanding the difference
  • 07.The evolution of Kafka operations
  • 08.Looking forward: AI operations

Help, my Kafka consumer is down


You're three hours into debugging a stalled Kafka consumer. The lag is climbing. Customers are complaining. Your logging doesn't show anything useful, and changing the log level requires a deployment approval that won't come until tomorrow morning.

Sound familiar?

If you're operating Apache Kafka at scale, that sinking feeling when a consumer group stops progressing, and you're left playing detective with insufficient clues. The pressure builds while you dig through offsets, parse through logs, and manually inspect messages, hoping to find the needle in the haystack.

This is exactly the kind of operational burden that drains valuable engineering time and keeps you from building the real-time service features that matter.

The hidden time-sink of Kafka operations


The real challenge with Kafka operations isn't spotting that something broke – monitoring tools are pretty good at that. The problem is figuring out why it broke, when it started, what changed, and who is impacted. 

We recently helped a customer troubleshoot a failing connector where a Single Message Transform (SMT) was silently failing due to an incorrect header value, triggering a heap space OutOfMemoryError. What should have been a 5-minute fix became a multi-hour investigation for 6+ engineers.

The poison pill problem


One of the most frustrating Kafka issues is the poison pill scenario. A single malformed message can bring an entire consumer group to its knees. These scenarios are surprisingly common in production environments:

  • Schema evolution issues and data quality problems

  • Encoding issues and size violations

  • Business logic failures that pass schema validation


Good schema management helps, but only if you're using a schema registry and enforcing it. Even then, it doesn't guard against semantic errors in your data or application-level processing failures that occur after successful deserialization.

Traditional debugging approaches require manually hunting through logs, inspecting offsets, and examining messages one by one – a process that can take hours for large topics with millions of messages.


Introducing the Lenses SRE Agent

The SRE Agent represents a shift in how we approach Kafka troubleshooting. Instead of manual investigation, it provides intelligent triage to help teams quickly focus their efforts. There are many things that can cause consumers to fail – some having nothing to do with Kafka itself – so the SRE Agent starts by analyzing what it knows best: the streaming data. When a consumer group stalls, the SRE Agent:

  • Analyzes message patterns: Using Lenses' SQL engine, it examines messages on either side of the stuck offset, looking for structural anomalies, data outliers, and pattern breaks that could indicate the poison pill

  • Provides historical context: It compares current data patterns against historical baselines to identify when and what changed, giving you the temporal context needed for effective troubleshooting

  • Identifies Schema issues: Even without formal schema enforcement, it can detect structural inconsistencies in your data that might cause consumer failures.

This is the first skill we are introducing into the Agent.  With more to come, including the ability to analyze the performance of the infrastructure itself, and detect configuration changes.

Impact: From Kafka troubleshooting investigation to resolution

Consider a typical poison pill scenario: Your Kafka consumer group for order processing suddenly stops progressing during peak traffic. With a serious production issue, multiple teams get involved in the traditional debugging approach:

Traditional approach:

  • Detection and team mobilization (20 minutes)

  • Parallel investigation across app, infrastructure, security, and network teams (60-120 minutes)

  • Manual offset analysis and message inspection (135 minutes)

  • Pattern recognition and root cause identification (90 minutes)

  • Resolution (15 minutes)

Total time: 5+ hours  across multiple engineers

With the SRE Agent, the same scenario becomes:

  • Detection (5 minutes): Kafka monitoring alerts trigger

  • AI Analysis (2 minutes): Agent analyzes stuck consumer (triggered via Lenses) and identifies potential poison pill

  • Report Generation (1 minute): Detailed analysis showing the problematic message, its location, and recommended actions

  • Resolution (15 minutes): Armed with specific information, quickly resolve the issue

Total time: 23 minutes with a single engineer

Kafka troubleshooting consumer group - traditional approach vs. Lenses SRE Agent

Human-operated vs. AI-operated triage: Understanding the difference


The SRE Agent uses Lenses' built-in diagnostic capabilities, but with several key advantages:

  • Speed: The agent can analyze thousands of messages and correlate patterns in seconds, not hours

  • Context: It automatically correlates message patterns and data anomalies that would require significant manual effort to identify

  • Parallel processing: While a human team might divide responsibilities, the AI agent can simultaneously analyze all aspects of the problem

The evolution of Kafka operations

As organizations scale Kafka deployments across multiple environments, troubleshooting complexity grows exponentially. Managing dozens of clusters across different vendors requires new approaches.

The SRE Agent addresses this by providing focused data analysis capabilities that work across different Kafka environments. Whether you're running Apache Kafka, Amazon MSK, Confluent Cloud, or a mix of all three, the agent applies the same intelligent analysis patterns.

This consistency is particularly valuable for organizations adopting hybrid architectures. 

Looking forward: AI operations

The SRE Agent represents the first step toward intelligent Kafka operations. Instead of 3 AM forensic investigations, you get a clear analysis of what's wrong and how to fix it.

With intelligent agents handling the detective work, your team can focus on building robust, real-time systems that drive business value.

Stay up to date with the latest AI agent developments at Lenses, and register for early access.

Back to all blogs

Related Blogs

Lenses 6.2 Oauth
Lenses 6.2 Oauth
Blog

Lenses 6.2 - Trusting Agents to build & operate event-driven applications

andrew
andrew
By
Andrew Stevenson
image
image
Blog

Kafka Migrations Need More Than a Replicator

Jonas Best Profile Picture
Jonas Best Profile Picture
By
Jonas Best
kafkaconnections hero banner
kafkaconnections hero banner
Blog

Self-Service Data Replication with K2K - part 1

Drew Oetzel
Drew Oetzel
By
Drew Oetzel