New Lenses Multi-Kafka Developer Experience

Download

That 3am security call about Apache Kafka...

By Joe Fitzpatrick
May 12, 2020

If you have worn the Platform or Security Engineer badge, or if you have a Sec/Ops role, you might have experienced something like this at some point in your career.

Hopefully not. You receive a call at 3am, it’s your SOC, something’s not right. Oh sh*t! There’s unidentified traffic on the network from an unknown host and it’s communicating with a remote server. Sounds like a Level 3 exfiltration. It’s gonna be a long night.

Devin (SecOps Engineer) : “Hey Luigi. This isn’t a drill. I think we may have been breached. One of our Tier 1 security analysts spotted anomalous network traffic that might indicate a compromise. The traffic is coming from our ERP system, which is connected to our Kafka cluster, and it’s leaving the network. We've initiated our major incident response procedures. We suspect data exfiltration”.

Luigi (Director SecOps): “What's the impact?”

Devin: “We don’t know the full picture yet, we have to check into our Kafka and we have shut down the port the traffic was on.”

Luigi: “Why the wait?“

Devin: “We can’t even triage. We have logs but we really need to see payload events in Kafka to understand what’s being exfiltrated and send them to Splunk for the IR team to analyze.”

Luigi: “I understand, how long until we have answers? Once this gets out you know my phone is going to start ringing off the hook Devin“

Kafka security breach

Anatomy of “the call you don’t want"

What you do know is that people are going to start freaking out. Your mind is racing … it’s not a pretty sight: the company’s reputation could be at stake, customers could churn and if it’s widespread, the fines for losing PII are staggering. The boss is going to yell at you, customers are going to call, and executives from other business units are going to demand updates. And Kafka? Our Kafka install was fast-tracked … and security ops wasn’t included in the process.

Back to the breach…

Devin: “Unfortunately, this is Kafka Luigi, this stuff is super complicated. We’re working with the Kafka platform team to inspect data and build and deploy an application to send the data to Splunk for analysis. We need to pre-process the data beforehand too. We’re dependent on a few of our Kafka experts. This will take a few days at least, 48 hours?“

Luigi: “We just can’t wait that long. What are we doing while we wait?“

Devin: “We’re trying to see the data live in Kafka but it’s in a proprietary format & we don’t understand it. The team is describing it to me as a black box. Unfortunately, all of our Kafka tools are custom, used by a few really experienced engineers and not designed for this type of scenario.“

Luigi: “Surely there must be a faster way to inspect this data and process it to SIEM?“

Pause.

If you have #apachekafka in your stack, congrats, you are probably a market leader.

The upside of having real-time data is huge: insights into the supply chain, views into real-time customer experience, data modernization and digital transformation, real-time operational performance and cybersecurity.

However, there are also challenges in operating a modern data platform. Kafka requires highly skilled engineers to build and maintain. Visibility into Kafka data is challenging, the data may be serialized, large, difficult to search, understand and analyze.

Add to this, understanding what applications are connected to it is bordering on impossible. As is common with Open Source Software, enterprise capabilities like management, user experience, and even security are not built into Kafka.

If your Kafka deployment is like most, it’s a unique, a one-off project. It’s likely difficult to triage and access the data. As this scenario suggests, one of their most strategic data projects is a black box!

There’s got to be a better option for monitoring and analyzing Kafka. What if you could unlock Kafka quickly and securely when you needed it? What if there was a workspace that allowed you secure, compliant and self-service access to Kafka with a single product and no custom coding?

Devin: “I agree, there “should” be a better option, but that’s how this Kafka thing got started here. We’ve been out of the loop.”

Luigi: “What do you mean, out of the loop. We are security!”

Devin: “I hear ya, I’m just the messenger … and up until about 9 months ago we didn’t have any Kafka deployed. “

Luigi: Yeah, it’s powerful but it’s Open Source Software and we have a security policy on OSS. At least we can point to that.”

Devin: “An engineering team saw some success with their Kafka on a single cluster. I remember hearing about their POC that went into production. I guess as more and more people heard about the success of the project, they must have wanted in. Pretty soon Kafka was “fast-tracked” and the devs on the ERP team onboarded it without a security audit.”

Luigi: “Someone had a shiny object and the rest was history.”

Devin: “Likely, yes. And here we sit, our ERP was potentially breached, data potentially being exfiltrated through Kafka and we’re in the blind”

“It is often safer to be in chains than to be free.”

How fitting is this quote from Apache Kafka’s namesake, Franz Kafka?

Maybe the “chains” of IT Operations and Security fundamentals: flexible access to data, monitoring and observability, avoiding defining incident response procedures and security audits are more favorable than the damage of a breach?

Exhales Kafka security breach

… 72 Hours Later

Devin: “Sorry for the wait, the platform team had their best Kafka people on this and it still took us 48 hours to build the application to get the data into Splunk. The investigation took us another day to complete - there was a lot of data. At least I have some good news boss.”

Luigi: “Good news? Not sure I know what that is anymore. I have been preparing an incident summary for the board.”

Devin: “Well, I guess we got lucky. We were breached but they didn’t get any data we deemed strategic or was regulated. They did access our ERP but only touched Kafka topics for Purchasing and Supply. Fortunately, these were not related to our customers, strategic suppliers or manufacturing sites.”

Luigi: “You're sure?”

Devin: “Yes we are. We built the application to get the Kafka data into Splunk. The event logs told the complete story, they accessed ERP and those two Kafka topics that were used primarily for our Global Facilities Supply Chain - nothing related to manufacturing or customers. They took that data and sent it to a remote server in Romania.”

Luigi: “Wow, lucky is right”. Make sure to look for a backdoor, perhaps this was a distraction for another activity?”

Devin: “Will do. This was a case of compromised credentials. They likely bought the credentials on the dark web or created malware that our dev (Sergei Lyon) uploaded. It was probably something from ThePirateBay - we are checking. At any rate, the developer was working from home due to the Coronavirus policy, and using a personal computer without the corporate controls.”

Luigi: “We knew this might happen.”

Devin: “Hard to contingency plan for a pandemic. Aside from that, 72 hours to complete an investigation is not OK. We could have been better prepared.”

Luigi: “Agreed.”

Do you want to be lucky or prepared?

Large, complex systems fail, or worse yet - are compromised, in weird and complex ways. The ability to observe in real time is the only way to safeguard your most strategic data assets from performance issues and/or security events. The benefits of Apache Kafka are immense. But so are the stakes when you are dealing with real-time streaming data flowing to your strategic applications.

For most, monitoring data from Kafka is like working in the dark. The data is big, serialized and not designed for monitoring or observability. Add to this, if you have access to the data you can only query it using a complicated script or the CLI which is slow, ineffective and complex. How many operations or security people know Kafka commands?

Our guide to Kafka monitoring helps you plan ahead and factor in observability across the components you may not have considered; from setting up clear alerts and audits to monitoring metadata and secret management.

So get yourself a copy, send it to Devin and share with your platform team to save everyone the stress of another 3am call.

Next time, you may not be so lucky.