Working with Apache Kafka and real-time applications comes with challenges.
Visibility into the deployed applications and their dependency on what we call the “data fabric” is one of them (For the sake of this blog, it means Kafka and all its state and configuration).
If you’ve built a multi-tenant real-time data platform with Kafka, where teams are deploying applications outside your jurisdiction, this is where the pain is particularly acute. It goes something like this.
A successful data platform will have product teams deploying applications across different frameworks and deployment pipelines.
Those adopting more advanced DataOps practices will now have teams outside of engineering deploying applications, bypassing engineers entirely. This means they use a different toolset to engineering.
Without care, this provides a data governance nightmare.
A free-for-all of flows deployed with no cataloguing or visibility into what is deployed, by whom, or its state is a recipe for disaster.
Before you know it, the platform will be swamped with flows. You’ll have no idea who owns or has deployed them or how to troubleshoot or govern them. This will inevitably lead to duplicate work, outages, complicated compliance reporting and a loss of confidence from technical and business colleagues.
We’ve also heard of teams struggling to show their management what they are doing with Kafka and losing investment in Kafka because they’ve failed to clearly and visually demonstrate the technology’s value as it has been adopted. In this sense, Kafka has become a victim of its own success.
You may have some level of visibility depending on the tools you use.
Some of your critical applications are instrumented and monitored with an app performance management (APM) solution. Or more broadly your monitoring metrics will report running applications and services.
Through your CI/CD processes you will have some ability to observe what’s deployed across your different pipelines.
Or your service discovery may have a registry that you can interrogate.
All of this may be feeding an asset management or config management DB (CMDB) of some sort.
The problem lies in doing something - anything at all: developing, debugging, securing, governing real-time flows with Apache Kafka has challenges: They require context about the Kafka environment as well as the business context.
This is hard enough for a Kafka expert, let alone the less technical set of users that DataOps practices dictates we open up a data platform to.
Here are a few examples:
As someone in ops managing an alert of poor Kafka performance due to high throughput of a producer, I need to identify the associated business application for the producer, its environment and an owner to contact, whilst at the same time identifying the client id so that a quota can be created.
As a data compliance officer, I need to verify that all applications and microservices for a service are not leaking sensitive information into Kafka, and identify which downstream applications may be consuming this data.
As a platform engineer I need to validate all the Kafka ACLs in accordance with their associated business applications.
To avoid teams being unproductive and needing Kafka experts involved in every process, this requires associating applications and their metadata (owner, version, deployment, environment, etc) with that data fabric we were talking about before (Kafka topics, ACLs, Quotas, Partitions, Consumer Groups etc.).
This isn’t something you’ll have as a day-to-day deployment capability or that can be documented.
And even if it were implemented and documented, it wouldn’t be efficient to ask for anyone to constantly swivel their chair between different tools.
Sitting alongside the new Lenses.io data catalog and new Snapshot SQL engine, the Lenses application catalog binds applications and data together to allow anyone of almost any skill level to operate real-time applications on Kafka.
Since this experience is protected with the Lenses security model, the Real-time App Catalog is designed to foster DataOps by allowing the data platform to be opened up to a wider set of users, in a well-governed way, beyond a single development team or expert platform engineering team.
The catalog provides the business context that keeps tenants of the platform cheery and productive. It minimizes duplicate effort (imagine building a new data processing pipeline not knowing another one doing the same thing had already been deployed). It increases platform and data hygiene and compliance and makes those audits that much less time-consuming.
The App Catalog operates in two main parts.
The Applications view provides a tabular list of all deployed applications and their health (through a health check of all application instances) alongside their associated metadata.
Metadata will include human-defined tags that different teams may add. For example if an application is known to be generating payment data, an operator may choose to tag it “PCI”.
As a data platform engineer, you would want to oversee which teams are deploying applications and ensure they are meeting the necessary data governance controls.
The topology provides a data-centric google-maps-like view of the dependencies between different applications and flows. It maps how upstream applications relate to downstream topics and applications. It helps you answer questions about data provenance and data lineage for good governance.
A developer may choose to consume a dataset as part of a new critical service they are developing. With the Real-time App Topology, they know that the upstream applications for the data also have high service levels or produce clean data for example.
For operations, the topology is often the first point of call as part of an investigation as it shows the service dependencies that will be crucial to investigating an incident or planning a downtime.
From either the tabular view or the Topology view, an operator can drill-down from the application and invoke different workflows including the following actions:
Identify associated consumer groups
Explore payload data for associated topics
Find and modify associated quotas and ACLs
View partitioning information for associated topics
View and modify configuration
This feature has different means of discovering or registering applications into the catalog, designed to cater for all types of applications and deployment methods.
If you build stream processing applications with Lenses’ Streaming SQL engine or you configure one of Lenses’ Stream Reactor Kafka Connect connectors, the application will be registered automatically into the Lenses Topology and Application Catalog through our internal Data Application Deployment (DAD) Framework whilst it deploys to Kubernetes or Kafka Connect.
For any JVM-based developers, a Topology Client can be included in their code that registers automatically their application instance to Lenses.
With a service account token, any developer or analyst can register (or de-register) their application through an HTTP endpoint to Lenses within their code. It means applications developed in any framework can be registered.
The endpoint allows metadata to be included such as deployment method, tags and version. Health checks for each runner/instance of the application can be defined which allows the App Catalog to ping each application runner on an interval. Anyone developing with the Spring framework would often expose an Actuator endpoint with Spring Boot for example.
Here is an example of python script that registers an application with 1 runner consuming from two topics and producing to one.
Stay tuned as we expand the App Catalog with some really exciting enhancements that will open up far more use cases! There are some really big things in store.
In the meantime, come and try it out on your existing cluster or in a trial Kafka workspace: