Data dump to data catalog for Apache Kafka
Why unearthing the value of metadata should be as simple as Google Search
From data stagnating in warehouses to a growing number of real-time applications, in this article we explain why we need a new class of Data Catalogs: this time for real-time data.
The 2010s brought us organizations “doing big data”. Teams were encouraged to dump it into a data lake and leave it for others to harvest.
But data lakes soon became data swamps. There were no best practices, no visibility into service levels or quality of data, no data contracts, no naming conventions or standardizations.
Just as quickly as data had arrived, it was impossible to find or trust.
If you were mature you might have deployed an enterprise Data Catalog that discovered data across your data stores. If you were less mature this would have been a manual process of documentation.
Either way, this wouldn’t prepare you for what was to come in the world of data and DataOps.
New streaming data, same problem, bigger stakes
As a developer or data engineer, you still have a problem finding data. Answering simple questions such as: Where do I have customer data? How about surnames and phone numbers or credit cards?
Why is this?
It’s because the challenge to catalog data got harder. Data isn’t sitting in data warehouses any longer. It’s streaming. And it is data generated by applications run within engineering, not business teams.
A lot of applications.
Engineering teams haven’t got time nor can they be expected to follow traditional data governance practices.
And yet, if there is no way to know what data there is across different teams and how to find it - it may as well not exist.
Much of DataOps is about self-service to remove friction from delivery. Pure luck in speaking to the right person at the right time or endless back-and-forths to understand what data exists, how it looks, etc isn't right.
This won’t work in 2020.
For real-time data, there is no alternative but to automate the data management processes. This includes the discovery of data entities, data lineage, classification and quality.
Automation will mean teams are free to develop new data-intensive applications without centralized data governance or manual procedures. What data is generated can be immediately socialized across a business for other teams to benefit from.
Commandments of Cataloguing data
Metadata is Queen.
If you can collect it from your different data infrastructure and applications you’re on the right path. Then to make it valuable you need to serve this information in the right measure, and you can start to answer the right questions:
What data exists and its profile?
What is its quality?
What service levels can I expect?
What is its data provenance?
How might other services be impacted?
How compliant is it?
Being able to answer these sorts of questions is fundamentally important to the success of real-time data projects.
“By 2021, organizations that offer a curated catalog of internal and external data to diverse users will realize twice the business value from their data and analytics investments than those that do not”
Source: Augmented Data Catalogs: Now an Enterprise Must-Have for Data and Analytics Leaders,” Ehtisham Zaidi & Guido de Simoni, Sept.12, 2019
Enter the Lenses real-time Data Catalog
Lenses.io delivers the first and only Data Catalog for streaming data.
It's an easy, secure and intuitive way to identify your data:
It works in real-time
It continuously and automatically identifies all your streaming data
It works across any data serialization format
It enables your team to mask and protect all sensitive data.
Lenses not only provides a Google Search experience over streaming data, but also a Google Maps experience.
In addition to monitoring your pipelines (Kafka Connect, Flink, Spark Streaming etc.) and your microservices, Lenses will highlight which applications are consuming or producing such “sensitive” data.
Next, we’ll explain the thought process and key principles behind our real-time Data Catalog.
Like Google but for Apache Kafka metadata
Building a real-time Data Catalog was a natural progression for our team. We’ve been giving visibility into Apache Kafka environments and applications that run on Kafka for years.
This was mainly developed to help engineers gain insight into their Kafka streams. Very useful when it came to debugging applications and inspecting message payload with SQL, partitioning information, overseeing infrastructure health or viewing consumer lag.
It all starts with SQL
The SQL engine to explore topic data is particularly important.
To understand the data and its structure we connected to an AVRO schema registry or deserialized proprietary messages. This meant we had visibility into the metadata and payload of all data sitting in Kafka.
Last year we extended the capabilities to explore data in Elasticsearch with the same SQL engine and built a framework to connect to multiple different data stores in the future: Postgres, Redis, Cassandra.
We also register stream processing applications that run on our streaming SQL engine over Kubernetes.
We allow developers to register their external applications either as a REST endpoint or with a client for JVM-based applications.
This builds us an App Catalog and a Topology of all the dependencies between different flows and applications. Allowing us to build the data lineage of different data sets.
It also allows us to answer a few important questions:
What applications generated this data?
How much can I trust the quality of the data and at what service levels?
What downstream applications consume this data to understand service disruption impact?
The Topology, App Catalog and SQL Engine therefore give us the ability to maintain a metadata catalog of data flowing across a data platform.
Most importantly, this data is updated automatically and in real-time.
As engineering teams develop a new product, whether it be a consumer-facing microservice application or data processing pipeline, the data and topology will be discovered automatically, including payload and all metadata.
Or if an application writes to an Elasticsearch index, that too will automatically be picked up.
No need to manually maintain a catalog.
This information can then be presented and found in a free-text search fashion a la Google:
The catalog is protected with the same unified namespace-based security model that protects all data access in Lenses.
It opens up new use cases around how data can be accessed and drastically reduces the time or the duplicate effort compared to current methods of finding data.
Here are two examples.
1. Scoping a new project
A business analyst is able to scope the feasibility of a new innovative stock management application by exploring what data can be used across multiple different lines of business, including service levels, quality and compliance requirements.
The analyst starts typing keywords such as stock* to find all metadata (indexes, documentation, field names, streamings) and generating applications that match.
They can drill down to the payload to explore the data or view in the context of a topology to understand upstream and downstream applications connected to the data. An analyst can only view data they have been granted access to, and/or may have certain sensitive fields masked in accordance with compliance requirements.
2. Data access audit
An auditor needs to explore all data entities holding possible password information.
The auditor saves themselves weeks of data gathering and manual reporting by searching pass* to find all entities. They validate the Lenses user group namespaces for these entities to understand which users have access and understands the applications processing this information via a Topology.
This same process can help meet any number of compliance controls including GDPR, HIPPA, SOX, SEC and PCI.