SQL has long been the universal language for working with data. In fact it’s more relevant today than it was 40 years ago. Many data technologies were born without it and inevitably ended up adopting it later on.
Apache Kafka is one of these data technologies. At Lenses.io, we were the first in the market to develop a SQL layer for Kafka (yes, before KSQL) and integrate it in a few different areas of our product for different workloads.
Below we’ll explain why you should use SQL for stream processing, and the key challenges SQL helps us overcome as data engineers and developers.
SQL has two natural benefits that can make it a critical component of DataOps and ensures fast and consistent adoption of real-time data use cases.
Anyone with a computer science background will immediately be familiar with SQL, which you can’t say for many other languages. Then you have analysts and scientists, with non-IT backgrounds, who already know how to use SQL.
With SQL you can operate data in any number of relational (and non relational) databases. With perhaps a few minor differences, an organization can change technology without necessarily needing to re-architect their applications.
Our experience from our customers is that SQL can play a significant role in promoting DataOps practices and getting faster adoption of real-time data.
Stream processing comes with a set of challenges that many engineering teams struggle to overcome.
Developing stream processing applications (using Apache Kafka Streams) isn’t for everyone. We’ve seen many experienced developers struggle to adopt a different way of thinking to solve a problem compared to an RDBMS.
Here are a few examples:
A data stream is an infinite source of data. Instead of acting on data - like we do on a traditional database, we are reacting to it. If you think of an account balance and payments you make using it you will see something like: Payment of 5 USD made to StarCoffee. This can trigger the new account balance calculation which can be 1000 (previously there was 1005 USD there). Rather than writing code which goes to the data, now data triggers and runs the code.
Joining two data streams is more involved than in a database; given the infinite nature of the data flow, a new dimension is required: time.
Some joins require lookups and may not require a time window if one of the streams can maintain an ever-updating state.
You need to think about how your data set has been partitioned and keyed. To join data, the topics involved need to have the same number of partitions, and the partitioner for the messages (the logic which decides to which partition should the message go to) have to align.
Aggregating data comes with a specific approach too: having to cater for early or late-arriving data.
Scaling the application relies on the Kafka Consumer Group semantics. Each instance of the application can process your data. At one point spinning up more instances won’t help because they will sit there idle. It all depends on the source topics partitions count.
Whether it’s for a complex or simple workload, engineering and data management/analytics teams will have to be up-skilled on these intricacies.
Developing real-time applications requires a new level of visibility and control over the infrastructure and surrounding systems. Not just for Devs and Ops but to a much wider number of people. If you’re using a Schema Registry, you’ll need access to view, create and evolve schemas (in an audited fashion). You’ll need to understand the format of your Kafka messages and have a good grasp on Avro and how a schema registry integrates.
Aggregations and Joins rely on the message Key and "the implications of this fact" need to be understood. Re-keying at times cannot be avoided, and it has a direct impact on performance and the cluster, because it means re-copying the data to another topic with a different key. A message timestamp drives which time window it gets allocated to. You’ll need access to the data to be able to debug when issues occur in production.
All these challenges, and more, are to be overcome without the usual tooling that DBAs have at their disposal in more traditional environments, as Kafka does not provide them out of the box.
If you’re working with both a mixture of legacy and modern/cloud-native applications, the likelihood is that data will be serialized in different formats, anything from String, CSV and JSON at one end of the spectrum to Avro and proprietary formats at the other. You will need to understand how to deserialize the data within your code. But even before you get to that stage, you’ll want to explore the serialized data in a data catalog and query the data so you know what’s available, and what to process.
Avro on its own has a series of intricacies that need to be understood.
We understand the wide adoption of SQL in the industry and its benefits. That’s why since the first release of Lenses.io we’ve embedded SQL within our technology. This primarily covers three use cases:
Exploring data across your different data technologies (including Elasticsearch and Apache Kafka) known as the Snapshot Engine.
Moving data in and out of Kafka via our Stream Reactor Kafka Connect Connectors
Stream processing data and deploying workloads on Kubernetes or Kafka Connect.
In this blog, we’re primarily talking about the latter point: our Streaming Processing Engine underwent a big revamp as part of our 4.0 release.
Firstly, we know that SQL isn’t for everyone or every use case. You’re not going to build a machine learning model with SQL.
Whether you’re an experienced stream processing developer or a data scientist there are four main benefits to using SQL to process, transform or move data to feed your model in a time-sensitive way:
An operator who may be used to working with RDBMSes may still need to know some basic concepts of stream processing (such as differences between a Table and a Stream) but otherwise many of the complexities in keying and partitioning data can be abstracted and dealt with behind the scenes.
Here’s an example:
Joining two topics is an operation that happens only based on their records’ keys; to join by arbitrary fields in the topics values, like a and b in the example above, Lenses will automatically re-key the topics to use the specified keys, ensuring that the complexities of the operation are hidden to the user.
Making data processing directly accessible to a data consumer increases innovation by removing the human bottlenecks between those that understand the data and those that have access to the data.
No one enjoys developing applications for simple tasks such as data transformation. Not only is it mundane but can take time and lead to complexity in deploying and maintaining, as well as the tech debt mentioned above.
Example: A workload that transforms CSV source topics into an Avro format to be consumed by a downstream application.
Saving an engineer time from these sorts of workloads allows them to then focus on more complex tasks that drive higher value.
SQL decoupled from the data infrastructure (in our case Apache Kafka but soon to be opened up to others) provides a consistent way of building, deploying, operating and monitoring real time data workloads across different technologies.
A major advantage of stream processing with SQL is how developers can define data processing workloads as configuration.
With the Lenses Streaming SQL engine, we remove the dependencies for the code to be deployed and run. The business logic of your workload and everything needed to run the workload can be declared with SQL via our secure UI, API or CLI and supporting attributes within a configuration file. This workload will be deployed through our Engine onto your existing data infrastructure: any Kubernetes cluster or Kafka Connect cluster can be used.
This also avoids you needing to deploy new distributed data infrastructure just for SQL workloads.
Your applications can run where the rest of your business applications run.
The monitoring of the application (partitions, consumers lag, throughput, etc.) and exploring of data generated by the application (in the case of our Data Catalog and Snapshot SQL Engine) can be done through the same portal in which you created and deployed the application.
This configuration can then be managed in Git and built and deployed through a far simplified CI/CD pipeline whilst ensuring governance.
Not all business requirements can be achieved on SQL alone. There are scenarios where there is no escape from writing Java, Scala or any other JVM language. And this depends on how complex the requirements are. But, there are many tasks SQL can perform that help you move to production in matters of minutes, thus avoiding the full software development lifecycle (code, test, build, deploy).
Many of our community wanted a way to convert JSON to Avro automatically. No one wants to spend time on such a mundane task. Ideally, you should avoid schemaless JSON payloads landing in your Kafka topics. But sometimes, given old systems or specific 3rd parties, there is no way around it.
Here’s an example:
Rekeying a topic is another task that has featured high in the list of requirements from our customers. There are many reasons for such a task, from the up-stream system not setting the key to the need to re-key because you need to build state (KTable) on the new Key value.
Reshaping the data is another in-demand use case. It’s an unnecessary task and can be delivered to production in a matter of minutes. The same can’t be said for having to go through the full software development lifecycle.
Changing the way the data looks can target the message Key or Value, of both. The following query will end up sending these messages to the output topic:
Unwrapping the values for Key and Value, might come in handy if you want to push your data into a time series storage like InfluxDB. The message Key will contain the car unique identifier and the Value will contain the speed.
Sometimes you need to trigger a delete on a Kafka compacted topic when some criteria is met.
Data is connected in many different ways, and we spend a lot of our time and efforts joining it this way and that. It could be reporting, or adding more information because the next business process requires such a high level of detail to do its part.
A common scenario is a lookup. The left side data stream has to be enriched with data which is slow-changing (or static). A call arrives and the data needs to contain the customer details as well as the call’s.
Another case of enriching data is joining two data streams which can be correlated via a common identifier over a time window. If you consider two events streams, a shipment and orders, you can answer questions like which orders have been shipped.
I will finish the example with yet another scenario which is frequently found in stream processing. Many times you want to identify your website traffic on a time interval, or track each gamer’s highest scores over a time window.
Ready to get started with building real-time data processing applications with SQL on your existing Apache Kafka and Kubernetes clusters? If you don’t have your own environment, you can download our all-in-one Kafka/Lenses.io Box docker container or access a free cloud environment.