Get your GitOps for real-time apps on Apache Kafka & Kubernetes
Here's how to bring DataOps to your app deployment on Kafka on Kubernetes using GitOps
Infrastructure as code has been an important practice of DevOps for years.
Anyone running an Apache Kafka data infrastructure and running on Kubernetes, the chances are you’ve probably nailed defining your infrastructure this way.
If you’re running on Kubernetes, you’re likely using operators as part of your CI/CD toolchain to automate your deployments.
Adopting GitOps is natural evolution, whereby the state of your landscape is managed in Git (or any code repo) with automation systems ensuring the state of a deployment remains consistent with the repo.
Where we’ve seen a particular lack of maturity and best practices is in managing the application landscape of flows and microservices on technologies such as Kafka and Kubernetes.
GitOps is increasingly playing a big part of DataOps. DataOps promotes the accelerated delivery of data products through decentralized data governance, automation, self-service and lowering the skills required to operate data. GitOps brings standardization, governance and automation.
It’s a subject we’ve spoken about at a few events recently including at Kafka Summit this year and on DevOps.com:
What Kafka with GitOps should look like
When looking to automate real-time applications following DataOps, we need to be thinking about more than automating the app deployment. We need to define the correct data governance controls as part of the software delivery.
This means avoiding blind releasing apps where governance is an afterthought.
Lack of good governance will reduce the accessibility and trust in data and reduce the confidence to use the data downstream. Governance should cover the accessibility, availability, quality, integrity and confidentiality of the data.
We want the creator of the App, who best understands the data to be able to define the governance controls as a single package and then have it reviewed by a standard workflow.
Example: A data analyst builds their own data-processing application, but they also define the governance and compliance controls required to release it. This puts the app’s performance parameters in their hands - they decide what makes it production-ready and enterprise-grade.
When deploying Kafka on Kubernetes, it may look like this: The analyst defines the application logic in some form (such as SQL) and includes:
The necessary Kafka ACLs
The data policies to mask any sensitive data
The lag and expected data throughput alerting rules
Where those events should be sent to (Slack, Pagerduty, etc)
The auditing rules to 3rd parties
And so on...
This can then be pushed to Git, managed through your standard Git workflows and a merge request created, triggering a build pipeline to deploy across your different environments.
If you’re working with Kafka and doing this, there are two areas we need to consider:
How to define the the application & governance landscape as config
How to best automate deployment in an infrastructure-agnostic and secure way.
Defining the App and Governance as config
Lenses provide a full experience for data practitioners (developers, analysts, even business users) to build real-time applications deploying on a Kafka & Kubernetes data platform by ensuring all data is catalogued, made accessible and explorable.
The Lenses CLI client allows your “data landscape” to be exported as declarative configuration. Here’s for example a YAML file representing an AVRO schema:
"doc": "Dataset with the Lensicab taxi trips containing trip distance",
"doc": "The unique ID of this taxi ride/trip"
"doc": "The date when the ride/trip happened"
"doc": "The distance of the taxi ride/trip in Km"
Or a Topic:
Or a data masking rule (“data policy”)
Configuration would include objects:
App Logic (as SQL or Kafka Connect config)
Monitoring alert rules (lag, throughput etc.)
Data masking rules
Deployment definition (for example for Kubernetes or Kafka Connect)
How to apply GitOps to Kafka and best maintain a consistent state
Here’s where there are a few different methods that are commonly adopted.
This is the method we see most organizations adopt. Teams will push to Git which will trigger a deploy pipeline in any CI/CD (such as Jenkins) using various APIs in Kafka, Kafka Connect, Kubernetes or wherever you’re running your applications. It also makes it difficult for Jenkins to monitor the desired state and ensure consistency with Git.
Of course you also need to open up the firewall to allow Jenkins into your data platform.
K8 Operator for Kafka
You can use a Kubernetes operator to push the desired state. Jenkins or your CI/CD tool still needs to penetrate the data platform environment, however, which may increase security risk. Then you can use a Kubernetes operator (such as Strimzi’s) to monitor the desired state and apply the state to Kafka via APIs. This method is often used to operate the Kafka infrastructure, scale up Brokers etc. rather than to operate at the data layer (Topics, ACLs etc.)
Pull methods are still based on an operator pattern, but through watching Git rather than watching Kubernetes.
This allows the pattern to work independently of using Kubernetes.
The operator is a Lenses operator which can run either inside or outside your data platform. The operator speaks with Lenses which applies the desired state to the data platform including any real-time applications. So for example for anyone using Lenses Streaming SQL, this would deploy the application over Kubernetes.
The desired state of course would have been created within a separate Lenses environment and exported as config with the CLI before being pushed into Git.
The benefit of this pull deployment approach is it provides a more secure environment by not needing your CI/CD to access your infrastructure, it also ensures your applications are secured, governed and audited in Git with the state actively monitored and the ability to rollback at any time.