Apache Kafka and GDPR Compliance

Data Protection Regulations

Antonios Chalkiopoulos
Apache Kafka and GDPR Compliance

General Data Protection Regulation

Read the March 2021 updated guide Architecting Apache Kafka for GDPR compliance.

GDPR is an important piece of legislation designed to strengthen and unify data protection laws for all individuals within the European Union. The regulations becomes effective and enforceable on the 25th May 2018.

Our commitment is to provide the necessary capabilities in data streaming systems, to allow your data-driven business to achieve compliance with GDPR prior to the regulation’s effective date.

What are the key points regarding data activities ?

In summary, here is a list of important changes that will come into effect with the upcoming GDPR

  • Right to request copy of personal data

  • Right to be forgotten

  • Compliance obligations to keep detailed records on data activities

“Under the Expanded rights for individuals, the GDPR provides expanded rights for individuals in the European Union by granting them, amongst other things, the right to be forgotten and the right to request a copy of any personal data stored in their regard.”

How GDPR relates with Apache Kafka ?

As Apache Kafka is one of the most prominent data streaming systems in the modern enterprise, the majority of data is continuously streaming in and out of various systems. Consider a typical Avro encoded message flowing through Kafka:

{
  "type": "record",
  "name": "CustomerRecord",
  "namespace": "com.acme.system",
  "doc": "Schema for Customer Records",
  "fields": [
    {
      "name": "customerId",
      "type": "long",
      "doc": "The unique customerId"
    }
  ]}

Avro schema definition allows attaching additional metadata which can be used to annotate particular fields. By injecting "gdpr" : "customerId" to the Avro Schema, we can automate the tracking and collection of customer data.

    {
      "gdpr": "customerId",
      "name": "customerId",
      "type": "long",
      "doc": "The unique customerId"
    }

Now let’s review a streaming topology for a typical IoT use case.

streaming topology

Customer data are collected via a JDBC Source and additional device events are sourced from an MQTT system. We are processing data (join, filter, aggregate) and placing them into another topic, before pushing them to down-stream data stores.

To track the data, and have “detailed records on data activities” we need to be aware of the full data lineage, even when transformations are in place:

    INSERT INTO data_topic_2 SELECT customerId AS cid, ...

A Kafka SQL processor, for example can rename a field, so tracking the lineage across the topology is the mechanism of identifying which other topics or data stores contain now customer information.

GDPR compliance with Apache Kafka

Compliance obligations to keep detailed records on data activities

Having GDPR awareness requires as a first step, identifying all the points, where data about individuals are processed or stored. In the case of Apache Kafka, you will expect tracking i) topics, ii) connectors and iii) processors:

gdpr topics

Topics with data about individuals

gdpr kafka connectors

Connectors with data about individuals

gdpr processors

Processors with GDPR data

Tracking as well as auditing all activities can allow us to respect the Compliance obligations to keep detailed records on data activities. Auditing means that we’ll have to track any processing of the data, across the entire lifecycle and timeline of the customer’s data.

Right to request copy of personal data

Once a request for a copy of personal data is received, based on the GDPR regulations, we will need to be in a position to retrieve all personal data for the particular individual. In the case of IoT that could include unaggregated and aggregated device data; thus becoming quite a challenge.

With a lineage aware system, that can propagate through the streaming topology the gdpr annotation, we can fully automate the collection of personal information, both from Kafka topics, as well as target datastore systems.

Thus executing a query:

gdpr query

Should collect all the data related to the particular individual across all topics as well as give us a report that a particular Cassandra table and a particular Elastic Search index have the potential to contain additional information.

Right to be forgotten

The right to be forgotten, becomes one of the hardest challenges because of data immutability. Apache Kafka does not support deleting records, and although some eventual deletion is supported, it requires:

  • A topic to be compacted

  • A message with the same key and a null value to be pushed into that topic

  • The compaction process to eventually kick in

What we need to have (as a first step) in terms of topics, is a complete list of all the topics with data about individuals, with the associated information regarding the retention period and if the topic is compacted:

gdpr topics reports

Let’s drill into the right to be forgotten with an example: A customer wishes to be entirely forgotten.

What we would expect to happen, is the customer record to be removed from the source of truth, that would be a SQL or NoSQL database. If we are using a Kafka Connector with CDC (Change Data Capture) capabilities, the deleted record will be picked up by the connector, and a nullable record will be pushed into the customer topic.

gdpr right to be forgotten kafka

If the Cassandra sink or ElasticSearch sink are CDC aware, they can automatically remove the relevant records from the target systems. If the Kafka sink connector, is not CDC aware, or is writing to an immutable store (i.e. S3 or HDFS) then some additional mechanism will have to kick-in to wipe that data.

Conclusions

As modern data-driven businesses are running more in real-time, and Apache Kafka becoming a central component for messaging across systems, it physically becomes the point of tracking as well as enforcing the new regulations.

As Data Protection regulations evolve and fines can be up to €10 million or 2% of the company’s global annual turnover of the previous financial year, or up to €20 million or 4% of the company’s global annual turnover of the previous financial year, whichever is higher, integrating data protection ‘by design and by default’ is becoming a high priority across many enterprises that deal with EU based citizens.

This is the first of a series of articles, regarding how an organization could tackle those regulations at design time, to provide a framework to ensure that compliance is in place. If you are interested more about GDPR compliance with Apache Kafka feel free to contact us to learn more.

Additional Resources

GDPR - Chapter 3 - Rights of the data subject

Ready to get started with Lenses?

Try now for free