Apache Kafka and GDPR Compliance

kafka-gdpr

General Data Protection Regulation

Read the March 2021 updated guide Architecting Apache Kafka for GDPR compliance.

GDPR is an important piece of legislation designed to strengthen and unify data protection laws for all
individuals within the European Union. The regulations becomes effective and enforceable on the 25th May 2018.

Our commitment is to provide the necessary capabilities in data streaming systems, to allow your data-driven business
to achieve compliance with GDPR prior to the regulation’s effective date.

What are the key points regarding data activities ?

In summary, here is a list of important changes that will come into effect with the upcoming GDPR

  • Right to request copy of personal data
  • Right to be forgotten
  • Compliance obligations to keep detailed records on data activities

“Under the 

Expanded rights for individuals
, the GDPR provides expanded rights for individuals in the European Union
by granting them, amongst other things, the right to be forgotten and the right to request a copy of any personal data
stored in their regard.”

How GDPR relates with Apache Kafka ?

As Apache Kafka is one of the most prominent data streaming systems in the modern enterprise, the majority of data is
continuously streaming in and out of various systems. Consider a typical Avro encoded message flowing through Kafka:

{
  "type": "record",
  "name": "CustomerRecord",
  "namespace": "com.acme.system",
  "doc": "Schema for Customer Records",
  "fields": [
    {
      "name": "customerId",
      "type": "long",
      "doc": "The unique customerId"
    }
  ]}

Avro schema definition allows attaching additional metadata which can be used to annotate particular fields.
By injecting 

"gdpr" : "customerId"
 to the Avro Schema, we can automate the tracking and collection of customer
data.

    {
      "gdpr": "customerId",
      "name": "customerId",
      "type": "long",
      "doc": "The unique customerId"
    }

Now let’s review a streaming topology for a typical 

IoT
 use case.

streaming topology

Customer data are collected via a JDBC Source and additional device events are sourced from an MQTT system. We are
processing data (join, filter, aggregate) and placing them into another topic, before pushing them to down-stream data stores.

To track the data, and have “detailed records on data activities” we need to be aware of the full data lineage,
even when transformations are in place:

    INSERT INTO data_topic_2 SELECT customerId AS cid, ...

A Kafka SQL processor, for example can rename a 

field
, so tracking the lineage across the topology is the mechanism
of identifying which other topics or data stores contain now customer information.

GDPR compliance with Apache Kafka

Compliance obligations to keep detailed records on data activities

Having GDPR awareness requires as a first step, identifying all the points, where data about individuals are processed
or stored. In the case of Apache Kafka, you will expect tracking i) topics, ii) connectors and iii) processors:

gdpr topics


Topics with data about individuals

gdpr kafka connectors

Connectors with data about individuals

gdpr processors


Processors with GDPR data

Tracking as well as auditing all activities can allow us to respect the 

Compliance obligations to keep detailed records on data activities
. Auditing means that we’ll have to track any processing of the data, across the entire
lifecycle and timeline of the customer’s data.

Right to request copy of personal data

Once a request for a copy of personal data is received, based on the GDPR regulations, we will need to be in a position
to retrieve all personal data for the particular individual. In the case of IoT that could include unaggregated and
aggregated device data; thus becoming quite a challenge.

With a 

lineage
 aware system, that can propagate through the streaming topology the 
gdpr
 annotation, we can fully
automate the collection of personal information, both from Kafka topics, as well as target datastore systems.

Thus executing a query:

gdpr query

Should collect all the data related to the particular individual across all topics as well as give us a report that
a particular Cassandra table and a particular Elastic Search index have the potential to contain additional information.

Right to be forgotten

The right to be forgotten, becomes one of the hardest challenges because of data immutability. Apache Kafka does not
support deleting records, and although some 

eventual deletion
 is supported, it requires:

  • A topic to be compacted
  • A message with the same
    key 
    and a null value to be pushed into that topic
  • The compaction process to eventually kick in

What we need to have (as a first step) in terms of topics, is a complete list of all the topics with data about individuals, with the
associated information regarding the retention period and if the topic is compacted:

gdpr topics reports

Let’s drill into the 

right to be forgotten
 with an example: A customer wishes to be entirely forgotten.

What we would expect to happen, is the customer record to be removed from the 

source of truth
, that would be
a SQL or NoSQL database. If we are using a Kafka Connector with CDC (Change Data Capture) capabilities, the deleted
record will be picked up by the connector, and a nullable record will be pushed into the customer topic.

gdpr right to be forgotten kafka

If the Cassandra sink or ElasticSearch sink are CDC aware, they can automatically remove the relevant records from the
target systems. If the Kafka sink connector, is not CDC aware, or is writing to an immutable store (i.e. S3 or HDFS)
then some additional mechanism will have to kick-in to wipe that data.

Conclusions

As modern data-driven businesses are running more in real-time, and Apache Kafka becoming a central component for
messaging across systems, it physically becomes the point of tracking as well as enforcing the new regulations.

As Data Protection regulations evolve and fines can be up to €10 million or 2% of the company’s global annual turnover
of the previous financial year, or up to €20 million or 4% of the company’s global annual turnover of the previous
financial year, whichever is higher, integrating data protection ‘by design and by default’ is becoming a high priority
across many enterprises that deal with EU based citizens.

This is the first of a series of articles, regarding how an organization could tackle those regulations at design time, to provide a framework to ensure that compliance is in place. If you are interested more about GDPR compliance with Apache Kafka feel free to contact us to learn more.

Additional Resources

GDPR - Chapter 3 - Rights of the data subject