Mike Barlotta, Agile Data Engineer at WalmartLabs introduces how Kafka Connect and Stream Reactor can be leveraged to bring data from Cassandra into Apache Kafka.
This post will look at how to setup and tune the Cassandra Source connector that is available from Lenses.io. The Cassandra Source connector is used for reading data from a Cassandra table, writing the contents into a Kafka topic using only a configuration file. This enables data that has been saved to be easily turned into an event stream.
In our example we will be capturing data representing a pack (i.e. a large box) of items being shipped. Each pack is pushed to consumers in JSON format on a Kafka topic.
Modeling data in Cassandra must be done around the queries that are needed to access the data (see this article for details). Typically this means that there will be one table for each query and data (in our case about the pack) will be duplicated across multiple tables.
Regardless of the other tables used for the product, the Cassandra Source connector needs a table that will allow us to query data using a time range. The connector is designed around its ability to generate a CQL query based on configuration. It uses this query to retrieve data from the table that is available within a configurable time range. Once all of this data has been published, Kafka Connect will mark the upper end of the time range as an offset. The connector will then query the table for more data using the next time range starting with the date/time stored in the offset. We will look at how to configure this later; for now we want to focus on the constraints for the table. Since Cassandra does not support joins, the table we are pulling data from must have all of the data that we want to put onto the Kafka topic. Data in other tables will not be available to Kafka Connect.
In its simplest form a table used by the Cassandra Source connector might look like this:
The event_id
field is the partition key. This is used by Cassandra to determine which nodes in the cluster will store the
data. The event_ts
is part of the cluster key and determines the order of the data within the partition (see this article for details). It is also the column that is used by the Cassandra source connector to manage time ranges. In this
example, the event_data
column stores the JSON representation of the pack.
This is not the only structure for a table that will work. The table that is queried by the Cassandra Source connector can use numerous columns to represent the partition key and the data. However, the connector requires a single time based column (either TIMESTAMP or TIMEUUID) in order to work correctly.
This would be an equally valid table for use with the Cassandra Source connector:
The most efficient way to access data in this table is to query for data with the partition key. This would allow Cassandra to quickly identify the node containing the data we are interested in.
SELECT * FROM pack_events WHERE event_id = "1234";
However, the Cassandra Source connector has no way of knowing the ids of the data that it will need to publish to a Kafka topic. That is why it uses a time range.
The reason we cannot use event_ts
as the partition key is because Cassandra does not support these operators
(>, >=, <=, <) on the partition key when querying – without these we would not be able to query across date/time
ranges (see this article for details).
There’s just one more thing. If we tried to run the following query it would fail:
The connector must supply the ALLOW FILTERING
option to the end of this query for it to work. This addition allows
Cassandra to search all of the nodes in the cluster for the data in the specified time range (see this article for details).
The Lenses.io connectors are configured using Kafka Connect Query Language (KCQL). This provides a concise and consistent way to configure the connectors (at least the ones from Lenses.io). The KCQL and other basic properties are provided via a JSON formatted property file.
For the sake of this post, let’s create a file named connect-cassandra-source.json
:
The name
of the connector needs to be unique across all the connectors installed into Kafka Connect.
The connector.class
is used to specify which connector is being used.
com.datamountaineer.streamreactor.connect.cassandra.source.CassandraSourceConnector
The next set of configuration (shown below) is used to specify the information needed to connect to the Cassandra cluster and which keyspace to use.
connect.cassandra.contact.points
connect.cassandra.port
connect.cassandra.username
connect.cassandra.password
connect.cassandra.consistency.level
connect.cassandra.key.space
There are two values for the connect.cassandra.import.mode
. Those are bulk
and incremental
. The bulk
option will
query everything in the table every time that the Kafka Connect polling occurs. We will set this to incremental
.
The interesting part of the configuration is the connect.cassandra.kcql
property (shown above). The KCQL statement
tells the connector which table in the Cassandra cluster to use, how to use the columns on the table, and where to
publish the data.
The first part of the KCQL statement tells the connector the name of the Kafka topic where the data will be published.
In our case that is the topic named test_topic
.
INSERT INTO test_topic
The next part of the KCQL statement tells the connector how to deal with the table. The SELECT/FROM
statement specifies the
table to poll with the queries. It also specifies the columns whose values should be retrieved. The column that keeps
track of the date/time must be part of the SELECT
statement. However, if we don’t want that data as part of what we
publish to the Kafka topic we can use the IGNORE
.
SELECT event_data, event_ts FROM pack_events IGNORE event_ts
The next part of the statement, the PK
, tells the connector which one of the columns is used to manage the date/time.
This is considered the primary key for the connector.
PK event_ts WITHUNWRAP INCREMENTALMODE=“TIMESTAMP”
The INCREMENTALMODE
tells the connector what the data type of the PK
column is. That is going to be either
TIMESTAMP
or TIMEUUID
.
Finally, the WITHUNWRAP
option tells the connector to publish the data to the topic as a String rather than as a JSON object.
For example, if we had the following value in the event_data column:
{ “foo”:“bar” }
We would want to publish this as seen above.
Leaving the WITHUNWRAP
option off will result in the following value being published to the topic.
If we leave WITHUNWRAP off when using the StringConverter (more on that later), we would get the following :
Struct:{event_data={“foo”:“bar”}}
We will need to use the combination of WITHUNWRAP
and the StringConverter
to get the result we want.
We will explore these in another post. But for now let’s start looking for data in our table with a starting date/time of today. We will also poll every second.
We will be using the following products:
Apache Cassandra 3.11.1
Apache Kafka and Kafka Connect 1.0
Lenses.io
Installation instructions for Apache Cassandra can be found on the web (link). Once installed and started the cluster can be verified using the following command:
nodetool -h [IP] status
This will generate a response like the following:
Kafka Connect is shipped and installed as part of Apache Kafka. Instructions for these are also available on the web (link).
Download the tar file (link).
Install the tar file.
This post will not attempt to explain the architecture behind a Kafka cluster. However, a typical installation will have several Kafka brokers and Apache Zookeeper.
To run Kafka, first start Zookeeper, then start the Kafka brokers. The following commands assume a local installation with only one node:
bin/zookeeper-server-start.sh config/zookeeper.properties
and
bin/kafka-server-start.sh config/server.properties
Once we have Kafka installed and running, we need to create four topics. One is used by our application to publish our pack JSON. The other three are required by Kafka Connect. We will continue to assume that most are running this initially on a laptop so we will set the replication factor to 1.
and
In order to verify that the four topics have been created, run the following command:
bin/kafka-topics.sh — list — zookeeper localhost:2181
Lenses.io offers numerous connectors for Kafka Connect. These are all available as open source. The first thing we need to do is download the Cassandra Source connector jar file (link).
kafka-connect-cassandra-1.0.0–1.0.0-all.tar.gz
Unzip the tar file and copy the jar file to the libs
folder under the Kafka install directory.
We need to tell Kafka Connect where the Kafka cluster is. In the config folder where Kafka was installed we will find the file: connect-distributed.properties
. Look for the bootstrap.servers
key. Update that to point to the cluster.
bootstrap.servers=localhost:9092
We can now start up our distributed Kafka Connect service. For more information on stand-alone vs distributed mode, see the documentation.
bin/connect-distributed.sh config/connect-distributed.properties
If all has gone well, you should see the following on your console:
In case you are wondering, “Data Mountaineer” is a company from Netherlands that merged with Lenses.io (link).
Kafka Connect has a REST API to interact with connectors. We need to add the Cassandra Source connector to the Kafka Connect. This is done by sending
the property file (connect-cassandra-source.json
) to Kafka Connect through the REST API.
Once we have successfully loaded the connector, we can check to see the installed connectors using this API:
curl localhost:8083/connectors
That should return a list of the connectors by their configured names:
[“packs”]
In order to test everything out, we will need to insert some data into our table.
We can check what is being written to the Kafka topic by running the following command:
bin/kafka-console-consumer.sh — bootstrap-server localhost:9092 — topic test_topic
At this point, we might be surprised to see something like this:
That is better than what we were getting without WITHUNWRAP but isn’t exactly what we were hoping for. To get the
JSON value that was written to the table column we need to update the connect-distributed.properties
file.
Open this up and look for JsonConverter. Replace those lines with the following:
Restart Kafka Connect. Insert another row into the table. Now we should get what we want.
{ “foo”:“bar” }
Happy coding!
Read the second part of this article here - Tuning the Kafka Connect Cassandra Source (part 2)
This originally appeared on TheAgileJedi blog here