Apache Kafka is an open source distributed streaming platform for real-time data pipelines and data integration. It provides an efficient and scalable streaming system for use in a variety of applications, including:
- Real-time analytics
- Stream processing
- Log aggregation
- Distributed messaging
- Event streaming
Objectives
Install Kafka on a Dataproc HA cluster with ZooKeeper (referred to in this tutorial as a "Dataproc Kafka cluster").
Create fictitious customer data, then publish the data to a Kafka topic.
Create Hive parquet and ORC tables in Cloud Storage to receive streamed Kafka topic data.
Submit a PySpark job to subscribe to and stream the Kafka topic into Cloud Storage in parquet and ORC format.
Run a query on the streamed Hive table data to count the streamed Kafka messages.
Costs
In this document, you use the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage,
use the pricing calculator.
When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.
Before you begin
If you haven't already done so, create a Google Cloud project.
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Dataproc, Compute Engine, and Cloud Storage APIs.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Dataproc, Compute Engine, and Cloud Storage APIs.
- In the Google Cloud console, go to the Cloud Storage Buckets page.
- Click Create bucket.
- On the Create a bucket page, enter your bucket information. To go to the next
step, click Continue.
- For Name your bucket, enter a name that meets the bucket naming requirements.
-
For Choose where to store your data, do the following:
- Select a Location type option.
- Select a Location option.
- For Choose a default storage class for your data, select a storage class.
- For Choose how to control access to objects, select an Access control option.
- For Advanced settings (optional), specify an encryption method, a retention policy, or bucket labels.
- Click Create.
Tutorial steps
Perform the following steps to create a Dataproc Kafka cluster to read a Kafka topic into Cloud Storage in parquet OR ORC format.
Copy the Kafka installation script to Cloud Storage
The kafka.sh
initialization action
script installs Kafka on a Dataproc cluster.
Browse the code.
Copy the
kafka.sh
initialization action script to your Cloud Storage bucket. This script installs Kafka on a Dataproc cluster.Open Cloud Shell, then run the following command:
gcloud storage cp gs://goog-dataproc-initialization-actions-REGION/kafka/kafka.sh gs://BUCKET_NAME/scripts/
Make the following replacements:
- REGION:
kafka.sh
is stored in public regionally-tagged buckets in Cloud Storage. Specify a geographically close Compute Engine region, (example:us-central1
). - BUCKET_NAME: The name of your Cloud Storage bucket.
- REGION:
Create a Dataproc Kafka cluster
Open Cloud Shell, then run the following
gcloud dataproc clusters create
command to create a Dataproc HA cluster cluster that installs the Kafka and ZooKeeper components:gcloud dataproc clusters create KAFKA_CLUSTER \ --project=PROJECT_ID \ --region=REGION \ --image-version=2.1-debian11 \ --num-masters=3 \ --enable-component-gateway \ --initialization-actions=gs://BUCKET_NAME/scripts/kafka.sh
Notes:
- KAFKA_CLUSTER: The cluster name, which must be unique within a project. The name must start with a lowercase letter, and can contain up to 51 lowercase letters, numbers, and hyphens. It cannot end with a hyphen. The name of a deleted cluster can be reused.
- PROJECT_ID: The project to associate with this cluster.
- REGION: The
Compute Engine region
where the cluster will be located, such as
us-central1
.- You can add the optional
--zone=ZONE
flag to specify a zone within the specified region, such asus-central1-a
. If you do not specify a zone, the Dataproc autozone placement feature selects a zone with the specified region.
- You can add the optional
--image-version
: Dataproc image version2.1-debian11
is recommended for this tutorial. Note: Each image version contains a set of pre-installed components, including the Hive component used in this tutorial (see Supported Dataproc image versions).--num-master
:3
master nodes create an HA cluster. The Zookeeper component, which is required by Kafka, is pre-installed on an HA cluster.--enable-component-gateway
: Enables the Dataproc Component Gateway.- BUCKET_NAME: The name of your Cloud Storage bucket
that contains the
/scripts/kafka.sh
initialization script (see Copy the Kafka installation script to Cloud Storage).
Create a Kafka custdata
topic
To create a Kafka topic on the Dataproc Kafka cluster:
Use the SSH utility to open a terminal window on the cluster master VM.
Create a Kafka
custdata
topic./usr/lib/kafka/bin/kafka-topics.sh \ --bootstrap-server KAFKA_CLUSTER-w-0:9092 \ --create --topic custdata
Notes:
KAFKA_CLUSTER: Insert the name of your Kafka cluster.
-w-0:9092
signifies the Kafka broker running on port9092
on theworker-0
node.You can run the following commands after creating the
custdata
topic:# List all topics. /usr/lib/kafka/bin/kafka-topics.sh \ --bootstrap-server KAFKA_CLUSTER-w-0:9092 \ --list
# Consume then display topic data. /usr/lib/kafka/bin/kafka-console-consumer.sh \ --bootstrap-server KAFKA_CLUSTER-w-0:9092 \ --topic custdata
# Count the number of messages in the topic. /usr/lib/kafka/bin/kafka-run-class.sh kafka.tools.GetOffsetShell \ --broker-list KAFKA_CLUSTER-w-0:9092 \ --topic custdata # Delete topic. /usr/lib/kafka/bin/kafka-topics.sh \ --bootstrap-server KAFKA_CLUSTER-w-0:9092 \ --delete --topic custdata
Publish content to the Kafka custdata
topic
The following script uses the kafka-console-producer.sh
Kafka tool to
generate fictitious customer data in CSV format.
Copy, then paste the script in the SSH terminal on the master node of your Kafka cluster. Press <return> to run the script.
for i in {1..10000}; do \ custname="cust name${i}" uuid=$(dbus-uuidgen) age=$((45 + $RANDOM % 45)) amount=$(echo "$(( $RANDOM % 99999 )).$(( $RANDOM % 99 ))") message="${uuid}:${custname},${age},${amount}" echo ${message} done | /usr/lib/kafka/bin/kafka-console-producer.sh \ --broker-list KAFKA_CLUSTER-w-0:9092 \ --topic custdata \ --property "parse.key=true" \ --property "key.separator=:"
Notes:
- KAFKA_CLUSTER: The name of your Kafka cluster.
Run the following Kafka command to confirm the
custdata
topic contains 10,000 messages./usr/lib/kafka/bin/kafka-run-class.sh kafka.tools.GetOffsetShell \ --broker-list KAFKA_CLUSTER-w-0:9092 \ --topic custdata
Notes:
- KAFKA_CLUSTER: The name of your Kafka cluster.
Expected output:
custdata:0:10000
Create Hive tables in Cloud Storage
Create Hive tables to receive streamed Kafka topic data.
Perform the following steps to create cust_parquet
(parquet) and a
cust_orc
(ORC) Hive tables in your Cloud Storage bucket.
Insert your BUCKET_NAME in the following script, then copy and paste the script into the SSH terminal on your Kafka cluster master node, then press <return> to create a
~/hivetables.hql
(Hive Query Language) script.You will run the
~/hivetables.hql
script in the next step to create parquet and ORC Hive tables in your Cloud Storage bucket.cat > ~/hivetables.hql <<EOF drop table if exists cust_parquet; create external table if not exists cust_parquet (uuid string, custname string, age string, amount string) row format delimited fields terminated by ',' stored as parquet location "gs://BUCKET_NAME/tables/cust_parquet"; drop table if exists cust_orc; create external table if not exists cust_orc (uuid string, custname string, age string, amount string) row format delimited fields terminated by ',' stored as orc location "gs://BUCKET_NAME/tables/cust_orc"; EOF
In the SSH terminal on the master node of your Kafka cluster, submit the
~/hivetables.hql
Hive job to createcust_parquet
(parquet) and acust_orc
(ORC) Hive tables in your Cloud Storage bucket.gcloud dataproc jobs submit hive \ --cluster=KAFKA_CLUSTER \ --region=REGION \ -f ~/hivetables.hql
Notes:
- The Hive component is pre-installed on the Dataproc Kafka cluster. See 2.1.x release versions for a list of the Hive component versions included in recently released 2.1 images.
- KAFKA_CLUSTER: The name of your Kafka cluster.
- REGION: The region where your Kafka cluster is located.
Stream Kafka custdata
to Hive tables
- Run the following command in the in the SSH terminal on the master node of
your Kafka cluster to install the
kafka-python
library. A Kafka client is needed to stream Kafka topic data to Cloud Storage.
pip install kafka-python
Insert your BUCKET_NAME, then copy then paste the following PySpark code into the SSH terminal on your Kafka cluster master node, and then press <return> to create a
streamdata.py
file.The script subscribes to the Kafka
custdata
topic, then streams the data to your Hive tables in Cloud Storage. The output format, which can be parquet or ORC, is passed into the script as a parameter.cat > streamdata.py <<EOF #!/bin/python import sys from pyspark.sql.functions import * from pyspark.sql.types import * from pyspark.sql import SparkSession from kafka import KafkaConsumer def getNameFn (data): return data.split(",")[0] def getAgeFn (data): return data.split(",")[1] def getAmtFn (data): return data.split(",")[2] def main(cluster, outputfmt): spark = SparkSession.builder.appName("APP").getOrCreate() spark.sparkContext.setLogLevel("WARN") Logger = spark._jvm.org.apache.log4j.Logger logger = Logger.getLogger(__name__) rows = spark.readStream.format("kafka") \ .option("kafka.bootstrap.servers", cluster+"-w-0:9092").option("subscribe", "custdata") \ .option("startingOffsets", "earliest")\ .load() getNameUDF = udf(getNameFn, StringType()) getAgeUDF = udf(getAgeFn, StringType()) getAmtUDF = udf(getAmtFn, StringType()) logger.warn("Params passed in are cluster name: " + cluster + " output format(sink): " + outputfmt) query = rows.select (col("key").cast("string").alias("uuid"),\ getNameUDF (col("value").cast("string")).alias("custname"),\ getAgeUDF (col("value").cast("string")).alias("age"),\ getAmtUDF (col("value").cast("string")).alias("amount")) writer = query.writeStream.format(outputfmt)\ .option("path","gs://BUCKET_NAME/tables/cust_"+outputfmt)\ .option("checkpointLocation", "gs://BUCKET_NAME/chkpt/"+outputfmt+"wr") \ .outputMode("append")\ .start() writer.awaitTermination() if __name__=="__main__": if len(sys.argv) < 2: print ("Invalid number of arguments passed ", len(sys.argv)) print ("Usage: ", sys.argv[0], " cluster format") print ("e.g.: ", sys.argv[0], " <cluster_name> orc") print ("e.g.: ", sys.argv[0], " <cluster_name> parquet") main(sys.argv[1], sys.argv[2]) EOF
In the SSH terminal on the master node of your Kafka cluster, run
spark-submit
to stream data to your Hive tables in Cloud Storage.Insert the name of your KAFKA_CLUSTER and the output FORMAT, then copy and paste the following code into the SSH terminal on the master node of your Kafka cluster, and then press <return> to run the code and stream the Kafka
custdata
data in parquet format to your Hive tables in Cloud Storage.spark-submit --packages \ org.apache.spark:spark-streaming-kafka-0-10_2.12:3.1.3,org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.3 \ --conf spark.history.fs.gs.outputstream.type=FLUSHABLE_COMPOSITE \ --conf spark.driver.memory=4096m \ --conf spark.executor.cores=2 \ --conf spark.executor.instances=2 \ --conf spark.executor.memory=6144m \ streamdata.py KAFKA_CLUSTER FORMAT
Notes:
- KAFKA_CLUSTER: Insert the name of your Kafka cluster.
- FORMAT: Specify either
parquet
ororc
as the output format. You can run the command successively to stream both formats to the Hive tables: for example, in the first invocation, specifyparquet
to stream the Kafkacustdata
topic to the Hive parquet table; then, in second invocation, specifyorc
format to streamcustdata
to the Hive ORC table.
After standard output halts in the SSH terminal, which signifies that all of the
custdata
has been streamed, press <control-c> in the SSH terminal to stop the process.List the Hive tables in Cloud Storage.
gcloud storage ls gs://BUCKET_NAME/tables/* --recursive
Notes:
- BUCKET_NAME: Insert the name of the Cloud Storage bucket that contains your Hive tables (see Create Hive tables).
Query streamed data
In the SSH terminal on the master node of your Kafka cluster, run the following
hive
command to count the streamed Kafkacustdata
messages in the Hive tables in Cloud Storage.hive -e "select count(1) from TABLE_NAME"
Notes:
- TABLE_NAME: Specify either
cust_parquet
orcust_orc
as the Hive table name.
Expected output snippet:
- TABLE_NAME: Specify either
...
Status: Running (Executing on YARN cluster with App id application_....)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 9.89 s
----------------------------------------------------------------------------------------------
OK
10000
Time taken: 21.394 seconds, Fetched: 1 row(s)
Clean up
Delete the project
Delete a Google Cloud project:
gcloud projects delete PROJECT_ID
Delete resources
-
Delete the bucket:
gcloud storage buckets delete BUCKET_NAME
- Delete your Kafka cluster:
gcloud dataproc clusters delete KAFKA_CLUSTER \ --region=${REGION}