This post is one of my Notes to Self one. I’m simply going to write, how can you connect to Cassandra from Spark, run “SQL” queries and perform analysis on Cassandra’s data.
Let’s get started.
(Platform: Spark v1.6.0, Cassandra v2.7, macOS 10.12.1, Scala 2.11.7)
I’m going to use the package spark-cassandra-connector
written by awesome Datastax guys.
Assuming you have already configured Cassandra & Spark, it’s time to start writing a small Spark job.
Code with explanation:
Spark job Execution:
To run your spark job, use the command below:
spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 myfile.py
(Note: Check localhost:4040
in your browser for Spark UI)
--packages : This parameter tells Spark to download the external dependencies for the job.
In our case, we are using spark-cassandra-connector
:
groupId: com.datastax.spark
artifactId: spark-cassandra-connector_2.10
version: 1.5.0-M2
Output:
<all your logs will be printed here. including ivy logs>
:
:
[Spark] Executing query: select * from roles
+---------+---------+------------+---------+--------------------+
| role|can_login|is_superuser|member_of| salted_hash|
+---------+---------+------------+---------+--------------------+
|cassandra| true| true| []|$2a$10$pQW3iGSC.m...|
+---------+---------+------------+---------+--------------------+
Here, df_payload
is DataFrame object. You can use all Spark’s Transformations & Actions on this. (Check here for more details)
Second Part: Starting a Spark shell with Cassandra connection
Steps for this is part of separate post.
Useful Links :-
I really want to thank guys at Datastax. They have written and open sourced, so many packages and drivers for Cassandra.
You can contribute to spark-cassandra-connector
here.
Link to spark-cassandra-connector
maven repository.