Writing Apache Spark workers with "Simple Spark Lib"

November 11, 2016

Apache Spark is a great project, could be plugged with most of the data sources/databases eg, HDFS, Cassandra, MongoDB, Kafka, Postgres, Redshift etc. I have been using Spark for ad-hoc querying, bunch of Aggregations & Segregations over Cassandra from a long time and noticed that, every time I used to write (or paste) same code for configuration & connection. Also, I knew when someone else wants to do the similar work from my team, he/she will have to do the same thing, including learning what that means and understanding it. Think of someone doing that, if he is using Spark for the first time?


I decided to write a wrapper over PySpark which obviously supports Cassandra, Redshift etc. It primarily provided following two advantages:

  1. I never repeated myself while writing the workers again
  2. My Team members do not need to figure out those Spark specific code in order to do some simple ad-hoc tasks

I named it “Simple Spark Lib” and, here’s how to use it:

Step 1: Clone the repo from here:

git clone https://github.com/rootcss/simple_spark_lib.git

Step 2: Install the library:

python setup.py install

Step 3: Write the worker:

# First, import the library
from simple_spark_lib import SimpleSparkCassandraWorkflow

# Define connection configuration for cassandra
cassandra_connection_config = {
  'host':     '',
  'username': 'cassandra',
  'password': 'cassandra'

# Define Cassandra Schema information
cassandra_config = {
  'cluster': 'rootCSSCluster',
  'tables': {
    'api_events': 'events_production.api_events',
# Initiate your workflow
workflow = SimpleSparkCassandraWorkflow(appName="Simple Example Worker")

# Setup the workflow with configurations
workflow.setup(cassandra_connection_config, cassandra_config)

# Run your favourite query
df = workflow.process(query="SELECT * FROM api_events LIMIT 10")

print df.show()

Step 4: Save it & Execute the worker:

simple-runner my_spark_woker.py -d cassandra

simple_spark_lib enables you to use the capability of spark without writing the actual Spark codes. I made it public, hoping it might be useful to someone else too.

If you are interested, go through other examples in the repo and feel free to contribute. :-)

Tags: Spark Cassandra Data Engineering Big Data

blog comments powered by Disqus