Writing Apache Spark workers with "Simple Spark Lib"

November 11, 2016


Apache Spark is a great project, could be plugged with most of the data sources/databases eg, HDFS, Cassandra, MongoDB, Kafka, Postgres, Redshift etc. I have been using Spark for ad-hoc querying, bunch of Aggregations & Segregations over Cassandra from a long time and noticed that, every time I used to write (or paste) same code for configuration & connection. Also, I knew when someone else wants to do the similar work from my team, he/she will have to do the same thing, including learning what that means and understanding it. Think of someone doing that, if he is using Spark for the first time?


TLDR;

I decided to write a wrapper over PySpark which obviously supports Cassandra, Redshift etc. It primarily provided following two advantages:

  1. I never repeated myself while writing the workers again
  2. My Team members do not need to figure out those Spark specific code in order to do some simple ad-hoc tasks

I named it ā€œSimple Spark Libā€ and, hereā€™s how to use it:

Step 1: Clone the repo from here:

git clone https://github.com/rootcss/simple_spark_lib.git

Step 2: Install the library:

python setup.py install

Step 3: Write the worker:

# First, import the library
from simple_spark_lib import SimpleSparkCassandraWorkflow

# Define connection configuration for cassandra
cassandra_connection_config = {
  'host':     '192.168.56.101',
  'username': 'cassandra',
  'password': 'cassandra'
}

# Define Cassandra Schema information
cassandra_config = {
  'cluster': 'rootCSSCluster',
  'tables': {
    'api_events': 'events_production.api_events',
  }
}
# Initiate your workflow
workflow = SimpleSparkCassandraWorkflow(appName="Simple Example Worker")

# Setup the workflow with configurations
workflow.setup(cassandra_connection_config, cassandra_config)

# Run your favourite query
df = workflow.process(query="SELECT * FROM api_events LIMIT 10")

print df.show()

Step 4: Save it & Execute the worker:

simple-runner my_spark_woker.py -d cassandra

simple_spark_lib enables you to use the capability of spark without writing the actual Spark codes. I made it public, hoping it might be useful to someone else too.

If you are interested, go through other examples in the repo and feel free to contribute. :-)


Tags: Spark Cassandra Data Engineering Big Data


blog comments powered by Disqus