Apache Spark is a great project, could be plugged with most of the data sources/databases eg, HDFS, Cassandra, MongoDB, Kafka, Postgres, Redshift etc. I have been using Spark for ad-hoc querying, bunch of Aggregations & Segregations over Cassandra from a long time and noticed that, every time I used to write (or paste) same code for configuration & connection. Also, I knew when someone else wants to do the similar work from my team, he/she will have to do the same thing, including learning what that means and understanding it. Think of someone doing that, if he is using Spark for the first time?
I decided to write a wrapper over
PySpark which obviously supports Cassandra, Redshift etc. It primarily provided following two advantages:
I named it “
Simple Spark Lib” and, here’s how to use it:
Step 1: Clone the repo from here:
Step 2: Install the library:
Step 3: Write the worker:
Step 4: Save it & Execute the worker:
simple_spark_lib enables you to use the capability of spark without writing the actual Spark codes. I made it public, hoping it might be useful to someone else too.
If you are interested, go through other examples in the repo and feel free to contribute. :-)