Apache Spark is a great project, could be plugged with most of the data sources/databases eg, HDFS, Cassandra, MongoDB, Kafka, Postgres, Redshift etc. I have been using Spark for ad-hoc querying, bunch of Aggregations & Segregations over Cassandra from a long time and noticed that, every time I used to write (or paste) same code for configuration & connection. Also, I knew when someone else wants to do the similar work from my team, he/she will have to do the same thing, including learning what that means and understanding it. Think of someone doing that, if he is using Spark for the first time?
TLDR;
I decided to write a wrapper over PySpark
which obviously supports Cassandra, Redshift etc. It primarily provided following two advantages:
I named it “Simple Spark Lib
” and, here’s how to use it:
Step 1: Clone the repo from here:
Step 2: Install the library:
Step 3: Write the worker:
Step 4: Save it & Execute the worker:
simple_spark_lib
enables you to use the capability of spark without writing the actual Spark codes. I made it public, hoping it might be useful to someone else too.
If you are interested, go through other examples in the repo and feel free to contribute. :-)