Saturday 3 March 2018

SQL with Apache Spark, easy!

Reading about cluster computing developments like Apache Spark and SQL I decided to find out.

What I was after was to see how easy is to write SQL in Spark-SQL. In this micro-post I will show you how easy is to SQL a JSON file.

For my experiment I will use my chrome_history.json file which you can download from your chrome browser using the extension www.JSON-XLS.com. To run the SQL query on PySpark on my laptop I will use the PyCharm IDE. After little bit of configuration on PyCharm, setting up environments (SPARK_HOME), there it is: It only takes 3 lines to be able SQL query a JSON document in Spark-SQL.



(click image to enlarge)


Think of the possibilities with SQL, the 'cluster' partitioning and parallelisation you can achieve


Links:

Apache Spark: https://spark.apache.org/downloads.html

PyCharm: https://www.jetbrains.com/pycharm/