Tuesday 30 June 2015

Apache Phoenix, SQL is getting closer to Big Data



Here is a post about another project in the Big Data world, like Apache Hive from my previous post, enables you to do SQL on Big Data. It is called Apache Phoenix.

Phoenix is a bit different, a bit closer to my heart too, as I read the documentation on Apache Phoenix, the word 'algebra' and 'relational algebra' came across few times, and that mean only one thing, SQL! The use of the word algebra in the docs did give me a lot of confidence. SQL has closure, is based on a database systems model which has it's roots in logic and maths and especially a subset of algebra, The Set Theory.

Apache Phoenix is developed in Salesforce and is now one of the popular projects in Apache. Apache Phoenix is a SQL skin on top of HBase, the columnar (NoSQL) database of the Hadoop ecosystem, capable of storing very large tables and data and query them via 'scans'. HBase is part of the Hadoop ecosystem and the file system it uses is usually HDFS. Apache Phoenix is using JDBC on the client as a driver.

In the race to bring the easiest to use tools for Big Data, I think Apache Phoenix is very close. It is the SQL we know used since the 1970s. The Apache Phoenix team seems to be committed and willing to introduce all of the missing parts of SQL, including transaction processing with different isolation levels.  Making Phoenix a fully operational Relational Database layer on HBase. Have a look in their roadmap. The amount of current and suggested future SQL compatibility is remarkable, and this makes me take them really seriously.
  • Transactions
  • Cost-based Query Optimization! (Wow)
  • Joins
  • OLAP
  • Subqueries
  • Striving for full SQL-92 compliance
In addition to all this, it is also possible to turn an existing HBase table to an Apache Phoenix table using CREATE TABLE or even CREATE VIEW, the DDL statements that we know. How handy is that? Suddenly you can SQL enable your existing HBase database!

How to install and use Phoenix

The SQL skin can be installed to an existing Hadoop HBase installation very quickly. All you need to do is to download and extract the tarball. You can setup a standalone Hadoop environment, look at my previous blog post for that, and then install HBase and install Apache Phoenix

Once the Apache  Phoenix software is installed, then you can start it and query it with SQL like this.

From within the bin/ directory of Phoenix install directory run

$ ./sqlline.py  localhost

That will bring you to the phoenix prompt


0: jdbc:phoenix:localhost> select * from mytable;


Thursday 25 June 2015

Hive (HiveQL) SQL for Hadoop Big Data



In this  post I will share my experience with an Apache Hadoop component called Hive which enables you to do SQL on an Apache Hadoop Big Data cluster.

Being a great fun of SQL and relational databases, this was my opportunity to set up a mechanism where I could transfer some (a lot)  data from a relational database into Hadoop and query it with SQL. Not a very difficult thing to do these days, actually is very easy with Apache Hive!

Having access to a Hadoop cluster which has the Hive module installed on, is all you need. You can provision a Hadoop cluster yourself by downloading and installing it in pseudo mode on your own PC. Or you can run one in the cloud with Amazon AWS EMR in a pay-as-you-go fashion.

There are many ways of doing this, just Google it and you will be surprised how easy it is. It is easier than it sounds. Next find links for installing it on your own PC (Linux).  Just download and install Apache Hadoop and Hive from Apache Hadoop Downloads

You will need to download and install 3 things from the above link.

  • Hadoop (HDFS and Big Data Framework, the cluster)
  • Hive (data warehouse module)
  • Sqoop (data importer)
You will also need to put the connector of the database (Oracle, MySQL...) you want to extract data from in the */lib folder in your Sqoop installation. For example the MySQL JDBC connector can be downloaded from here
Don't expect loads of tinkering installing Apache Hadoop and Hive or Sqoop, just unzipping binary extracts and few line changes on some config files in directories, that's all. Is not a big deal, and is Free. There are tones of tutorials on internet on this, here is one I used from another blogger bogotobogo.


What is Hive?

Hive is Big Data SQL, the Data Warehouse in Hadoop. You can create tables, indexes, partition tables, use external tables, Views like in a relational database Data Warehouse. You can run SQL to do joins and to query the Hive tables in parallel using the MapReduce framework. It is actually quite fun to see your SQL queries translating to MapReduce jobs and run in parallel like parallel SQL queries we do on Oracle EE Data Warehouses and other databases. :0) The syntax looks very much like MySQL's SQL syntax.

Hive is NOT an OLTP transactional database, does not have transactions of INSERT, UPDATE, DELETE like in OLTP and doesn't conform to ANSI SQL and ACID properties of transactions.


Direct insert into Hive with Apache Sqoop:

After you have installed Hadoop and have hive setup and are able to login to it, you can use Sqoop - the data importer of Hadoop - like in the following command and directly import a table from MySQL via JDBC into Hive using MapReduce.

$  sqoop import -connect jdbc:mysql://mydatbasename -username kubilay -P -table mytablename --hive-import --hive-drop-import-delims --hive-database dbadb --num-mappers 16 --split-by id

Sqoop import options explained:
  •  -P will ask for the password
  • --hive-import which makes Sqoop to import data straight into hive table which it creates for you
  • --hive-drop-import-delims Drops \n\r, and \01 from string fields when importing to Hive. 
  • --hive-database tells it which database in Hive to import it to, otherwise it goes to the default database. 
  • --num-mappers number of parallel maps to run, like parallel processes / threads in SQL
  • --split-by  Column of the table used to split work units, like in partitioning key in database partitioning. 
The above command will import any MySQL table you give in place of mytablename into Hive using MapReduce from a MySQL database you specify.

Once you import the table then you can login to hive and run SQL to it like in any relational database. You can login to Hive in a properly configured system just by calling hive from command line like this:

$ hive
hive> 


More Commands to list jobs:

Couple of other commands I found useful when I was experimenting with this:

List running Hadoop jobs

hadoop job -list

Kill running Hadoop jobs

hadoop job -kill job_1234567891011_1234

List particular table directories in HDFS

hadoop fs -ls mytablename


More resources & Links