Thursday, 25 June 2015

Hive (HiveQL) SQL for Hadoop Big Data



In this  post I will share my experience with an Apache Hadoop component called Hive which enables you to do SQL on an Apache Hadoop Big Data cluster.

Being a great fun of SQL and relational databases, this was my opportunity to set up a mechanism where I could transfer some (a lot)  data from a relational database into Hadoop and query it with SQL. Not a very difficult thing to do these days, actually is very easy with Apache Hive!

Having access to a Hadoop cluster which has the Hive module installed on, is all you need. You can provision a Hadoop cluster yourself by downloading and installing it in pseudo mode on your own PC. Or you can run one in the cloud with Amazon AWS EMR in a pay-as-you-go fashion.

There are many ways of doing this, just Google it and you will be surprised how easy it is. It is easier than it sounds. Next find links for installing it on your own PC (Linux).  Just download and install Apache Hadoop and Hive from Apache Hadoop Downloads

You will need to download and install 3 things from the above link.

  • Hadoop (HDFS and Big Data Framework, the cluster)
  • Hive (data warehouse module)
  • Sqoop (data importer)
You will also need to put the connector of the database (Oracle, MySQL...) you want to extract data from in the */lib folder in your Sqoop installation. For example the MySQL JDBC connector can be downloaded from here
Don't expect loads of tinkering installing Apache Hadoop and Hive or Sqoop, just unzipping binary extracts and few line changes on some config files in directories, that's all. Is not a big deal, and is Free. There are tones of tutorials on internet on this, here is one I used from another blogger bogotobogo.


What is Hive?

Hive is Big Data SQL, the Data Warehouse in Hadoop. You can create tables, indexes, partition tables, use external tables, Views like in a relational database Data Warehouse. You can run SQL to do joins and to query the Hive tables in parallel using the MapReduce framework. It is actually quite fun to see your SQL queries translating to MapReduce jobs and run in parallel like parallel SQL queries we do on Oracle EE Data Warehouses and other databases. :0) The syntax looks very much like MySQL's SQL syntax.

Hive is NOT an OLTP transactional database, does not have transactions of INSERT, UPDATE, DELETE like in OLTP and doesn't conform to ANSI SQL and ACID properties of transactions.


Direct insert into Hive with Apache Sqoop:

After you have installed Hadoop and have hive setup and are able to login to it, you can use Sqoop - the data importer of Hadoop - like in the following command and directly import a table from MySQL via JDBC into Hive using MapReduce.

$  sqoop import -connect jdbc:mysql://mydatbasename -username kubilay -P -table mytablename --hive-import --hive-drop-import-delims --hive-database dbadb --num-mappers 16 --split-by id

Sqoop import options explained:
  •  -P will ask for the password
  • --hive-import which makes Sqoop to import data straight into hive table which it creates for you
  • --hive-drop-import-delims Drops \n\r, and \01 from string fields when importing to Hive. 
  • --hive-database tells it which database in Hive to import it to, otherwise it goes to the default database. 
  • --num-mappers number of parallel maps to run, like parallel processes / threads in SQL
  • --split-by  Column of the table used to split work units, like in partitioning key in database partitioning. 
The above command will import any MySQL table you give in place of mytablename into Hive using MapReduce from a MySQL database you specify.

Once you import the table then you can login to hive and run SQL to it like in any relational database. You can login to Hive in a properly configured system just by calling hive from command line like this:

$ hive
hive> 


More Commands to list jobs:

Couple of other commands I found useful when I was experimenting with this:

List running Hadoop jobs

hadoop job -list

Kill running Hadoop jobs

hadoop job -kill job_1234567891011_1234

List particular table directories in HDFS

hadoop fs -ls mytablename


More resources & Links



Wednesday, 4 March 2015

Oracle APEX_WEB_SERVICE REST API call

In this post I will try to show you how I used the Oracle Apex and the APEX_WEB_SERVICE  PL/SQL package to quickly send a request to a public Internet API and how I handled the response. The code below was written during a 'Hackday' and hasn't been extensively tested.

My use case is integrating Oracle Apex with the public Mendeley REST API for Mendeley Catalog Search.

The idea was to build an application in Oracle Apex to query the Mendeley REST API Catalog with a keyword. Mendeley REST API gives JSON response so I used PL/JSON to parse it.  I hear in Oracle 12c JSON is going to be a native data-type. My Oracle Apex host is running Oracle 11g and I had to use PL/JSON for ease.

To cut it short here is how the Mendeley Catalog Search on Oracle Apex application look  like. (Click image to go to app or visit http://apex2.enciva.co.uk/apex/f?p=864:2






To integrate with Mendeley REST API from Oracle Apex, I used one PL/SQL function and one procedure.

I used the function to obtain the Mendeley REST API Client Credentials Authorisation flow token and the procedure to do make the API request to Mendeley Catalog Search and to handle the response.

Here is the MENDELEY_CALL PL/SQL function I created:

This function returns the Client Credentials Authorisation Flow token from the Mendeeley REST API

create or replace function mendeley_call (p_id in varchar2)
return varchar2
is
v_token varchar2(1000);
token varchar2(1000);
jtoken json;
v_grant_type varchar2(400:= 'client_credentials';
v_client_id varchar2(500:= p_id;
v_client_secret varchar2(500:= '<put_your_mendeley_client_secret_here>';
v_scope varchar2(300:= 'all';
begin

/*----------Setting Headers----------------------------------------*/                                      
apex_web_service.g_request_headers(1).name := 'Content-Type';
apex_web_service.g_request_headers(1).Value := 'application/x-www-form-urlencoded; charset=utf-8';
/*-----------------------------------------------------------------*/

token := apex_web_service.make_rest_request
    (
      p_url         => 'https://api.mendeley.com/oauth/token'
    p_http_method => 'POST'
    p_parm_name   => apex_util.string_to_table('grant_type:client_id:client_secret:scope')
    p_parm_value  => apex_util.string_to_table(v_grant_type||':'||v_client_id||':'||v_client_secret||':'

||v_scope)
    p_wallet_path => 'file:/home/app/oracle/product/11.2.0/dbhome_1/owm/wallets/oracle'
    p_wallet_pwd  => '<put_your_oracle_wallet_password_here>'
    );
-- debug
-- dbms_output.put_line(token);
jtoken := json(token);
v_token := json_ext.get_string(jtoken,'access_token');
-- debug
-- dbms_output.put_line(v_token);
return v_token;
EXCEPTION
WHEN OTHERS THEN
   raise_application_error(-20001,'An error was encountered - '||SQLCODE||' -ERROR- '||SQLERRM);
end;



Here is the anonymous procedure which I put into a PL/SQL region on the Oracle Apex page:

This procedure incorporates the function above and makes the request and handles the response from the Mendeley REST API

Note how the procedure calls the function MENDELEY_CALL (above) to load the variable v_token. 

DECLARE
  v_token  VARCHAR2(599:= mendeley_call(put_your_mendeley_client_id_here);
  v_search VARCHAR2(500);
  mendeley_document NCLOB;
  v_status VARCHAR2(100);
  obj json_list;
  v_id VARCHAR2(100);
  v_title NVARCHAR2(1000);
  v_abstract NCLOB;--varchar2(32000);
  v_link     VARCHAR2(1000);
  v_source   VARCHAR2(500);
  v_type     VARCHAR2(100);
  v_pct_hit  VARCHAR2(10);
  v_rows     NUMBER(10);
  v_batch_id NUMBER(10);
BEGIN
  -- Oracle Wallet
  utl_http.set_wallet('file:/home/app/oracle/product/11.2.0/dbhome_1/owm/wallets/oracle'

'my_secret_password');
  -- Set Authorisation headers and utf8
  -- the following lilne is necessary if you need to use languages other than latin and 
  -- you will use APEX_WEB_SERVICE package 
  utl_http.set_body_charset('UTF-8');
  -- build the Authorisation header
  apex_web_service.g_request_headers(1).name  := 'Content-Type';
  apex_web_service.g_request_headers(1).value := 'application/jsonrequest';
  apex_web_service.g_request_headers(1).name  := 'Authorization';
  apex_web_service.g_request_headers(1).value := 'Bearer '||v_token||'';
  
  -- Make the request and load the response into a CLOB 
  mendeley_document := apex_web_service.make_rest_request 
      
        p_url => 'https://api.mendeley.com:443/search/catalog' 
      p_http_method => 'GET' 
      p_parm_name => apex_util.string_to_table('title:limit'
      p_parm_value => apex_util.string_to_table('Mendeley:10'
      );
  -- Load the response to JSON_LIST PL/JSON object
  obj := json_list(mendeley_document);
  -- Start extracting values from the JSON and writhe some HTML
  -- Traverse over JSON_LIST extract elements you like
  FOR IN 1..obj.count
  LOOP
    v_id       := json_ext.get_string(json(obj.get(i)),'id');
    v_title    := json_ext.get_string(json(obj.get(i)),'title');
    v_abstract := json_ext.get_string(json(obj.get(i)),'abstract');
    v_link     := json_ext.get_string(json(obj.get(i)),'link');
    v_source   := json_ext.get_string(json(obj.get(i)),'source');
    v_type     := json_ext.get_string(json(obj.get(i)),'type');
    -- write extracted data
   dbms_output.put_line(v_title||' ==> '||v_abstract);
   END LOOP;
 END;

 END;

This shows how easy is, in this case using one function and one procedure to make a REST API request to an external Web Service from Oracle Apex. 

Thursday, 19 February 2015

Test MySQL on AWS quickly

Using sysbench to performance test AWS RDS MySQL hardware is an easy three step operation.
Sysbench creates synthetic tests and they are done on a 1 mil row 'sbtest' table that sysbench creates in the MySQL database you indicate. The test doesn't intrude with your database schema, and it doesn't use your data, so it is quite safe. The test is an OLTP test trying to simulate event operations in the database as it runs various, SELECT, INSERT, UPDATE and DELETE requests on it's own 'sbtest' table.

The results of the tests are metrics like transactions per second, number of events, elapsed time etc. See man pages for description and Google it, it is a popular testing tool. Other things you can set it up to do are to control how many requests (events) you want it to execute in a given time or you can tell it to keep on executing infinite requests until you stop it, or destruction testing. Is a very flexible testing tool with many options including throtling concurrency.

You can be up and running with 3 commands on a unix system as follows. 
 
Download sysbench tool (doing this on ubuntu)
sudo apt-get install sysbench

Create a table with 1 mil rows

sysbench --test=oltp --oltp-table-size=1000000 --mysql-host={your rds host url} --db-driver=mysql --mysql-user={your rds root user} --mysql-password={password} --mysql-db={your mysql database name} prepare

Test with different parameters

sysbench --test=oltp --oltp-table-size=1000000 --mysql-host={your rds host url} --db-driver=mysql --mysql-user={your rds root user} --mysql-password={password} --mysql-db={your mysql database name} --max-time=60 --num-threads=550 run

Warning: Synthetic tests will just give you the ability of the hardware at a given standard set of requests and DML operations. There are no way an indication of what will happen to your database if the real workload increases beacause of the applications. Application Load Testing is something else, applications are complex! Database Workload is dependent on the application generated workload from real users using the system and is very hard to simulate that in a test. It is not imppossible If you use a database, such as Oracle which has the capability of recording and replaying its production database workload - called Automatic Workload Repository (AWR). In MySQL I couldn't find so far a way to do this. But sysbench synthetic tests gave me the ability to quickly benchmark and baseline a MySQL database capabilities on different AWS Amazon hardware, something is better than nothing I suppose.

Tuesday, 6 January 2015

MySQL Locking

MySQL and Row Level Locking? Or why are you getting the error:

ERROR 1205 (HY000) : Lock wait timeout exceeded; try restarting transaction

You get the error because your allocated time to hold a DML lock in a transaction exceeds the set limit. Usually the default limit to hold a DML row lock, set by innodb_lock_wait_timeout db parameter, is 50 seconds. If your transaction doesn't commit/rollback within 50 seconds you will get this error. We don't want to hold locks for longer than 50 seconds anyway, throughput would be affected.

And yes MySQL in innodb uses row level locking. Since MySQL 5.1+ (time Oracle took over) it does row level locking in its InnoDB tables. That means only the rows which are selected . . . FOR UPDATE . . . are locked and not the  whole table. To see the threads (sessions) which are locking other threads and which queries are locking, use the following INFORMATION_SCHEMA dictionary SQL query as DBA.  You will be able to see blockers and waiters of transactions  waiting on locks. Run it as is using INFORMATION_SCHEMA schema, no modifications.

Use this SQL query to monitor locks and transactions and note that query will return data only when there are locks!


SELECT 
    r.trx_id AS wating_trx_id,
    r.trx_mysql_thread_id AS waiting_thread,
    TIMESTAMPDIFF(SECOND,
        r.trx_wait_started,
        CURRENT_TIMESTAMP) AS wait_time,
    r.trx_query AS waiting_query,
    l.lock_table AS waiting_table_lock,
    b.trx_id AS blocking_trx_id,
    b.trx_mysql_thread_id AS blocking_thread,
    SUBSTRING(p.host,
        1,
        INSTR(p.host, ':') - 1) AS blocking_host,
    SUBSTRING(p.host,
        INSTR(p.host, ':') + 1) AS blocking_port,
    IF(p.command = 'Sleep', p.time, 0) AS idle_in_trx,
    b.trx_query AS blocking_query
FROM
    information_schema.innodb_lock_waits AS w
        INNER JOIN
    information_schema.innodb_trx AS b ON b.trx_id = w.blocking_trx_id
        INNER JOIN
    information_schema.innodb_trx AS r ON r.trx_id = w.requesting_trx_id
        INNER JOIN
    information_schema.innodb_locks AS l ON w.requested_lock_id - l.lock_id
        LEFT JOIN
    information_schema.PROCESSLIST AS p ON p.id = b.trx_mysql_thread_id
ORDER BY wait_time DESC;