Database Systems

Monday, 14 July 2014

MySQL on-premise to Amazon RDS migration tips

Things to watch and do when migrating MySQL databases from ‘on-premise’ to Amazon AWS RDS

Not all versions of databases can be migrated to RDS. Especially if you want to do a 0 downtime migration. Make sure you know which versions are possible, at this writing Amazon announced that it will support any old version of MySQL 5.1 and above.
In a zero downtime migration to Amazon RDS you work with mysqldump or mydumper to import the baseline data and and then you use MySQL Replication and the binary_log position to apply the additional records created during the import, the delta. That is it is possible to create a MySQL slave in the Amazon AWS Clouds!
So when you have confirmed the on-premise MySQL that you have is compatible you can then use mysqldump with the --master-data parameter to export your data including the binlog position coordinates at the time of the export. You can use mydumper if yor database is big to do this with parallel streams. You will use the coordinates and MySQL replication to catch-up with the on-premise master database when creating the MySQL slave in RDS.
Use different database parameters for different databases.
As you load the RDS database using myloader or mysql the operation might take long time depending on the size of your database. If this is the case, disable backups, it stops logging, try using one of the better spec RDS Instance classes and IOPS for the duration of the operation. You can always downsize the RDS instance after you have completed the initial load.
After you have completed the initial load, use Multi AZ which is a synchronous standby (in Oracle parlour) and schedule the backups immediately before you open your applications to the database, as initial backup requires a reboot.
Beware there is no SSH access to RDS, that means you have no access to the file system.
Get the DB Secuirty groups right and make sure your applications can access the RDS instances

Monday, 5 May 2014

My MySQL database impressions

I have been in the data and database world for over a decade now, working with mainly Oracle and data integration projects with Salesforce.

I have also spent time working with MySQL a relational database with open source roots, now part of Oracle. In this post I want to talk about my impressions of MySQL and some tips and tricks I learned working with it.

First and for all, you will have to be ready to get your hands dirty. I couldn't find a package structure for database administration in MySQL - like DBMS libraries of packages and procedures in Oracle. That means you will have to do most of the things on your own. Nevertheless good news is he database starts showing an Oracle banner when you login from version 5.0 onwards and some features like on-line Schema changes, more cost based optimisation and partitioning are added in versions 5.6 - a sign of good things to come.

Some key points

Data Import/Export - You can use the native mysqldump utility to dump data with parameters, but it is slow. You can dump schemas and data. I couldn't get it to dump data fast (in parallel) though that is why I strongly recommend mydumper an open source utility written by guys in Oracle and MySQL to dump data using parallel threads and is very fast. Import can be done in parallel as well and it can give you that boost provided your hardware permits it. Don't try to disable constraints, drop indexes before imports as you will read in posts and suggestions on the net, mysqldump already does that for you.
Hot Backup - mylvmbackup seems like the de-facto script to take hot backups when the database in online. There are tools like XtraBackup from Percona too. It takes a snapshot of the disk where your datafiles and logfiles are. At restore it does a crash recovery using the logs and brings the database transactions forwards to the point of crash. Then if you have the logs after that, you can play them forwards and bring the database to a point in time after the backup.
Parallel processing - Nada, there is none! I couldn't get it do anything in parallel. The only thing I managed to do in parallel was to export and import data with mydumper, that works! So if you have many CPUs you will be watching them being idle most of the time as one thread only will be chugging away. Unless you use mydumper for your import/export operations where you can make those CPUs sweat.
DBMS packages - You fancy automating, do you need scripts to do repetitive tasks? Well there is no DBMS package library to help you administer the database in MySQL. Instead, you can use Percona Toolkit scripts, a consultancy specialising in helping MySQL DBAs to do great work with MySQL databases. They have a variety of scripts from comparing (diff), syncing databases, tables to extracting metadata and GRANTS structures.
Hints, Explain Plan, Performance Tuning. I couldn't see much of Cost Based Optimisation in MySQL, the data dictionary (INFORMATION_SCHEMA) has metadata names but doesn't hold any dynamic statistics about objects, estimates of counts of rows in tables and indexes it holds can be up 50% wrong. The whole thing is based on heuristics, I suppose. The EXPLAIN PLAN is just a row where it says what the optimiser will do, there is no cost analysis or logical tree structure of execution plans yet. I couldn't see much on Join orders either, no Nested Loops, HASH or MERGE joins yet.

MySQL is a popular, relational database. The free version of this database is probably what a small website and a start-up needs. But having said that, many sites outgrow MySQL and still stay with it.

Oracle will probably turn it to a serious database too. Adding partitioning, multi threading to it in the recent releases, is a step forwards in becoming an Enterprise size and scale database. I don't know much about the MySQL Cluster Version and MySQL Replication I know takes a load off from the reads. I want to see it doing more Performance Tuning science.

Top tools with MySQL that I used

MySQL Workbench - SQL IDE.
Mydumper - Fast logical backup and restore.
Mylvmbackup - Hot backup script
Pentaho Kettle - PDI is an all round data integration and middle-ware tool

Friday, 28 June 2013

DataDirect Access your favorite SaaS app with SQL

Database driver company are writing the next generation of data access. They are working on a platform called Datadirect which will enable you to access loads of data sources in the cloud, from the cloud, and some of them with SQL.

Yes you have read correct, SQL!

All you have to do is visit their website and sign up for a trial account and access your favorite SaaS app - i.e Salesforce or any other provided Cloud app with SQL.

Here is how their availability of connectors looks like

To sign up just follow this link.

I very much like this noble idea. SQL is the universal language for data manipulation and access. And things like Big Data, Hadoop in the horizon they are pro-active and seems like they are already working on it.

See some list of data sources they can/will SQL:

Twitter
Facebook
Hive
Eloqua
force.com
Salesforce
SQLAzure
Microsoft Dynamics CRM
and many more...

This is on my hot list of technology to watch. Well done DataDirect.

Tuesday, 4 June 2013

Salesforce.com Real Time integration with Oracle using Informatica PowerCenter 9.5

In this post I will describe how you can integrate your Salesforce.com org with a relational database, like Oracle in real time, or better 'near' real time!

Many times I come across the requirement of quickly propagating changes from cloud platforms like Salesforce.com to on-premise data stores. You can do this with webservices, but that is not middleware and it requires coding.

How about doing with a data integration tool?

+Informatica Corporation's Informatica PowerCenter can achieve this by using the CDC (Change Data Capture) feature of the Informatica PowerCenter Salesforce connector, when Salesforce is the source in a mapping.

The configuration is simple. All you really have to set up is 2 properties in the Mapping Tab of a Session Task in Informatica Workflow Manager.

These are the properties:

Time Limit property to -1
Flush interval property to 60 seconds (minimum 60 seconds)

See a picture from one of my settings

And here is what these two settings mean from the PowerCenter PowerExchange for Salesforce.com User Guide:

CDC Time Limit

Time period (in seconds) that the Integration Service reads changed Salesforce data. When you set the CDC Time Limit to a non-zero value, the Integration Service performs a full initial read of the source data and then captures changes to the Salesforce data for the time period you specify. Set the value to -1 to capture changed data for an infinite period of time. Default is 0.

Flush Interval

Interval (in seconds) at which the Integration Service captures changed Salesforce data. Default is 300. If you set the CDC Time Limit to a non-zero value, the Integration Service captures changed data from the source every 300 seconds. Otherwise, the Integration Service ignores this value.

That's it, you don't have to configure anything else!

Once you set up these properties in the mapping tab of a session, save and restart the task in the workflow, the task will run continuously, non stop. The connector will poll the Salesforce org continuously and propagate any changes you do in Salesforce, downstream to the premise database system, including INSERT, UPDATE and DELETE operations.

Enjoy!

More reading:

SFDC CDC implementation in Informatica PowerCenter

Friday, 3 May 2013

UTC timestamps for Salesforce from Oracle

I came across the requirement the other day to update Salesforce every 5 minutes with data from Oracle.

The data in Oracle was a simple table with few columns A,B,C and a timestamp column T indicating the last modified date/time of the record.

To my surprise whenever I sent data changes from Oracle, mapping the columns and the timestamp to their corresponding Salesforce fields, the Salesforce field T would record the time 1 hour ahead of the wall clock time!

Quickly I realized that Salesforce, no matter where you are in the wold entering data to it, it always saves your standard / custom date/time fields in UTC (think of it as new GMT). The UI takes care of displaying your local time by working out the difference of where you are from the UTC timestamp.

The 1 hour difference I was experiencing was because of Time-Zones and DST - Daylight Saving Time. I live in London, United Kingdom and currently the DST here says BST which is British Summer Time, and that is GMT + 1 hour. So if you modify data in Salesforce at 14:00, according to wall clocks in London, in your custom timestamp field in Salesforce it will actually record 13:00! (UTC time).

So when sending timestamps to Salesforce you have to send them in UTC!

In case of my Oracle table A,B,C,T the date/time column was saying:

01/05/2013 17:07:20

If you send this timestamp to Salesforce as is, it will record this as a UTC value, but then because of the Salesforce UI and because of BST (GMT+1 DST) it will display:

01/05/2013 18:07:20 (1 hour ahead in the future!)

So the solution was to calculate the date/time in Oracle as UTC always, as this is the accepted value for Salesforce. To do exactly that and to provide the date/time field always in UTC no matter of DST I used the Oracle SQL CAST function to obtain the given date/time value at GMT (UTC) like this:

select to_char(cast(my_date as timestamp with local time zone) at time zone 'GMT', 'DD/MM/YYYY HH24:MI:SS') utc_my_date from my_table;

The formatting above will always return a GMT (UTC) date/time no matter which timezone or DST you are in. Then you can safely pass that as a timestamp to Salesforce and be sure that the correct time is going to be saved in Salesforce.

Pages