Monday, 12 November 2018

Chart your SQL direct with Apache Zeppelin Notebook

Do you want a notebook where you can write some SQL queries to a database and see the results either as a chart or table?

As long as you can connect to the database with a username and have the JDBC driver, no need to transfer data into spreadsheets for analysis, just download (or docker) and use Apache Zeppelin notebook and chart your SQL directly!




I was impressed by the ability of Apache Zeppelin notebook to chart SQL queries directly. To configure this open source tool and start charting your SQL queries just point it your database JDBC endpoint and then start writing some SQL in real time.

See below how simple this is, just provide your database credentials and you are ready to go.





The notebook besides JDBC to any database, in my case I used a hosted Oracle cloud account, can also handle interpreters like: angular, Cassandra, neo4j, Python, SAP and many others.

You can download Apache Zeppelin and configure on localhost or you can run it on docker like this

docker run -d -p 8080:8080 -p 8443:8443 -v $PWD/logs:/logs -v $PWD/notebook:/notebook xemuliam/zeppelin

Thursday, 10 May 2018

How much data do you have?

Sometimes you need to ask this most simple question about your database to figure out what the real size of your data is.

Databases store loads of auxiliary data such as indexes, aggregate tables, materialized views and other structures where the original data is repeated. Many times databases repeat the data in these structures for the sake of achieving better performance gains for the applications and reports they serve. The duplicate storage of data, in this case, is legitimate. It is there for a reason.

But should this repetition be measured and included in the database 'data' size?

Probably yes. After all, it is data, right?

To make things worse, many databases due to many updates and deletes, over time create white space in their storage layer. This white space is unused fragmented free space which can not be re-used by new data entries. This is bad. Often it will end up being scanned in full table scan operations unnecessarily, eating up your computing resources. But the most unfortunate fact is that it will appear as if it is data in your database size measurements when it is not!

It is just unused white space, nothing but costly void. Very bad.

There are mechanisms in databases which, when enabled, will automatically remedy the white space and reset and re-organise the storage of data in the background and save you space, time and money. Here is a link which talks about such mechanisms at length https://oracle-base.com/articles/misc/reclaiming-unused-space

One should be diligent when measuring database sizes, be suspicious. There is loads of data which is repeated and some of it is just the blank void due to fragmentation and unused white-space. You will be surprised to see how much database white space exists in your database if you do not reclaim it back during maintenance. If you are curious to find out, there are ways you can measure the whitespace and the real data.

So, how do we measure?
Below is a database size measuring SQL script which can be used with Oracle to show data (excluding the indexes) in tables and partitions. It also tries to estimate real storage (in the actual_gb column) excluding the whitespace by multiplying the number of rows in a table with the average row size. Replace the '<YOURSCHEMA>' in the code with the schema you wish to measure. Provided you have statistics calculated before you measure.

SELECT SUM(actual_gb)  AS actual, 
       SUM(segment_gb) AS allocated 
FROM   ( 
                SELECT   s.owner, 
                         t.table_name, 
                         s.segment_name, 
                         s.segment_type, 
                         t.num_rows, 
                         t.avg_row_len, 
                         t.avg_row_len * t.num_rows / 1024 / 1024 / 1024 actual_gb, 
                         SUM(s.bytes)  / 1024 / 1024 / 1024              segment_gb 
                FROM     dba_segments s, 
                         dba_tables t wheres.owner = '<YOURSCHEMA>' 
                AND      t.table_name = s.segment_nameand segment_type IN ('TABLE', 
                                                                           'TABLE PARTITION', 
                                                                           'TABLE SUBPARTITION') 
                GROUP BY s.owner, 
                         t.table_name, 
                         s.segment_name, 
                         s.segment_type, 
                         t.num_rows, 
                         t.avg_row_len, 
                         t.avg_row_len * t.num_rows / 1024 / 1024 / 1024 );
ACTUAL     ALLOCATED
---------- ----------
18.9987    67.3823

Saturday, 3 March 2018

SQL with Apache Spark, easy!

Reading about cluster computing developments like Apache Spark and SQL I decided to find out.

What I was after was to see how easy is to write SQL in Spark-SQL. In this micro-post I will show you how easy is to SQL a JSON file.

For my experiment I will use my chrome_history.json file which you can download from your chrome browser using the extension www.JSON-XLS.com. To run the SQL query on PySpark on my laptop I will use the PyCharm IDE. After little bit of configuration on PyCharm, setting up environments (SPARK_HOME), there it is: It only takes 3 lines to be able SQL query a JSON document in Spark-SQL.



(click image to enlarge)


Think of the possibilities with SQL, the 'cluster' partitioning and parallelisation you can achieve


Links:

Apache Spark: https://spark.apache.org/downloads.html

PyCharm: https://www.jetbrains.com/pycharm/

Saturday, 4 March 2017

Bad bad bad data




One should feel really sorry about anyone who will rely on filtering and making a decision based on bad, bad data. It is going to be a bad decision.

This is serious stuff. I read the other day a recent study by IBM which shows that "Bad Data" costs US $3.1 trillion per year!

OK, let's say you don't mind the money and have money to burn, how about the implications of using the bad data? As the article hints these could be misinformation, wrong calculations, bad products, weak decisions mind you these will be weak/wrong 'data' driven decisions. Opportunities can be lost here.

So why all this badness, isn't it preventable? Can't we not do something about it?

Here are some options

1) Data Cleansing: This is a reactive solution where you clean, disambiguate, correct, change the bad data and put the right values in the database after you find the bad data. This is something you do when is too late and you have bad data already. Better late than never. A rather expensive and very time consuming solution. Nevermind the chance that you can still get it wrong. There are tools out there which you can buy and can help you do data cleansing. These tools will 'de-dupe' and correct the bad data up to a point. Of course data cleansing tools alone are not enough, you will still need those data engineers and data experts who know your data or who can study your data to guide you. Data cleansing is the damage control option. It is a solution hinted in the article as well.

2) Good Database Design: Use constraints! My favourite option. Constraints are key at the design time of the database, do put native database checks and constraints in the database entities and tables to guarantee the entity integrity and referential integrity of your database schema, validate more! Do not just create plain simple database tables, always think of ways to enforce the integrity of your data. Do not rely only on code. Prevent NULLS or empty strings as much as you can at database design time, put unique indexes and logical check constraints inside the database. Use database tools and features you are already paying for in your database license and already available to you, do not re-invent the wheel, validate your data. This way you will prevent the 'creation' of bad data at the source! Take a proactive approach. In projects don't just skip the database constraints and say I will do it in the app or later. You know you will not, chances are you will forget it in most of the cases. Also apps can change, but databases tend to outlast the apps. Look at a primer on how to do Database Design

My 'modus operandi' is option 2, a Good Database Design and data engineering can save you money, a lot of money, don't rush into projects with neglecting or skipping database tasks, engage the data experts, software engineers with the business find out the requirements, talk about them, ask many questions and do data models. Reverse engineer everything in your database, have a look. Know your data! That's the only way to have good, integral and reliable true data, and it will help you and your customers win.

Friday, 23 December 2016

Primary Storage, Snapshots, Databases, Backup, and Archival.

Data in the enterprise comes in many forms. Simple flat files, transactional databases, scratch files, complex binary blobs, encrypted files, and whole block devices, and filesystem metadata. Simple flat files, such as documents, images, application and operating system files are by far the easiest to manage. These files can simply be scanned for access time to be sorted and managed for backup and archival. Some systems can even transparently symlink these files to other locations for archival purposes. In general, basic files in this category are opened and closed in rapid succession, and actually rarely change. This makes them ideal for backup as they can be copied as they are, and in the distant past, they were all that there was and that was enough.

Then came multitasking. With the introduction of multiple programs running in a virtual memory space, it became possible that files could be opened by two different applications at once. It became also possible that these locked files could be opened and changed in memory without being synchronized back to disk. So elaborate systems were developed to handle file locks, and buffers that flush their changes back to those files on a periodic or triggered basis. Databases in this space were always open, and could not be backed up as they were. Every transaction was logged to a separate set of files, which could be played back to restore the database to functionality. This is still in use today, as reading the entire database may not be possible, or performant in a production system. This is called a transaction log. Mail servers, database management systems, and networked applications all had to develop software programming interfaces to backup to a single string of files. Essentially this format is called Tape Archive (tar.)

Eventually and quite recently actually, these systems became so large and complex as to require another layer of interface with the whole filesystem, there were certain applications and operating system files that simply were never closed for copy. The concept of Copy on Write was born. The entire filesystem was essentially always closed, and any writes were written as an incremental or completely new file, and the old one was marked for deletion. Filesystems in this modern era progressively implemented more pure copy on write transaction based journaling so files could be assured intact on system failure, and could be read for archival, or multiple application access. Keep in mind this is a one paragraph summation of 25 years of filesystem technology, and not specifically applicable to any single filesystem.


Along with journaling, which allowed a system to retain filesystem integrity, there came an idea that the files could intelligently retain the old copies of these files, and the state of the filesystem itself, as something called a snapshot. All of this stems from the microcosm of databases applied to general filesystems. Again databases still need to be backed up and accessed through controlled methods, but slowly the features of databases find their way into operating systems and filesystems. Modern filesystems use shadow copies and snapshotting to allow rollback of file changes, complete system restore, and undeletion of files as long as the free space hasn’t been reallocated.

Which brings us to my next point which is the difference between a backup or archive, and a snapshot. A snapshot is a picture of what a disk used to be. This picture is kept on the same disk, and in the event of a physical media failure or overuse of the disk itself, is in totality useless. There needs to be sufficient free space on the disk to hold the old snapshots, and if the disk fails, all is still lost. As media redundancy is easily managed to virtually preclude failure, space considerations especially in aged or unmanaged filesystems, can easily get out of hand. The effect of a filesystem growing near to capacity is essentially a limitation of usable features. As time moves on, simple file rollback features will lose all effectiveness, and users will have to go to the backup to find replacements.

There are products and systems to automatically compress and move files that are unlikely to be accessed in the near future. These systems usually create a separate filesystem and replace your files with links to that system. This has the net effect of reducing the primary storage footprint, the backup load, and allowing your filesystem to grow effectively forever. In general, this is not such a good thing as it sounds, as the archive storage may still fill up, and you then have an effective filesystem that is larger than the maximum theoretical size, which will have to be forcibly pruned to ever restore properly. Also, your backup system, if the archive system is not integrated, probably will be unaware of the archive system. This would mean that the archived data would be lost in the event of a disaster or catastrophe.

Which brings about another point, whatever your backup vendor supports, you are effectively bound to use those products for the life of the backup system. This may be ten or more years and may impact business flexibility. Enterprise business systems backup products easily can cost dozens of thousands per year, and however flexible your systems need to be, so your must your backup vendor provide.

Long term planning and backup systems go hand in hand. Ideally, you should be shooting for a 7 or 12-year lifespan for these systems. They should be able to scale in features and load for the predicted curve of growth with a very wide margin for error. Conservatively, you should plan on a 25% data growth rate per year minimum. Generally speaking 50 to 100% is far more likely. Highly integrated backup systems truly are a requirement of Information Services, and while costly, failure to effectively plan for disaster or catastrophe will lead to and end of business continuity, and likely the continuity of your employment.


Jason Zhang is the product marketing person for Rocket Software's Backup, Storage, and Cloud solutions.