Saturday, 23 February 2019

Chasm Trap problem in Data Warehouses

Chasm trap  in data warehouses occurs when two fact tables relate into one dimension table. This is a data modelling problem which will cause double-counting and bad data when these tables are joined. SQL and relational databases surprise me every day!

Star schemata in data warehouses usually have one fact table and many dimension tables where the fact table joins to it's dimensions tables via foreign keys. This is OK.

But what if you wanted to 'share' the dimension across two or more fact tables, use it commonly, slowly starting to create an intertwined galaxy maybe! These multi-fact schemas are also called Fact-Constellations.

For example think of a CUSTOMERS dimension table with the details of the customers and two fact tables SALES and REFUNDS with order and refund transactions as in the data model below.





Preparing sample data to reproduce the Chasm Trap

CUSTOMERS



SALES


REFUNDS




BANG! Chasm Trap spotted in the join of the SALES, REFUNDS and CUSTOMERS tables





What happened? Do you see the duplicates? We joined by the n-1 principle. That is 3 tables 2 join statements on the keys. It seems is impossible to query these two fact tables via their common dimension. So why can't we know how many SALES and REFUNDS a customer did? Strange, why is the query falling into a Cartesian Product when you want to query from two different tables via a common dimension table? Hint: Think of data modeling. 

But before let's try to sort out the chasm trap.


A quick SQL solution to 'Chasm Trap' is a pre-aggregated join of the tables like this:


Or an alternative solution can be to study again the data model and look at superfluous or wrongly assumed data relationships and dependencies. Maybe the dimension table is not meant to be shared. Review of the data model and data warehouse design might be necessary.

Conclusion

If you are going to use one dimension on two fact tables in your Data Warehouse, make sure you don't fall into the Chasm Trap. Or your data will be incorrect and untrustworthy. You must first pre-aggregate and then join to get correct results or look to better your understanding of the model and the relationships. For example maybe the REFUNDS entity is not directly related to CUSTOMERS but it is to SALES?

Once a gain we see how important data modeling and the understanding of the problem the data is trying to solve is.


Further reading:




Monday, 12 November 2018

Chart your SQL direct with Apache Zeppelin Notebook

Do you want a notebook where you can write some SQL queries to a database and see the results either as a chart or table?

As long as you can connect to the database with a username and have the JDBC driver, no need to transfer data into spreadsheets for analysis, just download (or docker) and use Apache Zeppelin notebook and chart your SQL directly!




I was impressed by the ability of Apache Zeppelin notebook to chart SQL queries directly. To configure this open source tool and start charting your SQL queries just point it your database JDBC endpoint and then start writing some SQL in real time.

See below how simple this is, just provide your database credentials and you are ready to go.





The notebook besides JDBC to any database, in my case I used a hosted Oracle cloud account, can also handle interpreters like: angular, Cassandra, neo4j, Python, SAP and many others.

You can download Apache Zeppelin and configure on localhost or you can run it on docker like this

docker run -d -p 8080:8080 -p 8443:8443 -v $PWD/logs:/logs -v $PWD/notebook:/notebook xemuliam/zeppelin

Thursday, 10 May 2018

How much data do you have?

Sometimes you need to ask this most simple question about your database to figure out what the real size of your data is.

Databases store loads of auxiliary data such as indexes, aggregate tables, materialized views and other structures where the original data is repeated. Many times databases repeat the data in these structures for the sake of achieving better performance gains for the applications and reports they serve. The duplicate storage of data, in this case, is legitimate. It is there for a reason.

But should this repetition be measured and included in the database 'data' size?

Probably yes. After all, it is data, right?

To make things worse, many databases due to many updates and deletes, over time create white space in their storage layer. This white space is unused fragmented free space which can not be re-used by new data entries. This is bad. Often it will end up being scanned in full table scan operations unnecessarily, eating up your computing resources. But the most unfortunate fact is that it will appear as if it is data in your database size measurements when it is not!

It is just unused white space, nothing but costly void. Very bad.

There are mechanisms in databases which, when enabled, will automatically remedy the white space and reset and re-organise the storage of data in the background and save you space, time and money. Here is a link which talks about such mechanisms at length https://oracle-base.com/articles/misc/reclaiming-unused-space

One should be diligent when measuring database sizes, be suspicious. There is loads of data which is repeated and some of it is just the blank void due to fragmentation and unused white-space. You will be surprised to see how much database white space exists in your database if you do not reclaim it back during maintenance. If you are curious to find out, there are ways you can measure the whitespace and the real data.

So, how do we measure?
Below is a database size measuring SQL script which can be used with Oracle to show data (excluding the indexes) in tables and partitions. It also tries to estimate real storage (in the actual_gb column) excluding the whitespace by multiplying the number of rows in a table with the average row size. Replace the '<YOURSCHEMA>' in the code with the schema you wish to measure. Provided you have statistics calculated before you measure.

SELECT SUM(actual_gb)  AS actual, 
       SUM(segment_gb) AS allocated 
FROM   ( 
                SELECT   s.owner, 
                         t.table_name, 
                         s.segment_name, 
                         s.segment_type, 
                         t.num_rows, 
                         t.avg_row_len, 
                         t.avg_row_len * t.num_rows / 1024 / 1024 / 1024 actual_gb, 
                         SUM(s.bytes)  / 1024 / 1024 / 1024              segment_gb 
                FROM     dba_segments s, 
                         dba_tables t wheres.owner = '<YOURSCHEMA>' 
                AND      t.table_name = s.segment_nameand segment_type IN ('TABLE', 
                                                                           'TABLE PARTITION', 
                                                                           'TABLE SUBPARTITION') 
                GROUP BY s.owner, 
                         t.table_name, 
                         s.segment_name, 
                         s.segment_type, 
                         t.num_rows, 
                         t.avg_row_len, 
                         t.avg_row_len * t.num_rows / 1024 / 1024 / 1024 );
ACTUAL     ALLOCATED
---------- ----------
18.9987    67.3823

Saturday, 3 March 2018

SQL with Apache Spark, easy!

Reading about cluster computing developments like Apache Spark and SQL I decided to find out.

What I was after was to see how easy is to write SQL in Spark-SQL. In this micro-post I will show you how easy is to SQL a JSON file.

For my experiment I will use my chrome_history.json file which you can download from your chrome browser using the extension www.JSON-XLS.com. To run the SQL query on PySpark on my laptop I will use the PyCharm IDE. After little bit of configuration on PyCharm, setting up environments (SPARK_HOME), there it is: It only takes 3 lines to be able SQL query a JSON document in Spark-SQL.



(click image to enlarge)


Think of the possibilities with SQL, the 'cluster' partitioning and parallelisation you can achieve


Links:

Apache Spark: https://spark.apache.org/downloads.html

PyCharm: https://www.jetbrains.com/pycharm/

Saturday, 4 March 2017

Bad bad bad data




One should feel really sorry about anyone who will rely on filtering and making a decision based on bad, bad data. It is going to be a bad decision.

This is serious stuff. I read the other day a recent study by IBM which shows that "Bad Data" costs US $3.1 trillion per year!

OK, let's say you don't mind the money and have money to burn, how about the implications of using the bad data? As the article hints these could be misinformation, wrong calculations, bad products, weak decisions mind you these will be weak/wrong 'data' driven decisions. Opportunities can be lost here.

So why all this badness, isn't it preventable? Can't we not do something about it?

Here are some options

1) Data Cleansing: This is a reactive solution where you clean, disambiguate, correct, change the bad data and put the right values in the database after you find the bad data. This is something you do when is too late and you have bad data already. Better late than never. A rather expensive and very time consuming solution. Nevermind the chance that you can still get it wrong. There are tools out there which you can buy and can help you do data cleansing. These tools will 'de-dupe' and correct the bad data up to a point. Of course data cleansing tools alone are not enough, you will still need those data engineers and data experts who know your data or who can study your data to guide you. Data cleansing is the damage control option. It is a solution hinted in the article as well.

2) Good Database Design: Use constraints! My favourite option. Constraints are key at the design time of the database, do put native database checks and constraints in the database entities and tables to guarantee the entity integrity and referential integrity of your database schema, validate more! Do not just create plain simple database tables, always think of ways to enforce the integrity of your data. Do not rely only on code. Prevent NULLS or empty strings as much as you can at database design time, put unique indexes and logical check constraints inside the database. Use database tools and features you are already paying for in your database license and already available to you, do not re-invent the wheel, validate your data. This way you will prevent the 'creation' of bad data at the source! Take a proactive approach. In projects don't just skip the database constraints and say I will do it in the app or later. You know you will not, chances are you will forget it in most of the cases. Also apps can change, but databases tend to outlast the apps. Look at a primer on how to do Database Design

My 'modus operandi' is option 2, a Good Database Design and data engineering can save you money, a lot of money, don't rush into projects with neglecting or skipping database tasks, engage the data experts, software engineers with the business find out the requirements, talk about them, ask many questions and do data models. Reverse engineer everything in your database, have a look. Know your data! That's the only way to have good, integral and reliable true data, and it will help you and your customers win.