Wednesday, 12 January 2022

Is Data Hub the new Staging environment?

"A data hub is an architectural pattern that enables the mediation, sharing, and governance of data flowing from points of production in the enterprise to points of consumption in the enterprise” Ted Friedman, datanami.com

Aren't relational databases, data marts, data warehouses and more recently data lakes not enough? Why is there a need to come up with yet another strategy and paradigm for database management?

To begin answering the above questions, I suggest we start looking at the history of data management and figure out how data architecture developed a new architectural pattern like Data Hub. After all, history is important as a famous quote from Martin Luther King Jr. says "We are not makers of history. We are made by history"


Relational architecture




A few decades ago, businesses began using relational databases and data warehouses to store their interests in a consistent and coherent recordThe relational architecture still keeps the clocks ticking with its well understood architectural structures and relational data models. It is a sound and consistent architectural pattern based on mathematical theory which will continue serving data workloads. The relational architecture serves brilliantly the very specific use case of transactional workloads, where the data semantics are defined in advance before any data is stored in any system. If implemented correctly the relational model can become a hub of information that is centralised and easy to query. It is hard to see that the relational architecture could be the reason to cause a paradigm shift into something like a data hub. Most likely is something else. Could it be cloud computing?


When the cloud came, it changed everything. The Cloud brought along an unfathomable proliferation of apps and an incredible amount of raw and unorganised data. With this outlandish amount of disorganised data in the pipes, the suitability of the relational architecture for data storage had to be re-examined and reviewed.  Faced with a data deluge, the relational architecture couldn't scale quickly and couldn't serve the analytical workloads and the needs of the business in a reasonable time. Put simply, there was no time to understand and model data. The sheer weight of the number of unorganised chunks of data coming from the cloud, structured and unstructured, at high speeds, propelled the engineers to look for a new architectural pattern.


Data Lake 



In a data lake, the structured and unstructured data chunks are stored raw and no questions are asked. Data is not organised and is not kept in well-understood data models anymore and it can be stored infinitely and in abundance. Moreover, very conveniently the process of understanding and creating a data model in a data lake is deferred to the future, which is a process known as schema-on-read. We have to admit, the data lake is the new monolith where data is stored only, a mega data dump yard indeed. This new architectural pattern also brought with it the massively parallel (MPP) processing data platforms, tools and disciplines, such as machine learning, which became the standard methods for extracting business insights from the absurd amounts of data found in a data lake. Unfortunately, the havoc and the unaccounted amounts of unknown data living in a data lake didn't help with understanding data and made the life of engineers difficult. Does a data lake have any redundant data or bad data? Are there complex data silos living in a data lake? These are still hard questions to answer and the chaotic data lakes looked like are missing a mediator. 


Data Hub



Could the mediator be a "data hub"? It is an architectural pattern based on the hub and spoke architecture. A data hub, which itself is another database system, integrates and stores data for mediation, most likely temporarily, from diverse and complex transactional and analytical workloads. Once the data is stored, the data hub becomes the tool to harmonise and enrich data and then radiate data to the AI, Machine Learning and other enterprise insights and reporting systems via its spokes.

What's more, while sharing the data in its spokes, the data hub can also help engineers to govern and catalogue the data landscape of the enterprise. The separation of data via mediation from the source and target database systems inside a data hub also offers engineers the flexibility to operate and govern independently of the source and target systems. But this reminds me of something.

If the data hub paradigm is a mediator presented to understand, organise, correct, enrich and put an order in the data chaos inside data lake monoliths, doesn't the data hub look similar to the data management practice engineers have been doing for decades in data warehouses and we all know as "Staging"? Is data hub the evolved version of staging?

Conclusion

The most difficult thing in anything you do is to persuade yourself that there is some value in doing it. It is the same when adopting a new architectural pattern as a data management solution. You have to understand where the change is coming from and see the value before you embark on using it. The data upsurge brought by the internet and cloud computing forced wobbly changes in data architecture and data storage solutions. The data hub is a new architectural pattern in data management introduced to mediate the chaos of fast-flowing data tsunamis around us and we hope it will help us tally everything up.


Saturday, 17 October 2020

Oracle Apex 20.2 REST API data sources and database tables

Oracle Apex 20.2 is out and has a very interesting new feature, REST Data Source Synchronisation


Why is the REST Data Source Synchronization feature interesting? 


Oracle Apex REST Data Source Synchronisation is exciting because it lets you query REST endpoints on the internet on a schedule or on-demand basis and saves the results automatically in database tables.


I think this feature will suit slow-changing data accessible with REST APIs very well. That is, if a REST endpoint data is known to be changing, say few times a day, why should we call the REST endpoint via HTTP every time we wanted to display data on an Apex page? Why would one want to render a page with data over HTTP if that data changes only once a day? Why should we cause network traffic and keep machines busy for data which is not changing often? Or maybe by requirement, you only needed to query a REST endpoint once a day and store it somewhere for data-warehousing.


Wouldn't it be better to store the data in a database table and render it from there every time a page is viewed?


This is exactly what the REST Data Source Synchronisation does. It queries the REST API endpoint and saves the JSON response as data in a database table on a schedule of your choice or on demand


For my experiment, I used the Public Free London TfL REST API Endpoint from the TfL API which holds data for TfL transportation disruptions and I configured this endpoint to synchronise with my database table every day at 5am.





I even created the Oracle Apex REST Data source inside the apex.oracle.com platform. I used the TfL API Dev platform provided key to make the call from there to the TfL REST endpoint and I managed to sync it once a day on an Oracle Apex Faceted Search page and some charts.


I was able to do all this with zero coding, just pointing the Oracle Apex REST Data Source I created for the TfL API to a table and scheduling the sync to happen once a day at 5am. 


To see the working app, go to this link: https://apex.oracle.com/pls/apex/databasesystems/r/tfl-dashboard/home


Screenshots of the app below






Wednesday, 13 November 2019

Oracle Apex Social Sign in

In this post I want to show you how I used the Oracle Apex Social Sign in feature for my Oracle Apex app. Try it by visiting my web app beachmap.info.




Oracle Apex Social Sign in gives you the ability to use oAuth2 to authenticate and sign in users to your Oracle Apex apps using social media like Google, Facebook and others.

Google and Facebook are the prominent authentication methods currently available, others will probably follow. Social sign in is easy to use, you don't need to code, all you have to do is to obtain project credentials from say Google and then pass them to the Oracle Apex framework and put the sign in button to the page which will require authentication and the flow will kick in. I would say at most a 3 step operation. Step by step instructions are available in the blog posts below.


Further reading:




Saturday, 23 February 2019

Chasm Trap problem in Data Warehouses

Chasm trap  in data warehouses occurs when two fact tables relate into one dimension table. This is a data modelling problem which will cause double-counting and bad data when these tables are joined. SQL and relational databases surprise me every day!

Star schemata in data warehouses usually have one fact table and many dimension tables where the fact table joins to it's dimensions tables via foreign keys. This is OK.

But what if you wanted to 'share' the dimension across two or more fact tables, use it commonly, slowly starting to create an intertwined galaxy maybe! These multi-fact schemas are also called Fact-Constellations.

For example think of a CUSTOMERS dimension table with the details of the customers and two fact tables SALES and REFUNDS with order and refund transactions as in the data model below.





Preparing sample data to reproduce the Chasm Trap

CUSTOMERS



SALES


REFUNDS




BANG! Chasm Trap spotted in the join of the SALES, REFUNDS and CUSTOMERS tables





What happened? Do you see the duplicates? We joined by the n-1 principle. That is 3 tables 2 join statements on the keys. It seems is impossible to query these two fact tables via their common dimension. So why can't we know how many SALES and REFUNDS a customer did? Strange, why is the query falling into a Cartesian Product when you want to query from two different tables via a common dimension table? Hint: Think of data modeling. 

But before let's try to sort out the chasm trap.


A quick SQL solution to 'Chasm Trap' is a pre-aggregated join of the tables like this:


Or an alternative solution can be to study again the data model and look at superfluous or wrongly assumed data relationships and dependencies. Maybe the dimension table is not meant to be shared. Review of the data model and data warehouse design might be necessary.

Conclusion

If you are going to use one dimension on two fact tables in your Data Warehouse, make sure you don't fall into the Chasm Trap. Or your data will be incorrect and untrustworthy. You must first pre-aggregate and then join to get correct results or look to better your understanding of the model and the relationships. For example maybe the REFUNDS entity is not directly related to CUSTOMERS but it is to SALES?

Once a gain we see how important data modeling and the understanding of the problem the data is trying to solve is.


Further reading:




Monday, 12 November 2018

Chart your SQL direct with Apache Zeppelin Notebook

Do you want a notebook where you can write some SQL queries to a database and see the results either as a chart or table?

As long as you can connect to the database with a username and have the JDBC driver, no need to transfer data into spreadsheets for analysis, just download (or docker) and use Apache Zeppelin notebook and chart your SQL directly!




I was impressed by the ability of Apache Zeppelin notebook to chart SQL queries directly. To configure this open source tool and start charting your SQL queries just point it your database JDBC endpoint and then start writing some SQL in real time.

See below how simple this is, just provide your database credentials and you are ready to go.





The notebook besides JDBC to any database, in my case I used a hosted Oracle cloud account, can also handle interpreters like: angular, Cassandra, neo4j, Python, SAP and many others.

You can download Apache Zeppelin and configure on localhost or you can run it on docker like this

docker run -d -p 8080:8080 -p 8443:8443 -v $PWD/logs:/logs -v $PWD/notebook:/notebook xemuliam/zeppelin