Wednesday 12 January 2022

Is Data Hub the new Staging environment?

"A data hub is an architectural pattern that enables the mediation, sharing, and governance of data flowing from points of production in the enterprise to points of consumption in the enterprise” Ted Friedman, datanami.com

Aren't relational databases, data marts, data warehouses and more recently data lakes not enough? Why is there a need to come up with yet another strategy and paradigm for database management?

To begin answering the above questions, I suggest we start looking at the history of data management and figure out how data architecture developed a new architectural pattern like Data Hub. After all, history is important as a famous quote from Martin Luther King Jr. says "We are not makers of history. We are made by history"


Relational architecture




A few decades ago, businesses began using relational databases and data warehouses to store their interests in a consistent and coherent recordThe relational architecture still keeps the clocks ticking with its well understood architectural structures and relational data models. It is a sound and consistent architectural pattern based on mathematical theory which will continue serving data workloads. The relational architecture serves brilliantly the very specific use case of transactional workloads, where the data semantics are defined in advance before any data is stored in any system. If implemented correctly the relational model can become a hub of information that is centralised and easy to query. It is hard to see that the relational architecture could be the reason to cause a paradigm shift into something like a data hub. Most likely is something else. Could it be cloud computing?


When the cloud came, it changed everything. The Cloud brought along an unfathomable proliferation of apps and an incredible amount of raw and unorganised data. With this outlandish amount of disorganised data in the pipes, the suitability of the relational architecture for data storage had to be re-examined and reviewed.  Faced with a data deluge, the relational architecture couldn't scale quickly and couldn't serve the analytical workloads and the needs of the business in a reasonable time. Put simply, there was no time to understand and model data. The sheer weight of the number of unorganised chunks of data coming from the cloud, structured and unstructured, at high speeds, propelled the engineers to look for a new architectural pattern.


Data Lake 



In a data lake, the structured and unstructured data chunks are stored raw and no questions are asked. Data is not organised and is not kept in well-understood data models anymore and it can be stored infinitely and in abundance. Moreover, very conveniently the process of understanding and creating a data model in a data lake is deferred to the future, which is a process known as schema-on-read. We have to admit, the data lake is the new monolith where data is stored only, a mega data dump yard indeed. This new architectural pattern also brought with it the massively parallel (MPP) processing data platforms, tools and disciplines, such as machine learning, which became the standard methods for extracting business insights from the absurd amounts of data found in a data lake. Unfortunately, the unaccounted amounts of unknown data living in a data lake didn't help us understand data better and made the life of engineers even more difficult. Does a data lake have any redundant data or bad data? Are there complex data silos living in a data lake? These are still hard questions to answer and the chaotic data lakes looked like are missing a mediator. 


Data Hub



Could the mediator be a "data hub"? It is an architectural pattern based on the hub and spoke architecture. A data hub, which itself is another database system, integrates and stores critical and important data and metadata for mediation, from diverse and complex transactional and analytical workloads and data sources. Once the data is stored, the data hub becomes the tool to "harmonise" and "enrich" data and then radiate it to the AI, Machine Learning and other enterprise insights and reporting systems, via its spokes.

What's more, while sharing the data in its spokes, the data hub can also help engineers to govern, secure and catalogue the data landscape of the enterprise. The separation of data via mediation from the source and target database systems inside a data hub also offers engineers the flexibility to operate and govern independently of the source and target systems. But this reminds me of something.

If the data hub paradigm is a mediator presented to understand, organise, correct, enrich and put an order in the data chaos of data lake monoliths, doesn't the data hub look similar to the data management practice engineers have been doing for decades and we all know as "Staging"? Is data hub the evolved version of staging?

Conclusion

The most difficult thing in anything you do is to persuade yourself that there is some value in doing it. It is the same when adopting a new architectural pattern as a data management solution. You have to understand where the change is coming from and see the value before you embark on using it. The data upsurge brought by the internet and cloud computing cause changes to made in data architecture and data storage solutions. The data hub is a new architectural pattern in data management introduced to mediate the chaos of fast-flowing data tsunamis around us and we hope it will help us tally everything up.