Database Systems: March 2008

In this post I will demonstrate how a normalized relational data model which is in 3rd Normal Form (3NF) evolves to a dimensional data model (star schema) in a data warehouse.

You can use this post to practice your dimensional modeling and find tips and clues on how to transform the solid Relational Data model of your operational OLTP environment into a dimensional data model suitable for your data warehouse.

There is lots of literature on benefits and uses of dimensional modeling and design in data warehouses. Primary author in this area is Ralph Kimball and his book called The Data Warehouse Toolkit. After I read this book I have been tempted to investigate his approaches hands-on and describe them in a post.

Why do we have to use Dimensional Modeling in the Data Warehouse?

According to Ralph Kimball the necessity for Dimensional Modeling in the warehouse is the complexity and the performance problems of Relational Models. He believes that relational models are complex and hard to understand when presented to business people and the granularity of information in these models is too much. Business people are after slicing and dicing the data and expect quick response times to their ad hoc queries. But business people, in order to be able to pose the query, first they must understand the data. Kimball says that dimensional modeling is able to do that and make business managers better understand data in their organisations.

His second argument is on performance. The relational models are too detailed he says, they contain to many tables related to each other, which make them difficult to join. He argues that the relational models describe in one big picture, business processes which usually do not happen at the same time (ie, sales and delivery). Data normalization and the goal to avoid redundancy and inconsistency in OLTP systems is what drives relational models. But, in the data warehouse the drive is different. Data warehouses are all about providing historical information for general use in reporting and decision making. It is not about capturing and ensuring that the information is consistent. He does not by any means suggest that Data Warehouse data is not consistent. He only makes the point that their driving idea is different. Operational OLTP systems have little data, maybe the last week of operations, whereas Data Warehouses have data for the last 5-10 years. The large amount of data is another factor effecting performance in data warehouses.

All this is fine. Now lets look on how Kimball suggests that we change our Relational Model to a Dimensional model.

When attacking a corporate data model to transform it to a dimensional data model to be used in the data warehouse, the designer must know very well the definitions of the fact table this table is also called cube or multidimensional hypercube and the dimension table. Because these are the only 2 different kinds of tables you have in a dimensional model. Loosely speaking you have to distinguish and blend these 2 kinds of tables from your relational model and at the end you will have a dimensional model!

Definitions of the Fact Table and Dimension Table from Ralph Kimball's book are below.

Fact Table: A fact table is the primary table in a dimensional model where the numeric performance measurements of the business are stored.

Dimensions Table: Dimension tables are integral companions to a fact table. The dimension tables contain the textual descriptors of the business.

Equipped with the above knowledge let's look at our sample relational model of a business:

In redesigning this relational model to a dimensional model, first thing you will have to do is to find the fact table. When looking for the fact table, as Kimball says, you will have to look for the numeric measurements tables, or the tables which count things, the tables which record transactions and the tables which are constantly changing and are very big with lots of rows. Another way to look for them is to find out the many-to-many relationship tables or intersection tables of the relational model. Those tables are the best candidates for the fact table. In our sample model above, best candidates are the ORDERS and ORDER_ITEMS tables as these are the tables which record transactions, constantly changing information and are the largest tables in the model. CUSTOMERS, SALES_CHANNEL and PRODUCTS tables are not measurement tables or tables with constant activity on them. They are more likely to be dimension tables rather than fact tables.

Nothing stops you from introducing new dimensions in a dimensional model, even dimensions which can not translate or do not exist in your relational models. See the TIME_DIMENSION table below which is a table of dates. If you find it appropriate to have a table just to record dates and date attributes, such as holidays, weeks, months, quarters, you can go ahead and create one. It is a standard practice for such dimensions to exist in data warehouses.

So as you can see later in the dimensional model, what happens when re-designing a relation model as a dimensional model is that the primary keys of the dimension tables become the foreign keys in the fact tables. That is, the fact table which is the measurement table is full of foreign keys coming from the dimension tables. Initially if you visualize such a table you might realize that the fact table is not in 3NF, as redundant data will exist in such a table and a non-prime attribute from such a table can depend functionally to another attribute and not directly to the primary key. In many cases the fact tables might not even have a primary key but might have composite primary keys. So redundancy is permitted and normalization laws of data are relaxed in fact tables.

All this for the sake of performance and the ability to analyze data and for a better representation of data to business people, as Kimball points out, Data Warehouses are to serve business people, departmental chiefs and CEOs and Data Warehouse Administration is somewhere between a DBA and an MBA.

In this case our relational model above could become a dimensional model, star schema, like the one below.

Here as you see the ORDERS and ORDER_ITEMS tables, the tables which record day to day measurements for the business are integrated into the SALES_FACTS table, which is the fact table of this dimensional model.

Now the next step after this design would be to find a way to do ETL (Extract, Transform, Load), that is, to extract the information from the OLTP system (relational model) periodically and to insert it in the data warehouse (dimensional model). Then your dimensional model would be ready to be used in your OLAP environment analytic workspace for reporting applications or with your Business Intelligence software for any kind of analytics and performance magic these systems can offer in combination with your dimensional model.

Resources: The Data Warehouse Toolkit - Ralph Kimball, Margy Ross

Have you ever heard industry jargon like Data Warehouse, Data Mart, Data Mining, OLAP and more. Did you ever wanted to know in one sentence what are these words all about? Then keep on reading I have put together some quick definitions from Oracle docs.

Data Warehouse: (An Oracle Docs Definition)

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but can include data from other sources. Data warehouses separate analysis workload from transaction workload and enable an organization to consolidate data from several sources.

Data Mart: (An Oracle Docs Definition)

A data warehouse that is designed for a particular line of business, such as sales, marketing, or finance. In a dependent data mart, the data can be derived from an enterprise-wide data warehouse. In an independent data mart, data can be collected directly from sources. Is the categorization of your datawarehouse data to a particular line of your business. For example BIG DATAWAREHOUSE DATA becomes PURCHASING DATA and SALES DATA and INVENTORY DATA. So a Data Mart is, loosely speaking, a piece of the big thing.

Data Mining: (An Oracle Docs Definition)

Too much data and not enough information — this is a problem facing many businesses and industries. Most businesses have an enormous amount of data, with a great deal of information hiding within it, but "hiding" is usually exactly what it is doing: So much data exists that it overwhelms traditional methods of data analysis.

Data mining provides a way to get at the information buried in the data. Data mining creates models to find hidden patterns in large, complex collections of data, patterns that sometimes elude traditional statistical approaches to analysis because of the large number of attributes, the complexity of patterns, or the difficulty in performing the analysis.

OLAP: (An Oracle Docs Definition)

OLAP functionality is characterized by dynamic, multidimensional analysis of historical data, which supports activities such as the following:

Calculating across dimensions and through hierarchies
Analyzing trends
Drilling up and down through hierarchies
Rotating to change the dimensional orientation

OLAP tools can run against a multidimensional database or interact directly with a relational database.

Sources:

Oracle® Database Data Warehousing Guide 10g Release 2 (10.2)
http://download.oracle.com/docs/cd/B19306_01/server.102/b14223/concept.htm#sthref47

Oracle® Data Mining Concepts 10g Release 2 (10.2)
http://download.oracle.com/docs/cd/B19306_01/datamine.102/b14339/1intro.htm#i1023970

Pages

Saturday, 15 March 2008

Relational Model to Dimensional Model

Wednesday, 5 March 2008

Data Warehouse, Data Mart, Data Mining and OLAP Definitions