Thursday 10 May 2018

How much data do you have?

Sometimes you need to ask this most simple question about your database to figure out what the real size of your data is.

Databases store loads of auxiliary data such as indexes, aggregate tables, materialized views and other structures where the original data is repeated. Many times databases repeat the data in these structures for the sake of achieving better performance gains for the applications and reports they serve. The duplicate storage of data, in this case, is legitimate. It is there for a reason.

But should this repetition be measured and included in the database 'data' size?

Probably yes. After all, it is data, right?

To make things worse, many databases due to many updates and deletes, over time create white space in their storage layer. This white space is unused fragmented free space which can not be re-used by new data entries. This is bad. Often it will end up being scanned in full table scan operations unnecessarily, eating up your computing resources. But the most unfortunate fact is that it will appear as if it is data in your database size measurements when it is not!

It is just unused white space, nothing but costly void. Very bad.

There are mechanisms in databases which, when enabled, will automatically remedy the white space and reset and re-organise the storage of data in the background and save you space, time and money. Here is a link which talks about such mechanisms at length

One should be diligent when measuring database sizes, be suspicious. There is loads of data which is repeated and some of it is just the blank void due to fragmentation and unused white-space. You will be surprised to see how much database white space exists in your database if you do not reclaim it back during maintenance. If you are curious to find out, there are ways you can measure the whitespace and the real data.

So, how do we measure?
Below is a database size measuring SQL script which can be used with Oracle to show data (excluding the indexes) in tables and partitions. It also tries to estimate real storage (in the actual_gb column) excluding the whitespace by multiplying the number of rows in a table with the average row size. Replace the '<YOURSCHEMA>' in the code with the schema you wish to measure. Provided you have statistics calculated before you measure.

SELECT SUM(actual_gb)  AS actual, 
       SUM(segment_gb) AS allocated 
FROM   ( 
                SELECT   s.owner, 
                         t.avg_row_len * t.num_rows / 1024 / 1024 / 1024 actual_gb, 
                         SUM(s.bytes)  / 1024 / 1024 / 1024              segment_gb 
                FROM     dba_segments s, 
                         dba_tables t wheres.owner = '<YOURSCHEMA>' 
                AND      t.table_name = s.segment_nameand segment_type IN ('TABLE', 
                                                                           'TABLE PARTITION', 
                                                                           'TABLE SUBPARTITION') 
                GROUP BY s.owner, 
                         t.avg_row_len * t.num_rows / 1024 / 1024 / 1024 );
---------- ----------
18.9987    67.3823