I wrote an article lately about a simple model to understand indices in databases. It may help to read the other article first.
In a data processing project I worked on, the data is put into an Oracle database. The data is queried differently based on questions to be answered. Even worth, the queries are not known upfront, so a quick enough access is to be provided for all kind of usage scenarios.
During the design phase some decisions were made based on the information at the time. As the main point of view onto the data was time series based, the decision was made to use partitioned tables with local indices based on the time stamps. In the aftermath, that was a mistake, which was corrected.
Let’s assume, we want to store two years of data and the partition key is based on the year and week. For instance week 50 in 2017 has the partition key 1750. Let’s assume we store data now for 2016 and 2017, we need about 104 partitions with partition keys 1601, 1602, …, 1652, 1701, 1702, …1751 and 1752. For the sake of simplicity for the next calculations, we assume we have in total 100 partitions. Each partition has local indices on different fields.
What is wrong with this scenario now? As long as we can directly access the table with partition keys, there is nothing wrong. We can create selects like:
SELECT * FROM timeserieas WHERE partitionkey = 1635 AND name = 'test';
Oracle knows in which partition to look at and which index to use to find the name. But, what happens in case we want to look up the name for the whole time? The query could be something like:
SELECT * FROM timeserieas WHERE name = 'test';
In this case we get terribly slow. But why?
Let’s calculate the time characteristics in Big O Notation:
Let’s assume we have a total number of N samples in the whole time series. In trivial, unpartitioned case, without any additional index, we have the time characteristics:
t(N) = O(N)
If we put an index onto the table which can be utilized, we get a much better time characteristics of:
t(N) = O(log2(N))
That’s actually, why indices are used. The speedup S is :
S(N) = O(N) / O(log2(N)) ~ N / log2(N)
If N is only 1, there is not speedup at all, but for let’s say 1000 samples, the speedup is already about 100. That’s great.
But, what happens in case of partitioning? Let’s assume we have in total P partitions. If we can use the partition key only without any additional index, Oracle just accesses the one partition and we get a behavior of
t(N) = O(N/P)
with a total speedup of
S(N) = O(N) / O (N/P) ~ P
In our example above, this is a speed up of a factor of 100 resulting in a 1/100 of query time, because we only need to access 1/100th of the data. Not bad at all.
The same characteristics we can calculate with an additional index. In this case we get:
t(N) = O(log(N/P))
What happens now in case we search for name over the whole time series? Because, we need to read the whole table again over all partitions with a local index only, this comes down to:
t(N) = O(P * log(N/P))
The problem is, we have to look P times into each partition and can use the local index only. This is P time utilizing an index. Compared with an index on unpartitioned table or with a global index, the speed up is is much smaller that one. In our example above with 100 partitions, this leads to:
S(N) = O(log(N)) / O(100*log( N/100) ~ 1/ 100
The reason is that the partition number increases linearly, but the gain of the performance due to the index is increasing much slower.
This calculations show, that in cases of random access on partitioned tables, global indices are recommend to not slow down the queries.