I had a lot of discussions in the past about databases and indices. The type of the database is not important, as the general design of indices are always the same. The following model can be applied to relational databases, document stores, big table databases, columnar stores and also to graph databases.

The following model is used to explain what indices are and why they work so well and especially why there are certain restrictions to them:

- Multiple single value indices do not help efficiently in where clauses with AND conditions.
- For combined indices, the where clauses need to have certain AND clauses to get fully utilized. Missing some of the clauses may lead into not using the index.
- The order of values in combined indices are important, when time series or continuous data is involved. It is shown what the correct order is to get good performance.

These three points, I have to discuss over and over again, as most people are not database specialists and the inner works of databases are not well known. The following simple model shall provide some insights.

# Storing Data in General

Storing data itself is normally quite simple, just put the stuff to disk. I know, it is not meaningful, but as starting point, let’s assume it for now. It is also not important whether the data is put onto disk, into a storage like Hadoop or something else.

For the different databases we can assume we have basic data structures like

- Relational Database:
*Row = Map<ColumnName, Value>* - Big Table Database:
*ColumnGroup = Map<Key, Value>* - Document Store:
*Document = <any data structure like JSON>* - Graph Database:
*Vertex, Edge = <any structure>*

Important for now is, we have different kinds of data structures, but it is not important for now, how they look like. We just put them onto the disk as a simple binary file.

If we do this, there is only one thing missing: We can’t distinguish them, that’s why we define a data store with *Object* *ID*s:

*DataStore = Tuple<ID, Object>*

We have an object *ID* which is unique defining a certain *Object*. We can read now the store *Object* by *Object* and can find the correct one by comparing the *ID*.

# Primary Index

With the store above, we can store any data and can retrieve it via the unique *ID*. The issue is, to find the related *Object,* we need to read the whole store. This leads to a time characteristics in Big O Notation with a total number of *N* samples of:

*t(N)=O(N)*

This is a poor characteristics. This can be improved dramatically by adding a primary index. This primary index maps every *ID* to the storage position of the *Object*:

*PrimaryIndex = Map<ID, DataFileAndPosition>*

Do not mistake this primary index with a primary key from a relational database. Here we talk about a technical index to retrieve data from disk efficiently. The primary key in a relational database is logical construct on the database content for data organization and modelling.

As soon as we have a primary index in place, the time characteristics changes to:

*t(N) = O(log(N))*

We assume here, that the index is ordered by the *ID*s and searchable with a binary search.

The speedup *S* is:

*S(N) = O(N) / O(log(N)) ~ N / log(N)*

For *N = 1000* is around *100* and for* N = 1,000,000* it is around *50,000*.

For a simple key values store, this would be good enough. But, what do we do if we look for a specific *Object* without knowing its *ID*? We still have to search the whole data store.

# Single Value Secondary Indices

To find objects with certain attributes, values or properties (the name depends on the type of database), indices are used. For this to work, we need to have the possibility to extract this *Value*s out of the stored *Object*:

*Value = Object(Name)*

The Name in a relational database is for instance the column name and for a graph database a property name. To find the value, we need to have an index like:

*Index = Map<Value, Set<ID>>*

What about the time characteristics? It was

*t(N) = O(N)*

before. Now, we can do an index lookup first, which has a time characteristics like the primary index:

*t(N) = O(log(N))*

This is pretty good and with the additional time characteristics of the primary index, we get an overall time characteristic of:

*t(N) = O(log(N)) + O(log(N))= 2 * O(log(N)) = O(log(N))
*

That’s great, isn’t it? Again, we assume, we can access the *Value* with a binary search.

# Definition of Cardinality and Selectivity

For a better understanding of the next chapters and for the sake of simplicity, we need to explain two terms: Cardinality and Selectivity.

Cardinality *C*, is the number of unique values. That means, *C* <= *N* for a single attribute. In case the Cardinality is known (an index should keep track of it),the number of returned results can be estimated as:

*NumerOfResults = N / C*

The simple idea is, that the number of each unique value is equal. The factor 1 / C can be defined as *Selectivity* like:

*Selectivity = 1 / C*

# Single value indices in AND clauses

With the explanations above in mind, we can explain why multiple single value indices cannot be combined. Let’s assume a SQL query like

1 |
SELECT * FROM table WHERE column1=value1 AND column2 = value2 |

Additionally, we assume to have single value indices on both columns. What can the database do now?

The first implementation would normally find that both columns have a single value index and uses the index with the better *Selectivity* (which filters most and returns least number of results). It takes the *ID*s out of it, reads the* Object*s and filters the results again with the second conditions. In this case we have the same behavior as with one index only.

A second implementation could check both available indices and takes the *ID* sets out of both. From both sets the intersection is calculated and with the result the *Object*s are read. Depending on the amount of effected IDs and the overall number of Objects, this might be less efficient, than the first approach.

A real solution is only a combined index.

# Combined Secondary Indices

A combined index is just an extension of a single value index with multiple values:

*CombinedIndex = Map<Attribute1Value, Map<Attribute2Value, … **Map<AttributeNValue, Set<ID>>…>>*

# Combined Indices and Where Clause

With the definition of the combined index above, we see why it is important to add the correct where clause to a query. Let us define a combined index over three attributes:

*CombinedIndex = Map<Attribute1Value, Map<Attribute2Value, **Map<AttributeNValue, Set<ID>>>>*

As long as the AND clauses in the query are added from attribute1 to attribute3 in the correct order (the order in the query statement itself should not matter), the index can be utilized. For instance, the quere

1 |
SELECT * FROM table WHERE attribute1=value1 AND attrbute2=value2 AND attribute3=value3 |

is optimal. It runs the query directly on the index and finds all relevant rows.

The same holds true for the queries:

1 2 |
SELECT * FROM table WHERE attribute1=value1 AND attrbute2=value2 SELECT * FROM table WHERE attribute1=value1 |

The only difference is, that only the first two or the first map can used to find the set. The remaining maps are scanned. All relevant rows are found with the index and the *Object*s can be read.

But, what about the following queries?

1 2 3 |
SELECT * FROM table WHERE attribute2=value2 AND attrbute3=value3 SELECT * FROM table WHERE attribute2=value2 SELECT * FROM table WHERE attribute3=value3 |

The index cannot be used efficiently, because the first Map needs to be scanned completely. Some benefit may be gained from the other maps, but it is not said, that the database can do it.

That’s why it is important to put some thoughts into the correct definition is combined indices. If the last queries are dominant, the order of the attributes should be changed. Maybe, an additional index might make sense. It heavily depends on the use cases expected.

# Time Series and Continuous Data

For time series data and continuous data, a special look is needed. For this kind of data, a check for equality makes sense only in rare cases, but creating an index make sense anyways.

Continuous data (also time series) is queried with inequality to get defined ranges of the data. In case an index was created, a range scan can take place. With the same binary search algorithm the start of the range can be found and afterwards all values before the end are used to retrieve the *ID*s. We assume here again, that the *Value*s are ordered in the index and a binary search can take place.

# Continuous Data in Combined Secondary Indices

Let’s assume we have a table with two attributes Time and Name. What is the best order for the combined index? The query could look like:

1 |
SELECT * FROM table WHERE name = 'name' AND time >= 2017-05-01 AND time <= 2017-06-30 |

For this example let’s assume the selectivity of name is 10% and also the time range is 10%.

If we create the index in the order

*Index = Map<NameValue, Map<TimeValue, Set<ID>>>*

the database can narrow the search with the name already to 10% of the index and can perform a range scan over the second map. This results into an index read of 1% of the size only.

If we create the index in the other order

*Index = Map<TimeValue, Map<NameValue, Set<ID>>>*

the database can perform a range scan over the first map. But, because time is a continuous value, the name will be most likely unique and for each time value the name needs to be checked again. This leads into a less efficient retrieval of data and a read of 10% of the index.

In case continuous data is to be put into an index, it is to be expected that it is most efficient if it is put at last position.

# Short Summary

The explanations above are maybe not the most accurate, but for me they served quite well to understand some effects I experienced.

Another article about partitioned tables is in preparation. With these some other effects are to be considered.

## One thought on “Simple Model for Understanding Database Indices”