Database Indexing In Depth
Every software engineer working with any Database should have used indexing at some time. It's crucial to understand how indexing works. Indexing done correctly can give substantial performance gains with minimal effort. In this article, I will discuss indexing in relational databases, particularly PostgreSQL.
Indexes are data structures that allow faster retrieval of rows from a table. They are physically stored separately from the table data objects. An index data structure contains a copy of specific data in the table, organized specifically to facilitate faster searches. These Indexes can be stored in memory or on disk.
Primary v/s Secondary Indexes
Primary key, also called clustered indexes, helps organize the table so that order is maintained by this key at the write time. MySQL has to have a primary key, while all other indexes are secondary. Oracle has the concept of Index Organised Table (IOT). This means primary key and non-key columns are stored in the same data structure. It speeds up retrieval for exact matches or fetching a sequential range of data.
Secondary Indexes are the ones that are created after data is stored in the table and they are stored separately from the main data area (also called heap in PostgreSQL).
PostgreSQL only supports secondary indexes, which means that data retrieval occurs in two stages: first, row IDs are obtained from the index, and second, other columns' data are fetched from the heap.
PostgreSQL Indexes
- B-tree index: This is PostgreSQL's most commonly used index type. It is based on the B-tree data structure and can be used to index data of any data type.
- Hash index: This index type uses a hash table to index the data. It is helpful for fast equality comparisons but not for range searches or sorting.
- GiST index: This index type is "Generalized Search Tree." It allows for more flexible indexing, including support for indexing data types that do not have a B-tree implementation.
- SP-GiST index
- GIN index: It stands for "Generalized Inverted Index." This should be used for multi-valued data, such as arrays or JSON documents.
In PostgreSQL, you can create an index like this:
sql
CREATE INDEX CONCURRENTLY index_name ON table_name (column_name);
This will create a B-tree index on the specified column of the table.
You can specify a different index type by using the USING clause. For example:
sql
CREATE INDEX CONCURRENTLY index_name ON table_name USING hash (column_name);
This will create a hash index on the specified column.
Note that I used the keyword CONCURRENTLY when creating the index. This tells the database to create the index in the background without affecting active reads/writes on the table.
Covering Indexes
Covering indexes, also called non-key indexes, are ones when you include certain fields in the index that are required in the query but not a part of an index. This prevents an extra call to the table heap from fetching those fields.
Let's understand this with an example. Consider a student_grades table with columns id, name, and grades having the following index:
sql
CREATE INDEX CONCURRENTLY stu_grade_idx ON student_grades (grades) INCLUDE (name);
Query:
sql
SELECT name, grades from student_grades where grades > 8;
In this query, the DB engine retrieves the results from the index instead of visiting the heap, making it an index-only scan. Note that querying on name field will not use this index. The index is on grades, name field is just included to avoid extra heap fetch.
Note: This must be done carefully, or it can significantly increase index size.
There are exceptions to Index-only scan, the DB does an extra heap fetch. I have explained this here.
Composite Index or Multi-Column Index
As the name suggests, the multi-column index is created on multiple fields in a table and should be used on fields included in the where clause of a query. Note that order of the fields in index while creation is important. DB will use index if left side fields are included in query.
For example,
sql
CREATE INDEX CONCURRENTLY stu_grade_idx on student_grades (grades, name);
Consider the following queries:
sql
// Query 1
SELECT * from student_grades where grades > 8;
// Query 2
SELECT * from student_grades where grades > 8 and name = 'shashank';
// Query 3
SELECT * from student_grades where name = 'shashank';
The above index will be used in Query 1 & 2 while not in Query 3.
Partial Index
When you create an index with a where clause on a table, they are called Partial indexes. This index can be useful if you frequently query a large table but only need a subset of the data, based on filter clauses.
For example, if you want to query all students with grades (not null), you can create a partial index on this table.
sql
CREATE INDEX CONCURRENTLY stu_not_null_grade_idx on student_grades (grades, name) WHERE grades <> NULL;
Unique Index
You can create a unique index on a column or combination of columns. This is a useful technique to maintain data integrity in columns by creating uniqueness on a combination of fields that act as a database constraint.
Let's say, from our above example, we want to maintain a unique combination of name and standard, where each can be duplicated individually but not together.
sql
CREATE UNIQUE INDEX uni_student_idx on student_grades (name, standard);
Note that lookups on unique indexes are much faster.
Conclusion
It's important to note that while indexes can improve the performance of specific queries, they can also add overhead to writes and increase the size of the indexes. Large indexes may not fit in memory and can lead to increased IOPS on the DB system. Thus, carefully select which columns to index based on whether the benefits outweigh the costs.