Database Optimisations with Indexing
I believe the first stage of DB optimizations is to use the power of indexes. They can give unimaginable results if done correctly and thoughtfully. I will discuss how query planners analyze and run queries, and best practices for indexing with examples. All discussions here are with respect to PostgreSQL.
Index (B-Tree) and Table (heap) are two different data structures that are maintained separately physically. The motive should always be to reduce heap fetches as far as possible while querying. To understand how query planner executes query, use this:
sql
EXPLAIN ANALYZE select id, student_name, grades from students where grade > 8;
This will give you a detailed plan of how the query planner retrieved the results. It has a few nuances, which I'll cover below.
Parallel Scan / Sequential Scan
This means full table scan without using any indexes. This is very bad and should be avoided if possible. Not all databases support this.
Index scan v/s Index Only scans
Index only scan: If a query requests data that already exists in an index, a heap fetch is not required. This is known as an index-only scan. There are exceptions to it which I will discuss later.
Index scan: If an index scan requires fetching other fields from the heap after finding references from the index, it is considered an index scan.
Let's consider there is students table which stores grades.
Index on this table:
CREATE INDEX CONCURRENTLY student_grades on students (id, grades)
This query will result in index only scan:
sql
select id from student_grades where grades = 8;
Below query will result in Index scan:
sql
select id, name, grades from student_grades where grades = 8;
Performance gain in 1st query is substantial compared to 2nd query. You can make 2nd query to do Index only scan by changing the index to a covering (or non-key) index.
sql
CREATE INDEX CONCURRENTLY student_grades on students (id, grades) include (name)
Here, name will also be stored with index. This was introduced in PostgreSQL 13. Note that querying on name field will not use this index. More info on this can be read here.
Bitmap Scan
Based on the size of the index, DB can do Index scan or Bitmap scan, later being for large tables.
For example, assume there is an index on grades & name namely grade_idx and name_idx respectively.
Consider a few queries:
sql
// Query 1:
SELECT id from student_grades where grades=7;
// Query 2:
SELECT id from student_grades where grades=7 and name="shashank";
Explain analyze of the above queries will give result something like this:
markdown
Query 1:
Bitmap Heap Scan
-> Bitmap Index scan (grade_idx)
Query 2:
Bitmap Heap Scan
-> BitmapAnd
-> Bitmap Index scan (grade_idx)
-> Bitmap Index scan (name_idx)
How DB optimizers decide to use Indexes
Indexes are not cheap. If indexes return a large number of rows, the database will perform a table scan instead. The database maintains approximate statistics of values across different tables and indexes to analyze and determine the query plan to use.
- PostgreSQL follows the MVCC (Multi-Version Concurrency Control) technique for dealing with concurrent reads / write transactions. This means that when a row is updated, there could be multiple versions of that tuple marked as dead. These dead tuples are removed through a process called Vacuuming.
- Imagine you have a table with 100,000 rows and just inserted a massive amount of data, say 5 million rows. The database statistics won't be immediately updated to reflect this new information. So, if you run a query on that table right away, the database will assume it still only has 100,000 rows and will perform a full table scan.
It is best practice to run VACUUM ANALYZE on a table after inserting a large amount of data. VACUUM ANALYZE updates table statistics, which helps the DB query planner make optimized decisions.
- Index Only Scan will check in the heap visibility map to ensure the values returned from the index are visible. However, the visibility map may not be updated regularly, requiring the database to perform the heap trip, which increases the number of heap fetches and affects performance. This can be improved with autovaccum settings or vacuum analyze after large data insert, as explained above.
- Always check rows removed by filters and heap fetches in the query explain. If the number of rows removed is high, it indicates that the index is too wide and needs to be narrowed down. One solution could be using a partial index based on the query. A large number of heap fetches is explained in the previous point.
- When creating indexes, it's important to be creative and use various combinations based on the query to point to fewer specific rows.