By Rahul Reddy
Each time we face a production issue, or design something which leads to lowering our BCR (booking conversion rate), or do just about anything which doesn’t have a positive impact on our ecosystem, we do two things for sure:
One 👉 We leave no stone unturned to fix the issue
Two 👉 We write about it so no one else does the same mistakes as us
Recently, we faced one such blog-triggering issue. The use case was simple:
Upon receiving a request, schedule/re-schedule a task in the scheduler and execute the task at the scheduled time. To achieve this, we were using Quartz, an open source library. It was running well until one dreadful night. 😣
We saw response times for the service going up. Initially, we thought the database box was the bottleneck, so we upgraded it. Nothing changed. 🙄
After digging more, we finally found the issue — a particular query used by the Quartz library was taking about 1 second to execute.
You thought it was all sunshine and rainbows? ☀️🌈
The problem here was that the indices were already present for that query. To add to that, when we tried to reproduce the issue by running the query from Postgres CLI, it was using the index and we were back to square one. But this post is not about how we debugged the issue.
Our concern was the fact that Postgres was not using index, which led us to deep dive into cases where this could be happening.
Unsettlingly, we found quite a few cases. 😑
To give a bird’s-eye view: Postgres will not use Index whenever the query planner thinks (Hope it is not actually “thinking”… Remember Skynet? 🤖). The query complexity of using Index scan is more than that of doing a simple sequential scan. Sometimes, it’s actually beneficial to do a sequence scan over Index scan, but this issue arrises when Postgres decides to do a sequential scan at the wrong times.
So without further ado, let’s get to the cases.
Case 1: Table is very small
When the table size is small, Postgres generally goes with the sequential scan as it will be much more economical than an index scan. To understand this, you need to understand how indices work. The basic reasoning is, if we use indexing, we need to first access the blocks corresponding to the index and then the actual blocks of the table. But, with sequence scan, we just need to access the blocks of the table.
Postgres is right to not use index in this case, since index scan takes more time. So, this should not cause any issues in production.
But knowing about this would prevent wasting time, debugging why index is not being used in dev/staging environment. (Still remember doing this on my first day 😭)
Case 2: Query is on one of the fields in multi index
Though a multicolumn B-tree index can be used with query conditions that involve any subset of the index’s columns, the index is most efficient when there are constraints on the leading (leftmost) columns.
For example: If there’s an index created on (a.b.c)
and have query a=5 and b=6
will use the index, but the query on b=6 and c=7
will not use the index. Think of the index as a sorted list of lists. In our example the entire table is sorted on a
first and all the rows with same values for a
are sorted on b
, and similarly then on c
. So we just make a query on b
or c
, then the entire index has to be accessed. This is more inefficient than doing a direct sequential scan on the table.
Case 3: Query is returning almost all of the table
This is an interesting case. If the query is such that the result would include most rows from the table, then Postgres would prefer a sequence scan to index scan. This is the result of reducing the number of block accesses same as above.
How would Postgres know the number of rows a query can return without actually running the query?
Postgres achieves this by continuously analysing and storing the statistics about the rows of the table in pg_statistic
table.
Note: This is not an actual measure, rather a close guess.
An example of this would be when you’re selecting for the only value of the column. That is, if a column, say a
, has only one value of true
and your query is for select * where a=true
This would mean that all the rows of the table are to be returned, hence using a sequential scan on table is beneficial in this case.
Case 4: When column data type and query data type don’t match
This happens when the application is setting wrong datatype for one of the columns in the query. Not that by default Postgres can do some datatype conversions as referenced by pg_cast
table. So even if the datatype doesn’t match, the query goes through without any errors in these case.
With pg_cast
there are two cases. One, where the data in query can be converted to match the datatype of the table (say, we give bigint
in query and the column is of numeric
type). In this case every thing works as expected, Postgres will do the type casting and use the index.
The problem arrises with the other case, where the values from the table are to be converted to match the datatype of the query (i.e., we give numeric
in query and the column is of bigint
type). As Postgres can only cast a bigint
to numeric
, the only option it has is to convert every row in the table to numeric
and then compare. Thus in this case, the index won’t be used.
If you’re wondering, this is exactly what happened in our case. To add to it, this is quite difficult to debug. If we do an explain analyse on the query in db directly, Postgres will use the index, and in normal queries we don’t specify the datatypes of the columns.
Case 5: When the query contains LIMIT
This is a tricky one. Sometimes, a query like select * from abc where col_1 = 5 limit 1;
doesn’t use the index while select * from abc where col_1 = 5;
does. This is happens when the pg_stats
on that table suggests that the col_1
table has enough random values in the table such that, the total cost of fetching required number if sequential blocks from db, would be less than that of fetching the index blocks first and then the corresponding data blocks.
Note: With the default settings, the planner considers the cost of a random page read (required in an index scan) four times the cost of a sequential page read (required in sequentially scanning the table from beginning).
The worse case for this is when the sequential scan has to be done on the entire table as the query results in 0 rows.
Case 6: Index is not present
This might look straightforward, but does happen in practice. 🤷♂️
Know of any more cases I might’ve have missed? Leave a comment.
Click here to read more stories on how we build our #SuperApp. 💚
Click below to build it with us.