About pagination

What is pagination? What is it good for? Let’s imagine we have a database that hold data about users and we have a lot of user. If a front end application would like to display it on a website, then it must fetch from database. In a good scenario, front end does not reach the database directly, but instead reach an backend server (e.g.: REST API backend) who has connection with database to fetch data.

Now the question: when we want to list users on front end, we have to fetch all of data from database? Let’s say we have 100k users in the database. If front end fetch all of it, then it could more problems:

  • Increase the network bandwidth between backend and frontend that can result in a slower page load
  • Increase the memory usage on the backend because database has to fetch all of the data

So put extra workload to the backend, meanwhile it makes the loading on front end slower. Without real benefit, because no user wants to read such a huge data at once.

Solution for this problem is called: pagination. With this technic, instead of serving all data, backend just serves slices of data. And more data can be fetch later. For example, as user scroll down and reach the last slices of the fetched slice, then front end can request for the next slice.

But how the pagination can be implemented?

Naive solution

Well, nobody know everything at the start line. I have made mistakes in the past (and still do), but I try to learn from my mistakes. And this is one of them. First, I make pagination many years ago, I have done something like:

SELECT * FROM users OFFSET 100 LIMIT 10;

What is the problem here (aside that we don’t do SELECT *, it is just an example). When I first tried it, database was small. I did not feel that this query has major problems. But as the database has been grown queries has become slower and it has a major problem.

I have made an EXPLAIN statement to check it.

=> EXPLAIN (ANALYZE, TIMING) SELECT * FROM users OFFSET 1000 LIMIT 10;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Limit  (cost=31.96..32.28 rows=10 width=141) (actual time=0.461..0.464 rows=10 loops=1)
   ->  Seq Scan on users  (cost=0.00..31964.00 rows=1000000 width=141) (actual time=0.013..0.388 rows=1010 loops=1)
 Planning Time: 0.089 ms
 Execution Time: 0.484 ms
(4 rows)

What do we see? analysis says that the actual execution time is between 0.461 and 0.464ms. It also showed that it does not use indexes for read but doing a sequential scan on the table. Although it still very fast (under 1ms), but the sequential scan and lack of index usage foresees an issue with higher number of offset.

To validate it, I have run the query with 100_000 offset. The execution time has been increased to ~26ms from ~0.5ms. This is a significant increment!

=> EXPLAIN (ANALYZE, TIMING) SELECT * FROM users OFFSET 100000 LIMIT 10;
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3196.40..3196.72 rows=10 width=141) (actual time=26.779..26.786 rows=10 loops=1)
   ->  Seq Scan on users  (cost=0.00..31964.00 rows=1000000 width=141) (actual time=0.014..19.625 rows=100010 loops=1)
 Planning Time: 0.087 ms
 Execution Time: 26.808 ms
(4 rows)

Better solution

I don’t say this is the best solution, because best solution really depends on the actual case. In my case, the database is reading in case of 90%, so I focusing on the read performance (thus just testing SELECT). Besides who knows, maybe I learn something in the future which makes this solution not best anymore. So I say this is just a better approach (based on numbers).

What has been changed? Instead of offset, I am using a WHERE clause and filter for the id (which is a primary key so indexed column). Query looks like:

SELECT * FROM users WHERE id > 100 LIMIT 10;

And let’s see the EXPLAIN result for the higher offset but with WHERE clause.

=> EXPLAIN (ANALYZE, TIMING) SELECT * FROM users WHERE id > 100000 LIMIT 10;
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.38 rows=10 width=141) (actual time=0.015..0.020 rows=10 loops=1)
   ->  Seq Scan on users  (cost=0.00..34464.00 rows=901068 width=141) (actual time=0.014..0.017 rows=10 loops=1)
         Filter: (id > 100000)
 Planning Time: 0.114 ms
 Execution Time: 0.041 ms
(5 rows)

It can be clearly seen that it still made a sequential scan, but not the whole table from the first row, but it sees that id is an indexed column, so first it makes a index search, then just start scanning. It is significantly faster then the offset approach even for higher number of offset. Its execution time is 0.041ms, comparing with the offset approach, where it was 26.808ms, it means more than 600 times faster execution time!

The boring numbers

The engineer in myself says that 1 measurement does not count. Although, logically thinking, the result is very make sense, but I am curious of its scale and I would also analyze it from memory footprint view as well. I have generated a database with random data with 1 million lines.

The plan for data collection

My plan is to make 50 measure point in every 50_000 between 0 and 999999 offset. 50 measure point reduce the noise as I saw with some test measure. Test collector script looks like:

#!/bin/bash

for i in {0..999999..50000}; do
    for j in {0..50}; do
        psql -U test -d dummy -A -t -c "explain (ANALYZE, BUFFERS, TIMING, FORMAT JSON) select * from users where id > $i limit 10;" >/tmp/measure_id_${i}_${j}.json
    done
    echo "Done with $i for id"
done

for i in {0..999999..50000}; do
    for j in {0..50}; do
        psql -U test -d dummy -A -t -c "explain (ANALYZE, BUFFERS, TIMING, FORMAT JSON) select * from users offset $i limit 10;" >/tmp/measure_offset_${i}_${j}.json
    done
    echo "Done with $i for offset"
done

This script generates 4080 files for each query type. Sample output of above command like this:

[
  {
    "Plan": {
      "Node Type": "Limit",
      "Parallel Aware": false,
      "Async Capable": false,
      "Startup Cost": 30365.8,
      "Total Cost": 30366.12,
      "Plan Rows": 10,
      "Plan Width": 141,
      "Actual Startup Time": 120.915,
      "Actual Total Time": 120.917,
      "Actual Rows": 10,
      "Actual Loops": 1,
      "Shared Hit Blocks": 15907,
      "Shared Read Blocks": 4960,
      "Shared Dirtied Blocks": 0,
      "Shared Written Blocks": 0,
      "Local Hit Blocks": 0,
      "Local Read Blocks": 0,
      "Local Dirtied Blocks": 0,
      "Local Written Blocks": 0,
      "Temp Read Blocks": 0,
      "Temp Written Blocks": 0,
      "Plans": [
        {
          "Node Type": "Seq Scan",
          "Parent Relationship": "Outer",
          "Parallel Aware": false,
          "Async Capable": false,
          "Relation Name": "users",
          "Alias": "users",
          "Startup Cost": 0.0,
          "Total Cost": 31964.0,
          "Plan Rows": 1000000,
          "Plan Width": 141,
          "Actual Startup Time": 0.015,
          "Actual Total Time": 85.248,
          "Actual Rows": 950010,
          "Actual Loops": 1,
          "Shared Hit Blocks": 15907,
          "Shared Read Blocks": 4960,
          "Shared Dirtied Blocks": 0,
          "Shared Written Blocks": 0,
          "Local Hit Blocks": 0,
          "Local Read Blocks": 0,
          "Local Dirtied Blocks": 0,
          "Local Written Blocks": 0,
          "Temp Read Blocks": 0,
          "Temp Written Blocks": 0
        }
      ]
    },
    "Planning": {
      "Shared Hit Blocks": 90,
      "Shared Read Blocks": 0,
      "Shared Dirtied Blocks": 0,
      "Shared Written Blocks": 0,
      "Local Hit Blocks": 0,
      "Local Read Blocks": 0,
      "Local Dirtied Blocks": 0,
      "Local Written Blocks": 0,
      "Temp Read Blocks": 0,
      "Temp Written Blocks": 0
    },
    "Planning Time": 0.563,
    "Triggers": [],
    "Execution Time": 120.962
  }
]

Planning time comparison

The planning time is very similar in both case. For offset approach is a bit faster but insignificant. Conclusion for the planning time that its difference is really insignificant, both are very fast (since these are very simple queries).

OffsetAvg (id) [ms]Avg (offset) [ms]OffsetAvg (id) [ms]Avg (offset) [ms]
00.6360.487500_0000.5550.464
50_0000.5820.497550_0000.5890.498
100_0000.5810.497600_0000.6180.470
150_0000.5910.492650_0000.5790.489
200_0000.6070.482700_0000.6250.474
250_0000.5630.490750_0000.6280.470
300_0000.6210.488800_0000.6290.489
350_0000.6220.478850_0000.6320.478
400_0000.6030.463900_0000.5860.498
450_0000.5920.464950_0000.6130.499

planning_time_id

planning_time_offset

Execution time comparison

This is where the real difference can be seen. On the data point and the plot obviously show, that we have seen earlier and was predictable, that filtering for index then make a sequence scan is significantly faster then a simple sequence scan by using offset, id approach is faster even more hundred times.

OffsetAvg (id) [ms]Avg (offset) [ms]OffsetAvg(id) [ms]Avg (offset) [ms]
00.05960.059500_0000.055172.336
50_0000.059012.486550_0000.058680.636
100_0000.058125.136600_0000.061383.507
150_0000.059037.032650_0000.069590.088
200_0000.060543.822700_0000.083593.433
250_0000.056449.412750_0000.087497.098
300_0000.062455.462800_0000.0842103.953
350_0000.063759.697850_0000.0877106.777
400_0000.060163.397900_0000.0839113.861
450_0000.059368.085950_0000.0764118.394

execution_time_id

execution_time_offset

Buffer usage

I have checked two values of the shared block (size of one block is 8k):

  • Shared Hit Block: It means that the block has been read from memory (RAM) instead of disk.
  • Shared Read Block: It means that the block has been read from the disk.

I was curious that how much extra memory is used by the offset query. It makes sense that it would use more, but I just wanted to make a number for it. I have made a cache ratio which is: cache_ratio = shared_hit_block / (shared_hit_block + shared_read_block). So in a good scenario it should tend towards 1.

Block approach

First let’s see the id approach data. As it could be expected, due to index search first, it does not use almost any shared read block and really just a few shared hit block after 600_000 offset. The ratio is at 100% and drop just 99% at the worst situation.

OffsetAvg Shared Hit BlocksAvg Shared Read BlocksCache Hit Ratio
01.00.01.0
50_0001.00.01.0
100_0001.00.01.0
150_0001.00.01.0
200_0001.00.01.0
250_0001.00.01.0
300_0001.00.01.0
350_0001.00.01.0
400_0001.00.01.0
450_0001.00.01.0
500_0001.00.01.0
550_0001.00.01.0
600_0001.00.01.0
650_0003.9600.0390.990
700_0003.9800.0190.995
750_0004.9600.0390.992
800_0003.9800.0190.995
850_0003.9600.0390.990
900_0003.9800.0190.995
950_0003.9600.0390.990

shared_hit_block_id

shared_read_block_id

cache_hit_ration_id

Offset approach

Shades the result, that I did not restart PostgreSQL server after each test, so the offset test has previously loaded some block to memory that has been used later test. This is why this increment can be seen in form of a linear line. Thus the cache ratio also does not goes under ~73%.

At the latest average calculation result, the shared hit block count was 15_425 and the shared read block count was 5_442. The shared hit block size would not have been such high if the server would have been restarted after each test. But the total block count 20867. Which is approximately 20867*8=166936 KB => ~163MB. This is the size of the buffers that has been used after 950_000 offset using offset. Meanwhile the id approach used 4 shared hit block and 0 shared read block with same offset number. That is just 4*8=32 KB buffer usage.

OffsetAvg Shared Hit BlocksAvg Shared Read BlocksCache Hit Ratio
01.00.01.0
50_000755.745343.6660.687
100_0001590.843606.0980.724
150_0002407.411888.3330.730
200_0003228.5881164.6860.734
250_0004051.2351440.6470.737
300_0004848.5681740.9800.735
350_0005664.2542024.0780.736
400_0006486.3132299.5680.738
450_0007285.9412598.6660.737
500_0008135.5492847.7050.740
550_0008926.6473154.1960.738
600_0009727.9413450.4900.738
650_00010548.7843728.3130.738
700_00011350.2154025.7050.738
750_00012192.2354281.2540.740
800_00012978.0394594.0580.738
850_00013805.3134864.3330.739
900_00014618.6475149.9210.739
950_00015425.0395442.1370.739

shared_hit_block_offset

shared_read_block_offset

cache_hit_ration_offset

Final words

What about UUID index or other column types?

Sometimes we use UUID as primary key. We can use UUID which is based on time like version 7. This gives some randomness meanwhile it depends from the actual time so it can be in order, and can also be used with the > comparison operator on similar way like an integer index. In case of datetime column, if indexed, it also can be used for fast pagination.

Of course it cannot be applied for every kind of query problem. In this case I just tested it for a simple table without any join or heavier input. I just wanted to represent myself some numbers behind it, so I can see and feel the importance of indexes.

👍 My thumbs up rule that I have already learnt: always make some test query and analysis with some test data and optimize for those actions that will be rather more typically executed. For example, with more complex tables and additional indexes maybe read still fine, but every index has an amount of penalty for writing. There is always a path where we can improve ourselves and our programs.