2005-08-23

Result pagination with postgresql

A common problem with webapps is providing an interface to page through result set like what various search engines do.
I have yet to find a website that discuss this in depth. So, here is a summary of solutions I came up with by looking at various pieces in the Web for doing result pagination with Postgresql. I hope this would give the sorely needed encouragement for people to start sharing their findings.
The problem actually has three components:
  • displaying result for a certain page,
  • not causing undue latency in page display, and
  • counting number of results.
Counting the accurate number of results would almost always require the full result set to be counted. Applications would usually cache this number. How exactly is the counting done depends on the approach taken.
And there are two basic approaches:
  • operating on piecemeal result set
  • operating on full result set
Please realise that there is no 'best' approach. Each comes with its own pros and cons.
To illustrate the pros and cons, I am employing two kind of queries: cheap and expensive. I categorise queries according to the effort exacted from pgsql: cheap and expensive. This categorisation is only for simplicity purposes as there are, of course, grey areas, queries that are neither cheap nor expensive; not to mention that cheap and expensive are subjective terms anyway.
dms3_test=> create view cheap as select id from document;
CREATE VIEW
Time: 150.078 ms

dms3_test=> create view expensive as SELECT doc.id FROM
attribute as attr0, attribute_name as an0, document as doc, state,
document_attribute as da0 WHERE state.id = doc.state_id AND state.name
= 'new' AND da0.doc_id = doc.id AND attr0.id = da0.attribute_id AND
an0.id = attr0.name_id AND an0.name = 'ssis_client' AND attr0.value
ILIKE 'client1';
CREATE VIEW
Time: 120.979 ms
Since what is pro and what is con depend very much against the context, I simply list the characteristics of each approach without further labelling.

Operating on Piecemeal Result Set (New Query for Each Page)

In this approach, a new query is run for each page. Each query differs only in OFFSET and LIMIT clauses.
For example, for the first page, the query would be executed with OFFSET 0 and LIMIT 10. The second page would be OFFSET 11 LIMIT 10.
This approach is popular and is found in various web applications. It is simple to implement and has an acceptable performance on cheap queries.
Number of matching results could be counted with a SELECT COUNT(*) in the beginning. This number could be cached as well so as to reduce the load on the server.
The latency in page display is low in the beginning and degrades linearly as user moves deeper.
The problem with this approach is it does not reuse previous effort. This is especially problematic if the query is expensive.
Another problem is each query would potentially see different snapshot of the data. If user is browsing page n and the underlying data changes, refreshing or revisiting page n would show a different data.
cheap query
dms3_test=> abort; 
begin; 
select count(*) from cheap; 
select * from cheap order by id offset 0 limit 10; 
select * from cheap order by id offset 50000 limit 10;

ROLLBACK
Time: 1.640 ms
BEGIN
Time: 6.187 ms
  count  
---------
 1010431
(1 row)

Time: 1094.187 ms
 id 
----
  1
  2
  3
  4
  5
  6
  9
 10
 11
 12
(10 rows)

Time: 57.371 ms
  id   
-------
 50003
 50004
 50005
 50006
 50007
 50008
 50009
 50010
 50011
 50012
(10 rows)

Time: 134.610 ms
expensive query
dms3_test=> abort; 
begin; 
select count(*) from expensive; 
select * from expensive order by id offset 0 limit 10; 
select * from expensive order by id offset 50000 limit 10;

ROLLBACK
Time: 2.698 ms
BEGIN
Time: 4.584 ms
count 
-------
 68276
(1 row)

Time: 18034.510 ms
 id  
-----
   6
  50
  55
  65
  89
 109
 110
 133
 144
 155
(10 rows)

Time: 76.929 ms
   id   
--------
 749659
 749661
 749667
 749685
 749692
 749720
 749732
 749740
 749741
 749778
(10 rows)

Time: 14424.053 ms
Characteristics:
  • simple implementation
  • no setup cost
  • suitable for cheap queries
  • suitable if user is expected to browse through only few pages
  • not isolated from underlying changes
  • latency degrades linearly

Operating on Full Result Set

This approach takes off from the previous one by reusing previous effort. The database takes a hit only on new query criteria, instead of every time the user changes pages.
This approach could be implemented by using either a temporary table or a without hold cursor. Both implementations require the webapp to maintain and reuse the transaction in which the table or cursor is defined.
A common strategy is to maintain a fixed number of connections to the database and assign one connection to the processing of a query in a round-robin way, i.e. map a specific query criteria to a specific connection.
In each connection, a transaction is held open throughout the duration of the webapp. This transaction would hold various temporary tables or cursors. You would want to keep the transaction open as long as possible.
Warning: keeping a transaction open for a long time would have the
following negative side-effects:
  • prevents vacuum from removing all dead tuples.
  • blocks changes to schema
  • may block other transactions if data is modified within the transaction
Moreover, transactions may become invalid at any time due to some unforeseen event like Bob spilling his soda over the ethernet switch.
Therefore, your webapp should be able to re-connect and re-setup the temporary tables or cursors setup if the existing connection or transaction is no longer valid.
Being able to re-setup would also allow the DBA to vacuum thoroughly and/or make schema updates by simply killing and temporarily blocking connections from your webapp during low-traffic hours without having to restart your webapp. This is a big deal if the DBA person is not the sysadmin or have permission to restart your webapp.
Before processing each query, it is recommended to generate a SAVEPOINT so that any error in processing a query would not destroy the transaction.

Using Temporary Tables

The result set could be piped into a temporary table via the CREATE TEMPORARY TABLE foo AS command. It is important to remember to use a temporary table since it is not journalled into the WAL (write-ahead logging) which would have negative impact on performance.
The implementation gives you a free count of matching result when you do the CREATE TEMPORARY TABLE AS. I am not sure why psql does not show the count, but it is accessible from within a stored procedure or your DB driver.
cheap query
dms3_test=> abort;
begin;
create temporary table foo as select * from cheap order by id;
select * from foo order by id offset 0 limit 10; 
select * from foo order by id offset 50000 limit 10;

ROLLBACK
Time: 60.744 ms
BEGIN
Time: 0.686 ms
SELECT
Time: 15125.956 ms
 id 
----
  1
  2
  3
  4
  5
  6
  9
 10
 11
 12
(10 rows)

Time: 4397.762 ms
  id   
-------
 50003
 50004
 50005
 50006
 50007
 50008
 50009
 50010
 50011
 50012
(10 rows)

Time: 4413.789 ms
expensive query
dms3_test=> abort;
begin;
create temporary table foo as select * from expensive order by id;
select * from foo order by id offset 0 limit 10; 
select * from foo order by id offset 50000 limit 10;

ROLLBACK
Time: 52.777 ms
BEGIN
Time: 3.683 ms
SELECT
Time: 18666.615 ms
 id  
-----
   6
  50
  55
  65
  89
 109
 110
 133
 144
 155
(10 rows)

Time: 314.754 ms
   id   
--------
 749659
 749661
 749667
 749685
 749692
 749720
 749732
 749740
 749741
 749778
(10 rows)

Time: 342.207 ms
Characteristics:
  • complex implementation
  • high setup cost
  • free result count as a side-effect
  • suitable for expensive queries
  • suitable if user is expected to comprehensively browse the result set
  • isolated from underlying changes
  • latency still degrades linearly but more gently

Using Without Hold Cursors

Without hold cursors are destroyed at the end of transaction, similar to temporary tables. On the other hand, with hold cursors outlive the creating transaction, although they are still bounded within a session. I recommend using without hold cursors to simplify garbage management.
cheap query
dms3_test=> abort;
begin;
declare cheap_cursor scroll cursor for select * from cheap order by id;
move all from cheap_cursor;
move first from cheap_cursor;
fetch 10 from cheap_cursor;
move absolute 50000 from cheap_cursor;
fetch 10 from cheap_cursor;

ROLLBACK
Time: 4.054 ms
BEGIN
Time: 0.970 ms
DECLARE CURSOR
Time: 1.022 ms
MOVE 1010431
Time: 12434.136 ms
MOVE 1
Time: 4.409 ms
 id 
----
  2
  3
  4
  5
  6
  9
 10
 11
 12
 13
(10 rows)

Time: 4.418 ms
MOVE 1
Time: 30.055 ms
  id   
-------
 50003
 50004
 50005
 50006
 50007
 50008
 50009
 50010
 50011
 50012
(10 rows)

Time: 3.875 ms
expensive query
dms3_test=> abort;
begin;
declare expensive_cursor scroll cursor for select * from expensive order by id;
move all from expensive_cursor;
move first from expensive_cursor;
fetch 10 from expensive_cursor;
move absolute 50000 from expensive_cursor;
fetch 10 from expensive_cursor;

ROLLBACK
Time: 2.044 ms
BEGIN
Time: 0.739 ms
DECLARE CURSOR
Time: 51.912 ms
MOVE 68276
Time: 19036.148 ms
MOVE 1
Time: 1.055 ms
 id  
-----
  50
  55
  65
  89
 109
 110
 133
 144
 155
 186
(10 rows)

Time: 0.911 ms
MOVE 1
Time: 30.226 ms
   id   
--------
 749659
 749661
 749667
 749685
 749692
 749720
 749732
 749740
 749741
 749778
(10 rows)

Time: 1.736 ms
Characteristics:
  • complex implementation
  • high setup cost
  • suitable for expensive queries
  • suitable if user is expected to comprehensively browse the result set
  • isolated from underlying changes
  • barely noticeable latency

Hybrid Approach

One could do a hybrid approach. The implementation would be even more complex, but in some cases, it could combine the no setup cost benefit of the piecemeal approach and the low latency of the full result approach.
The hybrid approach would operate on piecemeal result set until a certain threshold is reached, e.g.: paging past page 7. When that happens, one of the full result set approach is executed, preferably in the background. The webapp could transition to using the full result set when it is ready.

Summary

                              Summary of Implementations                              
Query Type Implementation        Setup(ms) Counting(ms) First Page(ms) 5000th Page(ms)
         
         
         
Cheap    
         
         
         
          
New query per page  
                     
      N/A
         
    1094.187
            
        57.371
              
        134.610
               
Temporary Table     
                     
15125.956
         
         N/A
            
      4397.762
              
       4413.789
               
Cursor              
                     
    1.022
         
   12434.136
            
         8.827
              
         33.930
               
         
         
         
Expensive
         
         
         
          
New query per page  
                     
      N/A
         
   18034.510
            
        76.929
              
      14424.053
               
Temporary Table     
                     
18666.615
         
         N/A
            
       314.754
              
        342.207
               
Cursor              
                     
   51.921
         
   19036.148
            
         1.966
              
         31.962
               

 
 
(originally from http://microjet.ath.cx/WebWiki/ResultPaginationWithPostgresql.html)

No comments: