2005-12-11

Preferring natural to surrogate key

I was reading an article at The Daily WTF today. The OP (original posts) sparked a debate on the use of surrogate vs. natural key.


It is really sad to see Jeff S. et. al fighting the horde of ignorants, but that is a battle that has been fought many times before and is not what I want to write about here.

What I want to show here is the difference in query time you'd expect to see by trimming down on unnecessary usage of surrogate keys.

A schema with unnecessary surrogate keys.
 
-- Done on postgresql 8.1

dms3_test=> \d document
                                         Table "dms4.document"
    Column     |           Type           |                          Modifiers                          
---------------+--------------------------+-------------------------------------------------------------
 id            | integer                  | not null default nextval('document_id_seq'::regclass)
 state         | text                     | not null
 modified_who  | text                     | not null
 modified_when | timestamp with time zone | not null default ('now'::text)::timestamp(6) with time zone
 created_who   | text                     | not null
 created_when  | timestamp with time zone | not null default ('now'::text)::timestamp(6) with time zone
Indexes:
    "document_pkey" PRIMARY KEY, btree (id)
    "document_created_when" btree (created_when)
    "document_created_who" btree (created_who)
    "document_modified_when" btree (modified_when)
    "document_modified_who" btree (modified_who)
    "document_state" btree (state)
Foreign-key constraints:
    "document_created_who_fkey" FOREIGN KEY (created_who) REFERENCES who(name)
    "document_modified_who_fkey" FOREIGN KEY (modified_who) REFERENCES who(name)
    "document_state_fkey" FOREIGN KEY (state) REFERENCES state(name)

dms3_test=> \d attribute_name
                            Table "public.attribute_name"
 Column |  Type   |                            Modifiers                             
--------+---------+------------------------------------------------------------------
 id     | integer | not null default nextval(('attribute_name_seq'::text)::regclass)
 name   | text    | not null
Indexes:
    "attribute_name_pkey" PRIMARY KEY, btree (id)
    "an_name_idx" UNIQUE, btree (name)

dms3_test=> \d attribute
                          Table "public.attribute"
 Column  |  Type   |                       Modifiers                        
---------+---------+--------------------------------------------------------
 id      | integer | not null default nextval('attribute_id_seq'::regclass)
 name_id | integer | not null
 value   | text    | not null
Indexes:
    "attribute_pkey" PRIMARY KEY, btree (id)
    "attr_nv_idx" UNIQUE, btree (name_id, value)
Foreign-key constraints:
    "$1" FOREIGN KEY (name_id) REFERENCES attribute_name(id)


dms3_test=> \d document_attribute
 Table "public.document_attribute"
    Column    |  Type   | Modifiers 
--------------+---------+-----------
 doc_id       | integer | not null
 attribute_id | integer | not null
Indexes:
    "document_attribute_pkey" PRIMARY KEY, btree (doc_id, attribute_id) CLUSTER
    "da_ad_idx" UNIQUE, btree (attribute_id, doc_id)
    "da_attr_idx" btree (attribute_id)
Foreign-key constraints:
    "$1" FOREIGN KEY (doc_id) REFERENCES public.document(id)
    "$2" FOREIGN KEY (attribute_id) REFERENCES attribute(id)


dms3_test=> 

explain analyze SELECT doc.id FROM attribute as attr0, attribute_name
as an0, attribute_name as an2, document_attribute as da2, document as
doc, attribute_name as an1, attribute as attr3, document_attribute as
da1, attribute as attr2, document_attribute as da0, attribute_name as
an3, attribute as attr1, document_attribute as da3

WHERE 
da0.doc_id = doc.id AND attr0.id = da0.attribute_id AND an0.id = attr0.name_id AND 
da1.doc_id = doc.id AND attr1.id = da1.attribute_id AND an1.id = attr1.name_id AND 
da2.doc_id = doc.id AND attr2.id = da2.attribute_id AND an2.id = attr2.name_id AND 
da3.doc_id = doc.id AND attr3.id = da3.attribute_id AND an3.id = attr3.name_id AND 
an0.name = 'ssis_client' AND attr0.value = 'client1' AND
an1.name = 'ssis_group' AND attr1.value = 'group1' AND
an2.name = 'ssis_type' AND attr2.value = 'type1' AND
an3.name = 'ssis_subtype' AND attr3.value = 'subtype1'
;


                                                                                                                         QUERY PLAN                                                                                                                          
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=655695.94..886471.16 rows=1 width=4) (actual time=188571.542..217327.467 rows=3 loops=1)
   Join Filter: ("outer".id = "inner".name_id)
   ->  Seq Scan on attribute_name an0  (cost=0.00..1.07 rows=1 width=4) (actual time=0.045..0.048 rows=1 loops=1)
         Filter: (name = 'ssis_client'::text)
   ->  Nested Loop  (cost=655695.94..886470.07 rows=1 width=8) (actual time=188571.488..217327.404 rows=3 loops=1)
         Join Filter: ("outer".id = "inner".name_id)
         ->  Seq Scan on attribute_name an5  (cost=0.00..1.07 rows=1 width=4) (actual time=0.008..0.015 rows=1 loops=1)
               Filter: (name = 'ssis_e2xkey'::text)
         ->  Nested Loop  (cost=655695.94..886468.99 rows=1 width=12) (actual time=188571.475..217327.378 rows=3 loops=1)
               Join Filter: ("outer".id = "inner".name_id)
               ->  Seq Scan on attribute_name an4  (cost=0.00..1.07 rows=1 width=4) (actual time=0.007..0.009 rows=1 loops=1)
                     Filter: (name = 'ssis_comment'::text)
               ->  Nested Loop  (cost=655695.94..886467.90 rows=1 width=16) (actual time=188571.453..217327.346 rows=3 loops=1)
                     ->  Hash Join  (cost=655695.94..886463.56 rows=1 width=40) (actual time=183814.259..217326.874 rows=11 loops=1)
                           Hash Cond: ("outer".name_id = "inner".id)
                           ->  Hash Join  (cost=655694.86..886462.45 rows=4 width=44) (actual time=183814.220..217326.774 rows=11 loops=1)
                                 Hash Cond: (("outer".id = "inner".name_id) AND ("outer".attribute_id = "inner".id))
                                 ->  Hash Join  (cost=655692.89..886452.06 rows=838 width=52) (actual time=183795.378..217307.410 rows=951 loops=1)
                                       Hash Cond: ("outer".attribute_id = "inner".id)
                                       ->  Nested Loop  (cost=655690.92..886280.53 rows=32236 width=52) (actual time=183664.609..217269.300 rows=97785 loops=1)
                                             ->  Hash Join  (cost=655688.96..885633.84 rows=16118 width=44) (actual time=183640.064..217012.129 rows=97785 loops=1)
                                                   Hash Cond: ("outer".doc_id = "inner".id)
                                                   ->  Nested Loop  (cost=0.00..191997.58 rows=7557225 width=12) (actual time=12.250..10810.601 rows=7557225 loops=1)
                                                         ->  Index Scan using attribute_name_pkey on attribute_name an3  (cost=0.00..3.08 rows=1 width=4) (actual time=12.218..12.232 rows=1 loops=1)
                                                               Filter: (name = 'ssis_subtype'::text)
                                                         ->  Seq Scan on document_attribute da1  (cost=0.00..116422.25 rows=7557225 width=8) (actual time=0.021..4547.481 rows=7557225 loops=1)
                                                   ->  Hash  (cost=655683.57..655683.57 rows=2155 width=32) (actual time=183565.310..183565.310 rows=0 loops=1)
                                                         ->  Nested Loop  (cost=462162.26..655683.57 rows=2155 width=32) (actual time=128403.334..183556.998 rows=9257 loops=1)
                                                               ->  Hash Join  (cost=462162.26..650853.43 rows=288 width=24) (actual time=128403.319..183498.454 rows=906 loops=1)
                                                                     Hash Cond: ("outer".attribute_id = "inner".id)
                                                                     ->  Hash Join  (cost=462160.29..650793.18 rows=11080 width=24) (actual time=128337.287..183447.336 rows=90804 loops=1)
                                                                           Hash Cond: ("outer".name_id = "inner".id)
                                                                           ->  Nested Loop  (cost=462159.22..650348.93 rows=66476 width=28) (actual time=128337.257..183307.650 rows=90804 loops=1)
                                                                                 ->  Hash Join  (cost=462159.22..501285.43 rows=8888 width=20) (actual time=128306.517..132366.000 rows=8840 loops=1)
                                                                                       Hash Cond: ("outer".id = "inner".doc_id)
                                                                                       ->  Hash Join  (cost=1.06..25599.40 rows=202087 width=4) (actual time=41.568..2754.244 rows=329853 loops=1)
                                                                                             Hash Cond: ("outer".state_id = "inner".id)
                                                                                             ->  Seq Scan on document doc  (cost=0.00..18525.31 rows=1010431 width=8) (actual time=18.767..2089.253 rows=1010431 loops=1)
                                                                                             ->  Hash  (cost=1.06..1.06 rows=1 width=4) (actual time=14.681..14.681 rows=0 loops=1)
                                                                                                   ->  Seq Scan on state  (cost=0.00..1.06 rows=1 width=4) (actual time=0.031..0.035 rows=1 loops=1)
                                                                                                         Filter: (name = 'associated'::text)
                                                                                       ->  Hash  (cost=461808.06..461808.06 rows=44437 width=16) (actual time=128239.687..128239.687 rows=0 loops=1)
                                                                                             ->  Hash Join  (cost=157528.96..461808.06 rows=44437 width=16) (actual time=15479.239..128139.028 rows=27087 loops=1)
                                                                                                   Hash Cond: ("outer".attribute_id = "inner".id)
                                                                                                   ->  Hash Join  (cost=157526.99..452807.67 rows=1710811 width=16) (actual time=15478.512..127203.588 rows=2313438 loops=1)
                                                                                                         Hash Cond: ("outer".doc_id = "inner".doc_id)
                                                                                                         ->  Seq Scan on document_attribute da2  (cost=0.00..116422.25 rows=7557225 width=8) (actual time=0.019..37889.434 rows=7557225 loops=1)
                                                                                                         ->  Hash  (cost=156173.26..156173.26 rows=196292 width=8) (actual time=14908.286..14908.286 rows=0 loops=1)
                                                                                                               ->  Hash Join  (cost=1.97..156173.26 rows=196292 width=8) (actual time=7.750..14650.188 rows=269366 loops=1)
                                                                                                                     Hash Cond: ("outer".attribute_id = "inner".id)
                                                                                                                     ->  Seq Scan on document_attribute da4  (cost=0.00..116422.25 rows=7557225 width=8) (actual time=7.678..11635.892 rows=7557225 loops=1)
                                                                                                                     ->  Hash  (cost=1.96..1.96 rows=2 width=8) (actual time=0.049..0.049 rows=0 loops=1)
                                                                                                                           ->  Seq Scan on attribute attr4  (cost=0.00..1.96 rows=2 width=8) (actual time=0.008..0.047 rows=1 loops=1)
                                                                                                                                 Filter: (value = 'comment1'::text)
                                                                                                   ->  Hash  (cost=1.96..1.96 rows=2 width=8) (actual time=0.046..0.046 rows=0 loops=1)
                                                                                                         ->  Seq Scan on attribute attr2  (cost=0.00..1.96 rows=2 width=8) (actual time=0.007..0.044 rows=1 loops=1)
                                                                                                               Filter: (value = 'type1'::text)
                                                                                 ->  Index Scan using document_attribute_pkey on document_attribute da0  (cost=0.00..16.67 rows=8 width=8) (actual time=5.713..5.748 rows=10 loops=8840)
                                                                                       Index Cond: ("outer".doc_id = da0.doc_id)
                                                                           ->  Hash  (cost=1.07..1.07 rows=1 width=4) (actual time=0.011..0.011 rows=0 loops=1)
                                                                                 ->  Seq Scan on attribute_name an2  (cost=0.00..1.07 rows=1 width=4) (actual time=0.007..0.010 rows=1 loops=1)
                                                                                       Filter: (name = 'ssis_type'::text)
                                                                     ->  Hash  (cost=1.96..1.96 rows=2 width=8) (actual time=0.052..0.052 rows=0 loops=1)
                                                                           ->  Seq Scan on attribute attr0  (cost=0.00..1.96 rows=2 width=8) (actual time=0.014..0.050 rows=1 loops=1)
                                                                                 Filter: (value = 'client1'::text)
                                                               ->  Index Scan using document_attribute_pkey on document_attribute da3  (cost=0.00..16.67 rows=8 width=8) (actual time=0.005..0.049 rows=10 loops=906)
                                                                     Index Cond: (da3.doc_id = "outer".doc_id)
                                             ->  Materialize  (cost=1.96..1.98 rows=2 width=8) (actual time=0.000..0.001 rows=1 loops=97785)
                                                   ->  Seq Scan on attribute attr5  (cost=0.00..1.96 rows=2 width=8) (actual time=18.498..18.537 rows=1 loops=1)
                                                         Filter: (value = 'e2xkey1'::text)
                                       ->  Hash  (cost=1.96..1.96 rows=2 width=8) (actual time=0.045..0.045 rows=0 loops=1)
                                             ->  Seq Scan on attribute attr1  (cost=0.00..1.96 rows=2 width=8) (actual time=0.007..0.044 rows=1 loops=1)
                                                   Filter: (value = 'group1'::text)
                                 ->  Hash  (cost=1.96..1.96 rows=2 width=8) (actual time=18.781..18.781 rows=0 loops=1)
                                       ->  Seq Scan on attribute attr3  (cost=0.00..1.96 rows=2 width=8) (actual time=18.739..18.779 rows=1 loops=1)
                                             Filter: (value = 'subtype1'::text)
                           ->  Hash  (cost=1.07..1.07 rows=1 width=4) (actual time=0.018..0.018 rows=0 loops=1)
                                 ->  Seq Scan on attribute_name an1  (cost=0.00..1.07 rows=1 width=4) (actual time=0.007..0.009 rows=1 loops=1)
                                       Filter: (name = 'ssis_group'::text)
                     ->  Index Scan using document_attribute_pkey on document_attribute da5  (cost=0.00..4.33 rows=1 width=8) (actual time=0.037..0.037 rows=0 loops=11)
                           Index Cond: ((da5.doc_id = "outer".doc_id) AND ("outer".id = da5.attribute_id))
 Total runtime: 217405.692 ms
(82 rows)

Time: 217885.006 ms
The same data but in a schema without trimmed-down surrogate keys.
-- Done on postgresql 8.1

dms3_test=> \d document
                                         Table "dms4.document"
    Column     |           Type           |                          Modifiers                          
---------------+--------------------------+-------------------------------------------------------------
 id            | integer                  | not null default nextval('document_id_seq'::regclass)
 state         | text                     | not null
 modified_who  | text                     | not null
 modified_when | timestamp with time zone | not null default ('now'::text)::timestamp(6) with time zone
 created_who   | text                     | not null
 created_when  | timestamp with time zone | not null default ('now'::text)::timestamp(6) with time zone
Indexes:
    "document_pkey" PRIMARY KEY, btree (id)
    "document_created_when" btree (created_when)
    "document_created_who" btree (created_who)
    "document_modified_when" btree (modified_when)
    "document_modified_who" btree (modified_who)
    "document_state" btree (state)
Foreign-key constraints:
    "document_created_who_fkey" FOREIGN KEY (created_who) REFERENCES who(name)
    "document_modified_who_fkey" FOREIGN KEY (modified_who) REFERENCES who(name)
    "document_state_fkey" FOREIGN KEY (state) REFERENCES state(name)

dms3_test=> \d temp20051206.da
   Table "temp20051206.da"
 Column |  Type   | Modifiers 
--------+---------+-----------
 doc_id | integer | not null
 an     | text    | not null
 av     | text    | not null
Indexes:
    "da_pkey" PRIMARY KEY, btree (doc_id, an, av)
    "da_an_av_idx" UNIQUE, btree (an, av, doc_id)
Foreign-key constraints:
    "da_an_fkey" FOREIGN KEY (an, av) REFERENCES attr(name, value) DEFERRABLE INITIALLY DEFERRED
    "da_doc_id_fkey" FOREIGN KEY (doc_id) REFERENCES public.document(id) DEFERRABLE INITIALLY DEFERRED

dms3_test=> 

explain analyze select doc.id from document as doc, da
da_client, da da_group, da da_type, da da_subtype 

WHERE
doc.id=da_client.doc_id and doc.id=da_group.doc_id and
doc.id=da_type.doc_id and doc.id=da_subtype.doc_id and
da_client.an='ssis_client' and da_client.av='client1' and 
da_group.an='ssis_group' and da_group.av='group1' and
da_type.an='ssis_type' and da_type.av='type1' and 
da_subtype.an='ssis_subtype' and da_subtype.av='subtype1'
;

                                                                             QUERY PLAN                                                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=40335.69..53871.76 rows=1 width=4) (actual time=17319.458..50292.429 rows=140 loops=1)
   ->  Nested Loop  (cost=40335.69..53864.64 rows=2 width=16) (actual time=17319.392..50260.780 rows=1072 loops=1)
         ->  Nested Loop  (cost=40335.69..53330.49 rows=150 width=12) (actual time=17289.476..35508.467 rows=10153 loops=1)
               ->  Merge Join  (cost=40335.69..40494.44 rows=3558 width=8) (actual time=17263.290..17594.400 rows=10153 loops=1)
                     Merge Cond: ("outer".doc_id = "inner".doc_id)
                     ->  Sort  (cost=20817.14..20849.19 rows=12818 width=4) (actual time=10145.947..10314.729 rows=101159 loops=1)
                           Sort Key: da_client.doc_id
                           ->  Bitmap Heap Scan on da da_client  (cost=155.91..19942.58 rows=12818 width=4) (actual time=1165.917..9800.025 rows=101159 loops=1)
                                 Recheck Cond: ((an = 'ssis_client'::text) AND (av = 'client1'::text))
                                 ->  Bitmap Index Scan on da_an_av_idx  (cost=0.00..155.91 rows=12818 width=0) (actual time=1156.362..1156.362 rows=101159 loops=1)
                                       Index Cond: ((an = 'ssis_client'::text) AND (av = 'client1'::text))
                     ->  Sort  (cost=19518.54..19548.09 rows=11817 width=4) (actual time=7117.330..7188.340 rows=101190 loops=1)
                           Sort Key: da_subtype.doc_id
                           ->  Bitmap Heap Scan on da da_subtype  (cost=143.90..18719.21 rows=11817 width=4) (actual time=913.707..6798.627 rows=101190 loops=1)
                                 Recheck Cond: ((an = 'ssis_subtype'::text) AND (av = 'subtype1'::text))
                                 ->  Bitmap Index Scan on da_an_av_idx  (cost=0.00..143.90 rows=11817 width=0) (actual time=904.720..904.720 rows=101190 loops=1)
                                       Index Cond: ((an = 'ssis_subtype'::text) AND (av = 'subtype1'::text))
               ->  Index Scan using document_pkey on document doc  (cost=0.00..3.60 rows=1 width=4) (actual time=1.762..1.763 rows=1 loops=10153)
                     Index Cond: (doc.id = "outer".doc_id)
         ->  Index Scan using da_pkey on da da_type  (cost=0.00..3.55 rows=1 width=4) (actual time=1.451..1.452 rows=0 loops=10153)
               Index Cond: (("outer".id = da_type.doc_id) AND (da_type.an = 'ssis_type'::text) AND (da_type.av = 'type1'::text))
   ->  Index Scan using da_pkey on da da_group  (cost=0.00..3.55 rows=1 width=4) (actual time=0.028..0.028 rows=0 loops=1072)
         Index Cond: (("outer".id = da_group.doc_id) AND (da_group.an = 'ssis_group'::text) AND (da_group.av = 'group1'::text))
 Total runtime: 50296.288 ms
(24 rows)

Time: 50301.853 ms
 
 
(originally from http://microjet.ath.cx/WebWiki/2005.12.11_PreferringNaturalToSurrogateKey.html)

No comments: