2005-12-27

Using symbols for the wrong reason

The concept of symbols have been popular among the lisp community, yet not many people know about it mainly due to the general ignorance.
Then ruby came and made symbols be a commodity programming construct. People who were not aware of symbols now are.

And they are asking about it numerous times. There is not a week in the ruby mailing list that there isn't a question about symbols: what are they and what are they good for?

Many people have tried to answer that, but the answer has been along the lines presented in these two articles: http://glu.ttono.us/articles/2005/08/19/understanding-ruby-symbols and http://zephyrfalcon.org/weblog2/arch_e10_00850.html#e857.

The answer has been putting undue emphasise on the way current ruby VM implements symbols. Ruby string is mutable, and it is not efficiently implemented in current ruby VM. So, use symbols for efficient, immutable, and string-like objects.

It is not wrong and it is correct for current ruby VM. However, I think, that is a misguided answer to the questions. The answer should, on the other hand, put an emphasise on the programmer's intention.
rubyists uses #each() method more frequently than a for loop because it clarifies their intention of iterating over some sequence even though they could have used the more efficient for loop or even the if and goto constructs.

One does not tell another to use if and goto over for loop for iterating a sequence simply because if and goto may be more efficient. No matter how inefficient a compiler/interpreter implements the for loop construct, the possibility of an efficient for loop implementation remains. In fact, by using the for construct, the compiler could have an easier time deducing your intent of looping, and if some conditions are met (e.g., closed looping of certain numbers of times), it could unroll your loop for a better performance if your systems allows it.

In short, any answer that depends on a particular implementation is doomed to be short-lived. What happens if the next ruby VM implements COW (copy-on-write) strings? A COW string would share initial instances. Only f there is a modification to the instance, then the initial instance is copied and the modification is performed on the copy. As long as one does not try to modify COW string instance, it can be as efficient as how the current VM implements symbols. IOW, any answer that rallies around so-called efficiency while abandoning intent would become obsolete and there is a new scramble to get at an updated answer.

Thus, I finally come to say that one should not use symbols just for efficiency gain. Symbols are not meant to be an immutable string-like object. It is really meant to be used to construct user-defined identifiers. The user in this case would be the programmers.

Consider:
foo1 = { 
   :host => 'localhost',
   :port => 80
}
foo2 = {
   'host' => 'localhost',
   'port' => 80

In foo1, symbols are being used to identify the following data. The string 'localhost' is identified as a host, and 80 is not just any number, but rather a port number.

In foo2, it is a bit unclear as to what purpose 'host' and 'port' serves. Is foo2 a macro replacement list? That is, if the program reads the string 'host', would it be replaced to 'localhost'? What is the purpose of 'host' there? Is it an identifier for the string 'localhost'?

The programming world should borrow the real estate's adage of "location, location, location". It should be translated to: "intention, intention, intention". It is the main reason why comments that clarifies the intention of the programmer are so valuable. It is the main reason why there are a variety language constructs. It should also be the main reason for you to decide whether or not to use symbols.

2005.12.28 update: I am joyful that not everyone resorted to dumbing down the concept of symbols. http://onestepback.org/index.cgi/Tech/Ruby/SymbolsAreNotImmutableStrings.red
2006.01.06 update: What an amazing interest on symbol! I don't think I've seen any one topic in ruby that has generated 142 posts in a single thread before this.
http://groups.google.com/group/comp.lang.ruby/browse_frm/thread/164ae5f5cbbac02e?q=differences+between+%3Afoo&hl=en&
2007.07.03 update: a related article: 2007.07.03_WhatAreSymbols

(originally from http://microjet.ath.cx/WebWiki/2005.12.27_UsingSymbolsForTheWrongReason.html)

2005-12-11

Preferring natural to surrogate key

I was reading an article at The Daily WTF today. The OP (original posts) sparked a debate on the use of surrogate vs. natural key.


It is really sad to see Jeff S. et. al fighting the horde of ignorants, but that is a battle that has been fought many times before and is not what I want to write about here.

What I want to show here is the difference in query time you'd expect to see by trimming down on unnecessary usage of surrogate keys.

A schema with unnecessary surrogate keys.
 
-- Done on postgresql 8.1

dms3_test=> \d document
                                         Table "dms4.document"
    Column     |           Type           |                          Modifiers                          
---------------+--------------------------+-------------------------------------------------------------
 id            | integer                  | not null default nextval('document_id_seq'::regclass)
 state         | text                     | not null
 modified_who  | text                     | not null
 modified_when | timestamp with time zone | not null default ('now'::text)::timestamp(6) with time zone
 created_who   | text                     | not null
 created_when  | timestamp with time zone | not null default ('now'::text)::timestamp(6) with time zone
Indexes:
    "document_pkey" PRIMARY KEY, btree (id)
    "document_created_when" btree (created_when)
    "document_created_who" btree (created_who)
    "document_modified_when" btree (modified_when)
    "document_modified_who" btree (modified_who)
    "document_state" btree (state)
Foreign-key constraints:
    "document_created_who_fkey" FOREIGN KEY (created_who) REFERENCES who(name)
    "document_modified_who_fkey" FOREIGN KEY (modified_who) REFERENCES who(name)
    "document_state_fkey" FOREIGN KEY (state) REFERENCES state(name)

dms3_test=> \d attribute_name
                            Table "public.attribute_name"
 Column |  Type   |                            Modifiers                             
--------+---------+------------------------------------------------------------------
 id     | integer | not null default nextval(('attribute_name_seq'::text)::regclass)
 name   | text    | not null
Indexes:
    "attribute_name_pkey" PRIMARY KEY, btree (id)
    "an_name_idx" UNIQUE, btree (name)

dms3_test=> \d attribute
                          Table "public.attribute"
 Column  |  Type   |                       Modifiers                        
---------+---------+--------------------------------------------------------
 id      | integer | not null default nextval('attribute_id_seq'::regclass)
 name_id | integer | not null
 value   | text    | not null
Indexes:
    "attribute_pkey" PRIMARY KEY, btree (id)
    "attr_nv_idx" UNIQUE, btree (name_id, value)
Foreign-key constraints:
    "$1" FOREIGN KEY (name_id) REFERENCES attribute_name(id)


dms3_test=> \d document_attribute
 Table "public.document_attribute"
    Column    |  Type   | Modifiers 
--------------+---------+-----------
 doc_id       | integer | not null
 attribute_id | integer | not null
Indexes:
    "document_attribute_pkey" PRIMARY KEY, btree (doc_id, attribute_id) CLUSTER
    "da_ad_idx" UNIQUE, btree (attribute_id, doc_id)
    "da_attr_idx" btree (attribute_id)
Foreign-key constraints:
    "$1" FOREIGN KEY (doc_id) REFERENCES public.document(id)
    "$2" FOREIGN KEY (attribute_id) REFERENCES attribute(id)


dms3_test=> 

explain analyze SELECT doc.id FROM attribute as attr0, attribute_name
as an0, attribute_name as an2, document_attribute as da2, document as
doc, attribute_name as an1, attribute as attr3, document_attribute as
da1, attribute as attr2, document_attribute as da0, attribute_name as
an3, attribute as attr1, document_attribute as da3

WHERE 
da0.doc_id = doc.id AND attr0.id = da0.attribute_id AND an0.id = attr0.name_id AND 
da1.doc_id = doc.id AND attr1.id = da1.attribute_id AND an1.id = attr1.name_id AND 
da2.doc_id = doc.id AND attr2.id = da2.attribute_id AND an2.id = attr2.name_id AND 
da3.doc_id = doc.id AND attr3.id = da3.attribute_id AND an3.id = attr3.name_id AND 
an0.name = 'ssis_client' AND attr0.value = 'client1' AND
an1.name = 'ssis_group' AND attr1.value = 'group1' AND
an2.name = 'ssis_type' AND attr2.value = 'type1' AND
an3.name = 'ssis_subtype' AND attr3.value = 'subtype1'
;


                                                                                                                         QUERY PLAN                                                                                                                          
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=655695.94..886471.16 rows=1 width=4) (actual time=188571.542..217327.467 rows=3 loops=1)
   Join Filter: ("outer".id = "inner".name_id)
   ->  Seq Scan on attribute_name an0  (cost=0.00..1.07 rows=1 width=4) (actual time=0.045..0.048 rows=1 loops=1)
         Filter: (name = 'ssis_client'::text)
   ->  Nested Loop  (cost=655695.94..886470.07 rows=1 width=8) (actual time=188571.488..217327.404 rows=3 loops=1)
         Join Filter: ("outer".id = "inner".name_id)
         ->  Seq Scan on attribute_name an5  (cost=0.00..1.07 rows=1 width=4) (actual time=0.008..0.015 rows=1 loops=1)
               Filter: (name = 'ssis_e2xkey'::text)
         ->  Nested Loop  (cost=655695.94..886468.99 rows=1 width=12) (actual time=188571.475..217327.378 rows=3 loops=1)
               Join Filter: ("outer".id = "inner".name_id)
               ->  Seq Scan on attribute_name an4  (cost=0.00..1.07 rows=1 width=4) (actual time=0.007..0.009 rows=1 loops=1)
                     Filter: (name = 'ssis_comment'::text)
               ->  Nested Loop  (cost=655695.94..886467.90 rows=1 width=16) (actual time=188571.453..217327.346 rows=3 loops=1)
                     ->  Hash Join  (cost=655695.94..886463.56 rows=1 width=40) (actual time=183814.259..217326.874 rows=11 loops=1)
                           Hash Cond: ("outer".name_id = "inner".id)
                           ->  Hash Join  (cost=655694.86..886462.45 rows=4 width=44) (actual time=183814.220..217326.774 rows=11 loops=1)
                                 Hash Cond: (("outer".id = "inner".name_id) AND ("outer".attribute_id = "inner".id))
                                 ->  Hash Join  (cost=655692.89..886452.06 rows=838 width=52) (actual time=183795.378..217307.410 rows=951 loops=1)
                                       Hash Cond: ("outer".attribute_id = "inner".id)
                                       ->  Nested Loop  (cost=655690.92..886280.53 rows=32236 width=52) (actual time=183664.609..217269.300 rows=97785 loops=1)
                                             ->  Hash Join  (cost=655688.96..885633.84 rows=16118 width=44) (actual time=183640.064..217012.129 rows=97785 loops=1)
                                                   Hash Cond: ("outer".doc_id = "inner".id)
                                                   ->  Nested Loop  (cost=0.00..191997.58 rows=7557225 width=12) (actual time=12.250..10810.601 rows=7557225 loops=1)
                                                         ->  Index Scan using attribute_name_pkey on attribute_name an3  (cost=0.00..3.08 rows=1 width=4) (actual time=12.218..12.232 rows=1 loops=1)
                                                               Filter: (name = 'ssis_subtype'::text)
                                                         ->  Seq Scan on document_attribute da1  (cost=0.00..116422.25 rows=7557225 width=8) (actual time=0.021..4547.481 rows=7557225 loops=1)
                                                   ->  Hash  (cost=655683.57..655683.57 rows=2155 width=32) (actual time=183565.310..183565.310 rows=0 loops=1)
                                                         ->  Nested Loop  (cost=462162.26..655683.57 rows=2155 width=32) (actual time=128403.334..183556.998 rows=9257 loops=1)
                                                               ->  Hash Join  (cost=462162.26..650853.43 rows=288 width=24) (actual time=128403.319..183498.454 rows=906 loops=1)
                                                                     Hash Cond: ("outer".attribute_id = "inner".id)
                                                                     ->  Hash Join  (cost=462160.29..650793.18 rows=11080 width=24) (actual time=128337.287..183447.336 rows=90804 loops=1)
                                                                           Hash Cond: ("outer".name_id = "inner".id)
                                                                           ->  Nested Loop  (cost=462159.22..650348.93 rows=66476 width=28) (actual time=128337.257..183307.650 rows=90804 loops=1)
                                                                                 ->  Hash Join  (cost=462159.22..501285.43 rows=8888 width=20) (actual time=128306.517..132366.000 rows=8840 loops=1)
                                                                                       Hash Cond: ("outer".id = "inner".doc_id)
                                                                                       ->  Hash Join  (cost=1.06..25599.40 rows=202087 width=4) (actual time=41.568..2754.244 rows=329853 loops=1)
                                                                                             Hash Cond: ("outer".state_id = "inner".id)
                                                                                             ->  Seq Scan on document doc  (cost=0.00..18525.31 rows=1010431 width=8) (actual time=18.767..2089.253 rows=1010431 loops=1)
                                                                                             ->  Hash  (cost=1.06..1.06 rows=1 width=4) (actual time=14.681..14.681 rows=0 loops=1)
                                                                                                   ->  Seq Scan on state  (cost=0.00..1.06 rows=1 width=4) (actual time=0.031..0.035 rows=1 loops=1)
                                                                                                         Filter: (name = 'associated'::text)
                                                                                       ->  Hash  (cost=461808.06..461808.06 rows=44437 width=16) (actual time=128239.687..128239.687 rows=0 loops=1)
                                                                                             ->  Hash Join  (cost=157528.96..461808.06 rows=44437 width=16) (actual time=15479.239..128139.028 rows=27087 loops=1)
                                                                                                   Hash Cond: ("outer".attribute_id = "inner".id)
                                                                                                   ->  Hash Join  (cost=157526.99..452807.67 rows=1710811 width=16) (actual time=15478.512..127203.588 rows=2313438 loops=1)
                                                                                                         Hash Cond: ("outer".doc_id = "inner".doc_id)
                                                                                                         ->  Seq Scan on document_attribute da2  (cost=0.00..116422.25 rows=7557225 width=8) (actual time=0.019..37889.434 rows=7557225 loops=1)
                                                                                                         ->  Hash  (cost=156173.26..156173.26 rows=196292 width=8) (actual time=14908.286..14908.286 rows=0 loops=1)
                                                                                                               ->  Hash Join  (cost=1.97..156173.26 rows=196292 width=8) (actual time=7.750..14650.188 rows=269366 loops=1)
                                                                                                                     Hash Cond: ("outer".attribute_id = "inner".id)
                                                                                                                     ->  Seq Scan on document_attribute da4  (cost=0.00..116422.25 rows=7557225 width=8) (actual time=7.678..11635.892 rows=7557225 loops=1)
                                                                                                                     ->  Hash  (cost=1.96..1.96 rows=2 width=8) (actual time=0.049..0.049 rows=0 loops=1)
                                                                                                                           ->  Seq Scan on attribute attr4  (cost=0.00..1.96 rows=2 width=8) (actual time=0.008..0.047 rows=1 loops=1)
                                                                                                                                 Filter: (value = 'comment1'::text)
                                                                                                   ->  Hash  (cost=1.96..1.96 rows=2 width=8) (actual time=0.046..0.046 rows=0 loops=1)
                                                                                                         ->  Seq Scan on attribute attr2  (cost=0.00..1.96 rows=2 width=8) (actual time=0.007..0.044 rows=1 loops=1)
                                                                                                               Filter: (value = 'type1'::text)
                                                                                 ->  Index Scan using document_attribute_pkey on document_attribute da0  (cost=0.00..16.67 rows=8 width=8) (actual time=5.713..5.748 rows=10 loops=8840)
                                                                                       Index Cond: ("outer".doc_id = da0.doc_id)
                                                                           ->  Hash  (cost=1.07..1.07 rows=1 width=4) (actual time=0.011..0.011 rows=0 loops=1)
                                                                                 ->  Seq Scan on attribute_name an2  (cost=0.00..1.07 rows=1 width=4) (actual time=0.007..0.010 rows=1 loops=1)
                                                                                       Filter: (name = 'ssis_type'::text)
                                                                     ->  Hash  (cost=1.96..1.96 rows=2 width=8) (actual time=0.052..0.052 rows=0 loops=1)
                                                                           ->  Seq Scan on attribute attr0  (cost=0.00..1.96 rows=2 width=8) (actual time=0.014..0.050 rows=1 loops=1)
                                                                                 Filter: (value = 'client1'::text)
                                                               ->  Index Scan using document_attribute_pkey on document_attribute da3  (cost=0.00..16.67 rows=8 width=8) (actual time=0.005..0.049 rows=10 loops=906)
                                                                     Index Cond: (da3.doc_id = "outer".doc_id)
                                             ->  Materialize  (cost=1.96..1.98 rows=2 width=8) (actual time=0.000..0.001 rows=1 loops=97785)
                                                   ->  Seq Scan on attribute attr5  (cost=0.00..1.96 rows=2 width=8) (actual time=18.498..18.537 rows=1 loops=1)
                                                         Filter: (value = 'e2xkey1'::text)
                                       ->  Hash  (cost=1.96..1.96 rows=2 width=8) (actual time=0.045..0.045 rows=0 loops=1)
                                             ->  Seq Scan on attribute attr1  (cost=0.00..1.96 rows=2 width=8) (actual time=0.007..0.044 rows=1 loops=1)
                                                   Filter: (value = 'group1'::text)
                                 ->  Hash  (cost=1.96..1.96 rows=2 width=8) (actual time=18.781..18.781 rows=0 loops=1)
                                       ->  Seq Scan on attribute attr3  (cost=0.00..1.96 rows=2 width=8) (actual time=18.739..18.779 rows=1 loops=1)
                                             Filter: (value = 'subtype1'::text)
                           ->  Hash  (cost=1.07..1.07 rows=1 width=4) (actual time=0.018..0.018 rows=0 loops=1)
                                 ->  Seq Scan on attribute_name an1  (cost=0.00..1.07 rows=1 width=4) (actual time=0.007..0.009 rows=1 loops=1)
                                       Filter: (name = 'ssis_group'::text)
                     ->  Index Scan using document_attribute_pkey on document_attribute da5  (cost=0.00..4.33 rows=1 width=8) (actual time=0.037..0.037 rows=0 loops=11)
                           Index Cond: ((da5.doc_id = "outer".doc_id) AND ("outer".id = da5.attribute_id))
 Total runtime: 217405.692 ms
(82 rows)

Time: 217885.006 ms
The same data but in a schema without trimmed-down surrogate keys.
-- Done on postgresql 8.1

dms3_test=> \d document
                                         Table "dms4.document"
    Column     |           Type           |                          Modifiers                          
---------------+--------------------------+-------------------------------------------------------------
 id            | integer                  | not null default nextval('document_id_seq'::regclass)
 state         | text                     | not null
 modified_who  | text                     | not null
 modified_when | timestamp with time zone | not null default ('now'::text)::timestamp(6) with time zone
 created_who   | text                     | not null
 created_when  | timestamp with time zone | not null default ('now'::text)::timestamp(6) with time zone
Indexes:
    "document_pkey" PRIMARY KEY, btree (id)
    "document_created_when" btree (created_when)
    "document_created_who" btree (created_who)
    "document_modified_when" btree (modified_when)
    "document_modified_who" btree (modified_who)
    "document_state" btree (state)
Foreign-key constraints:
    "document_created_who_fkey" FOREIGN KEY (created_who) REFERENCES who(name)
    "document_modified_who_fkey" FOREIGN KEY (modified_who) REFERENCES who(name)
    "document_state_fkey" FOREIGN KEY (state) REFERENCES state(name)

dms3_test=> \d temp20051206.da
   Table "temp20051206.da"
 Column |  Type   | Modifiers 
--------+---------+-----------
 doc_id | integer | not null
 an     | text    | not null
 av     | text    | not null
Indexes:
    "da_pkey" PRIMARY KEY, btree (doc_id, an, av)
    "da_an_av_idx" UNIQUE, btree (an, av, doc_id)
Foreign-key constraints:
    "da_an_fkey" FOREIGN KEY (an, av) REFERENCES attr(name, value) DEFERRABLE INITIALLY DEFERRED
    "da_doc_id_fkey" FOREIGN KEY (doc_id) REFERENCES public.document(id) DEFERRABLE INITIALLY DEFERRED

dms3_test=> 

explain analyze select doc.id from document as doc, da
da_client, da da_group, da da_type, da da_subtype 

WHERE
doc.id=da_client.doc_id and doc.id=da_group.doc_id and
doc.id=da_type.doc_id and doc.id=da_subtype.doc_id and
da_client.an='ssis_client' and da_client.av='client1' and 
da_group.an='ssis_group' and da_group.av='group1' and
da_type.an='ssis_type' and da_type.av='type1' and 
da_subtype.an='ssis_subtype' and da_subtype.av='subtype1'
;

                                                                             QUERY PLAN                                                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=40335.69..53871.76 rows=1 width=4) (actual time=17319.458..50292.429 rows=140 loops=1)
   ->  Nested Loop  (cost=40335.69..53864.64 rows=2 width=16) (actual time=17319.392..50260.780 rows=1072 loops=1)
         ->  Nested Loop  (cost=40335.69..53330.49 rows=150 width=12) (actual time=17289.476..35508.467 rows=10153 loops=1)
               ->  Merge Join  (cost=40335.69..40494.44 rows=3558 width=8) (actual time=17263.290..17594.400 rows=10153 loops=1)
                     Merge Cond: ("outer".doc_id = "inner".doc_id)
                     ->  Sort  (cost=20817.14..20849.19 rows=12818 width=4) (actual time=10145.947..10314.729 rows=101159 loops=1)
                           Sort Key: da_client.doc_id
                           ->  Bitmap Heap Scan on da da_client  (cost=155.91..19942.58 rows=12818 width=4) (actual time=1165.917..9800.025 rows=101159 loops=1)
                                 Recheck Cond: ((an = 'ssis_client'::text) AND (av = 'client1'::text))
                                 ->  Bitmap Index Scan on da_an_av_idx  (cost=0.00..155.91 rows=12818 width=0) (actual time=1156.362..1156.362 rows=101159 loops=1)
                                       Index Cond: ((an = 'ssis_client'::text) AND (av = 'client1'::text))
                     ->  Sort  (cost=19518.54..19548.09 rows=11817 width=4) (actual time=7117.330..7188.340 rows=101190 loops=1)
                           Sort Key: da_subtype.doc_id
                           ->  Bitmap Heap Scan on da da_subtype  (cost=143.90..18719.21 rows=11817 width=4) (actual time=913.707..6798.627 rows=101190 loops=1)
                                 Recheck Cond: ((an = 'ssis_subtype'::text) AND (av = 'subtype1'::text))
                                 ->  Bitmap Index Scan on da_an_av_idx  (cost=0.00..143.90 rows=11817 width=0) (actual time=904.720..904.720 rows=101190 loops=1)
                                       Index Cond: ((an = 'ssis_subtype'::text) AND (av = 'subtype1'::text))
               ->  Index Scan using document_pkey on document doc  (cost=0.00..3.60 rows=1 width=4) (actual time=1.762..1.763 rows=1 loops=10153)
                     Index Cond: (doc.id = "outer".doc_id)
         ->  Index Scan using da_pkey on da da_type  (cost=0.00..3.55 rows=1 width=4) (actual time=1.451..1.452 rows=0 loops=10153)
               Index Cond: (("outer".id = da_type.doc_id) AND (da_type.an = 'ssis_type'::text) AND (da_type.av = 'type1'::text))
   ->  Index Scan using da_pkey on da da_group  (cost=0.00..3.55 rows=1 width=4) (actual time=0.028..0.028 rows=0 loops=1072)
         Index Cond: (("outer".id = da_group.doc_id) AND (da_group.an = 'ssis_group'::text) AND (da_group.av = 'group1'::text))
 Total runtime: 50296.288 ms
(24 rows)

Time: 50301.853 ms
 
 
(originally from http://microjet.ath.cx/WebWiki/2005.12.11_PreferringNaturalToSurrogateKey.html)

Table partitioning

Postgresql could benefit from an implementation of table partitioning. Also see: http://groups.google.com/group/pgsql.general/tree/browse_frm/thread/e72ead84aa16bc8e/64d61f78425b119b?rnum=1&q=partition&_done=%2Fgroup%2Fpgsql.general%2Fbrowse_frm%2Fthread%2Fe72ead84aa16bc8e%2F121bab4de037ce6a%3Fq%3Dpartition%26rnum%3D7%26#doc_64d61f78425b119b

All in one table

dms3_test=> select count(doc.id) from document as doc, da da_client, da
da_group, da da_type, da da_subtype where doc.id=da_client.doc_id and
doc.id=da_group.doc_id and doc.id=da_type.doc_id and
doc.id=da_subtype.doc_id and da_client.an='ssis_client' and
da_group.an='ssis_group' and da_type.an='ssis_type' and
da_subtype.an='ssis_subtype' and da_client.av='client1' and
da_group.av='group1' and da_type.av='type1' and da_subtype.av='subtype1';

 count 
-------
   140
(1 row)

Time: 60882.964 ms


Separated out
 
 
dms3_test=> select count(doc.id) from document as doc, ssis_client as sc,
ssis_group as sg, ssis_type as st, ssis_subtype as sst where
doc.id=sc.doc_id and doc.id=sg.doc_id and doc.id=st.doc_id and
doc.id=sst.doc_id and sc.value='client1' and sg.value='group1' and
st.value='type1' and sst.value='subtype1';

 count 
-------
   140
(1 row)

Time: 3912.358 ms

Explain analyze of all in one table
 
 
dms3_test=> explain analyze select doc.id from document as doc, da
da_client, da da_group, da da_type, da da_subtype where
doc.id=da_client.doc_id and doc.id=da_group.doc_id and
doc.id=da_type.doc_id and doc.id=da_subtype.doc_id and
da_client.an='ssis_client' and da_group.an='ssis_group' and
da_type.an='ssis_type' and da_subtype.an='ssis_subtype' and
da_client.av='client1' and da_group.av='group1' and da_type.av='type1' and
da_subtype.av='subtype1'

                                                                             QUERY PLAN                                                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=40335.69..53871.76 rows=1 width=4) (actual time=17319.458..50292.429 rows=140 loops=1)
   ->  Nested Loop  (cost=40335.69..53864.64 rows=2 width=16) (actual time=17319.392..50260.780 rows=1072 loops=1)
         ->  Nested Loop  (cost=40335.69..53330.49 rows=150 width=12) (actual time=17289.476..35508.467 rows=10153 loops=1)
               ->  Merge Join  (cost=40335.69..40494.44 rows=3558 width=8) (actual time=17263.290..17594.400 rows=10153 loops=1)
                     Merge Cond: ("outer".doc_id = "inner".doc_id)
                     ->  Sort  (cost=20817.14..20849.19 rows=12818 width=4) (actual time=10145.947..10314.729 rows=101159 loops=1)
                           Sort Key: da_client.doc_id
                           ->  Bitmap Heap Scan on da da_client  (cost=155.91..19942.58 rows=12818 width=4) (actual time=1165.917..9800.025 rows=101159 loops=1)
                                 Recheck Cond: ((an = 'ssis_client'::text) AND (av = 'client1'::text))
                                 ->  Bitmap Index Scan on da_an_av_idx  (cost=0.00..155.91 rows=12818 width=0) (actual time=1156.362..1156.362 rows=101159 loops=1)
                                       Index Cond: ((an = 'ssis_client'::text) AND (av = 'client1'::text))
                     ->  Sort  (cost=19518.54..19548.09 rows=11817 width=4) (actual time=7117.330..7188.340 rows=101190 loops=1)
                           Sort Key: da_subtype.doc_id
                           ->  Bitmap Heap Scan on da da_subtype  (cost=143.90..18719.21 rows=11817 width=4) (actual time=913.707..6798.627 rows=101190 loops=1)
                                 Recheck Cond: ((an = 'ssis_subtype'::text) AND (av = 'subtype1'::text))
                                 ->  Bitmap Index Scan on da_an_av_idx  (cost=0.00..143.90 rows=11817 width=0) (actual time=904.720..904.720 rows=101190 loops=1)
                                       Index Cond: ((an = 'ssis_subtype'::text) AND (av = 'subtype1'::text))
               ->  Index Scan using document_pkey on document doc  (cost=0.00..3.60 rows=1 width=4) (actual time=1.762..1.763 rows=1 loops=10153)
                     Index Cond: (doc.id = "outer".doc_id)
         ->  Index Scan using da_pkey on da da_type  (cost=0.00..3.55 rows=1 width=4) (actual time=1.451..1.452 rows=0 loops=10153)
               Index Cond: (("outer".id = da_type.doc_id) AND (da_type.an = 'ssis_type'::text) AND (da_type.av = 'type1'::text))
   ->  Index Scan using da_pkey on da da_group  (cost=0.00..3.55 rows=1 width=4) (actual time=0.028..0.028 rows=0 loops=1072)
         Index Cond: (("outer".id = da_group.doc_id) AND (da_group.an = 'ssis_group'::text) AND (da_group.av = 'group1'::text))
 Total runtime: 50296.288 ms
(24 rows)

Time: 50301.853 ms

Explain analyze of separated out
 
 
dms3_test=> explain analyze select doc.id from document as doc, ssis_client
as sc, ssis_group as sg, ssis_type as st, ssis_subtype as sst where
doc.id=sc.doc_id and doc.id=sg.doc_id and doc.id=st.doc_id and
doc.id=sst.doc_id and sc.value='client1' and sg.value='group1' and
st.value='type1' and sst.value='subtype1';

                                                                           QUERY PLAN                                                                           
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=48763.53..76529.21 rows=108 width=4) (actual time=3947.647..5401.517 rows=140 loops=1)
   ->  Nested Loop  (cost=48763.53..76139.59 rows=108 width=16) (actual time=3947.592..5377.496 rows=140 loops=1)
         ->  Hash Join  (cost=48763.53..71836.53 rows=1019 width=12) (actual time=3947.454..5027.993 rows=993 loops=1)
               Hash Cond: ("outer".doc_id = "inner".doc_id)
               ->  Seq Scan on ssis_group sg  (cost=0.00..22537.39 rows=105085 width=4) (actual time=169.102..1161.228 rows=100924 loops=1)
                     Filter: (value = 'group1'::text)
               ->  Hash  (cost=48739.02..48739.02 rows=9802 width=8) (actual time=3777.630..3777.630 rows=10153 loops=1)
                     ->  Hash Join  (cost=23168.26..48739.02 rows=9802 width=8) (actual time=2108.108..3770.001 rows=10153 loops=1)
                           Hash Cond: ("outer".doc_id = "inner".doc_id)
                           ->  Seq Scan on ssis_client sc  (cost=0.00..22537.39 rows=100706 width=4) (actual time=201.259..1662.239 rows=101159 loops=1)
                                 Filter: (value = 'client1'::text)
                           ->  Hash  (cost=22537.39..22537.39 rows=98349 width=4) (actual time=1904.210..1904.210 rows=101190 loops=1)
                                 ->  Seq Scan on ssis_subtype sst  (cost=0.00..22537.39 rows=98349 width=4) (actual time=234.704..1828.887 rows=101190 loops=1)
                                       Filter: (value = 'subtype1'::text)
         ->  Index Scan using ssis_type_pkey on ssis_type st  (cost=0.00..4.21 rows=1 width=4) (actual time=0.351..0.351 rows=0 loops=993)
               Index Cond: ("outer".doc_id = st.doc_id)
               Filter: (value = 'type1'::text)
   ->  Index Scan using document_pkey on document doc  (cost=0.00..3.60 rows=1 width=4) (actual time=0.169..0.169 rows=1 loops=140)
         Index Cond: (doc.id = "outer".doc_id)
 Total runtime: 5401.958 ms
(20 rows)

Time: 5406.568 ms 
 
 
(originally from http://microjet.ath.cx/WebWiki/2005.12.11_TablePartitioning.html)

2005-12-09

Implementing application-level timeout

I was tasked to fix an app that had been in production for two years. The app was having trouble communicating with a server of a new customer. At times, it would just stuck in read() call even though it was supposed to have timed out in 2 minutes.
So I took a look at the source:
 
public void createConnection()
  {
    socket = Socket.new(/*....*/);
    s.setSocketTimeout(TWO_MINUTES);
    output = socket.getOutputStream();
    input = socket.getInputStream();
  }

  public void sendPing(Socket s) 
  {
    output.write(PING_MESSAGE);
    /* ... */
    input.read();
    /* ... */
  }


This is the wrong way to implement application-level timeout. For, you see, it turned out this particular server was sending a TCP Keep Alive packet every 1 minutes 15 seconds. I have no idea why they did that (the default value should be around 2 hours), but they did anyway, and they were a valuable customer.
Telling the customer, 'you are incapable of configuring server' just wouldn't do, and it was also not totally their fault.


Our fault was in using socket's timeout for application-level timeout. Java's Socket#setSocketTimeout method corresponds to setting SO_RCVTIMEO and SO_SNDTIMEO socket options in BSD Socket API.
Each packet received, whether it contains application level data or not, reset the time out timer. Those Keep Alive packets were preventing the timeout timer from reaching the two minutes mark.

The easiest solution to this problem was to create a timer service. Each time you want to do a socket operation, you register to the timer service and unregister afterward.

import javautils.fun.VoidToVoid;


public class TimerService extends Thread
{
  /* a singleton */
  public synchronized TimerService getInstance() { /* ... */ }

  public void run()
  {
    while (true) {
     mutex.acquire();
     try {
       interruptTimedOutThread();
     } finally {
       mutex.release();
     }
     sleep(ONE_SECOND);
    }
  }

  public void timeout(int timeout_millisecond, VoidToVoid func)
  {
    try {
      register(Thread.current, timeout_millisecond);
      try {
        func.call();
      } finally {
        unregister(Thread.current);      
      }
    /* 
       Convert exceptions caused by interruption outside
       of the register-unregister block because we don't want
       exception handling to be interrupted
    */
    } catch (ClosedByInterruptException e) {
      throw new TimedOutException("timed out", e);
    } catch (InterruptException e) {
      throw new TimedOutException("timed out", e);
    }
  }
}

public class ConnectionToServer 
{
  public void sendPing(Socket s)
  {
    TimerService timer = TimerService.getInstance();
    try {
      timer.timeout(TWO_SECONDS, new VoidToVoid() {
        public void with() 
        {
          output.write(PING_MESSAGE);
          /* ... */
          input.read();
          /* ... */
        }});
    } catch (TimedOutException e) {
      /* ... */
    }
  }
} 
 
(originally from http://microjet.ath.cx/WebWiki/2005.12.09_Implementing_Application-Level_Timeout.html)