2005-07-05

Escaping URI for CGI

Quick, how do you encode a space in a URI? As '+' or as '%20'?

The answer is the standard 'it depends'.

URI was originally specified by RFC 1738. At that time, they were still calling it URL. The specification was revised and renamed to URI in RFC 2396. Since the URI was formalised in RFC 1738 before the importance of supporting internationalisation was recognised, RFC 2396 clarifies that unless communicated otherwise, one could assume the URI to be in US-ASCII character set.

The RFC acknowledges a URI may be composed of many components. It uses
<first>/<second>;<third>?<fourth>



as an example of a partial URI that has four components. The "/", ";", "?" symbols are components separators and are defined by each component's schema. The above separator symbols are just for example purposes.

Most of section 2 of the RFC talks about encoding character data into URI. It is long and hard to read.
Wouldn't it be simpler if it is presented as bullet points? Anyway, the gist of section 2 is as follow:
 
alpha    = lowalpha | upalpha

lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
           "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
           "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"

upalpha  = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
           "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
           "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"

digit    = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
           "8" | "9"

alphanum = alpha | digit

mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
              "$" | ","n

unreserved  = alphanum | mark

uric        = reserved | unreserved | escaped

delims      = "<" | ">" | "#" | "%" | <">
  1. The escape syntax is "%" hex hex, e.g.: "%20" for space for the US-ASCII character set.
  2. The characters in delims class MUST be escaped.
  3. The membership of the reserved character class is fluid and depends on the context. "/" could be reserved in one context, and not reserved in another. If you want to use a reserved character in your data, you'd have to escape it.
  4. An implication of the above point is that the membership of the unreserved character class is also fluid. Reserved's losses are unreserved's gains. To continue the example with "/", the "/" character would be added into the unreserved class if "/" is not reserved in a particular context.
  5. All unreserved characters must be escaped
  6. All unreserved characters may be used as-is.
  7. Unreserved characters may also be escaped. For example, "~" (a mark) may be escaped as "%7e".
That's the generic URI spec. It is quite liberal in allowing specific schema to override the definitions of reserved and unreserved characters.

This allowance is used by the W3C's application/x-www-form-urlencoded specification to specify a different way of encoding values. Specifically, W3C's specification specifies that the space character must be encoded as '+', and the original members of the reserved class must be escaped (effectively removes the fluidity of the reserved class definition).


So, to answer the question: how do you encode a space in URI, the answer would be: in a generic context, as "%20", and in dealing with web forms (CGI as well), "+".

Long winded answer.

 (originally from http://microjet.ath.cx/WebWiki/EscapingURIForCgi.html)