Previous Next
950
CHAPTER 10 Document Interchange
URL Strings
URLs associated with Web Capture content sets must be reduced to a predictable,
canonical form before being used as keys in the URLS name tree. The following
steps describe how to perform this reduction, using terminology from Internet
RFCs 1738, Uniform Resource Locators, and 1808, Relative Uniform Resource Lo-
cators (see the Bibliography). This algorithm is relevant for HTTP, FTP, and file
URLs:
1. If the URL is relative, make it absolute.
2. If the URL contains one or more number sign characters ( # ), strip the leftmost
number sign and any characters after it.
3. Convert the scheme section to lowercase ASCII.
4. If there is a host section, convert it to lowercase ASCII.
5. If the scheme is file and the host is localhost, strip the host section.
6. If there is a port section and the port is the default port for the given protocol
(80 for HTTP or 21 for FTP), strip the port section.
7. If the path section contains dot ( . ) or double-dot ( . . ) subsequences, transform
the path as described in section 4 of RFC 1808.
Note: Because the percent character ( % ) is unsafe according to RFC 1738 and is
also the escape character for encoded characters, it is not possible in general to dis-
tinguish a URL with unencoded characters from one with encoded characters. For
example, it is impossible to decide whether the sequence %00 represents a single
encoded null character or a sequence of three unencoded characters. Hence, no
number of encoding or decoding passes on a URL can ever cause it to reach a stable
state. Empirically, URLs embedded in HTML files have unsafe characters encoded
with one encoding pass, and Web servers perform one decoding pass on received
paths (though CGI scripts can make their own decisions). Canonical URLs are thus
assumed to have undergone one and only one encoding pass. A URL whose initial
encoding state is known can be safely transformed into a URL that has undergone
only one encoding pass.
Digital Identifiers
Digital identifiers associated with Web Capture content sets by the IDS name tree
are generated using the MD5 message-digest algorithm (described in Internet
Previous Next