PDF Reference, version 1.7

Previous Next

950 CHAPTER 10 Document Interchange URL Strings URLs associated with Web Capture content sets must be reduced to a predictable, canonical form before being used as keys in the URLS name tree. The following steps describe how to perform this reduction, using terminology from Internet RFCs 1738, Uniform Resource Locators, and 1808, Relative Uniform Resource Lo- cators (see the Bibliography). This algorithm is relevant for HTTP, FTP, and file URLs: 1. If the URL is relative, make it absolute. 2. If the URL contains one or more number sign characters ( # ), strip the leftmost number sign and any characters after it. 3. Convert the scheme section to lowercase ASCII. 4. If there is a host section, convert it to lowercase ASCII. 5. If the scheme is file and the host is localhost, strip the host section. 6. If there is a port section and the port is the default port for the given protocol (80 for HTTP or 21 for FTP), strip the port section. 7. If the path section contains dot ( . ) or double-dot ( . . ) subsequences, transform the path as described in section 4 of RFC 1808. Note: Because the percent character ( % ) is unsafe according to RFC 1738 and is also the escape character for encoded characters, it is not possible in general to dis- tinguish a URL with unencoded characters from one with encoded characters. For example, it is impossible to decide whether the sequence %00 represents a single encoded null character or a sequence of three unencoded characters. Hence, no number of encoding or decoding passes on a URL can ever cause it to reach a stable state. Empirically, URLs embedded in HTML files have unsafe characters encoded with one encoding pass, and Web servers perform one decoding pass on received paths (though CGI scripts can make their own decisions). Canonical URLs are thus assumed to have undergone one and only one encoding pass. A URL whose initial encoding state is known can be safely transformed into a URL that has undergone only one encoding pass. Digital Identifiers Digital identifiers associated with Web Capture content sets by the IDS name tree are generated using the MD5 message-digest algorithm (described in Internet

Previous Next