SECTION 10.9
951
Web Capture
RFC 1321,
The MD5 Message-Digest Algorithm;
see the Bibliography). The exact
data passed to the algorithm depends on the type of content set and the nature of
the identifier being calculated.
For a page set, the source data is passed to the MD5 algorithm first, followed by
strings representing the digital identifiers of any auxiliary data files (such as im-
ages) referenced in the source data, in the order in which they are first referenced.
(If an auxiliary file is referenced more than once, its identifier is passed only the
first time.) This produces a composite identifier representing the visual appear-
ance of the pages in the page set. Two HTML source files that are identical, but
for which the referenced images contain different data—for example, if they have
been generated by a script or are pointed to by relative URLs—do not produce the
same identifier.
Note:
When the source data is taken from a PDF file, the identifier is generated sole-
ly from the contents of that file; there is no auxiliary data. (See also implementation
A page set can also have a
text identifier,
calculated by applying the MD5 algo-
rithm to just the rendered text present in the source data. For an HTML file, for
example, the text identifier is based solely on the text between markup tags; no
images are used in the calculation.
For an image set, the digital identifier is calculated by passing the source data for
the original image to the MD5 algorithm. For example, the identifier for an image
set created from a GIF image is calculated from the contents of the GIF.
Unique Name Generation
In generating PDF pages from a data source, Web Capture converts items such as
hypertext links and HTML form fields into corresponding named destinations
and interactive form fields. These items must have names that do not conflict
with those of existing items in the file. Also, when updating the file, Web Capture
may need to locate all destinations and fields constructed for a given page set.
Accordingly, each destination or field is given a unique name that is derived from
its original name but constructed so that it avoids conflicts with similarly named
items in other page sets.
Note:
As used here, the term
name
refers to a string, not a name object.