SECTION 10.9
947
Web Capture
TABLE 10.37 Entries in the Web Capture information dictionary
KEY
TYPE
VALUE
V
number
(Required)
The Web Capture version number. For PDF 1.3, the version number is 1.0.
Note:
This value is a single real number, not a major and minor version number. Thus, for
example, a version number of 1.2 would be considered greater than 1.15.
C
array
(Optional)
An array of indirect references to Web Capture command dictionaries (see
the PDF file. The commands appear in the array in the order in which they were executed
in building the file.
10.9.2 Content Database
Web Capture retrieves HTML files from URLs and converts them to PDF. The re-
sulting PDF file may contain the contents of multiple HTML pages. Conversely,
since HTML pages do not have a fixed size, a single HTML page may give rise to
multiple PDF pages. To keep track of the correspondences, Web Capture main-
tains a
content database
that maps URLs and digital identifiers to PDF objects
such as pages and XObjects. By looking up digital identifiers in the database, Web
Capture can determine whether newly downloaded content is identical to content
already retrieved from a different URL. Thus, it can perform optimizations such
as storing only one copy of an image that is referenced by multiple HTML pages.
Web Capture’s content database is organized into
content sets.
Each content set is
a dictionary holding information about a group of related PDF objects generated
from the same source data. Content sets are of two subtypes:
page sets
and
image
sets.
When Web Capture converts an HTML file to PDF pages, for example, it cre-
ates a page set to hold information about the pages. Similarly, when it converts a
GIF image to one or more image XObjects, it creates an image set describing
those XObjects.
The content set corresponding to a given data source can be accessed in either of
two ways:
•
By the URLs from which it was retrieved
•
By a digital identifier generated from the source data itself (see “Digital Identi-
The
URLS
and
IDS
entries in a PDF document’s name dictionary (see Section 3.6.3,
spectively, to Web Capture content sets. Figure 10.1 shows a simple example. An