Previous Next
947
SECTION 10.9 Web Capture
TABLE 10.37 Entries in the Web Capture information dictionary
KEY TYPE VALUE
V number (Required) The Web Capture version number. For PDF 1.3, the version number is 1.0.
Note: This value is a single real number, not a major and minor version number. Thus, for
example, a version number of 1.2 would be considered greater than 1.15.
C array (Optional) An array of indirect references to Web Capture command dictionaries (see
“Command Dictionaries” on page 957) describing commands that were used in building
the PDF file. The commands appear in the array in the order in which they were executed
in building the file.
10.9.2 Content Database
Web Capture retrieves HTML files from URLs and converts them to PDF. The re-
sulting PDF file may contain the contents of multiple HTML pages. Conversely,
since HTML pages do not have a fixed size, a single HTML page may give rise to
multiple PDF pages. To keep track of the correspondences, Web Capture main-
tains a content database that maps URLs and digital identifiers to PDF objects
such as pages and XObjects. By looking up digital identifiers in the database, Web
Capture can determine whether newly downloaded content is identical to content
already retrieved from a different URL. Thus, it can perform optimizations such
as storing only one copy of an image that is referenced by multiple HTML pages.
Web Capture’s content database is organized into content sets. Each content set is
a dictionary holding information about a group of related PDF objects generated
from the same source data. Content sets are of two subtypes: page sets and image
sets. When Web Capture converts an HTML file to PDF pages, for example, it cre-
ates a page set to hold information about the pages. Similarly, when it converts a
GIF image to one or more image XObjects, it creates an image set describing
those XObjects.
The content set corresponding to a given data source can be accessed in either of
two ways:
• By the URLs from which it was retrieved
• By a digital identifier generated from the source data itself (see “Digital Identi-
fiers” on page 950)
The URLS and IDS entries in a PDF document’s name dictionary (see Section 3.6.3,
“Name Dictionary”) contain name trees mapping URLs and digital identifiers, re-
spectively, to Web Capture content sets. Figure 10.1 shows a simple example. An
Previous Next