SECTION 10.7
895
Tagged PDF
plications all have their own ideas of what constitutes a word. It is not important
for a Tagged PDF document to identify the words within the text stream accord-
ing to a single, unambiguous definition that satisfies all of these clients. What is
important is that there be enough information available for each client to make
that determination for itself.
The consumer of a Tagged PDF document finds words by sequentially examining
the Unicode character stream, perhaps augmented by replacement text specified
with
ActualText
(see Section 10.8.3, “Replacement Text”). The consumer does not
need to guess about word breaks based on information such as glyph positioning
on the page, font changes, or glyph sizes. The main consideration is to ensure that
the spacing characters that would be present to separate words in a pure text rep-
resentation are also present in the Tagged PDF.
Note that the identification of what constitutes a word is unrelated to how the text
happens to be grouped into show strings. The division into show strings has no
semantic significance. In particular, a space or other word-breaking character is
still needed even if a word break happens to fall at the end of a show string.
Note:
Some applications may identify words by simply separating them at every
space character. Others may be slightly more sophisticated and treat punctuation
marks such as hyphens or em dashes as word separators as well. Still other applica-
tions may identify possible line-break opportunities by using an algorithm similar to
the one in Unicode Standard Annex #29,
Text Boundaries,
available from the Uni-
code Consortium (see the Bibliography).
10.7.2 Basic Layout Model
Tagged PDF’s standard structure types and attributes are interpreted in the con-
text of a basic layout model that describes the arrangement of structure elements
on the page. This model is designed to capture the general intent of the docu-
ment’s underlying structure and does not necessarily correspond to the one actu-
ally used for page layout by the application creating the document. (The PDF
content stream specifies the exact appearance.) The goal is to provide sufficient
information for Tagged PDF consumers to make their own layout decisions while
preserving the authoring application’s intent as closely as their own layout models
allow.
Note:
The Tagged PDF layout model resembles the ones used in markup languages
such as HTML, CSS, XSL, and RTF, but does not correspond exactly to any of them.