SECTION 5.9
471
Extraction of Text Content
the Adobe standard Latin character set and the set of named characters in the
Symbol
font (see Appendix D):
1. Map the character code to a character name according to Table D.1 on
Differences
array.
2. Look up the character name in the
Adobe Glyph List
(see the Bibliography)
to obtain the corresponding Unicode value.
•
If the font is a composite font that uses one of the predefined CMaps listed in
Identity–H
and
Identity–V
) or whose descendant
CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1
character collection:
1. Map the character code to a character identifier (CID) according to the
font’s CMap.
2. Obtain the registry and ordering of the character collection used by the
font’s CMap (for example,
Adobe
and
Japan1
) from its
CIDSystemInfo
dic-
tionary.
3. Construct a second CMap name by concatenating the registry and order-
ing obtained in step 2 in the format
registry–ordering–UCS2
(for example,
Adobe–Japan1–UCS2
).
4. Obtain the CMap with the name constructed in step 3 (available from the
ASN Web site; see the Bibliography).
5. Map the CID obtained in step 1 according to the CMap obtained in step 4,
producing a Unicode value.
Note:
Type 0 fonts whose descendant CIDFonts use the Adobe-GB1, Adobe-CNS1,
Adobe-Japan1, or Adobe-Korea1 character collection (as specified in the
CIDSystemInfo
dictionary) must have a supplement number corresponding to the
version of PDF supported by the application. See Table 5.16 on page 446 for a list of
the character collections corresponding to a given PDF version. (Other supplements
of these character collections can be used, but if the supplement is higher-numbered
than the one corresponding to the supported PDF version, only the CIDs in the latter
supplement are considered to be standard CIDs.)
If these methods fail to produce a Unicode value, there is no way to determine
what the character code represents.