Previous Next
470
CHAPTER 5 Text
The Unicode standard defines a system for numbering all of the common charac-
ters used in a large number of languages. It is a suitable scheme for representing
the information content of text, but not its appearance, since Unicode values
identify characters, not glyphs. For information about Unicode, see the Unicode
Standard by the Unicode Consortium (see the Bibliography).
When extracting character content, a consumer application can easily convert
text to Unicode values if a font’s characters are identified according to a standard
character set that is known to the application. This character identification can
occur if either the font uses a standard named encoding or the characters in the
font are identified by standard character names or CIDs in a well-known collec-
tion. Section 5.9.1, “Mapping Character Codes to Unicode Values,” describes in
detail the overall algorithm for mapping character codes to Unicode values.
If a font is not defined in one of these ways, the glyphs can still be shown, but the
characters cannot be converted to Unicode values without additional informa-
tion:
• This information can be provided as an optional ToUnicode entry in the font
dictionary (PDF 1.2; see Section 5.9.2, “ToUnicode CMaps”), whose value is a
stream object containing a special kind of CMap file that maps character codes
to Unicode values.
• An ActualText entry for a structure element or marked-content sequence (see
Section 10.8.3, “Replacement Text”) can be used to specify the text content di-
rectly.
5.9.1 Mapping Character Codes to Unicode Values
A consumer application can use the following methods, in the priority given, to
map a character code to a Unicode value. Tagged PDF documents, in particular,
must provide at least one of these methods (see “Unicode Mapping in Tagged
PDF” on page 892):
• If the font dictionary contains a ToUnicode CMap (see Section 5.9.2,
“ToUnicode CMaps”), use that CMap to convert the character code to Unicode.
• If the font is a simple font that uses one of the predefined encodings
MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an
encoding whose Differences array includes only character names taken from
Previous Next