CHAPTER 5
470
Text
The Unicode standard defines a system for numbering all of the common charac-
ters used in a large number of languages. It is a suitable scheme for representing
the information content of text, but not its appearance, since Unicode values
identify characters, not glyphs. For information about Unicode, see the
Unicode
Standard
by the Unicode Consortium (see the Bibliography).
When extracting character content, a consumer application can easily convert
text to Unicode values if a font’s characters are identified according to a standard
character set that is known to the application. This character identification can
occur if either the font uses a standard named encoding or the characters in the
font are identified by standard character names or CIDs in a well-known collec-
tion. Section 5.9.1, “Mapping Character Codes to Unicode Values,” describes in
detail the overall algorithm for mapping character codes to Unicode values.
If a font is not defined in one of these ways, the glyphs can still be shown, but the
characters cannot be converted to Unicode values without additional informa-
tion:
•
This information can be provided as an optional
ToUnicode
entry in the font
dictionary
(PDF 1.2;
see Section 5.9.2, “ToUnicode CMaps”), whose value is a
stream object containing a special kind of CMap file that maps character codes
to Unicode values.
•
An
ActualText
entry for a structure element or marked-content sequence (see
rectly.
5.9.1 Mapping Character Codes to Unicode Values
A consumer application can use the following methods, in the priority given, to
map a character code to a Unicode value. Tagged PDF documents, in particular,
must provide at least one of these methods (see “Unicode Mapping in Tagged
•
If the font dictionary contains a
ToUnicode
CMap (see Section 5.9.2,
•
If the font is a simple font that uses one of the predefined encodings
MacRomanEncoding
,
MacExpertEncoding
, or
WinAnsiEncoding
, or that has an
encoding whose
Differences
array includes only character names taken from