PDF Reference, version 1.7

Previous Next

470 CHAPTER 5 Text The Unicode standard defines a system for numbering all of the common charac- ters used in a large number of languages. It is a suitable scheme for representing the information content of text, but not its appearance, since Unicode values identify characters, not glyphs. For information about Unicode, see the Unicode Standard by the Unicode Consortium (see the Bibliography). When extracting character content, a consumer application can easily convert text to Unicode values if a font’s characters are identified according to a standard character set that is known to the application. This character identification can occur if either the font uses a standard named encoding or the characters in the font are identified by standard character names or CIDs in a well-known collec- tion. Section 5.9.1, “Mapping Character Codes to Unicode Values,” describes in detail the overall algorithm for mapping character codes to Unicode values. If a font is not defined in one of these ways, the glyphs can still be shown, but the characters cannot be converted to Unicode values without additional informa- tion: • This information can be provided as an optional ToUnicode entry in the font dictionary (PDF 1.2; see Section 5.9.2, “ToUnicode CMaps”), whose value is a stream object containing a special kind of CMap file that maps character codes to Unicode values. • An ActualText entry for a structure element or marked-content sequence (see Section 10.8.3, “Replacement Text”) can be used to specify the text content di- rectly. 5.9.1 Mapping Character Codes to Unicode Values A consumer application can use the following methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, must provide at least one of these methods (see “Unicode Mapping in Tagged PDF” on page 892): • If the font dictionary contains a ToUnicode CMap (see Section 5.9.2, “ToUnicode CMaps”), use that CMap to convert the character code to Unicode. • If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from

Previous Next