SECTION 3.8
159
Common Data Structures
using
PDFDocEncoding
with the two characters
thorn ydieresis
, which is unlikely
to be a meaningful beginning of a word or phrase).
Note:
Applications that process PDF files containing Unicode text strings should be
prepared to handle supplementary characters; that is, characters requiring more
than two bytes to represent.
An escape sequence may appear anywhere in a Unicode text string to indicate the
language in which subsequent text is written, which is useful when the language
cannot be determined from the character codes used in the text. The escape
sequence consists of the following elements, in order:
1. The Unicode value
U+001B
(that is, the byte sequence 0 followed by 27).
2. A 2-character ISO 639 language code—for example,
en
for English or
ja
for
Japanese.
Character
in this context means byte (as in ASCII character), not
Unicode character.
3.
(Optional)
A 2-character ISO 3166 country code—for example,
US
for the
United States or
JP
for Japan.
4. The Unicode value
U+001B.
The complete list of codes defined by ISO 639 and ISO 3166 can be obtained
from the International Organization for Standardization (see the Bibliography).
PDFDocEncoded String Type
A PDFDocEncoded string is similar to a string object, but it is a character string
where characters are represented in a single byte using PDFDocEncoding. Note
that
PDFDocEncoding
does not support all Unicode characters whereas UTF-
16BE does.
Note:
This type is not a true type. Rather, it is a string type that represents data en-
coded using a specific convention.
Byte String Type
The byte string type is used for binary data represented as a series of 8-bit bytes,
where each byte can be any value representable in 8 bits. The string may