CHAPTER 10
938
Document Interchange
Language identifiers can be based on codes defined by the International Organi-
zation for Standardization in ISO 639 and ISO 3166 (see the Bibliography) or reg-
istered with the Internet Assigned Numbers Authority (IANA, whose Web site is
located at < http://iana.org/ >), or they can include codes created for private use. A
language identifier consists of a primary code optionally followed by one or more
subcodes (each preceded by a hyphen). The primary code can be any of the fol-
lowing:
•
A 2-character ISO 639 language code—for example,
en
for English or
es
for
Spanish
•
The letter
i
, designating an IANA-registered identifier
•
The letter
x
, for private use
The first subcode can be a 2-character ISO 3166 country code, as in
en-US
, or a
3- to 8-character subcode registered with IANA, as in
en-cockney
or
i-cherokee
(except in private identifiers, for which subcodes are not registered). Subcodes
beyond the first can be any that have been registered with IANA.
Although language codes are commonly represented using lowercase letters and
country codes are commonly represented using uppercase letters, all tags must be
treated as case insensitive.
Language Specification Hierarchy
The
Lang
entry in the document catalog specifies the natural language for all text
in the document except where overridden by language specifications for struc-
ture elements or for marked-content sequences that are not in the structure hier-
archy (for example, within an entirely unstructured document). Examples in this
section illustrate the hierarchical manner in which the language for text in a doc-
ument is determined.
could be overridden by one specified for a marked-content sequence within a
page’s content stream, independent of any logical structure. In this case, the
Lang
entry in the document catalog (not shown) has the value
en-US
, meaning U.S. En-
glish, and it is overridden by the
Lang
property attached (with the
Span
tag) to
the marked-content sequence
Hasta la vista.
The
Lang
property identifies the lan-
guage for this marked content sequence with the value
es-MX
, meaning Mexican
Spanish.