Unicode Technical Standard #35

Unicode Locale Data Markup Language (LDML)

Version	22.1
Editors	Mark Davis ([email protected]) and other CLDR committee members
Date	2012-10-26
This Version	http://unicode.org/reports/tr35/tr35-29.html
Previous Version	http://unicode.org/reports/tr35/tr35-27.html
Latest Version	http://unicode.org/reports/tr35/
Corrigenda	http://unicode.org/cldr/corrigenda.html
Latest Proposed Update	http://unicode.org/reports/tr35/proposed.html
Namespace	http://cldr.unicode.org/
DTDs	http://unicode.org/cldr/dtd/22.1/
Revision	29

Summary

This document describes an XML format (vocabulary) for the exchange of structured locale data. This format is used in the Unicode Common Locale Data Repository.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.

Please submit corrigenda and other comments with the CLDR bug reporting form [Bugs]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

1 Introduction
- 1.1 Conformance
2 What is a Locale?
3 Unicode Language and Locale Identifiers
- 3.1 Unknown or Invalid Identifiers
- 3.2 BCP 47 Conformance
  - 3.2.1 -u- and -t- Extensions
  - 3.2.2 BCP 47 Language Tag Conversion
- 3.3 Relation to OpenI18n
- 3.4 Compatibility with Older Identifiers
  - 3.4.1 Legacy Variants
  - 3.4.2 Old Locale Extension Syntax
4 Locale Inheritance
- 4.1 Multiple Inheritance
5 XML Format
- 5.1 Common Elements
  - 5.1.1 Escaping Characters
  - 5.1.2 Text Directionality
- 5.2 Common Attributes
  - 5.2.1 Dates and Date Ranges
- 5.3 Identity Elements
  - 5.3.1 Fallback Elements
- 5.4 Display Name Elements
- 5.5 Layout Elements
- 5.6 Character Elements
- 5.7 Delimiter Elements
- 5.8 Measurement Elements
- 5.9 Date Elements
  - 5.9.1 Calendar Elements
  - 5.9.2 Time Zone Names
- 5.10 Number Elements
  - 5.10.1 Number Symbols
  - 5.10.2 Currencies
- 5.11 Unit Elements
- 5.12 POSIX Elements
- 5.13 Reference Elements
- 5.14 Collation Elements
- 5.15 Segmentations
- 5.16 Transforms
- 5.17 Rule-Based Number Formatting
- 5.18 List Patterns
- 5.19 ContextTransform Elements
- 5.20 Metadata Elements
- 5.21 Alias Elements
Appendix A: Sample Special Elements
- A.1 openoffice.org
Appendix B: Transmitting Locale Information
- B.1 Message Formatting and Exceptions
Appendix C: Supplemental Data
- C.1 Supplemental Currency Data
- C.2 Supplemental Territory Containment
- C.3 Supplemental Language Data
- C.4 Supplemental Territory Information
- C.5 Supplemental Calendar Data
- C.6 Measurement System Data
- C.7 Supplemental Time Zone Data
- C.8 Supplemental Character Fallback Data
- C.9 Supplemental Code Mapping
- C.10 Likely Subtags
- C.11 Language Plural Rules
- C.12 Telephone Code Data
- C.13 Numbering Systems
- C.14 Postal Code Validation
- C.15 Calendar Preference Data
- C.16 BCP 47 Keyword Mapping
- C.17 DayPeriod Rules
- C.18 Language Matching
- C.19 Parent Locales
- C.20 Gender of Lists
Appendix D: Language and Locale IDs
Appendix E: Unicode Sets
Appendix F: Date Format Patterns
Appendix G: Number Format Patterns
Appendix H: Choice Patterns
Appendix I: Inheritance and Validity
Appendix J: Time Zone Display Names
Appendix K: Valid Attribute Values
Appendix L: Canonical Form
Appendix M: Coverage Levels
Appendix N: Transform Rules
Appendix O: Lenient Parsing
Appendix P: Supplemental Metadata
Appendix Q: Unicode BCP 47 Extension Data
Appendix R: Property Data
Appendix S: Keyboards
References
Acknowledgments
Modifications

1. Introduction

Not long ago, computer systems were like separate worlds, isolated from one another. The internet and related events have changed all that. A single system can be built of many different components, hardware and software, all needing to work together. Many different technologies have been important in bridging the gaps; in the internationalization arena, Unicode has provided a lingua franca for communicating textual data. However, there remain differences in the locale data used by different systems.

The best practice for internationalization is to store and communicate language-neutral data, and format that data for the client. This formatting can take place on any of a number of the components in a system; a server might format data based on the user's locale, or it could be that a client machine does the formatting. The same goes for parsing data, and locale-sensitive analysis of data.

But there remain significant differences across systems and applications in the locale-sensitive data used for such formatting, parsing, and analysis. Many of those differences are simply gratuitous; all within acceptable limits for human beings, but yielding different results. In many other cases there are outright errors. Whatever the cause, the differences can cause discrepancies to creep into a heterogeneous system. This is especially serious in the case of collation (sort-order), where different collation caused not only ordering differences, but also different results of queries! That is, with a query of customers with names between "Abbot, Cosmo" and "Arnold, James", if different systems have different sort orders, different lists will be returned. (For comparisons across systems formatted as HTML tables, see [Comparisons].)

Note: There are many different equally valid ways in which data can be judged to be "correct" for a particular locale. The goal for the common locale data is to make it as consistent as possible with existing locale data, and acceptable to users in that locale.

This document specifies an XML format for the communication of locale data: the Unicode Locale Data Markup Language (LDML). This provides a common format for systems to interchange locale data so that they can get the same results in the services provided by internationalization libraries. It also provides a standard format that can allow users to customize the behavior of a system. With it, for example, collation (sorting) rules can be exchanged, allowing two implementations to exchange a specification of tailored collation rules. Using the same specification, the two implementations will achieve the same results in comparing strings. Unicode LDML can also be used to let a user encapsulate specialized sorting behavior for a specific domain, or create a customized locale for a minority language. Unicode LDML is also used in the Unicode Common Locale Data Repository (CLDR). CLDR uses an open process for reconciling differences between the locale data used on different systems and validating the data, to produce with a useful, common, consistent base of locale data.

For more information, see the Common Locale Data Repository project page [LocaleProject].

As LDML is an interchange format, it was designed for ease of maintenance and simplicity of transformation into other formats, above efficiency of run-time lookup and use. Implementations should consider converting LDML data into a more compact format prior to use.

1.1 Conformance

There are many ways to use the Unicode LDML format and the data in CLDR, and the Unicode Consortium does not restrict the ways in which the format or data are used. However, an implementation may also claim conformance to LDML or to CLDR, as follows:

UAX35-C1. An implementation that claims conformance to this specification shall:

Identify the sections of the specification that it conforms to.
- For example, an implementation might claim conformance to all LDML features except for transforms and segments.
Interpret the relevant elements and attributes of LDML documents in accordance with the descriptions in those sections.
- For example, an implementation that claims conformance to the date format patterns must interpret the characters in such patterns according to Date Field Symbol Table.
Declare which types of CLDR data that it uses.
- For example, an implementation might declare that it only uses language names, and those with a draft status of contributed or approved.

UAX35-C2. An implementation that claims conformance to Unicode locale or language identifiers shall:

Specify whether Unicode locale extensions are allowed
Specify the canonical form used for identifiers in terms of casing and field separator characters.

External specifications may also reference particular components of Unicode locale or language identifiers, such as:

Field X can contain any Unicode region subtag values as given in Unicode Technical Standard #35: Unicode Locale Data Markup Language (LDML), excluding grouping codes.

2. What is a Locale?

Before diving into the XML structure, it is helpful to describe the model behind the structure. People do not have to subscribe to this model to use data in LDML, but they do need to understand it so that the data can be correctly translated into whatever model their implementation uses.

The first issue is basic: what is a locale? In this model, a locale is an identifier (id) that refers to a set of user preferences that tend to be shared across significant swaths of the world. Traditionally, the data associated with this id provides support for formatting and parsing of dates, times, numbers, and currencies; for measurement units, for sort-order (collation), plus translated names for time zones, languages, countries, and scripts. The data can also include support for text boundaries (character, word, line, and sentence), text transformations (including transliterations), and other services.

Locale data is not cast in stone: the data used on someone's machine generally may reflect the US format, for example, but preferences can typically set to override particular items, such as setting the date format for 2002.03.15, or using metric or Imperial measurement units. In the abstract, locales are simply one of many sets of preferences that, say, a website may want to remember for a particular user. Depending on the application, it may want to also remember the user's time zone, preferred currency, preferred character set, smoker/non-smoker preference, meal preference (vegetarian, kosher, and so on), music preference, religion, party affiliation, favorite charity, and so on.

Locale data in a system may also change over time: country boundaries change; governments (and currencies) come and go: committees impose new standards; bugs are found and fixed in the source data; and so on. Thus the data needs to be versioned for stability over time.

In general terms, the locale id is a parameter that is supplied to a particular service (date formatting, sorting, spell-checking, and so on). The format in this document does not attempt to represent all the data that could conceivably be used by all possible services. Instead, it collects together data that is in common use in systems and internationalization libraries for basic services. The main difference among locales is in terms of language; there may also be some differences according to different countries or regions. However, the line between locales and languages, as commonly used in the industry, are rather fuzzy. Note also that the vast majority of the locale data in CLDR is in fact language data; all non-linguistic data is separated out into a separate tree. For more information, see Appendix D: Language and Locale IDs.

We will speak of data as being "in locale X". That does not imply that a locale is a collection of data; it is simply shorthand for "the set of data associated with the locale id X". Each individual piece of data is called a resource or field, and a tag indicating the key of the resource is called a resource tag.

3. Unicode Language and Locale Identifiers

Unicode LDML uses stable identifiers based on [BCP47] for distinguishing among languages, locales, regions, currencies, time zones, transforms, and so on. There are many systems for identifiers for these entities. The Unicode LDML identifiers may not match the identifiers used on a particular target system. If so, some process of identifier translation may be required when using LDML data.

A Unicode language identifier has the following structure (provided in either EBNF (Perl-based) or ABNF [RFC5234]):

	EBNF	ABNF
unicode_language_id	="root" \| unicode_language_subtag (sep unicode_script_subtag)? (sep unicode_region_subtag)? (sep unicode_variant_subtag)*	="root" / unicode_language_subtag [sep unicode_script_subtag] [sep unicode_region_subtag] *(sep unicode_variant_subtag)
sep	= "-" \| "_"	= "-" / "_"

EBNF

ABNF

unicode_language_id

="root"
| unicode_language_subtag 
  (sep unicode_script_subtag)? 
  (sep unicode_region_subtag)?
  (sep unicode_variant_subtag)*

="root" 
/ unicode_language_subtag 
  [sep unicode_script_subtag] 
  [sep unicode_region_subtag]
  *(sep unicode_variant_subtag)

sep

= "-" | "_"

= "-" / "_"

For example, "en-US" (American English), "en_GB" (British English), "es-419" (Latin American Spanish), and "uz-Cyrl" (Uzbek in Cyrillic) are all Unicode language identifiers.

A Unicode locale identifier is composed of a Unicode language identifier plus (optional) locale extensions. It has the following structure:

	EBNF	ABNF
unicode_locale_id	= unicode_language_id transformed_extensions? unicode_locale_extensions?	= unicode_language_id [transformed_extensions] [unicode_locale_extensions]
unicode_locale_extensions	= sep "u" ((sep keyword)+ \|(sep attribute)+ (sep keyword)*)	= sep "u" (1(sep keyword) / 1(sep attribute) *(sep keyword))
transformed_extensions	= sep "t" (("-" tlang ("-" tfield)*) \| ("-" tfield)+)	= sep "t" (("-" tlang ("-" tfield)) / 1("-" tfield))
keyword	= key (sep type)?	= key [sep type]
key	= alphanum{2}	= 2alphanum
type	= alphanum{3,8} (sep alphanum{3,8})*	= 38alphanum (sep 3*8alphanum)
attribute	= alphanum{3,8}	= 3*8alphanum
tlang	= unicode_language_subtag ("-" unicode_script_subtag)? ("-" unicode_region_subtag)? ("-" unicode_variant_subtag)*	= unicode_language_subtag ["-" unicode_script_subtag] ["-" unicode_region_subtag] *("-"unicode_variant_subtag)
tfield	= fsep ("-" alphanum{3,8})+	= fsep 1("-" 38alphanum)
fsep	= [A-Z a-z] [0-9]	= ALPHA DIGIT
alphanum	= [0-9 A-Z a-z]	= ALPHA / DIGIT

For historical reasons, this is called a Unicode locale identifier. However, it really functions (with few exceptions) as a language identifier, and accesses language-based data. Except where it would be unclear, this document uses the term "locale" data loosely to encompass both types of data: for more information, see Appendix D: Language and Locale IDs.

Although not shown in the syntax above, Unicode locale identifiers may also have [BCP47] extensions (other than "u") and private use subtags; these are not, however, relevant to their use in Unicode.

As for terminology, the term code may also be used instead of "subtag", and "territory" instead of "region". The primary language subtag is also called the base language code. For example, the base language code for "en-US" (American English) is "en" (English). The type may also be referred to as a value or key-value.

The Unicode locale identifier is based on [BCP47]. However, it differs in the following ways:

It does not allow for the full syntax of [BCP47]:
- No irregular or BCP47 grandfathered tags are allowed
- No extlang subtags are allowed
It allows for certain additions:
- For field separator characters, the "_" character can be used as well as the "-" used in [BCP47].
- "root" to indicate the generic locale used as the parent of all languages in the CLDR data model.
- Defined semantics of certain private use codes, and some "macrolanguage" codes.

The identifiers can vary in case and in the separator characters. The "-" and "_" separators are treated as equivalent. All identifier field values are case-insensitive. Although case distinctions do not carry any special meaning, an implementation of LDML should use the casing recommendations in [BCP47], especially when a Unicode locale identifier is used for locale data exchange in software protocols. The recommendation is that: the region subtag is in uppercase, the script subtag is in title case, and all other subtags are in lowercase.

Note: The current version of CLDR uses upper case letters for variant subtags in its file names for backward compatibility reasons. This might be changed in future CLDR releases.

Unicode language and locale identifier field values are provided in the following table. Note that some private-use BCP 47 field values are given specific meanings in CLDR.

Field Allowable Characters Sample values

unicode_language_subtag

(also known as a Unicode base language code)

ASCII letters

[BCP47] subtag values marked as Type: language

ISO 639-3 introduces the notion of "macrolanguages", where certain ISO 639-1 or ISO 639-2 codes are given broad semantics, and additional codes are given for the narrower semantics. For backwards compatibility, Unicode language identifiers retain use of the narrower semantics for these codes. For example:

For	Use	Not
Standard Chinese (Mandarin)	`zh`	`cmn`
Standard Arabic	`ar`	`arb`
Standard Malay	`ms`	`zsm`
Standard Swahili	`sw`	`swh`
Standard Uzbek	`uz`	`uzn`
Standard Konkani	`kok`	`knn`

For a full list, see supplementalMetadata.xml. If a language subtag matches the type attribute of a languageAlias element, then the replacement value is used instead. For example, because "swh" occurs in <languageAlias type="swh" replacement="sw"/>, "sw" must be used instead of "swh". Thus Unicode language identifiers use "ar-EG" for Standard Arabic (Egypt), not "arb-EG"; they use "zh-TW" for Mandarin Chinese (Taiwan), not "cmn-TW".

The private use codes from qfz..qtz will never be given specific semantics in Unicode identifiers, and are thus safe for use for other purposes by other applications.

The supplementalMetadata.xml provides data for normalizing language/locale codes, including mapping overlong codes like "eng-840" or "eng-USA" to the correct code "en-US". For a summary, see Aliases Chart.

unicode_script_subtag

(also known as a Unicode script code)

ASCII letters

[BCP47] subtag values marked as Type: script

In most cases the script is not necessary, since the language is only customarily written in a single script. Examples of cases where it is used are:

`az_Arab`	Azerbaijani in Arabic script
`az_Cyrl`	Azerbaijani in Cyrillic script
`az_Latn`	Azerbaijani in Latin script
`zh_Hans`	Chinese, in simplified script
`zh_Hant`	Chinese, in traditional script

Unicode identifiers give specific semantics to three Unicode Script values [UAX24]:

`Zyyy`	Common
`Qaai`	Inherited	the preferred form is now Zinh
`Zzzz`	Unknown

The private use subtags from Qaaq..Qabx will never be given specific semantics in Unicode identifiers, and are thus safe for use for other purposes by other applications.

unicode_region_subtag

(also known as a Unicode region code, or a Unicode territory code)

ASCII letters and digits

[BCP47] subtag values marked as Type: region

Unicode identifiers give specific semantics to the following subtags:

	Name	Comment	ISO 3166-1 status
`QO`	Outlying Oceania	countries in Oceania [009] that do not have a subcontinent.	private use
`QU`	European Union	the preferred form is now EU	private use
UK	United Kingdom	the correct form is GB	exceptionally reserved
`ZZ`	Unknown or Invalid Territory	used in APIs or as replacement for invalid code	private use

The private use subtags from XA..XZ will never be given specific semantics in Unicode identifiers, and are thus safe for use for other purposes by other applications.

The supplementalMetadata.xml provides data for normalizing territory/region codes, including mapping overlong codes like "eng-840" or "eng-USA" to the correct code "en-US". For a summary, see Aliases Chart.

Special Codes:

The territory code 'UK' has a special status in ISO, and is used for the domain name instead of GB. It is thus recognized by CLDR as being an alternate (unnormalized) form of 'GB'.
The territory code '001' (the World) is used to indicate a standardized form, such as "ar-001" for Modern Standard Arabic.

unicode_variant_subtag

(also known as a Unicode language variant code)

ASCII letters

[BCP47] subtag values marked as Type: variant

The supplementalMetadata.xml provides data for normalizing variant codes. For a summary, see Aliases Chart.

attribute ASCII letters and digits Currently not used, reserved for future use.

key

ASCII letters and digits

key/type definitions are discussed below. For information on the process for adding new key/type, see [LocaleProject].

All type values except ones used for key "ka" (colAlternate) and "vt" (variableTop) are represented by a single subtag in the current version of CLDR. If the type is not included, and one of the possible type values is "true", then that value is assumed. Note that the default for key with a possible "true" value is often "false", but may not always be.

type ASCII letters and digits

Examples:

en
fr_BE
de_DE_u_co_phonebk_cu_ddm

A locale that only has a language subtag (and optionally a script subtag) is called a language locale; one with both language and territory subtag is called a territory locale (or country locale).

The following chart contains a set of key values that are currently available, with a description or sampling of type values. Each category is associated with an XML file in the bcp47 directory. For the complete list of valid keys and types defined for Unicode locale extensions, see Appendix Q: Unicode BCP 47 Extension Data.

The BCP47 form is the canonical form, and recommended. Other aliases are included for backwards compatibility.

Key/Type Definitions
category	key (old key name)	key description	type (old type name)	type description
Calendar bcp47/calendar.xml	"ca" (calendar)	Calendar algorithm (For information on the calendar algorithms associated with the data used with these, see [Calendars].)	"buddhist"	Thai Buddhist calendar (same as Gregorian except for the year)
			"chinese"	Traditional Chinese calendar
			…
			"gregory" (gregorian)	Gregorian calendar
			…
Collation bcp47/collation.xml	"co" (collation)	Collation type	"standard"	The default ordering for each language. For root it is based on a modified version of [UCA] order: see 5.14 Collation Elements. Each other locale is based on that, except for appropriate modifications to certain characters for that language.
			"ducet"	The unmodified [UCA] order (Default Unicode Collation Element Table). (The current version of this tailoring does not fully reverse the CLDR root modifications.)
			"search"	A special collation type dedicated for string search.
			Other keywords provide additional choices for certain locales; they only have effect in certain locales.
			…
			"phonetic"	Requests a phonetic variant if available, where text is sorted based on pronunciation. It may interleave different scripts, if multiple scripts are in common use.
			"pinyin"	Pinyin ordering for Latin and for CJK characters; that is, an ordering for CJK characters based on a character-by-character transliteration into a pinyin. (used in Chinese)
			"reformed"	Reformed collation (such as in Swedish)
			"searchjl"	Special collation type for a modified string search in which a pattern consisting of a sequence of Hangul initial consonants (jamo lead consonants) will match a sequence of Hangul syllable characters whose initial consonants match the pattern. The jamo lead consonants can be represented using conjoining or compatibility jamo. This search collator is best used at SECONDARY strength with an "asymmetric" search of the sort obtained, for example, using ICU4C's usearch facility with attribute USEARCH_ELEMENT_COMPARISON set to value USEARCH_PATTERN_BASE_WEIGHT_IS_WILDCARD; this ensures that a full Hangul syllable in the search pattern will only match the same syllable in the searched text (instead of matching any syllable with the same initial consonant), while a Hangul initial consonant in the search pattern will match any Hangul syllable in the searched text with the same initial consonant.
			…
	For information on each collation setting parameter, from ka to vt, Setting Options
Currency bcp47/currency.xml	"cu" (currency)	Currency type	ISO 4217 code, plus others in common use	Codes that are or have been valid in ISO 4217, plus certain additional codes that are or have been in common use. The full list of codes, with descriptions, is available in the common/main/en.xml file for each release of CLDR. The list of countries and time periods associated with each currency value is in the common/supplemental/supplementalData.xml file under the <currencyData> element. The XXX code is given a broader interpretation as Unknown or Invalid Currency. For more information, see C.1 Supplemental Currency Data.
Number bcp47/number.xml	"nu" (numbers)	Numbering system	Unicode script subtag	Four-letter types indicating the primary numbering system for the corresponding script represented in Unicode. Unless otherwise specified, it is a decimal numbering system using digits [:GeneralCategory=Nd:]. For example, "latn" refers to the ASCII / Western digits 0-9, while "taml" is an algorithmic (non-decimal) numbering system. (The code "tamldec" is indicates the "modern Tamil decimal digits".) For more information, see Section C.13 Numbering Systems.
			"arabext"	Extended Arabic-Indic digits ("arab" means the base Arabic-Indic digits)
			"armnlow"	Armenian lowercase numerals
			…
			"roman"	Roman numerals
			"romanlow"	Roman lowercase numerals
			"tamldec"	Modern Tamil decimal digits
Time zone bcp47/timezone.xml	"tz" (timezone)	Time zone	Unicode short time zone IDs	Short identifiers defined in terms of a TZ time zone database [Olson] identifier in the file common/bcp47/timezone.xml file. The format of that file is specified in Appendix Q: Locale Extension Keys and Types. The short identifiers use UN [LOCODE] codes where possible. Identifiers of length not equal to 5 are used where there is no corresponding LOCODE, such as "usnavajo" for "America/Shiprock Navajo", or "utcw01" for "Etc/GMT+1". There is a special code "unk" for an Unknown or Invalid Timezone. This can be expressed in TZDB syntax as "Etc/Unknown" although it is not defined in [Olson]. The supplementalMetadata.xml provides data for normalizing timezone codes. For a summary, see Aliases Chart.
Locale variant bcp47/variant.xml	"va"	Common variant type	"posix"	POSIX style locale variant

For more information on the allowed keys and types, see the specific elements below, and Appendix Q: Locale Extension Key and Type DataUnicode BCP 47 Extension Data.

Additional keys or types might be added in future versions. Implementations of LDML should be robust to handle any syntactically valid key or type values.

3.1 Unknown or Invalid Identifiers

The following identifiers are used to indicate an unknown or invalid code in Unicode language and locale identifiers. For Unicode identifiers, the region code uses a private use ISO 3166 code, and Time Zone code uses an additional code; the others are defined by the relevant standards. When these codes are used in APIs connected with Unicode identifiers, the meaning is that either there was no identifier available, or that at some point an input identifier value was determined to be invalid or ill-formed.

Code Type	Value	Description in Referenced Standards
Language	`und`	Undetermined language
Script	`Zzzz`	Code for uncoded script, Unknown [UAX24]
Region	`ZZ`	Unknown or Invalid Territory
Currency	`XXX`	The codes assigned for transactions where no currency is involved
Time Zone	`unk`	Unknown or Invalid Time Zone

When only the script or region are known, then a locale ID will use "und" as the language subtag portion. Thus the locale tag "und_Grek" represents the Greek script; "und_US" represents the US territory.

3.1.1 Numeric Codes

For region codes, ISO and the UN establish a mapping to three-letter codes and numeric codes. However, this does not extend to the private use codes, which are the codes 900-999 (total: 100), and AAA, QMA-QZZ, XAA-XZZ, and ZZZ (total: 1092). Unicode identifiers supply a standard mapping to these: for the numeric codes, it uses the top of the numeric private use range; for the 3-letter codes it doubles the final letter. These are the resulting mappings for all of the private use region codes:

Region	UN/ISO Numeric	ISO 3-Letter
`AA`	`958`	`AAA`
`QM..QZ`	`959..972`	`QMM..QZZ`
`XA..XZ`	`973..998`	`XAA..XZZ`
`ZZ`	`999`	`ZZZ`

For script codes, ISO 15924 supplies a mapping (however, the numeric codes are not in common use):

Script	Numeric
`Qaaa..Qabx`	`900..949`

3.2 BCP 47 Conformance

Unicode language and locale identifiers inherit the design and the repertoire of subtags from [BCP47] Language Tags. There are some extensions and restrictions made for the use of identifiers in CLDR.

3.2.1 -u- and -t- Extensions

[BCP47] Language Tags provides a mechanism for extending language tags for use in various applications by extension subtags. Each extension subtag is identified by a single alphanumeric character subtag assigned by IANA. The Unicode consortium has registered the character 'u' for Unicode locale extensions, and the character 't' for Transformed Content extensions.

The complete list of Unicode locale extension subtags are defined by Appendix Q: Unicode BCP 47 Extension Data. These subtags are all in lowercase (that is the canonical casing for these subtags), however, subtags are case-insensitive and casing does not carry any specific meaning. All subtags within the Unicode extensions are alphanumeric characters in length of two to eight that meet the rule extension in the [BCP47] specification.

The -u- Extension. The syntax of 'u' extension subtags is defined by the rule unicode_locale_extensions in Unicode locale identifier, except the separator of subtags sep must be always hyphen '-' when the extension is used as a part of BCP 47 language tag.

A 'u' extension may contain multiple attributes or keywords as defined in Unicode locale identifier. Although the order of attributes or keywords does not matter, this specification defines the canonical form as below:

All attributes are sorted in alphabetical order.
All keywords are sorted by alphabetical order of keys.
All keywords are in lowercase.
All keys and types use the canonical form (from the name attribute; see Appendix Q).

For example, the canonical form of 'u' extension "u-foo-bar-nu-thai-ca-buddhist" is "u-bar-foo-ca-buddhist-nu-thai". The attributes "foo" and "bar" in this example are provided only for illustration; no attribute subtags are defined by the current CLDR specification.

The -t- Extension. The syntax of 't' extension subtags is defined by the rule unicode_locale_extensions in Unicode locale identifier, except the separator of subtags sep must be always hyphen '-' when the extension is used as a part of BCP 47 language tag. For information about the registration process, meaning, and usage of the 't' extension, see [RFC6497].

3.2.2 BCP 47 Language Tag Conversion

A Unicode language/locale identifier can be converted to a valid [BCP 47] language tag by performing the following transformation.

Replace the "_" separators with "-"
Replace the special language identifier "root" with the BCP 47 primary language tag "und"

For example,

en_US → en-US
de_DE_u_co_phonebk → de-DE-u-co-phonebk
root → und
root_u_cu_usd → und-u-cu-usd

A valid [BCP 47] language tag can be converted to a valid Unicode language/locale identifier by performing the following transformation.

Canonicalize the language tag (afterwards, there will be no extlang subtag)
Replace the BCP 47 primary language subtag "und" with "root" if no script, region, or variant subtags are present
If the BCP 47 primary language subtag matches the type attribute of a languageAlias element in supplementalMetadata.xml, replace the language subtag with the replacement value.
If the BCP 47 region subtag matches the type attribute of a territoryAlias element in supplementalMetadata.xml, replace the language subtag with the replacement value. (When multiple replacement values are available, use the first one)

For example,

en-US → en-US (no changes)
und → root
und-US → und-US (no changes, because region subtag is present)
und-u-cu-USD → root-u-cu-usd
cmn-TW → zh-TW (language alias)
sr-CS → sr-RS (territory alias)

Note: In some rare cases, BCP 47 language tags cannot be converted to valid Unicode language/locale identifiers, such as certain [BCP 47] grandfathered tags.

3.3 Relation to OpenI18n

The locale id format generally follows the description in the OpenI18N Locale Naming Guideline [NamingGuideline], with some enhancements. The main differences from the those guidelines are that the locale id:

does not include a charset (since the data in LDML format always provides a representation of all Unicode characters. The repository is stored in UTF-8, although that can be transcoded to other encodings as well.),
adds the ability to have a variant, as in Java
adds the ability to discriminate the written language by script (or script variant).
is a superset of [BCP47] codes.

3.4 Compatibility with Older Identifiers

LDML version before 1.7.2 used slightly different syntax for variant subtags and locale extensions. Implementations of LDML may provide backward compatible identifier support as described in following sections.

3.4.1 Legacy Variants

Old LDML specification allowed codes other than registered [BCP47] variant subtags used in Unicode language and locale identifiers for representing variations of locale data. Unicode locale identifiers including such variant codes can be converted to the new [BCP47] compatible identifiers by following the descriptions below:

Legacy Variant Mappings
Variant Code	Description
AALAND	Åland, variant of "sv" Swedish used in Finland. Use "sv_AX" to indicate this.
BOKMAL	Bokmål, variant of "no" Norwegian. Use primary language subtag "nb" to indicate this.
NYNORSK	Nynorsk, variant of "no" Norwegian. Use primary language subtag "nn" to indicate this.
POSIX	POSIX variation of locale data. Use Unicode locale extension "-u-va-posix" to indicate this.
POLYTONI	Polytonic, variant of "el" Greek. Use [BCP47] variant subtag "polyton" to indicate this.
SAAHO	The Saaho variant of Afar. Use primary language subtag "ssy" to indicated this.

3.4.2 Old Locale Extension Syntax

LDML 1.7 or older specification used different syntax for representing unicode locale extensions. The previous definition of Unicode locale extensions had the following structure:

	EBNF	ABNF
old_unicode_locale_extensions	= "@" old_key "=" old_type (";" old_key "=" old_type)*	= "@" old_key "=" old_type *(";" old_key "=" old_type)

The new specification mandates keys to be two alphanumeric characters and types to be three to eight alphanumeric characters. As the result, new codes were assigned to all existing keys and some types. For example, a new key "co" replaced the previous key "collation", a new type "phonebk" replaced the previous type "phonebook". However, the existing collation type "big5han" already satisfied the new requirement, so no new type code was assigned to the type. The chart below shows some example mappings between the new syntax and the old syntax.

Locale Extension Mappings
Old (LDML 1.7 or older)	New
de_DE@collation=phonebook	de_DE_u_co_phonebk
zh_Hant_TW@collation=big5han	zh_Hant_TW_u_co_big5han
th_TH@calendar=gregorian;@numbers=thai	th_TH_u_ca_gregory_nu_thai
en_US_POSIX@timezone=America/Los_Angeles	en_US_u_tz_uslax_va_posix

For more information about the key/type definitions and their old code mappings, see Appendix Q: Unicode BCP 47 Extension Data.

4. Locale Inheritance

The XML format relies on an inheritance model, whereby the resources are collected into bundles, and the bundles organized into a tree. Data for the many Spanish locales does not need to be duplicated across all of the countries having Spanish as a national language. Instead, common data is collected in the Spanish language locale, and territory locales only need to supply differences. The parent of all of the language locales is a generic locale known as root. Wherever possible, the resources in the root are language & territory neutral. For example, the collation (sorting) order in the root is based on the Unicode Collation Algorithm order (see 5.14 Collation Elements). Since English language collation has the same ordering, the 'en' locale data does not need to supply any collation data, nor does either the 'en_US' or the 'en_IE' locale data.

Given a particular locale id "en_IE_someVariant", the search chain for a particular resource is the following.

en_IE_someVariant
en_IE
en
root

If a type and key are supplied in the locale id, then logically the chain from that id to the root is searched for a resource tag with a given type, all the way up to root. If no resource is found with that tag and type, then the chain is searched again without the type.

Thus the data for any given locale will only contain resources that are different from the parent locale. For example, most territory locales will inherit the bulk of their data from the language locale: "en" will contain the bulk of the data: "en_IE" will only contain a few items like currency. All data that is inherited from a parent is presumed to be valid, just as valid as if it were physically present in the file. This provides for much smaller resource bundles, and much simpler (and less error-prone) maintenance. At the script or region level, the "primary" child locale will be empty, since its parent will contain all of the appropriate resources for it. For more information see Appendix P.3 Default Content.

If a language has more than one script in customary modern use, then the CLDR file structure in common/main follows the following model:

lang
lang_script
lang_script_region
lang_region (aliases to lang_script_region)

There are actually two different kinds of fallback: resource bundle lookup and resource item lookup. For the former, a process is looking to find the first, best resource bundle it can; for the later, it is fallback within bundles on individual items, like a the translated name for the region "CN" in Breton. These are closely related, but distinct, processes. Below "key" stands for zero or more key/type pairs.

Lookup Differences
Lookup Type	Example	Comments
Resource bundle lookup	se-FI → se → default* → root	* default may have its own inheritance change; for example, it may be "en-GB → en" In that case, the chain is expanded by inserting the chain, resulting in: se-FI → se → fi → en-GB → en → root
Resource item lookup	se-FI+key → se+key → root_alias*+key → root+key	* if there is a root_alias to another key or locale, then insert that entire chain. For example, suppose that months for another calendar system have a root alias to Gregorian months. In that case, the root alias would change the key, and retry from se-FI downward. se-FI+key → se+key → root_alias+key → se-FI+key2 → se+key2 → root_alias+key2 → root+key2

The fallback is a bit different for these two cases; internal aliases and keys are are not involved in the bundle lookup, and the default locale is not involved in the item lookup. Moreover, the resource item lookup must remain stable, because the resources are built with a certain fallback in mind; changing the core fallback order can render the bundle structure incoherent. Resource bundle lookup, on the other hand, is more flexible; changes in the view of the "best" match between the input request and the output bundle are more tolerant, when represent overall improvements for users. For more information, see Section 5.3.1 Fallback_Elements.

Where the LDML inheritance relationship does not match a target system, such as POSIX, the data logically should be fully resolved in converting to a format for use by that system, by adding all inherited data to each locale data set.

For a more complete description of how inheritance applies to data, and the use of keywords, see Appendix I: Inheritance and Validity.

The locale data does not contain general character properties that are derived from the Unicode Character Database [UAX44]. That data being common across locales, it is not duplicated in the bundles. Constructing a POSIX locale from the CLDR data requires use of UCD data. In addition, POSIX locales may also specify the character encoding, which requires the data to be transformed into that target encoding.

Warning: If a locale has a different script than its parent (for example, sr_Latn), then special attention must be paid to make sure that all inheritance is covered. For example, auxiliary exemplar characters may need to be empty ("[]") to block inheritance.

Empty Override: There is one special value reserved in LDML to indicate that a child locale is to have no value for a path, even if the parent locale has a value for that path. That value is "∅∅∅". For example, if there is no phrase for "two days ago" in a language, that can be indicated with:

<field type="day">
  <relative type="-2">∅∅∅</relative>

4.1 Multiple Inheritance

In clearly specified instances, resources may inherit from within the same locale. For example, currency format symbols inherit from the number format symbols; the Buddhist calendar inherits from the Gregorian calendar. This only happens where documented in this specification. In these special cases, the inheritance functions as normal, up to the root. If the data is not found along that path, then a second search is made, logically changing the element/attribute to the alternate values.

For example, for the locale "en_US" the month data in <calendar class="buddhist"> inherits first from <calendar class="buddhist"> in "en", then in "root". If not found there, then it inherits from <calendar type="gregorian"> in "en_US", then "en", then in "root".

5 XML Format

There are two kinds of data that can be expressed in LDML: language-dependent data and supplementary data. In either case, data can be split across multiple files, which can be in multiple directory trees.

For example, the language-dependent data for Japanese in CLDR is present in the following files:

common/collation/ja.xml
common/main/ja.xml
common/rbnf/ja.xml
common/segmentations/ja.xml

The status of the data is the same, whether or not data is split. That is, for the purpose of validation and lookup, all of the data for the above ja.xml files is treated as if it was in a single file.

Supplemental data relating to Japan or the Japanese writing system can be found in:

common/supplemental/supplementalData.xml
common/transforms/Hiragana-Katakana.xml
common/transforms/Hiragana-Latin.xml
...

The following sections describe the structure of the XML format for language-dependent data. The more precise syntax is in the DTD, listed at the top of this document; however, the DTD does not describe all the constraints on the structure.

To start with, the root element is <ldml>, with the following DTD entry:

<!ELEMENT ldml (identity, (alias |(fallback*, localeDisplayNames?, layout?, contextTransforms?, characters?, delimiters?, measurement?, dates?, numbers?, units?, listPatterns?, collations?, posix?, segmentations?, rbnf?, metadata?, references?, special*))) >

That element contains the following elements:

<identity>
<fallback>
<localeDisplayNames>
<layout>
<contextTransforms>
<characters>
<delimiters>
<measurement>
<dates>
<numbers>
<units>
<listPatterns>
<collations>
<posix>
<segmentations>
<rbnf>
<metadata>
<references>

The structure of each of these elements and their contents will be described below. The first few elements have little structure, while dates, numbers, and collations are more involved.

The XML structure is stable over releases. Elements and attributes may be deprecated: they are retained in the DTD but their usage is strongly discouraged. In most cases, an alternate structure is provided for expressing the information.

In general, all translatable text in this format is in element contents, while attributes are reserved for types and non-translated information (such as numbers or dates). The reason that attributes are not used for translatable text is that spaces are not preserved, and we cannot predict where spaces may be significant in translated material.

There are two kinds of elements in LDML: rule elements and structure elements. For structure elements, there are restrictions to allow for effective inheritance and processing:

There is no "mixed" content: if an element has textual content, then it cannot contain any elements.
The [XPath] leading to the content is unique; no two different pieces of textual content have the same [XPath].

Rule elements do not have this restriction, but also do not inherit, except as an entire block. The structure elements are listed in serialElements in the supplemental metadata. See also Appendix I: Inheritance and Validity. For more technical details, see Updating-DTDs.

Note that the data in examples given below is purely illustrative, and does not match any particular language. For a more detailed example of this format, see [Example]. There is also a DTD for this format, but remember that the DTD alone is not sufficient to understand the semantics, the constraints, nor the interrelationships between the different elements and attributes. You may wish to have copies of each of these to hand as you proceed through the rest of this document.

In particular, all elements allow for draft versions to coexist in the file at the same time. Thus most elements are marked in the DTD as allowing multiple instances. However, unless an element is listed as a serialElement, or has a distinguishing attribute, it can only occur once as a subelement of a given element. Thus, for example, the following is illegal even though allowed by the DTD:

There must be only one instance of these per parent, unless there are other distinguishing attributes (such as an alt element).

In general, LDML data should be in NFC format. However, certain elements may need to contain characters that are not in NFC, including exemplars, transforms, segmentations, and p/s/t/i/pc/sc/tc/ic rules in collation. These elements must not be normalized (either to NFC or NFD), or their meaning may be changed. Thus LDML documents must not be normalized as a whole. To prevent problems with normalization, no element value can start with a combining slash (U+0338 COMBINING LONG SOLIDUS OVERLAY).

Lists, such as singleCountries are space-delimited. That means that they are separated by one or more XML whitespace characters,

singleCountries
preferenceOrdering
references
validSubLocales

5.1 Common Elements

At any level in any element, two special elements are allowed.

This element is designed to allow for arbitrary additional annotation and data that is product-specific. It has one required attribute, which specifies the XML namespace of the special data. For example, the following used the version 1.0 POSIX special element.

<!DOCTYPE ldml SYSTEM "http://unicode.org/cldr/dtd/1.0/ldml.dtd" [
    <!ENTITY % posix SYSTEM "http://unicode.org/cldr/dtd/1.0/ldmlPOSIX.dtd">
%posix;
]>
<ldml>
...
<special xmlns:posix="http://www.opengroup.org/regproducts/xu.htm">
        <!-- old abbreviations for pre-GUI days -->
        <posix:messages>
            <posix:yesstr>Yes</posix:yesstr>
            <posix:nostr>No</posix:nostr>
            <posix:yesexpr>^[Yy].*</posix:yesexpr>
            <posix:noexpr>^[Nn].*</posix:noexpr>
        </posix:messages>
    </special>
</ldml>

The alias element used to be a common element. Its usage has since been modified so that it only occurs in root. See 5.21 Alias Elements for more information.

Many elements can have a display name. This is a translated name that can be presented to users when discussing the particular service. For example, a number format, used to format numbers using the conventions of that locale, can have translated name for presentation in GUIs.

  <numberFormat>
    <displayName>Prozentformat</displayName>
...
  <numberFormat>

Where present, the display names must be unique; that is, two distinct code would not get the same display name. (There is one exception to this: in time zones, where parsing results would give the same GMT offset, the standard and daylight display names can be the same across different time zone IDs.) Any translations should follow customary practice for the locale in question. For more information, see [Data Formats].

In some cases, a number of elements are present. The default element can be used to indicate which of them is the default, in the absence of other information. The value of the choice attribute is to match the value of the type attribute for the selected item.

<timeFormats>
  <default choice="medium" /> 
  <timeFormatLength type="full">
    <timeFormat type="standard">
      <pattern type="standard">h:mm:ss a z</pattern> 
    </timeFormat>
  </timeFormatLength>
  <timeFormatLength type="long">
    <timeFormat type="standard">
      <pattern type="standard">h:mm:ss a z</pattern> 
    </timeFormat>
  </timeFormatLength>
  <timeFormatLength type="medium">
    <timeFormat type="standard">
      <pattern type="standard">h:mm:ss a</pattern> 
    </timeFormat>
  </timeFormatLength>
...

Like all other elements, the <default> element is inherited. Thus, it can also refer to inherited resources. For example, suppose that the above resources are present in fr, and that in fr_BE we have the following:

<timeFormats>
  <default choice="long"/>
</timeFormats>

In that case, the default time format for fr_BE would be the inherited "long" resource from fr. Now suppose that we had in fr_CA:

  <timeFormatLength type="medium">
    <timeFormat type="standard">
      <pattern type="standard">...</pattern> 
    </timeFormat>
  </timeFormatLength>

In this case, the <default> is inherited from fr, and has the value "medium". It thus refers to this new "medium" pattern in this resource bundle.

5.1.1 Escaping Characters

Unfortunately, XML does not have the capability to contain all Unicode code points. Due to this, in certain instances extra syntax is required to represent those code points that cannot be otherwise represented in element content. These escapes are only allowed in certain elements, according to the DTD.

Escaping Characters
Code Point	XML Example
`U+0000`	`<cp hex="0">`

5.1.2 Text Directionality

The content of certain elements, such as date or number formats, may consist of several sub-elements with an inherent order (for example, the year, month, and day for dates). In some cases, the order of these sub-elements may be changed depending on the bidirectional context in which the element is embedded.

For example, short date formats in languages such as Arabic may contain neutral or weak characters at the beginning or end of the element content. In such a case, the overall order of the sub-elements may change depending on the surrounding text.

Element content whose display may be affected in this way should include an explicit direction mark, such as U+200E LEFT-TO-RIGHT MARK or U+200F RIGHT-TO-LEFT MARK, at the beginning or end of the element content, or both.

5.2 Common Attributes

<... type="stroke" ...>

The attribute type is also used to indicate an alternate resource that can be selected with a matching type=option in the locale id modifiers, or be referenced by a default element. For example:

<ldml>
  ...
  <currencies>
    <currency>...</currency>
    <currency type="preEuro">...</currency>
  </currencies>
</ldml>

<... draft="unconfirmed" ...>

If this attribute is present, it indicates the status of all the data in this element and any subelements (unless they have a contrary draft value), as per the following:

approved: fully approved by the technical committee (equals the CLDR 1.3 value of false, or an absent draft attribute). This does not mean that the data is guaranteed to be error-free—this is the best judgment of the committee.
contributed: partially approved by the technical committee.
provisional: partially confirmed. Implementations may choose to accept the provisional data, especially if there is no translated alternative.
unconfirmed: no confirmation available.

For more information on precisely how these values are computed for any given release, see Data Submission and Vetting Process on the CLDR website.

Normally draft attributes should only occur on "leaf" elements. For a more formal description of how elements are inherited, and what their draft status is, see Appendix I: Inheritance and Validity.

<... alt="descriptor" ...>

This attribute labels an alternative value for an element. The descriptor indicates what kind of alternative it is, and takes one of the following forms:

variantname meaning that the value is a variant of the normal value, and may be used in its place in certain circumstances. If a variant value is absent for a particular locale, the normal value is used. The variant mechanism should only be used when such a fallback is acceptable.
proposed, optionally followed by a number, indicating that the value is a proposed replacement for an existing value.
variantname-proposed, optionally followed by a number, indicating that the value is a proposed replacement variant value.

"proposed" should only be present if the draft status is not "approved". It indicates that the data is proposed replacement data that has been added provisionally until the differences between it and the other data can be vetted. For example, suppose that the translation for September for some language is "Settembru", and a bug report is filed that that should be "Settembro". The new data can be entered in, but marked as alt="proposed" until it is vetted.

...
<month type="9">Settembru</month>
<month type="9" draft="unconfirmed" alt="proposed">Settembro</month>
<month type="10">...

Now assume another bug report comes in, saying that the correct form is actually "Settembre". Another alternative can be added:

...
<month type="9" draft="unconfirmed" alt="proposed2">Settembre</month>
...

The values for variantname at this time include "variant", "list", "email", "www", "short", and "secondary".

<... validSubLocales="de_AT de_CH de_DE" ...>

The attribute validSubLocales allows sublocales in a given tree to be treated as though a file for them were present when there is not one. It can be applied to any element. It only has an effect for locales that inherit from the current file where a file is missing, and the elements would not otherwise be draft.

For a more complete description of how draft applies to data, see Appendix I: Inheritance and Validity.

<... standard="..." ...>

Note: This attribute is deprecated. Instead, use a reference element with the attribute standard="true". See Section 5.13 <references>.

The value of this attribute is a list of strings representing standards: international, national, organization, or vendor standards. The presence of this attribute indicates that the data in this element is compliant with the indicated standards. Where possible, for uniqueness, the string should be a URL that represents that standard. The strings are separated by commas; leading or trailing spaces on each string are not significant. Examples:

<collation standard="MSA 200:2002"> ... <dateFormatStyle standard=”http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=26780&ICS1=1&ICS2=140&ICS3=30”>

<... references="..." ...>

The value of this attribute is a token representing a reference for the information in the element, including standards that it may conform to. See Section 5.13 <references>. (In older versions of CLDR, the value of the attribute was freeform text. That format is deprecated.)

Example:

<territory type="UM" references="R222">USAs yttre öar</territory>

The reference element may be inherited. Thus, for example, R222 may be used in sv_SE.xml even though it is not defined there, if it is defined in sv.xml.

<... allow="verbatim" ...> (deprecated)

This attribute was originally intended for use in marking display names whose capitalization differed from what was indicated by the now-deprecated <inText> element (perhaps, for example, because the names included a proper noun). It was never supported in the dtd and is not needed for use with the new <contextTransforms> element.

5.2.1 Date and Date Ranges

When attribute specify date ranges, it is usually done with attributes from and to. The from attribute specifies the starting point, and the to attribute specifies the end point. The deprecated time attribute was formerly used to specify time with the deprecated weekEndStart and weekEndEnd elements, which were themselves inherently from or to.

The data format is a restricted ISO 8601 format, restricted to the fields year, month, day, hour, minute, and second in that order, with "-" used as a separator between date fields, a space used as the separator between the date and the time fields, and ":" used as a separator between the time fields. If the minute or minute and second are absent, they are interpreted as zero. If the hour is also missing, then it is interpreted based on whether the attribute is from or to.

from defaults to "00:00:00" (midnight at the start of the day).
to defaults to "24:00:00" (midnight at the end of the day).

That is, Friday at 24:00:00 is the same time as Saturday at 00:00:00. Thus when the hour is missing, the from and to are interpreted inclusively: the range includes all of the day mentioned.

For example, the following are equivalent:

If the from element is missing, it is assumed to be as far backwards in time as there is data for; if the to element is missing, then it is from this point onwards, with no known end point.

The dates and times are specified in local time, unless otherwise noted. (In particular, the metazone values are in UTC (also known as GMT).

5.3 Identity Elements

<!ELEMENT identity (alias | (version, generation?, language, script?, territory?, variant?, special*) ) >

The identity element contains information identifying the target locale for this data, and general information about the version of this data.

The version element provides, in an attribute, the version of this file. The contents of the element can contain textual notes about the changes between this version and the last. For example:

<version number="1.1">Various notes and changes in version 1.1</version>
This is not to be confused with the version attribute on the ldml element, which tracks the dtd version.

The generation element contains the last modified date for the data. This can be in two formats: ISO 8601 format, or CVS format (illustrated by the example above).

The language code is the primary part of the specification of the locale id, with values as described above.

The script code may be used in the identification of written languages, with values described above.

The territory code is a common part of the specification of the locale id, with values as described above.

The variant code is the tertiary part of the specification of the locale id, with values as described above.

When combined according to the rules described in Section 3, Unicode Language and Locale Identifiers, the language element, along with any of the optional script, territory, and variant elements, must identify a known, stable locale identifier. Otherwise, it is an error.

5.3.1 Fallback Elements

<!ELEMENT fallback (#PCDATA) >

The fallback element is deprecated. Implementations should use instead the information in C.18 Language Matching for doing language fallback.

5.4 Display Name Elements

<!ELEMENT localeDisplayNames (alias | (localeDisplayPattern?, languages?, scripts?, territories?, variants?, keys?, types?, transformNames?, measurementSystemNames?, codePatterns?, special*)) >

Display names for scripts, languages, countries, currencies, and variants in this locale are supplied by this element. They supply localized names for these items for use in user-interfaces for various purposes such as displaying menu lists, displaying a language name in a dialog, and so on. Capitalization should follow the conventions used in the middle of running text; the <contextTransforms> element may be used to specify the appropriate capitalization for other contexts (see Section 5.19 ContextTransform Elements). Examples are given below.

Note: The "en" locale may contain translated names for deprecated codes for debugging purposes. Translation of deprecated codes into other languages is discouraged.

Any translations should follow customary practice for the locale in question. For more information, see [Data Formats].

<!ELEMENT localeDisplayPattern ( alias | (localePattern*, localeSeparator*, localeKeyTypePattern*, special*) ) >

For compound language (locale) IDs such as "pt_BR" which contain additional subtags beyond the initial language code: When the <languages> data does not explicitly specify a display name such as "Brazilian Portuguese" for a given compound language ID, "Portuguese (Brazil)" from the display names of the subtags.

It includes three sub-elements:

The <localePattern> element specifies a pattern such as "{0} ({1})" in which {0} is replaced by the display name for the primary language subtag and {1} is replaced by a list of the display names for the remaining subtags.
The <localeSeparator> element specifies the list separator for the display names in {1}.
The <localeKeyTypePattern> element specifies the pattern used to display key-type pairs, such as "{0}: {1}"

For example, for the locale identifier zh_Hant_CN_co_pinyin_cu_USD, the display would be "Chinese (Traditional, China, Pinyin Sort Order, Currency: USD)". The key-type for co_pinyin doesn't use the localeKeyTypePattern because there is a translation for the key-type in English:

<type type="pinyin" key="collation">Pinyin Sort Order</type>

This contains a list of elements that provide the user-translated names for language codes, as described in Section 3, Unicode Language and Locale Identifiers.

<language type="ab">Abkhazian</language>
<language type="aa">Afar</language>
<language type="af">Afrikaans</language>
<language type="sq">Albanian</language>

The type can actually be any locale ID as specified above. The set of which locale IDs is not fixed, and depends on the locale. For example, in one language one could translate the following locale IDs, and in another, fall back on the normal composition.

type	translation	composition
nl_BE	Flemish	Dutch (Belgium)
zh_Hans	Simplified Chinese	Chinese (Simplified Han)
en_GB	British English	English (United Kingdom)

Thus when a complete locale ID is formed by composition, the longest match in the language type is used, and the remaining fields (if any) added using composition.

This element can contain an number of script elements. Each script element provides the localized name for a script code, as described in Section 3, Unicode Language and Locale Identifiers (see also UAX #24: Script Names [UAX24]). For example, in the language of this locale, the name for the Latin script might be "Romana", and for the Cyrillic script is "Kyrillica". That would be expressed with the following.

<script type="Latn">Romana</script>
<script type="Cyrl">Kyrillica</script>

This contains a list of elements that provide the user-translated names for territory codes, as described in Section 3, Unicode Language and Locale Identifiers.

<territory type="AF">Afghanistan</territory>
<territory type="AL">Albania</territory>
<territory type="DZ">Algeria</territory>
<territory type="AD">Andorra</territory>
<territory type="AO">Angola</territory>
<territory type="US">United States</territory>

This contains a list of elements that provide the user-translated names for the variant_code values described in Section 3, Unicode Language and Locale Identifiers.

<variant type="nynorsk">Nynorsk</variant>

<keys>

This contains a list of elements that provide the user-translated names for the key values described in Section 3, Unicode Language and Locale Identifiers.

<key type="collation">Sortierung</key>

<types>

This contains a list of elements that provide the user-translated names for the type values described in Section 3, Unicode Language and Locale Identifiers. Since the translation of an option name may depend on the key it is used with, the latter is optionally supplied.

<type type="phonebook" key="collation">Telefonbuch</type>

This contains a list of elements that provide the user-translated names for systems of measurement. The types currently supported are "US", "metric", and "UK".

<measurementSystemName type="US">U.S.</type>

Note: In the future, we may need to add display names for the particular measurement units (millimeter versus millimetre versus whatever the Greek, Russian, etc are), and a message format for positioning those with respect to numbers. for example, "{number} {unitName}" in some languages, but "{unitName} {number}" in others.

This contains a list of elements that provide the user-translated names for transforms that are not script or locale-based, such as FULLWIDTH.

<transformName type="Numeric">Numeric</type>

<codePattern type="language">Language: {0}</type>

5.5 Layout Elements

<!ELEMENT layout ( alias | (orientation*, inList*, inText*, special*) ) >

This top-level element specifies general layout features. It currently only has one possible element (other than <special>, which is always permitted).

The lines and characters attributes specify the default general ordering of lines within a page, and characters within a line. The values are:

Orientation Attributes
Vertical	top-to-bottom
	bottom-to-top
Horizontal	left-to-right
	right-to-left

If the lines value is one of the vertical attributes, then the characters value must be one of the horizontal attributes, and vice versa. For example, for English the lines are top-to-bottom, and the characters are left-to-right. For Mongolian (in the Mongolian Script) the lines are right-to-left, and the characters are top to bottom. This does not override the ordering behavior of bidirectional text; it does, however, supply the paragraph direction for that text (for more information, see UAX #9: The Bidirectional Algorithm [UAX9]).

For dates, times, and other data to appear in the right order, the display for them should be set to the orientation of the locale.

<inList> (deprecated)

The <inList> element is deprecated and has been superseded by the <contextTransforms> element; see Section 5.19 ContextTransform Elements.

This element controls whether display names (language, territory, etc) are title cased in GUI menu lists and the like. It is only used in languages where the normal display is lower case, but title case is used in lists. There are two options:

<inList casing="titlecase-words">

<inList casing="titlecase-firstword">

In both cases, the title case operation is the default title case function defined by Chapter 3 of [Unicode]. In the second case, only the first word (using the word boundaries for that locale) will be title cased. The results can be fine-tuned by using alt="list" on any element where titlecasing as defined by the Unicode Standard will produce the wrong value. For example, suppose that "turc de Crimée" is a value, and the title case should be "Turc de Crimée". Then that can be expressed using the alt="list" value.

<inText> (deprecated)

The <inList> element is deprecated and has been superseded by the <contextTransforms> element; see Section 5.19 ContextTransform Elements.

This element indicates the casing of the data in the category identified by the inText type attribute, when that data is written in text or how it would appear in a dictionary. For example :

<inText type="languages">lowercase-words</inText>

indicates that language names embedded in text are normally written in lower case. The possible values and their meanings are :

titlecase-words : all words in the phrase should be title case
titlecase-firstword : the first word should be title case
lowercase-words : all words in the phrase should be lower case
mixed : a mixture of upper and lower case is permitted. generally used when the correct value is unknown.

5.6 Character Elements

<!ELEMENT characters (alias | (exemplarCharacters*, ellipsis*, moreInformation*, stopwords*, indexLabels*, mapping*, special*)) >

The <characters> element provides optional information about characters that are in common use in the locale, and information that can be helpful in picking resources or data appropriate for the locale, such as when choosing among character encodings that are typically used to transmit data in the language of the locale. It typically only occurs in a language locale, not in a language/territory locale. The stopwords are an experimental feature, and should not be used.

The basic exemplar character sets (main and auxiliary) contain the commonly used letters for a given modern form of a language, which can be for testing and for determining the appropriate repertoire of letters for charset conversion or collation. ("Letter" is interpreted broadly, as anything having the property Alphabetic in the [UAX44], which also includes syllabaries and ideographs.) It is not a complete set of letters used for a language, nor should it be considered to apply to multiple languages in a particular country. Punctuation and other symbols should not be included in the main and auxiliary sets. In particular, format characters like CGJ are not included.

There are four sets altogether: main, auxiliary, punctuation, and index. The main set should contain the minimal set required for users of the language, while the auxiliary exemplar set is designed to encompass additional characters: those non-native or historical characters that would customarily occur in common publications, dictionaries, and so on. Major style guidelines are good references for the auxiliary set. So, for example, if Irish newspapers and magazines would commonly have Danish names using å, for example, then it would be appropriate to include å in the auxiliary exemplar characters; just not in the main exemplar set. Thus English has the following:

For a given language, there are a few factors that help for determining whether a character belongs in the auxiliary set, instead of the main set:

The character is not available on all normal keyboards.
It is acceptable to always use spellings that avoid that character.

For example, the exemplar character set for en (English) is the set [a-z]. This set does not contain the accented letters that are sometimes seen in words like "résumé" or "naïve", because it is acceptable in common practice to spell those words without the accents. The exemplar character set for fr (French), on the other hand, must contain those characters: [a-z é è ù ç à â ê î ô û æ œ ë ï ÿ]. The main set typically includes those letters commonly "alphabet".

The punctuation set consists of common punctuation characters that are used with the language (corresponding to main and auxiliary). Symbols may also be included where they are common in plain text, such as ©. It does not include characters with narrow technical usage, such as dictionary punctuation/symbols or copy-edit symbols. For example, English would have something like the following:

- ‐ – —

, ; : ! ? . …

' ‘ ’ " “ ” ′ ″

( ) [ ] { } ⟨ ⟩

+ − ± × ÷ < ≤ = ≅ ≥ > √

When determining the character repertoire needed to support a language, a reasonable initial set would include at least the characters in the main and punctuation exemplar sets, along with the digits and common symbols associated with the numberSystems supported for the locale (see Section C.13 Numbering Systems).

The index characters are a set of characters for use as a UI "index", that is, a list of clickable characters (or character sequences) that allow the user to see a segment of a larger "target" list. Each character corresponds to a bucket in the target list. One may have different kinds of index lists; one that produces an index list that is relatively static, and the other is a list that produces roughly equally-sized buckets. While CLDR is mostly focused on the first, there is provision for supporting the second as well.

The index characters need to be used in conjunction with a collation for the locale, which will determine the order of the characters. It will also determine which index characters show up.

The static list would be presented as something like the following (either vertically or horizontally):

… A B C D E F G H CH I J K L M N O P Q R S T U V W X Y Z …

In the "A" bucket, you would find all items that are primary greater than or equal to "A" in collation order, and primary less than "B". The use of the list requires that the target list be sorted according to the locale that is used to create that list. Although we say "character" above, the index character could be a sequence, like "CH" above. The index exemplar characters must always be used with a collation appropriate for the locale. Any characters that do not have primary differences from others in the set should be removed.

Details:

The primary weight (according to the collation) is used to determine which bucket a string is in. There are special buckets for before the first character, between buckets of different scripts, and after the last bucket (and of a different script).
Characters in the index characters do not need to have distinct primary weights. That is, the index characters are adapted to the underlying collation: normally Ё is in the Е bucket for Russian, but if someone used a variant of Russian collation that distinguished them on a primary level, then Ё would show up as its own bucket.
The behavior for index characters that have multiple primary weights (such as "sch" in German) is currently undefined and not supported.

The … items are special: each is a bucket for everything else, either less or greater. They are inserted at the start and end of the index list, and on script boundaries. These are really script boundaries, not reordering code boundaries. Each script has its own range, except where scripts sort primary-equal (e.g., Hira & Kana). All characters that sort in the low reordering groups (whitespace, punctuation, symbols, currency symbols, digits) are treated as a single script for this purpose. So if you had a collation that reordered Hebrew after Ethiopic, you would still get index boundaries between the following (and in that order):

Ethiopic
Hebrew
Phoenician // included in the Hebrew reordering group
Samaritan // included in the Hebrew reordering group
Devanagari

If you tailor a Greek character into the Cyrillic script, that Greek character will be bucketed (and sorted) among the Cyrillic ones.

In the UI, an index character could also be omitted or grayed out if its bucket is empty. For example, if there is nothing in the bucket for Q, then Q could be omitted. That would be up to the implementation. Additional buckets could be added if other characters are present. For example, we might see something like the following:

Sample Greek Index	Contents
Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω	With only content beginning with Greek letters
… Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω …	With some content before or after
… 9 Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω …	With numbers, and nothing between 9 and Alpha
… 9 A-Z Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω …	With numbers, some Latin

Here is a sample of the XML structure:

The display of the index characters can be modified with the Index labels elements, discussed in Section 5.6.4.

5.6.1 Exemplar Syntax

In all of the exemplar characters, the list of characters is in the Unicode Set format, which allows boolean combinations of sets of letters, including those specified

Sequences of characters that act like a single letter in the language — especially in collation — are included within braces, such as [a-z á é í ó ú ö ü ő ű {cs} {dz} {dzs} {gy} ...]. The characters should be in normalized form (NFC). Where combining marks are used generatively, and apply to a large number of base characters (such as in Indic scripts), the individual combining marks should be included. Where they are used with only a few base characters, the specific combinations should be included. Wherever there is not a precomposed character (for example, single codepoint) for a given combination, that must be included within braces. For example, to include sequences from the Where is my Character? page on the Unicode site, one would write: [{ch} {tʰ} {x̣} {ƛ̓} {ą́} {i̇́} {ト゚}], but for French one would just write [a-z é è ù ...]. When in doubt use braces, since it does no harm to include them around single code points: for example, [a-z {é} {è} {ù} ...].

If the letter 'z' were only ever used in the combination 'tz', then we might have [a-y {tz}] in the main set. (The language would probably have plain 'z' in the auxiliary set, for use in foreign words.) If combining characters can be used productively in combination with a large number of others (such as say Indic matras), then they are not listed in all the possible combinations, but separately, such as:

[‌ ‍ ॐ ०-९ ऄ-ऋ ॠ ऌ ॡ ऍ-क क़ ख ख़ ग ग़ घ-ज ज़ झ-ड ड़ ढ ढ़ ण-फ फ़ ब-य य़ र-ह ़ ँ-ः ॑-॔ ऽ ् ॽ ा-ॄ ॢ ॣ ॅ-ौ]

The exemplar character set for Han characters is composed somewhat differently. It is even harder to draw a clear line for Han characters, since usage is more like a frequency curve that slowly trails off to the right in terms of decreasing frequency. So for this case, the exemplar characters simply contain a set of reasonably frequent characters for the language.

The ordering of the characters in the set is irrelevant, but for readability in the XML file the characters should be in sorted order according to the locale's conventions. The set should only contain lower case characters (except for the special case of Turkish and similar languages, where the dotted capital I should be included); the upper case letters are to be mechanically added when the set is used. For more information on casing, see the discussion of Special Casing in the Unicode Character Database.

5.6.2. Restrictions

The sets are normally restricted to those letters with a specific Script character property (that is, not the values Common or Inherited) or required Default_Ignorable_Code_Point characters (such as a non-joiner), or combining marks, or the Word_Break properties Katakana, ALetter, or MidLetter.
The auxiliary set should not overlap with the main set. There is one exception to this: Hangul Syllables and CJK Ideographs can overlap between the sets.
Any Default_Ignorable_Code_Points should be in the auxiliary set , or, if they are only needed for currency formatting, in the currency set. These can include characters such as U+200E LEFT-TO-RIGHT MARK and U+200F RIGHT-TO-LEFT MARK which may be needed in bidirectional text in order for date, currency or other formats to display correctly.

5.6.3. Mapping

The mapping element describes character conversion mapping tables that are commonly used to encode data in the language of this locale for a particular purpose. Each encoding is identified by a name from the specified registry. If more than one encoding is used for a particular purpose, the encodings are listed in the type attribute in order, from most preferred to least. An alt tag is used to indicate the purpose ("email" or "www" being the most frequent); if it is absent, then the encoding(s) may be used for all purposes not explicitly specified.

Each locale may have at most one mapping element tagged with a particular purpose, and at most one general-purpose mapping element. Inheritance is on an element basis; an element in a sub-locale overrides an inherited element with the same purpose.

For email usage (alt="email") the list begins with encodings that should be tried for outgoing mail; these encodings should be tried in order until one is found that can represent the message text. Typically, this section of the encoding list terminates with encoding "utf-8", which can represent any message text. Any encodings listed after "utf-8" may be encountered in incoming messages (along with the encodings in the first section) and should be handled for incoming messages, but should not be used for outgoing messages.

Currently the only registry that can be used is "iana", which specifies use of an IANA name.

Note: While IANA names are not precise for conversion (see UTS #22: Character Mapping Tables [UTS22]), they are sufficient for this purpose.

5.6.4 Index Labels

<!ELEMENT indexLabels (indexSeparator*, compressedIndexSeparator*, indexRangePattern*, indexLabelBefore*, indexLabelAfter*, indexLabel*) >

<!ELEMENT indexSeparator ( #PCDATA ) >

<!ELEMENT compressedIndexSeparator ( #PCDATA ) >

<!ELEMENT indexRangePattern ( #PCDATA ) >

<!ELEMENT indexLabelBefore ( #PCDATA ) >

<!ELEMENT indexLabelAfter ( #PCDATA ) >

<!ELEMENT indexLabel ( #PCDATA ) >
<!ATTLIST indexLabel indexSource CDATA #IMPLIED >
<!ATTLIST indexLabel priority ( 1 | 2 | 3 ) #IMPLIED >

The index label elements provide information for modifying the index exemplar characters in display. In particular, they are used to indicate how index exemplar characters can be compressed where screen real estate is limited. For example, A B C D E F G H I J K L M N O P Q R S T U V W X Y Z can be represented as A • E • I • N • S • Z.

The index Separator can used to separate the index characters if they occur in free flowing text (instead of, say, on buttons or in cells). The default (root) is a space. Where the index is compressed (by omitting values -- see the priority attribute below), the compressedIndexSeparator can be used instead.

The indexRangePattern is used for dynamic configuration. That is, if there are few items in X, Y, and Z, they can be grouped into a single bucket with <indexRangePattern>{0}-{1}</separator>, giving "X-Z". The indexLabel and either be applied to a single string from the exemplars, or to the result of an indexRangePattern; so the localizer can turn "X-Z" into "XYZ" if desired.

The indexLabelBefore and After are used before and after a list. The default (root) value is an ellipsis, as in the example at the top. When displaying index characters with multiple scripts, the main language can be used for all characters from the main script. For other scripts there are two possibilities:

Use the primary characters from the UCA. This has the disadvantage that many very uncommon characters show up.
Use the likely-subtags language for each scripts. For example, if the main language is French, and Cyrillic characters are present, then the likely subtags language for Cyrillic is "ru" (derived by looking up "und-Cyrl").

The indexLabel is used to display characters (if it is available). That is, when displaying index characters, if there is an indexLabel, use it instead. For example, for Hungarian, we could have A => "A, Á". The priority is used where not all of the index characters can be displayed. In that case, only the higher priorities (lower numbers) would be displayed.

Note that the indexLabels can be used both with contiguous ranges and non-contiguous ranges. For German we might have [A-S Sch Sci St Su T-Z] as the index characters, and the following labels:

<indexLabel item="Sci">S</indexLabel>
<indexLabel item="Su">S</indexLabel>

What that means is that the "S" bucket will include anything [S,Sch), [Sci,St), and [Su,T). That is, items are put into the first display bucket that contains them. That allows for the desired behavior in German (and other languages) of:

S (contains Satt, Semel, Szent)
Sch (contains Scherer, Schoen)
St (contains Stumpf, Sturr)

5.6.5 Ellipsis

The ellipsis element provides patterns for use when truncating strings. There are three versions: initial for removing an initial part of the string (leaving final characters); medial for removing from the center of the string (leaving initial and final characters), and final for removing a final part of the string (leaving initial characters). For example, the following uses the ellipsis character in all three cases (although some languages may have different characters for different positions).

5.6.6 More Information

The moreInformation string is one that can be displayed in an interface to indicate that more information is available. For example:

5.7 Delimiter Elements

<!ELEMENT delimiters (alias | (quotationStart*, quotationEnd*, alternateQuotationStart*, alternateQuotationEnd*, special*)) >

The delimiters supply common delimiters for bracketing quotations. The quotation marks are used with simple quoted text, such as:

He said, “Don’t be absurd!”

When quotations are nested, the quotation marks and alternate marks are used in an alternating fashion:

He said, “Remember what the Mad Hatter said: ‘Not the same thing a bit! Why you might just as well say that “I see what I eat” is the same thing as “I eat what I see”!’”

<quotationStart>“</quotationStart>
<quotationEnd>”</quotationEnd>
<alternateQuotationStart>‘</alternateQuotationStart>
<alternateQuotationEnd>’</alternateQuotationEnd>

5.8 Measurement Elements (deprecated)

<!ELEMENT measurement (alias | (measurementSystem?, paperSize?, special*)) >

The measurement element is deprecated in the main LDML files, because the data is more appropriately organized as connected to territories, not to linguistic data. Instead, the measurementData element in the supplemental data file should be used.

5.9 Date Elements

<!ELEMENT dates (alias | (localizedPatternChars*, dateRangePattern*, calendars?, timeZoneNames?, special*)) >

This top-level element contains information regarding the format and parsing of dates and times. The data format is based on the Java/ICU format. Most of these are fairly self-explanatory, except the week elements, localizedPatternChars, and the meaning of the pattern characters. For information on this, and more information on other elements and attributes, see Appendix F: Date Format Patterns.

5.9.1 Calendar Elements

<!ELEMENT calendars (alias | (default*, calendar*, special*)) >
<!ELEMENT calendar (alias | (months?, monthNames?, monthAbbr?, monthPatterns?, days?, dayNames?, dayAbbr?, quarters?, week?, am*, pm*, dayPeriods?, eras?, cyclicNameSets?, dateFormats?, timeFormats?, dateTimeFormats?, fields*, special*))>

This element contains multiple <calendar> elements, each of which specifies the fields used for formatting and parsing dates and times according to the given calendar. The month and quarter names are identified numerically, starting at 1. The day (of the week) names are identified with short strings, since there is no universally-accepted numeric designation.

Note: Use of a default element for calendars is deprecated. The default calendar type for a locale is specified by C.15 Calendar Preference Data.

Many calendars will only differ from the Gregorian Calendar in the year and era values. For example, the Japanese calendar will have many more eras (one for each Emperor), and the years will be numbered within that era. All calendar data inherits from the Gregorian calendar in the same locale data (if not present in the chain up to root), so only the differing data will be present. See Section 4.1 Multiple Inheritance.

<months>, <days>, <quarters>, <eras>

<!ELEMENT months ( alias | (default*, monthContext*, special*)) >
<!ELEMENT monthContext ( alias | (default*, monthWidth*, special*)) >
<!ATTLIST monthContext type ( format | stand-alone ) #REQUIRED >
<!ELEMENT monthWidth ( alias | (month*, special*)) >
<!ATTLIST monthWidth type ( abbreviated| narrow | wide) #REQUIRED >
<!ELEMENT month ( #PCDATA | cp )* >
<!ATTLIST month type ( 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 ) #REQUIRED >
<!ATTLIST month yeartype ( standard | leap ) #IMPLIED >

<!ELEMENT days ( alias | (default*, dayContext*, special*)) >
<!ELEMENT dayContext ( alias | (default*, dayWidth*, special*)) >
<!ATTLIST dayContext type ( format | stand-alone ) #REQUIRED >
<!ELEMENT dayWidth ( alias | (day*, special*)) >
<!ATTLIST dayWidth type NMTOKEN #REQUIRED >
<!ELEMENT day ( #PCDATA ) >
<!ATTLIST day type ( sun | mon | tue | wed | thu | fri | sat ) #REQUIRED >

<!ELEMENT quarters ( alias | (default*, quarterContext*, special*)) >
<!ELEMENT quarterContext ( alias | (default*, quarterWidth*, special*)) >
<!ATTLIST quarterContext type ( format | stand-alone ) #REQUIRED >
<!ELEMENT quarterWidth ( alias | (quarter*, special*)) >
<!ATTLIST quarterWidth type NMTOKEN #REQUIRED >
<!ELEMENT quarter ( #PCDATA ) >
<!ATTLIST quarter type ( 1 | 2 | 3 | 4 ) #REQUIRED >

<!ELEMENT eras (alias | (eraNames?, eraAbbr?, eraNarrow?, special*)) >
<!ELEMENT eraNames ( alias | (era*, special*) ) >
<!ELEMENT eraAbbr ( alias | (era*, special*) ) >
<!ELEMENT eraNarrow ( alias | (era*, special*) ) >

Month, day, and quarter names may vary along two axes: the width and the context. The context is either format (the default), the form used within a date format string (such as "Saturday, November 12^th", or stand-alone, the form used independently, such as in calendar headers. The width can be wide (the default), abbreviated, or narrow; for days only, the width can also be short, which is ideally between the abbreviated and narrow widths, but must be no longer than abbreviated and no shorter than narrow (if short day names are not explicitly specified, abbreviated day names are used instead). Note that for <monthPattern>, described in the next section:

There is an additional context type numeric
When the context type is numeric, the width has a special type all

The format values must be distinct for the wide, abbreviated, and short widths. However, values for the narrow width in either format or stand-alone contexts, as well as values for other widths in stand-alone contexts, need not be distinct; they might only be distinguished by context. For example, "S" may be used both for Saturday and for Sunday. The narrow width is typically used in calendar headers; it must be the shortest possible width, no more than one character (or grapheme cluster, or exemplar set element) in stand-alone values, and the shortest possible widths (in terms of grapheme clusters) in format values. The short width (if present) is often the shortest unambiguous form.

Era names should be distinct within each of the widths, including narrow; there is less disambiguating information for them, and they are more likely to be used in a format that requires parsing.

The most important distinction to make between format and stand-alone forms is a grammatical distinction, for languages that require it. For example, many languages require that a month name without an associated day number be in the basic nominative form, while a month name with an associated day number should be in a different grammatical form: genitive, partitive, etc. Another common type of distinction between format and stand-alone involves capitalization; however, this can be controlled separately and more precisely using the <contextTransforms> element as described in Section 5.19 ContextTransform Elements.

Due to aliases in root, the forms inherit "sideways". (See Section 4.1 Multiple Inheritance.) For example, if the abbreviated format data for Gregorian does not exist in a language X (in the chain up to root), then it inherits from the wide format data in that same language X.

<monthContext type="format">
	<default choice="wide"/>
	<monthWidth type="abbreviated">
		<alias source="locale" path="../monthWidth[@type='wide']"/>
	</monthWidth>
	<monthWidth type="narrow">
		<alias source="locale" path="../../monthContext[@type='stand-alone']/monthWidth[@type='narrow']"/>
	</monthWidth>
	<monthWidth type="wide">
		<month type="1">1</month>
		...
		<month type="12">12</month>
	</monthWidth>
</monthContext>
<monthContext type="stand-alone">
	<monthWidth type="abbreviated">
		<alias source="locale" path="../../monthContext[@type='format']/monthWidth[@type='abbreviated']"/>
	</monthWidth>
	<monthWidth type="narrow">
		<month type="1">1</month>
		...
		<month type="12">12</month>
	</monthWidth>
	<monthWidth type="wide">
		<alias source="locale" path="../../monthContext[@type='format']/monthWidth[@type='wide']"/>
	</monthWidth>
</monthContext>

The older monthNames, dayNames, and monthAbbr, dayAbbr are maintained for backwards compatibility. They are equivalent to: using the months element with the context type="format" and the width type="wide" (for ...Names) and type="narrow" (for ...Abbr), respectively. The minDays, firstDay, weekendStart, and weekendEnd elements are also deprecated; there are new elements in supplemental data for this data.

The yeartype attribute for months is used to distinguish alternate month names that would be displayed for certain calendars during leap years. The practical example of this usage occurs in the Hebrew calendar, where the 7th month "Adar" occurs in non-leap years, with the 6th month being skipped, but in leap years there are two months named "Adar I" and "Adar II". There are currently only two defined year types, standard (the implied default) and leap.

Example:

  <calendar type="gregorian">
    <months>
      <default type="format"/>
      <monthContext type="format">
         <default type="wide"/>
         <monthWidth type="wide">
            <month type="1">January</month>
            <month type="2">February</month>
...
            <month type="11">November</month>
            <month type="12">December</month>
        </monthWidth>
        <monthWidth type="abbreviated">
            <month type="1">Jan</month>
            <month type="2">Feb</month>
...
            <month type="11">Nov</month>
            <month type="12">Dec</month>
        </monthWidth>
       <monthContext type="stand-alone">
         <default type="wide"/>
         <monthWidth type="wide">
            <month type="1">Januaria</month>
            <month type="2">Februaria</month>
...
            <month type="11">Novembria</month>
            <month type="12">Decembria</month>
        </monthWidth>
        <monthWidth type="narrow">
            <month type="1">J</month>
            <month type="2">F</month>
...
            <month type="11">N</month>
            <month type="12">D</month>
        </monthWidth>
       </monthContext>
    </months>

    <days>
      <default type="format"/>
      <dayContext type="format">
         <default type="wide"/>
         <dayWidth type="wide">
            <day type="sun">Sunday</day>
            <day type="mon">Monday</day>
...
            <day type="fri">Friday</day>
            <day type="sat">Saturday</day>
        </dayWidth>
        <dayWidth type="abbreviated">
            <day type="sun">Sun</day>
            <day type="mon">Mon</day>
...
            <day type="fri">Fri</day>
            <day type="sat">Sat</day>
        </dayWidth>
        <dayWidth type="narrow">
            <day type="sun">Su</day>
            <day type="mon">M</day>
...
            <day type="fri">F</day>
            <day type="sat">Sa</day>
        </dayWidth>
      </dayContext>
      <dayContext type="stand-alone">
        <dayWidth type="narrow">
            <day type="sun">S</day>
            <day type="mon">M</day>
...
            <day type="fri">F</day>
            <day type="sat">S</day>
        </dayWidth>
      </dayContext>
    </days>

    <quarters>
      <default type="format"/>
      <quarterContext type="format">
         <default type="abbreviated"/>
         <quarterWidth type="abbreviated">
            <quarter type="1">Q1</quarter>
            <quarter type="2">Q2</quarter>
            <quarter type="3">Q3</quarter>
            <quarter type="4">Q4</quarter>
        </quarterWidth>
        <quarterWidth type="wide">
            <quarter type="1">1st quarter</quarter>
            <quarter type="2">2nd quarter</quarter>
            <quarter type="3">3rd quarter</quarter>
            <quarter type="4">4th quarter</quarter>
        </quarterWidth>
      </quarterContext>
    </quarters>

    <am>AM</am> deprecated
    <pm>PM</pm> deprecated

    <eras>
       <eraAbbr>
        <era type="0">BC</era>
        <era type="1">AD</era>
       </eraAbbr>
       <eraNames>
        <era type="0">Before Christ</era>
        <era type="1">Anno Domini</era>
       </eraNames>
       <eraNarrow>
        <era type="0">B</era>
        <era type="1">A</era>
       </eraNarrow>
    </eras>

<monthPatterns>, <cyclicNameSets>

<!ELEMENT monthPatterns ( alias | (monthPatternContext*, special*)) >
<!ELEMENT monthPatternContext ( alias | (monthPatternWidth*, special*)) >
<!ATTLIST monthPatternContext type ( format | stand-alone | numeric ) #REQUIRED >
<!ELEMENT monthPatternWidth ( alias | (monthPattern*, special*)) >
<!ATTLIST monthPatternWidth type ( abbreviated| narrow | wide | all ) #REQUIRED >
<!ELEMENT monthPattern ( #PCDATA ) >
<!ATTLIST monthPattern type ( leap | standardAfterLeap | combined ) #REQUIRED >

<!ELEMENT cyclicNameSets ( alias | (cyclicNameSet*, special*)) >
<!ELEMENT cyclicNameSet ( alias | (cyclicNameContext*, special*)) >
<!ATTLIST cyclicNameSet type ( years | months | days | dayParts | zodiacs ) #REQUIRED >
<!ELEMENT cyclicNameContext ( alias | (cyclicNameWidth*, special*)) >
<!ATTLIST cyclicNameContext type ( format | stand-alone ) #REQUIRED >
<!ELEMENT cyclicNameWidth ( alias | (cyclicName*, special*)) >
<!ATTLIST cyclicNameWidth type ( abbreviated | narrow | wide ) #REQUIRED >
<!ELEMENT cyclicName ( #PCDATA ) >
<!ATTLIST cyclicName type NMTOKEN #REQUIRED >

The Chinese lunar calendar can insert a leap month after nearly any month of its year; when this happens, the month takes the name of the preceding month plus a special marker. The Hindu lunar calendars can insert a leap month before any one or two months of the year; when this happens, not only does the leap month take the name of the following month plus a special marker, the following month also takes a special marker. Moreover, in the Hindu calendar sometimes a month is skipped, in which case the preceding month takes a special marker plus the names of both months. The <monthPatterns> element structure supports these special kinds of month names. It parallels the <months> element structure, with various contexts and widths, but with some differences:

Since the month markers may be applied to numeric months as well, there is an additional monthPatternContext type "numeric" for this case. When the numeric context is used, there is no need for different widths, so the monthPatternWidth type is "all" for this case.
The monthPattern element itself is a pattern showing how to create the modified month name from the standard month name(s). The three types of possible pattern are for "leap", "standardAfterLeap", and "combined".
The <monthPatterns> element is not present for calendars that do not need it.

The Chinese and Hindu lunar calendars also use a 60-name cycle for designating years. The Chinese lunar calendars can also use that cycle for months and days, and can use 12-name cycles for designating day subdivisions or zodiac names associated with years. The <cyclicNameSets> element structure supports these special kinds of name cycles; a cyclicNameSet can be provided for types "year", "month", "day", "dayParts", or "zodiacs". For each cyclicNameSet, there is a context and width structure similar to that for day names. For a given context and width, a set of cyclicName elements provides the actual names.

Example:

    <monthPatterns>
        <monthPatternContext type="format">
            <monthPatternWidth type="wide">
                <monthPattern type="leap">闰{0}</monthPattern>
            </monthPatternWidth>
        </monthPatternContext>
        <monthPatternContext type="stand-alone">
            <monthPatternWidth type="narrow">
                <monthPattern type="leap">闰{0}</monthPattern>
            </monthPatternWidth>
        </monthPatternContext>
        <monthPatternContext type="numeric">
            <monthPatternWidth type="all">
                <monthPattern type="leap">闰{0}</monthPattern>
            </monthPatternWidth>
        </monthPatternContext>
    </monthPatterns>
    <cyclicNameSets>
        <cyclicNameSet type="years">
            <cyclicNameContext type="format">
                <cyclicNameWidth type="abbreviated">
                    <cyclicName type="1">甲子</cyclicName>
                    <cyclicName type="2">乙丑</cyclicName>
                    ...
                    <cyclicName type="59">壬戌</cyclicName>
                    <cyclicName type="60">癸亥</cyclicName>
                </cyclicNameWidth>
            </cyclicNameContext>
        </cyclicNameSet>
        <cyclicNameSet type="zodiacs">
            <cyclicNameContext type="format">
                <cyclicNameWidth type="abbreviated">
                    <cyclicName type="1">鼠</cyclicName>
                    <cyclicName type="2">牛</cyclicName>
                    ...
                    <cyclicName type="11">狗</cyclicName>
                    <cyclicName type="12">猪</cyclicName>
                </cyclicNameWidth>
            </cyclicNameContext>
        </cyclicNameSet>
    </cyclicNameSets>

<dayPeriods>

The former am/pm elements have been deprecated, and replaced by the more flexible dayPeriods.

<!ELEMENT dayPeriods ( alias | (dayPeriodContext*) ) >

<!ELEMENT dayPeriodContext (alias | dayPeriodWidth*) >
<!ATTLIST dayPeriodContext type NMTOKEN #REQUIRED >

<!ELEMENT dayPeriodWidth (alias | dayPeriod*) >
<!ATTLIST dayPeriodWidth type NMTOKEN #REQUIRED >

<!ELEMENT dayPeriod ( #PCDATA ) >
<!ATTLIST dayPeriod type NMTOKEN #REQUIRED >

These behave like months, days, and so on in terms of having context and width. Each locale has an associated dayPeriodRuleSet in the supplemental data, rules that specify when the day periods start and end for that locale. Each type in the rules needs to have a translation in a dayPeriod. For more information, see Section C.17 DayPeriod Rules.

The dayPeriod names should be distinct within each of the context/width combinations, including narrow; as with era names, there is less disambiguating information for them, and they are more likely to be used in a format that requires parsing. In some unambiguous cases, it is acceptable for certain overlapping dayPeriods to be the same, such as the names for "am" and "morning", or the names for "pm" and "afternoon".

Example:

    <dayPeriods>
      <dayPeriodContext type="format">
        <dayPeriodWidth type="wide">
          <dayPeriod type="am">AM</dayPeriod>
          <dayPeriod type="noon">noon</dayPeriod>
          <dayPeriod type="pm">PM</dayPeriod>
        </dayPeriodWidth>
      </dayPeriodContext>
    </dayPeriods>

<dateFormats>

Date formats have the following form:

    <dateFormats>
      <default type=”medium”/>
      <dateFormatLength type=”full”>
        <dateFormat>
          <pattern>EEEE, MMMM d, yyyy</pattern>
        </dateFormat>
       </dateFormatLength>
     <dateFormatLength type="medium">
       <default type="DateFormatsKey2">
       <dateFormat type="DateFormatsKey2">
        <pattern>MMM d, yyyy</pattern>
       </dateFormat>
       <dateFormat type="DateFormatsKey3">
         <pattern>MMM dd, yyyy</pattern>
        </dateFormat>
      </dateFormatLength>
    <dateFormats>

The patterns for date formats and time formats are defined in Appendix F: Date Format Patterns. These patterns are intended primarily for display of isolated date and time strings in user-interface elements, rather than for date and time strings in the middle of running text, so capitalization and grammatical form should be chosen appropriately.

<timeFormats>

Time formats have the following form:

     <timeFormats>
       <default type="medium"/>
       <timeFormatLength type=”full”>
         <timeFormat>
           <displayName>DIN 5008 (EN 28601)</displayName>
           <pattern>h:mm:ss a z</pattern>
         </timeFormat>
       </timeFormatLength>
       <timeFormatLength type="medium">
         <timeFormat>
           <pattern>h:mm:ss a</pattern>
         </timeFormat>
       </timeFormatLength>
     </timeFormats>

The preference of 12 hour versus 24 hour for the locale should be derived from the short timeFormat. If the hour symbol is "h" or "K" (of various lengths) then the format is 12 hour; otherwise it is 24 hour.

Time formats use the specific non-location format (z or zzzz) for the time zone name. This is the format that should be used when formatting a specific time for presentation. When formatting a time referring to a recurring time (such as a meeting in a calendar), applications should substitute the generic non-location format (v or vvvv) for the time zone in the time format pattern. See Appendix J: Time Zone Display Names. for a complete description of available time zone formats and their uses.

<dateTimeFormats>

Date/Time formats have the following form:

     <dateTimeFormats>
       <default type="medium"/>
       <dateTimeFormatLength type=”full”>
         <dateTimeFormat>
            <pattern>{0} {1}</pattern>
         </dateTimeFormat>
       </dateTimeFormatLength>
       <availableFormats>
         <dateFormatItem id="Hm">HH:mm</dateFormatItem> 
         <dateFormatItem id="Hms">HH:mm:ss</dateFormatItem> 
         <dateFormatItem id="M">L</dateFormatItem> 
         <dateFormatItem id="MEd">E, M/d</dateFormatItem> 
         <dateFormatItem id="MMM">LLL</dateFormatItem> 
         <dateFormatItem id="MMMEd">E, MMM d</dateFormatItem> 
         <dateFormatItem id="MMMMEd">E, MMMM d</dateFormatItem> 
         <dateFormatItem id="MMMMd">MMMM d</dateFormatItem> 
         <dateFormatItem id="MMMd">MMM d</dateFormatItem> 
         <dateFormatItem id="Md">M/d</dateFormatItem> 
         <dateFormatItem id="d">d</dateFormatItem> 
         <dateFormatItem id="hm">h:mm a</dateFormatItem> 
         <dateFormatItem id="ms">mm:ss</dateFormatItem> 
         <dateFormatItem id="y">yyyy</dateFormatItem> 
         <dateFormatItem id="yM">M/yyyy</dateFormatItem> 
         <dateFormatItem id="yMEd">EEE, M/d/yyyy</dateFormatItem> 
         <dateFormatItem id="yMMM">MMM yyyy</dateFormatItem> 
         <dateFormatItem id="yMMMEd">EEE, MMM d, yyyy</dateFormatItem> 
         <dateFormatItem id="yMMMM">MMMM yyyy</dateFormatItem> 
         <dateFormatItem id="yQ">Q yyyy</dateFormatItem> 
         <dateFormatItem id="yQQQ">QQQ yyyy</dateFormatItem> 
         . . .
       </availableFormats>
       <appendItems>
         <appendItem request="G">{0} {1}</appendItem>
         <appendItem request="w">{0} ({2}: {1})</appendItem>
         . . .
       </appendItems>
     </dateTimeFormats>

  </calendar>

  <calendar type="buddhist">
    <eras>
      <eraAbbr>
        <era type="0">BE</era>
      </eraAbbr>
    </eras>
  </calendar>

These formats allow for date and time formats to be composed in various ways.

<!ELEMENT dateTimeFormats (alias | (default*, dateTimeFormatLength*, availableFormats*, appendItems*, intervalFormats*, special*)) >
<!ELEMENT dateTimeFormatLength (alias | (default*, dateTimeFormat*, special*))>
<!ATTLIST dateTimeFormatLength type ( full | long | medium | short ) #IMPLIED >
<!ELEMENT dateTimeFormat (alias | (pattern*, displayName*, special*))>

The dateTimeFormat element works like the dateFormats and timeFormats, except that the pattern is of the form "{1} {0}", where {0} is replaced by the time format, and {1} is replaced by the date format, with results such as "8/27/06 7:31 AM". Except for the substitution markers {0} and {1}, text in the dateTimeFormat is interpreted as part of a date/time pattern, and is subject to the same rules described in Appendix F: Date Format Patterns. This includes the need to enclose ASCII letters in single quotes if they are intended to represent literal text.

<!ELEMENT availableFormats (alias | (dateFormatItem*, special*))>
<!ELEMENT dateFormatItem ( #PCDATA ) >
<!ATTLIST dateFormatItem id CDATA #REQUIRED >

The availableFormats element and its subelements provide a more flexible formatting mechanism than the predefined list of patterns represented by dateFormatLength, timeFormatLength, and dateTimeFormatLength. Instead, there is an open-ended list of patterns (represented by dateFormatItem elements as well as the predefined patterns mentioned above) that can be matched against a requested set of calendar fields and field lengths. Software can look through the list and find the pattern that best matches the original request, based on the desired calendar fields and lengths. For example, the full month and year may be needed for a calendar application; the request is MMMMyyyy, but the best match may be "yyyy MMMM" or even "G yy MMMM", depending on the locale and calendar.

For some calendars, such as Japanese, a displayed year must have an associated era, so for these calendars dateFormatItem patterns with a year field should also include an era field. When matching availableFormats patterns: If a client requests a format string containing a year, and all the availableFormats patterns with a year also contain an era, then include the era as part of the result.

The id attribute is a so-called "skeleton", containing only field information, and in a canonical order. Examples are "yyyyMMMM" for year + full month, or "MMMd" for abbreviated month + day. In particular:

The fields are from the Date Field Symbol Table in Appendix F: Date Format Patterns.
The canonical order is from top to bottom in that table; that is, "yM" not "My".
Only one field of each type is allowed; that is "Hh" is not valid.
The 'a' field is not allowed in the skeleton.

In order to support user overrides of default locale behavior, data should be supplied for both 12-hour-cycle time formats (using h or K) and 24-hour-cycle time formats (using H or k), even if one of those styles is not commonly used; the locale's actual preference for 12-hour or 24-hour time cycle is determined from the hour character used in the locale's standard short time format. Thus skeletons using h or K should have patterns that only use h or K for hours, while skeletons using H or k should have patterns that only use H or k for hours.

<!ELEMENT appendItems (alias | (appendItem*, special*))>
<!ELEMENT appendItem ( #PCDATA ) >
<!ATTLIST appendItem request CDATA >

In case the best match does not include all the requested calendar fields, the appendItems element describes how to append needed fields to one of the existing formats. Each appendItem element covers a single calendar field. In the pattern, {0} represents the format string, {1} the data content of the field, and {2} the display name of the field (see Calendar Fields).

<!ELEMENT intervalFormats (alias | (intervalFormatFallback*, intervalFormatItem*, special*)) >

<!ELEMENT intervalFormatFallback ( #PCDATA ) >

<!ELEMENT intervalFormatItem (alias | (greatestDifference*, special*)) >
<!ATTLIST intervalFormatItem id NMTOKEN #REQUIRED >

<!ELEMENT greatestDifference ( #PCDATA ) >
<!ATTLIST greatestDifference id NMTOKEN #REQUIRED >

Interval formats allow for software to format intervals like "Jan 10-12, 2008" as a shorter and more natural format than "Jan 10, 2008 - Jan 12, 2008". They are designed to take a "skeleton" pattern (like the one used in availableFormats) plus start and end datetime, and use that information to produce a localized format.

The data supplied in CLDR requires the software to determine the calendar field with the greatest difference before using the format pattern. For example, the greatest difference in "Jan 10-12, 2008" is the day field, while the greatest difference in "Jan 10 - Feb 12, 2008" is the month field. This is used to pick the exact pattern. The pattern is then designed to be broken up into two pieces by determining the first repeating field. For example, "MMM d-d, y" would be broken up into "MMM d-" and "d, y". The two parts are formatted with the first and second datetime, as described in more detail below.

In case there is no matching pattern, the intervalFormatFallback defines the fallback pattern. The fallback pattern is of the form "{0} - {1}" or "{1} - {0}", where {0} is replaced by the start datetime, and {1} is replaced by the end datetime. The fallback pattern determines the default order of the interval pattern. "{0} - {1}" means the first part of the interval patterns in current local are formatted with the start datetime, while "{1} - {0}" means the first part of the interval patterns in current locale are formatted with the end datetime.

The id attribute of intervalFormatItem is the "skeleton" pattern (like the one used in availableFormats) on which the format pattern is based. The id attribute of greatestDifference is the calendar field letter, for example 'M', which is the greatest difference between start and end datetime.

The greatest difference defines a specific interval pattern of start and end datetime on a "skeleton" and a greatestDifference. As stated above, the interval pattern is designed to be broken up into two pieces. Each piece is similar to the pattern defined in date format. Also, each interval pattern could override the default order defined in fallback pattern. If an interval pattern starts with "latestFirst:", the first part of this particular interval pattern is formatted with the end datetime. If an interval pattern starts with "earliestFirst:", the first part of this particular interval pattern is formatted with the start datetime. Otherwise, the order is the same as the order defined in intervalFormatFallback.

For example, the English rules that produce "Jan 10–12, 2008", "Jan 10 – Feb 12, 2008", and "Jan 10, 2008 – Feb. 12, 2009" are as follows:

To format a start and end datetime, given a particular "skeleton":

Look for the intervalFormatItem element that matches the "skeleton", starting in the current locale and then following the locale fallback chain up to, but not including root (better results are obtained by following steps 2-6 below with locale- or language- specific data than by using matching intervalFormats from root).
If no match was found from the previous step, check what the closest match is in the fallback locale chain, as in availableFormats. That is, this allows for adjusting the string value field's width, including adjusting between "MMM" and "MMMM", and using different variants of the same field, such as 'v' and 'z'.
If a match is found from previous steps, compute the calendar field with the greatest difference between start and end datetime. If there is no difference among any of the fields in the pattern, format as a single date using availableFormats, and return.
Otherwise, look for greatestDifference element that matches this particular greatest difference.
If there is a match, use the pieces of the corresponding pattern to format the start and end datetime, as above.
Otherwise, format the start and end datetime using the fallback pattern.

<week>

<!ELEMENT week (alias | (minDays?, firstDay?, weekendStart?, weekendEnd?, special*))>

The week element is deprecated in the main LDML files, because the data is more appropriately organized as connected to territories, not to linguistic data. Instead, the similar element in the supplemental data file should be used.

Calendar Fields

<!ELEMENT fields ( alias | (field*, special*)) >
<!ELEMENT field ( alias | (displayName?, relative*, special*)) >

Translations may be supplied for names of calendar fields (elements of a calendar, such as Day, Month, Year, Hour, and so on), and for relative values for those fields (for example, the day with relative value -1 is "Yesterday"). Where there is not a convenient, customary word or phrase in a particular language for a relative value, it should be omitted.

Here are examples for English and German. Notice that the German has more fields than the English does.

<calendar>
  <fields>
...
   <field type='day'>
    <displayName>Day</displayName>
    <relative type='-1'>Yesterday</relative>
    <relative type='0'>Today</relative>
    <relative type='1'>Tomorrow</relative>
   </field>
...
  </fields>
</calendars>

<calendar>
  <fields>
...
   <field type='day'>
    <displayName>Tag</displayName>
    <relative type='-2'>Vorgestern</relative>
    <relative type='-1'>Gestern</relative>
    <relative type='0'>Heute</relative>
    <relative type='1'>Morgen</relative>
    <relative type='2'>Übermorgen</relative>
   </field>
...
  </fields>
</calendars>

<!ELEMENT dateRangePattern ( #PCDATA ) > (deprecated)

The dateRangePattern allows the specification of a date range, such as "May 7 - Aug. 3". For example, here is the format for English:

<dateRangePattern>{0} - {1}</dateRangePattern>

The dateRangePattern element is deprecated, use intervalFormats instead.

5.9.2 Time Zone Names

<!ELEMENT timeZoneNames (alias | (hourFormat*, hoursFormat*, gmtFormat*, gmtZeroFormat*, regionFormat*, fallbackFormat*, fallbackRegionFormat*, abbreviationFallback*, preferenceOrdering*, singleCountries*, default*, zone*, metazone*, special*)) >
<!ELEMENT zone (alias | ( long*, short*, commonlyUsed*, exemplarCity*, special*)) >

The time zone IDs (tzid) are language-independent, and follow the TZ time zone database [Olson] and naming conventions. However, the display names for those IDs can vary by locale. The generic time is so-called wall-time; what clocks use when they are correctly switched from standard to daylight time at the mandated time of the year.

Unfortunately, the canonical tzid's (those in zone.tab) are not stable: may change in each release of the TZ Time Zone database. In CLDR, however, stability of identifiers is very important. So the canonical IDs in CLDR are kept stable as described in Appendix L: Canonical Form.

The TZ time zone database can have multiple IDs that refer to the same entity. It does contain information on equivalence relationships between these IDs, such as Asia/Calcutta" and "Asia/Kolkata". It does not remove IDs (with a few known exceptions), but it may change the "canonical" ID which is in the file zone.tab.

For lookup purposes specifications such as CLDR need a stable canonical ID, one that does not change from release to release. The stable ID is maintained as the first alias item type element in the file bcp47/timezone.xml, such as:

<type name="inccu" alias="Asia/Calcutta Asia/Kolkata"/>

That file also contains the short ID used in keywords. In versions of CLDR previous to 1.8, the alias information (but not the short ID) was in supplementalData.xml under the zoneItem, such as:

<zoneItem type="Asia/Calcutta" territory="IN" aliases="Asia/Kolkata"/>

This element was deprecated after the introduction of bcp47/timezone.xml, because the information became redundant (or was contained in the TZ time zone database).

The following is an example of time zone data. Although this is an example of possible data, in most cases only the exemplarCity needs translation. And that does not even need to be present, if a country only has a single time zone. As always, the type field for each zone is the identification of that zone. It is not to be translated.

<zone type="America/Los_Angeles">
    <long>
        <generic>Pacific Time</generic>
        <standard>Pacific Standard Time</standard>
        <daylight>Pacific Daylight Time</daylight>
    </long>
    <short>
        <generic>PT</generic>
        <standard>PST</standard>
        <daylight>PDT</daylight>
    </short>
    <exemplarCity>San Francisco</exemplarCity>
</zone>

<zone type="Europe/London">
     <long>
        <generic>British Time</generic>
        <standard>British Standard Time</standard>
        <daylight>British Daylight Time</daylight>
    </long>
    <exemplarCity>York</exemplarCity>
</zone>

In a few cases, some time zone IDs do not designate a city, as in:

<zone type="America/Puerto_Rico">
    ...
</zone>

<zone type="America/Guyana">
    ...
</zone>

<zone type="America/Cayman">
    ...
</zone>

<zone type="America/St_Vincent">
    ...
</zone>

They may designate countries or territories; their actual capital city may be a name that is too common, or, too uncommon. CLDR time zone IDs follow the Olson naming conventions.

Note: CLDR does not allow "GMT", "UT", or "UTC" as translations (short or long) of time zones other than GMT itself.

Note: Transmitting "14:30" with no other context is incomplete unless it contains information about the time zone. Ideally one would transmit neutral-format date/time information, commonly in UTC (GMT), and localize as close to the user as possible. (For more about UTC, see [UTCInfo].)

The conversion from local time into UTC depends on the particular time zone rules, which will vary by location. The standard data used for converting local time (sometimes called wall time) to UTC and back is the TZ Data [Olson], used by Linux, UNIX, Java, ICU, and others. The data includes rules for matching the laws for time changes in different countries. For example, for the US it is:

"During the period commencing at 2 o'clock antemeridian on the first Sunday of April of each year and ending at 2 o'clock antemeridian on the last Sunday of October of each year, the standard time of each zone established by sections 261 to 264 of this title, as modified by section 265 of this title, shall be advanced one hour..." (United States Law - 15 U.S.C. §6(IX)(260-7)).

Each region that has a different time zone or daylight savings time rules, either now or at any time back to 1970, is given a unique internal ID, such as Europe/Paris. (Some IDs are also distinguished on the basis of differences before 1970.) As with currency codes, these are internal codes. A localized string associated with these is provided for users (such as in the Windows Control Panels>Date/Time>Time Zone).

Unfortunately, laws change over time, and will continue to change in the future, both for the boundaries of time zone regions and the rules for daylight savings. Thus the TZ data is continually being augmented. Any two implementations using the same version of the TZ data will get the same results for the same IDs (assuming a correct implementation). However, if implementations use different versions of the data they may get different results. So if precise results are required then both the TZ ID and the TZ data version must be transmitted between the different implementations.

For more information, see [Data Formats].

The following subelements of time zoneNames are used to control the fallback process described in Appendix J: Time Zone Display Names.

Element Name	Data Examples	Results/Comment
hourFormat	"+HHmm;-HHmm"	"+1200"
hourFormat	"+HHmm;-HHmm"	"-1200"
hoursFormat (deprecated)	"{0}/{1}"	"-0800/-0700"
gmtFormat	"GMT{0}"	"GMT-0800"
gmtFormat	"{0}ВпГ"	"-0800ВпГ"
gmtZeroFormat	"GMT"	Specifies how GMT/UTC with no explicit offset (implied 0 offset) should be represented.
regionFormat	"{0} Time"	"Japan Time"
regionFormat	"Tiempo de {0}"	"Tiempo de Japón"
fallbackFormat	"{1} ({0})"	"Pacific Time (Canada)"
fallbackRegionFormat	"{0} Time ({1})"	United States Time (New York)
abbreviationFallback (deprecated)	type="GMT"	causes any "long" match to be skipped in Time Zone fallbacks
preferenceOrdering (deprecated)	type="America/Mexico_City America/Chihuahua America/New_York"	a preference ordering among modern zones
singleCountries	list="America/Godthab America/Santiago America/Guayaquil Europe/Madrid Pacific/Auckland Pacific/Tahiti Europe/Lisbon..."	uses country name alone

When referring to the abbreviated (short) form of the time zone name, there are often situations where the location-based (city or country) time zone designation for a particular language may not be in common usage in a particular territory.

Note: User interfaces for time zone selection can use the "generic location format" for time zone names to obtain the most useful ordering of names in a menu or list; see Appendix J: Time Zone Display Names and the zone section of the Date Field Symbol Table.

Section 5.9.2.1 Metazones

A metazone is an grouping of one or more internal TZIDs that share a common display name in current customary usage, or that have shared a common display name during some particular time period. For example, the zones Europe/Paris, Europe/Andorra, Europe/Tirane, Europe/Vienna, Europe/Sarajevo, Europe/Brussels, Europe/Zurich, Europe/Prague, Europe/Berlin, and so on are often simply designated Central European Time (or translated equivalent).

A metazone's display fields become a secondary fallback if an appropriate data field cannot be found in the explicit time zone data. The usesMetazone field indicates that the target metazone is active for a particular time. This also provides a mechanism to effectively deal with situations where the time zone in use has changed for some reason. For example, consider the TZID "America/Indiana/Knox", which observed Central time (GMT-6:00) prior to October 27, 1991, and has currently observed Central time since April 2, 2006, but has observed Eastern time ( GMT-5:00 ) between these two dates. This is denoted as follows (in the supplemental data file metaZones.xml).

<timezone type="America/Indiana/Knox">
  <usesMetazone to="1991-10-27 07:00" mzone="America_Central"/>
  <usesMetazone to="2006-04-02 07:00" from="1991-10-27 07:00" mzone="America_Eastern"/>
  <usesMetazone from="2006-04-02 07:00" mzone="America_Central"/>
</timezone>

Note that the dates and times are specified in UTC, not local time.

The metazones can then have translations in different locale files, such as the following.

<metazone type="America_Central"> 
  <long> 
    <generic>Central Time</generic> 
    <standard>Central Standard Time</standard> 
    <daylight>Central Daylight Time</daylight> 
  </long> 
  <short> 
    <generic>CT</generic> 
    <standard>CST</standard> 
    <daylight>CDT</daylight> 
  </short> 
</metazone> 
<metazone type="America_Eastern"> 
  <long> 
    <generic>Eastern Time</generic> 
    <standard>Eastern Standard Time</standard> 
    <daylight>Eastern Daylight Time</daylight> 
  </long> 
  <short> 
    <generic>ET</generic> 
    <standard>EST</standard> 
    <daylight>EDT</daylight> 
  </short> 
</metazone>

<metazone type="America_Eastern">
  <long>
    <generic>Heure de l’Est</generic>
    <standard>Heure normale de l’Est</standard>
    <daylight>Heure avancée de l’Est</daylight>
  </long>
  <short>
    <generic>HE</generic>
    <standard>HNE</standard>
    <daylight>HAE</daylight>
  </short>
</metazone>

When formatting a date and time value using this data, an application can properly be able to display "Eastern Time" for dates between 1991-10-27 and 2006-04-02, but display "Central Time" for current dates. (See also Section 5.2.1 Dates and Date Ranges).

Metazones are used with the 'z', 'zzzz', 'v', 'vvvv', and 'V' date time pattern characters, and not with the 'Z', 'ZZZZ', 'VVVV' pattern characters. For more information, see Appendix F: Date Format Patterns.

The commonlyUsed element is now deprecated, which effectively makes the semantics of the 'z' and 'V' formatting patterns to be identical. The CLDR committee has found it nearly impossible to obtain accurate and reliable data regarding which time zone abbreviations may be understood in a given territory, and therefore has changed to a simpler approach. Thus, if the short metazone form is available in a given locale, it is to be used for formatting regardless of the value of commonlyUsed. If a given short metazone form is known NOT to be understood in a given locale and the parent locale has this value such that it would normally be inherited, the inheritance of this value can be explicitly disabled by use of the 'no inheritance marker' as the value, which is 3 simultaneous empty set characters ( U+2205 ).

5.10 Number Elements

<!ELEMENT numbers (alias | (defaultNumberingSystem*, otherNumberingSystems*, symbols*, decimalFormats*, scientificFormats*, percentFormats*, currencyFormats*, currencies?, special*)) >

The numbers element supplies information for formatting and parsing numbers and currencies. It has the following sub-elements: <defaultNumberingSystem>, <otherNumberingSystems>, <symbols>, <decimalFormats>, <scientificFormats>, <percentFormats>, <currencyFormats>, and <currencies>. The currency IDs are from [ISO4217] (plus some additional common-use codes). For more information, including the pattern structure, see Appendix G: Number Pattern Format.

<!ELEMENT defaultNumberingSystem ( #PCDATA ) >
This element indicates which numbering system should be used for presentation of numeric quantities in the given locale.

<!ELEMENT otherNumberingSystems ( alias | ( native*, traditional*, finance*)) >

This element defines general categories of numbering systems that are sometimes used in the given locale for formatting numeric quantities. These additional numbering systems are often used in very specific contexts, such as in calendars or for financial purposes. There are currently three defined categories, as follows:

<native> Defines the numbering system used for the native digits, usually defined as a part of the script used to write the language. The native numbering system can only be a numeric positional decimal-digit numbering system, using digits with General_Category=Decimal_Number.

<traditional> Defines the traditional numerals for a locale. This numbering system may be numeric or algorithmic. If the traditional numbering system is not defined, applications should use the native numbering system as a fallback.

<finance> Defines the numbering system used for financial quantities. This numbering system may be numeric or algorithmic. This is often used for ideographic languages such as Chinese, where it would be easy to alter an amount represented in the default numbering system simply by adding additional strokes. If the financial numbering system is not specified, applications should use the default numbering system as a fallback.

The categories defined for other numbering systems can be used in a Unicode locale identifier to select the proper numbering system without having to know the specific numbering system by name. For example:

To select Hindi language using the native digits for numeric formatting, use locale ID: "hi-IN-u-nu-native".
To select Chinese language using the appropriate financial numerals, use locale ID: "zh-u-nu-finance".
To select Tamil language using the traditional Tamil numerals, use locale ID: "ta-u-nu-traditio".

For more information on numbering systems and their definitions, see Section Numbering Systems.

5.10.1 Number Symbols

<!ELEMENT symbols (alias | (decimal*, group*, list*, percentSign*, nativeZeroDigit*, patternDigit*, plusSign*, minusSign*, exponential*, perMille*, infinity*, nan*, currencyDecimal*, currencyGroup*, special*)) >

<symbols>
      <decimal>.</decimal>
      <group>,</group>
      <list>;</list>
      <percentSign>%</percentSign>
      <nativeZeroDigit>0</nativeZeroDigit>
      <patternDigit>#</patternDigit>
      <plusSign>+</plusSign>
      <minusSign>-</minusSign>
      <exponential>E</exponential>
      <perMille>‰</perMille>
      <infinity>∞</infinity>
      <nan>☹</nan>
</symbols>

<!ATTLIST symbols numberSystem CDATA #IMPLIED >
The numberSystem attribute is used to specify that the given number formatting symbols are to be used when the given numbering system is active. By default, number formatting symbols without a specific numberSystem attribute are assumed to be used for the "latn" numbering system, which is western (ASCII) digits. Locales that specify a numbering system other than "latn" as the default should also specify number formatting symbols that are appropriate for use within the context of the given numbering system. For example, a locale that uses the Arabic-Indic digits as its default would likely use an Arabic comma for the grouping separator rather than the ASCII comma. The numberSystem attribute can also be applied to the decimalFormats, scientificFormats, currencyFormats, or percentFormats elements below, in order to specify an alternative format to be used when the given numbering system is active. For more information on numbering systems and their definitions, see Section C.13 Numbering Systems.

<!ELEMENT decimalFormats (alias | (default*, decimalFormatLength*, special*))>
<!ELEMENT decimalFormatLength (alias | (default*, decimalFormat*, special*))>
<!ATTLIST decimalFormatLength type ( full | long | medium | short ) #IMPLIED >
<!ELEMENT decimalFormat (alias | (pattern*, special*)) >
(scientificFormats, percentFormats have the same structure)

<decimalFormats>
  <decimalFormatLength type="long">
    <decimalFormat>
      <pattern>#,##0.###</pattern>
    </decimalFormat>
  </decimalFormatLength>
</decimalFormats>

<scientificFormats>
  <default type="long"/>
  <scientificFormatLength type="long">
    <scientificFormat>
      <pattern>0.000###E+00</pattern>
    </scientificFormat>
  </scientificFormatLength>
  <scientificFormatLength type="medium">
    <scientificFormat>
      <pattern>0.00##E+00</pattern>
    </scientificFormat>
  </scientificFormatLength>
</scientificFormats>

<percentFormats>
  <percentFormatLength type="long">
    <percentFormat>
      <pattern>#,##0%</pattern>
    </percentFormat>
  </percentFormatLength>
</percentFormats>

<!ELEMENT currencyFormats (alias | (default*, currencySpacing*, currencyFormatLength*, unitPattern*, special*)) >
<!ELEMENT currencySpacing (alias | (beforeCurrency*, afterCurrency*, special*)) >
<!ELEMENT beforeCurrency (alias | (currencyMatch*, surroundingMatch*, insertBetween*)) >
<!ELEMENT afterCurrency (alias | (currencyMatch*, surroundingMatch*, insertBetween*)) >
<!ELEMENT currencyMatch ( #PCDATA ) >
<!ELEMENT surroundingMatch ( #PCDATA )) >
<!ELEMENT insertBetween ( #PCDATA ) >
<!ELEMENT currencyFormatLength (alias | (default*, currencyFormat*, special*)) >
<!ATTLIST currencyFormatLength type ( full | long | medium | short ) #IMPLIED >
<!ELEMENT currencyFormat (alias | (pattern*, special*)) >

<currencyFormats>
  <currencyFormatLength type="long">
    <currencyFormat>
      <pattern>¤ #,##0.00;(¤ #,##0.00)</pattern>
    </currencyFormat>
  </currencyFormatLength>
</currencyFormats>

A pattern type attribute is used for compact number formats, such as the following:

<decimalFormatLength type="long">
	<ecimalFormat>
		<pattern type="1000" count="one">0 millier</pattern>
		<pattern type="1000" count="other">0 milliers</pattern>
		<pattern type="10000" count="one">00 mille</pattern>
		<pattern type="10000" count="other">00 mille</pattern>
		<pattern type="100000" count="one">000 mille</pattern>
		<pattern type="100000" count="other">000 mille</pattern>
		<pattern type="1000000" count="one">0 million</pattern>
		<pattern type="1000000" count="other">0 millions</pattern>
		...
	</decimalFormat>
</decimalFormatLength>
<decimalFormatLength type="short">
	<decimalFormat>
		<pattern type="1000" count="one">0 K</pattern>
		<pattern type="1000" count="other">0 K</pattern>
		<pattern type="10000" count="one">00 K</pattern>
		<pattern type="10000" count="other">00 K</pattern>
		<pattern type="100000" count="one">000 K</pattern>
		<pattern type="100000" count="other">000 K</pattern>
		<pattern type="1000000" count="one">0 M</pattern>
		<pattern type="1000000" count="other">0 M</pattern>
								...
		</decimalFormat>

To format a number N, the greatest type less than or equal to N is used, with the appropriate plural category. N is divided by the type, after removing the number of zeros in the pattern, less 1. APIs supporting this format should provide control over the number of significant or fraction digits.

Thus N=12345 matches <pattern type="10000" count="other">00 K</pattern>. N is divided by 1000 (obtained from10000 after removing "00" and restoring one "0". The result is formatted according to the normal decimal pattern. With no fractional digits, that yields "12 K".

The short format is designed for UI environments where space is at a premium, and should ideally result in a formatted string no more than about 6 em wide (with no fractional digits).

5.10.2 Currencies

<!ELEMENT currencies (alias | (default?, currency*, special*)) >
<!ELEMENT currency (alias | (((pattern+, displayName*, symbol*) | (displayName+, symbol*, pattern*) | (symbol+, pattern*))?, decimal*, group*, special*)) >
<!ELEMENT symbol ( #PCDATA ) >
<!ATTLIST symbol choice ( true | false ) #IMPLIED >

Note: The term "pattern" appears twice in the above. The first is for consistency with all other cases of pattern + displayName; the second is for backwards compatibility.

<currencies>
    <currency type="USD">
        <displayName>Dollar</displayName>
        <symbol>$</symbol>
    </currency>
    <currency type ="JPY">
        <displayName>Yen</displayName>
        <symbol>¥</symbol>
    </currency>
    <currency type="PTE">
        <displayName>Escudo</displayName>
        <symbol>$</symbol>
    </currency>
</currencies>

In formatting currencies, the currency number format is used with the appropriate symbol from <currencies>, according to the currency code. The <currencies> list can contain codes that are no longer in current use, such as PTE. The choice attribute has been deprecated.

The count attribute distinguishes the different plural forms, such as in the following:

<currencyFormats>
    <unitPattern count="other">{0} {1}</unitPattern>
    ...
<currencies>

<currency type="ZWD">
    <displayName>Zimbabwe Dollar</displayName>
    <displayName count="one">Zimbabwe dollar</displayName>
    <displayName count="other">Zimbabwe dollars</displayName>
    <symbol>Z$</symbol>
</currency>

To format a particular currency value "ZWD" for a particular numeric value n:

First see if there is a count with an explicit number (0 or 1). If so, use that string.
Otherwise, determine the count value that corresponds to n using the rules in Appendix C.11 Language Plural Rules
Next, get the currency unitPattern.
1. Look for a unitPattern element that matches the count value, starting in the current locale and then following the locale fallback chain up to, but not including root.
2. If no matching unitPattern element was found in the previous step, then look for a unitPattern element that matches count="other", starting in the current locale and then following the locale fallback chain up to root (which has a unitPattern element with count="other" for every unit type).
3. The resulting unitPattern element indicates the appropriate positioning of the numeric value and the currency display name.
Next, get the displayName element for the currency.
1. Look for a displayName element that matches the count value, starting in the current locale and then following the locale fallback chain up to, but not including root.
2. If no matching displayName element was found in the previous step, then look for a displayName element that matches count="other", starting in the current locale and then following the locale fallback chain up to, but not including root.
3. If no matching displayName element was found in the previous step, then look for a displayName element that with no count, starting in the current locale and then following the locale fallback chain up to root.
4. If there is no displayName element, use the currency code itself (for example, "ZWD").
The numeric value, formatted according to the locale with the number of decimals appropriate for the currency, is substituted for {0} in the unitPattern, while the currency display name is substituted for the {1}.

While for English this may seem overly complex, for some other languages different plural forms are used for different unit types; the plural forms for certain unit types may not use all of the plural-form tags defined for the language.

For example, if the the currency is ZWD and the number is 1234, then the latter maps to count="other" for English. The unit pattern for that is "{0} {1}", and the display name is "Zimbabwe dollars". The final formatted number is then "1,234 Zimbabwe dollars".

When the currency symbol is substituted into a pattern, there may be some further modifications, according to the following.

<currencySpacing>
  <beforeCurrency>
    <currencyMatch>[:letter:]</currencyMatch>
    <surroundingMatch>[:digit:]</surroundingMatch>
    <insertBetween>&#x00a0;</insertBetween>
  </beforeCurrency>
  <afterCurrency>
    <currencyMatch>[:letter:]</currencyMatch>
    <surroundingMatch>[:digit:]</surroundingMatch>
    <insertBetween>&#x00a0;</insertBetween>
  </afterCurrency>
</currencySpacing>

This element controls whether additional characters are inserted on the boundary between the symbol and the pattern. For example, with the above currencySpacing, inserting the symbol "US$" into the pattern "#,##0.00¤" would result in an extra no-break space inserted before the symbol, for example, "#,##0.00 US$". The beforeCurrency element governs this case, since we are looking before the "¤" symbol. The currencyMatch is positive, since the "U" in "US$" is at the start of the currency symbol being substituted. The surroundingMatch is positive, since the character just before the "¤" will be a digit. Because these two conditions are true, the insertion is made.

Conversely, look at the pattern "¤#,##0.00" with the symbol "US$". In this case, there is no insertion; the result is simply "US$#,##0.00". The afterCurrency element governs this case, since we are looking after the "¤" symbol. The surroundingMatch is positive, since the character just after the "¤" will be a digit. However, the currencyMatch is not positive, since the "$" in "US$" is at the end of the currency symbol being substituted. So the insertion is not made.

For more information on the matching used in the currencyMatch and surroundingMatch elements, see Appendix E: Unicode Sets.

Currencies can also contain optional grouping, decimal data, and pattern elements. This data is inherited from the <symbols> in the same locale data (if not present in the chain up to root), so only the differing data will be present. See Section 4.1 Multiple Inheritance.

Note: Currency values should never be interchanged without a known currency code. You never want the number 3.5 interpreted as $3.5 by one user and ¥3.5 by another. Locale data contains localization information for currencies, not a currency value for a country. A currency amount logically consists of a numeric value, plus an accompanying currency code (or equivalent). The currency code may be implicit in a protocol, such as where USD is implicit. But if the raw numeric value is transmitted without any context, then it has no definitive interpretation.

Notice that the currency code is completely independent of the end-user's language or locale. For example, RUR is the code for Russian Rubles. A currency amount of <RUR, 1.23457×10³> would be localized for a Russian user into "1 234,57р." (using U+0440 (р) cyrillic small letter er). For an English user it would be localized into the string "Rub 1,234.57" The end-user's language is needed for doing this last localization step; but that language is completely orthogonal to the currency code needed in the data. After all, the same English user could be working with dozens of currencies.Notice also that the currency code is also independent of whether currency values are inter-converted, which requires more interesting financial processing: the rate of conversion may depend on a variety of factors.

Thus logically speaking, once a currency amount is entered into a system, it should be logically accompanied by a currency code in all processing. This currency code is independent of whatever the user's original locale was. Only in badly-designed software is the currency code (or equivalent) not present, so that the software has to "guess" at the currency code based on the user's locale.

Note: The number of decimal places and the rounding for each currency is not locale-specific data, and is not contained in the Locale Data Markup Language format. Those values override whatever is given in the currency numberFormat. For more information, see Appendix C: Supplemental Data.

For background information on currency names, see [CurrencyInfo].

5.11 Unit Elements

These elements specify the localized way of formatting quantities of units such as years, months, days, hours, minutes and seconds— for example, in English, "1 day" or "3 days". The English rules that produce this example are as follows ({0} indicates the position of the formatted numeric value):

<unit type="day">
	<unitPattern count="one">{0} day</unitName>
	<unitPattern count="other">{0} days</unitName>
</unit>

To format a particular unit type such as "day" for a particular numeric value n:

First see if there is a count with an explicit number (0 or 1). If so, use that string.
Otherwise, determine the count value that corresponds to n using the rules in Appendix C.11 Language Plural Rules
Next, for unit type="day", look for a unitPattern element that matches the count value, starting in the current locale and then following the locale fallback chain up to, but not including root.
If no matching unitPattern element was found in the previous step, then look for a unitPattern element that matches count="other" (still for unit type="day"), starting in the current locale and then following the locale fallback chain up to root (which has a unitPattern element with count="other" for every unit type).
The resulting unitPattern element indicates the appropriate form of the unit name and its position with respect to the numeric value.

The explicit values 0 and 1 are added because even in languages without separate plural categories in Appendix C.11 Language Plural Rules, there are often special forms used with 0 and 1, such as "no books" or "a book" (vs 0 books and 1 book). In some languages, there is less need for these forms with units, even where they are used with other constructions. In those cases, the 0/1 forms can be omitted. Alternatively, where the category forms (such as zero or one) are completely covered by 0/1 (as in Arabic), those category forms may be omitted.

5.12 POSIX Elements

<!ELEMENT posix (alias | (messages*, special*)) >
<!ELEMENT messages (alias | ( yesstr*, nostr*)) >

The following are included for compatibility with POSIX.

The values for yesstr and nostr contain a colon-separated list of strings that would normally be recognized as "yes" and "no" responses. For cased languages, this shall include only the lower case version. POSIX locale generation tools must generate the upper case equivalents, and the abbreviated versions, and add the English words wherever they do not conflict. Examples:
- ja → ja:Ja:j:J:yes:Yes:y:Y
- ja → ja:Ja:j:J:yes:Yes // exclude y:Y if it conflicts with the native "no".
The older elements yesexpr and noexpr are deprecated. They should instead be generated from yesstr and nostr so that they match all the responses.

So for English, the appropriate strings and expressions would be as follows:

yesstr "yes:y"
nostr "no:n"

The generated yesexpr and noexpr would be:

yesexpr "^([yY]([eE][sS])?)"This would match y,Y,yes,yeS,yEs,yES,Yes,YeS,YEs,YES.

noexpr "^([nN][oO]?)"
This would match n,N,no,nO,No,NO.

5.13 Reference Element

(Use only in supplemental data; deprecated for ldml.dtd and locale data)

<!ELEMENT references ( reference* ) >
<!ELEMENT reference ( #PCDATA ) >
<!ATTLIST reference type NMTOKEN #REQUIRED>
<!ATTLIST reference standard ( true | false ) #IMPLIED >
<!ATTLIST reference uri CDATA #IMPLIED >

The references section supplies a central location for specifying references and standards. The uri should be supplied if at all possible. If not online, then a ISBN number should be supplied, such as in the following example:

<reference type="R2" uri="http://www.ur.se/nyhetsjournalistik/3lan.html">Landskoder på Internet</reference>
<reference type="R3" uri="URN:ISBN:91-47-04974-X">Svenska skrivregler</reference>

5.14 Collation Elements

<!ELEMENT collations (alias | (default*, collation*, special*)) >

This section contains one or more collation elements, distinguished by type. Each collation contains rules that specify a certain sort-order, as a tailoring of the root order. The root order is based on the UCA default table defined in UTS #10: Unicode Collation Algorithm [UCA]. (For a chart view of the UCA, see Collation Chart [UCAChart].)

CLDR uses modified tables for the root order as described in CollationAuxiliary.html in the UCA data directory. U+FFFE and U+FFFF have special tailorings as well:

U+FFFF: This code point is tailored to have a primary weight higher than all other characters. This allows the reliable specification of a range, such as “Sch” ≤ X ≤ “Sch\uFFFF”, to include all strings starting with "sch" or equivalent.

U+FFFE: This code point produces a CE with special minimal weights on all levels, regardless of alternate handling. This allows for Merging Sort Keys within code point space. For example, when sorting names in a database, a sortable string can be formed with last_name + '\uFFFE' + first_name. These strings would sort properly, without ever comparing the last part of a last name with the first part of another first name.

In CLDR, so as to maintain the highest and lowest status, U+FFFE..U+FFFF are not further tailorable, and nothing can tailor to them. That is, neither can occur in a collation rule: for example, the following rules are illegal:

& \uFFFF < x

& x <\uFFFF

There is also a separate collation in root that allows access to the original DUCET table order. Using the keyword “ducet”, the locale ID “und-u-co-ducet” allows access to that order. For a description of how inheritance applies to keywords, see Appendix I: Inheritance and Validity. Special index markers have been added to the CJK collations for stroke, pinyin, zhuyin, and unihan. These markers allow for effective and robust use of indexes for these collations. For example, near the start of the pinyin there is the following:

A
<pc>阿呵𥥩锕𠼞𨉚</pc>

…

These indicate the boundaries of "buckets" that can be used for indexing. They are always two characters starting with U+FDD0, and thus will not occur in normal text. For pinyin the second character is A-Z; for unihan it is one of the radicals; and for stroke it is a character after U+2800 indicating the number of strokes, such as ⠁. For zhuyin the second character is one of the standard Bopomofo characters in the range U+3105 through U+3129.

To allow implementations in reduced memory environments to use CJK sorting, there are also short forms of each of these collation sequences. These provide for the most common characters in common use, and are marked with alt="short".

There are two syntax specifications for specifying collation rules: the basic collation syntax and the XML collation syntax. Both have the same functionality. The LDML files use the XML format, but the basic format is simpler to read, and will often be used in examples. Implementations of LDML, such as [ICUCollation] may choose to use the basic collation syntax as their native syntax.

Notes:

There is an on-line demonstration of collation at [LocaleExplorer] that uses the basic syntax. (Pick the locale and scroll to "Collation Rules", near the end.)
Java uses an early version of the basic collation syntax, but has not been updated recently. It does not support any of the basic syntax marked with [...], and its default table is not the UCA.

5.14.1 Version

The version attribute is used in case a specific version of the UCA is to be specified. It is optional, and is specified if the results are to be identical on different systems. If it is not supplied, then the version is assumed to be the same as the Unicode version for the system as a whole. In general, tailorings should be defined so as to minimize dependence on the underlying UCA version, by explicitly specifying the behavior of all characters used to write the language in question.

Note: For version 3.1.1 of the UCA, the version of Unicode must also be specified with any versioning information; an example would be "3.1.1/3.2" for version 3.1.1 of the UCA, for version 3.2 of Unicode. This was changed by decision of the UTC, so that dual versions were no longer necessary. So for UCA 4.0 and beyond, the version just has a single number.

5.14.2 Collation Element

<!ELEMENT collation (alias | (base?, settings?, suppress_contractions?, optimize?, rules?, special*)) >

The tailoring syntax is designed to be independent of the actual weights used in any particular UCA table. That way the same rules can be applied to UCA versions over time, even if the underlying weights change. The following illustrates the overall structure of a collation with the XML syntax:

<collation> <settings caseLevel="on"/> <rules> <reset>c<reset>
k </rules> </collation>

The basic syntax corresponding to this would be:

[caseLevel on]
& c < k

The optional base element <base>...</base>, contains an alias element that points to another data source that defines a base collation. If present, it indicates that the settings and rules in the collation are modifications applied on top of the respective elements in the base collation. That is, any successive settings, where present, override what is in the base as described in Setting Options. Any successive rules are concatenated to the end of the rules in the base. The results of multiple rules applying to the same characters is covered in Orderings.

5.14.3 Setting Options

In XML syntax, these are attributes of <settings>. For example, <setting strength="secondary"> will only compare strings based on their primary and secondary weights. In basic syntax, these are of the form [keyword value].

If the attribute is not present, the CLDR default (or the default for the locale, if there is one) is used. That default is listed in bold italics. Where there is a UCA default that is different, it is listed in bold with (UCA default). Note that the default value for a locale may be different than the default value for the attribute, so the defaults here are not defaults for the corresponding keywords.

The Example cells include an LDML example followed by the same example in basic syntax.

Collation Settings
BCP47 Key	Attribute	BCP47 Value	Options	Example	Description
ks	strength	level1	primary (1)	`strength = "primary" [strength 1]`	Sets the default strength for comparison, as described in the [UCA]. Note that strength setting of greater than 4 may have the same effect as identical, depending on the locale and implementation.
		level2	secondary (2)
		level3	tertiary (3)
		level4	quaternary (4)
		identic	identical (5)
ka	alternate	noignore	non-ignorable	`alternate = "non-ignorable" [alternate non-ignorable]`	Sets alternate handling for variable weights, as described in [UCA], where "shifted" causes certain characters to be ignored in comparison. The default for LDML is different than it is in the UCA. In LDML, the default for alternate handling is non-ignorable, while in UCA it is shifted. In addition, in LDML only whitespace and punctuation are variable.
		shifted	shifted (UCA default)
		n/a	blanked
kb	backwards	true	on	`backwards = "on" [backwards 2]`	Sets the comparison for the second level to be backwards, as described in [UCA].
kb	backwards	false	off	`backwards = "on" [backwards 2]`
kk	normalization	true	on (UCA default)	`normalization = "off" [normalization off]`	If on, then the normal [UCA] algorithm is used. If off, then all strings that are in [FCD] will sort correctly, but others will not necessarily sort correctly. So should only be set off if the the strings to be compared are in FCD. Note that the default for CLDR locales may be different than in the UCA. The rules for particular locales have it set to on: those locales whose exemplar characters (in forms commonly interchanged) would be affected by normalization.
kk	normalization	false	off	`normalization = "off" [normalization off]`
kc	caseLevel	true	on	`caseLevel = "off" [caseLevel on]`	If set to on, a level consisting only of case characteristics will be inserted in front of tertiary level, as a "Level 2.5". To ignore accents but take case into account, set strength to primary and case level to on. For details, see Section 5.14.13, Case Parameters.
kc	caseLevel	false	off	`caseLevel = "off" [caseLevel on]`
kf	caseFirst	upper	upper	`caseFirst = "off" [caseFirst off]`	If set to upper, causes upper case to sort before lower case. If set to lower, causes lower case to sort before upper case. Useful for locales that have already supported ordering but require different order of cases. Affects case and tertiary levels. For details, see Section 5.14.13, Case Parameters.
		lower	lower
		false	off
kh	hiraganaQuaternary	true	on	`hiraganaQuaternary = "on" [hiraganaQ on]`	Controls special treatment of Hiragana code points on quaternary level. If turned on, Hiragana codepoints will get lower values than all the other non-variable code points in shifted. That is, the normal Level 4 value for a regular collation element is FFFF, as described in [UCA], Section 3.6.2, Variable Weighting. This is changed to FFFE for [:script=Hiragana:] characters. The strength must be greater or equal than quaternary if this attribute is to have any effect.
kh	hiraganaQuaternary	false	off	`hiraganaQuaternary = "on" [hiraganaQ on]`
kn	numeric	true	on	`numeric = "on" [numeric on]`	If set to on, any sequence of Decimal Digits (General_Category = Nd in the [UAX44]) is sorted at a primary level with its numeric value. For example, "A-21" < "A-123". The computed primary weights are all at the start of the digit reordering group. Thus with an untailored UCA table, "a$" < "a0" < "a2" < "a12" < "a⓪" < "aa".
kn	numeric	false	off	`numeric = "on" [numeric on]`
vt	variableTop	See Appendix Q: Locale Extension Keys and Types.	uXXuYYYY (the default is set to the highest punctuation, thus including spaces and punctuation, but not symbols)	`variableTop = "uXXuYYYY" & \u00XX\uYYYY < [variable top]`	The Option value is an encoded Unicode string, with code points in hex, leading zeros removed, and 'u' inserted between successive elements. The BCP47 value is described in Appendix Q: Locale Extension Keys and Types. Sets the string value for the variable top. All the code points with primary strengths less than or equal to that string will be considered variable, and thus affected by the alternate handling. Variables are ignorable by default in [UCA], but not in CLDR. See below for more information.
kr	reorder	a sequence of one or more reorder codes: space, punct, symbol, currency, digit, or any BCP47 script ID		`reorder = "Grek digit" [reorder Grek digit]`	Specifies a reordering of scripts or other significant blocks of characters such as symbols, punctuation, and digits. For the precise meaning and usage of the reorder codes, see Section 5.14.12, Collation Reordering.
n/a	match-boundaries:	n/a	none	`match-boundaries = "whole-word"` n/a	Defined by Section 8, Searching and Matching of [UCA].
		n/a	whole-character
		n/a	whole-word
n/a	match-style	n/a	minimal	`match-style = "medial"` n/a	Defined by Section 8, Searching and Matching of [UCA].
		n/a	medial
		n/a	maximal

Variable Top (vt) bears more explanation. Users may want to include more or fewer characters as Variable. For example, someone could want to restrict the Variable characters to just include space marks. In that case, variableTop would be set to U+1680 (see UCA Variable chart). Alternatively, someone could want more of the Common characters in them, and include characters up to (but not including '0'), by setting variableTop to be U+20BA (in Unicode 6.2; see UCA Common chart).

The effect of these settings is to customize to ignore different sets of characters when comparing strings. For example, the locale identifier "de-u-ka-shifted-vt-0024" is requesting settings appropriate for German, including German sorting conventions, and that '$' and characters sorting below it are ignored in sorting.

5.14.4 Collation Rule Syntax

<!ELEMENT rules (alias | ( reset, ( reset | p | pc | s | sc | t | tc | i | ic | x)* )) >

The goal for the collation rule syntax is to have clearly expressed rules with a concise format, that parallels the basic syntax as much as possible. The rule syntax uses abbreviated element names for primary (level 1), secondary (level 2), tertiary (level 3), and identical, to be as short as possible. The reason for this is because the tailorings for CJK characters are quite large (tens of thousands of elements), and the extra overhead would have been considerable. Other elements and attributes do not occur as frequently, and have longer names.

Note: The rules are stated in terms of actions that cause characters to change their ordering relative to other characters. This is for stability; assigning characters specific weights would not work, since the exact weight assignment in UCA (or ISO 14651) is not required for conformance — only the relative ordering of the weights. In addition, stating rules in terms of relative order is much less sensitive to changes over time in the UCA itself.

5.14.5 Orderings

The following are the normal ordering actions used for the bulk of characters. Each rule contains a string of ordered characters that starts with an anchor point or a reset value. The reset value is an absolute point in the UCA that determines the order of other characters. For example, the rule & a < g, places "g" after "a" in a tailored UCA: the "a" does not change place. Logically, subsequent rule after a reset indicates a change to the ordering (and comparison strength) of the characters in the UCA. For example, the UCA has the following sequence (abbreviated for illustration):

... a <₃ ａ <₃ ⓐ <₃ A <₃ Ａ <₃ Ⓐ <₃ ª <₂ á <₃ Á <₁ æ <₃ Æ <₁ ɐ <₁ ɑ <₁ ɒ <₁ b <₃ ｂ <₃ ⓑ <₃ B <₃ Ｂ <₃ ℬ ...

Whenever a character is inserted into the UCA sequence, it is inserted at the first point where the strength difference will not disturb the other characters in the UCA. For example, & a < g puts g in the above sequence with a strength of L1. Thus the g must go in after any lower strengths, as follows:

... a <₃ ａ <₃ ⓐ <₃ A <₃ Ａ <₃ Ⓐ <₃ ª <₂ á <₃ Á <₁ g <₁ æ <₃ Æ <₁ ɐ <₁ ɑ <₁ ɒ <₁ b <₃ ｂ <₃ ⓑ <₃ B <₃ Ｂ <₃ ℬ ...

The rule & a << g, which uses a level-2 strength, would produce the following sequence:

... a <₃ ａ <₃ ⓐ <₃ A <₃ Ａ <₃ Ⓐ <₃ ª <₂ g <₂ á <₃ Á <₁ æ <₃ Æ <₁ ɐ <₁ ɑ <₁ ɒ <₁ b <₃ ｂ <₃ ⓑ <₃ B <₃ Ｂ <₃ ℬ ...

And the rule & a <<< g, which uses a level-3 strength, would produce the following sequence:

... a <₃ g <₃ ａ <₃ ⓐ <₃ A <₃ Ａ <₃ Ⓐ <₃ ª <₂ á <₃ Á <₁ æ <₃ Æ <₁ ɐ <₁ ɑ <₁ ɒ <₁ b <₃ ｂ <₃ ⓑ <₃ B <₃ Ｂ <₃ ℬ ...

Since resets always work on the existing state, the rule entries must be in the proper order. A character or sequence may occur multiple times; each subsequent occurrence causes a different change. The following shows the result of serially applying a three rules.

	Basic Syntax	Result	Comment
1	& a < g	... a <₁ g ...	Put g after a.
2	& a < h < k	... a <₁ h <₁ k <₁ g ...	Now put h and k after a (inserting before the g).
3	& h << g	... a <₁ h <₁ g <₁ k ...	Now put g after h (inserting before k).

Notice that characters can occur multiple times, and thus override previous rules.

Except for the case of expansion sequence syntax, every sequence after a reset is equivalent in action to breaking up the sequence into an atomic rule: a reset + relation pair. The tailoring is then equivalent to applying each of the atomic rules to the UCA in order, according to the above description.

Example:

Basic Syntax	Equivalent Atomic Rules
& b < q <<< Q & a < x <<< X << q <<< Q < z	& b < q & q <<< Q & a < x & x <<< X & X << q & q <<< Q & Q < z

In the case of expansion sequence syntax, the equivalent atomic sequence can be derived by first transforming the expansion sequence syntax into normal expansion syntax. (See Expansions.)

<!ELEMENT reset ( #PCDATA | cp | ... )* >
<!ELEMENT p ( #PCDATA | cp | last_variable )* >
(Elements pc, s, sc, t, tc, i, and ic have the same structure as p.)

Specifying Collation Ordering
Basic Symbol	Basic Example	XML Symbol	XML Example	Description
`&`	`& Z`	`<reset>`	`<reset>Z</reset>`	Do not change the ordering of Z, but place subsequent characters relative to it.
`<`	`& a < b`	`<p>`	`<reset>a<reset> <p>b</p>`	Make 'b' sort after 'a', as a primary (base-character) difference
`<<`	`& a << ä`	`<s>`	`<reset>a<reset> <s>ä</s>`	Make 'ä' sort after 'a' as a secondary (accent) difference
`<<<`	`& a <<< A`	`<t>`	`<reset>a<reset> <t>A</t>`	Make 'A' sort after 'a' as a tertiary (case/variant) difference
`=`	`& v = w`	`<i>`	`<reset>v<reset> <i>w</i>`	Make 'w' sort identically to 'v'

Resets only need to be at the start of a sequence, to position the characters relative a character that is in the UCA (or has already occurred in the tailoring). For example: <reset>z</reset>abcd.

Some additional elements are provided to save space with large tailorings. The addition of a 'c' to the element name indicates that each of the characters in the contents of that element are to be handled as if they were separate elements with the corresponding strength. In the basic syntax, these are expressed by adding a * to the operation.

Abbreviating Ordering Specifications
Basic Symbol	Basic Example	Equivalent	XML Symbol	XML Example	Equivalent
`<*`	`& a <* bcd`	`& a < b < c < d`	`<pc>`	`<reset>a<reset><pc>bcd</pc>`	`<reset>a<reset><p>b</p><p>c</p><p>d</p>`
`<<*`	`& a <<* àáâã`	`& a << à << á << âã`	`<sc>`	`<reset>a<reset><sc>àáâã</sc>`	`<reset>a<reset><s>à</s><s>á</s><s>â</s><s>ã</s>`
`<<<*`	`& p <<<* PｐＰ`	`& p <<< P <<< ｐ <<< Ｐ`	`<tc>`	`<reset>p<reset><tc>PｐＰ</tc>`	`<reset>p<reset><t>P</t><t>ｐ</t><t>Ｐ</t>`
`=*`	`& v =* VwW`	`& v = V = w = W`	`<ic>`	`<reset>v<reset><ic>VwW</ic>`	`<reset>v<reset><i>V</i><i>w</i><i>W</i>`

5.14.6 Contractions

To sort a sequence as a single item (contraction), just use the sequence, for example,

Specifying Contractions
Basic Example	XML Example	Description
`& k < ch`	`<reset>k</reset> <p>ch</p>`	Make the sequence 'ch' sort after 'k', as a primary (base-character) difference

5.14.7 Expansions

<!ELEMENT x (context?, ( p | pc | s | sc | t | tc | i | ic )*, extend? ) >

There are two ways to handle expansions (where a character sorts as a sequence) with both the basic syntax and the XML syntax. The first method is to reset to the sequence of characters. This is called sequence expansion syntax. The second is to use the extension sequence. Both are equivalent in practice (unless the reset sequence happens to be a contraction). This is called normal expansion syntax.

Specifying Expansions
Basic	XML	Description
`& c << k / h`	`<reset>c</reset> <x><s>k</s> <extend>h</extend></x>`	normal expansion syntax: Make 'k' sort after the sequence 'ch'; thus 'k' will behave as if it expands to a character after 'c' followed by an 'h'.
`& ch << k`	`<reset>ch</reset> <s>k</s>`	sequence expansion syntax: Make 'k' sort after the sequence 'ch'; thus 'k' will behave as if it expands to a character after 'c' followed by an 'h'. (unless 'ch' is defined beforehand as a contraction).

If an <extend> element is necessary, it requires the rule to be embedded in an <x> element.

The sequence expansion syntax can be quite tricky, so it should be avoided where possible. In particular:

The expansion is only in effect up to — but not including — the first primary rule. Thus
<reset>ch</reset> <s>x</x> <t>y</t> zis the same as
<reset>c</reset> <x><s>x</s><extend>h</extend></x> <x><t>y</t><extend>h</extend></x> z
In accordance with the UCA, all strings are interpreted as being in NFD form. In other rules, this has no effect, but syntax such as <reset>ä</reset>, the ä will be treated as two characters a + ¨, unless the ä has previously been used as a contraction. Thus the ¨ will be used as an expansion for following characters (up to the next primary).

Each extension replaces the one before it; it does not append to it. So

& ab << c
& cd << e

is equivalent to:

& a << c / b << e / d

and produces the following weights (where p(x) is the primary weight and s(a) is the secondary weight):

Character	Weights
c	p(a), p(b); s(a)+1, s(b); ...
e	p(a), p(d); s(a)+2, s(d); ...

When expressing rules as atomic rules, the sequences must first be transformed into normal expansion syntax:

Expansion Sequence	Normal Expansion	Equivalent Atomic Rules
& ab << q <<< Q & ad <<< AD < x <<< X	& a << q / b <<< Q / b & a <<< AD / d < x <<< X	& a << q / b & q <<< Q / b & a <<< AD / d & AD < x & x<<< X

5.14.8 Context Before

The context before a character can affect how it is ordered, such as in Japanese. This could be expressed with a combination of contractions and expansions, but is faster using a context. (The actual weights produced are different, but the resulting string comparisons are the same.) If a context element occurs, it must be the first item in the rule, and requires an <x> element.

For example, suppose that "-" is sorted like the previous vowel. Then one could have rules that take "a-", "e-", and so on. However, that means that every time a very common character (a, e, ...) is encountered, a system will slow down as it looks for possible contractions. An alternative is to indicate that when "-" is encountered, and it comes after an 'a', it sorts like an 'a', and so on.

Specifying Previous Context
Basic	XML
`& a <<< a \| - & e <<< e \| - ...`	`<reset>a</reset><x><context>a</context><s>-</s></x> <reset>e</reset><x><context>e</context><s>-</s></x> ...`

Both the context and extend elements can occur in an <x> element. For example, the following are allowed:

<x><context>abc</context>def<extend>ghi</extend></x>
<x>def<extend>ghi</extend></x>
<x><context>abc</context>def</x>

5.14.9 Placing Characters Before Others

There are certain circumstances where characters need to be placed before a given character, rather than after. This is the case with Pinyin, for example, where certain accented letters are positioned before the base letter. That is accomplished with the following syntax.

Placing Characters *Before* Others
Item	Options	Basic Example	XML Example
before	primary secondary tertiary	`& [before 2] a << à`	`<reset before="secondary">a</reset> <s>à</s>`

It is an error if the strength of the before relation is not identical to the relation after the reset. Thus the following are errors, since the value of the before attribute does not agree with the relation <s>.

Basic Example	XML Example
`& [before 2] a < à`	`<reset before="primary">a</reset> <s>à</s>`	`Error`
`& [before 2] a <<< à`	`<reset before="tertiary">a</reset> <s>à</s>`	`Error`

5.14.10 Logical Reset Positions

The CLDR table (based on UCA) has the following overall structure for weights, going from low to high.

Specifying Logical Positions
Name	Description	UCA Examples
first tertiary ignorable ... last tertiary ignorable	p, s, t = ignore	Control Codes Format Characters Hebrew Points Tibetan Signs ...
first secondary ignorable ... last secondary ignorable	p, s = ignore	None in UCA
first primary ignorable ... last primary ignorable	p = ignore	Most combining marks
first variable ... last variable	if alternate = non-ignorable p != ignore, if alternate = shifted p, s, t = ignore	Whitespace, Punctuation
first non-ignorable ... last non-ignorable	p != ignore	General Symbols Currency Symbols Numbers Latin Greek ...
implicits	p != ignore, assigned automatically	CJK, CJK compatibility (those that are not decomposed) CJK Extension A, B Unassigned
first trailing ... last trailing	p != ignore, used for trailing syllable components	Jamo Trailing Jamo Leading

Each of the above Names (except implicits) can be used with a reset to position characters relative to that logical position. That allows characters to be ordered before or after a logical position rather than a specific character.

Note: The reason for this is so that tailorings can be more stable. A future version of the UCA might add characters at any point in the above list. Suppose that you set character X to be after Y. It could be that you want X to come after Y, no matter what future characters are added; or it could be that you just want Y to come after a given logical position, for example, after the last primary ignorable.

Here is an example of the syntax:

Sample Logical Position
Basic	XML
`& [first tertiary ignorable] << à`	`<reset><first_tertiary_ignorable/></reset> <s>à</s>`

For example, to make a character be a secondary ignorable, one can make it be immediately after (at a secondary level) a specific character (like a combining dieresis), or one can make it be immediately after the last secondary ignorable.

The last-variable element indicates the "highest" character that is treated as punctuation with alternate handling. Unlike the other logical positions, it can be reset as well as referenced. For example, it can be reset to be just above spaces if all visible punctuation are to be treated as having distinct primary values.

Specifying Last-Variable
Attribute	Options	Basic Example	XML Example
variableTop	at	`& x = [last variable]`	`<reset>x</reset> <i><last_variable/></i>`
	after	`& x < [last variable]`	`<reset>x</reset> <p><last_variable/></p>`
	before	`& [before 1] x < [last variable]`	`<reset before="primary">x</reset> <p><last_variable/></p>`

The default value for last-variable is the highest punctuation mark, thus below symbols. The value can be further changed by using the variable-top setting. This takes effect, however, after the rules have been built, and does not affect any characters that are reset relative to the last-variable value when the rules are being built. The variable-top setting might also be changed via a runtime parameter. That also does not effect the rules.

The <last_variable/> cannot occur inside an <x> element, nor can there be any element content. Thus there can be no <context> or <extend> or text data in the rule. For example, the following are all disallowed:

<x><context>a</context><last_variable/></x>
<x><last_variable/><extend>a</extend></x>
<last_variable/>a
a<last_variable/>

5.14.11 Special-Purpose Commands

<!ELEMENT import EMPTY >
<!ATTLIST import source CDATA #REQUIRED >
<!ATTLIST import type CDATA #IMPLIED >

The import command imports rules from another collation. This allows for better maintenance and smaller rule sizes. The source is the locale of the source, and the type is the type (if any). If the source is "locale" it is the same locale. The type is defaulted to "standard".

Example:

<import source="de" type="phonebook"/>

Special-Purpose Commands
Basic	XML
[suppress contractions [Љ-ґ]]	`<suppress_contractions>`[Љ-ґ]`</suppress_contractions>`
[optimize [Ά-ώ]]	`<optimize>`[Ά-ώ]`</optimize>`

The suppress contractions tailoring command turns off any existing contractions that begin with those characters. It is typically used to turn off the Cyrillic contractions in the UCA, since they are not used in many languages and have a considerable performance penalty. The argument is a Unicode Set.

The optimize tailoring command is purely for performance. It indicates that those characters are sufficiently common in the target language for the tailoring that their performance should be enhanced.

The reason that these are not settings is so that their contents can be arbitrary characters.

Example:

The following is a simple example that combines portions of different tailorings for illustration. For more complete examples, see the actual locale data: Japanese, Chinese, Swedish, and German (type="phonebook") are particularly illustrative.

<collation>
  <settings caseLevel="on"/>
  <rules>
        <reset>Z</reset>
        <p>æ</p>
        <t>Æ</t>
        <p>å</p>
        <t>Å</t>
        <t>aa</t>
        <t>aA</t>
        <t>Aa</t>
        <t>AA</t>
        <p>ä</p>
        <t>Ä</t>
        <p>ö</p>
        <t>Ö</t>
        <s>ű</s>
        <t>Ű</t>
        <p>ő</p>
        <t>Ő</t>
        <s>ø</s>
        <t>Ø</t>
        <reset>V</reset>
        <tc>wW</tc>
        <reset>Y</reset>
        <tc>üÜ</tc>
        <reset><last_non_ignorable/></reset>
        <!-- following is equivalent to <p>亜</p><p>唖</p><p>娃</p>... -->
        <pc>亜唖娃阿哀愛挨姶逢葵茜穐悪握渥旭葦芦</pc>
        <pc>鯵梓圧斡扱</pc>
  </rules>
</collation>

5.14.12 Collation Reordering

Collation reordering allows scripts and certain other defined blocks of characters to be moved relative to each other parametrically, without changing the detailed rules for all the characters involved. This reordering is done on top of any specific ordering rules within the script or block currently in effect. Reordering can specify groups to be placed at the start and/or the end of the collation order. For example, to reorder Greek characters before Latin characters, and digits afterwards (but before other scripts), the following can be used:

Basic	XML	Locale Identifier
`[reorder Grek Latn digit]`	`<reorder>Grek Latn digit</reorder>`	`en-u-kr-grek-latn-digit`

In each case, a sequence of reorder_codes is used, separated by spaces for Basic and XML syntax, and by hyphens for locale identifiers.

A reorder_code is any of the following special codes:

space, punct, symbol, currency, digit - core groups of characters below 'a'
any script code from the Recommended Table in UAX 31 except Katakana, Common, and Inherited.
1. Katakana characters are are always reordered with Hiragana.
2. Characters in any script not in the Recommended Table are treated as being in the preceding Recommended script, in DUCET order. Thus Phoenician characters always reordered with Hebrew characters.
others - where all codes not explicitly mentioned should be ordered. The script code Zzzz (Unknown Script) is a synonym for others.

It is an error if a code occurs multiple times.

Interpretation of a reordering list

The reordering list is interpreted as if it were processed in the following way.

If any core code is not present, then it is inserted at the front of the list in the order given above.
If the others code is not present, then it is inserted at the end of the list.
The others code is replaced by the list of all script codes not explicitly mentioned, in DUCET order.
The reordering list is now complete, and used to reorder characters in collation accordingly.

The locale data may have a particular ordering. For example, the Czech locale data could put digits after all letters, with [reorder others digit]. Any reordering codes specified on top of that (such as with a bcp47 locale identifier) completely replace what was there. To specify a version of collation that completely resets any existing reordering to the ducet ordering, the single code others can be used, as below.

Examples:

Locale Identifier	Effect
`en-u-kr-latn-digit`	Reorder digits after Latin characters (but before other scripts like Cyrillic).
`en-u-kr-others-digit`	Reorder digits after all other characters.
`en-u-kr-arab-cyrl-others-symbol`	Reorder Arabic characters first, then Cyrillic, and put symbols at the end—after all other characters.
`en-u-kr-others`	Remove any locale-specific reordering, and use DUCET order for reordering blocks.

The default reordering groups are defined by the FractionalUCA.txt file, based on the primary weights of associated collation elements. The [top_byte] table contains a mapping from the first (top) byte of primary weights to the associated reordering group. For example:

U+02D0 MODIFIER LETTER TRIANGULAR COLON has a fractional UCA collation weight of [0E 0B, 05, 05]. In the [top_byte] table, the line [top_byte 0E SYMBOL] indicates that 0E maps to SYMBOL.

There are some special cases:

The TRAILING group, the FIELD-SEPARATOR (associated with U+FFFE), and collation elements with only zero primary weights are not reordered.
The IMPLICIT group is currently treated as if it were part of Hani.
The TERMINATOR, LEVEL-SEPARATOR, and SPECIAL groups are never associated with characters.

The default reordering groups follow the allkeys_CLDR.txt ordering; they also may be tailored by implementations to different values. For more information on FractionalUCA.txt and allkeys_CLDR.txt, see Collation Auxiliary.

The DUCET ordering is slightly different from the allkeys_CLDR ordering. The reordering groups for the DUCET are not specified here. However, most reordering groups would start with the same characters as in FractionalUCA.txt.

5.14.13 Case Parameters

The case level is an optional intermediate level ("2.5") between Level 2 and Level 3 (or after Level 1, if there is no Level 2 due to strength settings). The case level is used to support two parametric features: ignoring non-case variants (Level 3 differences) except for case, and giving case differences a higher-level priority than other tertiary differences. Distinctions between small and large Kana characters are also included as case differences, to support Japanese collation.

The case first parameter controls whether to swap the order of upper and lowercase. It can be used with or without the case level.

Importantly, the case parameters have no effect in many instances. For example, they have no effect on the comparison of two non-ignorable characters with different primary weights, or with different secondary weights if the strength = secondary (or higher).

When either the case level or case first parameters are set, the following describes the derivation of the modified collation elements. It assumes the original levels for the code point are [p.s.t] (primary, secondary, tertiary). This derivation may change in future versions of LDML, to track the case characteristics more closely.

Untailored Characters

For untailored characters and strings, that is, for mappings in the root collation, the case value for each collation element is computed from the tertiary weight listed in allkeys_CLDR.txt. This is used to modify the collation element.

If the character is U+FFFE (lowest-weight), set case value = LOWEST.
Otherwise, look up a case value for the tertiary weight x of each collation element:
1. UPPER if x ∈ {08-0C, 0E, 11, 12, 1D}
2. UNCASED otherwise

Compute Modified Collation Elements

From a computed case value, set a weight c according to the following.

If the value is LOWEST, set c = 1
Otherwise if CaseFirst=UpperFirst, set c = UPPER ? 2 : MIXED ? 3 : 4
Otherwise set c = UPPER ? 4 : MIXED ? 3 : 2

Compute a new collation element according to the following table. The notation xt means that the values are numerically combined into a single level, such that xt < yu whenever x < y. The fourth level (if it exists) is unaffected.

Case Level	Strength	Original CE	Modified CE	Comment
on	primary	`0.s.t`	`0.0`	ignore case level weights of primary-ignorable CEs
	primary	`p.s.t`	`p.c`	ignore case level weights of primary-ignorable CEs
	secondary or higher	`0.0.t`	`0.0.0.t`	ignore case level weights of secondary-ignorable CEs
		`0.s.t`	`0.s.c.t`
		`p.s.t`	`p.s.c.t`
off	any	`0.0.0`	`0.0.00`	ignore case level weights of tertiary-ignorable CEs
		`0.0.t`	`0.0.4t`
		`0.s.t`	`0.s.ct`
		`p.s.t`	`p.s.ct`

Note the special case weights when s = 0. They ensure the construction of well-formed case and tertiay weights. For details, see Section 3.7, Well-Formed Collation Element Tables in [UCA].

Tailored Strings

Characters and strings that are tailored (e.g., via LDML/XML collation syntax or basic collation syntax) have case values computed from their UCD properties. A known limitation of the tailoring is that where the source string is a contraction of cased characters, the case level does not reflect the difference in mixed cases, such as between "dZ" and Dz".

Form a set of case values by looking up a case value for each character x in the NFKD mapping of the source string, based on UCD properties:
1. UNCASED if x ∈ UncasedExceptions
2. LOWER if x ∈ Lowercase or x ∈ Changes_When_Uppercased or x ∈ LowerExceptions
3. UPPER if x ∈ Uppercase or x ∈ Changes_When_Lowercased or x ∈ UpperExceptions
4. MIXED if x ∈ gc=Lt or both (a) and (b)
5. UNCASED otherwise
Compute a single case value from this set, by first removing UNCASED, then setting:
1. MIXED if not all elements are identical, otherwise
2. UPPER if the set contains UPPER, otherwise
3. LOWER
Apply that case-value to the first collation element in the tailoring, according to "Compute Modified Collation Elements". The case values and weights in an expansion are unaffected.

UncasedExceptions: is the set of letter modifiers

[:General_Category=Lm:]

LowerExceptions: is the set of small letters where script=Hiragana or Katakana, plus other characters lowercase in form. In Unicode 6.2, these are:

[ぁァぃィぅゥぇェぉォゕヵㇰゖヶㇱㇲっッㇳ-ㇺゃャゅュょョㇻ-ㇿゎヮ]
[℩]

UpperExceptions: is the set of non-small letters where script=Hiragana or Katakana, minus the iteration mark, plus other characters uppercase in form. In Unicode 6.2, these are:

[あいうえお-ぢつ-もやゆよ-ろわ-ゔアイウエオ-ヂツ-モヤユヨ-ロワ-ヴヷ-ヺ 𛀀𛀁]
[℘℺⅁-⅄🅐-🅩🅰-🆊]

5.14.14 Visibility

<!ATTLIST collation visibility ( internal | external ) "external" >

Collators have external visibility by default, meaning that they can be displayed in a list of collation options for users to choose from. Collators marked as having internal visibility should not be shown in such a list. Collators are typically internal when they are partial sequences included in other collators.

5.15 Segmentations

The segmentations element provides for segmentation of text into words, lines, or other segments. The structure is based on [UAX29] notation, but adapted to be machine-readable. It uses a list of variables (representing character classes) and a list of rules. Each must have an id attribute.

The rules in root implement the segmentations found in [UAX29] and [UAX14], for grapheme clusters, words, sentences, and lines. They can be overridden by rules in child locales.

Here is an example:

<segmentations>
  <segmentation type="GraphemeClusterBreak">
    <variables>
      <variable id="$CR">\p{Grapheme_Cluster_Break=CR}</variable>
      <variable id="$LF">\p{Grapheme_Cluster_Break=LF}</variable>
      <variable id="$Control">\p{Grapheme_Cluster_Break=Control}</variable>
      <variable id="$Extend">\p{Grapheme_Cluster_Break=Extend}</variable>
      <variable id="$L">\p{Grapheme_Cluster_Break=L}</variable>
      <variable id="$V">\p{Grapheme_Cluster_Break=V}</variable>
      <variable id="$T">\p{Grapheme_Cluster_Break=T}</variable>
      <variable id="$LV">\p{Grapheme_Cluster_Break=LV}</variable>
      <variable id="$LVT">\p{Grapheme_Cluster_Break=LVT}</variable>
    </variables>
    <segmentRules>
      <rule id="3"> $CR × $LF </rule>
      <rule id="4"> ( $Control | $CR | $LF ) ÷ </rule>
      <rule id="5"> ÷ ( $Control | $CR | $LF ) </rule>
      <rule id="6"> $L × ( $L | $V | $LV | $LVT ) </rule>
      <rule id="7"> ( $LV | $V ) × ( $V | $T ) </rule>
      <rule id="8"> ( $LVT | $T) × $T </rule>
      <rule id="9"> × $Extend </rule>
    </segmentRules>
  </segmentation>
...

Variables: All variable ids must start with a $, and otherwise be valid identifiers according to the Unicode definitions in [UAX31]. The contents of a variable is a regular expression using variables and UnicodeSets. The ordering of variables is important; they are evaluated in order from first to last (see Section 5.15.1 Segmentation Inheritance). It is an error to use a variable before it is defined.

Rules: The contents of a rule uses the syntax of [UAX29]. The rules are evaluated in numeric id order (which may not be the order in which the appear in the file). The first rule that matches determines the status of a boundary position, that is, whether it breaks or not. Thus ÷ means a break is allowed; × means a break is forbidden. It is an error if the rule does not contain exactly one of these characters (except where a rule has no contents at all, or if the rule uses a variable that has not been defined.

There are some implicit rules:

The implicit initial rules are always "start-of-text ÷" and "÷ end-of-text"; these are not to be included explicitly.
The implicit final rule is always "Any ÷ Any". This is not to be included explicitly.

Note: A rule like X Format* -> X in [UAX29] and [UAX14] is not supported. Instead, this needs to be expressed as normal regular expressions. The normal way to support this is to modify the variables, such as in the following example:
<variable id="$Format">\p{Word_Break=Format}</variable>
<variable id="$Katakana">\p{Word_Break=Katakana}</variable>
...

<variable id="$X">[$Format $Extend]*</variable>
<variable id="$Katakana">($Katakana $X)</variable>
<variable id="$ALetter">($ALetter $X)</variable>
...

5.15.1 Segmentation Inheritance

Variables and rules both inherit from the parent.

Variables: The child's variable list is logically appended to the parent's, and evaluated in that order. For example:

// in parent<variable id="$AL">[:linebreak=AL:]</variable> <variable id="$YY">[[:linebreak=XX:]$AL]</variable> // adds $AL

// in child<variable id="$AL">[$AL && [^a-z]]</variable> // changes $AL, does not affect $YY <variable id="$ABC">[abc]</variable> // adds new rule

Rules: The rules are also logically appended to the parent's. Because rules are evaluated in numeric id order, to insert a rule in between others just requires using an intermediate number. For example, to insert a rule before id="10.1" and after id="10.2", just use id="10.15". To delete a rule, use empty contents, such as:

<rule id="3"/> // deletes rule 3

5.16 Transforms

Transforms provide a set of rules for transforming text via a specialized set of context-sensitive matching rules. They are commonly used for transliterations or transcriptions, but also other transformations such as full-width to half-width (for katakana characters). The rules can be simple one-to-one relationships between characters, or involve more complicated mappings. Here is an example:

<transform source="Greek" target="Latin" variant="UNGEGN" direction="both">
...
  <comment>Useful variables</comment>
  <tRule>$gammaLike = [ΓΚΞΧγκξχϰ] ;</tRule>
  <tRule>$egammaLike = [GKXCgkxc] ;</tRule>
...
  <comment>Rules are predicated on running NFD first, and NFC afterwards</comment>
  <tRule>::NFD (NFC) ;</tRule>
...
  <tRule>λ ↔ l ;</tRule>
  <tRule>Λ ↔ L ;</tRule>
...
  <tRule>γ } $gammaLike ↔ n } $egammaLike ;</tRule>
  <tRule>γ ↔ g ;</tRule>
...
  <tRule>::NFC (NFD) ;</tRule>
...
</transform>

The source and target values are valid locale identifiers, where 'und' means an unspecified language, plus some additional extensions.

The long names of a script according to [UAX24] may be used instead of the short script codes. The script identifier may also omit und; that is, "und_Latn" may be written as just "Latn".
The long names of the English languages may also be used instead of the languages.
The term "Any" may be used instead of a solitary "und".
Other identifiers may be used for special purposes. In CLDR, these include: Accents, Digit, Fullwidth, Halfwidth, Jamo, NumericPinyin, Pinyin, Publishing, Tone. (Other than these values, valid private use locale identifiers should be used, such as "x-Special".)
When presenting localizing transform names, the "und_" is normally omitted. Thus for a transliterator with the ID "und_Latn-und_Grek" (or the equivalent "Latin-Greek"), the translated name for Greek would be Λατινικό-Ελληνικό.

Inheritance

The CLDR transforms are built using the following locale inheritance. While this inheritance is not required of LDML implementations, the transforms supplied with CLDR may not otherwise behave as expected without some changes.

For either the source or the target, the fallback starts from the maximized locale ID (using the likely-subtags data). It also uses the country for lookup before the base language is reached, and root is never accessed: instead the script(s) associated with the language are used. Where there are multiple scripts, the maximized script is tried first, and then the other scripts associated with the language (from supplemental data).

For example, see the bolded items below in the fallback chain for az_IR.

	Locale ID	Comments
1	az_Arab_IR	The maximized locale for az_IR
2	az_Arab	Normal fallback
3	az_IR	Inserted country locale
4	az	Normal fallback
5	Arab	Maximized script
6	Cyrl	Other associated script

The source, target, and variant use "laddered" fallback, where the source changes the most quickly (using the above rules), then the target (using the above rules), then the variant if any, is discarded. That is, in pseudo code:

for variant in {variant, ""}
- for target in target-chain
  - for source in source-chain
    - transform = lookup source-target/variant
    - if transform != null return transform

For example, here is the fallback chain for ru_RU-el_GR/BGN.

source		target	variant
ru_RU	-	el_GR	/BGN
ru	-	el_GR	/BGN
Cyrl	-	el_GR	/BGN
ru_RU	-	el	/BGN
ru	-	el	/BGN
Cyrl	-	el	/BGN
ru_RU	-	Grek	/BGN
ru	-	Grek	/BGN
Cyrl	-	Grek	/BGN
ru_RU	-	el_GR
ru	-	el_GR
Cyrl	-	el_GR
ru_RU	-	el
ru	-	el
Cyrl	-	el
ru_RU	-	Grek
ru	-	Grek
Cyrl	-	Grek

Variants

Variants used in CLDR include UNGEGN and BGN, both indicating sources for transliterations. There is an additional attribute private="true" which is used to indicate that the transform is meant for internal use, and should not be displayed as a separate choice in a UI.

There are many different systems of transliteration. The goal for the "unqualified" script transliterations are

to be lossless when going to Latin and back
to be as lossless as possible when going to other scripts
to abide by a common standard as much as possible (possibly supplemented to meet goals 1 and 2).

Language-to-language transliterations, and variant script-to-script transliterations are generally transcriptions, and not expected to be lossless.

Additional transliterations may also be defined, such as customized language-specific transliterations (such as between Russian and French), or those that match a particular transliteration standard, such as the following:

UNGEGN - United Nations Group of Experts on Geographical Names
BGN - United States Board on Geographic Names
ISO9 - ISO/IEC 9
ISO15915 - ISO/IEC 15915
ISCII91 - ISCII 91
KMOCT - South Korean Ministry of Culture & Tourism
USLC - US Library of Congress
UKPCGN - Permanent Committee on Geographical Names for British Official Use
RUGOST - Russian Main Administration of Geodesy and Cartography

The rules for transforms are described in Appendix N: Transform Rules. For more information on Transliteration, see Transliteration Guidelines.

5.17 Rule-Based Number Formatting

<!ELEMENT rbnf ( alias | rulesetGrouping*) >

<!ELEMENT rulesetGrouping ( alias | ruleset*) >
<!ATTLIST rulesetGrouping type NMTOKEN #REQUIRED>

<!ELEMENT ruleset ( alias | rbnfrule*) >
<!ATTLIST ruleset type NMTOKEN #REQUIRED>
<!ATTLIST ruleset access ( public | private ) #IMPLIED >

<!ELEMENT rbnfrule ( #PCDATA ) >
<!ATTLIST rbnfrule value CDATA #REQUIRED >
<!ATTLIST rbnfrule radix CDATA #IMPLIED >
<!ATTLIST rbnfrule decexp CDATA #IMPLIED >

The rule-based number format (RBNF) encapsulates a set of rules for mapping binary numbers to and from a readable representation. They are typically used for spelling out numbers, but can also be used for other number systems like roman numerals, Chinese numerals, or for ordinal numbers (1st, 2nd, 3rd,...). The syntax used in the CLDR representation of rules is intended to be simply a transcription of ICU based RBNF rules into an XML compatible syntax. The rules are fairly sophisticated; for details see Rule-Based Number Formatter [RBNF].

Used to group rules into functional sets for use with ICU. Currently, the valid types of rule set groupings are "SpelloutRules", "OrdinalRules", and "NumberingSystemRules".

This element denotes a specific rule set to the number formatter. The ruleset is assumed to be a public ruleset unless the attribute type="private" is specified.

<rule>

Contains the actual formatting rule for a particular number or sequence of numbers. The "value" attribute is used to indicate the starting number to which the rule applies. The actual text of the rule is identical to the ICU syntax, with the exception that Unicode left and right arrow characters are used to replace < and > in the rule text, since < and > are reserved characters in XML. The "radix" attribute is used to indicate an alternate radix to be used in calculating the prefix and postfix values for number formatting. Alternate radix values are typically used for formatting year numbers in formal documents, such as "nineteen hundred seventy-six" instead of "one thousand nine hundred seventy-six".

5.18 List Patterns

<!ELEMENT listPatterns (alias | (listPattern*, special*)) >

<!ELEMENT listPattern (alias | (listPatternPart*, special*)) >
<!ATTLIST listPattern type (NMTOKEN) #IMPLIED >

<!ELEMENT listPatternPart ( #PCDATA ) >
<!ATTLIST listPatternPart type (start | middle | end | 2 | 3) #REQUIRED >

List patterns can be used to format variable-length lists of things in a locale-sensitive manner, such as "Monday, Tuesday, Friday, and Saturday" (in English) versus "lundi, mardi, vendredi et samedi" (in French). For example, consider the following example:

<listPatterns>
 <listPattern>
  <listPatternPart type="2">{0} and {1}</listPatternPart>
  <listPatternPart type="start">{0}, {1}</listPatternPart>
  <listPatternPart type="middle">{0}, {1}</listPatternPart>
  <listPatternPart type="end">{0}, and {1}</listPatternPart>
</listPattern>
</listPatterns>

The data is used as follows: If there is a type type matches exactly the number of elements in the desired list (such as "2" in the above list), then use that pattern. Otherwise,

Format the last two elements with the "end" format.
Then use middle format to add on subsequent elements working towards the front, all but the very first element. That is, {1} is what you've already done, and {0} is the previous element.
Then use "start" to add the front element, again with {1} as what you've done so far, and {0} is the first element.

Thus a list (a,b,c,...m, n) is formatted as: start(a,middle(b,middle(c,middle(...end(m, n))...)))

5.19 ContextTransform Elements

<!ELEMENT contextTransforms ( alias | (contextTransformUsage*, special*)) >
<!ELEMENT contextTransformUsage ( alias | (contextTransform*, special*)) >
<!ATTLIST contextTransformUsage type CDATA #REQUIRED >
<!ELEMENT contextTransform ( #PCDATA ) >
<!ATTLIST contextTransform type ( uiListOrMenu | stand-alone ) #REQUIRED >

CLDR locale elements provide data for display names or symbols in many categories. The default capitalization for these elements is intended to be the form used in the middle of running text. In many languages, other capitalization may be required in other contexts, depending on the type of name or symbol.

Each <contextTransformUsage> element’s type attribute specifies a category of data from the table below; the element includes one or more <contextTransform> elements that specify how to perform capitalization of this category of data in different contexts. The <contextTransform> elements are only needed for cases in which the capitalization is other than the default form used in the middle of running text. The only value currently defined for the <contextTransform> element is the transformation "titlecase-firstword", covering the case in which text that is otherwise lowercase needs to have its first word titlecased. No other necessary case transforms have been identified.

Four contexts for capitalization behavior are currently identified. Two need no data, and hence have no corresponding <contextTransform> elements:

In the middle of running text: This is the default form, so no additional data is required.
At the beginning of a complete sentence: The initial word is titlecased, no additional data is required to indicate this.

Two other contexts require <contextTransform> elements if their capitalization behavior is other than the default for running text. The context is identified by the type attribute, as follows:

uiListOrMenu: Capitalization appropriate to a user-interface list or menu.
stand-alone: Capitalization appropriate to an isolated user-interface element (e.g. an isolated name on a calendar page)

Example:

    <contextTransforms>
        <contextTransformUsage type="languages">
             <contextTransform type="uiListOrMenu">titlecase-firstword</contextTransform>
             <contextTransform type="stand-alone">titlecase-firstword</contextTransform>
        </contextTransformUsage>
        <contextTransformUsage type="month-format-except-narrow">
             <contextTransform type="uiListOrMenu">titlecase-firstword</contextTransform>
        </contextTransformUsage>
        <contextTransformUsage type="month-standalone-except-narrow">
             <contextTransform type="uiListOrMenu">titlecase-firstword</contextTransform>
        </contextTransformUsage>
    </contextTransforms>

<contextTransformUsage> type attribute values
type attribute value	Description
all	Special value, indicates that the specified transformation applies to all of the categories below
language	localeDisplayNames language names
script	localeDisplayNames script names
territory	localeDisplayNames territory names
variant	localeDisplayNames variant names
key	localeDisplayNames key names
type	localeDisplayNames type names
month-format-except-narrow	dates/calendars/calendar[type=*]/months format wide and abbreviated month names
month-standalone-except-narrow	dates/calendars/calendar[type=*]/months stand-alone wide and abbreviated month names
month-narrow	dates/calendars/calendar[type=*]/months format and stand-alone narrow month names
day-format-except-narrow	dates/calendars/calendar[type=*]/days format wide and abbreviated day names
day-standalone-except-narrow	dates/calendars/calendar[type=*]/days stand-alone wide and abbreviated day names
day-narrow	dates/calendars/calendar[type=*]/days format and stand-alone narrow day names
era-name	dates/calendars/calendar[type=*]/eras (wide) era names
era-abbr	dates/calendars/calendar[type=*]/eras abbreviated era names
era-narrow	dates/calendars/calendar[type=*]/eras narrow era names
quarter-format-wide	dates/calendars/calendar[type=*]/quarters format wide quarter names
quarter-standalone-wide	dates/calendars/calendar[type=*]/quarters stand-alone wide quarter names
quarter-abbreviated	dates/calendars/calendar[type=*]/quarters format and stand-alone abbreviated quarter names
quarter-narrow	dates/calendars/calendar[type=*]/quarters format and stand-alone narrow quarter names
calendar-field	dates/calendars/calendar[type=]/fields/field[type=]/displayName field names (for relative forms see type "tense" below)
zone-exemplarCity	dates/timeZoneNames/zone[type=*]/exemplarCity city names
zone-long	dates/timeZoneNames/zone[type=*]/long zone names
zone-short	dates/timeZoneNames/zone[type=*]/short zone names
metazone-long	dates/timeZoneNames/metazone[type=*]/long metazone names
metazone-short	dates/timeZoneNames/metazone[type=*]/short metazone names
symbol	numbers/currencies/currency[type=*]/symbol symbol names
displayName-count	numbers/currencies/currency[type=]/displayName[count=] currency names for use with count
displayName	numbers/currencies/currency[type=*]/displayName currency names
tense	units/unit[type=-(past\|future)]/unitPattern[count=] relative unit names; dates/calendars/calendar[type=]/fields/field[type=]/relative relative field names
unit-pattern	units/unit[type=]/unitPattern[count=] unit names

5.20 Metadata Elements

<!ELEMENT metadata (casingData?) >
<!ELEMENT casingData (casingItem*) >
<!ELEMENT casingItem ( #PCDATA ) >
<!ATTLIST casingItem type CDATA #REQUIRED >

The <metadata> element contains metadata about the locale for use by the Survey Tool or other tools in checking locale data; this data is not intended for export as part of the locale itself.

The <casingItem> element specifies the capitalization intended for the majority of the data in a given category with the locale. The purpose is so that warnings can be issued to translators that anything deviating from that capitalization should be carefully reviewed. Its type attribute has one of the values used for the <contextTransformUsage> element above, with the exception of the special value "all"; its value is one of the following:

lowercase
titlecase

5.21 Alias Elements

<!ELEMENT alias (special*) >
<!ATTLIST alias source NMTOKEN #REQUIRED >
<!ATTLIST alias path CDATA #IMPLIED>

The contents of any element in root can be replaced by an alias, which points to the path where the data can be found.

Aliases will only ever appear in root with the form //ldml/.../alias[@source="locale"][@path="..."].

Consider the following example in root:

      <calendar type="gregorian">
 <months>
      <default choice="format"/>
      <monthContext type="format">
            <default choice="wide"/>
            <monthWidth type="abbreviated">
             <alias source="locale" path="../monthWidth[@type='wide']"/>
                      </monthWidth>

If the locale "de_DE" is being accessed for a month name for format/abbreviated, then a resource bundle at "de_DE" will be searched for a resource element at the that path. If not found there, then the resource bundle at "de" will be searched, and so on. When the alias is found in root, then the search is restarted, but searching for format/wide element instead of format/abbreviated.

If the path attribute is present, then its value is an [XPath] that points to a different node in the tree. For example:

<alias source="locale" path="../monthWidth[@type='wide']"/>

The default value if the path is not present is the same position in the tree. All of the attributes in the [XPath] must be distinguishing elements. For more details, see Appendix I: Inheritance and Validity.

There is a special value for the source attribute, the constant source="locale". This special value is equivalent to the locale being resolved. For example, consider the following example, where locale data for 'de' is being resolved:

Inheritance with source="locale"
Root	de	Resolved
`<x> <a>1</a> <b>2</b> <c>3</c> </x>`	`<x> <a>11</a> <b>12</b> <d>14</d> </x>`	`<x> <a>11</a> <b>12</b> <c>3</c> <d>14</d> </x>`
`<y> <alias source="locale" path="../x"> </y>`	`<y> <b>22</b> <e>25</e> </y>`	`<y> <a>11</a> <b>22</b> <c>3</c> <d>14</d> <e>25</e> </y>`

The first row shows the inheritance within the <x> element, whereby <c> is inherited from root. The second shows the inheritance within the <y> element, whereby <a>, <c>, and <d> are inherited also from root, but from an alias there. The alias in root is logically replaced not by the elements in root itself, but by elements in the 'target' locale.

For more details on data resolution, see Appendix I: Inheritance and Validity.

Aliases must be resolved recursively. An alias may point to another path that results in another alias being found, and so on. For example, looking up Thai buddhist abbreviated months for the locale xx-YY may result in the following chain of aliases being followed:

../../calendar[@type="buddhist"]/months/monthContext[@type="format"]/monthWidth[@type="abbreviated"]

xx-YY → xx → root // finds alias that changes path to:

../../calendar[@type="gregorian"]/months/monthContext[@type="format"]/monthWidth[@type="abbreviated"]

xx-YY → xx → root // finds alias that changes path to:

../../calendar[@type="gregorian"]/months/monthContext[@type="format"]/monthWidth[@type="wide"]

xx-YY → xx // finds value here

It is an error to have a circular chain of aliases. That is, a collection of LDML XML documents must not have situations where a sequence of alias lookups (including inheritance and multiple inheritance) can be followed indefinitely without terminating.

Appendix A: Sample Special Elements

The elements in this section are not part of the Locale Data Markup Language 1.0 specification. Instead, they are special elements used for application-specific data to be stored in the Common Locale Repository. They may change or be removed future versions of this document, and are present her more as examples of how to extend the format. (Some of these items may move into a future version of the Locale Data Markup Language specification.)

The above examples are old versions: consult the documentation for the specific application to see which should be used.

These DTDs use namespaces and the special element. To include one or more, use the following pattern to import the special DTDs that are used in the file:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE ldml SYSTEM "http://unicode.org/cldr/dtd/1.1/ldml.dtd" [
    <!ENTITY % icu SYSTEM "http://unicode.org/cldr/dtd/1.1/ldmlICU.dtd">
    <!ENTITY % openOffice SYSTEM "http://unicode.org/cldr/dtd/1.1/ldmlOpenOffice.dtd">
%icu;
%openOffice;
]>

Thus to include just the ICU DTD, one uses:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE ldml SYSTEM "http://unicode.org/cldr/dtd/1.1/ldml.dtd" [
    <!ENTITY % icu SYSTEM "http://unicode.org/cldr/dtd/1.1/ldmlICU.dtd">
%icu;
]>

Note: A previous version of this document contained a special element for ISO TR 14652 compatibility data. That element has been withdrawn, pending further investigation, since 14652 is a Type 1 TR: "when the required support cannot be obtained for the publication of an International Standard, despite repeated effort". See the ballot comments on 14652 Comments for details on the 14652 defects. For example, most of these patterns make little provision for substantial changes in format when elements are empty, so are not particularly useful in practice. Compare, for example, the mail-merge capabilities of production software such as Microsoft Word or OpenOffice.

Note: While the CLDR specification guarantees backwards compatibility, the definition of specials is up to other organizations. Any assurance of backwards compatibility is up to those organizations.

A.1 openoffice.org

A number of the elements above can have extra information for openoffice.org, such as the following example:

    <special xmlns:openOffice="http://www.openoffice.org">
        <openOffice:search>
            <openOffice:searchOptions>
                <openOffice:transliterationModules>IGNORE_CASE</openOffice:transliterationModules>
            </openOffice:searchOptions>
        </openOffice:search>
    </special>

Appendix B: Transmitting Locale Information

In a world of on-demand software components, with arbitrary connections between those components, it is important to get a sense of where localization should be done, and how to transmit enough information so that it can be done at that appropriate place. End-users need to get messages localized to their languages, messages that not only contain a translation of text, but also contain variables such as date, time, number formats, and currencies formatted according to the users' conventions. The strategy for doing the so-called JIT localization is made up of two parts:

Store and transmit neutral-format data wherever possible.
- Neutral-format data is data that is kept in a standard format, no matter what the local user's environment is. Neutral-format is also (loosely) called binary data, even though it actually could be represented in many different ways, including a textual representation such as in XML.
- Such data should use accepted standards where possible, such as for currency codes.
- Textual data should also be in a uniform character set (Unicode/10646) to avoid possible data corruption problems when converting between encodings.
Localize that data as "close" to the end-user as possible.

There are a number of advantages to this strategy. The longer the data is kept in a neutral format, the more flexible the entire system is. On a practical level, if transmitted data is neutral-format, then it is much easier to manipulate the data, debug the processing of the data, and maintain the software connections between components.

Once data has been localized into a given language, it can be quite difficult to programmatically convert that data into another format, if required. This is especially true if the data contains a mixture of translated text and formatted variables. Once information has been localized into, say, Romanian, it is much more difficult to localize that data into, say, French. Parsing is more difficult than formatting, and may run up against different ambiguities in interpreting text that has been localized, even if the original translated message text is available (which it may not be).

Moreover, the closer we are to end-user, the more we know about that user's preferred formats. If we format dates, for example, at the user's machine, then it can easily take into account any customizations that the user has specified. If the formatting is done elsewhere, either we have to transmit whatever user customizations are in play, or we only transmit the user's locale code, which may only approximate the desired format. Thus the closer the localization is to the end user, the less we need to ship all of the user's preferences around to all the places that localization could possibly need to be done.

Even though localization should be done as close to the end-user as possible, there will be cases where different components need to be aware of whatever settings are appropriate for doing the localization. Thus information such as a locale code or time zone needs to be communicated between different components.

B.1 Message Formatting and Exceptions

Windows (FormatMessage, String.Format), Java (MessageFormat) and ICU (MessageFormat, umsg) all provide methods of formatting variables (dates, times, etc) and inserting them at arbitrary positions in a string. This avoids the manual string concatenation that causes severe problems for localization. The question is, where to do this? It is especially important since the original code site that originates a particular message may be far down in the bowels of a component, and passed up to the top of the component with an exception. So we will take that case as representative of this class of issues.

There are circumstances where the message can be communicated with a language-neutral code, such as a numeric error code or mnemonic string key, that is understood outside of the component. If there are arguments that need to accompany that message, such as a number of files or a datetime, those need to accompany the numeric code so that when the localization is finally at some point, the full information can be presented to the end-user. This is the best case for localization.

More often, the exact messages that could originate from within the component are not known outside of the component itself; or at least they may not be known by the component that is finally displaying text to the user. In such a case, the information as to the user's locale needs to be communicated in some way to the component that is doing the localization. That locale information does not necessarily need to be communicated deep within the component; ideally, any exceptions should bundle up some language-neutral message ID, plus the arguments needed to format the message (for example, datetime), but not do the localization at the throw site. This approach has the advantages noted above for JIT localization.

In addition, exceptions are often caught at a higher level; they do not end up being displayed to any end-user at all. By avoiding the localization at the throw site, it the cost of doing formatting, when that formatting is not really necessary. In fact, in many running programs most of the exceptions that are thrown at a low level never end up being presented to an end-user, so this can have considerable performance benefits.

Appendix C: Supplemental Data

The following represents the format for supplemental information. This is information that is important for internationalization and proper use of CLDR, but is not contained in the locale hierarchy. It is not localizable, nor is it overridden by locale data. The current CLDR data can be viewed in the Supplemental Charts.

The data in CLDR is split into multiple files: supplementalData.xml, supplementalMetadata.xml, characters.xml, likelySubtags.xml, ordinals.xml, plurals.xml, telephoneCodeData.xml, genderList.xml, plus transforms (see Section 5.16 Transforms and Appendix N: Transform Rules). The split is just for convenience: logically, they are treated as though they were a single file. Future versions of CLDR may split the data in a different fashion.

C.1 Supplemental Currency Data

<!ELEMENT currencyData ( fractions*, region+ ) >
<!ELEMENT fractions ( info+ ) >

<!ELEMENT info EMPTY >
<!ATTLIST info iso4217 NMTOKEN #REQUIRED >
<!ATTLIST info digits NMTOKEN #IMPLIED >
<!ATTLIST info rounding NMTOKEN #IMPLIED >

<!ELEMENT region ( currency* ) >
<!ATTLIST region iso3166 NMTOKEN #REQUIRED >

<!ELEMENT currency ( alternate* ) >
<!ATTLIST currency from NMTOKEN #IMPLIED >
<!ATTLIST currency to NMTOKEN #IMPLIED >
<!ATTLIST currency iso4217 NMTOKEN #REQUIRED >
<!ATTLIST currency tender ( true | false ) #IMPLIED >

Each currencyData element contains one fractions element followed by one or more region elements. Here is an example for illustration.

<supplementalData>
  <currencyData>
    <fractions>
      ...
      <info iso4217="CHF" digits="2" rounding="5"/>
      ...
      <info iso4217="ITL" digits="0"/>
      ...
    </fractions>
    ...
    <region iso3166="IT">
      <currency iso4217="EUR" from="1999-01-01"/>
      <currency iso4217="ITL" from="1862-8-24" to="2002-02-28"/>
    </region>
    ...
    <region iso3166="CS">
      <currency iso4217="EUR" from="2003-02-04"/>
      <currency iso4217="CSD" from="2002-05-15"/>
      <currency iso4217="YUM" from="1994-01-24" to="2002-05-15"/>
    </region>
    ...
  </currencyData>
...
</supplementalData>

The fractions element contains any number of info elements, with the following attributes:

iso4217: the ISO 4217 code for the currency in question. If a particular currency does not occur in the fractions list, then it is given the defaults listed for the next two attributes.
digits: the number of decimal digits normally formatted. The default is 2.
rounding: the rounding increment, in units of 10^-digits. The default is 1. Thus with fraction digits of 2 and rounding increment of 5, numeric values are rounded to the nearest 0.05 units in formatting. With fraction digits of 0 and rounding increment of 50, numeric values are rounded to the nearest 50.

Each region element contains one attribute:

iso3166: the ISO 3166 code for the region in question. The special value XXX can be used to indicate that the region has no valid currency or that the circumstances are unknown (usually used in conjunction with before, as described below).

And can have any number of currency elements, with the ordered subelements.

    <region iso3166="IT"> <!-- Italy -->
      <currency iso4217="EUR" from="2002-01-01"/>
      <currency iso4217="ITL" to="2001-12-31"/>
    </region>

iso4217: the ISO 4217 code for the currency in question. Note that some additional codes that were in widespread usage are included, others such as GHP are not included because they were never used.
from: the currency was valid from to the datetime indicated by the value. See Section 5.2.1 Dates and Date Ranges.
to: the currency was valid up to the datetime indicated by the value of before. See Section 5.2.1 Dates and Date Ranges.
tender: indicates whether or not the ISO currency code represents a currency that was or is legal tender in some country. The default is "true". Certain ISO codes represent things like financial instruments or precious metals, and do not represent normally interchanged currencies.

That is, each currency element will list an interval in which it was valid. The ordering of the elements in the list tells us which was the primary currency during any period in time. Here is an example of such an overlap:

<currency iso4217="CSD" to="2002-05-15"/>
<currency iso4217="YUD" from="1994-01-24" to="2002-05-15"/>
<currency iso4217="YUN" from="1994-01-01" to="1994-07-22"/>

The from element is limited by the fact that ISO 4217 does not go very far back in time, so there may be no ISO code for the previous currency.

Currencies change relatively frequently. There are different types of changes:

YU=>CS (name change)
CS=>RS+ME (split, different names)
US=>US+NC (split, same name for one // Northern California secedes)
NC+CA=>CX (Union, new name // Northern Calif later joins with Canada to form Canadornia)
DE+DD=>DE (Union, reuses one name)

The UN Information is used to determine dates due to country changes.

When a code is no longer in use, it is terminated (see #1, #2, #4, #5)

Example:

<currency iso4217="EUR" from="2003-02-04" to="2006-06-03"/>

When codes split, each of the new codes inherits (see #2, #3) the previous data. However, some modifications can be made if it is clear that currencies were only in use in one of the parts.

When codes merge, the data is copied from the most populous part.

Example. When CS split into RS and ME:

RS & ME copy the former CS, except that the line for EUR is dropped from RS

CS now terminates on Jun 3, 2006 (following the UN info)

C.2 Supplemental Territory Containment

<!ELEMENT territoryContainment ( group* ) >
<!ELEMENT group EMPTY >
<!ATTLIST group type NMTOKEN #REQUIRED >
<!ATTLIST group contains NMTOKENS #IMPLIED >
<!ATTLIST group grouping ( true | false ) #IMPLIED >
<!ATTLIST group status ( deprecated, grouping ) #IMPLIED >

The following data provides information that shows groupings of countries (regions). The data is based on the UNM49]. There is one special code, QO, which is used for outlying areas that are typically uninhabited. The territory containment forms a tree with the following levels:

World

Continent

Subcontinent

Country/Region

For a chart showing the relationships (plus the included timezones), see the Territory Containment Chart. The XML structure has the following form.

<territoryContainment>

<group type="001" contains="002 009 019 142 150"/> <!--World -->
<group type="011" contains="BF BJ CI CV GH GM GN GW LR ML MR NE NG SH SL SN TG"/> <!--Western Africa -->
<group type="013" contains="BZ CR GT HN MX NI PA SV"/> <!--Central America -->
<group type="014" contains="BI DJ ER ET KE KM MG MU MW MZ RE RW SC SO TZ UG YT ZM ZW"/> <!--Eastern Africa -->
<group type="142" contains="030 035 062 145"/> <!--Asia -->
<group type="145" contains="AE AM AZ BH CY GE IL IQ JO KW LB OM PS QA SA SY TR YE"/> <!--Western Asia -->
<group type="015" contains="DZ EG EH LY MA SD TN"/> <!--Northern Africa -->
...

There are groupings that don't follow this regular structure, such as:

<group type="003" contains="013 021 029" grouping="true"/> <!--North America -->

These are marked with the attribute grouping="true".

When groupings have been deprecated but kept around for backwards compatibility, they are marked with the attribute status="deprecated", like this:

<group type="029" contains="AN" status="deprecated"/> <!--Caribbean -->

When the containment relationship itself is a grouping, it is marked with the attribute status="grouping", like this:

<group type="150" contains="EU" status="grouping"/> <!--Europe -->

That is, the type value isn’t a grouping, but if you filter out groupings you can drop this containment. In the example above, EU is a grouping, and contained in 150.

C.3 Supplemental Language Data

<!ELEMENT languageData ( language* ) >
<!ELEMENT language EMPTY >
<!ATTLIST language type NMTOKEN #REQUIRED >
<!ATTLIST language scripts NMTOKENS #IMPLIED >
<!ATTLIST language territories NMTOKENS #IMPLIED >
<!ATTLIST language variants NMTOKENS #IMPLIED >
<!ATTLIST language alt NMTOKENS #IMPLIED >

The language data is used for consistency checking and testing. It provides a list of which languages are used with which scripts and in which countries. To a large extent, however, the territory list has been superseded by the territoryInfo data discussed below.

	<languageData>
		<language type="af" scripts="Latn" territories="ZA"/>
		<language type="am" scripts="Ethi" territories="ET"/>
		<language type="ar" scripts="Arab" territories="AE BH DZ EG IN IQ JO KW LB
LY MA OM PS QA SA SD SY TN YE"/>
                ...

If the language is not a modern language, or the script is not a modern script, or the language not a major language of the territory, then the alt attribute is set to secondary.

		<language type="fr" scripts="Latn" territories="IT US" alt="secondary" />
                ...

C.4 Supplemental Territory Information

<!ELEMENT territory ( languagePopulation* ) >
<!ATTLIST territory type NMTOKEN #REQUIRED >
<!ATTLIST territory gdp NMTOKEN #REQUIRED >
<!ATTLIST territory literacyPercent NMTOKEN #REQUIRED >
<!ATTLIST territory population NMTOKEN #REQUIRED >

<!ELEMENT languagePopulation EMPTY >
<!ATTLIST languagePopulation type NMTOKEN #REQUIRED >
<!ATTLIST languagePopulation writingPercent NMTOKEN #IMPLIED >
<!ATTLIST languagePopulation populationPercent NMTOKEN #REQUIRED >
<!ATTLIST languagePopulation officialStatus (de_facto_official | official | official_regional | official_minority) #IMPLIED >

This data provides testing information for language and territory populations. The main goal is to provide approximate figures for the literate, functional population for each language in each territory: that is, the population that is able to read and write each language, and is comfortable enough to use it with computers.

The GDP and Literacy figures are taken from the World Bank where available, otherwise supplemented by FactBook data and other sources. Much of the per-language data is taken from the Ethnologue, but is supplemented and processed using many other sources, including per-country census data. (The focus of the Ethnologue is native speakers, which includes people who are not literate, and excludes people who are functional second-language users.)

The percentages may add up to more than 100% due to multilingual populations, or may be less than 100% due to illiteracy or because the data has not yet been gathered or processed. Languages with a small population may be omitted.

C.5 Supplemental Calendar Data

<!ELEMENT calendarData ( calendar* ) >
<!ELEMENT calendar ( calendarSystem?, eras? ) >
<!ATTLIST calendar type NMTOKENS #REQUIRED >
<!ATTLIST calendar territories NMTOKENS #IMPLIED > <!-- territories are deprecated.  use ordering attribute in calendarPreference element instead. -->

<!ELEMENT calendarSystem EMPTY >
<!ATTLIST calendarSystem type (solar | lunar | lunisolar | other) #REQUIRED >

<!ELEMENT eras ( era* ) >

<!ELEMENT era EMPTY >
<!ATTLIST era type NMTOKENS #REQUIRED >
<!ATTLIST era start CDATA #IMPLIED >
<!ATTLIST era end CDATA #IMPLIED >

<!ELEMENT weekData ( minDays*, firstDay*, weekendStart*, weekendEnd* ) >

<!ELEMENT minDays EMPTY >
<!ATTLIST minDays count (1 | 2 | 3 | 4 | 5 | 6 | 7) #REQUIRED >
<!ATTLIST minDays territories NMTOKENS #REQUIRED >

<!ELEMENT firstDay EMPTY >
<!ATTLIST firstDay day (sun | mon | tue | wed | thu | fri | sat) #REQUIRED >
<!ATTLIST firstDay territories NMTOKENS #REQUIRED >

<!ELEMENT weekendStart EMPTY >
<!ATTLIST weekendStart day (sun | mon | tue | wed | thu | fri | sat) #REQUIRED >
<!ATTLIST weekendStart territories NMTOKENS #REQUIRED >

<!ELEMENT weekendEnd EMPTY >
<!ATTLIST weekendEnd day (sun | mon | tue | wed | thu | fri | sat) #REQUIRED >
<!ATTLIST weekendEnd territories NMTOKENS #REQUIRED >

The calendar data provides locale-independent data about calendars and usage. Example:

<calendarData>
  <!-- gregorian is assumed, so these are all in addition -->
  <calendar type="japanese" territories="JP"/>
  <calendar type="islamic-civil" territories="AE BH DJ DZ EG EH ER IL IQ JO KM KW
     LB LY MA MR OM PS QA SA SD SY TD TN YE AF IR"/>
  ...

The common values provide a list of the calendars that are in common use, and thus should be shown in UIs that provide choice of calendars. (An 'Other...' button could give access to the other available calendars.)

Note: The territories attribute in the calendar element is deprecated. Calendar types used by each territory is provided by C.15 Calendar Preference Data.

<weekData>
  <minDays count="1" territories="001"/>
  <minDays count="4" territories="AT BE CA CH DE DK FI FR IT LI LT LU MC MT NL NO SE SK"/>
  <minDays count="4" territories="CD" draft="true"/>
  <firstDay day="mon" territories="001"/>
...

These values provide information on how a calendar is used in a particular territory. It may also be used in computing week boundaries for other purposes. The default is provided by the element with territories="001".

The minDays indicates the minimum number of days to count as the first week (of a month or year).

The day indicated by firstDay is the one that should be shown as the first day of the week in a calendar view. This is not necessarily the same as the first day after the weekend (or the first work day of the week), which should be determined from the weekend information. Currently, day-of-week numbering is based on firstDay (that is, day 1 is the day specified by firstDay), but in the future we may add a way to specify this separately.

What is meant by the weekend varies from country to country. It is typically when most non-retail businesses are closed. The time should not be specified unless it is a well-recognized part of the day.

The weekendStart day defaults to "sat", and weekendEnd day defaults to "sun". For more information, see Section 5.2.1 Dates and Date Ranges.

C.6 Measurement System Data

<!ELEMENT measurementData ( measurementSystem*, paperSize* ) >

<!ELEMENT measurementSystem EMPTY >
<!ATTLIST measurementSystem type ( metric | US | UK ) #REQUIRED >
<!ATTLIST measurementSystem territories NMTOKENS #REQUIRED >

<!ELEMENT paperSize EMPTY >
<!ATTLIST paperSize type ( A4 | US-Letter ) #REQUIRED >
<!ATTLIST paperSize territories NMTOKENS #REQUIRED >

The measurement system is the normal measurement system in common everyday use (except for date/time). For example:

<measurementData>
  <measurementSystem type="metric" territories="001"/>
  <measurementSystem type="US" territories="US"/>
  <paperSize type="A4" territories="001"/>
  <paperSize type="US-Letter" territories="US"/>
</measurementData>

The values are "metric", "US", or "UK"; others may be added over time. The "metric" value indicates the use of SI [ISO1000] base or derived units, or non-SI units accepted for use with SI: For example, meters, kilograms, liters, and degrees Celsius. The "US" value indicates the customary system of measurement as used in the United States: feet, inches, pints, quarts, degrees Fahrenheit, and so on. The "UK" value indicates the customary system of measurement as used in the United Kingdom: feet, inches, pints, quarts, and so on. It is also called the Imperial system: the pint, quart, and so on are different sizes than in "US".

The paperSize attribute gives the height and width of paper used for normal business letters. The values are "A4" and "US-Letter".

For both measurementSystem entries and paperSize entries, later entries for specific territories such as "US" will override the value assigned to that territory by earlier entries for more inclusive territories such as "001".

The measurement information was formerly in the main LDML file, and had a somewhat different format.

C.7 Supplemental Time Zone Data

<!ELEMENT windowsZones (mapTimezones?) >
<!ELEMENT metaZones (metazoneInfo?, mapTimezones?) >

<!ELEMENT metazoneInfo (timezone*) >

<!ELEMENT timezone (usesMetazone*) >
<!ATTLIST timezone type CDATA #REQUIRED >
<!ELEMENT usesMetazone EMPTY >
<!ATTLIST usesMetazone mzone NMTOKEN #REQUIRED >
<!ATTLIST usesMetazone from CDATA #IMPLIED >
<!ATTLIST usesMetazone to CDATA #IMPLIED >

<!ELEMENT mapTimezones ( mapZone* ) >
<!ATTLIST mapTimezones type NMTOKEN #IMPLIED >
<!ATTLIST mapTimezones typeVersion CDATA #IMPLIED >
<!ATTLIST mapTimezones otherVersion CDATA #IMPLIED >
<!ATTLIST mapTimezones references CDATA #IMPLIED >

<!ELEMENT mapZone EMPTY >
<!ATTLIST mapZone type CDATA #REQUIRED >
<!ATTLIST mapZone other CDATA #REQUIRED >
<!ATTLIST mapZone territory CDATA #IMPLIED >
<!ATTLIST mapZone references CDATA #IMPLIED >

The following subelement of <metaZones> (metaZones.xml) provides a mapping from a single Unicode time zone id to metazones. For more information about metazones, See Section 5.9.2 Time Zone Names.

<metazoneInfo>
	<timezone type="Europe/Andorra">
		<usesMetazone mzone="Europe_Central"/>
	</timezone>
	....
	<timezone type="Asia/Yerevan">
		<usesMetazone to="1991-09-22 20:00" mzone="Yerevan"/>
		<usesMetazone from="1991-09-22 20:00" mzone="Armenia"/>
	</timezone>
	....

The following subelement of <metaZones> specifies a mapping from a metazone to golden zones for each territory. For more information about golden zones, see Appendix J: Time Zone Display Names.

<mapTimezones type="metazones">
	<mapZone other="Acre" territory="001" type="America/Rio_Branco"/>
	<mapZone other="Afghanistan" territory="001" type="Asia/Kabul"/>
	<mapZone other="Africa_Central" territory="001" type="Africa/Maputo"/>
	<mapZone other="Africa_Central" territory="BI" type="Africa/Bujumbura"/>
	<mapZone other="Africa_Central" territory="BW" type="Africa/Gaborone"/>
	....

The <mapTimezones> element can be also used to provide mappings between Unicode time zone IDs and other time zone IDs. This example specifies a mapping from Windows TZIDs to Unicode time zone IDs (windowsZones.xml).

<mapTimezones otherVersion="07dc0000" typeVersion="2011n">
	....
	<!-- (UTC-08:00) Baja California -->
	<mapZone other="Pacific Standard Time (Mexico)" territory="001" type="America/Santa_Isabel"/>
	<mapZone other="Pacific Standard Time (Mexico)" territory="MX" type="America/Santa_Isabel"/>

	<!-- (UTC-08:00) Pacific Time (US & Canada) -->
	<mapZone other="Pacific Standard Time" territory="001" type="America/Los_Angeles"/>
	<mapZone other="Pacific Standard Time" territory="CA" type="America/Vancouver America/Dawson America/Whitehorse"/>
	<mapZone other="Pacific Standard Time" territory="MX" type="America/Tijuana"/>
	<mapZone other="Pacific Standard Time" territory="US" type="America/Los_Angeles"/>
	<mapZone other="Pacific Standard Time" territory="ZZ" type="PST8PDT"/>
	....

The attributes otherVersion and typeVersion in <mapTimezones> specify the versions of two systems. In the example above, otherVersion="07dc0000" specifies the version of Windows time zone and typeVersion="2011n" specifies the version of Unicode time zone IDs. The attribute territory="001" in <mapZone> element indicates the Unicode time zone ID specified by the type attribute is used as the default mapping for the Windows TZID. For each unique Windows TZID, there must be exactly one <mapZone> element with territory="001". <mapZone> elements other than territory="001" specify territory specific mappings. When multiple Unicode time zone IDs are available for a single territory, the value of the type attribute will be a list of Unicode time zone IDs delimited by space. In this case, the first entry represents the default mapping for the territory. The territory "ZZ" is used when a Unicode time zone ID is not associated with a specific territory.

C.8 Supplemental Character Fallback Data

<!ELEMENT characters ( character-fallback*) >

<!ELEMENT character-fallback ( character* ) >
<!ELEMENT character (substitute*) >
<!ATTLIST character value CDATA #REQUIRED >

<!ELEMENT substitute (#PCDATA) >

The characters element provides a way for non-Unicode systems, or systems that only support a subset of Unicode characters, to transform CLDR data. It gives a list of characters with alternative values that can be used if the main value is not available. For example:

<characters>
     <character-fallback>
	<character value = "ß">
		<substitute>ss</substitute>
	</character>
	<character value = "Ø">
		<substitute>Ö</substitute>
		<substitute>O</substitute>
	</character>
	<character value = "₧">
		<substitute>Pts</substitute>
	</character>
	<character value = "₣">
		<substitute>Fr.</substitute>
	</character>
     </character-fallback> 
</characters>

The ordering of the substitute elements indicates the preference among them.

That is, this data provides recommended fallbacks for use when a charset or supported repertoire does not contain a desired character. There is more than one possible fallback: the recommended usage is that when a character value is not in the desired repertoire the following process is used, whereby the first value that is wholly in the desired repertoire is used.

toNFC(value)
other canonically equivalent sequences, if there are any
the explicit substitutes value (in order)
toNFKC(value)

C.9 Supplemental Code Mapping

<!ELEMENT languageCodes EMPTY >
<!ATTLIST languageCodes type NMTOKEN #REQUIRED>
<!ATTLIST languageCodes alpha3 NMTOKEN #REQUIRED>

<!ELEMENT territoryCodes EMPTY >
<!ATTLIST territoryCodes type NMTOKEN #REQUIRED>
<!ATTLIST territoryCodes numeric NMTOKEN #REQUIRED>
<!ATTLIST territoryCodes alpha3 NMTOKEN #REQUIRED>
<!ATTLIST territoryCodes fips10 NMTOKEN #IMPLIED>
<!ATTLIST territoryCodes internet NMTOKENS #IMPLIED>

The code mapping information provides mappings between the subtags used in the CLDR locale IDs (from BCP 47) and other coding systems or related information. The language codes are only provided for those codes that have two letters in BCP 47 to their ISO three-letter equivalents. The territory codes provide mappings to numeric (UN M.49 [UNM49] codes, equivalent to ISO numeric codes), ISO three-letter codes, FIPS 10 codes, and the internet top-level domain codes. The alphabetic codes are only provided where different from the type. For example:

<territoryCodes type="AA" numeric="958" alpha3="AAA"/>
<territoryCodes type="AD" numeric="020" alpha3="AND" fips10="AN"/>
<territoryCodes type="AE" numeric="784" alpha3="ARE"/>
...
<territoryCodes type="GB" numeric="826" alpha3="GBR" fips10="UK" internet="UK GB"/>
...
<territoryCodes type="QU" numeric="967" alpha3="QUU" internet="EU"/>

C.10 Likely Subtags

<!ELEMENT likelySubtag EMPTY >
<!ATTLIST likelySubtag from NMTOKEN #REQUIRED>
<!ATTLIST likelySubtag to NMTOKEN #REQUIRED>

There are a number of situations where it is useful to be able to find the most likely language, script, or region. For example, given the language "zh" and the region "TW", what is the most likely script? Given the script "Thai" what is the most likely language or region? Given the region TW, what is the most likely language and script?

Conversely, given a locale, it is useful to find out which fields (language, script, or region) may be superfluous, in the sense that they contain the likely tags. For example, "en_Latn" can be simplified down to "en" since "Latn" is the likely script for "en"; "ja_Jpan_JP" can be simplified down to "ja".

The likelySubtag supplemental data provides default information for computing these values. This data is based on the default content data, the population data, and the the suppress-script data in [BCP47]. It is heuristically derived, and may change over time. To look up data in the table, see if a locale matches one of the from attribute values. If so, fetch the corresponding to attribute value. For example, the Chinese data looks like the following:

<likelySubtag from="zh" to="zh_Hans_CN"/>
<likelySubtag from="zh_HK" to="zh_Hant_HK"/>
<likelySubtag from="zh_Hani" to="zh_Hans_CN"/>
<likelySubtag from="zh_Hant" to="zh_Hant_TW"/>
<likelySubtag from="zh_MO" to="zh_Hant_MO"/>
<likelySubtag from="zh_TW" to="zh_Hant_TW"/>

So looking up "zh_TW" returns "zh_Hant_TW", while looking up "zh" returns "zh_Hans_CN". In the following text, the components of such a result will be be designated with language², region², and script².

The data is designed to be used in the following operations. It can also be used with language tags using [BCP47] syntax, with a few changes.

Add Likely Subtags: Given a locale, to fill in the most likely other fields.

This operation is performed in the following way.

Canonicalize.
1. Make sure the input locale is in canonical form: uses the right separator, and has the right casing.
2. Replace any deprecated subtags with their canonical values using the <alias> data in supplemental metadata. Use the first value in the replacement list, if it exists.
3. If the tag is grandfathered (see <variable id="$grandfathered" type="choice"> in the supplemental data), then return it.
4. Remove the script code 'Zzzz' and the region code 'ZZ' if they occur; change an empty language subtag to 'und'.
5. Get the components of the cleaned-up tag (language¹, script¹, and region¹), plus any variants if they exist (including keywords).
Try each of the following in order (where the fields exist). The notation field³ means field¹ if it exists, otherwise field².
1. Lookup language¹ _ script¹ _ region¹. If in the table, return the language² _ script² _ region² + variants.
2. Lookup language¹ _ script¹. If in the table, return language² _ script² _ region³ + variants.
3. Lookup language¹ _ region¹. If in the table, return language² _ script³ _ region² + variants.
4. Lookup language¹. If in the table, return language² _ script³ _ region³ + variants.
If none of these succeed, signal an error.

Example:

Input is ZH-ZZZZ-SG.
Normalize to zh_SG.
Lookup in table. No match.
Remove SG, but remember it. Lookup zh, and get the match (zh_Hans_CN). Substitute SG, and return zh_Hans_SG.

To find the most likely language for a country, or language for a script, use "und" as the language subtag. For example, looking up "und_TW" returns zh_Hant_TW.

Remove Likely Subtags: Given a locale, remove any fields that Add Likely Subtags would add.

The reverse operation removes fields that would be added by the first operation.

First get max = AddLikelySubtags(inputLocale). If an error is signaled, return it.
Remove the variants from max.
Then for trial in {language, language _ region, language _ script}
- If AddLikelySubtags(trial) = max, then return trial + variants.
If you do not get a match, return max + variants.

Example:

Input is zh_Hant. Maximize to get zh_Hant_TW.
zh => zh_Hans_CN. No match, so continue.
zh_TW => zh_Hant_TW. Matches, so return zh_TW.

A variant of this favors the script over the region, thus using {language, language_script, language_region} in the above. If that variant is used, then the result in this example would be zh_Hant instead of zh_TW.

C.11 Language Plural Rules

<!ELEMENT plurals (pluralRules*) >
<!ATTLIST plurals type ( ordinal | cardinal ) #IMPLIED > 

<!ELEMENT pluralRules (pluralRule*) >
<!ATTLIST pluralRules locales NMTOKENS #REQUIRED >

<!ELEMENT pluralRule ( #PCDATA ) >
<!ATTLIST pluralRule count (zero | one | two | few | many) #REQUIRED >

This section defines certain types of plural forms that exist in a language—namely, the cardinal and ordinal plural forms for nouns. Cardinal plural forms express units such as time, currency or distance, used in conjunction with a number expressed in decimal digits (i.e. "2", not "two", and not an indefinite number such as "some" or "many"). Ordinal plural forms denote the order of items in a set and are always integers. For example, English has two forms for cardinals:

form "one": 1 day
form "other": 0 days, 2 days, 10 days, 0.3 days

and four forms for ordinals:

form "one": 1st floor, 21st floor, 101st floor
form "two": 2nd floor, 22nd floor, 102nd floor
form "few": 3rd floor, 23rd floor, 103rd floor
form "other": 4th floor, 11th floor, 96th floor

Other languages may have additional forms or only one form for each type of plural. CLDR provides the following tags for designating the various plural forms of a language; for a given language, only the tags necessary for that language are defined, along with the specific numeric ranges covered by each tag (for example, the plural form "few" may be used for the numeric range 2-4 in one language and 3-9 in another):

zero
one
two
few
many

In addition, an "other" tag is always implicitly defined to cover the forms not explicitly designated by the tags defined for a language. This "other" tag is also used for languages that only have a single form (in which case no plural-form tags are explicitly defined for the language). For a more complex example, consider the cardinal rules for Russian and certain other languages:

<pluralRules locales="hr ru sr uk">
	<pluralRules count="one">n mod 10 is 1 and n mod 100 is not 11</pluralRule>
	<pluralRules count="few">n mod 10 in 2..4 and n mod 100 not in 12..14</pluralRule>
</pluralRules>

These rules specify that Russian has a "one" form (for 1, 21, 31, 41, 51, …), a "few" form (for 2-4, 22-24, 32-34, …), and implicitly an "other" form (for everything else: 0, 5-20, 25-30, 35-40, …, decimals). Russian does not need additional separate forms for zero, two, or many, so these are not defined.

Plural rules syntax

The xml value for each pluralRule is a condition with a boolean result that specifies whether that rule (i.e. that plural form) applies to a given numeric value n, where n can be expressed as a decimal fraction. Conditions have the following syntax:

condition     = and_condition ('or' and_condition)*
and_condition = relation ('and' relation)*
relation      = is_relation | in_relation | within_relation
is_relation   = expr 'is' ('not')? value
in_relation   = expr ('not')? 'in' range_list
within_relation = expr ('not')? 'within' range_list
expr          = 'n' ('mod' value)?
range_list    = (range | value) (',' range_list)*
value         = digit+
digit         = 0|1|2|3|4|5|6|7|8|9
range         = value'..'value

Whitespace (defined as Unicode Pattern_White_Space) can occur between or around any of the above tokens.
In the syntax, and binds more tightly than or. So X or Y and Z is interpreted as (X or (Y and Z)).
Each plural rule must be written to be self-contained, and not depend on the ordering. Thus rules must be mutually exclusive; for a given numeric value, only one rule can apply (i.e. the condition can only be true for one of the pluralRule elements.
The in and within relations can take comma-separated lists, such as: n in 3,5,7..15. The difference between in and within is that in only includes integers in the specified range, while within includes all values.
mod (modulus) is a remainder operation as defined in Java; for example, where n = 4.3 the result of n mod 3 is 1.3.
To detect an integer in a rule, use n mod 1 is 0. Conversely, for a fraction use: n mod 1 is not 0.

Examples:

one: n is 1 few: n in 2..4	This defines two rules, for 'one' and 'few'. The condition for 'one' is "n is 1" which means that the number must be equal to 1 for this condition to pass. The condition for 'few' is "n in 2..4" which means that the number must be between 2 and 4 inclusive for this condition to pass. All other numbers are assigned the keyword 'other' by the default rule.
zero: n is 0 or n is not 1 and n mod 100 in 1..19 one: n is 1	Each rule must not overlap with other rules. Also note that a modulus is applied to n in the last rule, thus its condition holds for 119, 219, 319...
one: n is 1 few: n mod 10 in 2..4 and n mod 100 not in 12..14	This illustrates conjunction and negation. The condition for 'few' has two parts, both of which must be met: "n mod 10 in 2..4" and "n mod 100 not in 12..14". The first part applies a modulus to n before the test as in the previous example. The second part applies a different modulus and also uses negation, thus it matches all numbers not in 12, 13, 14, 112, 113, 114, 212, 213, 214...

Using the cardinal plural rules

Elements such as <currencyFormats>, <currency> and <unit> provide selection among subelements designating various localized cardinal plural forms by tagging each of the relevant subelements with a different count value, or with no count value in some cases. Note that the plural forms for a specific currencyFormat, unit type, or currency type may not use all of the different plural-form tags defined for the language. To format a currency or unit type for a particular numeric value, determine the count value according to the plural rules for the language, then select the appropriate display form for the currency format, currency type or unit type using the rules in those sections:

5.10.1 Number Symbols (for currencyFormats elements)
5.10.2 Currencies (for currency elements)
5.11 Unit Elements

There are two extra values that can be used with count attributes: 0 and 1. These are used for the explicit values, and may or may not be the same as the forms for "zero" and "one".

C.12 Telephone Code Data

<!ELEMENT telephoneCodeData ( codesByTerritory* ) >

<!ELEMENT codesByTerritory ( telephoneCountryCode+ ) >
<!ATTLIST codesByTerritory territory NMTOKEN #REQUIRED >

<!ELEMENT telephoneCountryCode EMPTY >
<!ATTLIST telephoneCountryCode code NMTOKEN #REQUIRED >
<!ATTLIST telephoneCountryCode from NMTOKEN #IMPLIED >
<!ATTLIST telephoneCountryCode to NMTOKEN #IMPLIED >

This data specifies the mapping between ITU telephone country codes [ITUE164] and CLDR-style territory codes (ISO 3166 2-letter codes or non-corresponding UN M.49 [UNM49] 3-digit codes). There are several things to note:

A given telephone country code may map to multiple CLDR territory codes; +1 (North America Numbering Plan) covers the US and Canada, as well as many islands in the Caribbean and some in the Pacific
Some telephone country codes are for global services (for example, some satellite services), and thus correspond to territory code 001.
The mappings change over time (territories move from one telephone code to another). These changes are usually planned several years in advance, and there may be a period during which either telephone code can be used to reach the territory. While the CLDR telephone code data is not intended to include past changes, it is intended to incorporate known information on planned future changes, using "from" and "to" date attributes to indicate when mappings are valid.

A subset of the telephone code data might look like the following (showing a past mapping change to illustrate the from and to attributes):

<codesByTerritory territory="001">
	<telephoneCountryCode code="800"/> <!-- International Freephone Service -->
	<telephoneCountryCode code="808"/> <!-- International Shared Cost Services (ISCS) -->
	<telephoneCountryCode code="870"/> <!-- Inmarsat Single Number Access Service (SNAC) -->
</codesByTerritory>
<codesByTerritory territory="AS"> <!-- American Samoa -->
	<telephoneCountryCode code="1" from="2004-10-02"/> <!-- +1 684 in North America Numbering Plan -->
	<telephoneCountryCode code="684" to="2005-04-02"/> <!-- +684 now a spare code -->
</codesByTerritory>
<codesByTerritory territory="CA">
	<telephoneCountryCode code="1"/> <!-- North America Numbering Plan -->
</codesByTerritory>

C.13 Numbering Systems

<!ELEMENT numberingSystems ( numberingSystem* ) >
<!ELEMENT numberingSystem EMPTY >
<!ATTLIST numberingSystem id NMTOKEN #REQUIRED >
<!ATTLIST numberingSystem type ( numeric | algorithmic ) #REQUIRED >
<!ATTLIST numberingSystem radix NMTOKEN #IMPLIED >
<!ATTLIST numberingSystem digits CDATA #IMPLIED >
<!ATTLIST numberingSystem rules CDATA #IMPLIED >

Numbering systems information is used to define different representations for numeric values to an end user. Numbering systems are defined in CLDR as one of two different types: algorithmic and numeric. Numeric systems are simply a decimal based system that uses a predefined set of digits to represent numbers. Examples are Western ( ASCII digits ), Thai digits, Devanagari digits. Algorithmic systems are more complex in nature, since the proper formatting and presentation of a numeric quantity is based on some algorithm or set of rules. Examples are Chinese numerals, Hebrew numerals, or Roman numerals. In CLDR, the rules for presentation of numbers in an algorithmic system are defined using the RBNF syntax described in Section 5.17 Rule-Based Number Formatting.

Attributes for the <numberingSystem> element are as follows:

id - Specifies the name of the numbering system that can be used to designate its use in formatting.

type - Specifies whether the numbering system is algorithmic or numeric.

digits - For numeric systems, specifies the digits used to represent numbers, in order, starting from zero.

rules - Specifies the RBNF ruleset to be used for formatting numbers from this numbering system. The rules specifier can contain simply a ruleset name, in which case the ruleset is assumed to be found in the rule set grouping "NumberingSystemRules". Alternatively, the specifier can denote a specific locale, ruleset grouping, and ruleset name, separated by slashes.

Examples:

<numberingSystem id="latn" type="numeric" digits="0123456789"/>
<!-- ASCII digits - A numeric system -->

<numberingSystem id="thai" type="numeric" digits="๐๑๒๓๔๕๖๗๘๙"/>
<!-- A numeric system using Thai digits -->

<numberingSystem id="geor" type="algorithmic" rules="georgian"/>
<!-- An algorithmic system - Georgian numerals , rules found in NumberingSystemRules -->

<numberingSystem id="hant" type="algorithmic" rules="zh_Hant/SpelloutRules/spellout-cardinal"/>
<!-- An algorithmic system. Traditional Chinese Numerals -->

For general information about the numbering system data, including the BCP47 identifiers, see Section Q.1.1 Numbering System Data.

C.14 Postal Code Validation

<!ELEMENT postalCodeData (postCodeRegex*) >
<!ELEMENT postCodeRegex (#PCDATA) >
<!ATTLIST postCodeRegex territoryId NMTOKEN #REQUIRED>

The Postal Code regex information can be used to validate postal codes used in different countries. In some cases, the regex is quite simple, such as for Germany:

<postCodeRegex territoryId="DE" >\d{5}</postCodeRegex>

The US code is slightly more complicated, since there is an optional portion:

<postCodeRegex territoryId="US" >\d{5}([ \-]\d{4})?</postCodeRegex>

The most complicated currently is the UK.

C.15 Calendar Preference Data

<!ELEMENT calendarPreferenceData ( calendarPreference* ) >
<!ELEMENT calendarPreference EMPTY >
<!ATTLIST calendarPreference territories NMTOKENS #REQUIRED >
<!ATTLIST calendarPreference ordering NMTOKENS #REQUIRED >

The calendarPreference element provides a list of commonly used calendar types in a territory. The ordering attribute indicates the list of calendar types in preferred order. The first calendar type in the list is the default calendar type for the territory. For example:

<calendarPreference territories="TH" ordering="buddhist gregorian"/>

The calendarPreference element above indicates both Buddhist calendar and Gregorian calendar are commonly used in Thailand and Buddhist calendar is most preferred.

C.16 BCP 47 Keyword Mapping

Note: This data is deprecated and replaced with Appendix Q: Unicode BCP 47 Extension Data. The data might be removed in future CLDR releases.

<!ELEMENT bcp47KeywordMappings ( mapKeys?, mapTypes* ) >
<!ELEMENT mapKeys ( keyMap* ) >
<!ELEMENT keyMap EMPTY >
<!ATTLIST keyMap type NMTOKEN #REQUIRED >
<!ATTLIST keyMap bcp47 NMTOKEN #REQUIRED >
<!ELEMENT mapTypes ( typeMap* ) >
<!ATTLIST mapTypes type NMTOKEN #REQUIRED >
<!ELEMENT typeMap EMPTY >
<!ATTLIST typeMap type CDATA #REQUIRED >
<!ATTLIST typeMap bcp47 NMTOKEN #REQUIRED >

This section defines mappings between old Unicode locale identifier key/type values and their BCP 47 'u' extension subtag representations. The 'u' extension syntax described in section Section 3.2.1 -u- and -t- Extensions restricts a key to two ASCII alphanumerics and a type to three to eight ASCII alphanumerics. A key or a type which does not meet that syntax requirement is converted according to the mapping data defined by the mapKeys or mapTypes elements. For example, a keyword "collation=phonebook" is converted to BCP 47 'u' extension subtags "co-phonebk" by the mapping data below:

    <mapKeys>
        ...
        <keyMap type="collation" bcp47="co"/>
        ...
    </mapKeys>
    <mapTypes type="collation">
        ...
        <typeMap type="phonebook" bcp47="phonebk"/>
        ...
    </mapTypes>

C.17 DayPeriod Rules

<!ELEMENT dayPeriodRuleSet ( dayPeriodRules* ) >

<!ELEMENT dayPeriodRules (dayPeriodRule*) >
<!ATTLIST dayPeriodRules locales NMTOKENS #REQUIRED >

<!ELEMENT dayPeriodRule EMPTY >
<!ATTLIST dayPeriodRule type NMTOKEN #REQUIRED >
<!ATTLIST dayPeriodRule at NMTOKEN #IMPLIED >
<!ATTLIST dayPeriodRule after NMTOKEN #IMPLIED >
<!ATTLIST dayPeriodRule from NMTOKEN #IMPLIED >
<!ATTLIST dayPeriodRule before NMTOKEN #IMPLIED >
<!ATTLIST dayPeriodRule to NMTOKEN #IMPLIED >

Each locale should have a set of day period rules, which determine the periods during a day if used with a 12 hour format. If locales are not defined in dayPeriods.xml, dayPeriods fallback to AM/PM. Here are the requirements for a rule set:

"from" and "to" are closed intervals(inclusive).
"after" and "before" are open intervals(exclusive).
"at" means starting time and end time are the same.
There must be exactly one of {at, from, after} and exactly one of {at, to, before} for each dayPeriodRule.
The set of dayPeriodRule's need to completely cover the 24 hours in a day (from 0:00 before 24:00), with no overlaps between each dayPeriodRule.
Both hh:mm [period name] and hh [period name] can be parsed uniquely to HH:mm [period name].
1. For example, you can't have <dayPeriod type = "morning" from="0:00" to="12:00"/> because "12 {morning}" would be ambiguous.
One dayPeriodRule can cross midnight. For example:
1. <dayPeriodRule type="night" from="20:00" before="5:00"/>
2. However, this should be avoided unless the alternative is awkward, because of ambiguities. While the use of the dayPeriods without hours is not recommended, they can be used. And if the user sees "Tuesday night" they may not think that that includes 1:00 am Tuesday.
dayPeriodRule's with the same type are only allowed if they are not adjacent. Example:
- <dayPeriod type = "twilight" from="5:00" to="7:00"/>
- <dayPeriod type = "twilight" from="17:00" to="19:00"/>
24:00 is only allowed in before="24:00". A term for midnight should be avoided in the rules, because of ambiguity problems in most languages.
1. "Tuesday midnight" generally means at the end of the day on Tuesday (24:00)
2. Most software does not format anything for 24:00, only for 00:00. And you don't want 00:00 Tuesday (the start of the day) to be formatted as midnight, meaning the end of the day.
When parsing, if the hour is present the dayperiod is checked for consistency. If there is no hour, the center of the first matching dayPeriodRule is chosen (starting from 0:00).
If rounding is done -- including the rounding done by the time format -- then it needs to be done before the dayperiod is computed, so that the correct format is shown.

C.18 Language Matching

<!ELEMENT languageMatching ( languageMatches* ) >
<!ELEMENT languageMatches ( languageMatch* ) >
<!ATTLIST languageMatches type NMTOKEN #REQUIRED >
<!ELEMENT languageMatch EMPTY >
<!ATTLIST languageMatch desired CDATA #REQUIRED >
<!ATTLIST languageMatch supported CDATA #REQUIRED >
<!ATTLIST languageMatch percent NMTOKEN #REQUIRED >
<!ATTLIST languageMatch oneway ( true | false ) #IMPLIED >

Implementers are often faced with the issue of how to match the user's requested languages with their product's supported languages. For example, suppose that a product supports {ja-JP, de, zh-TW}. If the user understands written American English, German, French, Swiss German, and Italian, then de would be the best match; if s/he understands only Chinese (zh), then zh-TW would be the best match.

The standard truncation-fallback algorithm does not work well when faced with the complexities of natural language. The language matching data is designed to fill that gap. Stated in those terms, language matching can have the effect of a more complex fallback, such as:

sr-Cyrl-RS
sr-Cyrl
sr-Latn-RS
sr-Latn
sr
hr-Latn
hr

Language matching is used to find the best supported locale ID given a requested list of languages. The requested list could come from different sources, such as such as the user's list of preferred languages in the OS Settings, or from a browser Accept-Language list. For example, if my native tongue is English, I can understand Swiss German and German, my French is rusty but usable, and Italian basic, ideally an implementation would allow me to select {gsw, de, fr} as my preferred list of languages, skipping Italian because my comprehension is not good enough for arbitrary content.

Language Matching can also be used to get fallback data elements. In many cases, there may not be full data for a particular locale. For example, for a Breton speaker, the best fallback if data is unavailable might be French. That is, suppose we have found a Breton bundle, but it does not contain translation for the key "CN" (for the country China). It is best to return "chine", rather than falling back to the value default language such as Russian and getting "Кітай". The language matching data can be used to get the closest fallback locales (of those supported) to a given language.

When such fallback is used for resource item lookup, the normal order of inheritance is used for resource item lookup, except that before using any data from root, the data for the fallback locales would be used if available. Language matching does not interact with the fallback of resources within the locale-parent chain. For example, suppose that we are looking for the value for a particular path P in nb-NO. In the absence of aliases, normally the following lookup is used.

nb-NO → nb → root

That is, we first look in nb-NO. If there is no value for P there, then we look in nb. If there is no value for P there, we return the value for P in root (or a code value, if there is nothing there). Remember that if there is an alias element along this path, then the lookup may restart with a different path in nb-NO (or another locale).

However, suppose that nb-NO has the fallback values [nn da sv en], derived from language matching. In that case, an implementation may progressively lookup each of the listed locales, with the appropriate substitutions, returning the first value that is not found in root. This follows roughly the following pseudocode:

value = lookup(P, nb-NO); if (locationFound != root) return value;
value = lookup(P, nn-NO); if (locationFound != root) return value;
value = lookup(P, da-NO); if (locationFound != root) return value;
value = lookup(P, sv-NO); if (locationFound != root) return value;
value = lookup(P, en-NO); return value;

The locales in the fallback list are not used recursively. For example, for the lookup of a path in nb-NO, if fr were a fallback value for da, it would not matter for the above process. Only the original language matters.

The languageMatching data is interpreted as an ordered list. To find the match between any two languages, use the likely subtags to maximize each language, and perform the following steps.

Remove any trailing fields that are the same.
Traverse the list until a match is found. (If the oneway flag is false, then the match is symmetric.)
Record the match value.
Remove the final field from each, and if any fields are left, repeat these steps.

The end result is the product of the matched values.

There is one special case. Suppose we have the following situation:

desired languages: {und, it}
supported languages: {en, it}
resulting language: en

Part of this is because 'und' has a special function in BCP47; it stands in for 'no supplied base language'. To prevent this from happening, if the desired base language is und, the language matcher should not apply likely subtags to it.

Examples:

For example, suppose that nn-DE and nb-FR are being compared. They are first maximized to nn-Latn-DE and nb-Latn-FR, respectively. The list is searched. The first match is with "*-*-*", for a match of 96%. The languages are truncated to nn-Latn and nb-Latn, then to nn and nb. The first match is also for a value of 96%, so the result is 92%.

Note that language matching is orthogonal to the how closely two languages are related linguistically. For example, Breton is more closely related to Welsh than to French, but French is the better match (because it is more likely that a Breton reader will understand French than Welsh). This also illustrates that the matches are often asymmetric: it is not likely that a French reader will understand Breton.

The "*" acts as a wild card, as shown in the following example:

C.19 Parent Locales

<!ELEMENT parentLocales ( parentLocale* ) >
<!ELEMENT parentLocale EMPTY >
<!ATTLIST parentLocale parent CDATA #REQUIRED >
<!ATTLIST parentLocale locales CDATA #REQUIRED >

In some cases, the normal truncation inheritance does not function well. This happens when:

The child locale is of a different script. In this case, mixing elements from the parent into the child data results in a mishmash.
A large number of child locales behave similarly, and differently from the truncation parent.

The parentLocale element is used to override the normal inheritance when accessing CLDR data.

For case 1, the children are script locales, and the parent is "root". For example:

 <parentLocale parent="root" locales="az_Cyrl ha_Arab … zh_Hant"/>

For case 2, the children and parent share the same primary language, but the region is changed. For example:

 <parentLocale parent="es_419" locales="es_AR es_BO … es_UY es_VE"/>

Collation data, however, is an exception. Since collation rules do not truly inherit data from the parent, the parentLocale element is not necessary and not used for collation. Thus, for a locale like zh_Hant in the example above, the parentLocale element would dictate the parent as "root" when referring to main locale data, but for collation data, the parent locale would still be "zh", even though the parentLocale element is present for that locale.

C.20 Gender of Lists

<!ELEMENT gender ( personList+ ) >
<!ELEMENT personList EMPTY >
<!ATTLIST personList type ( neutral | mixedNeutral | maleTaints ) #REQUIRED >
<!ATTLIST personList locales NMTOKENS #REQUIRED >

This can be used to determine the gender of a list of 2 or more persons, such as "Tom and Mary", for use with gender-selection messages. For example,

  
  <supplementalData>
    <gender>
      <!-- neutral: gender(list) = other -->
      <personList type="neutral" locales="af da en..."/>
  
      <!-- mixedNeutral: gender(all male) = male, gender(all female) = female, otherwise gender(list) = other -->
      <personList type="mixedNeutral" locales="el"/> 

      <!-- maleTaints: gender(all female) = female, otherwise gender(list) = male -->
      <personList type="maleTaints" locales="ar ca..."/> 
    </gender>
  </supplementalData>

There are three ways the gender of a list can be formatted:

neutral: A gender-independent "other" form will be used for the list.
mixedNeutral: If the elements of the list are all male, "male" form is used for the list. If all the elements of the lists are female, "female" form is used. If the list has a mix of male, female and neutral names, the "other" form is used.
maleTaints: If all the elements of the lists are female, "female" form is used, otherwise the "male" form is used.

The data will be in genderList.xml, where the locales belonging to each of these types are enumerated.

Appendix D: Unicode Language and Locale IDs

People have very slippery notions of what distinguishes a language code versus a locale code. The problem is that both are somewhat nebulous concepts.

In practice, many people use [BCP47] codes to mean locale codes instead of strictly language codes. It is easy to see why this came about; because [BCP47] includes an explicit region (territory) code, for most people it was sufficient for use as a locale code as well. For example, when typical web software receives an [BCP47] code, it will use it as a locale code. Other typical software will do the same: in practice, language codes and locale codes are treated interchangeably. Some people recommend distinguishing on the basis of "-" versus "_" (for example, zh-TW for language code, zh_TW for locale code), but in practice that does not work because of the free variation out in the world in the use of these separators. Notice that Windows, for example, uses "-" as a separator in its locale codes. So pragmatically one is forced to treat "-" and "_" as equivalent when interpreting either one on input.

Another reason for the conflation of these codes is that very little data in most systems is distinguished by region alone; currency codes and measurement systems being some of the few. Sometimes date or number formats are mentioned as regional, but that really does not make much sense. If people see the sentence "You will have to adjust the value to १,२३४.५६७ from ૭૧,૨૩૪.૫૬" (using Indic digits), they would say that sentence is simply not English. Number format is far more closely associated with language than it is with region. The same is true for date formats: people would never expect to see intermixed a date in the format "2003年4月1日" (using Kanji) in text purporting to be purely English. There are regional differences in date and number format — differences which can be important — but those are different in kind than other language differences between regions.

As far as we are concerned — as a completely practical matter — two languages are different if they require substantially different localized resources. Distinctions according to spoken form are important in some contexts, but the written form is by far and away the most important issue for data interchange. Unfortunately, this is not the principle used in [ISO639], which has the fairly unproductive notion (for data interchange) that only spoken language matters (it is also not completely consistent about this, however).

[BCP47] can express a difference if the use of written languages happens to correspond to region boundaries expressed as [ISO3166] region codes, and has recently added codes that allow it to express some important cases that are not distinguished by [ISO3166] codes. These written languages include simplified and traditional Chinese (both used in Hong Kong S.A.R.); Serbian in Latin script; Azerbaijani in Arab script, and so on.

Notice also that currency codes are different than currency localizations. The currency localizations should largely be in the language-based resource bundles, not in the territory-based resource bundles. Thus, the resource bundle en contains the localized mappings in English for a range of different currency codes: USD → US$, RUR → Rub, AUD → $A and so on. Of course, some currency symbols are used for more than one currency, and in such cases specializations appear in the territory-based bundles. Continuing the example, en_US would have USD → $, while en_AU would have AUD → $. (In protocols, the currency codes should always accompany any currency amounts; otherwise the data is ambiguous, and software is forced to use the user's territory to guess at the currency. For some informal discussion of this, see JIT Localization.)

D.1 Written Language

Criteria for what makes a written language should be purely pragmatic; what would copy-editors say? If one gave them text like the following, they would respond that is far from acceptable English for publication, and ask for it to be redone:

"Theatre Center News: The date of the last version of this document was 2003年3月20日. A copy can be obtained for $50,0 or 1.234,57 грн. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug Felt."

So one would change it to either B or C below, depending on which orthographic variant of English was the target for the publication:

"Theater Center News: The date of the last version of this document was 3/20/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian Hryvni. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric Mader."
"Theatre Centre News: The date of the last version of this document was 20/3/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian Hryvni. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric Mader."

Clearly there are many acceptable variations on this text. For example, copy editors might still quibble with the use of first versus last name sorting in the list, but clearly the first list was not acceptable English alphabetical order. And in quoting a name, like "Theatre Centre News", one may leave it in the source orthography even if it differs from the publication target orthography. And so on. However, just as clearly, there limits on what is acceptable English, and "2003年3月20日", for example, is not.

Note that the language of locale data may differ from the language of localized software or web sites, when those latter are not localized into the user's preferred language. In such cases, the kind of incongruous juxtapositions described above may well appear, but this situation is usually preferable to forcing unfamiliar date or number formats on the user as well.

Appendix E: Unicode Sets

A UnicodeSet is a set of Unicode characters (and possibly strings) determined by a pattern, following UTS #18: Unicode Regular Expressions [UTS18], Level 1 and RL2.5, including the syntax where given. For an example of a concrete implementation of this, see [ICUUnicodeSet].

Patterns are a series of characters bounded by square brackets that contain lists of characters and Unicode property sets. Lists are a sequence of characters that may have ranges indicated by a '-' between two characters, as in "a-z". The sequence specifies the range of all characters from the left to the right, in Unicode order. For example, [a c d-f m] is equivalent to [a c d e f m]. Whitespace can be freely used for clarity, as [a c d-f m] means the same as [acd-fm].

Unicode property sets are specified by any Unicode property and a value of that property, such as [:General_Category=Letter:]. The property names are defined by the PropertyAliases.txt file and the property values by the PropertyValueAliases.txt file. For more information, see [UAX44]. The syntax for specifying the property sets is an extension of either POSIX or Perl syntax, by the addition of "=<value>". For example, you can match letters by using the POSIX-style syntax:

[:General_Category=Letter:]

or by using the Perl-style syntax

\p{General_Category=Letter}.

Property names and values are case-insensitive, and whitespace, "-", and "_" are ignored. The property name can be omitted for the Category and Script properties, but is required for other properties. If the property value is omitted, it is assumed to represent a boolean property with the value "true". Thus [:Letter:] is equivalent to [:General_Category=Letter:], and [:Wh-ite-s pa_ce:] is equivalent to [:Whitespace=true:].

The table below shows the two kinds of syntax: POSIX and Perl style. Also, the table shows the "Negative", which is a property that excludes all characters of a given kind. For example, [:^Letter:] matches all characters that are not [:Letter:].

	Positive	Negative
POSIX-style Syntax	[:type=value:]	[:^type=value:]
Perl-style Syntax	\p{type=value}	\P{type=value}

These following low-level lists or properties then can be freely combined with the normal set operations (union, inverse, difference, and intersection):

To union two sets, simply concatenate them. For example, [[:letter:] [:number:]]
To intersect two sets, use the '&' operator. For example, [[:letter:] & [a-z]]
To take the set-difference of two sets, use the '-' operator. For example, [[:letter:] - [a-z]]
To invert a set, place a '^' immediately after the opening '['. For example, [^a-z]. In any other location, the '^' does not have a special meaning.

The binary operators '&', '-', and the implicit union have equal precedence and bind left-to-right. Thus [[:letter:]-[a-z]-[\u0100-\u01FF]] is equal to [[[:letter:]-[a-z]]-[\u0100-\u01FF]]. Another example is the set [[ace][bdf] - [abc][def]], which is not the empty set, but instead equal to [[[[ace] [bdf]] - [abc]] [def]], which equals [[[abcdef] - [abc]] [def]], which equals [[def] [def]], which equals [def].

One caution: the '&' and '-' operators operate between sets. That is, they must be immediately preceded and immediately followed by a set. For example, the pattern [[:Lu:]-A] is illegal, since it is interpreted as the set [:Lu:] followed by the incomplete range -A. To specify the set of upper case letters except for 'A', enclose the 'A' in a set: [[:Lu:]-[A]].

A multi-character string can be in a Unicode set, to represent a tailored grapheme cluster for a particular language. The syntax uses curly braces for that case.

In Unicode Sets, there are two ways to quote syntax characters and whitespace:

E.1 Single Quote

Two single quotes represents a single quote, either inside or outside single quotes. Text within single quotes is not interpreted in any way (except for two adjacent single quotes). It is taken as literal text (special characters become non-special).

E.2 Backslash Escapes

Outside of single quotes, certain backslashed characters have special meaning:

\uhhhh	Exactly 4 hex digits; h in [0-9A-Fa-f]
\Uhhhhhhhh	Exactly 8 hex digits
\xhh	1-2 hex digits
\ooo	1-3 octal digits; o in [0-7]
\a	U+0007 (BELL)
\b	U+0008 (BACKSPACE)
\t	U+0009 (HORIZONTAL TAB)
\n	U+000A (LINE FEED)
\v	U+000B (VERTICAL TAB)
\f	U+000C (FORM FEED)
\r	U+000D (CARRIAGE RETURN)
\\	U+005C (BACKSLASH)
\N{name}	The Unicode character named "name".

Anything else following a backslash is mapped to itself, except in an environment where it is defined to have some special meaning. For example, \p{uppercase} is the set of upper case letters in Unicode.

Any character formed as the result of a backslash escape loses any special meaning and is treated as a literal. In particular, note that \u and \U escapes create literal characters. (In contrast, Java treats Unicode escapes as just a way to represent arbitrary characters in an ASCII source file, and any resulting characters are not tagged as literals.)

The following table summarizes the syntax that can be used.

Example	Description
[a]	The set containing 'a' alone
[a-z]	The set containing 'a' through 'z' and all letters in between, in Unicode order. Thus it is the same as [\u0061-\u007A].
[^a-z]	The set containing all characters but 'a' through 'z'. Thus it is the same as [\u0000-\u0061 \u007B..\U0010FFFF].
[[pat1][pat2]]	The union of sets specified by pat1 and pat2
[[pat1]&[pat2]]	The intersection of sets specified by pat1 and pat2
[[pat1]-[pat2]]	The asymmetric difference of sets specified by pat1 and pat2
[a {ab} {ac}]	The character 'a' and the multi-character strings "ab" and "ac"
[:Lu:]	The set of characters with a given property value, as defined by PropertyValueAliases.txt. In this case, these are the Unicode upper case letters. The long form for this is [:General_Category=Uppercase_Letter:].
[:L:]	The set of characters belonging to all Unicode categories starting with 'L', that is, [[:Lu:][:Ll:][:Lt:][:Lm:][:Lo:]]. The long form for this is [:General_Category=Letter:].

Appendix F: Date Format Patterns

A date pattern is a string of characters, where specific strings of characters are replaced with date and time data from a calendar when formatting or used to generate data for a calendar when parsing. The following are examples:

Pattern	Result (in a particular locale)
yyyy.MM.dd G 'at' HH:mm:ss zzz	1996.07.10 AD at 15:08:56 PDT
EEE, MMM d, ''yy	Wed, July 10, '96
h:mm a	12:08 PM
hh 'o''clock' a, zzzz	12 o'clock PM, Pacific Daylight Time
K:mm a, z	0:00 PM, PST
yyyyy.MMMM.dd GGG hh:mm aaa	01996.July.10 AD 12:08 PM

When parsing using a pattern, a lenient parse should be used; see Lenient Parsing.

The Date Field Symbol Table below contains the characters used in patterns to show the appropriate formats for a given locale, such as yyyy for the year. Characters may be used multiple times. For example, if y is used for the year, 'yy' might produce '99', whereas 'yyyy' produces '1999'. For most numerical fields, the number of characters specifies the field width. For example, if h is the hour, 'h' might produce '5', but 'hh' produces '05'. For some characters, the count specifies whether an abbreviated or full form should be used, but may have other choices, as given below.

<!ATTLIST pattern numbers CDATA #IMPLIED >

The numbers attribute is used to indicate that numeric quantities in the pattern are to be rendered using a numbering system other than then default numbering system defined for the given locale. The attribute can be in one of two forms. If the alternate numbering system is intended to apply to ALL numeric quantities in the pattern, then simply use the numbering system ID as found in Section C.13 Numbering Systems. To apply the alternate numbering system only to a single field, the syntax "<letter>=<numberingSystem>" can be used one or more times, separated by semicolons.

Examples:
<pattern numbers="hebr">dd/mm/yyyy</pattern>


<pattern numbers="y=hebr">dd/mm/yyyy</pattern>


<pattern numbers="d=thai;m=hans;y=deva">dd/mm/yyyy</pattern>

In patterns, two single quotes represents a literal single quote, either inside or outside single quotes. Text within single quotes is not interpreted in any way (except for two adjacent single quotes). Otherwise all ASCII letter from a to z and A to Z are reserved as syntax characters, and require quoting if they are to represent literal characters. In addition, certain ASCII punctuation characters may become variable in the future (for example, ":" being interpreted as the time separator and '/' as a date separator, and replaced by respective locale-sensitive characters in display).

Note: the counter-intuitive use of 5 letters for the narrow form of weekdays and months is forced by backwards compatibility.

Date Field Symbol Table

Field

Sym.

No.

Example

Description

era

1..3

Era - Replaced with the Era string for the current date. One to three letters for the abbreviated form, four letters for the long form, five for the narrow form.

Anno Domini

year

1..n

1996

Year. Normally the length specifies the padding, but for two letters it also specifies the maximum length. Example:

Year	y	yy	yyy	yyyy	yyyyy
AD 1	1	01	001	0001	00001
AD 12	12	12	012	0012	00012
AD 123	123	23	123	0123	00123
AD 1234	1234	34	1234	1234	01234
AD 12345	12345	45	12345	12345	12345

1..n

1997

Year (in "Week of Year" based calendars). Normally the length specifies the padding, but for two letters it also specifies the maximum length. This year designation is used in ISO year-week calendar as defined by ISO 8601, but can be used in non-Gregorian based calendar systems where week date processing is desired. May not always be the same value as calendar year.

1..n

4601

Extended year. This is a single number designating the year of this calendar system, encompassing all supra-year fields. For example, for the Julian calendar system, year numbers are positive, with an era of BCE or CE. An extended year value for the Julian calendar system assigns positive values to CE years and negative values to BCE years, with 1 BCE being year 0.

1..3

甲子

Cyclic year name. Calendars such as the Chinese lunar calendar (and related calendars) and the Hindu calendars use 60-year cycles of year names. Use one through three letters for the abbreviated name, four for the full name, or five for the narrow name (currently the data only provides abbreviated names, which will be used for all requested name widths). If the calendar does not provide cyclic year name data, or if the year value to be formatted is out of the range of years for which cyclic name data is provided, then numeric formatting is used (behaves like 'y').

(currently also 甲子)

quarter

1..2

Quarter - Use one or two for the numerical quarter, three for the abbreviation, or four for the full name.

2nd quarter

1..2

Stand-Alone Quarter - Use one or two for the numerical quarter, three for the abbreviation, or four for the full name.

2nd quarter

month

1..2

Month - Use one or two for the numerical month, three for the abbreviation, four for the full name, or five for the narrow name.

Sept

September

1..2

Stand-Alone Month - Use one or two for the numerical month, three for the abbreviation, or four for the full name, or 5 for the narrow name.

Sept

September

(nothing)

This pattern character is deprecated, and should be ignored in patterns. It was originally intended to be used in combination with M to indicate placement of the symbol for leap month in the Chinese calendar. Placement of that marker is now specified using locale-specific <monthPatterns> data, and formatting and parsing of that marker should be handled as part of supporting the regular M and L pattern characters.

week

1..2

Week of Year.

Week of Month

day

1..2

Date - Day of the month

1..3

345

Day of year

Day of Week in Month. The example is for the 2nd Wed in July

1..n

2451334

Modified Julian day. This is different from the conventional Julian day number in two regards. First, it demarcates days at local zone midnight, rather than noon GMT. Second, it is a local number; that is, it depends on the local time zone. It can be thought of as a single number that encompasses all the date-related fields.

week
day

1..3

Tues

Day of week - Use one through three letters for the short day, or four for the full name, five for the narrow name, or six for the short name.

Tuesday

1..2

Local day of week. Same as E except adds a numeric value that will depend on the local starting day of the week, using one or two letters. For this example, Monday is the first day of the week.

Tues

Tuesday

Stand-Alone local day of week - Use one letter for the local numeric value (same as 'e'), three for the short day, four for the full name, five for the narrow name, or six for the short name.

Tues

Tuesday

period

AM or PM

hour

1..2

Hour [1-12]. When used in skeleton data or in a skeleton passed in an API for flexible data pattern generation, it should match the 12-hour-cycle format preferred by the locale (h or K); it should not match a 24-hour-cycle format (H or k). Use hh for zero padding.

1..2

Hour [0-23]. When used in skeleton data or in a skeleton passed in an API for flexible data pattern generation, it should match the 24-hour-cycle format preferred by the locale (H or k); it should not match a 12-hour-cycle format (h or K). Use HH for zero padding.

1..2

Hour [0-11]. When used in a skeleton, only matches K or h, see above. Use KK for zero padding.

1..2

Hour [1-24]. When used in a skeleton, only matches k or H, see above. Use kk for zero padding.

1..2

n/a

This is a special-purpose symbol. It must not occur in pattern or skeleton data. Instead, it is reserved for use in skeletons passed to APIs doing flexible date pattern generation. In such a context, it requests the preferred hour format for the locale (h, H, K, or k), as determined by whether h, H, K, or k is used in the standard short time format for the locale. In the implementation of such an API, 'j' must be replaced by h, H, K, or k before beginning a match against availableFormats data. Note that use of 'j' in a skeleton passed to an API is the only way to have a skeleton request a locale's preferred time cycle type (12-hour or 24-hour).

minute

1..2

Minute. Use one or two for zero padding.

second

1..2

Second. Use one or two for zero padding.

1..n

3456

Fractional Second - truncates (like other time fields) to the count of letters. (example shows display using pattern SSSS for seconds value 12.34567)

1..n

69540000

Milliseconds in day. This field behaves exactly like a composite of all time-related fields, not including the zone fields. As such, it also reflects discontinuities of those fields on DST transition days. On a day of DST onset, it will jump forward. On a day of DST cessation, it will jump backward. This reflects the fact that is must be combined with the offset field to obtain a unique local time value.

zone

1..3

PDT
fallbacks:
HPG-8:00

GMT-08:00

Time Zone - with the specific non-location format. Where that is unavailable, falls back to localized GMT format. Use one to three letters for the short format or four for the full format.
For more information about timezone formats, see Appendix J: Time Zone Display Names.

Pacific Daylight Time
fallbacks: HPG-8:00

GMT-08:00

1..3

-0800

Time Zone - RFC 822 GMT format.
For more information about timezone formats, see Appendix J: Time Zone Display Names.

HPG-8:00

Time Zone - The localized GMT format.
For more information about timezone formats, see Appendix J: Time Zone Display Names.

-08:00

Time Zone - ISO8601 time zone format.
For more information about timezone formats, see Appendix J: Time Zone Display Names.

Time Zone - with the generic non-location format. Where that is unavailable, uses special fallback rules given in Appendix J. Use one letter for short format, four for long format.
For more information about timezone formats, see Appendix J: Time Zone Display Names.

Pacific Time
fallbacks:
Pacific Time (Canada)
Pacific Time (Whitehorse)
United States (Los Angeles) Time
HPG-8:35
GMT-08:35

PST

fallbacks:
HPG-8:00

GMT-08:00

Time Zone - Identical to the format for z. This specifier formerly had slightly different behavior than the z specifier before the deprecation of the commonlyUsed element. Use z instead of V whenever possible in date formats.
For more information about timezone formats, see Appendix J: Time Zone Display Names.

United States (Los Angeles) Time
fallbacks:
HPG-8:35
GMT-08:35

Time Zone - with the generic location format. Where that is unavailable, falls back to the localized GMT format. (Fallback is only necessary with a GMT-style Time Zone ID, like Etc/GMT-830.)
This is especially useful when presenting possible timezone choices for user selection, since the naming is more uniform than the v format.
For more information about timezone formats, see Appendix J: Time Zone Display Names.

All non-letter character represent themselves in a pattern, except for the single quote. It is used to 'escape' letters. Two single quotes in a row, whether inside or outside a quoted sequence, represent a 'real' single quote.

F.1 Localized Pattern Characters (deprecated)

These are characters that can be used when displaying a date pattern to an end user. This can occur, for example, when a spreadsheet allows users to specify date patterns. Whatever is in the string is substituted one-for-one with the characters "GyMdkHmsSEDFwWahKzYe", with the above meanings. Thus, for example, if "J" is to be used instead of "Y" to mean Year, then the string would be: "GyMdkHmsSEDFwWahKzJe".

This element is deprecated. It is recommended instead that a more sophisticated UI be used for localization, such as using icons to represent the different formats (and lengths) in the Date Field Symbol Table.

F.2 AM / PM

Even for countries where the customary date format only has a 24 hour format, both the am and pm localized strings must be present and must be distinct from one another. Note that as long as the 24 hour format is used, these strings will normally never be used, but for testing and unusual circumstances they must be present.

F.3 Eras

There are only two values for an era in a Gregorian calendar, "BC" and "AD". These values can be translated into other languages, like "a.C." and and "d.C." for Spanish, but there are no other eras in the Gregorian calendar. Other calendars have a different numbers of eras. Care should be taken when translating the era names for a specific calendar.

F.4 Week of Year

Values calculated for the Week of Year field range from 1 to 53 for the Gregorian calendar (they may have different ranges for other calendars). Week 1 for a year is the first week that contains at least the specified minimum number of days from that year. Weeks between week 1 of one year and week 1 of the following year are numbered sequentially from 2 to 52 or 53 (if needed). For example, January 1, 1998 was a Thursday. If the first day of the week is MONDAY and the minimum days in a week is 4 (these are the values reflecting ISO 8601 and many national standards), then week 1 of 1998 starts on December 29, 1997, and ends on January 4, 1998. However, if the first day of the week is SUNDAY, then week 1 of 1998 starts on January 4, 1998, and ends on January 10, 1998. The first three days of 1998 are then part of week 53 of 1997.

Values are similarly calculated for the Week of Month.

F.5 Week Elements

firstDay: A number indicating which day of the week is considered the 'first' day, for calendar purposes. Because the ordering of days may vary between calendar, keywords are used for this value, such as sun, mon,... These values will be replaced by the localized name when they are actually used.
minDays (Minimal Days in First Week): Minimal days required in the first week of a month or year. For example, if the first week is defined as one that contains at least one day, this value will be 1. If it must contain a full seven days before it counts as the first week, then the value would be 7.
weekendStart, weekendEnd: Indicates the day and time that the weekend starts or ends. As with firstDay, keywords are used instead of numbers.

Appendix G: Number Format Patterns

G.1 Number Patterns

The NumberElements resource affects how these patterns are interpreted in a localized context. Here are some examples, based on the French locale. The "." shows where the decimal point should go. The "," shows where the thousands separator should go. A "0" indicates zero-padding: if the number is too short, a zero (in the locale's numeric set) will go there. A "#" indicates no padding: if the number is too short, nothing goes there. A "¤" shows where the currency sign will go. The following illustrates the effects of different patterns for the French locale, with the number "1234.567". Notice how the pattern characters ',' and '.' are replaced by the characters appropriate for the locale.

Pattern Currency Text

#,##0.## n/a 1 234,57

#,##0.### n/a 1 234,567

###0.##### n/a 1234,567

###0.0000# n/a 1234,5670

00000.0000 n/a 01234,5670

# ##0.00 ¤ EUR 1 234,57 €

JPY 1 235 ¥

The number of # placeholder characters before the decimal do not matter, since no limit is placed on the maximum number of digits. There should, however, be at least one zero someplace in the pattern. In currency formats, the number of digits after the decimal also do not matter, since the information in the supplemental data (see Appendix C: Supplemental Data) is used to override the number of decimal places — and the rounding — according to the currency that is being formatted. That can be seen in the above chart, with the difference between Yen and Euro formatting.

When parsing using a pattern, a lenient parse should be used; see Lenient Parsing.

G.2 Special Pattern Characters

Many characters in a pattern are taken literally; they are matched during parsing and output unchanged during formatting. Special characters, on the other hand, stand for other characters, strings, or classes of characters. For example, the '#' character is replaced by a localized digit. Often the replacement character is the same as the pattern character; in the U.S. locale, the ',' grouping character is replaced by ','. However, the replacement is still happening, and if the symbols are modified, the grouping character changes. Some special characters affect the behavior of the formatter by their presence; for example, if the percent character is seen, then the value is multiplied by 100 before being displayed.

To insert a special character in a pattern as a literal, that is, without any special meaning, the character must be quoted. There are some exceptions to this which are noted below.

Symbol Location Localized? Meaning

0 Number Yes Digit

1-9 Number Yes '1' through '9' indicate rounding.

@ Number No Significant digit

# Number Yes Digit, zero shows as absent

. Number Yes Decimal separator or monetary decimal separator

- Number Yes Minus sign

, Number Yes Grouping separator

E Number Yes Separates mantissa and exponent in scientific notation. Need not be quoted in prefix or suffix.

+ Exponent Yes Prefix positive exponents with localized plus sign. Need not be quoted in prefix or suffix.

; Subpattern boundary Yes Separates positive and negative subpatterns

% Prefix or suffix Yes Multiply by 100 and show as percentage

‰
(\u2030) Prefix or suffix Yes Multiply by 1000 and show as per mille

¤ (\u00A4) Prefix or suffix No Currency sign, replaced by currency symbol. If doubled, replaced by international currency symbol. If tripled, uses the long form of the decimal symbol. If present in a pattern, the monetary decimal separator and grouping separators (if available) are used instead of the numeric ones.

' Prefix or suffix No Used to quote special characters in a prefix or suffix, for example, "'#'#" formats 123 to "#123". To create a single quote itself, use two in a row: "# o''clock".

* Prefix or suffix boundary Yes Pad escape, precedes pad character

A pattern contains a positive and may contain a negative subpattern, for example, "#,##0.00;(#,##0.00)". Each subpattern has a prefix, a numeric part, and a suffix. If there is no explicit negative subpattern, the negative subpattern is the localized minus sign prefixed to the positive subpattern. That is, "0.00" alone is equivalent to "0.00;-0.00". If there is an explicit negative subpattern, it serves only to specify the negative prefix and suffix; the number of digits, minimal digits, and other characteristics are ignored in the negative subpattern. That means that "#,##0.0#;(#)" has precisely the same result as "#,##0.0#;(#,##0.0#)".

Note: The thousands separator and decimal separator in this pattern are always ',' and '.'. They are substituted by the code with the correct local values according to other fields in CLDR.

The prefixes, suffixes, and various symbols used for infinity, digits, thousands separators, decimal separators, and so on may be set to arbitrary values, and they will appear properly during formatting. However, care must be taken that the symbols and strings do not conflict, or parsing will be unreliable. For example, either the positive and negative prefixes or the suffixes must be distinct for any parser using this data to be able to distinguish positive from negative values. Another example is that the decimal separator and thousands separator should be distinct characters, or parsing will be impossible.

The grouping separator is a character that separates clusters of integer digits to make large numbers more legible. It is commonly used for thousands, but in some locales it separates ten-thousands. The grouping size is the number of digits between the grouping separators, such as 3 for "100,000,000" or 4 for "1 0000 0000". There are actually two different grouping sizes: One used for the least significant integer digits, the primary grouping size, and one used for all others, the secondary grouping size. In most locales these are the same, but sometimes they are different. For example, if the primary grouping interval is 3, and the secondary is 2, then this corresponds to the pattern "#,##,##0", and the number 123456789 is formatted as "12,34,56,789". If a pattern contains multiple grouping separators, the interval between the last one and the end of the integer defines the primary grouping size, and the interval between the last two defines the secondary grouping size. All others are ignored, so "#,##,###,####" == "###,###,####" == "##,#,###,####".

For consistency in the CLDR data, the following conventions should be observed so as to have a canonical representation:

All number patterns should be minimal: there should be no leading # marks except to specify the position of the grouping separators (for example, avoid ##,##0.###).
All formats should have one 0 before the decimal point (for example, avoid #,###.##)
Decimal formats should have three hash marks in the fractional position (for example, #,##0.###).
Currency formats should have two zeros in the fractional position (for example, ¤ #,##0.00).
- The exact number of decimals is overridden with the decimal count in supplementary data.
The only time two thousands separators needs to be used is when the number of digits varies, such as for Hindi: #,##,##0.

G.3 Formatting

Formatting is guided by several parameters, all of which can be specified either using a pattern or using the API. The following description applies to formats that do not use scientific notation or significant digits.

If the number of actual integer digits exceeds the maximum integer digits, then only the least significant digits are shown. For example, 1997 is formatted as "97" if the maximum integer digits is set to 2.
If the number of actual integer digits is less than the minimum integer digits, then leading zeros are added. For example, 1997 is formatted as "01997" if the minimum integer digits is set to 5.
If the number of actual fraction digits exceeds the maximum fraction digits, then half-even rounding it performed to the maximum fraction digits. For example, 0.125 is formatted as "0.12" if the maximum fraction digits is 2. This behavior can be changed by specifying a rounding increment and a rounding mode.
If the number of actual fraction digits is less than the minimum fraction digits, then trailing zeros are added. For example, 0.125 is formatted as "0.1250" if the minimum fraction digits is set to 4.
Trailing fractional zeros are not displayed if they occur j positions after the decimal, where j is less than the maximum fraction digits. For example, 0.10004 is formatted as "0.1" if the maximum fraction digits is four or less.

Special Values

NaN is represented as a single character, typically (\uFFFD). This character is determined by the localized number symbols. This is the only value for which the prefixes and suffixes are not used.

Infinity is represented as a single character, typically ∞ (\u221E), with the positive or negative prefixes and suffixes applied. The infinity character is determined by the localized number symbols.

G.4 Scientific Notation

Numbers in scientific notation are expressed as the product of a mantissa and a power of ten, for example, 1234 can be expressed as 1.234 x 10³. The mantissa is typically in the half-open interval [1.0, 10.0) or sometimes [0.0, 1.0), but it need not be. In a pattern, the exponent character immediately followed by one or more digit characters indicates scientific notation. Example: "0.###E0" formats the number 1234 as "1.234E3".

The number of digit characters after the exponent character gives the minimum exponent digit count. There is no maximum. Negative exponents are formatted using the localized minus sign, not the prefix and suffix from the pattern. This allows patterns such as "0.###E0 m/s". To prefix positive exponents with a localized plus sign, specify '+' between the exponent and the digits: "0.###E+0" will produce formats "1E+1", "1E+0", "1E-1", and so on. (In localized patterns, use the localized plus sign rather than '+'.)
The minimum number of integer digits is achieved by adjusting the exponent. Example: 0.00123 formatted with "00.###E0" yields "12.3E-4". This only happens if there is no maximum number of integer digits. If there is a maximum, then the minimum number of integer digits is fixed at one.
The maximum number of integer digits, if present, specifies the exponent grouping. The most common use of this is to generate engineering notation, in which the exponent is a multiple of three, for example, "##0.###E0". The number 12345 is formatted using "##0.####E0" as "12.345E3".
When using scientific notation, the formatter controls the digit counts using significant digits logic. The maximum number of significant digits limits the total number of integer and fraction digits that will be shown in the mantissa; it does not affect parsing. For example, 12345 formatted with "##0.##E0" is "12.3E3". See the section on significant digits for more details.
Exponential patterns may not contain grouping separators.

G.5 Significant Digits

There are two ways of controlling how many digits are shows: (a) significant digits counts, or (b) integer and fraction digit counts. Integer and fraction digit counts are described above. When a formatter is using significant digits counts, the number of integer and fraction digits is not specified directly, and the formatter settings for these counts are ignored. Instead, the formatter uses however many integer and fraction digits are required to display the specified number of significant digits. Examples:

Pattern Minimum significant digits Maximum significant digits Number Output

@@@ 3 3 12345 12300

@@@ 3 3 0.12345 0.123

@@## 2 4 3.14159 3.142

@@## 2 4 1.23004 1.23

In order to enable significant digits formatting, use a pattern containing the '@' pattern character. In order to disable significant digits formatting, use a pattern that does not contain the '@' pattern character.
Significant digit counts may be expressed using patterns that specify a minimum and maximum number of significant digits. These are indicated by the '@' and '#' characters. The minimum number of significant digits is the number of '@' characters. The maximum number of significant digits is the number of '@' characters plus the number of '#' characters following on the right. For example, the pattern "@@@" indicates exactly 3 significant digits. The pattern "@##" indicates from 1 to 3 significant digits. Trailing zero digits to the right of the decimal separator are suppressed after the minimum number of significant digits have been shown. For example, the pattern "@##" formats the number 0.1203 as "0.12".
If a pattern uses significant digits, it may not contain a decimal separator, nor the '0' pattern character. Patterns such as "@00" or "@.###" are disallowed.
Any number of '#' characters may be prepended to the left of the leftmost '@' character. These have no effect on the minimum and maximum significant digits counts, but may be used to position grouping separators. For example, "#,#@#" indicates a minimum of one significant digits, a maximum of two significant digits, and a grouping size of three.
The number of significant digits has no effect on parsing.
Significant digits may be used together with exponential notation. Such patterns are equivalent to a normal exponential pattern with a minimum and maximum integer digit count of one, a minimum fraction digit count of Minimum Significant Digits - 1, and a maximum fraction digit count of Maximum Significant Digits - 1. For example, the pattern "@@###E0" is equivalent to "0.0###E0".

G.6 Padding

Patterns support padding the result to a specific width. In a pattern the pad escape character, followed by a single pad character, causes padding to be parsed and formatted. The pad escape character is '*'. For example, "$*x#,##0.00" formats 123 to "$xx123.00", and 1234 to "$1,234.00".

When padding is in effect, the width of the positive subpattern, including prefix and suffix, determines the format width. For example, in the pattern "* #0 o''clock", the format width is 10.
Some parameters which usually do not matter have meaning when padding is used, because the pattern width is significant with padding. In the pattern "* ##,##,#,##0.##", the format width is 14. The initial characters "##,##," do not affect the grouping size or maximum integer digits, but they do affect the format width.
Padding may be inserted at one of four locations: before the prefix, after the prefix, before the suffix, or after the suffix. No padding can be specified in any other location. If there is no prefix, before the prefix and after the prefix are equivalent, likewise for the suffix.
When specified in a pattern, the code point immediately following the pad escape is the pad character. This may be any character, including a special pattern character. That is, the pad escape escapes the following character. If there is no character after the pad escape, then the pattern is illegal.

Rounding

Patterns support rounding to a specific increment. For example, 1230 rounded to the nearest 50 is 1250. Mathematically, rounding to specific increments is performed by multiplying by the increment, rounding to an integer, then dividing by the increment. To take a more bizarre example, 1.234 rounded to the nearest 0.65 is 1.3, as follows:

Original:	1.234
Divide by increment (0.65):	1.89846...
Round:	2
Multiply by increment (0.65):	1.3

To specify a rounding increment in a pattern, include the increment in the pattern itself. "#,#50" specifies a rounding increment of 50. "#,##0.05" specifies a rounding increment of 0.05.

Rounding only affects the string produced by formatting. It does not affect parsing or change any numerical values.
An implementation may allow the specification of a rounding mode to determine how values are rounded. In the absence of such choices, the default is to round "half-even", as described in IEEE arithmetic. That is, it rounds towards the "nearest neighbor" unless both neighbors are equidistant, in which case, it rounds towards the even neighbor. Behaves as for round "half-up" if the digit to the left of the discarded fraction is odd; behaves as for round "half-down" if it's even. Note that this is the rounding mode that minimizes cumulative error when applied repeatedly over a sequence of calculations.
Some locales use rounding in their currency formats to reflect the smallest currency denomination.
In a pattern, digits '1' through '9' specify rounding, but otherwise behave identically to digit '0'.


decimalFormats: The normal locale specific way to write a base 10 number.
currencyFormats: Use \u00A4 where the local currency symbol should be. Doubling the currency symbol (\u00A4\u00A4) will output the international currency symbol (a 3-letter code).
percentFormats: Pattern for use with percentage formatting
scientificFormats: Pattern for use with scientific (exponent) formatting.

G.7 Quoting Rules

Single quotes, ('), enclose bits of the pattern that should be treated literally. Inside a quoted string, two single quotes ('') are replaced with a single one ('). For example: 'X '#' Q ' -> X 1939 Q (Literal strings underlined.)

G.8 Number Elements

Localized symbols used in number formatting and parsing.

decimal: - separates the integer and fractional part of the number.
group: - separates clusters of integer digits to make large numbers more legible; commonly used for thousands (grouping size 3, e.g. "100,000,000") or in some locales, ten-thousands (grouping size 4, e.g. "1,0000,0000"). There may be two different grouping sizes: The primary grouping size used for the least significant integer group, and the secondary grouping size used for more significant groups; these are not the same in all locales (e.g. "12,34,56,789"). If a pattern contains multiple grouping separators, the interval between the last one and the end of the integer defines the primary grouping size, and the interval between the last two defines the secondary grouping size. All others are ignored, so "#,##,###,####" == "###,###,####" == "##,#,###,####".
list: - separates lists of numbers
percentSign: - symbol used to indicate a percentage (1/100th) amount. (If present, the value is also multiplied by 100 before formatting. That way 1.23 → 123%)
nativeZeroDigit: - Symbol used to indicate a digit in the pattern, or zero if that place would otherwise be empty. For example, with the digit of '0', the pattern "000" would format "34" as "034", but the pattern "0" would format "34" as just "34". As well, the digits 1-9 are expected to follow the code point of this specified 0 value.
patternDigit: - Symbol used to indicate any digit value, typically #. When that digit is zero, then it is not shown.
minusSign: - Symbol used to denote negative value.
plusSign: - Symbol used to denote positive value.
exponential: - Symbol separating the mantissa and exponent values.
perMille: - symbol used to indicate a per-mille (1/1000th) amount. (If present, the value is also multiplied by 1000 before formatting. That way 1.23 → 1230 [1/000])
infinity: - The infinity sign. Corresponds to the IEEE infinity bit pattern.
nan - Not a number: - The NaN sign. Corresponds to the IEEE NaN bit pattern.
currencyDecimal: This is used as the decimal separator in currency formatting/parsing, instead of the DecimalSeparator from the NumberElements list. This item is optional in the CLDR.
currencyGroup: This is used as the grouping separator in currency formatting/parsing, instead of the DecimalSeparator from the NumberElements list. This item is optional in the CLDR.

Appendix H: Choice Patterns

A choice pattern is a string that chooses among a number of strings, based on numeric value. It has the following form:

<choice_pattern> = <choice> ( '|' <choice> )*
<choice> = <number><relation><string>
<number> = ('+' | '-')? ('∞' | [0-9]+ ('.' [0-9]+)?)
<relation> = '<' | '≤'

The interpretation of a choice pattern is that given a number N, the pattern is scanned from right to left, for each choice evaluating <number> <relation> N. The first choice that matches results in the corresponding string. If no match is found, then the first string is used. For example:

Pattern	N	Result
0≤Rf\|1≤Ru\|1<Re	-∞, -3, -1, -0.000001	Rf (defaulted to first string)
	0, 0.01, 0.9999	Rf
	1	Ru
	1.00001, 5, 99, ∞	Re

Quoting is done using ' characters, as in date or number formats.

Appendix I: Inheritance and Validity

The following describes in more detail how to determine the exact inheritance of elements, and the validity of a given element in LDML.

I.1 Definitions

Blocking elements are those whose subelements do not inherit from parent locales. For example, a <collation> element is a blocking element: everything in a <collation> element is treated as a single lump of data, as far as inheritance is concerned. For more information, see Appendix K: Valid Attribute Values.

Attributes that serve to distinguish multiple elements at the same level are called distinguishing attributes. For example, the type attribute distinguishes different elements in lists of translations, such as:

<language type="aa">Afar</language>
<language type="ab">Abkhazian</language>

Distinguishing attributes affect inheritance; two elements with different distinguishing attributes are treated as different for purposes of inheritance. For more information, see Appendix K: Valid Attribute Values. Other attributes are called nondistinguishing (or informational) attributes. These carry separate information, and do not affect inheritance.

For any element in an XML file, an element chain is a resolved [XPath] leading from the root to an element, with attributes on each element in alphabetical order. So in, say, http://unicode.org/cldr/data/common/main/el.xml we may have:

<ldml>
  <identity>
    <version number="1.1" />
    <generation date="2004-06-04" />
    <language type="el" />
  </identity>
  <localeDisplayNames>
    <languages>
      <language type="ar">Αραβικά</language>
...

Which gives the following element chains (among others):

//ldml/identity/version[@number="1.1"]
//ldml/localeDisplayNames/languages/language[@type="ar"]

An element chain A is an extension of an element chain B if B is equivalent to an initial portion of A. For example, #2 below is an extension of #1. (Equivalent, depending on the tree, may not be "identical to". See below for an example.)

//ldml/localeDisplayNames
//ldml/localeDisplayNames/languages/language[@type="ar"]

An LDML file can be thought of as an ordered list of element pairs: <element chain, data>, where the element chains are all the chains for the end-nodes. (This works because of restrictions on the structure of LDML, including that it does not allow mixed content.) The ordering is the ordering that the element chains are found in the file, and thus determined by the DTD.

For example, some of those pairs would be the following. Notice that the first has the null string as element contents.

<//ldml/identity/version[@number="1.1"], "">
<//ldml/localeDisplayNames/languages/language[@type="ar"], "Αραβικά">

Note: There are two exceptions to this:

Blocking nodes and their contents are treated as a single end node.

In terms of computing inheritance, the element pair consists of the element chain plus all distinguishing attributes; the value consists of the value (if any) plus any nondistinguishing attributes.

Thus instead of the element pair being (a) below, it is (b):

<//ldml/dates/calendars/calendar[@type='gregorian']/week/weekendStart[@day='sun'][@time='00:00'],
"">

<//ldml/dates/calendars/calendar[@type='gregorian']/week/weekendStart,
[@day='sun'][@time='00:00']>

Two LDML element chains are equivalent when they would be identical if all attributes and their values were removed — except for distinguishing attributes. Thus the following are equivalent:

//ldml/localeDisplayNames/languages/language[@type="ar"]
//ldml/localeDisplayNames/languages/language[@type="ar"][@draft="unconfirmed"]

For any locale ID, an locale chain is an ordered list starting with the root and leading down to the ID. For example:

<root, de, de_DE, de_DE_xxx>

I.2 Resolved Data File

To produce fully resolved locale data file from CLDR for a locale ID L, you start with L, and successively add unique items from the parent locales until you get up to root. More formally, this can be expressed as the following procedure.

Let Result be initially L.
For each Li in the locale chain for L, starting at L and going up to root:
1. Let Temp be a copy of the pairs in the LDML file for Li
2. Replace each alias in Temp by the resolved list of pairs it points to.
  1. The resolved list of pairs is obtained by recursively applying this procedure.
  2. That alias now blocks any inheritance from the parent. (See Section 5.1 Common Elements for an example.)
3. For each element pair P in Temp:
  1. If P does not contain a blocking element, and Result does not have an element pair Q with an equivalent element chain, add P to Result.

Notes:

When adding an element pair to a result, it has to go in the right order for it to be valid according to the DTD.
The identity element and its children are unaffected by resolution.
The LDML data must be constructed so as to avoid circularity in step 2.2.

I.3 Valid Data

The attribute draft="x" in LDML means that the data has not been approved by the subcommittee. (For more information, see Process). However, some data that is not explicitly marked as draft may be implicitly draft, either because it inherits it from a parent, or from an enclosing element.

Example 2. Suppose that new locale data is added for af (Afrikaans). To indicate that all of the data is unconfirmed, the attribute can be added to the top level.

<ldml version="1.1" draft="unconfirmed"> <identity> <version number="1.1" /> <generation date="2004-06-04" /> <language type="af" /> </identity> <characters>...</characters> <localeDisplayNames>...</localeDisplayNames> </ldml>

Any data can be added to that file, and the status will all be draft=unconfirmed. Once an item is vetted—whether it is inherited or explicitly in the file—then its status can be changed to approved. This can be done either by leaving draft="unconfirmed" on the enclosing element and marking the child with draft="approved", such as:

<ldml version="1.1" draft="unconfirmed"> <identity> <version number="1.1" /> <generation date="2004-06-04" /> <language type="af" /> </identity> <characters draft="approved">...</characters> <localeDisplayNames>...</localeDisplayNames> <dates/> <numbers/> <collations/> </ldml>

However, normally the draft attributes should be canonicalized, which means they are pushed down to leaf nodes as described in Appendix L: Canonical Form. If an LDML file does has draft attributes that are not on leaf nodes, the file should be interpreted as if it were the canonicalized version of that file.

The attribute validSubLocales allows sublocales in a given tree to be treated as though a file for them were present when there is not one. It only has an effect for locales that inherit from the current file where a file is missing, and the elements would

Example 1. Suppose that in a particular LDML tree, there are no region locales for German, for example, there is a de.xml file, but no files for de_AT.xml, de_CH.xml, or de_DE.xml. Then no elements are valid for any of those region locales. If we want to mark one of those files as having valid elements, then we introduce an empty file, such as the following.

<ldml version="1.1"> <identity> <version number="1.1" /> <generation date="2004-06-04" /> <language type="de" /> <territory type="AT" /> </identity> </ldml>

With the validSubLocales attribute, instead of adding the empty files for de_AT.xml, de_CH.xml, and de_DE.xml, in the de file we can add to the parent locale a list of the child locales that should behave as if files were present.

<ldml version="1.1" validSubLocales="de_AT de_CH de_DE"> <identity> <version number="1.1" /> <generation date="2004-06-04" /> <language type="de" /> </identity> ... </ldml>

More formally, here is how to determine whether data for an element chain E is implicitly or explicitly draft, given a locale L. Sections 1, 2, and 4 are simply formalizations of what is in LDML already. Item 3 adds the new element.

I.4 Checking for Draft Status:

Parent Locale Inheritance
1. Walk through the locale chain until you find a locale ID L' with a data file D. (L' may equal L).
2. Produce the fully resolved data file D' for D.
3. In D', find the first element pair whose element chain E' is either equivalent to or an extension of E.
4. If there is no such E', return true
5. If E' is not equivalent to E, truncate E' to the length of E.
Enclosing Element Inheritance
1. Walk through the elements in E', from back to front.
  1. If you ever encounter draft=x, return x
2. If L' = L, return false
Missing File Inheritance
1. Otherwise, walk again through the elements in E', from back to front.
  1. If you encounter a validSubLocales attribute:
    1. If L is in the attribute value, return false
    2. Otherwise return true
Otherwise
1. Return true

The validSubLocales in the most specific (farthest from root file) locale file "wins" through the full resolution step (data from more specific files replacing data from less specific ones).

I.5 Keyword and Default Resolution

When accessing data based on keywords, the following process is used. Consider the following example:

The locale 'de' has collation types A, B, C, and no <default> element
The locale 'de_CH' has <default type='B'>

Here are the searches for various combinations.

User Input	Lookup in Locale	For	Comment
de_CH no keyword	de_CH	default collation type	finds "B"
	de_CH	collation type=B	not found
	de	collation type=B	found
de no keyword	de	default collation type	not found
	root	default collation type	finds "standard"
	de	collation type=standard	not found
	root	collation type=standard	found
de@collation=A	de	collation type=A	found
de@collation=standard	de	collation type=standard	not found
de@collation=standard	root	collation type=standard	found
de@collation=foobar	de	collation type=foobar	not found
	root	collation type=foobar	not found, starts looking for default
	de	default collation type	not found
	root	default collation type	finds "standard"
	de	collation type=standard	not found
	root	collation type=standard	found
de_DE@collation=ducet	de_DE	collation type=ducet	not found
	de	collation type=ducet	not found
	root	collation type=ducet	found

Note: It is an invariant that the default in root for a given element must
always be a value that exists in root. So you can not have the following in root:

<someElements> <default type='a'/> <someElement type='b'>...</someElement> <someElement type='c'>...</someElement>  </someElements>

For identifiers, such as language codes, script codes, region codes, variant codes, types, keywords, currency symbols or currency display names, the default value is the identifier itself whenever if no value is found in the root. Thus if there is no display name for the region code 'QA' in root, then the display name is simply 'QA'.

Appendix J: Time Zone Display Names

There are three main types of formats for zone identifiers: GMT, generic (wall time), and standard/daylight. Standard and daylight are equivalent to a particular offset from GMT, and can be represented by a GMT offset as a fallback. In general, this is not true for the generic format, which is used for picking timezones or for conveying a timezone for specifying a recurring time (such as a meeting in a calendar). For either purpose, a GMT offset would lose information.

Time Zone Format Terminology

The following terminology defines more precisely the formats that are used.

Generic non-location format: Reflects "wall time" (what is on a clock on the wall): used for recurring events, meetings, or anywhere people do not want to be overly specific. For example, "10 am Pacific Time" will be GMT-8 in the winter, and GMT-7 in the summer.

"Pacific Time"
"PT"

Generic partial location format: Reflects "wall time": used as a fallback format when the generic non-location format is not specific enough.

"Pacific Time (Canada)"
"Pacific Time (Whitehorse)"

Generic location format: Reflects "wall time": a primary function of this format type is to represent a time zone in a list or menu for user selection of time zone, since the naming is more uniform than the generic non-location format and zones for the same country will be grouped together (and could be organized hierarchically by country if desired). It is also a fallback format when there is no translation for the generic non-location format.

"United States (Los Angeles) Time"
"Italy Time".

Note: A generic location format is constructed by a part of time zone ID representing an exemplar city name and its country as the final fallback. However, there are Unicode time zones which are not associated with any locations, such as "Etc/GMT+5" and "PST8PDT". The date format pattern "vvvv" specifies the generic location format, but it displays localized GMT format for them. Some of these time zones observe daylight saving time, so the result (localized GMT format) may change depending on input date. For generating a list for user selection of time zone with format "vvvv", these non-location zones should be excluded.

Specific non-location format: Reflects a specific standard or daylight time, which may or may not be the wall time. For example, "10 am Pacific Standard Time" will be GMT-8 in the winter and in the summer.

"Pacific Standard Time"
"PST"
"Pacific Daylight Time"
"PDT"

Localized GMT format: A constant, specific offset from GMT (or UTC), which may be in a translated form. There are two styles for this. The first is used when there is an explicit non-zero offset from GMT; this style is specified by the <gmtFormat> element:

"HMG+03.30"
"GMT+03:30"
"Гриинуич+03:30"

Otherwise (when the offset from GMT is zero, referring to GMT itself) the style specified by the <gmtZeroFormat> element is used:

"HMG"
"GMT"
"Гриинуич"

RFC 822 GMT format: A constant, specific offset from GMT (or UTC), which always has the same format.

"-0800"

ISO 8601 time zone format: A constant, specific offset from UTC, which always has the same format except UTC itself ("Z").

"-08:00"
"Z"

Raw Offset - an offset from GMT that does not include any daylight savings behavior. For example, the raw offset for Pacific Time is -8, even though the observed offset may be -8 or -7.

Metazone - a collection of time zones that share the same behavior and same name during some period. They may differ in daylight behavior (whether they have it and when).

For example, the TZID America/Cambridge_Bay is in the following metazones during various periods:

<timezone type="America/Cambridge_Bay">
<usesMetazone to="1999-10-31 08:00" mzone="America_Mountain"/>
<usesMetazone to="2000-10-29 07:00" from="1999-10-31 08:00" mzone="America_Central"/>
<usesMetazone to="2000-11-05 05:00" from="2000-10-29 07:00" mzone="America_Eastern"/>
<usesMetazone to="2001-04-01 09:00" from="2000-11-05 05:00" mzone="America_Central"/>
<usesMetazone from="2001-04-01 09:00" mzone="America_Mountain"/>
</timezone>

Zones may join or leave a metazone over time. The data relating between zones and metazones is in the supplemental information; the locale data is restricted to translations of metazones and zones.

Invariants:

At any given point in time, each zone belongs to exactly one metazone.
Except for daylight savings, at any given time, all zones in a metazone have the same offset at that time.

Golden Zone - the TZDB zone that exemplifies a metazone. For example, America/New_York is the golden zone for the metazone America_Eastern:

<mapZone other="America_Eastern" territory="001" type="America/New_York"/>

Invariants:

The golden zones are those in mapZone supplemental data under the territory "001".
Every metazone has exactly one golden zone.
Each zone has at most one metazone for which it is golden.
The golden zone is in that metazone during the entire life of the metazone. (The raw offset of the golden zone may change over time.)
Each other zone must have the same raw offset as the golden zone, for the entire period that it is in the metazone. (It might not have the same offset when daylight savings is in effect.)
A golden zone in mapTimezones (metaZones.xml) must have reverse mapping in metazoneInfo (metaZones.xml)

Preferred Zone - for a given TZID, the "best" zone out of a metazone for a given country or language.

Invariants:

The preferred zone for a given country XX are those in mapZone supplemental data under the territory XX.
Every metazone has at most one preferred zone for a given territory XX.
Each zone has at most one metazone for which it is preferred for a territory XX.
The preferred zone for a given metazone and territory XX is in a metazone M during any time when any other zone in XX is also in M
A preferred zone in mapTimezones (metaZones.xml) must have reverse mapping in metazoneInfo (metaZones.xml)

For example, for America_Pacific the preferred zone for Canada is America/Vancouver, and the preferred zone for Mexico is America/Tijuana. The golden zone is America/Los_Angeles, which is also also the preferred zone for any other country.

<mapZone other="America_Pacific" territory="001" type="America/Los_Angeles"/>
<mapZone other="America_Pacific" territory="CA" type="America/Vancouver"/>
<mapZone other="America_Pacific" territory="MX" type="America/Tijuana"/>

fallbackRegionFormat: a formatting string such as "{1} Time ({0})", where {1} is the country and {0} is a city.

fallbackFormat: a formatting string such as "{1} ({0})", where {1} is the metazone, and {0} is the country or city.

regionFormat: a formatting string such as "{0} Time", where {0} is the country.

Goals

The timezones are designed so that:

For any given locale, every time round trips with all patterns Z, ZZZZ, z, zzzz, v, vvvv, V, VVVV (but not necessarily every timezone). That is, given a time and a format pattern with a zone string, you can format, then parse, and get back the same time.
Note that the round-tripping is not just important for parsing; it provides for formatting dates and times in an unambiguous way for users. It is also important for testing.

There are exceptions to the above for transition times.

With generic, during the transition when the local time maps to two possible GMT times.

For example, Java works as follows, favoring standard time:

Source: Sun Nov 04 01:30:00 PDT 2007

=> Formatted: "Sunday, November 4, 2007 1:30:00 AM"

=> Parsed: Sun Nov 04 01:30:00 PST 2007

When the timezone changes offset, say from GMT+4 to GMT+5, there can also be a gap.

The VVVV format will roundtrip not only the time, but the canonical timezone.

When the data for a given format is not available, a fallback format is used. The fallback order is given in the following by a list.

Specifics
- z - [short form] specific non-location
  - falling back to localized GMT
- zzzz - [long form] specific non-location
  - falling back to localized GMT
- Z - RFC 822 (no fallback necessary)
- ZZZZ - Localized GMT (no fallback necessary)
- ZZZZZ - ISO 8601 (no fallback necessary)
- V - specific non-location
  - falling back to localized GMT
Generics
- v - [short form] generic non-location
  (however, the rules are more complicated, see #5 below)
  - falling back to generic location
  - falling back to localized GMT
- vvvv - [long form] generic non-location
  (however, the rules are more complicated, see #5 below)
  - falling back to generic location
  - falling back to localized GMT
- VVVV - generic location
  - falling back to localized GMT

The following process is used for the particular formats, with the fallback rules as above.

Some of the examples are drawn from real data, while others are for illustration. For illustration the region format is "Hora de {0}". The fallback format in the examples is "{1} ({0})", which is what is in root.

In all cases, first canonicalize the TZ ID according to the <timezoneData> table in supplemental data. Use that canonical TZID in each of the following steps.
- America/Atka → America/Adak
- Australia/ACT → Australia/Sydney
For RFC 822 GMT format ("Z") return the results according to the RFC.
- America/Los_Angeles → "-0800"
Note: The digits in this case are always from the western digits, 0..9.
For the localized GMT format, use the gmtFormat (such as "GMT{0}" or "HMG{0}") with the hourFormat (such as "+HH:mm;-HH:mm" or "+HH.mm;-HH.mm").
- America/Los_Angeles → "GMT-08:00" // standard time
- America/Los_Angeles → "HMG-07:00" // daylight time
- Etc/GMT+3 → "GMT-03.00" // note that TZ tzids have inverse polarity!
Note: The digits should be whatever are appropriate for the locale used to format the time zone, not necessarily from the western digits, 0..9. For example, they might be from ०..९.
For ISO 8601 time zone format ("ZZZZZ") return the results according to the ISO 8601.
- America/Los_Angeles → "-08:00"
- Etc/GMT → Z // special case of UTC
Note: The digits in this case are always from the western digits, 0..9.
For the non-location formats (generic or specific):
1. if there is an explicit translation for the TZID in timeZoneNames according to type (generic, standard, or daylight) in the resolved locale, return it.
  1. If the requested type is not available, but another type is, and there is a Type Fallback then return that other type.
    - Examples:
      - America/Los_Angeles → "Heure du Pacifique (ÉUA)" // generic
      - America/Los_Angeles → 太平洋標準時 // standard
      - America/Los_Angeles → Yhdysvaltain Tyynenmeren kesäaika // daylight
      - Europe/Dublin → Am Samhraidh na hÉireann // daylight
      - Note: This translation may not at all be literal: it would be what is most recognizable for people using the target language.
2. Otherwise, get the requested metazone format according to type (generic, standard, daylight).
  1. If the requested type is not available, but another type is, get the format according to Type Fallback.
  2. If there is no format for the type, fall back.
3. Otherwise do the following:
  1. Get the country for the current locale. If there is none, use the most likely country based on the likelySubtags data. If there is none, use “001”.
  2. Get the preferred zone for the metazone for the country; if there is none for the country, use the preferred zone for the metazone for “001”.
  3. If that preferred zone is the same as the requested zone, use the metazone format. For example, "Pacific Time" for Vancouver if the locale is en-CA, or for Los Angeles if locale is en-US.
  4. Otherwise, if the zone is the preferred zone for its country but not for the country of the locale, use the metazone format + country in the fallbackFormat.
  5. Otherwise, use the metazone format + city in the fallbackFormat.
    - Examples:
      - “Pacific Time (Canada)" for the zone Vancouver in the locale en_MX.
      - Mountain Time (Phoenix)
      - Pacific Time (Whitehorse)
For the generic location format:
1. From <timezoneData> get the country code for the zone, and determine whether there is only one timezone in the country. If there is only one timezone or if the zone id is in the singleCountries list, format the country name with the regionFormat, and return it.
 - Examples:
 - Europe/Rome → IT → Italy Time // for English
 - Africa/Monrovia → LR → "Hora de Liberja"
 - America/Havana → CU → "Hora de CU" // if CU is not localized
2. Otherwise format the exemplar city with the regionFormat, and return it.
 1. America/Buenos_Aires → "Buenos Aires Time

Note: If a language does require grammatical changes when composing strings, then the regionFormat should either use a neutral format such as "Heure: {0}", or put all exceptional cases in explicitly translated strings.

Type Fallback

When a specified type (generic, standard, daylight) does not exist:

If the daylight type does not exist, then the metazone doesn’t require daylight support. For all three types:
1. If the generic type exists, use it.
2. Otherwise if the standard type exists, use it.
Otherwise if the generic type is needed, but not available, and the offset and daylight offset do not change within 184 day +/- interval around the exact formatted time, use the standard type.
1. Example: "Mountain Standard Time" for Phoenix
2. Note: 184 is the smallest number that is at least 6 months AND the smallest number that is more than 1/2 year (Gregorian).

Composition

In composing the metazone + city or metazone + country:

Use the fallbackFormat, where:
- {1} will be the metazone
- {0} will be a qualifier (city or country)
- Example:
  - metazone = Pacific Time
  - city = Phoenix
  - → Pacific Time (Phoenix)
If the localized country name is not available, use the code:
- CU (country code)→ "CU" // no localized country name for Cuba
If the localized exemplar city is not available, use as the exemplar city the last field of the raw TZID, stripping off the prefix and turning _ into space.
- America/Los_Angeles → "Los Angeles" // no localized exemplar city

Note: As with the regionFormat, exceptional cases need to be explicitly translated.

Parsing

In parsing, an implementation will be able to either determine the zone id, or a simple offset from GMT for anything formatting according to the above process.

The following is a sample process for how this might be done. It is only a sample; implementations may use different methods for parsing.

The sample describes the parsing of a zone as if it were an isolated string. In implementations, the zone may be mixed in with other data (like the time), so the parsing actually has to look for the longest match, and then allow the remaining text to be parsed for other content. That requires certain adaptions to the following process.

Start with a string S.
If S matches the RFC 822 GMT format or ISO 8601 time zone format, return it.
- For example, "-0800" (RFC 822), "-08:00" (ISO 8601) => Etc/GMT+8
If S matches the English or localized GMT format, return the corresponding TZID
- Matching should be lenient. Thus allow for the number formats like: 03, 3, 330, 3:30, 33045 or 3:30:45. Allow +, -, or nothing. Allow spaces after GMT, +/-, and before number. Allow non-Latin numbers. Allow UTC or UT (per RFC 788) as synonyms for GMT (GMT, UT, UTC are global formats, always allowed in parsing regardless of locale).
- For example, "GMT+3" or "UT+3" or "HPG+3" => Etc/GMT-3
- When parsing, the absence of a numeric offset should be interpreted as offset 0, whether in localized or global formats. For example, "GMT" or "UT" or "UTC+0" or "HPG" => Etc/GMT
If S matches the fallback format, extract P = {0} [ie, the part in parens in the root format] and N = {1}.
If S does not match, set P = "" and N = S
If N matches the region format, then M = {0} from that format, otherwise M = N.
- For example, "United States (Los Angeles) Time" => N = "United States Time", M = "United States", P = "Los Angeles".
- For example, "United States Time" => N = "United States Time", M = "United States", P = "".
- For example, "United States" => N = M = "United States", P = "".
If P, N, or M is a localized country, set C to that value. If C has only one zone, return it.
- For example, "Italy Time (xxx)" or "xxx (Italy)" => Europe/Rome
- For example, "xxx (Canada)" or "Canada Time (xxx)" => Sets C = CA and continues
If P is a localized TZID (and not metazone), return it.
- For example, "xxxx (Phoenix)" or "Phoenix (xxx)" => America/Phoenix
If N, or M is a localized TZID (and not metazone), return it.
- For example, "Pacific Standard Time (xxx)" => "America/Los_Angeles" // this is only if "Pacific Standard Time" is not a metazone localization.
If N or M is a localized metazone
- If it corresponds to only one TZID, return it.
- If C is set, look up the Metazone + Country => TZID mapping, and return that value if it exists
- Get the locale's language, and get the default country from that. Look up the Metazone + DefaultCountry => TZID mapping, and return that value if it exists.
- Otherwise, lookup Metazone + 001 => TZID and return it (that will always exist)
If you get this far, return an error.

Note: This CLDR date parsing recommendation does not fully handle all RFC 788 date/time formats, nor is it intended to.

Parsing can be more lenient than the above, allowing for different spacing, punctuation, or other variation. A stricter parse would check for consistency between the xxx portions above and the rest, so "Pacific Standard Time (India)" would give an error.

Using this process, a correct parse will roundtrip the location format (VVVV) back to the canonical zoneid.

Australia/ACT → Australia/Sydney → “Sydney (Australia)” → Australia/Sydney

The GMT formats (Z and ZZZZ) will return back an offset, and thus lose the original canonical zone id.

Australia/ACT → Australia/Sydney → "GMT+11:00" → GMT+11

The daylight and standard time formats, and the non-location formats (z, zzzz, v, and vvvv) may either roundtrip back to the original canonical zone id, to a zone in the same metazone that time, or to just an offset, depending on the available translation data. Thus:

Australia/ACT → Australia/Sydney → "GMT+11:00" → GMT+11
PST8PDT → America/Los_Angeles → “PST” → America/Los_Angeles
America/Vancouver → “Pacific Time (Canada)” → America/Vancouver

Note: The hoursFormat, preferenceOrdering, and abbreviationFallback items used in earlier versions of this appendix are deprecated.

Appendix K: Valid Attribute Values

The valid attribute values, as well as other validity information is contained in the supplementalMetadata.xml file. (Some, but not all, of this information could have been represented in XML Schema or a DTD.) Most of this is primarily for internal tool use.

The following specify the ordering of elements / attributes in the file:

<elementOrder>ldml alternate attributeOrder attributes blockingItems calendarPreference ...</elementOrder>
<attributeOrder>_q access after aliases allowsParsing alpha3 alternate at attribute ...</attributeOrder>

The suppress elements are those that are suppressed in canonicalization.

The serialElements are those that do not inherit, and may have ordering

<serialElements>attributeValues base comment extend first_non_ignorable first_primary_ignorable
first_secondary_ignorable first_tertiary_ignorable first_trailing first_variable i ic languagePopulation
last_non_ignorable last_primary_ignorable last_secondary_ignorable last_tertiary_ignorable last_trailing
last_variable optimize p pc reset rules s sc settings suppress_contractions t tRule tc variable x
</serialElements>

The validity elements give the possible attribute values. They are in the format of a series of variables, followed by attributeValues.

<variable id="$calendar" type="choice">
buddhist coptic ethiopic ethiopic-amete-alem chinese gregorian hebrew indian islamic islamic-civil
japanese arabic civil-arabic thai-buddhist persian roc</variable>

The types indicate the style of match:

choice: for a list of possible values
regex: for a regular expression match
notDoneYet: for items without matching criteria
locale: for locale IDs
list: for a space-delimited list of values
path: for a valid [XPath]

If the attribute order="given" is supplied, it indicates the order of elements when canonicalizing (see below).

The variable values are intended for internal testing, and the definition and usage may change between releases. They do not necessarily include all valid elements. For example, for primary language codes, they include the subset that occur in CLDR locale data. They are intended for a particular version of CLDR, and may omit codes that were present in earlier versions, such as deprecated codes.

The <deprecated> element lists elements, attributes, and attribute values that are deprecated. If any deprecatedItems element contains more than one attribute, then only the listed combinations are deprecated. Thus the following means not that the draft attribute is deprecated, but that the true and false values for that attribute are:

<deprecatedItems attributes="draft" values="true false"/>

Similarly, the following means that the type attribute is deprecated, but only for the listed elements:

<deprecatedItems elements="abbreviationFallback default ... preferenceOrdering" attributes="type"/>

<!ELEMENT blockingItems EMPTY >
<!ATTLIST blockingItems elements NMTOKENS #IMPLIED >

The blockingItems indicate which elements (and their child elements) do not inherit. For example, because supplementalData is a blocking item, all paths containing the element supplementalData do not inherit.

<!ELEMENT distinguishingItems EMPTY >
<!ATTLIST distinguishingItems exclude ( true | false ) #IMPLIED >
<!ATTLIST distinguishingItems elements NMTOKENS #IMPLIED >
<!ATTLIST distinguishingItems attributes NMTOKENS #IMPLIED >

The distinguishing items indicate which combinations of elements and attributes (in unblocked environments) are distinguishing in performing inheritance. For example, the attribute type is distinguishing except in combination with certain elements, such as in:

<distinguishingItems
exclude="true"
elements="default measurementSystem mapping abbreviationFallback preferenceOrdering" 
attributes="type"/>

Appendix L: Canonical Form

The following are restrictions on the format of LDML files to allow for easier parsing and comparison of files.

Peer elements have consistent order. That is, if the DTD or this specification requires the following order in an element foo:

<foo>
  <pattern>
  <somethingElse>
</foo>

It can never require the reverse order in a different element bar.

<foo>
  <somethingElse>
  <pattern>
</foo>

Note that there was one case that had to be corrected in order to make this true. For that reason, pattern occurs twice under currency:

<!ELEMENT currency (alias | (pattern*, displayName?, symbol?, pattern*,
decimal?, group?, special*)) >

XML files can have a wide variation in textual form, while representing precisely the same data. By putting the LDML files in the repository into a canonical form, this allows us to use the simple diff tools used widely (and in CVS) to detect differences when vetting changes, without those tools being confused. This is not a requirement on other uses of LDML; just simply a way to manage repository data more easily.

L.1 Content

All start elements are on their own line, indented by depth tabs.
All end elements (except for leaf nodes) are on their own line, indented by depth tabs.
Any leaf node with empty content is in the form <foo/>.
There are no blank lines except within comments or content.
Spaces are used within a start element. There are no extra spaces within elements.
- <version number="1.2"/>, not <version number = "1.2" />
- </identity>, not </identity >
All attribute values use double quote ("), not single (').
There are no CDATA sections, and no escapes except those absolutely required.
- no ' since it is not necessary
- no 'a', it would be just 'a'
All attributes with defaulted values are suppressed. See Appendix L.8, Defaulted Values Table.
The draft and alt="proposed.*" attributes are only on leaf elements.
The tzid are canonicalized in the following way:
1. All tzids as of as CLDR 1.1 (2004.06.08) in zone.tab are canonical.
2. After that point, the first time a tzid is introduced, that is the canonical form.
That is, new IDs are added, but existing ones keep the original form. The TZ timezone database keeps a set of equivalences in the "backward" file. These are used to map other tzids to the canonical form. For example, when America/Argentina/Catamarca was introduced as the new name for the previous America/Catamarca, a link was added in the backward file.

Link America/Argentina/Catamarca America/Catamarca

Example:

<ldml draft="unconfirmed" >
	<identity>
		<version number="1.2"/>
		<generation date="2004-06-04"/>
		<language type="en"/>
		<territory type="AS"/>
	</identity>
	<numbers>
		<currencyFormats>
			<currencyFormatLength>
				<currencyFormat>
					<pattern>¤#,##0.00;(¤#,##0.00)</pattern>
				</currencyFormat>
			</currencyFormatLength>
		</currencyFormats>
	</numbers>
</ldml>

L.2 Ordering

Element names are ordered by the Element Order Table
Attribute names are ordered by the Attribute Order Table
Attribute value comparison is a bit more complicated, and may depend on the attribute and type. Compare two values by using the following steps:
1. If two values are in the Value Order Table, compare according to the order in the table. Otherwise if just one is, it goes first.
2. If two values are numeric [0-9], compare numerically (2 < 12). Otherwise if just one is numeric, it goes first.
3. Otherwise values are ordered alphabetically
An attribute-value pair is ordered first by attribute name, and then if the attribute names are identical, by the value.
An element is ordered first by the element name, and then if the element names are identical, by the sorted set of attribute-value pairs (sorted by #4). For the latter, compare the first pair in each (in sorted order by attribute pair). If not identical, go to the second pair, and so on.
Any future additions to the DTD must be structured so as to allow compatibility with this ordering.
See also Appendix K: Valid Attribute Values

L.3 Comments

Comments are of the form .
They are logically attached to a node. There are 4 kinds:
1. Inline always appear after a leaf node, on the same line at the end. These are a single line.
2. Preblock comments always precede the attachment node, and are indented on the same level.
3. Postblock comments always follow the attachment node, and are indented on the same level.
4. Final comment, after </ldml>
Multiline comments (except the final comment) have each line after the first indented to one deeper level.

Examples:

<eraAbbr>
	<era type="0">BC</era> <!-- might add alternate BDE in the future -->
...
<timeZoneNames>
	<!-- Note: zones that do not use daylight time need further work --> 
	<zone type="America/Los_Angeles">
	...
	<!-- Note: the following is known to be sparse,
		and needs to be improved in the future -->
	<zone type="Asia/Jerusalem">

L.4 Canonicalization

The process of canonicalization is fairly straightforward, except for comments. Inline comments will have any linebreaks replaced by a space. There may be cases where the attachment node is not permitted, such as the following.

		</dayWidth>
		<!-- some comment -->
	</dayContext>
</days>

In those cases, the comment will be made into a block comment on the last previous leaf node, if it is at that level or deeper. (If there is one already, it will be appended, with a line-break between.) If there is no place to attach the node (for example, as a result of processing that removes the attachment node), the comment and its node's [XPath] will be appended to the final comment in the document.

Multiline comments will have leading tabs stripped, so any indentation should be done with spaces.

L.5 Element Order Table

The order of attributes is given by the elementOrder table in the supplemental metadata.

L.6 Attribute Order Table

The order of attributes is given by the attributeOrder table in the supplemental metadata.

L.7 Value Order Table

The order of attribute values is given by the order of the values in the attributeValues elements that have the attribute order="given". Numeric values are sorted in numeric order, while tzids are ordered by country, then longitude, then latitude.

L.8 Defaulted Values Table

The defaulted attributes are given by the suppress table in the supplemental metadata. There is one special value _q; that is used on serial elements internally to preserve ordering.

Appendix M: Coverage Levels

The following describes the coverage levels used for the current version of CLDR. This list will change between releases of CLDR. Each level adds to what is in the lower level.

Level	Description
0	undetermined	Does not meet any of the following levels.
10	core	See http://cldr.unicode.org/index/cldr-spec/minimaldata
20	posix	what is required for POSIX generation; for example, only one country name, only one currency symbol, and so on.
30	minimal	names for the languages, scripts, and territories associated with the language, numbering systems used in those languages, date and number formats, plus a few key values such as the values in Section 3.1 Unknown or Invalid Identifiers.
40	basic	data for most prominent languages and countries.
60	moderate	delimiters, ellipses formats, core currency symbols
80	modern	other fields in normal modern use, including all country names, and currencies in use.
100	comprehensive	complete localizations (or valid inheritance) for every possible field
101	optional	fields that are not typically in use, or are deprecated.

Levels 40 through 80 are based on the definitions and specifications listed in M.1-M.4. However, these principles have been refined, and do not completely reflect the data that is actually used for coverage determination, which is in the coverageLevels.xml file. For a view of the trunk version of this file, see coverageLevels.xml.

<!ELEMENT coverageLevels ( coverageVariable*, coverageLevel* ) >
<!ELEMENT coverageLevel EMPTY >
<!ATTLIST coverageLevel inLanguage CDATA #IMPLIED >
<!ATTLIST coverageLevel inScript CDATA #IMPLIED >
<!ATTLIST coverageLevel inTerritory CDATA #IMPLIED >
<!ATTLIST coverageLevel value CDATA #REQUIRED >
<!ATTLIST coverageLevel match CDATA #REQUIRED >

For example, here is an example coverageLevel line.

<coverageLevel
    value="30"
    inLanguage="(de|fi)" 
    match="localeDisplayNames/types/type[@type='phonebook'][@key='collation']"/>

The coverageLevel elements are read in order, and the first match results in a coverage level value. The element matches based on the inLanguage, inScript, inTerritory, and match attribute values, which are regular expressions. For example, in the above example, a match occurs if the language is de or fi, and if the path is a locale display name for collation=phonebook.

The match attribute value logically has "//ldml/" prefixed before it is applied. In addition, the "[@" is automatically quoted. Otherwise standard Perl/Java style regular expression syntax is used.

<!ELEMENT coverageVariable EMPTY >
<!ATTLIST coverageVariable key CDATA #REQUIRED >
<!ATTLIST coverageVariable value CDATA #REQUIRED >

The coverageVariable element allows us to create variables for certain regular expressions that are used frequently in the coverageLevel definitions above. Each coverage varible must contain a key / value pair of attributes, which can then be used to be substituted into a coverageLevel definition above.

For example, here is an example coverageLevel line using coverageVariable substitution.

<coverageVariable key="%dayTypes" value="(sun|mon|tue|wed|thu|fri|sat)">

<coverageVariable key="%wideAbbr" value="(wide|abbreviated)">

<coverageLevel value="20" match="dates/calendars/calendar[@type='gregorian']/days/dayContext[@type='format']/dayWidth[@type='%wideAbbr']/day[@type='%dayTypes']"/>

In this example, the coverge variables %dayTypes and %wideAbbr are used to substitute their respective values into the match expression. This allows us to reuse the same variable for other coverageLevel matches that use the same regular expression fragment.

M.1 Definitions

Target-Language is the language under consideration.
Target-Territories is the list of territories found by looking up Target-Language in the <languageData> elements in supplementalData.xml
Language-List is Target-Language, plus
- basic: Chinese, English, French, German, Italian, Japanese, Portuguese, Russian, Spanish, Unknown (de, en, es, fr, it, ja, pt, ru, zh, und
- moderate: basic + Arabic, Hindi, Korean, Indonesian, Dutch, Bengali, Turkish, Thai, Polish (ar, hi, ko, in, nl, bn, tr, th, pl). If an EU language, add the remaining official EU languages, currently: Danish, Greek, Finnish, Swedish, Czech, Estonian, Latvian, Lithuanian, Hungarian, Maltese, Slovak, Slovene (da, el, fi, sv, cs, et, lv, lt, hu, mt, sk, sl)
- modern: all languages that are official or major commercial languages of modern territories
Target-Scripts is the list of scripts in which Target-Language can be customarily written (found by looking up Target-Language in the <languageData> elements in supplementalData.xml), plus Unknown (Zzzz).
Script-List is the Target-Scripts plus the major scripts used for multiple languages
- Latin, Simplified Chinese, Traditional Chinese, Cyrillic, Arabic (Latn, Hans, Hant, Cyrl, Arab)
Territory-List is the list of territories formed by taking the Target-Territories and adding:
- basic: Brazil, China, France, Germany, India, Italy, Japan, Russia, United Kingdom, United States, Unknown (BR, CN, DE, GB, FR, IN, IT, JP, RU, US, ZZ)
- moderate: basic + Spain, Canada, Korea, Mexico, Australia, Netherlands, Switzerland, Belgium, Sweden, Turkey, Austria, Indonesia, Saudi Arabia, Norway, Denmark, Poland, South Africa, Greece, Finland, Ireland, Portugal, Thailand, Hong Kong SAR China, Taiwan (ES, BE, SE, TR, AT, ID, SA, NO, DK, PL, ZA, GR, FI, IE, PT, TH, HK, TW). If an EU language, add the remaining member EU countries: Luxembourg, Czech Republic, Hungary, Estonia, Lithuania, Latvia, Slovenia, Slovakia, Malta (LU, CZ, HU, ES, LT, LV, SI, SK, MT).
- modern: all current ISO 3166 territories, plus the UN M.49 [UNM49] regions in supplementalData.xml
Currency-List is the list of current official currencies used in any of the territories in Territory-List, found by looking at the region elements in supplementalData.xml, plus Unknown (XXX).
Calendar-List is the set of calendars in customary use in any of Target-Territories, plus Gregorian.
Number-System-List is the set of number systems in customary use in the language.

M.2 Data Requirements

The required data to qualify for the level is then the following.

localeDisplayNames
1. languages: localized names for all languages in Language-List.
2. scripts: localized names for all scripts in Script-List.
3. territories: localized names for all territories in Territory-List.
4. variants, keys, types: localized names for any in use in Target-Territories; for example, a translation for PHONEBOOK in a German locale.
dates: all of the following for each calendar in Calendar-List.
1. calendars: localized names
2. month names, day names, era names, and quarter names
  - context=format and width=narrow, wide, & abbreviated
  - plus context=standAlone and width=narrow, wide, & abbreviated, if the grammatical forms of these are different than for context=format.
3. week: minDays, firstDay, weekendStart, weekendEnd
  - if some of these vary in territories in Territory-List, include territory locales for those that do.
4. am, pm, eraNames, eraAbbr
5. dateFormat, timeFormat: full, long, medium, short
6. intervalFormatFallback
numbers: symbols, decimalFormats, scientificFormats, percentFormats, currencyFormats for each number system in Number-System-List.
currencies: displayNames and symbol for all currencies in Currency-List, for all plural forms
transforms: (moderate and above) transliteration between Latin and each other script in Target-Scripts.

M.3 Default Values

Items should only be included if they are not the same as the default, which is:

what is in root, if there is something defined there.
for timezone IDs: the name computed according to Appendix J: Time Zone Display Names
for collation sequence, the UCA DUCET (Default Unicode Collation Element Table), as modified by CLDR.
- however, in that case the locale must be added to the validSubLocale list in collation/root.xml.
for currency symbol, language, territory, script names, variants, keys, types, the internal code identifiers, for example,
- currencies: EUR, USD, JPY, ...
- languages: en, ja, ru, ...
- territories: GB, JP, FR, ...
- scripts: Latn, Thai, ...
- variants: PHONEBOOK,...

Appendix N: Transform Rules

<!ELEMENT transforms ( transform*) >
<!ELEMENT transform ((comment | tRule)*) >
<!ATTLIST transform source CDATA #IMPLIED >
<!ATTLIST transform target CDATA #IMPLIED >
<!ATTLIST transform variant CDATA #IMPLIED >
<!ATTLIST transform direction ( forward | backward | both ) "both" >
<!ATTLIST transform visibility ( internal | external ) "external" >

<!ELEMENT comment (#PCDATA) >
<!ELEMENT tRule (#PCDATA) >

The transform rules are similar to regular-expression substitutions, but adapted to the specific domain of text transformations. The rules and comments in this discussion will be intermixed, with # marking the comments. In the xml format these in separate elements: comment and tRule. The simplest rule is a conversion rule, which replaces one string of characters with another. The conversion rule takes the following form:

xy → z ;

This converts any substring "xy" into "z". Rules are executed in order; consider the following rules:

sch → sh ;

			ss → z ;

This conversion rule transforms "bass school" into "baz shool". The transform walks through the string from start to finish. Thus given the rules above "bassch" will convert to "bazch", because the "ss" rule is found before the "sch" rule in the string (later, we'll see a way to override this behavior). If two rules can both apply at a given point in the string, then the transform applies the first rule in the list.

All of the ASCII characters except numbers and letters are reserved for use in the rule syntax, as are the characters →, ←, ↔. Normally, these characters do not need to be converted. However, to convert them use either a pair of single quotes or a slash. The pair of single quotes can be used to surround a whole string of text. The slash affects only the character immediately after it. For example, to convert from a U+2190 ( ← ) LEFTWARDS ARROW to the string "arrow sign" (with a space), use one of the following rules:

\←   →  arrow\ sign ;

			'←'   →   'arrow sign' ;

			'←'   →   arrow' 'sign ;

Spaces may be inserted anywhere without any effect on the rules. Use extra space to separate items out for clarity without worrying about the effects. This feature is particularly useful with combining marks; it is handy to put some spaces around it to separate it from the surrounding text. The following is an example:

→ i ; # an iota-subscript diacritic turns into an i.

For a real space in the rules, place quotes around it. For a real backslash, either double it \\, or quote it '\'. For a real single quote, double it '', or place a backslash before it \'.

Any text that starts with a hash mark and concludes a line is a comment. Comments help document how the rules work. The following shows a comment in a rule:

x → ks ; # change every x into ks

The "\u" notation can be used instead of any letter. For instance, instead of using the Greek π, one could write:

\u03C0 → p ;

One can also define and use variables, such as:

$pi = \u03C0 ;

			$pi → p ;

N.1 Dual Rules

Rules can also specify what happens when an inverse transform is formed. To do this, we reverse the direction of the "←" sign. Thus the above example becomes:

$pi ← p ;

With the inverse transform, "p" will convert to the Greek p. These two directions can be combined together into a dual conversion rule by using the "↔" operator, yielding:

$pi ↔ p ;

N.2 Context

Context can be used to have the results of a transformation be different depending on the characters before or after. The following means "Remove hyphens, but only when they follow lower case letters":

[:lowercase letter:] } '-' → '' ;

The context itself ([:lowercase letter:]) is unaffected by the replacement; only the text between the curly braces is changed.

N.3 Revisiting

If the resulting text contains a vertical bar "|", then that means that processing will proceed from that point and that the transform will revisit part of the resulting text. Thus the | marks a "cursor" position. For example, if we have the following, then the string "xa" will convert to "w".

x → y | z ;

			z a → w;

First, "xa" is converted to "yza". Then the processing will continue from after the character "y", pick up the "za", and convert it. Had we not had the "|", the result would have been simply "yza". The '@' character can be used as filler character to place the revisiting point off the start or end of the string. Thus the following causes x to be replaced, and the cursor to be backed up by two characters.

x → |@@y;

N.4 Example

The following shows how these features are combined together in the Transliterator "Any-Publishing". This transform converts the ASCII typewriter conventions into text more suitable for desktop publishing (in English). It turns straight quotation marks or UNIX style quotation marks into curly quotation marks, fixes multiple spaces, and converts double-hyphens into a dash.

# Variables

			

			$single = \' ;

			$space = ' ' ;

			$double = \" ;

			$back = \` ;

			$tab = '\u0008' ;

			

			# the following is for spaces, line ends, (, [, {, ...

			$makeRight = [[:separator:][:start punctuation:][:initial punctuation:]] ;

			

			# fix UNIX quotes

			

			$back $back → “ ; # generate right d.q.m. (double quotation mark)

			$back → ‘ ;

			

			# fix typewriter quotes, by context

			

			$makeRight { $double ↔ “ ; # convert a double to right d.q.m. after certain chars

			^ { $double → “ ; # convert a double at the start of the line.

			$double ↔ ” ; # otherwise convert to a left q.m.

			

			$makeRight {$single} ↔ ‘ ; # do the same for s.q.m.s

			^ {$single} → ‘ ;

			$single ↔ ’;

			

			# fix multiple spaces and hyphens

			

			$space {$space} → ; # collapse multiple spaces

			'--' ↔ — ; # convert fake dash into real one

N.5 Rule Syntax

The following describes the full format of the list of rules used to create a transform. Each rule in the list is terminated by a semicolon. The list consists of the following:

an optional filter rule
zero or more transform rules
zero or more variable-definition rules
zero or more conversion rules
an optional inverse filter rule

The filter rule, if present, must appear at the beginning of the list, before any of the other rules. The inverse filter rule, if present, must appear at the end of the list, after all of the other rules. The other rules may occur in any order and be freely intermixed.

The rule list can also generate the inverse of the transform. In that case, the inverse of each of the rules is used, as described below.

N.6 Transform Rules

Each transform rule consists of two colons followed by a transform name, which is of the form source-target. For example:

:: NFD ;

			:: und_Latn-und_Greek ;

			:: Latin-Greek; # alternate form

If either the source or target is 'und', it can be omitted, thus 'und_NFC' is equivalent to 'NFC'. For compatibility, the English names for scripts can be used instead of the und_Latn locale name, and "Any" can be used instead of "und". Case is not significant.

The following transforms are defined not by rules, but by the operations in the Unicode Standard, and may be used in building any other transform:

Any-NFC, Any-NFD, Any-NFKD, Any-NFKC - the normalization forms defined by [UAX15].

Any-Lower, Any-Upper, Any-Title - full case transformations, defined by [Unicode] Chapter 3.

In addition, the following special cases are defined:

Any-Null - has no effect; that is, each character is left alone.
Any-Remove - maps each character to the empty string; this, removes each character.

The inverse of a transform rule uses parentheses to indicate what should be done when the inverse transform is used. For example:

:: lower () ; # only executed for the normal

			:: (lower) ; # only executed for the inverse

			:: lower ; # executed for both the normal and the inverse

N.7 Variable Definition Rules

Each variable definition is of the following form:

$variableName = contents ;

The variable name can contain letters and digits, but must start with a letter. More precisely, the variable names use Unicode identifiers as defined by [UAX31]. The identifier properties allow for the use of foreign letters and numbers.

The contents of a variable definition is any sequence of Unicode sets and characters or characters. For example:

$mac = M [aA] [cC] ;

Variables are only replaced within other variable definition rules and within conversion rules. They have no effect on transliteration rules.

N.8 Filter Rules

A filter rule consists of two colons followed by a UnicodeSet. This filter is global in that only the characters matching the filter will be affected by any transform rules or conversion rules. The inverse filter rule consists of two colons followed by a UnicodeSet in parentheses. This filter is also global for the inverse transform.

For example, the Hiragana-Latin transform can be implemented by "pivoting" through the Katakana converter, as follows:

:: [:^Katakana:] ; # do not touch any katakana that was in the text!

			:: Hiragana-Katakana;

			:: Katakana-Latin;

			:: ([:^Katakana:]) ; # do not touch any katakana that was in the text 

                     # for the inverse either!

The filters keep the transform from mistakenly converting any of the "pivot" characters. Note that this is a case where a rule list contains no conversion rules at all, just transform rules and filters.

N.9 Conversion Rules

Conversion rules can be forward, backward, or double. The complete conversion rule syntax is described below:

N.9.1 Forward

A forward conversion rule is of the following form:

before_context { text_to_replace } after_context → completed_result | result_to_revisit ;

If there is no before_context, then the "{" can be omitted. If there is no after_context, then the "}" can be omitted. If there is no result_to_revisit, then the "|" can be omitted. A forward conversion rule is only executed for the normal transform and is ignored when generating the inverse transform.

N.9.2 Backward

A backward conversion rule is of the following form:

completed_result | result_to_revisit ← before_context { text_to_replace } after_context ;

The same omission rules apply as in the case of forward conversion rules. A backward conversion rule is only executed for the inverse transform and is ignored when generating the normal transform.

N.9.3 Dual

A dual conversion rule combines a forward conversion rule and a backward conversion rule into one, as discussed above. It is of the form:

a { b | c } d ↔ e { f | g } h ;

When generating the normal transform and the inverse, the revisit mark "|" and the before and after contexts are ignored on the sides where they do not belong. Thus, the above is exactly equivalent to the sequence of the following two rules:

a { b c } d  →  f | g  ;

			b | c  ←  e { f g } h ;

N.10 Intermixing Transform Rules and Conversion Rules

Transform rules and conversion rules may be freely intermixed. Inserting a transform rule into the middle of a set of conversion rules has an important side effect.

Normally, conversion rules are considered together as a group. The only time their order in the rule set is important is when more than one rule matches at the same point in the string. In that case, the one that occurs earlier in the rule set wins. In all other situations, when multiple rules match overlapping parts of the string, the one that matches earlier wins.

Transform rules apply to the whole string. If you have several transform rules in a row, the first one is applied to the whole string, then the second one is applied to the whole string, and so on. To reconcile this behavior with the behavior of conversion rules, transform rules have the side effect of breaking a surrounding set of conversion rules into two groups: First all of the conversion rules before the transform rule are applied as a group to the whole string in the usual way, then the transform rule is applied to the whole string, and then the conversion rules after the transform rule are applied as a group to the whole string. For example, consider the following rules:

abc → xyz;

			xyz → def;

			::Upper;

If you apply these rules to “abcxyz”, you get “XYZDEF”. If you move the “::Upper;” to the middle of the rule set and change the cases accordingly, then applying this to “abcxyz” produces “DEFDEF”.

abc → xyz;

			::Upper;

			XYZ → DEF;

This is because “::Upper;” causes the transliterator to reset to the beginning of the string. The first rule turns the string into “xyzxyz”, the second rule upper cases the whole thing to “XYZXYZ”, and the third rule turns this into “DEFDEF”.

This can be useful when a transform naturally occurs in multiple “passes.” Consider this rule set:

[:Separator:]* → ' ';

			'high school' → 'H.S.';

			'middle school' → 'M.S.';

			'elementary school' → 'E.S.';

If you apply this rule to “high school”, you get “H.S.”, but if you apply it to “high school” (with two spaces), you just get “high school” (with one space). To have “high school” (with two spaces) turn into “H.S.”, you'd either have to have the first rule back up some arbitrary distance (far enough to see “elementary”, if you want all the rules to work), or you have to include the whole left-hand side of the first rule in the other rules, which can make them hard to read and maintain:

$space = [:Separator:]*;

			high $space school → 'H.S.';

			middle $space school → 'M.S.';

			elementary $space school → 'E.S.';

Instead, you can simply insert “::Null;” in order to get things to work right:

[:Separator:]* → ' ';

			::Null;

			'high school' → 'H.S.';

			'middle school' → 'M.S.';

			'elementary school' → 'E.S.';

The “::Null;” has no effect of its own (the null transform, by definition, does not do anything), but it splits the other rules into two “passes”: The first rule is applied to the whole string, normalizing all runs of white space into single spaces, and then we start over at the beginning of the string to look for the phrases. “high school” (with four spaces) gets correctly converted to “H.S.”.

This can also sometimes be useful with rules that have overlapping domains. Consider this rule set from before:

sch → sh ;

			ss → z ;

Apply this rule to “bassch” results in “bazch” because “ss” matches earlier in the string than “sch”. If you really wanted “bassh”—that is, if you wanted the first rule to win even when the second rule matches earlier in the string, you'd either have to add another rule for this special case...

sch → sh ;

			ssch → ssh;

			ss → z ;

...or you could use a transform rule to apply the conversions in two passes:

sch → sh ;

			::Null;

			ss → z ;

N.11 Inverse Summary

The following table shows how the same rule list generates two different transforms, where the inverse is restated in terms of forward rules (this is a contrived example, simply to show the reordering):

Original Rules	Forward	Inverse
`:: [:Uppercase Letter:] ; :: latin-greek ; :: greek-japanese ; x ↔ y ; z → w ; r ← m ; :: upper; a → b ; c ↔ d ; :: any-publishing ; :: ([:Number:]) ;`	`:: [:Uppercase Letter:] ; :: latin-greek ; :: greek-japanese ; x → y ; z → w ; :: upper ; a → b ; c → d ; :: any-publishing ;`	`:: [:Number:] ; :: publishing-any ; d → c ; :: lower ; y → x ; m → r ; :: japanese-greek ; :: greek-latin ;`

Note how the irrelevant rules (the inverse filter rule and the rules containing ←) are omitted (ignored, actually) in the forward direction, and notice how things are reversed: the transform rules are inverted and happen in the opposite order, and the groups of conversion rules are also executed in the opposite relative order (although the rules within each group are executed in the same order).

Appendix O: Lenient Parsing

O.1 Motivation

User input is frequently messy. Attempting to parse it by matching it exactly against a pattern is likely to be unsuccessful, even when the meaning of the input is clear to a human being. For example, for a date pattern of "MM/dd/yy", the input "June 1, 2006" will fail.

The goal of lenient parsing is to accept user input whenever it is possible to decipher what the user intended. Doing so requires using patterns as data to guide the parsing process, rather than an exact template that must be matched. This informative section suggests some heuristics that may be useful for lenient parsing of dates, times, and numbers.

O.2 Loose Matching

Loose matching ignores attributes of the strings being compared that are not important to matching. It involves the following steps:

Remove "." from currency symbols and other fields used for matching, and also from the input string unless:
- "." is in the decimal set, and
- its position in the input string is immediately before a decimal digit
Ignore all format characters: in particular, ignore the RLM and LRM used to control BIDI formatting.
Ignore all characters in [:Zs:] unless they occur between letters. (In the heuristics below, even those between letters are ignored except to delimit fields)
Map all characters in [:Dash:] to U+002D HYPHEN-MINUS
Use the data in the <character-fallback> element to map equivalent characters (for example, curly to straight apostrophes). Other apostrophe-like characters should also be treated as equivalent, especially if the character actually used in a format may be unavailable on some keyboards. For example:
- U+02BB MODIFIER LETTER TURNED COMMA (ʻ) might be typed instead as U+2018 LEFT SINGLE QUOTATION MARK (‘).
- U+02BC MODIFIER LETTER APOSTROPHE (ʼ) might be typed instead as U+2019 RIGHT SINGLE QUOTATION MARK (’), U+0027 APOSTROPHE, etc.
- U+05F3 HEBREW PUNCTUATION GERESH (‎׳) might be typed instead as U+0027 APOSTROPHE.
Apply mappings particular to the domain (i.e., for dates or for numbers, discussed in more detail below)
Apply case folding (possibly including language-specific mappings such as Turkish i)
Normalize to NFKC; thus no-break space will map to space; half-width katakana will map to full-width.

Loose matching involves (logically) applying the above transform to both the input text and to each of the field elements used in matching, before applying the specific heuristics below. For example, if the input number text is " - NA f. 1,000.00", then it is mapped to "-naf1,000.00" before processing. The currency signs are also transformed, so "NA f." is converted to "naf" for purposes of matching. As with other Unicode algorithms, this is a logical statement of the process; actual implementations can optimize, such as by applying the transform incrementally during matching.

O.3 Parsing Numbers

The following elements are relevant to determining the value of a parsed number:

A possible prefix or suffix, indicating sign
A possible currency symbol or code
Decimal digits
A possible decimal separator
A possible exponent
A possible percent or per mille character

Other characters should either be ignored, or indicate the end of input, depending on the application. The key point is to disambiguate the sets of characters that might serve in more than one position, based on context. For example, a period might be either the decimal separator, or part of a currency symbol (for example, "NA f."). Similarly, an "E" could be an exponent indicator, or a currency symbol (the Swaziland Lilangeni uses "E" in the "en" locale). An apostrophe might be the decimal separator, or might be the grouping separator.

Here is a set of heuristic rules that may be helpful:

Any character with the decimal digit property is unambiguous and should be accepted.
Note: In some environments, applications may independently wish to restrict the decimal digit set to prevent security problems. See [UTR36].
The exponent character can only be interpreted as such if it occurs after at least one digit, and if it is followed by at least one digit, with only an optional sign in between. A regular expression may be helpful here.
For the sign, decimal separator, percent, and per mille, use a set of all possible characters that can serve those functions. For example, the decimal separator set could include all of [.,']. (The actual set of characters can be derived from the number symbols in the By-Type charts [ByType], which list all of the values in CLDR.) To disambiguate, the decimal separator for the locale must be removed from the "ignore" set, and the grouping separator for the locale must be removed from the decimal separator set. The same principle applies to all sets and symbols: any symbol must appear in at most one set.
Since there are a wide variety of currency symbols and codes, this should be tried before the less ambiguous elements. It may be helpful to develop a set of characters that can appear in a symbol or code, based on the currency symbols in the locale.
Otherwise, a character should be ignored unless it is in the "stop" set. This includes even characters that are meaningful for formatting, for example, the grouping separator.
If more than one sign, currency symbol, exponent, or percent/per mille occurs in the input, the first found should be used.
A currency symbol in the input should be interpreted as the longest match found in the set of possible currency symbols.
Especially in cases of ambiguity, the user's input should be echoed back, properly formatted according to the locale, before it is actually used for anything.

O.4 Parsing Dates and Times

Lenient parsing of date and time strings is more complicated, due to the large number of possible fields and formats. The fields fall into two categories: numeric fields (hour, day of month, year, numeric month, and so on) and symbolic fields (era, quarter, month, weekday, AM/PM, time zone). In addition, the user may type in a date or time in a form that is significantly different from the normal format for the locale, and the application must use the locale information to figure out what the user meant. Input may well consist of nothing but a string of numbers with separators, for example, "09/05/02 09:57:33".

The input can be separated into tokens: numbers, symbols, and literal strings. Some care must be taken due to ambiguity, for example, in the Japanese locale the symbol for March is "3 月", which looks like a number followed by a literal. To avoid these problems, symbols should be checked first, and spaces should be ignored (except to delimit the tokens of the input string).

The meaning of symbol fields should be easy to determine; the problem is determining the meaning of the numeric fields. Disambiguation will likely be most successful if it is based on heuristics. Here are some rules that can help:

Always try the format string expected for the input text first. This is the only way to disambiguate 03/07 (March 2007, a credit card expiration date) from 03/07 (March 7, a birthday).
Attempt to match fields and literals against those in the format string, using loose matching of the tokens.
When matching symbols, try the narrow, abbreviated, and full-width forms, including standalone forms if they are unique. You may want to allow prefix matches too, or diacritic-insensitive, again, as long as they are unique. For example, for a month, accept 9, 09, S, Se, Sep, Sept, Sept., and so on.
- Note: While parsing of narrow date values (e.g. month or day names) may be required in order to obtain minimum information from a formatted date (for instance, the only month information may be in a narrow form), the results are not guaranteed; parsing of an ambiguous formatted date string may produce a result that differs from the date originally used to create the formatted string.
When a field or literal is encountered that is not compatible with the pattern:
- Synchronization is not necessary for symbolic fields, since they are self-identifying. Wait until a numeric field or literal is encountered before attempting to resynchronize.
- Ignore whether the input token is symbolic or numeric, if it is compatible with the current field in the pattern.
- Look forward or backward in the current format string for a literal that matches the one most recently encountered. See if you can resynchronize from that point. Use the value of the numeric field to resynchronize as well, if possible (for example, a number larger than the largest month cannot be a month)
- If that fails, use other format strings from the locale (including those in <availableFormats>) to try to match the previous or next symbol or literal (again, using a loose match).

Appendix P. Supplemental Metadata

The supplemental metadata contains information about the CLDR file itself, used to test validity and provide information for locale inheritance. A number of these elements are described in

Appendix I: Inheritance and Validity
Appendix K: Valid Attribute Values
Appendix L: Canonical Form
Appendix M: Coverage Levels

P.1 Supplemental Alias Information

<!ELEMENT alias ( languageAlias*, scriptAlias*, territoryAlias*, variantAlias*, zoneAlias* ) >

<!ELEMENT languageAlias EMPTY >
<!ATTLIST languageAlias type NMTOKEN #IMPLIED >
<!ATTLIST languageAlias replacement NMTOKEN #IMPLIED >

<!ELEMENT scriptAlias EMPTY >
<!ATTLIST scriptAlias type NMTOKEN #IMPLIED >
<!ATTLIST scriptAlias replacement NMTOKEN #IMPLIED >

<!ELEMENT territoryAlias EMPTY >
<!ATTLIST territoryAlias type NMTOKEN #IMPLIED >
<!ATTLIST territoryAlias replacement NMTOKENS #IMPLIED >

<!ELEMENT variantAlias EMPTY >
<!ATTLIST variantAlias type NMTOKEN #IMPLIED >
<!ATTLIST variantAlias replacement NMTOKEN #IMPLIED >

<!ELEMENT zoneAlias EMPTY >
<!ATTLIST zoneAlias type CDATA #IMPLIED >
<!ATTLIST zoneAlias replacement CDATA #IMPLIED >

This element provides information as to parts of locale IDs that should be substituted when accessing CLDR data. This logical substitution should be done to both the locale id, and to any lookup for display names of languages, territories, and so on. As with the display names, the language type and replacement may be any prefix of a valid locale id, such as "no_NO".

<alias>
  <language type="in" replacement="id">
  <language type="sh" replacement="sr">
  <language type="sh_YU" replacement="sr_Latn_YU">
...
  <territory type="BU" replacement="MM">
...
</alias>

P.2 Supplemental Deprecated Information

<!ELEMENT deprecated ( deprecatedItems* ) >
<!ATTLIST deprecated draft ( approved | contributed | provisional | unconfirmed | true | false ) #IMPLIED > <!-- true and false are deprecated. -->

<!ELEMENT deprecatedItems EMPTY >
<!ATTLIST deprecatedItems draft ( approved | contributed | provisional | unconfirmed | true | false ) #IMPLIED > <!-- true and false are deprecated. -->
<!ATTLIST deprecatedItems type ( standard | supplemental | ldml | supplementalData | ldmlBCP47 ) #IMPLIED > <!-- standard | supplemental are deprecated -->
<!ATTLIST deprecatedItems elements NMTOKENS #IMPLIED >
<!ATTLIST deprecatedItems attributes NMTOKENS #IMPLIED >
<!ATTLIST deprecatedItems values CDATA #IMPLIED >

The deprecated items can be used to indicate elements, attributes, and attribute values that are deprecated. This means that the items are valid, but that their usage is strongly discouraged. When the same deprecatedItems element contains combinations of elements, attributes, and values, then the "least significant" items are only deprecated if they occur with the "more significant" items. For example:

Deprecated Items
`<deprecatedItems elements="A B">`	A and B are deprecated
`<deprecatedItems attributes="C D">`	C and D are deprecated on all elements
`<deprecatedItems elements="A B" attributes="C D">`	C and D are deprecated, but only if they occur on elements A or B.
`<deprecatedItems elements="A B" attributes="C D" values="E">`	E is deprecated, but only if it is a value of C in an element A or B

In each case, multiple items are space-delimited.

Where particular values are deprecated (such as territory codes like SU for Soviet Union), the names for such codes may be removed from the common/main translated data after some period of time. However, typically supplemental information for deprecated codes is retained, such as containment, likely subtags, older currency codes usage, etc. The English name may also be retained, for debugging purposes.

P.3 Default Content

<!ELEMENT defaultContent EMPTY >
<!ATTLIST defaultContent locales NMTOKENS #IMPLIED >

In CLDR, locales without territory information (or where needed, script information) provide data appropriate for what is called the default content locale. For example, the en locale contains data appropriate for en-US, while the zh locale contains content for zh-Hans-CN, and the zh-Hant locale contains content for zh-Hant-TW. The default content locales themselves thus inherit all of their contents, and are empty.

The choice of content is typically based on the largest literate population of the possible choices. Thus if an implementation only provides the base language (such as en), it will still get a complete and consistent set of data appropriate for a locale which is reasonably likely to be the one meant. Where other information is available, such as independent country information, that information can always be used to pick a different locale (such as en-CA for a website targeted at Canadian users).

If an implementation is to use a different default locale, then the data needs to be pivoted; all of the data from the CLDR for the current default locale pushed out to the locales that inherit from it, then the new default content locale's data moved into the base. There are tools in CLDR to perform this operation.

Appendix Q. Unicode BCP 47 Extension Data

The Unicode Consortium has registered and is the maintaining authority for two BCP 47 language tag extensions: the extension 'u' for Unicode locale extension [RFC6067] and extension 't' for transformed content [RFC6497]. The Unicode BCP 47 extension data defines the complete list of valid Unicode locale key, type and attribute subtags used by the 'u' extension and valid subtags used by 't' extension.

The 'u' extension data is stored in multiple XML files located under common/bcp47 directory in CLDR. Each file contains the locale extension key/type values and their backward compatibility mappings appropriate for a particular domain. For example, common/bcp47/collation.xml contains key/type values for collation, including optional collation parameters and valid type values for each key.

The 't' extension data is stored in common/bcp47/transform.xml.

<!ELEMENT keyword ( key* )>

<!ELEMENT key ( type* ) >
<!ATTLIST key extension NMTOKEN #IMPLIED>
<!ATTLIST key name NMTOKEN #REQUIRED>
<!ATTLIST key alias NMTOKEN #IMPLIED>
<!ATTLIST key description CDATA #IMPLIED>
<!ATTLIST key since CDATA #IMPLIED>
<!ATTLIST key deprecated ( true | false ) "false">

<!ELEMENT type EMPTY>
<!ATTLIST type name NMTOKEN #REQUIRED>
<!ATTLIST type alias CDATA #IMPLIED>
<!ATTLIST type description CDATA #IMPLIED>
<!ATTLIST type since CDATA #IMPLIED>
<!ATTLIST type deprecated ( true | false ) "false">

<!ELEMENT attribute EMPTY>
<!ATTLIST attribute name NMTOKEN #REQUIRED>
<!ATTLIST attribute description CDATA #IMPLIED>
<!ATTLIST attribute since CDATA #IMPLIED>
<!ATTLIST attribute deprecated ( true | false ) "false">

The extension attribute in <key> element specifies the BCP 47 language tag extension type. The default value of the extension attribute is "u" (Unicode locale extension). The <type> element is only applicable to the enclosing <key>.

Q.1 Unicode Locale Extension Data Files

In the Unicode locale extension 'u' data files, the common attributes for the <key>, <type> and <attribute> elements are as follows:

Note: There are no values defined for the locale extension attribute in the current CLDR release.

name

The key or type name used by Unicode locale extension with 'u' extension syntax. When alias below is absent, this name can be also used with the old style "@key=type" syntax.

The type name "CODEPOINTS" is reserved for a variable representing Unicode code point(s). The syntax is:

	EBNF	ABNF
codepoints	= codepoint (sep codepoint)?	= codepoint *(sep codepoint)
codepoint	= [0-9 A-F a-f]{4,6}	= 4*6HEXDIG

In addition, no codepoint may exceed 10FFFF. For example, "00A0", "300b", "10D40C" and "00C1-00E1" are valid, but "A0", "U060C" and "110000" are not.

In the current version of CLDR, the type "CODEPOINTS" is only used for the locale extension key "vt" (variableTop). The subtags forming the type for "vt" represent an arbitrary string of characters. There is no formal limit in the number of characters, although practically anything above 1 will be rare, and anything longer than 4 might be useless. Repetition is allowed, for example, 0061-0061 ("aa") is a Valid type value for "vt", since the sequence may be a collating element. Order is vital: 0061-0062 ("ab") is different than 0062-0061 ("ba").

For example,

en-u-vt-0061 : this indicates English, with any characters sorting at or below "a" (at a primary level) considered Variable.

en-u-vt-0061-0065 : this indicates English, with any characters sorting at or below the sequence "ae" (at a primary level) considered Variable.

By default in UCA, variable characters are ignored in sorting at a primary, secondary, and tertiary level. But in CLDR, they are not ignorable by default. For more information, see Section 5.14.3 Setting Options.

The type name "REORDER_CODE" is reserved for reordering block names (e.g. "latn", "digit" and "others") defined by DUCET. The type "REORDER_CODE" is used for locale extension key "kr" (colReorder). The value of type for "kr" is represented by one or more reordering block names such as "latn-digit". For more information, see Section 5.14.12 Collation Reordering.

In the current version of CLDR, all type names except "CODEPOINTS" and "REORDER_CODE" are final and used alone. For example, "gregory" and "japanese" are valid type names for key "ca" (calendar). Both "u-ca-gregory" and "u-ca-japanese" are valid representations of Unicode locale extension, but "u-ca-gregory-japanese" is not.

alias (Not applicable to <attribute>)

The BCP47 form is the canonical form, and recommended. Other aliases are included only for backwards compatibility.

Example:

The preferred term, and the only one to be used in BCP47, is the name: in this example, "phonebk".

The alias is a key or type name used by Unicode locale extensions with the old "@key=type" syntax. The attribute value for type may contain multiple names delimited by ASCII space characters. Of those aliases, the first name is the preferred value.

description

The description of the key, type or attribute element.

since

The version of CLDR in which this key or type was introduced. Absence of this attribute value implies the key or type was available in CLDR 1.7.2.

deprecated

The deprecation status of the key, type or attribute element. The value "true" indicates the element is deprecated and no longer used in the version of CLDR. The default value is "false".

For example,

<key name="co" alias="collation" description="Collation type key">
  <type name="pinyin" description="Pinyin ordering for Latin and for CJK characters (used in Chinese)"/>
</key>

<key name="ka" alias="colAlternate" description="Collation parameter key for alternate handling">
  <type name="noignore" alias="non-ignorable" description="Variable collation elements are not reset to ignorable"/>
  <type name="shifted" description="Variable collation elements are reset to zero at levels one through three"/>
</key>

<key name="tz" alias="timezone">
  ...
  <type name="aumel" alias="Australia/Melbourne Australia/Victoria" description="Melbourne, Australia"/>
  <type name="aumqi" alias="Antarctica/Macquarie" description="Macquarie Island Station, Macquarie Island" since="1.8.1"/>
  ...
</key>

The data above indicates:

type "pinyin" is valid for key "co", thus "u-co-pinyin" is a valid Unicode locale extension.
type "pinyin" is not valid for key "ka", thus "u-ka-pinyin" is not a valid Unicode locale extension.
type "pinyin" has no alias, so "zh@collation=pinyin" is a valid Unicode locale identifier according to the old syntax.
type "noignore" has an alias attribute, so "en@colAlternate=noignore" is not a valid Unicode locale identifier according to the old syntax.
type "aumel" is valid for key "tz", supported by CLDR 1.7.2 (default value) or later versions.
type "aumqi" is valid for key "tz", supported by CLDR 1.8.1 or later versions.

Q.1.1 Numbering System Data

LDML supports multiple numbering systems. The identifiers for those numbering systems are defined in the file bcp47/number.xml. For example, for the 'trunk' version of the data see bcp47/number.xml.

Details about those numbering systems are defined in supplemental/numberingSystems.xml. For example, for the 'trunk' version of the data see supplemental/numberingSystems.xml.

LDML makes certain stability guarantees on this data:

Like other BCP47 identifiers, once a numeric identifier is added to bcp47/number.xml or numberingSystems.xml, it will never be removed from either of those files.
If an identifier has type="numeric" in numberingSystems.xml, then
1. It is a decimal, positional numbering system with an attribute digits=X, where X is a string with the 10 digits in order used by the numbering system.
2. The values of the type and digits will never change.

Q.2 Transformed Content Data File

In the transformed content 't' data file, the name attribute in a <key> element defines a valid field separator subtag. The name attribute in an enclosed <type> element defines a valid field subtag for the field separator subtag. For example:

<key extension="t" name="m0" 
    description="Transform extension mechanism">
	<type name="ungegn"
		description="United Nations Group of Experts on Geographical Names"
      since="21"/>
<key>

The data above indicates:

"m0" is a valid field separator for the transformed content extension 't'.
field subtag "ungegn" is valid for field separator "m0".
field subtag "ungegn" was introduced in CLDR 21.

The attributes are:

name: The name of the mechanism, limited to 3-8 characters (or sequences of them).
description: A description of the name, with all and only that information necessary to distinguish one name from | American Library others with which it might be confused. Descriptions are not intended to provide general background information.
since: Indicates the first version of CLDR where the name appears. (Required for new items.)

alias: Alternative name, not limited in number of characters. Aliases are intended for compatibility, not to provide all possible alternate names or designations. (Optional)

For information about the registration process, meaning, and usage of the 't' extension, see [RFC6497].

Appendix R. Property Data

Some data in CLDR does not use an XML format, but rather a semicolon-delimited format derived from that of the Unicode Character Database. That is because the data is more likely to be parsed by implementations that already parse UCD data. Those files are present in the common/properties directory.

Each file has a header that explains the format and usage of the data.

Appendix S. Keyboards

The CLDR keyboard format provides for the communication of keyboard mapping data between different modules, and the comparison of data across different vendors and platforms. The standardized identifier for keyboards can be used to communicate, internally or externally, a request for a particular keyboard mapping that is to be used to transform either text or keystrokes. The corresponding data can then be used to perform the requested actions.

For example, a web-based virtual keyboard may transform text in the following way. Suppose the user types a key that produces a "W" on a qwerty keyboard. A web-based tool using an azerty virtual keyboard can map that text ("W") to the text that would have resulted from typing a key on an azerty keyboard, by transforming "W" to "Z". Such transforms are in fact performed in existing web applications.

The data can also be used in analysis of the capabilities of different keyboards. It also allows better interoperability by making it easier for keyboard designers to see which characters are generally supported on keyboards for given languages.

To illustrate this specification, here is an abridged layout representing the English US 101 keyboard on the Mac OSX operating system (with an inserted long-press example). For more complete examples, and information collected about keyboards, see keyboard data in XML.

<keyboard locale="en-t-k0-osx">
	<version platform="10.4" number="$Revision: 7883 $" />
	<generation date="$Date: 2012-10-25 18:40:08 -0700 (Thu, 25 Oct 2012) $" />
	<names>
		<name value="U.S." />
	</names>
	<keyMap>
		<map iso="E00" to="`" />
		<map iso="E01" to="1" />
		<map iso="D01" to="q" />
		<map iso="D02" to="w" />
		<map iso="D03" to="e" longPress="é è ê ë" />
		…
	</keyMap>
	<keyMap modifiers="caps">
		<map iso="E00" to="`" />
		<map iso="E01" to="1" />
		<map iso="D01" to="Q" />
		<map iso="D02" to="W" />
		…
	</keyMap>
	<keyMap modifiers="opt">
		<map iso="E00" to="`" />
		<map iso="E01" to="¡" /> <!-- key=1 -->
		<map iso="D01" to="œ" /> <!-- key=Q -->
		<map iso="D02" to="∑" /> <!-- key=W -->
		…
	</keyMap>
	<transforms type="simple">
		<transform from="` " to="`" />
		<transform from="`a" to="à" />
		<transform from="`A" to="À" />
		<transform from="´ " to="´" />
		<transform from="´a" to="á" />
		<transform from="´A" to="Á" />
		<transform from="˜ " to="˜" />
		<transform from="˜a" to="ã" />
		<transform from="˜A" to="Ã" />
		…
	</transforms>
</keyboard>

And its associated platform file (which includes the hardware mapping):

<platform id="osx">
	<hardwareMap>
		<map keycode="0" iso="C01" />
		<map keycode="1" iso="C02" />
		<map keycode="6" iso="B01" />
		<map keycode="7" iso="B02" />
		<map keycode="12" iso="D01" />
		<map keycode="13" iso="D02" />
		<map keycode="18" iso="E01" />
		<map keycode="50" iso="E00" />
	</hardwareMap>
</platform>

Goals and Nongoals

Some goals of this format are:

Make the XML as readable as possible.
Represent faithfully keyboard data from major platforms: it should be possible to create a functionally-equivalent data file (such that given any input, it can produce the same output).
Make as much commonality in the data across platforms as possible to make comparison easy.

Some non-goals (outside the scope of the format) currently are:

Display names or symbols for keycaps (eg, the German name for "Return"). If that were added to LDML, it would be in a different structure, outside the scope of this proposal.
Advanced IME features, handwriting recognition, etc.
Roundtrip mappings—the ability to recover precisely the same format as an original platform's representation. In particular, the internal structure may have no relation to the internal structure of external keyboard source data, the only goal is functional equivalence.
More sophisticated transforms, such as for Indic character rearrangement. It is anticipated that these would be added to a future version, after working out a reasonable representation.

Note: During development, it was considered whether RAlt (=AltGr) with Option. In the end, they were kept separate, but for comparison across platforms implementers may choose to identify them.

Definitions

Keyboard: The physical keyboard.

Key: A key on a physical keyboard.

Modifier: A key that is held to change the behavior of a keyboard. For example, the "Shift" key allows access to upper-case characters on a US keyboard. Other modifier keys include but is not limited to: Ctrl, Alt, Option, Command and Caps Lock.

Key code: The integer code sent to the application on pressing a key.

ISO position: The corresponding position of a key using the ISO layout convention where rows are identified by letters and columns are identified by numbers. For example, "D01" corresponds to the "Q" key on a US keyboard. For the purposes of this document, an ISO layout position is depicted by a one-letter row identifier followed by a two digit column number (like "B03", "E12" or "C00"). The following diagram depicts a typical US keyboard layout superimposed with the ISO layout indicators (it is important to note that the number of keys and their physical placement relative to each-other in this diagram is irrelevant, rather what is important is their logical placement using the ISO convention): keyboard layout example showing ISO key numbering

One may also extend the notion of the ISO layout to support keys that don't map directly to the diagram above (such as the Android device - see diagram). Per the ISO standard, the space bar is mapped to "A03", so the period and comma keys are mapped to "A02" and "A04" respectively based on their relative position to the space bar. Also note that the "E" row does not exist on the Android keyboard.

keyboard layout example showing extension of ISO key numbering

If it becomes necessary in the future, the format could extend the ISO layout to support keys that are located to the left of the "00" column by using negative column numbers "-01", "-02" and so on, or 100's complement "99", "98",...

Hardware map: A mapping between key codes and ISO layout positions.

Base character: The character emitted by a particular key when no modifiers are active. In ISO terms, this is group 1, level 1.

Base map: A mapping from the ISO positions to the base characters. There is only one base map per layout. The characters on this map can be output by not using any modifier keys.

Key map: The basic mapping between ISO positions and the output characters for each set of modifier combinations associated with a particular layout. There may be multiple key maps for each layout.

Transform: A transform is simply a combination of key presses that gets transformed into one (or more) final characters. For example, in most latin keyboards hitting the "^" dead-key followed by the "e" key produces "ê".

Layout: A layout is the overall keyboard configuration for a particular locale. Within a keyboard layout, there is a single base map, one or more key maps and one or more transforms.

File and Directory Structure

Each platform has its own directory, where a "platform" is a designation for a set of keyboards available from a particular source, such as Windows or Chromeos. This directory name is the platform name (see Table 2 located further in the document). Within this directory there are two types of files:

A single platform file (see XML structure for Platform file), this file includes a mapping of hardware key codes to the ISO layout positions. This file is also open to expansion for any configuration elements that are valid across the whole platform and that are not layout specific. This file is simply called _platform.xml.
Multiple layout files named by their locale identifiers. (eg. lt-t-k0-chromeos.xml or ne-t-k0-windows.xml).

Element Hierarchy - Layout File

Element: keyboard

This is the top level element. All other elements defined below are under this element.

Syntax

{definition of the layout as described by the elements defined below}

</keyboard>

Attribute: locale (required)

This mandatory attribute represents the locale of the keyboard using Unicode locale identifiers (see LDML) - for example 'el' for Greek. Sometimes, the locale may not specify the base language. For example, a Devanagari keyboard for many languages could be specified by BCP-47 code: 'und-Deva'. For details, see Keyboard IDs .

Examples (for illustrative purposes only, not indicative of the real data)

<keyboard locale="ka-t-k0-qwerty-windows">
  …
</keyboard>
<keyboard locale="fr-CH-t-k0-android">
  …
</keyboard>

Element: version

Element used to keep track of the source data version.

Syntax

Attribute: platform (required)

The platform source version. Specifies what version of the platform the data is from. For example, data from Mac OSX 10.4 would be specified as platform="10.4". For platforms that have unstable version numbers which change frequently (like Linux), this field is set to an integer representing the iteration of the data starting with "1". This number would only increase if there were any significant changes in the keyboard data.

Attribute: number (required)

The data revision version.

Attribute: cldrVersion (fixed by DTD)

The CLDR specification version that is associated with this data file. This value is fixed and is inherited from the DTD file and therefore does not show up directly in the XML file.

Example

…

…

</keyboard>

Element: generation

Element used to keep track of the generation date of the data.

Syntax

Attribute: date (required)

The date the data was generated.

Example

…

…

</keyboard>

Element: names

Element used to store any names given to the layout by the platform.

Syntax

<names>

{set of name elements}

</names>

Element: name

A single name given to the layout by the platform.

Syntax

Attribute: value (required)

The name of the layout.

Example

…

<names>

</names>

…

</keyboard>

Element: settings

An element used to keep track of layout specific settings. This element may or may not show up on a layout. These settings reflect the normal practice on the platform. However, an implementation using the data may customize the behavior. For example, for transformFailures the implementation could ignore the setting, or modify the text buffer in some other way (such as by emitting backspaces).

Syntax

Attribute: fallback="omit" (optional)

The presence of this attribute means that when a modifier key combination goes unmatched, no output is produced. The default behavior (when this attribute is not present) is to fallback to the base map when the modifier key combination goes unmatched.

If this attribute is present, it must have a value of omit.

Attribute: transformFailure="omit" (optional)

This attribute describes the behavior of a transform when it is escaped (see the transform element in the Layout file for more information). A transform is escaped when it can no longer continue due to the entry of an invalid key. For example, suppose the following set of transforms are valid:

^e → ê

^a → â

Suppose a user now enters the "^" key then "^" is now stored in a buffer and may or may not be shown to the user (see the partial attribute).

If a user now enters d, then the transform has failed and there are two options for output.

1. default behavior - "^d"

2. omit - "" (nothing and the buffer is cleared)

The default behavior (when this attribute is not present) is to emit the contents of the buffer upon failure of a transform.

If this attribute is present, it must have a value of omit.

Attribute: transformPartial="hide" (optional)

This attribute describes the behavior the system while in a transform. When this attribute is present then don't show the values of the buffer as the user is typing a transform (this behavior can be seen on Windows or Linux platforms).

By default (when this attribute is not present), show the values of the buffer as the user is typing a transform (this behavior can be seen on the Mac OSX platform).

If this attribute is present, it must have a value of hide.

Example

…

…

</keyboard>

Indicates that:

When a modifier combination goes unmatched, do not output anything when a key is pressed.
If a transform is escaped, output the contents of the buffer.
During a transform, hide the contents of the buffer as the user is typing.

Element: keyMap

This element defines the group of mappings for all the keys that use the same set of modifier keys. It contains one or more map elements.

Syntax

{a set of map elements}

</keyMap>

Attribute: modifiers (optional)

A set of modifier combinations that cause this key map to be "active". Each combination is separated by a space. The interpretation is that there is a match if any of the combinations match, that is, they are ORed. Therefore, the order of the combinations within this attribute does not matter.

A combination is simply a concatenation of words to represent the simultaneous activation of one or more modifier keys. The order of the modifier keys within a combination does not matter, although don't care cases are generally added to the end of the string for readability (see next paragraph). For example: "cmd+caps" represents the Caps Lock and Command modifier key combination. Some keys have right or left variant keys, specified by a 'R' or 'L' suffix. For example: "ctrlR+caps" would represent the Right-Control and Caps Lock combination. For simplicity, the presence of a modifier without a 'R' or 'L' suffix means that either its left or right variants are valid. So "ctrl+caps" represents the same as "ctrlL+ctrlR?+caps ctrlL?+ctrlR+caps"

A modifier key may be further specified to be in a "don't care" state using the '?' suffix. The "don't care" state simply means that the preceding modifier key may be either ON or OFF. For example "ctrl+shift?" could be expanded into "ctrl ctrl+shift".

Within a combination, the presence of a modifier WITHOUT the '?' suffix indicates this key MUST be on. The converse is also true, the absence of a modifier key means it MUST be off for the combination to be active.

Here is an exhaustive list of all possible modifier keys:

Possible Modifier Keys

Modifier Keys		Comments
altL	altR	xAlty → xAltR+AltL? xAltR?AltLy
ctrlL	ctrlR	ditto for Ctrl
shiftL	shiftR	ditto for Shift
optL	optR	ditto for Opt
caps		Caps Lock
cmd		Command on the Mac

All sets of modifier combinations within a layout are disjoint with no-overlap existing between the key maps. That is, for every possible modifier combination, there is at most a single match within the layout file. There are thus never multiple matches. If no exact match is available, the match falls back to the base map unless the fallback="omit" attribute in the settings element is set, in which case there would be no output at all.

To illustrate, the following example produces an invalid layout because pressing the "Ctrl" modifier key produces an indeterminate result:

…

</keyMap>

…

</keyMap>

Modifier Examples:

Caps-Lock may be ON or OFF, Option must be ON, Shift must be ON and Command may be ON or OFF.

Caps-Lock must be ON OR Shift must be ON. Is also the fallback key map.

If the modifiers attribute is not present on a keyMap then that particular key map is the base map.

Element: map

This element defines a mapping between the base character and the output for a particular set of active modifier keys. This element must have the keyMap element as its parent.

If a map element for a particular ISO layout position has not been defined then if this key is pressed, no output is produced.

Syntax

<map
 iso="{the iso position}"
 to="{the output}"
 [longPress="{long press keys}"]
 [transform="no"]
/><!-- {Comment to improve readability (if needed)} -->

Attribute: iso (exactly one of base and iso is required)

The iso attribute represents the ISO layout position of the key (see the definition at the beginning of the document for more information).

Attribute: to (required)

The to attribute contains the output sequence of characters that is emitted when pressing this particular key. Control characters, whitespace (other than the regular space character) and combining marks in this attribute are escaped using the \u{...} notation.

Attribute: longPress (optional)

The longPress attribute contains any characters that can be emitted by "long-pressing" a key, this feature is prominent in mobile devices. The possible sequences of characters that can be emitted are whitespace delimited. Control characters, combining marks and whitespace (which is intended to be a long-press option) in this attribute are escaped using the \u{...} notation.

Attribute: transform="no" (optional)

The transform attribute is used to define a key that never participates in a transform but its output shows up as part of a transform. This attribute is necessary because two different keys could output the same characters (with different keys or modifier combinations) but only one of them is intended to be a dead-key and participate in a transform. This attribute value must be no if it is present.

For example, suppose there are the following keys, their output and one transform:

E00 outputs `

Option+E00 outputs ` (the dead-version which participates in transforms).

`e → è

Then the first key must be tagged with transform="no" to indicate that it should never participate in a transform.

Comment: US key equivalent, base key, escaped output and escaped longpress

In the generated files, a comment is included to help the readability of the document. This comment simply shows the English key equivalent (with prefix key=), the base character (base=), the escaped output (to=) and escaped long-press keys (long=). These comments have been inserted strategically in places to improve readability. Not all comments include include all components since some of them may be obvious.

Examples

<keyboard locale="fr-BE-t-k0-windows">
	…
	<keyMap modifiers="shift">
		<map iso="D01" to="A" /> <!-- key=Q -->
		<map iso="D02" to="Z" /> <!-- key=W -->
		<map iso="D03" to="E" />
		<map iso="D04" to="R" />
		<map iso="D05" to="T" />
		<map iso="D06" to="Y" />
		…
	</keyMap>
	…
</keyboard>
<keyboard locale="ps-t-k0-windows">
	…
	<keyMap modifiers='altR+caps? ctrl+alt+caps?'>
		<map iso="D04" to="\u{200e}" /> <!-- key=R base=ق -->
		<map iso="D05" to="\u{200f}" /> <!-- key=T base=ف -->
		<map iso="D08" to="\u{670}" /> <!-- key=I base=ه to= ٰ -->
		…
	</keyMap>
	…
</keyboard>

Element: transforms

This element defines a group of one or more transform elements associated with this keyboard layout. This is used to support dead-keys using a straightforward structure that works for all the keyboards tested, and that results in readable source data.

There can be multiple <transforms> elements; at this point the "simple" one is defined.

Syntax

{a set of transform elements}

</transforms>

Attribute: type (required)

The value is "simple" for the transforms listed below. People have legitimate needs for more complex transforms, and more sophisticated types of transforms may be added over time. (Doing the more sophisticated transforms would take much more time, since it would require a thorough survey of the major keyboard mechanisms that use them, development of a unified mechanism that handles all the requirements, and coding to ensure sure programmatically mapping those mechanisms into the standard is possible, and so on.)

Element: transform

This element must have the transforms element as its parent. This element represents a single transform that may be performed using the keyboard layout. A transform is simply a combination of key presses that gets transformed into one (or more) final characters. For example, in most French keyboards hitting the "^" dead-key followed by the "e" key produces "ê".

Syntax

Attribute: from (required)

This is the combination of keys that must be pressed in order to activate this transform. Each character in this series of characters must match a character that is located in some chars attribute in the document.

For example, suppose there are the following transforms:

^e → ê

^a → â

^o → ô

If the user types a key that produces "^", the keyboard enters a dead state. When the user then types a key that produces an "e", the transform is invoked, and "ê" is output. Suppose a user presses keys producing "^" then "u". In this case, there is no match for the "^u", and the "^" is output if the failure attribute in the transform element is set to emit. If there is no transform starting with "u", then it is also output (again only if failure is set to emit) and the mechanism leaves the "dead" state.

The UI may show an initial sequence of matching characters with a special format, as is done with dead-keys on the Mac, and modify them as the transform completes. This behavior is specified in the partial attribute in the transform element.

Most transforms in practice have only a couple of characters. But for completeness, the behavior is defined on all strings:

If there could be a longer match if the user were to type additional keys, go into a 'dead' state.
If there could not be a longer match, find the longest actual match, emit the transformed text (if failure is set to emit), and start processing again with the remainder.
If there is no possible match, output the first character, and start processing again with the remainder.

Suppose that there is the following transforms:

ab → x

abc → y

abef → z

bc → m

beq → n

Here's what happens when the user types various sequence characters:

Input characters	Result	Comments
ab		No output, since there is a longer transform with this as prefix.
abc	y	Complete transform match.
abd	xd	The longest match is "ab", so that is converted and output. The 'd' follows, since it is not the start of any transform.
abeq	xeq	"ab" wins over "beq", since it comes first. That is, there is no longer possible match starting with 'a'.
bc	m

Control characters, combining marks and whitespace in this attribute are escaped using the \u{...} notation.

Attribute: to (required)

This attribute represents the characters that are output from the transform. This may be more than one, so you could have <transform from="´A" to="Fred"/>

Control characters, whitespace (other than the regular space character) and combining marks in this attribute are escaped using the \u{...} notation.

Examples

<keyboard locale="fr-CA-t-k0-CSA-osx">
	<transforms type="simple">
		<transform from="´a" to="á" />
		<transform from="´A" to="Á" />
		<transform from="´e" to="é" />
		<transform from="´E" to="É" />
		<transform from="´i" to="í" />
		<transform from="´I" to="Í" />
		<transform from="´o" to="ó" />
		<transform from="´O" to="Ó" />
		<transform from="´u" to="ú" />
		<transform from="´U" to="Ú" />
	</transforms>
	...
</keyboard>
<keyboard locale="nl-BE-t-k0-chromeos">
	<transforms type="simple">
		<transform from="\u{30c}a" to="ǎ" /> <!-- ̌a → ǎ -->
		<transform from="\u{30c}A" to="Ǎ" /> <!-- ̌A → Ǎ -->
		<transform from="\u{30a}a" to="å" /> <!-- ̊a → å -->
		<transform from="\u{30a}A" to="Å" /> <!-- ̊A → Å -->
	</transforms>
	...
</keyboard>

Element Hierarchy - Platform File

There is a separate XML structure for platform-specific configuration elements. The most notable component is a mapping between the hardware key codes to the ISO layout positions for that platform.

Element: platform

This is the top level element. This element contains a set of elements defined below. A document shall only contain a single instance of this element.

Syntax

{platform-specific elements}

</platform>

Element: hardwareMap

This element must have a platform element as its parent. This element contains a set of map elements defined below. A document shall only contain a single instance of this element.

Syntax

<platform>
    <hardwareMap>
        {a set of map elements}
    </hardwareMap>
</platform>

Element: map

This element must have a hardwareMap element as its parent. This element maps between a hardware keycode and the corresponding ISO layout position of the key.

Syntax

Attribute: keycode (required)

The hardware key code value of the key. This value is an integer which is provided by the keyboard driver.

Attribute: iso (required)

The corresponding position of a key using the ISO layout convention where rows are identified by letters and columns are identified by numbers. For example, "D01" corresponds to the "Q" key on a US keyboard. (See the definition at the beginning of the document for a diagram).

Examples

<platform>
	<hardwareMap>
		<map keycode="2" iso="E01" />
		<map keycode="3" iso="E02" />
		<map keycode="4" iso="E03" />
		<map keycode="5" iso="E04" />
		<map keycode="6" iso="E05" />
		<map keycode="7" iso="E06" />
		<map keycode="41" iso="E00" />
	</hardwareMap>
</platform>

Invariants

Beyond what the DTD imposes, certain other restrictions on the data are imposed on the data.

For a given platform, every map[@iso] value must be in the hardwareMap if there is one (_keycodes.xml)
Every map[@base] value must also be in base[@base] value
No keyMap[@modifiers] value can overlap with another keyMap[@modifiers] value.
- eg you can't have "RAlt Ctrl" in one keyMap, and "Alt Shift" in another (because Alt = RAltLAlt).
Every sequence of characters in a transform[@from] value must be a concatenation of two or more map[@to] values.
- eg with <transform from="xyz" to="q"> there must be some map values to get there, such as <map... to="xy"> & <map... to="z">
There must be either 0 or 1 of (keyMap[@fallback] or baseMap[@fallback]) attributes
If the base and chars values for modifiers="" are all identical, and there are no longpresses, that keyMap must not appear (??)
There will never be overlaps among modifier values.
A modifier set will never have ? (optional) on all values
- eg, you'll never have RCtrl?Caps?LShift?
Every base[@base] value must be unique.
A modifier attribute value will aways be minimal, observing the following simplification rules.

Notation	Notes
Lower case character (eg. )	Interpreted as any combination of modifiers. (eg. = CtrlShiftOption)
Upper-case character (eg. )	Interpreted as a single modifier key (which may or may not have a L and R variant) (eg. = Ctrl, = RCtrl, etc..)
Y? ⇔ Y ∨ ∅ Y ⇔ LY ∨ RY ∨ LYRY	Eg. Opt? ⇔ ROpt ∨ LOpt ∨ ROptLOpt ∅ Eg. Opt ⇔ ROpt ∨ LOpt ∨ ROptLOpt

Axiom	Example
xY ∨ x ⇒ xY?	OptCtrlShift OptCtrl → OptCtrlShift?
xRY ∨ xY? ⇒ xY? xLY ∨ xY? ⇒ xY?	OptCtrlRShift OptCtrlShift? → OptCtrlShift?
xRY? ∨ xY ⇒ xY? xLY? ∨ xY ⇒ xY?	OptCtrlRShift? OptCtrlShift → OptCtrlShift?
xRY? ∨ xY? ⇒ xY? xLY? ∨ xY? ⇒ xY?	OptCtrlRShift? OptCtrlShift? → OptCtrlShift?
xRY ∨ xY ⇒ xY xLY ∨ xY ⇒ xY	OptCtrlRShift OptCtrlShift → OptCtrlShift?
LY?RY?	OptRCtrl?LCtrl? → OptCtrl?
xLY? ⋁ xLY ⇒ xLY?
xY? ⋁ xY ⇒ xY?
xY? ⋁ x ⇒ xY?
xLY? ⋁ x ⇒ xLY?
xLY ⋁ x ⇒ xLY?

Data Sources

Here is a list of the data sources used to generate the initial key map layouts:

Platform	Source	Notes
Android	Android 4.0 - Ice Cream Sandwich (http://source.android.com/source/downloading.html)	Parsed layout files located in packages/inputmethods/LatinIME/java/res
ChromeOS	XKB (http://www.x.org/wiki/XKB)	The ChromeOS represents a very small subset of the keyboards available from XKB.
Mac OSX	Ukelele bundled System Keyboards (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=ukelele)	These layouts date from Mac OSX 10.4 and are therefore a bit outdated
Windows	Generated .klc files from the Microsoft Keyboard Layout Creator (http://msdn.microsoft.com/en-us/goglobal/bb964665)	For interactive layouts, see also http://msdn.microsoft.com/en-us/goglobal/bb964651

Keyboard IDs

There is a set of subtags that help identify the keyboards. Each of these are used after the "t-k0" subtags to help identify the keyboards. The first tag appended is a mandatory platform tag followed by zero or more tags that help differentiate the keyboard from others with the same locale code.

Principles for Keyboard Ids

The following are the design principles for the ids.

BCP47 compliant.
1. Eg, "en-t-k0-extended".
Use the minimal language id based on likelySubtags.
1. Eg, instead of en-US-t-k0-xxx, use en-t-k0-xxx. Because there is <likelySubtag from="en" to="en_Latn_US"/>, en-US → en.
2. The data is in http://unicode.org/repos/cldr/trunk/common/supplemental/likelySubtags.xml
The platform goes first, if it exists. If a keyboard on the platform changes over time, both are dated, eg bg-t-k0-chromeos-2011. When selecting, if there is no date, it means the latest one.
Keyboards are only tagged that differ from the "standard for each platform". That is, for each language on a platform, there will be a keyboard with no subtags other than the platform.Subtags with a common semantics across platforms are used, such as '-extended', -phonetic, -qwerty, -qwertz, -azerty, …
In order to get to 8 letters, abbreviations are reused that are already in bcp47 -u/-t extensions and in language-subtag-registry variants, eg for Traditional use "-trad" or "-traditio" (both exist in bcp47).
Multiple languages cannot be indicated, so the predominant target is used.
1. For Finnish + Sami, use fi-t-k0-smi or extended-smi
In some cases, there are multiple subtags, like en-US-t-k0-chromeos-intl-altgr.xml
Otherwise, platform names are used as a guide.

Platform Behaviors in Edge Cases

Platform	No modifier combination match is available	No map match is available for key position	Transform fails (ie. if ^d is pressed when that transform does not exist)
ChromeOS	Fall back to base	Fall back to character in a keyMap with same "level" of modifier combination. If this character does not exist, fall back to (n-1) level. (This is handled data-generation side). In the spec: No output	No output at all
Mac OSX	Fall back to base (unless combination is some sort of keyboard shortcut, eg. cmd-c)	No output	Both keys are output separately
Windows	No output	No output	Both keys are output separately

References

Ancillary Information	To properly localize, parse, and format data requires ancillary information, which is not expressed in Locale Data Markup Language. Some of the formats for values used in Locale Data Markup Language are constructed according to external specifications. The sources for this data and/or formats include the following:
[Bugs]	CLDR Bug Reporting form http://cldr.unicode.org/index/bug-reports
[Charts]	The online code charts can be found at http://unicode.org/charts/ An index to character names with links to the corresponding chart is found at http://unicode.org/charts/charindex.html
[DUCET]	The Default Unicode Collation Element Table (DUCET) For the base-level collation, of which all the collation tables in this document are tailorings. http://unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table
[FAQ]	Unicode Frequently Asked Questions http://unicode.org/faq/ For answers to common questions on technical issues.
[FCD]	As defined in UTN #5 Canonical Equivalences in Applications http://unicode.org/notes/tn5/
[Glossary]	Unicode Glossary http://unicode.org/glossary/ For explanations of terminology used in this and other documents.
[JavaChoice]	Java ChoiceFormat http://docs.oracle.com/javase/1.4.2/docs/api/java/text/ChoiceFormat.html
[Olson]	The TZID Database (aka Olson timezone database) Time zone and daylight savings information. ftp://www.iana.org/time-zones For archived data, see ftp://ftp.iana.org/tz/releases/
[Reports]	Unicode Technical Reports http://unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports.
[Unicode]	The Unicode Consortium. The Unicode Standard, Version 6.1.0, (Mountain View, CA: The Unicode Consortium, 2012. ISBN 978-1-936213-02-3) http://www.unicode.org/versions/Unicode6.1.0/
[Versions]	Versions of the Unicode Standard http://www.unicode.org/versions/ For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports.
[XPath]	http://www.w3.org/TR/xpath/
Other Standards	Various standards define codes that are used as keys or values in Locale Data Markup Language. These include:
[BCP47]	http://www.rfc-editor.org/rfc/bcp/bcp47.txt The Registry http://www.iana.org/assignments/language-subtag-registry
[ISO639]	ISO Language Codes http://www.loc.gov/standards/iso639-2/ Actual List http://www.loc.gov/standards/iso639-2/langcodes.html
[ISO1000]	ISO 1000: SI units and recommendations for the use of their multiples and of certain other units, International Organization for Standardization, 1992. http://www.iso.org/iso/catalogue_detail?csnumber=5448
[ISO3166]	ISO Region Codes http://www.iso.org/iso/country_codes Actual List http://www.iso.org/iso/country_codes/iso_3166_code_lists/english_country_names_and_code_elements.htm
[ISO4217]	ISO Currency Codes http://www.iso.org/iso/support/faqs/faqs_widely_used_standards/widely_used_standards_other/currency_codes/currency_codes_list-1.htm (Note that as of this point, there are significant problems with this list. The supplemental data file contains the best compendium of currency information available.)
[ISO15924]	ISO Script Codes http://www.unicode.org/iso15924/standard/index.html Actual List http://www.unicode.org/iso15924/codelists.html
[LOCODE]	United Nations Code for Trade and Transport Locations", commonly known as "UN/LOCODE" http://www.unece.org/cefact/locode/ Download at: http://www.unece.org/cefact/codesfortrade/codes_index.htm
[RFC6067]	BCP 47 Extension U http://www.ietf.org/rfc/rfc6067.txt
[RFC6497]	BCP 47 Extension T - Transformed Content http://www.ietf.org/rfc/rfc6497.txt
[UNM49]	UN M.49: UN Statistics Division Country or area & region codes http://unstats.un.org/unsd/methods/m49/m49.htm Composition of macro geographical (continental) regions, geographical sub-regions, and selected economic and other groupings http://unstats.un.org/unsd/methods/m49/m49regin.htm
[XML Schema]	W3C XML Schema http://www.w3.org/XML/Schema
General	The following are general references from the text:
[ByType]	CLDR Comparison Charts http://www.unicode.org/cldr/comparison_charts.html
[Calendars]	Calendrical Calculations: The Millennium Edition by Edward M. Reingold, Nachum Dershowitz; Cambridge University Press; Book and CD-ROM edition (July 1, 2001); ISBN: 0521777526. Note that the algorithms given in this book are copyrighted.
[Comparisons]	Comparisons between locale data from different sources http://unicode.org/cldr/data/diff/
[CurrencyInfo]	UNECE Currency Data http://www.unece.org/etrades/unedocs/repository/codelists/xml/CurrencyCodeList.xml
[DataFormats]	CLDR Data Formats http://unicode.org/cldr/data_formats.html
[Example]	A sample in Locale Data Markup Language http://unicode.org/cldr/dtd/1.1/ldml-example.xml
[ICUCollation]	ICU rule syntax http://www.icu-project.org/userguide/Collate_Customization.html
[ICUTransforms]	Transforms http://www.icu-project.org/userguide/Transformations.html Transforms Demo http://demo.icu-project.org/icu-bin/translit/
[ICUUnicodeSet]	ICU UnicodeSet http://www.icu-project.org/userguide/unicodeSet.html API http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html
[ITUE164]	International Telecommunication Union: List Of ITU Recommendation E.164 Assigned Country Codes available at http://www.itu.int/opb/publications.aspx?parent=T-SP&view=T-SP2
[LocaleExplorer]	ICU Locale Explorer http://demo.icu-project.org/icu-bin/locexp
[LocaleProject]	Common Locale Data Repository Project http://unicode.org/cldr/
[NamingGuideline]	OpenI18N Locale Naming Guideline formerly at http://www.openi18n.org/docs/text/LocNameGuide-V10.txt
[RBNF]	Rule-Based Number Format http://www.icu-project.org/apiref/icu4c/classRuleBasedNumberFormat.html#_details
[RBBI]	Rule-Based Break Iterator http://www.icu-project.org/userguide/boundaryAnalysis.html
[RFC5234]	RFC5234 Augmented BNF for Syntax Specifications: ABNF http://www.ietf.org/rfc/rfc5234.txt
[UCAChart]	Collation Chart http://unicode.org/charts/collation/
[UTCInfo]	NIST Time and Frequency Division Home Page http://tf.nist.gov/ U.S. Naval Observatory: What is Universal Time? http://aa.usno.navy.mil/faq/docs/UT.php
[WindowsCulture]	Windows Culture Info (with mappings from [BCP47]-style codes to LCIDs) http://msdn2.microsoft.com/en-us/library/system.globalization.cultureinfo(vs.71).aspx

Acknowledgments

Special thanks to the following people for their continuing overall contributions to the CLDR project, and for their specific contributions in the following areas. These descriptions only touch on the many contributions that they have made.

Mark Davis for creating the initial version of LDML, and adding to and maintaining this specification, and for his work on the LDML code and tests, much of the supplemental data and overall structure, and transforms and keyboards.
John Emmons for the POSIX conversion tool and metazones.
Deborah Goldsmith for her contributions to LDML architecture and this specification.
Chris Hansten for coordinating and managing data submissions and vetting.
Erkki Kolehmainen and his team for their work on Finnish.
Steven R. Loomis for development of the survey tool and database management.
Peter Nugent for his contributions to the POSIX tool and from Open Office, and for coordinating and managing data submissions and vetting.
George Rhoten for his work on currencies.
Roozbeh Pournader (روزبه پورنادر) for his work on South Asian countries.
Ram Viswanadha (రఘురామ్ విశ్వనాధ) for all of his work on LDML code and data integration, and for coordinating and managing data submissions and vetting.
Vladimir Weinstein (Владимир Вајнштајн) for his work on collation.
Yoshito Umaoka (馬岡由人) for his work on the timezone architecture.
Rick McGowan for his work gathering language, script and region data.
Xiaomei Ji (吉晓梅) for her work on time intervals and plural formatting.
David Bertoni for his contributions to the conversion tools.
Mike Tardif for reviewing this specification and for coordinating and vetting data submissions.
Peter Edberg for work on this specification, telephone code data, monthPatterns, cyclicNameSets and contextTransforms.
Raymond Wainman and Cibu Johny for their work on keyboards.
Jennifer Chye for her contributions to the conversion tools.

Other contributors to CLDR are listed on the CLDR Project Page.

Modifications

The following summarizes modifications from the previous revision of this document. Some of the modification notes have an associated bug ticket number, which may be used to look up additional information about the modification; for further information, see http://www.unicode.org/cldr/filing_bug_reports.html.

Revision 29

Updated for Version 22.1.
Clarified the mechanism for producing timezone display names, removing fallbackRegionFormat.
Clarified the use and meaning of the index characters.
Revised the table for collation keyword lookup in Section I.5 Keyword and Default Resolution
In Section 5.14, Collation Elements, added information on zhuyin collation index markers. [ticket #5320]
Clarify that text in a dateTimePattern other than {0}{1} is treated as part of a date pattern. [ticket #5398]

Revision 28 being a proposed update, only changes between revisions 27 and 29 are summarized here.

Revision 27

Updated for Version 22.
Added new section Section 5.14.13, Case Parameters.
Modified the table in Section 5.14.3, Setting Options to include information from UCA, and to add clarifications.
Added 6-letter patterns for short weekday names to the Date Field Symbol Table. [ticket #4571]
Updated deprecated status. [ticket #5229]
In Section 5.9.1, Calendar Elements, under <months>, <days>, <quarters>, <eras>, mentioned the new day width short and the new <monthPattern> context numeric and width all. [ticket #5268]
Fixed last example in table Specifying Collation Ordering. [ticket #3108]
Added descriptions of coverageVariable in Appendix M. [ticket #5269]
Documented the attribute value status="grouping" in Appendix C.2. [ticket #5270]
Added Appendix R, for property data.
Added Appendix S, for Keyboards.
Updated reference to [Olson]
Added clarification for removal of deprecated codes to Appendix P: Supplemental Metadata.
Disallowed isolated "n" in plural rules in C.11 Language Plural Rules. Reworded some of the syntax notes for clarity.
Moved collation Key/Type information to Section 5.14.3 Setting Options.
Clarified that the BCP47 key/value identifiers are the canonical (and preferred) identifiers.
Added Section Q.1.1 Numbering System Data.
Added description of deprecated="true" in Appendix Q.

Revision 26 being a proposed update, only changes between revisions 25 and 27 are summarized here.

Revision 25

Updated for CLDR Version 21.0.1.
Fixed typo in 't' extension.
Added note on the special case of 'und' in language matching: #3439
Added explanation of Empty Override: #4012
Fix typo in Annex N: #4110

Revision 24 being a proposed update, only changes between revisions 23 and 25 are summarized here.

Revision 23

Updated for CLDR Version 21.
Note the change in version numbering scheme beginning with this release.
Added information on distinctness requirements for names of eras and dayPeriods. [ticket #3831]
Added descriptions of new <monthPatterns> and <cyclicNameSets> calendar elements and of new 'U' date pattern character for cyclic year names. Deprecated the 'l' (SMALL LETTER L) pattern character for leap month marker. [tickets #4230, #4231, #4232]
Added documentation regarding the deprecation of the commonlyUsed element in formatting short time zone names. [tickets #4052, #4130]
Added documentation for ordinal plural forms. [ticket #4323]
Added documentation for territoryContainment status="deprecated". [ticket #4326]
Added documentation for gender of lists. [tickets #4125, #4357]
Clarified use of hour pattern characters (h, H, K, k) in skeletons and associated patterns, in both Section 5.9.1 Calendar Elements and in the Date Field Symbol Table. [ticket #4061]
Added Section 5.19 ContextTransform Elements and Section 5.20 Metadata Elements. Deprecated the <inList> and <inText> elements. Added guidance for capitalization of display names, and for consideration of grammar vs. capitalization in format vs. stand-alone calendar names. [ticket #4317]
Added clarification of YY as fixed width week of year. [ticket #3862]
Clarified that the parentLocale element does not apply to collation. [ticket #3897]
Moved the information about the alias element into Section 5.21 Alias Elements and removed comments about whole-locale aliasing.
Updated Section C.7 Supplemental Time Zone Data to specify the details of the Windows TZID mapping data extended by ticket #4067 in this release. [ticket #4296]
Added documentation for the new date format pattern "ZZZZZ" (ISO 8601 time zone format) in Appendix F: Date Format Patterns and Appendix J: Time Zone Display Names. [ticket #3995]
Added an explanation of the use of the code 'UK', and pointers to the aliases for normalizing codes. [ticket #4250]
Clarified the use of <variable> elements in checking attribute values. [ticket #4013]
Updated info on deprecated items. [ticket #4360]
Clarified the status of non-decimal numbering systems. [ticket #4177]
Described collation reordering. [ticket #4194]
Changed the section title of Appendix Q from "Locale Extension Key and Type Data" to "BCP 47 Extension Data" and updated the description. [ticket #4361]
In Appendix J: Time Zone Display Names, fold description of Localized GMT-zero format into that of Localized GMT format. [ticket #3695]
Describe the 't' extension [ticket #3976]
In Appendix J: Time Zone Display Names, fold description of Localized GMT-zero format into that of Localized GMT format. [ticket #3695]
In Section 5.10.2 Currencies, deprecated the "choice" attribute for currency symbols. [ticket #3934]
In Appendix Q: BCP 47 Extension Data, clarified valid/invalid use case of type value with multiple subtags. [ticket #4212]
Misc editorial fixes (spelling etc.). [ticket #4378]

Revision 22 being a proposed update, only changes between revisions 21 and 23 are summarized here.

Revision 21

Updated for CLDR 2.0.1.
In the Collation section of the Key/Type Definitions table, added an entry for "searchjl". [ticket #3560]
In the Collation parameters section of the Key/Type Definitions table, corrected misspelled "quarternary" to "quaternary". [ticket #4031]
In Section 5.6 Character Elements, corrected the statements about when punctuation and symbols cannot be included in exemplar sets, and added a note about use of exemplar sets and number systems to determine character repertoire requirements to support a language. [ticket #3498]
In Section 5.9.2 Time Zone Names, added a note recommending use of generic location format in user interfaces for timezone selection, and referring to the Date Field Symbol Table table and Appendix G (where this is also discussed). [ticket #914]
Added documentation for territoryContainment grouping="true", for addition deprecatedItems, for coverageLevels, and for parentLocales. [ticket #3938]
Added clarifications of count=0/1 [ticket #3988]
Added descriptions of special index markers added to CJK collations for stroke, pinyin, and unihan, and the alternate alt="short" forms.

Revision 20 being a proposed update, only changes between revisions 19 and 21 are summarized here.

Revision 19

Updated for CLDR 2.0.
In the Collation section of the Key/Type Definitions table, added an entry for "ducet" and corrected the information about which types are available in all locales. [ticket #3399]
Added fallbackRegionFormat, localeKeyTypePattern, stopwords, count=0/1.
Enhanced plural rules to allow for explicit lists: n in 1,3,5..14.
Clarified normalization of LDML files.
Changed the description of coverage levels; it is now data-based.
Clarified the use of commonlyUsed flag for pattern "v" in the Date Field Symbol Table. [ticket #2700]
Added calendar type "iso8601" and number type "tamldec" in the Key/Type Definitions table.
Restricted the use of the <alias> element to only two circumstances: in root, and as whole-locale aliases.

Revision 18 being a proposed update, only changes between revisions 17 and 19 are summarized here.

Revision 17

Updated for CLDR 1.9.
In the Date Field Symbol Table, changed the description of pattern character 'S' to indicate that the corresponding field truncates, rather than rounds. [ticket #2845]
In the description of dateFormats, noted that they are intended primarily for use by themselves in user interface elements. [ticket #3048]
Added (short) descriptions of transformNames, ellipsis, moreInformation, punctuation exemplars. [ticket #3360]
Updated the discussion of canonical TZ IDs [ticket #2899]
Described the use of numberSystem with symbols, decimalFormats, etc. [ticket #3361]
Documented the recommended fallback for transforms [ticket #2240]
Clarified some issues with plural rules [ticket #3061]
Documented the changes for UCA 6.0, and clarified some examples and the use of "basic syntax" [ticket #3060]
Highlighted where CLDR and LDML have different defaults than UCA/DUCET [ticket #2904]
Described the default subtype 'true' for keywords [ticket #2958] Clarified that the defaults are different from the attribute defaults for collation.
Deprecated <fallback> and moved relevant text into C.18 Language Matching [ticket #1988]
Clarified that alt values are not limited to the list in the text. Also added "short". [ticket #1910]
Updated coverage levels [ticket #2591]
Updated the specification to match DTD in various places: Section 5 XML Format header; Section 5.1 Common Elements, for default element use choice instead of deprecated type; Section 5.4 Display Name Elements header; Section 5.6 Character Elements header; Section 5.9 Date Elements header; Section 5.10 Number Elements header. [ticket #1925]
Added a collation type 'search' in the Key/Type Definitions table. Also moved 'standard' at the beginning to match the description. [ticket#3375]

Revision 16

Updated for CLDR 1.8.1. Fix TOC links to sections C.17, C.18. [ticket #2722]
Updated Appendix Q Locale Extension Key and Type Data to provide more information about valid "vt" (variableTop) value and versioning. [tickets #2740, #2741]
Clarified the role of aliases and fallback elements [tickets #2757, #2742, and #2762].

Revision 15

In Section C.5 Supplemental Calendar Data, explained the different interpretations of “first day of the week” and their relationship to CLDR data. [ticket #2663]
Updated Appendix K Valid Attribute Values. [ticket #1504]
Added descriptions of yeartype and numberingSystem attributes [ticket #2712]
Explained syntax and meaning of vt and variableTop.
Added index exemplar sets and index label elements
Added dayPeriod and dayPeriod Rules
Added list patterns
Added language matching
Updated descriptions of currency and timezone codes in the Key/Type table.
Revised the description in Section C.7 Supplemental Time Zone Data to support the new time zone data organization. [ticket #2715]
Updated references to UAX/UTS/UTR documents and to The Unicode Standard. [ticket #2530]
Noted (in Calendar Elements and Lenient Parsing sections) that narrow month and day values need not be distinct. [ticket #1955]
In Appendix O Lenient Parsing, expanded the discussion of equivalences among apostrophe-like characters. [ticket #2629]

Revision 14 being a proposed update, only changes between revisions 13 and 15 are summarized here.

Revision 13

Updated 3. Unicode Language and Locale Identifiers to make the primary locale identifier syntax more BCP47 compatible. [ticket #2457]
Added Appendix Q. Locale Extension Key and Type Data. [ticket #2457]
Note that the dateRangePattern element is deprecated, replaced by intervalFormats. Note that the replacement for the deprecated measurement element is measurementData in supplemental. Be more clear that the hoursFormat, abbreviationFallback, and preferenceOrdering elements are deprecated. [ticket #2369]
In Appendix G.8 Number Elements, updated the description of grouping separators to match Appendix G.2. [ticket #2317]
Fixed some typos. [ticket #2216]

Revision 12

In Section 1 Introduction, clarified that LDML is an interchange format, not a runtime format. [ticket #1971]
In Section 3 Unicode Language and Locale Identifiers:
- Clarified some entries in the Variant Definitions section. [ticket #1878]
- Added calendar types ethiopic-amete-alem, indian, roc. [ticket #1960]
In Section 5.6 Character Elements:
- Noted that LRM and RLM can be included in auxiliary or currency exemplars. [ticket #2049]
- Clarified ordering of encodings in the mapping element for email usage. [ticket #2153]
In Section 5.9.2 Time Zone Names:
- Note that "GMT", "UT", "UTC" are not allowed as translations of non-GMT timezones. [ticket #1949]
- Describe new <gmtZeroFormat> element. [tickets #1949, #1950]
Added 5.17 Rule-Based Number Formatting
Added C.13 Numbering Systems
Added C.14 Postal Code Validation
Added C.15 Calendar Preference Data
Added C.16 BCP 47 Keyword Mapping
- Also modified 3.2 BCP 47 Tag Conversion
In Appendix F Date Format Patterns, clarified the usage of the 'j' format character. [ticket #2098]
In Appendix J Time Zone Display Names:
- Added the fallback format used for generic location when <timezoneData> does not have country data for a zone. [ticket #1962]
- In the Parsing section, clarified parsing of GMT/UT/UT and localized GMT formats with/without numeric offset, and inability to parse all RFC 788 date/time formats. [ticket #1949]
Added numbers attribute on date patterns, in 5.9 Date Elements and C.15 Calendar Preference Data, and in Key_Type_Definitions. Also added bookmarks for Unicode_language_identifier, Unicode_locale_identifier, Language_Locale_Field_Definitions, Variant_Definitions, Key_Type_Definitions
Added defaultNumberingSystem in 5.10 Number Elements
Added tender attribute in C.1 Supplemental Currency Data.
Misc editing.

Revision 11

Made a number of changes as the result of a copy-edit pass by Julie.
Added clarifications in Unicode Language and Locale Identifiers.

Revision 10

In Section 1.1 Conformance, added UAX35-C2 and information on referencing particular components of Unicode locale or language identifiers. [ticket #1801]
In Section 3 Unicode Language and Locale Identifiers:
- Clarified the syntax and usage of the language and locale identifiers. [ticket #1801]
- Replaced the use of the comma (U+0020 SPACE), as an options separator, with the semicolon (U+003B SEMICOLON). This change was also reflected in the examples given. [ticket #1717]
- Re-emphasized that key and type values are limited to ASCII lettters and digits, and that they have to be unique within the first 8 letters and digits. [ticket #1772]
In Section 4 Locale Inheritance, explained the different fallback processes for resource bundle lookup and resource item lookup. [ticket #1763]
In Section 5 XML Format, clarified that no element value can start with a combining slash U+0338 (not a combining backslash). [ticket #1223]
In Section 5.2 Common Attributes, updated the draft attribute value descriptions; added "contributed".
In Section 5.3.1 Fallback Elements, clarified examples, and explained how implementations can provide a mechanism for overriding the fallbacks. [ticket #1763]
In Section 5.4 Display Name Elements, re-emphasized the potential uses of these elements [ticket #1665], and added new localeDisplayPattern element [ticket #1448].
In Section 5.9.1 Calendar Elements:
- Clarified handling of availableFormat patterns for calendars that require an era field if a year field is present [ticket #1346].
- Described the format of a date-time format skeleton used with the availableFormats element [ticket #1611].
- Added descriptions of intervalFormats [ticket #1813].
In Section 5.9.2 Time Zone Names, indicated that timezone IDs are not limited to city names, and corrected/augmented the fallbackFormat examples. [ticket #1604]
In Section 5.10.1 Number Symbols, updated to match current DTD.
In 5.10.2 Currencies, updated to match current DTD. Added section explaining use of "count" to format currency values for particular numeric values. [ticket #1550]
Added new Section 5.11 Unit Elements (renumbered the 5.x sections after it) describing <units> element: Support for unit forms used (for example) in formatting durations [tickets #900, #973, #1009, #1807, #1821]
In Appendix C.6 Measurement System Data, clarified the meaning of "metric" and its relation to ISO 1000 [ticket #481], and corrected the values for paperSize [ticket #1712].
Added Appendix C.11 Language Plural Rules. [tickets #1550, #1703]
Added Appendix C.12 Telephone Code Data. [ticket #1542]
In Appendix F: Date Format Patterns, clarified the usage of the YYYY for week of year calendars. [ticket #1605]
In Section G.8 of Appendix G: Number Format Patterns, changed currencySeparator, which is not a valid field, to currencyDecimal. [ticket #997]
In Appendix J: Time Zone Display Names, amended the fallback example and corrected a typo; corrected root fallbackFormat and all discussions and examples that ensue from that change. [ticket #1604]
In the Parsing section of Appendix J: Time Zone Display Names, updated step 3 to allow UTC and UT as synonyms for GMT. [ticket #1582]
Fixed validation errors and broken links [tickets #1606, #1619]. Added contributors [ticket #1835]. In this Modifications section, added bug ticket information [ticket #1630].

Revision 9

Extensive rewrite of Appendix J: Time Zone Display Names, primarily due to refinements to the metazone process. This also caused some changes in Appendix F: Date Format Patterns. [ticket #1508]
Made the date range handling uniform, with new Section 5.2.1 Dates and Date Ranges, and related changes particularly to C.1 and C.5 in Appendix C: Supplemental Data.
Added Appendix C10. Likely Subtags
Added missing date pattern symbol "l" for Chinese calendar. [ticket #1557]

Revision 8

Reserved 'j' in date formats for distinguishing 12 and 24 formats.
Added section 5.1.2 Text Directionality.
Added new conformance section: 1.1 Conformance
Revised text on loose matching to include BIDI control characters: Appendix O: Lenient Parsing
Revised text on distinguishing and blocking elements
Added currency exemplar sets
Added dateRangePatterns
Added language fallbacks: 5.3.1 Fallback Elements
Clarified use of transliterator names
Added matching options for collation
Added currency change policy
Added description of character fallbacks, changed ordering of NFC and NFKC.
Added DTD headers for supplemental data
Added supplemental metadata descriptions: Appendix P: Supplemental Metadata
Added mappings to alternate language and country codes
Added substantial data on language and script usage in different countries
Added default content data
Added metazones
Clarified the before and after elements in currency formatting.
Minor edits

Revision 7

Point at bug database instead of Unicode reporting form.
Add "root" as valid locale identifier, and clarify that "locale" in CLDR is really essentially language.
Added the list of private use language & script subtag codes that will not be used by CLDR.
Corrected the dateTimeFormat assignments for {0} and {1}.

Revision 6

Incorporated Corrigendum 1 (see http://unicode.org/cldr/corrigenda.html) into Appendix F: Date Format Patterns and Section 5.4 <localeDisplayNames>
Revamped Appendix J: Time Zone Display Names. Also changed "Fallback" to "Display Names" in the title of the Appendix, and "Olson" to "TZ" in other places in the document.
Yesstr/nostr/yesexpr/noexpr changes in Section 5.12 <posix>.
Added Section 5.15 <segmentations>.
Moved week, measurement data to Appendix C: Supplemental Data
Added coverage levels in Appendix M: Coverage Levels
Added rule-based number formats and transforms, in Section 5.16 Transforms, Appendix N: Transform Rules
Added metadata, replacing the contents of Appendix K: Valid Attribute Values
Added availableFormats, dateFormatItem, and appendItem in <calendars> to support more flexible date/time formatting
Added measurementSystemNames and measurementSystemName in <localeDisplayNames> for localized names of measurement systems
Added quarters, quarterContext, quarterWidth, and quarter to <calendars> for names of calendar quarters
Extended possible values for alt tag
Added ethiopic calendar to allowed calendar values
Clarified usage of quotation marks and alternate marks
Corrected example ISBN
Added eraNarrow to <calendars> for one-character version of era names
Added Appendix O, on lenient parsing
Added
- 3.1 Unknown or Invalid Identifiers
Other editing
Updated descriptions to final DTD and metadata.

Revision 5

The canonical form for variants is upper case
Addition of UN M.49 codes
Addition of persian and coptic calendar IDs
Clarification of alias inheritance
New XML references section
Modified revision and generation field format
Use of language display names for whole initial segments of locale IDs names, such as nl-BE
Addition of the inList element
Clarification of 'narrow'
Additional dateTimeFormat description
Names of calendar fields, and relative times.
New element currencySpacing
Descriptions of POSIX yes/now
New supplemental data elements/attributes (end of Appendix C)
- currency to/from
- languageData
- timezoneData
- territoryContainment
- mapTimezones
- alias
- deprecated
- characters
in dateExtension of era to 1..3
Clarification of year padding
Deprecation of localizedPatternChars
Use of the singleCountries list
Appendix L: canonical form
Misc editing
Revision 1 (2005-06-30): added link to Corrigenda.

Copyright © 2001-2012 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.

Pattern	Currency	Text
#,##0.##	n/a	1 234,57
#,##0.###	n/a	1 234,567
###0.#####	n/a	1234,567
###0.0000#	n/a	1234,5670
00000.0000	n/a	01234,5670
# ##0.00 ¤	EUR	1 234,57 €
# ##0.00 ¤	JPY	1 235 ¥

Symbol	Location	Localized?	Meaning
0	Number	Yes	Digit
1-9	Number	Yes	'1' through '9' indicate rounding.
@	Number	No	Significant digit
#	Number	Yes	Digit, zero shows as absent
.	Number	Yes	Decimal separator or monetary decimal separator
-	Number	Yes	Minus sign
,	Number	Yes	Grouping separator
E	Number	Yes	Separates mantissa and exponent in scientific notation. Need not be quoted in prefix or suffix.
+	Exponent	Yes	Prefix positive exponents with localized plus sign. Need not be quoted in prefix or suffix.
;	Subpattern boundary	Yes	Separates positive and negative subpatterns
%	Prefix or suffix	Yes	Multiply by 100 and show as percentage
‰ (\u2030)	Prefix or suffix	Yes	Multiply by 1000 and show as per mille
¤ (\u00A4)	Prefix or suffix	No	Currency sign, replaced by currency symbol. If doubled, replaced by international currency symbol. If tripled, uses the long form of the decimal symbol. If present in a pattern, the monetary decimal separator and grouping separators (if available) are used instead of the numeric ones.
'	Prefix or suffix	No	Used to quote special characters in a prefix or suffix, for example, `"'#'#"` formats 123 to `"#123"`. To create a single quote itself, use two in a row: `"# o''clock"`.
*	Prefix or suffix boundary	Yes	Pad escape, precedes pad character

Pattern	Minimum significant digits	Maximum significant digits	Number	Output
`@@@`	3	3	12345	`12300`
`@@@`	3	3	0.12345	`0.123`
`@@##`	2	4	3.14159	`3.142`
`@@##`	2	4	1.23004	`1.23`

Unicode Technical Standard #35

Unicode Locale Data Markup Language (LDML)

Summary

Status

Contents

1. Introduction

1.1 Conformance

2. What is a Locale?

3. Unicode Language and Locale Identifiers

3.1 Unknown or Invalid Identifiers

3.1.1 Numeric Codes

3.2 BCP 47 Conformance

3.2.1 -u- and -t- Extensions

3.2.2 BCP 47 Language Tag Conversion

3.3 Relation to OpenI18n

3.4 Compatibility with Older Identifiers

3.4.1 Legacy Variants

3.4.2 Old Locale Extension Syntax

4. Locale Inheritance

4.1 Multiple Inheritance

5 XML Format

5.1 Common Elements

5.1.1 Escaping Characters

5.1.2 Text Directionality

5.2 Common Attributes

5.2.1 Date and Date Ranges

5.3 Identity Elements

5.3.1 Fallback Elements

5.4 Display Name Elements

5.5 Layout Elements

5.6 Character Elements

5.6.1 Exemplar Syntax

5.6.2. Restrictions

5.6.3. Mapping

5.6.4 Index Labels

5.6.5 Ellipsis

5.6.6 More Information

5.7 Delimiter Elements

5.8 Measurement Elements (deprecated)

5.9 Date Elements

5.9.1 Calendar Elements

5.9.2 Time Zone Names

Section 5.9.2.1 Metazones

5.10 Number Elements

5.10.1 Number Symbols

5.10.2 Currencies

5.11 Unit Elements

5.12 POSIX Elements

5.13 Reference Element

5.14 Collation Elements

5.14.1 Version

5.14.2 Collation Element

5.14.3 Setting Options

5.14.4 Collation Rule Syntax

5.14.5 Orderings

5.14.6 Contractions

5.14.7 Expansions

5.14.8 Context Before

5.14.9 Placing Characters Before Others

5.14.10 Logical Reset Positions

5.14.11 Special-Purpose Commands

5.14.12 Collation Reordering

Interpretation of a reordering list

5.14.13 Case Parameters

Untailored Characters

Compute Modified Collation Elements

Tailored Strings

5.14.14 Visibility

5.15 Segmentations

5.15.1 Segmentation Inheritance

5.16 Transforms

Inheritance

Variants

5.17 Rule-Based Number Formatting

5.18 List Patterns

5.19 ContextTransform Elements

5.20 Metadata Elements

5.21 Alias Elements

Appendix A: Sample Special Elements

A.1 openoffice.org