Collation Service Architecture
Contents
- Overview
- Collator
- Input values for collation
- Collation Elements
- Sort Keys
- Collation Element Iterator
- Collation Attributes
- Performance
- Versioning
- Programming Examples
Overview
This section describes the design principles, architecture and coding conventions of the ICU Collation Service.
Collator
To use the Collation Service, a Collator must first be instantiated. An Collator is a data structure or object that maintains all of the property and state information necessary to define and support the specific collation behavior provided. Examples of properties described in the Collator are the locale, whether normalization is to be performed, and how many levels of collation are to be evaluated. Examples of the state information described in the Collator include the direction of a Collation Element Iterator (forward or backward) and the status of the last API executed.
The Collator is instantiated either by referencing a locale or by defining a custom set of rules (a tailoring).
The Collation Service uses the paradigm:
-
Open a Collator,
-
Use while necessary,
-
Close the Collator.
Collator instances cannot be shared among threads. You should open them instead, and use a different collator for each separate thread. The safe clone function is supported for cloning collators in a thread-safe fashion.
The Collation Service follows the ICU conventions for locale designation when opening collators:
-
NULL means the default locale.
-
The empty locale name (“”) means the root locale. The Collation Service adheres to the ICU conventions described in the “ICU Architectural Design “ section of the users guide. In particular:
-
The standard error code convention is usually followed. (Functions that do not take an error code parameter do so for backward compatibility.)
-
The string length convention is followed: when passing a
UChar *
, the length is required in a separate argument. If -1 is passed for the length, it is assumed that the string is zero terminated.
Collation locale and keyword handling
When a collator is created from a locale, the collation service (like all ICU services) must map the requested locale to the localized collation data available to ICU at the time. It does so using the standard ICU locale fallback mechanism. See the fallback section of the locale chapter for more details.
If you pass a regular locale in, like “en_US”, the collation service first searches with fallback for “collations/default” key. The first such key it finds will have an associated string value; this is the keyword name for the collation that is default for this locale. If the search falls all the way back to the root locale, the collation service will us the “collations/default” key there, which has the value “standard”.
If there is a locale with a keyword, like “de-u-co-phonebk” or “de@collation=phonebook”, the collation service searches with fallback for “collations/phonebook”. If the search is successful, the collation service uses the string value it finds to instantiate a Collator. If the search fails because no such key is present in any of ICU’s locale data (e.g., “de@collation=funky”), the service returns a collator implementing the default tailoring of the locale. If the fallback is all the way to the root locale, then the return UErrorCode
is U_USING_DEFAULT_WARNING
.
Input values for collation
Collation deals with processing strings. ICU generally requires that all the strings should be in UTF-16 format, and that all the required conversion should done before ICU functions are used. In the case of collation, there are APIs that can also take instances of character iterators (UCharIterator
) or UTF-8 directly.
Theoretically, character iterators can iterate strings in any encoding. ICU currently provides character iterator implementations for UTF-8 and UTF-16BE (useful when processing data from a big endian platform on an little endian machine). It should be noted, however, that using iterators for collation APIs has a performance impact. It should be used in situations when it is not desirable to convert whole strings before the operation - such as when using a string compare function.
Collation Elements
As discussed in the introduction, there are many possible orderings for sorted text, depending on language and other factors. Ideally, there is a way to describe each ordering as a set of rules for calculating numeric values for each string of text. The collation process then becomes one of simply comparing these numeric values.
This essentially describes the way the Collation Service works. To implement a particular sort ordering, first the relationship between each character or character sequence is derived. For example, a Spanish ordering defines the letter sequence “CH” to be between the letters “C” and “D”. As also discussed in the introduction, to order strings properly requires that comparison of base letters must be considered separately from comparison of accents. Letter case must also be considered separately from either base letters or accents. Any ordering specification language must provide a way to define the relationships between characters or character sequences on multiple levels. ICU supports this by using “<” to describe a relationship at the primary level, using “«” to describe a relationship at the secondary level, and using “«<” to describe a relationship at the tertiary level. Here are some example usages:
Symbol | Example | Description |
---|---|---|
< | c < ch | Make a primary (base letter) difference between “c” and the character sequence “ch” |
<< | a << ä | Make a secondary (accent) difference between “a” and “ä” |
<<< | a<<<A | Make a tertiary difference between “a” and “A” |
A more complete description of the ordering specification symbols and their meanings is provided in the section on Collation Tailoring.
Once a sort ordering is defined by specifying the desired relationships between characters and character sequences, ICU can convert these relationships to a series of numerical values (one for each level) that satisfy these same relationships.
This series of numeric values, representing the relative weighting of a character or character sequence, is called a Collation Element (CE). One possible encoding of a Collation Element is a 32-bit value consisting of a 16-bit primary weight, a 8-bit secondary weight, 2 case bits, and a 6-bit tertiary weight.
The sort weight of a string is represented by the collation elements of its component characters and character sequences. For example, the sort weight of the string “apple” would consist of its component Collation Elements, as shown here:
“Apple” | “Apple” Collation Elements |
---|---|
a | [1900.05.05] |
p | [3700.05.05] |
p | [3700.05.05] |
l | [2F00.05.05] |
e | [2100.05.05] |
In this example, the letter “a” has a 16-bit primary weight of 1900 (hex), an 8-bit secondary weight of 05 (hex), and a combined 8-bit case-tertiary weight of 05 (hex).
String comparison is performed by comparing the collation elements of each string. Each of the primary weights are compared. If a difference is found, that difference determines the relationship between the two strings. If no differences are found, the secondary weights are compared and so forth.
With ICU it is possible to specify how many levels should be compared. For some applications, it can be desirable to compare only primary levels or to compare only primary and secondary levels.
Sort Keys
If a string is to be compared thousands or millions of times, it can be more efficient to use sort keys. Sort keys are useful in situations where a large amount of data is indexed and frequently searched. The sort key is generated once and used in subsequent comparisons, rather than repeatedly generating the string’s Collation Elements. The comparison of sort keys is a very efficient and simple binary compare of strings of unsigned bytes.
An important property of ICU sort keys is that you can obtain the same results by comparing 2 strings as you do by comparing the sort keys of the 2 strings (provided that the same ordering and related collation attributes are used).
An ICU sort key is a pre-processed sequence of bytes generated from a Unicode string. The weights for each comparison level are concatenated, separated by a “0x01” byte between levels. The entire sequence is terminated with a 0x00 byte for convenience in C APIs. (This 0x00 terminator is counted in the sort key length — unlike regular strings where the NUL terminator is excluded from the string length.)
ICU actually compresses the sort keys so that they take the minimum storage in memory and in databases.
Sort key size
One of the more important issues when considering using sort keys is the sort key size. Unfortunately, it is very hard to give a fast exact answer to the following question: “What is the maximum size for sort keys generated for strings of size X”. This problem is twofold:
-
The maximum size of the sort key depends on the size of the collation elements that are used to build it. Size of collation elements vary greatly and depends both on the alphabet in question and on the locale used.
-
Compression is used in building sort keys. Most ‘regular’ sequences of characters produce very compact sort keys.
If one is to assume the worst case and use too-big buffers, a lot of space will be wasted. However, if you use too-small buffers, you will lose performance if generated sort keys are longer than supplied buffers too often (and you have to reallocate for each of those). A good strategy for this problem would be to manually manage a large buffer for storing sortkeys and keep a list of indices to sort keys in this buffer (see the “large buffers” Collation Example for more details).
Here are some rules of a thumb, please do not rely on them. If you are looking at the East Asian locales, you probably want to go with 5 bytes per code point. For Thai, 3 bytes per code point should be sufficient. For all the other locales (mostly Latin and Cyrillic), you should be fine with 2 bytes per code point. These values are based on average lengths of sort keys generated with tertiary strength. If you need quaternary and identical strength (you should not), add 3 bytes per code point to each of these.
Partial sort keys
In some cases, most notably when implementing radix sorting, it is useful to produce only parts of sort keys at a time. ICU4C 2.6+ provides an API that allows producing parts of sort keys (ucol_nextSortKeyPart
API). These sort keys may or may not be compressed; that is, they may or may not be compatible with regular sort keys.
Merging sort keys
Sometimes, it is useful to be able to merge sort keys. One example is having separate sort keys for first and last names. If you need to perform an operation that requires a sort key generated on the whole name, instead of concatenating strings and regenerating sort keys, you should merge the sort keys. The merging is done by merging the corresponding levels while inserting a terminator between merged parts. The reserved sort key byte value for the merge terminator is 0x02. For more details see UCA section 1.6, Merging Sort Keys.
- C API: unicode/ucol.h
ucol_mergeSortkeys()
- Java API:
com.ibm.icu.text.CollationKey merge(CollationKey source)
CLDR 1.9/ICU 4.6 and later map U+FFFE to a special collation element that is intended to allow concatenating strings like firstName+\uFFFE+lastName to yield the same results as merging their individual sort keys. This has been fully implemented in ICU since version 53.
Generating bounds for a sort key (prefix matching)
Having sort keys for strings allows for easy creation of bounds - sort keys that are guaranteed to be smaller or larger than any sort key from a give range. For example, if bounds are produced for a sortkey of string “smith”, strings between upper and lower bounds with one level would include “Smith”, “SMITH”, “sMiTh”. Two kinds of upper bounds can be generated - the first one will match only strings of equal length, while the second one will match all the strings with the same initial prefix.
CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with the maximum primary weight, so that for example the string “smith\uFFFF” can be used as the upper bound rather than modifying the sort key for “smith”.
Collation Element Iterator
The collation element iterator is used for traversing Unicode string collation elements one at a time. It can be used to implement language-sensitive text search algorithms like Boyer-Moore.
For most applications, the two API categories, compare and sort key, are sufficient. Most people do not need to manipulate collation elements directly.
Example:
Consider iterating over “apple” and “äpple”. Here are sequences of collation elements:
String 1 | String 1 Collation Elements |
---|---|
a | [1900.05.05] |
p | [3700.05.05] |
p | [3700.05.05] |
l | [2F00.05.05] |
e | [2100.05.05] |
String 2 | String 2 Collation Elements |
---|---|
a | [1900.05.05] |
\u0308 | [0000.9D.05] |
p | [3700.05.05] |
p | [3700.05.05] |
l | [2F00.05.05] |
e | [2100.05.05] |
The resulting CEs are typically masked according to the desired strength, and zero CEs are discarded. In the above example, masking with 0xFFFF0000 (for primary strength) produces the results of NULL secondary and tertiary differences. The collator then ignores the NULL differences and declares a match. For more details see the paper “Efficient text searching in Java™: Finding the right string in any language” by Laura Werner ( http://icu-project.org/docs/papers/efficient_text_searching_in_java.html).
Collation Attributes
The Collation Service has a number of attributes whose values can be changed during run time. These attributes affect both the functionality and the performance of the Collation Service. This section describes these attributes and, where possible, their performance impact. Performance indications are only approximate and timings may vary significantly depending on the CPU, compiler, etc.
Although string comparison by ICU and comparison of each string’s sort key give the same results, attribute settings can impact the execution time of each method differently. To be precise in the discussion of performance, this section refers to the API employed in the measurement. The ucol_strcoll
function is the API for string comparison. The ucol_getSortKey
function is used to create sort keys.
Note There is a special attribute value,
UCOL_DEFAULT
, that can be used to set any attribute to its default value (which is inherited from the UCA and the tailoring).
Attribute Types
Strength level
Collation strength, or the maximum collation level used for comparison, is set by using the UCOL_STRENGTH
attribute. Valid values are:
-
UCOL_PRIMARY
-
UCOL_SECONDARY
-
UCOL_TERTIARY
(default) -
UCOL_QUATERNARY
-
UCOL_IDENTICAL
French collation
The UCOL_FRENCH_COLLATION
attribute determines whether to sort the secondary differences in reverse order. Valid values are:
-
UCOL_OFF
(default): compares secondary differences in the order they appear in the string. -
UCOL_ON
: causes secondary differences to be considered in reverse order, as it is done in the French language.
Normalization mode
The UCOL_NORMALIZATION_MODE
attribute, or its alias UCOL_DECOMPOSITION_MODE
, controls whether text normalization is performed on the input strings. Valid values are:
-
UCOL_OFF
(default): turns off normalization check -
UCOL_ON
: normalization is checked and the collator performs normalization if it is needed.
X | FCD | NFC | NFD |
---|---|---|---|
A-ring | Y | Y | |
Angstrom | Y | ||
A + ring | Y | Y | |
A + grave | Y | Y | |
A-ring + grave | Y | ||
A + cedilla + ring | Y | Y | |
A + ring + cedilla | |||
A-ring + cedilla | Y |
With normalization mode turned on, the ucol_strcoll
function slows down by 10%. In addition, the time to generate a sort key also increases by about 25%.
Alternate handling
This attribute allows shifting of the variable characters (usually spaces and punctuation, in the UCA also most symbols) from the primary to the quaternary strength level. This is set by using the UCOL_ALTERNATE_HANDLING
attribute. For details see UCA: Variable Weighting, LDML: Collation Settings, and “Ignore Punctuation” Options.
-
UCOL_NON_IGNORABLE
(CLDR/ICU default): variable characters are treated as all the other characters -
UCOL_SHIFTED
(UCA default): all the variable characters will be ignored at the primary, secondary and tertiary levels and their primary strengths will be shifted to the quaternary level.
Case Ordering
Some conventions require uppercase letters to sort before lowercase ones, while others require the opposite. This attribute is controlled by the value of the UCOL_CASE_FIRST
. The case difference in the UCA is contained in the tertiary weights along with other appearance characteristics (like circling of letters). The case-first attribute allows for emphasizing of the case property of the letters by reordering the tertiary weights with either upper-first, and/or lowercase-first. This difference gets the most significant bit in the weight. Valid values for this attribute are:
-
UCOL_OFF
(default): leave tertiary weights unaffected -
UCOL_LOWER_FIRST
: causes lowercase letters and uncased characters to sort before uppercase -
UCOL_UPPER_FIRST
: causes uppercase letters to sort first
The case-first attribute does not affect the performance substantially.
Case level
When this attribute is set, an additional level is formed between the secondary and tertiary levels, known as the Case Level. The case level is used to distinguish large and small Japanese Kana characters. Case level could also be used in other situations. for example to distinguish certain Pinyin characters. Case level is controlled by UCOL_CASE_LEVEL
attribute. Valid values for this attribute are
-
UCOL_OFF
(default): no additional case level -
UCOL_ON
: adds a case level
Hiragana Quaternary
This setting is deprecated and ignored in recent versions of ICU.
Hiragana Quaternary can be set to UCOL_ON
, in which case Hiragana code points will sort before everything else on the quaternary level. If set to UCOL_OFF
Hiragana letters are treated the same as all the other code points. This setting can be changed on run-time using the UCOL_HIRAGANA_QUATERNARY_MODE
attribute. You probably won’t need to use it.
Variable Top
Variable Top is a boundary which decides whether the code points will be treated as variable (shifted to quaternary level in the shifted mode) or non-ignorable. Special APIs are used for setting of variable top. It can basically be set either to a codepoint or a primary strength value.
Performance
ICU collation is designed to be fast, small and customizable. Several techniques are used to enhance the performance:
-
Providing optimized processing for Latin characters.
-
Comparing strings incrementally and stopping at the first significant difference.
-
Tuning to eliminate unnecessary file access or memory allocation.
-
Providing efficient preflight functions that allows fast sort key size generation.
-
Using a single, shared copy of UCA in memory for the read-only default sort order. Only small tailoring tables are kept in memory for locale-specific customization.
-
Compressing sort keys efficiently.
-
Making the sort order be data-driven.
In general, the best performance from the Collation Service is expected by doing the following:
-
After opening a collator, keep and reuse it until done. Do not open new collators for the same sort order. (Note the restriction on multi-threading.)
-
Use
ucol_strcoll
etc. when comparing strings. If it is necessary to compare strings thousands or millions of times, create the sort keys first and compare the sort keys instead. Generating the sort keys of two strings is about 5-10 times slower than just comparing them directly. -
Follow the best practice guidelines for generating sort keys. Do not call
ucol_getSortKey
twice to first size the key and then allocate the sort key buffer and repeat the call to the function to fill in the buffer.
Performance and Storage Implications of Attributes
Most people use the default attributes when comparing strings or when creating sort keys. When they do want to customize the ordering, the most common options are the following :
UCOL_ALTERNATE_HANDLING == UCOL_SHIFTED
Used to ignore space and punctuation characters
UCOL_ALTERNATE_HANDLING == UCOL_SHIFTED
and UCOL_STRENGTH == UCOL_QUATERNARY
Used to ignore the space and punctuation characters except when there are no previous letter, accent, or case/variable differences.
UCOL_CASE_FIRST == UCOL_LOWER_FIRST
or UCOL_CASE_FIRST == UCOL_UPPER_FIRST
Used to change the ordering of upper vs. lower case letters (as well as small vs. large kana)
UCOL_CASE_LEVEL == UCOL_ON
and UCOL_STRENGTH == UCOL_PRIMARY
Used to ignore only the accent differences.
UCOL_NORMALIZATION_MODE == UCOL_ON
Force to always check for normalization. This is used if the input text may not be in FCD form.
UCOL_FRENCH_COLLATION == UCOL_OFF
This is only useful for languages like French and Catalan that may turn this attribute on. (It is the default only for Canadian French (“fr-CA”).)
In String Comparison, most of these options have little or no effect on performance. The only noticeable one is normalization, which can cost 10%-40% in performance.
For Sort Keys, most of these options either leave the storage alone or reduce it. Shifting can reduce the storage by about 10%-20%; case level + primary-only can decrease it about 20% to 40%. Using no French accents can reduce the storage by about 38% , but only for languages like French and Catalan that turn it on by default. On the other hand, using Shifted + Quaternary can increase the storage by 10%-15%. (The Identical Level also increases the length, but this option is not recommended).
Note All of the above numbers are based on tests run on a particular machine, with a particular set of data. (The data for each language is a large number of names in that language in the format
, .) The performance and storage may vary, depending on the particular computer, operating system, and data.
Versioning
Sort keys are often stored on disk for later reuse. A common example is the use of keys to build indexes in databases. When comparing keys, it is important to know that both keys were generated by the same algorithms and weightings. Otherwise, identical strings with keys generated on two different dates, for example, might compare as unequal. Sort keys can be affected by new versions of ICU or its data tables, new sort key formats, or changes to the Collator. Starting with release 1.8.1, ICU provides a versioning mechanism to identify the version information of the following (but not limited to),
-
The run-time executable
-
The collation element content
-
The Unicode/UCA database
-
The tailoring table
The version information of Collator is a 32-bit integer. If a new version of ICU has changes affecting the content of collation elements, the version information will be changed. In that case, to use the new version of ICU collator will require regenerating any saved or stored sort keys.
However, it is possible to modify ICU code or data without changing relevant version numbers, so it is safer to regenerate sort keys any time after any part of ICU has been updated.
Since ICU4C 1.8.1. it is possible to build your program so that it uses more than one version of ICU (only in C/C++, not in Java). Therefore, you could use the current version for the features you need and use the older version for collation.
Programming Examples
See the Collation Examples chapter for an example of how to compare and create sort keys with the default locale in C, C++ and Java.