Where is my Character?
If you are trying to find a specific character in the
Unicode Standard, the first place to go is the
code charts.
The code charts are organized into blocks, which are groupings of related characters.
For each character defined in Unicode you will find an assigned
code point: a hexadecimal number that is used to represent
that character in computer data.
The very term character is rather
ambiguous, and may be interpreted broadly or narrowly. In this
document, we'll use a very broad sense. For more details, see
UTR #17: Character Encoding Model.
You may not find the character in what you think is
the obvious spot. While the characters in Unicode are grouped into
blocks, this is only a rough grouping because characters can be
categorized many different ways. In particular, punctuation and
symbols are applicable across a very wide range of usages and
scripts (writing systems). Even the notion of a script itself
is not well-defined; text in a given language may make use of
characters from multiple scripts. For example, the digits 0-9 are in
widespread use; the Devanagari danda is used across many
Indic scripts.
Thus you may need to look in several locations to
find your character. If you are using the book, you
may find the printed character index in the back of the standard
helpful. The same data is available online as a plain text file,
Index. Or you can use the web version of the
Unicode Character Name Index.
You can also do a text search in the online
Unicode names list. For example, suppose you were searching for a
"Japanese kome", the character ※. By opening up the
NamesList.txt in your browser, and
searching for "Japanese kome", you would find it under the entry:
203B REFERENCE MARK
= Japanese kome
= Urdu paragraph separator
x (tibetan ku ru kha bzhi mig can - 0FBF)
Documentation regarding the syntax conventions of the online
Unicode names list can be found in
Names List File Format.
For Han characters (Chinese, Japanese, and Korean) you can find
the character you are looking for by using the printed Han
Radical-Stroke Index in the book or by using the the online web
Unihan Database.
There are auxiliary charts which contain the Unicode characters
organized in different ways. You may sometimes find that useful in
finding your character. For example, see
Collation
charts,
Script charts,
Case Mapping
charts, or
Normalization charts. If you know what legacy character encoding
your character is in, you might be able to find it in the
ICU Character
Set Mapping Tables.
You may not find a character simply because the
charts do not specify the exact shape; they only provide a
representative shape for identification. For example, a lowercase
Cyrillic p could appear with any of the following character
shapes (also called glyphs). The second is customary for italic in
Russia, and the third is customary for italic in Serbia:
Cyrillic p |
Russian Italic |
Serbian Italic |
|
|
|
Characters may also take on different shapes in
different contexts. So, for example, the Arabic character hah
may have four different basic shapes.
Representative shape in code chart |
Possible shapes in context |
|
|
|
|
|
The character you are looking for may be represented
as a sequence of code points in Unicode. Here are examples of
such characters, and their representation as a sequence of code
points.
Character |
Code Points |
Linguistic Usage |
|
0063 0068 |
Slovak, traditional Spanish |
|
0074 02B0 |
Native American languages |
|
0078 0323 |
|
019B 0313 |
|
00E1 0328 |
Lithuanian |
|
0069 0307 0301 |
|
30C8 309A |
Ainu in kana transcription |
Similarly, you won't find the Indic half-forms in the code
charts, since they are formed with a consonant + halant (virama).
For example:
Representative shapes in code chart |
Display appearance |
|
|
|
Other Devanagari ligatures such as ksha are coded with
sequences, as shown in Table 12-4: Sample Devanagari Half-Forms of
the core specification. For example:
Representative shapes in code chart |
Display appearance |
|
|
|
|
In addition, the joining control characters can be used to
request specific appearances, as in Figure 12-8 of the core specification. For example:
Representative shapes in code chart |
Display appearance |
|
|
|
|
|
Unfortunately there are not yet such detailed block descriptions
for all Indic scripts, so it may not be clear exactly which
sequences to use. These should be forthcoming in the future. In the
meantime, sometimes you may get an answer if you ask on the general
Unicode
public e-mail list.
In some rare instances, you will find apparently
identical characters. In most cases, if not all, this is to maintain
compatibility with the original source standards for Unicode:
vendor, national, and international character standards in wide
usage in 1990. For example, there are duplicate encodings in the
following case:
|
Capital letter A with ring |
|
Angstrom sign |
There are also particular shapes of characters that
are given separate code points in Unicode, such as the shapes of the
Arabic character hah listed above. These were also added to
Unicode because of pre-existing standards.
For compatibility with pre-existing standards, there are
characters that are equivalently represented either as sequences of
code points or as a single code point called a composite
character. For example, the i with 2 dots in naïve
could be presented either as i + diaeresis (0069 0308)
or as the composite character i + diaeresis (00EF).
There are other cases where the order of two combining characters
does not matter. For example, the pair of combining characters
acute and dot-below can occur with either one first; both
alternate orders are equivalent. The rules for when order is
significant is precisely spelled out by the Unicode Standard.
Due to the requirements for uniqueness — especially on the
Internet — Unicode provides for a unique format, called Form C.
This format always picks one of the equivalent code points (or
sequences of code points) and not the other. It also picks a
specific order where there are alternatives. For more information,
see UTR #15: Unicode Normalization Forms.
In a very few cases, Unicode separates glyphs as
distinct characters on the basis of whether they are treated as
letters or not. For example, the following characters are
distinguished on this basis, even though the range of possible
shapes are the same.
|
Modifier letter prime. Is treated as a letter. Used
to transcribe the "soft" sign in Cyrillic. |
|
Prime. Treated as a punctuation mark or symbol. Used
in mathematics, and as a symbol for minutes (fractions of
degrees). |
In those rare cases where this occurs, to decide
which character to use you should consult the text of the Unicode
Standard.
Simply because a character or sequence of
characters may have a different sorting order does not
qualify it to be given a separate code point in Unicode. For
more information, see
UTR #10: Unicode
Collation Algorithm.
Finally, your character may not yet be encoded in
Unicode. There is a well defined
submission process for new characters or scripts. This process
verifies that the proposed character is in fact a candidate for
encoding. In some cases, this process may not be straightforward.
Because the Unicode Standard and ISO 10646 are
synchronized in character codes, both organizations need to agree to
the encoding of new characters. This process can require some time
before a new character is accepted into the standard, and some time
beyond that before it is fully supported in products.