Page MenuHomePhabricator

Add language code vi-hani for monolingual text and lexemes
Closed, ResolvedPublic

Description

Please add the language code vi-hani and ko-hani to the list of language codes supported for monolingual text values and lexemes.

*The language code: vi-Hani
*Language name in the language itself or English:

  • English: Vietnamese (chữ Nôm)
  • Local name: 㗂越

*The used script, if not obvious: Hani
*Where and when the language was or is used: Mainly in Vietnam before 20th century
*The Wikidata item id: chữ Nôm (Q875344)
*Sample use of language code: See reply

Usage example: See reply.

Note that according to https://www.wikidata.org/wiki/Help:Monolingual_text_languages , although the policy was not finished yet, it was stated that a language code does not have to fulfill requirement of the language proposal policy for new wikis, and in general a code would be acceptable as long as the language code is valid.

Event Timeline

C933103 renamed this task from Add monolingual language code vi-hani, ko-hani to Add monolingual language code vi-hani, ko-kore.Nov 13 2017, 3:23 PM
C933103 updated the task description. (Show Details)

@GerardM what's your consideration on this request?

These seem reasonable to me, but I'd like to see if @Amire80 has any comments.

These seem reasonable to me, but I'd like to see if @Amire80 has any comments.

Ping @Amire80 Can you respond to jhsoby?

Adding chữ Nôm would be a nice improvement for Vietnamese-language Wikimedia projects as well as for federated projects. For example, an infobox at the Vietnamese Wikipedia could display the historical chữ Nôm name of a place or person. OpenStreetMap currently has a nonnegligible number of chữ Nôm names, even though the project is supposed to focus on current details rather than historical details. It’s likely that the people who contributed these names would be less likely to add them to OSM if they were able to add them to Wikidata, where historical nuances could be better accounted for (like the distinction between Ho Chi Minh City and Saigon).

These seem reasonable to me, but I'd like to see if @Amire80 has any comments.

Ping @Amire80 Could you please respond?

I'd just like to note that the Suppress-Script value for Korean according to the official subtag registry is in fact Kore (meaning ko-Kore as a code is redundant in the eyes of a number of organizations).

I think it would be useful to see some examples of how these would be used/what they would be used for.

I'm not sure that ko-kore (or ko-hani) would be the best way to add text containing hanja because wouldn't we want it to be linked to the corresponding hangul? Years ago, I proposed a "hanja" property for Korean (proposal here) . It was rejected at the time but I still think that would be the best way to do it and perhaps it would be a good idea to revisit it.

I imagine it's a similar situation for vi-hani.

Should the task be spilt to two?

I'd just like to note that the Suppress-Script value for Korean according to the official subtag registry is in fact Kore (meaning ko-Kore as a code is redundant in the eyes of a number of organizations).

The thing with the "Kore" script tag is that, it indicate mixed use of Korean Hangul (Phonetic) and Hanja (Ideographic) characters, but it didn't indicate the ratio.
Currently in regular Korean text, almost everything are being written in Hangul. But from time to time there can still be a few common shorthand Hanja characters in use in Korean text, in addition pf writing out Hanja as disambiguation in some situation. So I guess you can say it is a mixed script as well.
On the other hand, what I originally have in mind was that, since many Korean term, especially proper noun, are originated or created based on Han characters and that they can be written as Hanja completely, it would not be possible to write Hanja for terms natively used in Korean or terms imported from Western languages in modern time. Such that, there are terms that can partially be written in Hanja but other parts need to be written as Hangul instead, like "Seoul Special City" or "Asiana Airlines", thus constitute another forn of mixed usage.
I guess one can say it should use the "Hani" script tag instead since the intention is to show the Hanja characters of the terms, but then a problem is I don't think Hangul characters are to be expected for the Hani scripte value?

I think it would be useful to see some examples of how these would be used/what they would be used for.

I'm not sure that ko-kore (or ko-hani) would be the best way to add text containing hanja because wouldn't we want it to be linked to the corresponding hangul? Years ago, I proposed a "hanja" property for Korean (proposal here) . It was rejected at the time but I still think that would be the best way to do it and perhaps it would be a good idea to revisit it.

I imagine it's a similar situation for vi-hani.

I also proposed a chữ Nôm property for lexemes that was similarly rejected in favor of a semantically distinct property. I remain convinced about the need for the property I proposed, but I haven’t had a chance to press that point. For the time being, there isn’t really a path to bring the Vietnamese Wiktionary’s large body of chữ Nôm data to Wikidata in a semantically correct way.

Anyhow, that proposal only made sense in the context of a lexeme. A “chữ Nôm name” property on an item would have to be qualified by applies to name of item (P5168), with a constraint that it could only apply to a Vietnamese label, and there would have to be separate properties for “chữ Nôm given name”, “chữ Nôm birth name”, “chữ Nôm inscription”, etc. I think a monolingual language code would be cleaner.

Another issue is that the Vietnamese Wikisource currently hosts a number of older Vietnamese works with the Latin and chữ Nôm transcriptions either side by side or on separate pages. I suspect that, in the long term, better supporting this diglossia will require software support for vi-hani as a language code somewhere or other.

Esc3300 changed the task status from Open to Stalled.Jun 11 2021, 1:34 PM
Esc3300 subscribed.

This is still lacks samples, despite the request at T180345#6710271 (see Requirements_for_a_new_language_code)

I think we should decline this. If still needed, new requests can be made (ideally one per language).

On the vietnamese wikipedia:

https://vi.wikipedia.org/wiki/Chữ_Hán

  • 名 “danh”
  • 書 “thư”
  • 文 “văn”
  • 字 “tự”
  • 漢字 “Hán tự”.

Or more specialized article: https://vi.wikipedia.org/wiki/Chữ_Nôm

Generally, in en.wiktionary, article in the Category:Sino-Vietnamese_word have their vi-Hani version.
https://en.wiktionary.org/wiki/Category:Sino-Vietnamese_words

Wikidata’s Vietnamese-language lexemes are currently using vi-x-Q8201 as the language code for chữ Nôm, as a workaround for this issue:

phở: 𬖾, 頗
râu: 鬍, 𩅺, 𩭶, 𩯁

mxn changed the task status from Stalled to Open.Jan 16 2022, 7:52 AM

Is vi-x-Q8201 a standard? If it's not, why for create a new name please? They are one variant of Hani characters (called Hán tự (漢字) in vietnamese) chữ Nôm

It is in standard [https://translatewiki.net/wiki/Category:ISO_15924:Hani ISO 15924] used on HTML pages, and it was still used officially until the mid of the 20th century. It is still used at least by scholars and some people that like to continue to use it in Vietnam.

C933103 renamed this task from Add monolingual language code vi-hani, ko-kore to Add monolingual language code vi-hani.Mar 26 2022, 12:52 AM
C933103 updated the task description. (Show Details)

I have removed the Korean part of the ticket and focus on Vietnamese writing, due to problem of ambiguity of "Kore" script tag in ISO 15924, as mentioned in December 2020.

"Hani" in ISO 15924 is Hanzi, Kanji, Hanja. In Vietnamese it's chữ Hán, Hán tự, Hán văn, or chữ Nho.

Chữ Nôm was a script invented and used exclusively in Vietnam. It uses Hán tự/Hanzi as a basis, but introduces a host of characters unique to Nôm. The Buddhist history Cổ Châu Pháp Vân phật bản hạnh ngữ lục (1752) gives the story of early Buddhism in Vietnam both in Hán tự/Hanzi script and in a parallel Nôm translation. The Jesuit Girolamo Maiorica (1605–1656) had also used parallel Hán tự/Hanzi and Nôm texts. By 1174, Hán tự/Hanzi had become the official writing script of the Vietnamese courts, mainly used by administration and literati, and continued to serve this role until the mid-19th century. Nôm gained parallel popularity among the literati for composing creative works.

In other words, "vi-Hani" should refer to Hán tự/Hanzi. A person reading Classical Chinese should be able to discern its meaning and not really notice the source is from Vietnam. But chữ Nôm is a different script altogether, and generally cannot be understood by Classical Chinese readers. Now, there might be some characters which have no Hanzi/Nôm distinction, like thiên (天) "heaven", but just means Hanzi and Nôm have shared lexicons. If it's a Vietnamese dictionary entry for 天, I'd annotate it as "vi-Hani". But if it's a Nôm-exclusive character like 𡦂 (chữ) "letter, character, word", then unfortunately there is no script code for chữ Nôm. But even though this is how it should be, one also has to note that "Hani" is only understood as "Hanzi, Kanji, Hanja", not "Hanzi, Kanji, Hanja, chữ Nôm/Hán tự". I'm worried we would be sidestepping the ISO, even if we know what Hán tự is.

Japanese have Jpan for Han (Hani) + Hiragana (Hira) + Katakana (Kana).
Korean has Kore for Hangul (Hang) + Han (Hani).
Vietnamese need their own chữ Nôm entry within ISO 15924. There is already precedence within it for other ancient and historical scripts. As a workaround, I suppose similar to vi-x-Q8201 to refer to Vietnamese written in Hán tự/Hanzi (although they should really just annotate it as vi-Hani), chữ Nôm will have to be vi-x-Q875344.

In other words, "vi-Hani" should refer to Hán tự/Hanzi. A person reading Classical Chinese should be able to discern its meaning and not really notice the source is from Vietnam.

According to ISO 639, vi is the code for Vietnamese. vi-Hani cannot represent chữ nho because chữ nho is not a written form of the Vietnamese language per se. But lzh does represent Classical Chinese, including chữ nho.

But even though this is how it should be, one also has to note that "Hani" is only understood as "Hanzi, Kanji, Hanja", not "Hanzi, Kanji, Hanja, chữ Nôm/Hán tự". I'm worried we would be sidestepping the ISO, even if we know what Hán tự is.

Is it technically correct to interpret the English name of Hani in ISO 15924 so literally? After all, chữ Nôm differs from chữ nho in terms of language usage and many specific characters, but the CJKV writing systems have been unified into a single subset of Unicode, including chữ Nôm. On the other hand, ISO 15924 does contain codes for typographic variants of scripts like Nastaliq Arabic (Aran) and Gaelic Latin (Latg).

Ken Lunde (2009) writes that chữ Nôm had a dedicated script code of Cu in ISO 15924:2004 (which has since been superseded by ISO 15924:2022). Does anyone have more information about these two-letter codes or why chữ Nôm didn’t get a four-letter code corresponding to this two-letter code?

Vietnamese need their own chữ Nôm entry within ISO 15924. There is already precedence within it for other ancient and historical scripts. As a workaround, I suppose similar to vi-x-Q8201 to refer to Vietnamese written in Hán tự/Hanzi (although they should really just annotate it as vi-Hani), chữ Nôm will have to be vi-x-Q875344.

You do have a point that vi-x-Q875344 would be more correct than vi-x-Q8201 (and more specific, anyhow).

"Hani" simply mean "Chinese[Han] characters".
"vi-Hani" mean "Vietnamese, written in Chinese[Han] characters".
Chu Nho, despite widely used in Vietnam in ancient time, are written according to Classical Chinese grammar, and as such should classify as Classical Chinese text, with code "lzh", similar to comparable works from Japan, Korea, and other neighboring regions.
The existence of ISO code Jpan is for the mixed use of Kana together of Kanji in Japanese text, which is still the common writing system for Japanese system nowadays.
The existence of the ISO code Kore is for the mixed use of Hangul together with Hanja in Korean text. Although Hanja's role in Korean language have greatly diminished, it is still not unexpected to see Hanja in modern Korean text, hence the code "Kore" which represent Hanja+Hangul is still the default code for Korean language writing system, at least in South Korea.
On the other hand, I do not think the mixed use of Han characters with other writing systems, say Latin alphabets, is an expected usage in Vietnam nowadays, hence I don't think it is necessary to apply for a new ISO 15924 code for such mixed use to reflect this.

As for "characters common between Chinese characters as used in China vs characters that only exists in Chu nom", note that both Japanese and Korean languages also have some Han characters uniquely created by them for their countries, but they simply treat them as part of the Han characters in their language, in the same way as all other imported Han characters. And would be tagged with script code "Hani".

But I do note that one thing that separate the Vietnamese Chu Nom from those unique characters from Japanese/Korean is that, there are large number of them, and they are formed according to some rule for many Vietnamese indigenous words. They can be treated as Han characters, and also followed typical ways of Han characters formation by combining meanings and sounds of characters, and thus "Hani" code is applicable. But I think it is also not impossible to apply for another ISO 15924 code, given how Traditional Chinese and Simplified Chinese which have much less different from each others still received their individual code. On the other hand however, the classification of Hans versus Hant is necessary in rendering text in two different writing systems that both are part of the Chinese language, but it is not really the case for Vietnamese when Chu Nho are coded lzh.

As for situation of other ISO 15924 codes like Latg, I don't think they are comparable, as they represent different characters and different ways to write the language. I guess it would be more comparable to say Seal scripts should get their own ISO 15924 codes.

Ken Lunde (2009) writes that chữ Nôm had a dedicated script code of Cu in ISO 15924:2004 (which has since been superseded by ISO 15924:2022). Does anyone have more information about these two-letter codes or why chữ Nôm didn’t get a four-letter code corresponding to this two-letter code?

That table is same as the table in Japanese version of the book, published in year 2002. So it would be before the official publication of the ISO 15924. Given Unicode maintain a list of changes to ISO 15924 since the standard's official publication in 2004, and the list didn't include Chu nom, I would assume the table in the book reflect content in the standard's draft that didn't make it to the ultimate list.

Is it technically correct to interpret the English name of Hani in ISO 15924 so literally? After all, chữ Nôm differs from chữ nho in terms of language usage and many specific characters, but the CJKV writing systems have been unified into a single subset of Unicode, including chữ Nôm. On the other hand, ISO 15924 does contain codes for typographic variants of scripts like Nastaliq Arabic (Aran) and Gaelic Latin (Latg).

I think we can agree if it were explicit, it's a clear sign of intent and leaves little to misinterpretation. I just don't like getting ahead of the standards, but this is a minor point in any case. That Latin has variants shows that it is possible to distinguish chữ Nôm as its own variant, but equally true that chữ Nôm doesn't have a ISO 15924 code point of its own. But yes, with C933103's comment, I kinda see the argument for putting all Han orthography under the "Hani" code. I found this argument in favor of it from 2006 on a Hanja Wikipedia proposal page. While there are constructions/compositions of Han characters into combinations that might only be intelligible to the Vietnamese literati of the time, it still follows the same rules of construction of Classical Chinese script in general. I'm studying the Dictionarium Annamiticum Lusitanum et Latinum, and although Latinized Middle Vietnamese had b-with-flourish ꞗ, apex diacritic, and a few other conventions no longer in use in modern Vietnamese orthography, it doesn't mean it merits its own Latin variation code.

If we look at an example classical text, Chinh Phụ Ngâm Khúc, which is written in Han script, and later translated into Nôm script, if I were to encode this and give a user a choice between Han and Nôm, it would seem to me that Han is 'lzh' (Hani is implied I suppose), and Nôm would be 'vi-Hani'. As long as there is a way to distinguish between the two texts, then I'm happy for it.

But I think it is also not impossible to apply for another ISO 15924 code, given how Traditional Chinese and Simplified Chinese which have much less different from each others still received their individual code.

If it ever happens, then we'll jump on that bandwagon. I think 'vi-Hani' is still a step in the right direction, compared to using 'vi-x-Q875344' for a rather currently pervasive usage among the Wikimedia projects.

Hoi,
We can only support scripts supported in Unicode. So what is the font to be
used?
Thanks,

GerardM

Hoi,
We can only support scripts supported in Unicode.

As mentioned in previous posts, chu nom are already supported and added into Unicode through CJKV unified ideographic characters extensions.

So what is the font to be
used?

Please see this list of fonts: https://en.m.wikipedia.org/wiki/Template:Vi-nom/fonts.css

By 1174, Hán tự/Hanzi had become the official writing script of the Vietnamese courts, mainly used by administration and literati, and continued to serve this role until the mid-19th century.

As said previously, The Hán tự/Hanzi was still used during French colonization, there as documents of mid 20th century (including on commons) with still both chữ Nôm and lain script vietnamese and some French colony stamps. Some continue to use it today, there are several sites wrote using this script, and we can still see some videos using it as subtitles.

So what is the font to be
used?

Please see this list of fonts: https://en.m.wikipedia.org/wiki/Template:Vi-nom/fonts.css

For reference, here are similar font lists used by other WMF wikis:

Some of these fonts are specifically designed for chữ Nôm. The Nôm character repertoire in Unicode has expanded several times. The early favorite of the Vietnamese wikis, HAN NOM A/B, contains most Nôm-specific characters, but many of them are encoded in the Private Use Area because the font predates CJK Unified Ideographs Extension E in Unicode 10. Aside from coverage, selecting a dedicated Nôm font is important because some important characters have distinct forms in each CJKV tradition that nevertheless got unified into a single Unicode codepoint.

By 1174, Hán tự/Hanzi had become the official writing script of the Vietnamese courts, mainly used by administration and literati, and continued to serve this role until the mid-19th century.

As said previously, The Hán tự/Hanzi was still used during French colonization, there as documents of mid 20th century (including on commons) with still both chữ Nôm and lain script vietnamese and some French colony stamps. Some continue to use it today, there are several sites wrote using this script, and we can still see some videos using it as subtitles.

This ticket tracks adding a monolingual language code, which would be useful regardless of exactly when chữ nho or chữ Nôm fell out of the mainstream. After all, we’ve already been resorting to workarounds like vi-x-Q875344 in Wikidata lexemes for a while now, but no such workaround is possible for a native label statement on a family name item or inscription statement on a commemorative plaque item, for example.

Wikidata’s Vietnamese-language lexemes are currently using vi-x-Q8201 as the language code for chữ Nôm, as a workaround for this issue:

phở: 𬖾, 頗
râu: 鬍, 𩅺, 𩭶, 𩯁

You do have a point that vi-x-Q875344 would be more correct than vi-x-Q8201 (and more specific, anyhow).

Due to T236593, it became necessary to split out chữ Nôm forms as separate lexemes. A lexeme’s language is stored as a QID, so I’ve used chữ Nôm (Q875344) as the lexeme language instead of chữ Hán (Q8201), which is too general. For consistency, I’ve changed the lemmas and form representations to use vi-x-Q875344 instead of vi-x-Q8201, even though vi-hani would be a suitable monolingual language code to replace it with.

Nikki renamed this task from Add monolingual language code vi-hani to Add language code vi-hani for monolingual text and lexemes.Oct 27 2022, 4:46 PM
Nikki updated the task description. (Show Details)

Change 1009741 had a related patch set uploaded (by Nikki; author: Nikki):

[mediawiki/extensions/cldr@master] Add English names for script variants that have been requested

https://gerrit.wikimedia.org/r/1009741

Question/problem:

  • There are three Han-related writing systems: Chữ Hán, Chữ Nôm, Hán Nôm (Chữ Hán + Chữ Nôm)
    • Currently, they all use the Hani code
    • If we choose Chữ Nôm for 'vi-hani', which codes are we going to use for Hán Nôm and Chữ Hán?

Change #1009741 merged by jenkins-bot:

[mediawiki/extensions/cldr@master] Add English names for script variants that have been requested

https://gerrit.wikimedia.org/r/1009741

Nikki claimed this task.

This is available now.