Wikidata:Lexicographical data/Documentation

Translate this page

This is the main documentation page for lexicographical data on Wikidata. It is intended to describe general information about Wikidata lexemes: the way they are structured, how one may edit them, and what may be added to enrich them.

Note that while the information on this page may be broadly applicable across most languages, what works for modeling one language will not always work for modeling another language. For information about modeling lexemes for specific languages, visit the documentation pages for them.

More technical documentation may separately be found for the WikibaseLexeme extension for MediaWiki, which provides support for lexemes on Wikidata.

A Glossary of Wikidata Lexicographical terms is available.

Data model

The data model of WikibaseLexeme describes the structure of the data that is handled as "lexemes" in Wikidata. The text below is merely a summary; for more detailed information, see the corresponding WikibaseLexeme documentation page.

A lexeme is a lexical element of a language, such as a word, a phrase, or a prefix. (More information about lexemes in general may be found on Wikipedia.)

Lexemes, like items and properties, are also Wikibase entities; they too have individual identifiers and can be separately accessed and queried.

There are seven components of a lexeme, described in each of the following subsections:

its LID;
its lemmata;
its language;
its lexical category;
its (top-level) statements;
its senses; and
its forms.

Lexeme ID

Lexemes have identifiers starting with an "L" followed by a number using the digits 0-9, such as L3746552. These IDs (often called "LIDs", for "lexeme identifiers") are unique within Wikidata and are assigned automatically when a lexeme is created.

The RDF URI for a lexeme is http://www.wikidata.org/entity/ followed by the lexeme ID.

Lexeme lemmata

Further information: /Lemmata

The lemmata (singular lemma) of a lexeme are primarily used as human-readable representations of the lexeme. Each lemma consists of a string accompanied by a valid IETF language tag. Usually lemmata are the written forms of a word, phrase, or affix that would be found in a dictionary describing them, whether or not they are considered the 'base' or 'stem' forms morphologically.

e.g. the English lexeme Lexeme:L3435 has the lemma 'umbrella' because most English dictionaries provide information about this lexeme under the heading 'umbrella' and not under something like 'umbrellas' or "umbrella's" or "umbrellas'".
e.g. the Italian lexeme Lexeme:L1196965 has the lemma 'volare' because most Italian dictionaries provide information about it under that heading and not under something like 'volo', 'volante', or 'volato'.
e.g. the Korean lexeme Lexeme:L17 has the lemma '먹다' because most Korean dictionaries provide information about it under that form, rather than something like '먹-', '먹어', or even '먹습니다'.

Lexemes can have several lemmata, particularly when there are differences in the writing system or other orthographic conventions within a given language. Different lemmata are indicated with different language tags, and a lexeme may only have one lemma for a given language tag.

e.g. the Hindustani lexeme Lexeme:L641622 has two lemmata, 'चाचा' with code hi and 'چاچا' with code ur, which are representations of the same dictionary form (pronounced /t͡ʃɑː.t͡ʃɑː/) in the Devanagari script (used for Hindi) and the Arabic script (used for Urdu).
e.g. the Hebrew lexeme Lexeme:L63672 has two lemmata, 'אדום' with code he and 'אָדֹם' with code he-x-Q21283070, which reflect differences in how the same word form is spelt depending on whether diacritics are present.
e.g. the Southern Min lexeme Lexeme:L308008 has three lemmata, '城市' with code nan-hani, 'siânn-tshī' with code nan-x-Q56929, and 'siâⁿ-chhī' with code nan-x-Q559173. These represent using either Chinese characters or one of two romanization systems, each corresponding to the same word form.

Note that some of the language codes above contain an '-x-' in them. There are two main reasons this would be present in a language code:

For languages whose language codes are not yet supported, a last-resort option for a language code to use would involve adding a private-use subtag, containing the QID for the Wikidata item for the language, with the mis base code.
- e.g. lexemes in Polabian (Q36741), such as Lexeme:L1089491, have a lemma with the code mis-x-Q36741.
- e.g. lexemes in Soyot (Q4426878), such as Lexeme:L1015954, have a lemma with the code mis-x-Q4426878.
- e.g. lexemes in Láadan (Q35757), such as Lexeme:L623039, have a lemma with the code mis-x-Q35757.
If a language has a supported language code, but a variation whose language code is not supported, the private-use subtag may be attached directly to the existing supported code.
- e.g. lexemes in the Varendri (Q48726757) of Bengali, such as Lexeme:L672268, have a lemma with the code bn-x-Q48726757 (where 'bn' is the existing supported code).
- e.g. lemmata in Devanagari Sindhi (Q116688933) for lexemes in Sindhi use the language code sd-x-q116688933 (where 'sd' is the existing supported code).
- e.g. lemmata in the Adlam (Q19606346) for lexemes in Fula use the language code ff-x-q19606346 (where 'ff' is the existing supported code).

Lexeme lemmata are what are displayed when using the {{L}} template to link to a lexeme on Wikidata (including later on this page).

Lexeme language

Further information: /Lexeme languages and /Languages

The language to which a lexeme belongs is a reference to a Wikidata item for a language.

For most languages, this is a straightforward determination: English (Q1860), Thai (Q9217), Manchu (Q33638), and Gun (Q3111668) are just four of the many possibilities, since they have supported language codes en, th, mnc, and guw.

Some languages, however, have begun to require for their lexemes that particular language items be used; see the documentation pages for those languages for more information.

Lexical category

Further information: /Lexical categories

The lexical category to which a lexeme belongs is a reference to a Wikidata item for a particular group of words with specific syntactic behavior in a language. This usually corresponds with the "part of speech" of the lexeme: nouns, verbs, adjectives, adverbs, and so on.

The lexical category of a lexeme should be somewhat more general than any other more appropriate but more specific description thereof, as a broader reflection of how the lexeme behaves syntactically in its language. Other items like count noun (Q1520033), separable verb (Q3254028), and relative pronoun (Q1050744), where applicable, should be added as the values of instance of (P31) statements instead.

Different languages may use different lexical categories, but some are frequent enough across languages that a comparison may be made. See the full documentation page on lexical categories for a table comparing such categories across languages.

Lexeme statements

Further information: /Lexeme statements

Lexemes, like items or properties, have statements (claims) that provide information about the lexeme that is not specific to one of its forms or senses. Depending on how a particular language works, and depending on the lexical category of the lexeme, some statements will be more applicable to a given lexeme than others.

Many common properties applicable directly to lexemes are listed in Template:Lexicographical properties.

Lexeme senses

Further information: /Senses

Senses describe the different meanings of a lexeme.

A sense consists of three parts: 1) the sense ID, 2) glosses, and 3) statements.

The sense ID starts with the ID of the lexeme it belongs to, followed by a hyphen ("-") and an "S", followed by a natural number in decimal notation: e.g. L3746552-S4. These IDs are unique within Wikidata; when a new sense is created within a lexeme, an entirely new sense ID is provided for it. Like an LID, a sense ID may be appended to http://www.wikidata.org/entity/ to form a unique URI for the sense.
Glosses define the meaning of the sense using natural language. For a lexeme in a given language X, the gloss in language X should be a more detailed explanation of the meaning of the sense, while the glosses in other languages Y and Z may be less detailed, so long as they are clear enough to speakers of Y and Z what the meaning of the sense is.
Like lexemes, items, and properties, senses can have statements further describing the sense and its relations to other senses and to Wikidata items.

Many common properties applicable to lexeme senses are listed in Template:Lexicographical properties.

Lexeme forms

Further information: /Forms

Forms describe the different realizations of a lexeme in speech or writing.

Depending on how a language behaves morphologically, there may be exactly one form of a lexeme or there may be multiple forms. In general, the more isolating or analytic or the more agglutinative or polysynthetic a language is, the more it may benefit from having one form per lexeme. Lexemes in many fusional languages typically have multiple forms for particular combinations of grammatical features.

A form consists of four parts: 1) the form ID, 2) form representations, 3) grammatical features, and 4) statements.

The form ID starts with the ID of the lexeme it belongs to, followed by a hyphen ("-") and an "F", followed by a natural number in decimal notation: e.g. L3746552-F4. These IDs are unique within Wikidata; when a new form is created within a lexeme, an entirely new form ID is provided for it. Like an LID or a sense ID, a form ID may be appended to http://www.wikidata.org/entity/ to form a unique URI for the form.
Form representations are strings, accompanied with language tags, that signify how a particular form is used. As with lemmata, there may be multiple representations on a single form to handle differences in writing system or orthographic variation within a language.
Grammatical features are references to Wikidata items that define the syntactic circumstances in which a given form applies.
Like lexemes, senses, items, and properties, forms can have statements further describing the form and its relations to other forms and to Wikidata items.

Many common properties applicable to lexeme forms are listed in Template:Lexicographical properties.

Lexeme inclusion criteria

In some cases or languages, there may be multiple entities for related words, whereas in other language there may be just one. The below table provides an overview of how nouns in particular may be linked:

One or several lexemes for nouns?
difference in	1 lexeme		2+ lexemes
sense	add several senses		add applicable sense to lexeme	link other(s) with homograph lexeme	duplicate forms on each
etym.	add etym. to each sense		add etym. to lexeme base	link other(s) with homograph lexeme	duplicate forms on each
gender	add gender to each sense		add gender to lexeme base	link other(s) with homograph lexeme	duplicate forms on each
common/proper	add several senses	use lexical category "noun"	add applicable sense to lexeme	link other(s) with homograph lexeme	duplicate forms on each
caps/lowercase	add several forms	qualify forms to applicable senses	add applicable sense to lexeme	link other(s) with homograph lexeme	add only applicable forms
singular/plural	add several forms	qualify forms to applicable senses	add applicable sense	if possible link other(s) with homograph lexeme	add only applicable forms
pronunciation	add the same form twice	qualify forms to applicable senses, add pronunciation	add applicable sense	if possible link other(s) with homograph lexeme	add form and applicable pronunciation
forms/spelling	add several forms or alternate forms	qualify forms to applicable senses	add applicable sense	if possible link other(s) with homograph lexeme	add only applicable forms

For a given language and criterion (first column), just one of the two might apply

Interface

The following section details steps to take in Wikidata's user interface to perform common tasks involving editing lexemes.

Lexemes

Create a new lexeme

Go to Special:NewLexeme.
Under Lemma, enter a lemma (see #Lexeme lemmata for more information).
Under Lexeme's language, enter the language of the lexeme, either by typing the name of the language or its QID (see #Lexeme language for more information).
1. If you are prompted to do so, under Spelling variant of the Lemma, enter the language code of the lemma (see #Lexeme lemmata for more information).
Under Lexical category, enter the lexical category of the lexeme, either by typing its name or its QID (see #Lexical category for more information).
Click "Create" to save your changes.

You have now created a lexeme with the most basic information. Because it is very empty, it cannot meaningfully be used until more information is added to it, such as statements, senses, and forms (for which see later in this page).

Edit a lexeme's lemmata, language, or lexical category

Next to the lemmata, click the 'edit' button.
Lemmata may be edited as follows:
1. To add a lemma, first select the "+" that appears beside the lemmata.
2. In the new lemma, under Lemma, add the representation of the new lemma.
3. Also in the new lemma, under Spelling variant, add the language code of the new lemma.
4. To remove a particular lemma, simply select the "x" appearing beside Lemma in that lemma.
To change the language of the lexeme, use the search box appearing beside Language to pick an item for a language.
To change the lexical category of the lexeme, use the search box appearing beside Lexical category to pick an item for a lexical category.
Click "publish" to save your changes.

Add, edit or delete a lexeme's statements

Screenshot of the interface to edit a statement

Adding a statement to a lexeme entails the following steps:

Click "add statement"
Enter a property, typing its name in the property field (such as derived from lexeme) and selecting it in the suggester.
Enter a value for the property.
Tracked in Phabricator
Task T271500
Note: A Wikidata property for lexicographic senses (Q54275340) such as translation (P5972) or synonym (P5973) does not currently support searching for senses, either by lexeme lemmata or sense glosses. This means in order to enter a value for a statement, you need to enter the precise sense ID for the sense you want as a value.
As seen here, Wikidata will not be able to find Lexemes and their senses when searching by their name.

Searching by a precise Lexeme Sense ID however returns a publishable result.
If you wish to add qualifiers and references to the statement, feel free to do so.
Save the statement by clicking "publish".
To edit a statement, click "edit".
To delete a statement, click "edit", then click "remove".

Delete a lexeme

To delete a lexeme, you may request its deletion at Wikidata:Requests for deletions, just as is done with items. If you have the Merge gadget enabled, you may submit deletion requests for lexemes using it.

Search for a Lexeme

To look for lexemes via Special:Search or the search box on any page, you may use its LID, one of its lemmata, or a representation of one of its forms.

The simplest way to do this is to prefix "L:" to one of these, and you will automatically see results in the lexeme namespace for your search. For example, lexeme L301993 has the lemma "হৃদয়" and one of its forms has the representation "হৃদয়েতে". Searching for "L:L301993", "L:হৃদয়", or "L:হৃদয়েতে" will return the same lexeme in the results.

You may alternatively search without the "L:" prefix (e.g. using "L301993", "হৃদয়", or "হৃদয়েতে"), then select the "Lexeme" namespace in the Search in: and rerun the search to get the same lexeme returned.

Note that the selector (the drop-down menu that pops up to suggest results) does not support the lexeme namespace yet. Pressing Enter or clicking the search icon after typing your keyword, however, will show you the results.

Senses

Create a new sense

In the Senses section of a lexeme, click "add Sense".
Under Language, enter a language code for the gloss.
Under Gloss, enter the gloss.
To add new glosses, click "add" and repeat steps 2 and 3.
Click "publish" to save your changes.

Edit a sense's glosses

Next to the sense glosses, click "edit".
To add a new gloss, do the following:
1. Underneath the existing sense glosses, click the smaller "add" link. (Be careful that you do not accidentally click on the add statement or add Sense links used to add a new statement or sense instead!)
2. Under Language, enter a language code for the new gloss.
3. Under Gloss, enter the new gloss.
4. Repeat these steps for each new gloss you wish to add.
To remove a gloss, click "remove" next to the gloss.
Click "publish" to save your changes.

Remove a sense

Next to the sense glosses, click "edit".
Click "remove".

Forms

Create a new form

In the Forms section of a lexeme, click "add Form".
Under Representation, fill in a representation for the new form.
Under Spelling variant, fill in the language code for that representation.
To add more representations, click the "+" next to the existing representations and repeat steps 2 and 3 for the new representation.
Next to Grammatical features, enter one or several grammatical features, by typing their name and selecting them in the list of items that appears.
Click "publish" to save your changes.

Edit a form's representations or grammatical features

Next to the form's representations, click "edit".
Representations may be edited as follows:
1. To add a representation, first select the "+" that appears beside the representations.
2. In the new representation, under Representation, add the new representation for the form.
3. Also in the new representation, under Spelling variant, add the language code for that representation.
4. To remove a particular representation, simply select the "x" appearing beside Representation in that representation.
To add a grammatical feature, type its name at the end of the text box and select the appropriate item in the list of items that appears.
To remove a grammatical feature, click the "x" that appears next to it.
Click "publish" to save your changes.

Delete a form

Next to the form's representations, click "edit".
Click "remove".

Features

What is included in the first version

New datatypes: Lexeme, Form
Add, edit, delete Lexemes
Add, edit, delete Forms
Add, edit, delete statements
Add, edit, delete qualifiers
Add, edit, delete references
Linking to an Item from a Lexeme or a Form
- item for this sense (P5137)
Linking to another Lexeme from a Lexeme, a Form or an Item
Search and suggestions when entering a value
Basic internal APIs (used for UI, you should not use them)

What will be added in the future

Ordered from near to long-term plans

Search for content with Special:Search Done
Display the lemma in the history pages, recent changes and watchlist Done
Add, edit, delete Senses Done
RDF support and ability to query the data on query.wikidata.org Done
Better API support
Automatic generation of Forms
Data access on clients (other Wikimedia projects) Done
Editing data directly from Wiktionary