User talk:Uziel302

Welcome to Wikidata, Uziel302!

Wikidata is a free knowledge base that you can edit! It can be read and edited by humans and machines alike and you can go to any item page now and add to this ever-growing database!

Need some help getting started? Here are some pages you can familiarize yourself with:

Introduction – An introduction to the project.
Wikidata tours – Interactive tutorials to show you how Wikidata works.
Community portal – The portal for community members.
User options – including the 'Babel' extension, to set your language preferences.
Contents – The main help page for editing and using the site.
Project chat – Discussions about the project.
Tools – A collection of user-developed tools to allow for easier completion of some tasks.

Please remember to sign your messages on talk pages by typing four tildes (~~~~); this will automatically insert your username and the date.

If you have any questions, don't hesitate to ask on Project chat. If you want to try out editing, you can use the sandbox to try. Once again, welcome, and I hope you quickly feel comfortable here, and become an active editor for Wikidata.

Best regards! Cycn

Merging items

Hallo Uziel302,

For merging items, you may want to use the merge.js gadget from help page about merging. It has an option "Request deletion for extra items on RfD" to automatically place a request to delete the emptied page. This way of nominating makes it a lot easier for the admins to process the requests.

With regards, - - (Cycn/talk) 06:34, 2 April 2014 (UTC)[reply]

hebrew Lexemes

hi, I saw you added many Hebrew words - if you know Python, we could work together on importing data from https://he.wiktionary.org (sadly I don't speak Hebrew, but I wrote Lexicator which imported over 100,000 words from ru wiktionary. --Yurik (talk) 21:13, 18 September 2019 (UTC)[reply]

Yurik, thanks for the offer. I am willing to participate, I don't know much python but I can learn. I didn't fully get how you filtered cc0 content from wiktionary. Anyway, hewiktionary has less than 20,000 pages so the potential there is limited. I would like a simple script where I can put list of words with their pos and sense and it will create a lexeme. Currently quickstatements doesn't support creating lexemes.Uziel302 (talk) 21:29, 18 September 2019 (UTC)[reply]

According to this paper, words and their forms are not copyrightable, but senses are. If you can create a full list of things (words, their forms, all other statements that should be on them), I could upload that data. Note that you should first figure out what information you want to store. Currently I only see you added the words, but made no statements about them (e.g. how to pronounce them, their forms, or any other data). --Yurik (talk) 22:31, 18 September 2019 (UTC)[reply]

Yurik, I found a source for public domain senses: Wikidata. I query the words I want and I get relevant senses from description of their Q items.

SELECT ?item ?itemLabel ?desc ?endesc ?enlabel
WHERE{ 
  ?item rdfs:label ?itemLabel filter (lang(?itemLabel) = "he").
  VALUES ?itemLabel {
"אלבום"@he
"אלבני"@he
"אלבנית"@he  } .
OPTIONAL {?item schema:description ?desc filter (lang(?desc) = "he").}
OPTIONAL {?item schema:description ?endesc filter (lang(?endesc) = "en").}
OPTIONAL {?item rdfs:label ?enlabel filter (lang(?enlabel) = "en").}
}

Try it!

Of course this requires some manual filtering, since not every concept eligible as Q item is actully a lexicographical sense, there are many names etc. I started working on it here. The table is of existing Hebrew lexemes that I want to add senses for. Please let me know if uploading the senses to Wikidata is possible with your bot. Thanks, Uziel302 (talk) 10:05, 21 September 2019 (UTC)[reply]

In theory it might be easier to fix quick statements (if i can actually dig up its code), than to write a totally new script for that -- since as you said you already have the list of words. My bot is mostly geared towards wiktionary parsing (which is the harder part). --Yurik (talk) 15:23, 21 September 2019 (UTC)[reply]

Yurik, thanks for the reply. It will be very useful to expand the abilities of quickstatements, which has gotten very partial support of lexemes. I just thought it might be more complicated to dig in its code. I thought it might be easier for you to create a simple bot to enter senses, based on what you already wrote. You can also use the method of querying wikidata for adding senses in Russian. As of parsing wiktionaries, I wouldn't invest in Hebrew wiktionary which is small and dull, I would recommend using your script for the largest Wiktionary - English.Uziel302 (talk) 15:31, 21 September 2019 (UTC)[reply]

it is true that quick statements are harder to modify than to write a short script, but the value is much bigger, so might as well invest into that. --Yurik (talk) 16:27, 21 September 2019 (UTC)[reply]

Yurik, I am currently running the import of senses using LexData python framework for Lexemes. Thanks for the developer user:MichaelSchoenitzer. Uziel302 (talk) 07:15, 23 September 2019 (UTC)[reply]

User:Yurik, I managed to get a structured list of 500k forms of Hebrew words, using Hspell - linux tool for Hebrew spell check and morphological analysis. Will you be able to help me uploading it here? The structure is: base word (which may already appear here) form, pos, gender of the form, plural or singular, and tense if verb. Uziel302 (talk) 07:34, 4 October 2019 (UTC)[reply]

@Uziel302: loading is easy, but I think he-wiktionary community should figure out 1) how to store those (it took ru-wikt community over a month to iron out all the nuances), 2) if/how you want to link to those lexemes (see lexeme link box on the right in example in ruwikt), and 3) check if there are any license issues with that data. Thanks! --Yurik (talk) 08:37, 4 October 2019 (UTC)[reply]

P.S. See Wikidata_talk:Lexicographical_data -- there was a number of discussions there about ruwikt structures (search for "yurik" there). --Yurik (talk) 08:39, 4 October 2019 (UTC)[reply]

Yurik, thanks for the references. I am currently not focused on using the data on hewiktionary, since they have high standards of inclusion, each word should have sense and usage example. I may offer some templates for voluntary use on hewiktionary. I found out how to upload the forms I extracted, but no one has responded to my bot request: Wikidata:Requests for permissions/Bot/Uzielbot, do you know who should I talk to?Uziel302 (talk) 18:36, 6 October 2019 (UTC)[reply]

Template

Hello.

Can you translate and upload en:Template:Baku landmarks in Hebrew Wikipedia? Most of the articles in this list article already exist in Hebrew Wikipedia.

Yours sincerely, Karalainza (talk) 04:41, 17 November 2019 (UTC)[reply]

Karalainza, this is too much work to go link by link and copy its Hebrew version. You can do it yourself in sandbox and I'll go over it and move to the templates namespace. Uziel302 (talk) 06:04, 17 November 2019 (UTC)[reply]

Thank you for the advice. I will work on it. Karalainza (talk) 07:53, 17 November 2019 (UTC)[reply]

LexData and multiple change in one history edit

Hi,

I'm starting to look more and more into LexData (see Wikidata talk:Lexicographical data#Adding forms with LexData for a simple example of what I did recently). I would like to use it more but the number of edits is bugging me a bit. And then, I saw Wikidata:Requests for permissions/Bot/Uzielbot. Could you tell explain to me how to do multiple changes in one edit?

Cheers, VIGNERON (talk) 17:10, 11 March 2020 (UTC)[reply]

VIGNERON, sorry for misleading you, I later found out that I can only edit multiple existing forms by using wbeditentity API call and sending a relevant JSON with it. As far as I know there is no way to create new form/sense using this API, and one needs to use separate edits of creation for each one. Maybe User:Yurik can explain how he managed to send API calls that create multiple forms in one edit.Uziel302 (talk) 17:28, 11 March 2020 (UTC)[reply]

VIGNERON, I explored User:Yurik's code and turns out he uses wbeditentity to upload JSONs of the full lexemes. wbeditentity does support uploading new claims, forms and senses, I only had a little issue formatting the JSON so I thought it doesn't work. You can read JSONs of existing items in order to make the JSON right. I just uploaded a Lexeme with a few claims in one edit, using this code. Let me know if you need any help.Uzielbot (talk) 15:13, 12 March 2020 (UTC)[reply]

Latin Lexemes

Hi,

You bot imported a lot of Latin Lexemes, this is wonderful. Now I'm taking Latin classes to refresh my Latin and I have a question about your import. country (P17) is very strange ; the value are also strange (most are not countries, like internationality (Q1072012), nor Near East (Q48214) nor Roman Egypt (Q202311), nor Africa (Q15) which are 4 of the top 5 values...) Could you explain what you mean by that? More generally did you documented somewhere how you did this import? (I have many more small question, like how was handle homophones lexemes, how was extract the main lemma - first person instead of infinitive for verbs is a bit unexpected -, is there plan to add more data or populate the forms, and such).

Cheers, VIGNERON (talk) 14:45, 25 March 2020 (UTC)[reply]

VIGNERON, I used country because I didn't know of another property to use for the area of the words. As of code of import,it is here. It is basically Lexdata with a few changes to support uploading full lexeme in one json. To create the json I ran Whitaker's words on linux cli over a list of words I found online and used the morphological analysis of the software to get some details about all those words. I then had to do some search and replace with regex to make the format like Wikidata json. I plan to populate forms I got from Whitaker words but I need to do some additional work to prepare it to upload.Uziel302 (talk) 15:02, 25 March 2020 (UTC)[reply]

I'm not sure to understand what is "area of the words" is. A word has not really an area, or do you mean "place of origin"? I'm not sure but wouldn't it be more appropriate to use properties like location (P276) or location of formation (P740)?

Thanks, I already looked at the script (not sure to understand everything but I get the general idea), now I'm more interrested on how you generated the json, did you have some algorithm (even the general idea of one) on how to transform original data into data for Lexemes. By the way, what is your source, is it http://archives.nd.edu/whitaker/words.htm ?

Good to know you will work more on this Lexemes, I'll leave them to you then (I have enough to do otherwise), let me know if I can help in any way. And please, document as much as possible, I would love to replicate you method (especially for the Breton language).

Cdlt, VIGNERON (talk) 17:33, 25 March 2020 (UTC)[reply]

PS: for homophones, I forget to give the example non (L30640)/ (L288383) or (L284747)/si (L289938). Is it on purpose like tour (L2330)/tour (L2331)/tour (L2332) or is it a mistake?

VIGNERON, I tried to prevent duplicates and merged the duplicates I found. I just merged the two first examples you brought up. The only duplicates that are there on purpose are the ones with some difference in details.

The link you sent was probably the same I used to download the linux app, but I can't tell for sure.

Do you have any source of Breton language? I can check the options to do automation there. The process of creating the json here was simply by understanding the json format, using existing lexemes, and replacing Whitaker's code e.g. [XXEEC] with equivalent wikidata claims.Uziel302 (talk) 21:06, 25 March 2020 (UTC)[reply]

Location is more physical, "location of the item, physical object or event is within" and location of formation is not the meaning here, I meant areas of use. Country sounded ok. The best solution is to create new property, location of lexeme usage, like we have "location of sense usage (P6084)". Uziel302 (talk) 21:14, 25 March 2020 (UTC)[reply]

For location, I'm still not sure to understand (was it in the source file? how do they define it?). The best way is probably by starting asking on Wikidata_talk:Lexicographical data and then if needed, ask for a new property.

I didn't look in details, but I know there is some dictionaries in Breton out there, see the PanLex list for instance. I'm also working to transcribe dictionaries like fr:s:Lexique étymologique du breton moderne on Wikisource.

PS: for duplicates, you might want to take a look a this query: https://w.wiki/LGG (with some duplicate missing if there is a slight difference in the lemma - like macron for long vowels on dō (L42145) - and some false-positive).

Cheers, VIGNERON (talk) 09:31, 26 March 2020 (UTC)[reply]

VIGNERON, 1 I asked for new property. 2 I exported all the Latin lexemes and merged all the duplicates I found (that also have the same claims). Let me know if you find something I missed. 3 Textual dictionaries are hard to use for automaic upload, I only uploaded using cli dictionaries (GUI dictionaries are possible to use but it would take eternity to collect all the items, on cli it takes few minutes. Uziel302 (talk) 21:41, 29 March 2020 (UTC)[reply]

1. Perfect. I'll follow the proposal.

2. Perfect, I see you merged some in the higher number lexeme but then corrected, it's look good to me. I'll be using these Lexemes from time to time and I'll let you know if I spot anything.

3. Ok, I figured textual dictionnaries would be harder. And what about the ressources in PanLex? I tried but couldn't find how to use it. As the documentation says « PanLem is not deliberately designed to be difficult to use, but it is a difficult interface. »

VIGNERON, I started uploading forms based on Whitaker's WORDS. Please check abactus (L254557) and let me know what can be improved.Uziel302 (talk) 00:19, 31 March 2020 (UTC)[reply]

Hi,

(less indentation to be more readable)

Globally, abactus (L254557) sounds good but I have several "small" remarks (not all important or even calling for a correction, I'm just sharing my thought and ideas which may or maybe not be good :) )

why put described by source (P1343) = William Whitaker's Words (Q533803) in every form and not only once at the lexeme level?
the la-x-Q533803 representations are strange, shouldn't it go into hyphenation (P5279) or rather put into word stem (P5187) at the lexeme level? I'm not sure exactly what is "abact.i" here but it clearly does not look like a representation of the form.
grammatical cases, most of the time you used the general case (genitive case (Q146233) and so on), is there a reason to use ablative in Latin (Q4668057) instead of the general ablative case (Q156986)?
described by source (P1343) = Oxford Latin Dictionary (Q822282) is good but for this property, I would prefer to be more specific and use a specific edition not the whole work and give a page (like I did on Lexeme:L69#P1343)

Chers, VIGNERON (talk) 13:58, 31 March 2020 (UTC)[reply]

VIGNERON, 1 at the lexeme level I already put described by source according to the sources Whitaker uses for lexemes, e.g. Oxford, Lewis and Short etc. and also there is a chance people will add forms from other sources.

2 In the original Whitaker output he has this notation of separation between stem and suffix, although he notes this isn't grammatically accurate separation. I think this is useful when looking at a form level to see the different parts according to the analysis source, and lexeme may have different stems, e.g. ales and alit in ales (L256931).

3 I think it is better to link to more specific items since the user can then explore about it more, hence ablative in latin is better cause user then enters it and can read on it in Wikipedia. Other types had no Wikidata item nor Wikipedia article so I preferred to use the existing general items and not create new ones.

4 I thought about adding specific link on each "described by source" but on the source I used: https://latin.ucant.org/ all words get the same link... There are other websites with the content and maybe some allow direct link to word but it won't necessarily be helpful on the long term as those websites move domains and link structures. In the items about the frequency catergories etc. I used direct link to the documentation.

Thanks for the feedback, Uziel302 (talk) 16:14, 31 March 2020 (UTC)[reply]

1. Very true, I withdraw my remark.

2. I agree that the information is very useful. I think that "representation" is not the best place and placing it in a statement. The Lexeme level is indeed probably not right as the stem is not always unique (which is quite frequent, not only in Latin, I'll ask about this point on the property talk page), maybe at the form level then?

3. I'm unsure here. And if we choose to be specific, why not be specific for all cases, it would be more coherent.

4. Ok, I see, I withdraw my remark.

Cheers, VIGNERON (talk) 15:11, 1 April 2020 (UTC)[reply]

VIGNERON, we can be more specific and add items and Wikipedia articles on each case and voice etc. in Latin. Once all ready, I can ran the script again to update appropriately. Same if you find better way than representation. Uziel302 (talk) 15:28, 1 April 2020 (UTC)[reply]

I guess it will be best to have specific items.

I started a discussion with broader examples: Property talk:P5187#Multiple values, let's wait to see the results of this conversation.

Cdlt, VIGNERON (talk) 15:41, 1 April 2020 (UTC)[reply]

Strange lemma for L280927

Hi,

I guess there has been an error the lemma (and nominative/vocative singular form) of perduell (L280927). It is probably perduellis, no? Could you check?

Cheers, VIGNERON (talk) 10:01, 17 November 2020 (UTC)[reply]

PS: commaterr (L265355) is strange also, I expected 'commater' with only one r (like mater (L29829)), no?

PPS: maybe also unicorn (L287099).

VIGNERON, https://latin.ucant.org/ can show you the original definitions from Whitaker words, I just imported it.Uziel302 (talk) 10:42, 17 November 2020 (UTC)[reply]

I tried to use it but it doesn't show anything... Cheers, VIGNERON (talk) 11:15, 17 November 2020 (UTC)[reply]

VIGNERON, I see the issue, use http://archives.nd.edu/words.html instead.Uziel302 (talk) 11:22, 17 November 2020 (UTC)[reply]

Thanks. Can I found more information and explanation somewhere? (for instance, what means the abreviation and where does the data come from, how good is the quality, and so on). For the moment I only found en:William Whitaker's Words which is a bit poor. Cheers, VIGNERON (talk) 11:49, 17 November 2020 (UTC)[reply]

VIGNERON, I mainly used this source https://mk270.github.io/whitakers-words/dictionary.html Cheers, Uziel302 (talk) 19:19, 17 November 2020 (UTC)[reply]