Apertium identifies words that it cannot translate and has an ability to log it. We should consider collecting this information and sending it to Apertium developers.
Steps:
- Package python-toro (See: T101947)
- Determine location for missingFreqs.db and access to it (It is sqlite DB).
- Puppet config.
- Deployment in Beta and Production.
From #apertium on IRC:
aharoni 2. How can Wikipedia help Apertium improve this? Can we report the most frequent missing words, for example?
TinoDidriksen Unhammer and jacobEo, the currently online maintainers of dan-nor; what say you?
aharoni I've been thinking how to report untranslated words from Wikipedia back to Apertium
TinoDidriksen Well, APY keeps a database of untranslated words, with frequency afaik.
aharoni Where is it collected?
TinoDidriksen Some SQLite db on the APY host.
aharoni [ hi kart_ ]
aharoni If we have our own package installed, do we already collect it?
aharoni kart_ handles all the packaging for us, I don't know the technical details.
TinoDidriksen Don't know what version you have packaged, or whether it has that part enabled.
aharoni kart_: do you know?
TinoDidriksen File is called missingFreqs.db in the APY folder.
aharoni OK, let's say that we do have it.
aharoni If we periodically send it to Apertium, will it be useful?
aharoni Will somebody bother to add the translations?
kart_ TinoDidriksen: you mean -apy?
kart_ TinoDidriksen: I think I need to update package then.
kart_ aharoni: ^^
kart_ aharoni: can I have task in Phab? :D
aharoni kart_: ack
aharoni TinoDidriksen: you know, you could just run Apertium over a dump of all Wikipedia articles and collect the most frequent untranslated words :)
kart_ aharoni: how to access is another subject, as we do run it on production service.
aharoni If you haven't already :)
aharoni kart_: How about just copying it once a month and emailing it to an Apertium contact :)
TinoDidriksen Whether anyone will care to look most missing words is a whole other story. I guess it's good incentive because there is a direct feedback loop.
aharoni It shouldn't be too big for email.
TinoDidriksen Ours is 130MB currently.
aharoni TinoDidriksen: If there is somebody who will care and add the translations, I'd gladly provide it.
TinoDidriksen Nobody is even looking at our own, currently...but it also hasn't been advertised to the mailing list. We should do that.
kart_ TinoDidriksen: Please do.
kart_ Even I came to know today, we should have send feedback earlier.
kart_ TinoDidriksen: can location of db configurable?
TinoDidriksen I don't maintain any of the Python code. Unhammer and sushain handle APY. But I can only assume the answer is yes, 'cause that sounds trivial.
TinoDidriksen Oh, it already is, with -f
TinoDidriksen There was also a cmdline flag to make it keep an in-memory buffer, so that it doesn't hog I/O with SQLite commits: -M1000
Unhammer hi
Unhammer yeah, probably good idea to use -M1000 (or some number like that)
Unhammer and yeah I'd like seeing the wp missingfreqs, it's probably more directly useful