+ cache = {r_text: "দেখার পর তিনি চ্চিত্র worngly."} + eq_(solve(bengali.dictionary.revision.datasources.dict_words, + cache=cache), + ['দ', 'খ', 'র', 'পর', 'ত', 'ন', 'চ', 'চ', 'ত', 'র'])
That looks wrong.
+ cache = {r_text: "দেখার পর তিনি চ্চিত্র worngly."} + eq_(solve(bengali.dictionary.revision.datasources.dict_words, + cache=cache), + ['দ', 'খ', 'র', 'পর', 'ত', 'ন', 'চ', 'চ', 'ত', 'র'])
That looks wrong.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • Catrope | T170723 Deploy ORES Review Tool & ORES-based RCFilters for Romanian & Albanian Wikipedia | |||
Resolved | awight | T170485 ORES deployment - Mid July, 2017 | |||
Resolved | Halfak | T170490 Train reverted model for Bengali Wikipedia | |||
Resolved | Ladsgroup | T162620 Add language support for Bengali | |||
Resolved | Halfak | T164767 Fix bengali tokenization in deltas |
I don't know what is the problem. Just in case you want to know: দেখার পর তিনি চ্চিত্র => দ+ে+খ+া+র+ space +প+র+ space +ত+ি+ন+ি+ space +চ+্+চ+ি+ত+্+র
Right. I think I need to account for those signing chars for bengali when doing work tokenization. We had a similar problem with hindi and arabic/persian.
Fixed in https://github.com/halfak/deltas/commit/e426eb394bc186953fe9bb8b6666f0b33032e23a and released as version 0.4.6.