Page MenuHomePhabricator

Fix bengali tokenization in deltas
Closed, ResolvedPublic

Description

+    cache = {r_text: "দেখার পর তিনি চ্চিত্র worngly."}
+    eq_(solve(bengali.dictionary.revision.datasources.dict_words,
+              cache=cache),
+       ['দ', 'খ', 'র', 'পর', 'ত', 'ন', 'চ', 'চ', 'ত', 'র'])

That looks wrong.

Event Timeline

Halfak removed a project: User-Ladsgroup.
Halfak added a subscriber: Ladsgroup.

I don't know what is the problem. Just in case you want to know: দেখার পর তিনি চ্চিত্র => দ+ে+খ+া+র+ space +প+র+ space +ত+ি+ন+ি+ space +চ+্+চ+ি+ত+্+র

Right. I think I need to account for those signing chars for bengali when doing work tokenization. We had a similar problem with hindi and arabic/persian.

Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.