Fix bengali tokenization in deltas
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	May 8 2017, 4:58 PM

Description

+    cache = {r_text: "দেখার পর তিনি চ্চিত্র worngly."}
+    eq_(solve(bengali.dictionary.revision.datasources.dict_words,
+              cache=cache),
+       ['দ', 'খ', 'র', 'পর', 'ত', 'ন', 'চ', 'চ', 'ত', 'র'])

That looks wrong.

Related Objects
Search...

Status	Assigned	Task
Resolved	• Catrope	T170723 Deploy ORES Review Tool & ORES-based RCFilters for Romanian & Albanian Wikipedia
Resolved	awight	T170485 ORES deployment - Mid July, 2017
Resolved	Halfak	T170490 Train reverted model for Bengali Wikipedia
Resolved	Ladsgroup	T162620 Add language support for Bengali
Resolved	Halfak	T164767 Fix bengali tokenization in deltas

Event Timeline

Halfak created this task.May 8 2017, 4:58 PM

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptMay 8 2017, 4:58 PM

Halfak claimed this task.May 8 2017, 4:58 PM

Halfak removed a project: User-Ladsgroup.

Halfak added a subscriber: Ladsgroup.

I don't know what is the problem. Just in case you want to know: দেখার পর তিনি চ্চিত্র => দ+ে+খ+া+র+ space +প+র+ space +ত+ি+ন+ি+ space +চ+্+চ+ি+ত+্+র

Right. I think I need to account for those signing chars for bengali when doing work tokenization. We had a similar problem with hindi and arabic/persian.

Fixed in https://github.com/halfak/deltas/commit/e426eb394bc186953fe9bb8b6666f0b33032e23a and released as version 0.4.6.

Halfak moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.May 8 2017, 7:12 PM

• Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.May 9 2017, 10:07 AM

• Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.