The Kam4D Linguistic Knowledge Graph: Putting Smurfs, Ducks, Lemurs, and Party Terms to the Service of African Languages

Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

SADiLaR: DH Colloquium – 17 November, 2021

South African Centre for Digital Language Resources


kamusi is Swahili for dictionary
Goal: A complete matrix of human expression across time and space

• As a knowledge resource
• As a data resource
• As a basis for any-to-any translation
3
In service since 1994 - originally at Yale Council on African Studies
International NGO since 2009
• Registered non-profit in 🇺🇸 and 🇨🇭
Academic Home since 2013:
EPFL - Swiss Federal Institute of Technology in Lausanne
First at LSIR - Distributed Systems Information Laboratory
Now at the Swiss EdTech Collider 4
ACALAN (Intergovernmental language agency for 55 member states of the African Union):
Platform for African Language Empowerment development partner
5
6

• Lemurs and Party Terms


Lemur = • Smurfs and Ducks
• Lemma • Costumes and Wardrobes
• Lemmatic form
• Dictionary form
• Canonical form
• Citation form Party term =
• Multiword Expression
• MWE
7

• Lemurs and Party Terms


SMURF = • Smurfs and Ducks
Spelling/ Meaning • Costumes and Wardrobes
Unit Reference
DUCKS = Data
Unified Concept
Knowledge Set
8

1. The problem with linguistic


data
2. The Kam4D solution
3. Kamusi Labs projects
9

1. The problem with linguistic


data
2. The Kam4D solution
3. Kamusi Labs projects
10

Data: Words that have been


digitized in a way that can be
used within computer processes
11

This is a word

light
12

Lemur =
• Lemma
• Lemmatic form
• Dictionary form
• Canonical form
• Citation form

light
13

light

why mul+lingual dic+onaries were impossible


14

lumineux

light
léger léger

allég
é

why multilingual dictionaries were impossible


15
give the green light
red light district come to light
light up the town

see the light running light


in light of

trip the light fantas+c


light year

light

Party term =
• Multiword
Expression
• MWE
16

sw: -enye mwanga fi: valoisa

fr: lumineux th: สว่าง

fr: léger fr: léger

sw: -epesi sw: -a kuchekesha


light

fi: kevyt fi:
tyhjänpäiväinen
th: เบา th: ซึง# ไร้ สาระ

th: ที#แคลอรี# ตํ#า


fr: allégé
fi: kaloriton
sw: pungufu
why mul+lingual dic+onaries were impossible
17

sw: -enye mwanga fi: valoisa

fr: lumineux th: สว่าง

en: light
fr: léger fr: léger

sw: -epesi sw: -a kuchekesha

en: light en: light


fi: kevyt fi: tyhjänpäiväinen

th: เบา th: ซึง# ไร้ สาระ

en: light

th: ที#แคลอรี# ตํ#า


fr: allégé
fi: kaloriton
sw: pungufu
why mul+lingual dic+onaries were impossible
18

sw: -enye mwanga fi: valoisa

fr: lumineux th: สว่าง

fr: léger fr: léger

sw: -epesi sw: -a kuchekesha


light

fi: kevyt fi:


tyhjänpäiväinen
th: เบา th: ซึง# ไร้ สาระ

th: ที#แคลอรี# ตํ#า


fr: allégé
fi: kaloriton
sw: pungufu
why mul+lingual dic+onaries were impossible
19

sw: -enye mwanga fi: valoisa

fr: lumineux th: สว่าง

fr: léger fr: léger

sw: -epesi sw: -a kuchekesha


light

fi: kevyt fi:
tyhjänpäiväinen
th: เบา th: ซึง# ไร้ สาระ

th: ที#แคลอรี# ตํ#า


fr: allégé
fi: kaloriton
sw: pungufu
why mul+lingual dic+onaries were impossible
20

1. The problem with linguistic


data
2. The Kam4D solution
3. Kamusi Labs projects
21

1. More problems with linguistic


data!
2. The Kam4D solution
3. Kamusi Labs projects rica e s
A f u rc
a ge s in i ta l r e so e en
n g u d i g o t b
2 0 0 0 la v e a ny d h a sn
• fe w ha i g i : ze le
V e r y e e n d e ra b
• h as b s intero p
a t
• Wh onized a
h a rm
22

1. The problem with linguistic


data
2. The Kam4D solution
3. Kamusi Labs projects
23

light
24

light (not dark)

light (not heavy)

light (not serious) SMURF =


Spelling/ Meaning
Unit Reference
light (not fa@ening)

how Kamusi makes a mul+lingual dic+onary possible


25

light (not dark) fr: lumineux

light (not heavy) fr: léger

light (not serious) fr: léger SMURF =


Spelling/ Meaning
Unit Reference
light (not fa@ening) fr: allégé

DUCKS = Data
Unified Concept
Knowledge Set

how Kamusi makes a mul+lingual dic+onary possible


26

light (not dark) fr: lumineux sw: -enye mwanga fi: valoisa th: สว่าง

light (not heavy)

DUCKS = Data
light (not serious) Unified Concept
Knowledge Set

light (not fa@ening)

how Kamusi makes a multilingual dictionary possible


27

light (not dark)

light (not heavy) fr: léger sw: -epesi fi: kevyt th: เบา

light (not serious)


DUCKS = Data
Unified Concept
light (not fattening) Knowledge Set

how Kamusi makes a mul+lingual dic+onary possible


28

light (not dark)

light (not heavy)

light (not serious) fr: léger sw: -a kuchekesha fi: th: ซึง# ไร้ สาระ
tyhjänpäiväinen

light (not fa@ening)

DUCKS = Data
Unified Concept
Knowledge Set

how Kamusi makes a mul+lingual dic+onary possible


29

light (not dark)

light (not heavy)

light (not serious)

light (not fa@ening) fr: allégé sw: pungufu fi: kaloriton th: ที#แคลอรี# ตํ#า

DUCKS = Data
Unified Concept
Knowledge Set
how Kamusi makes a mul+lingual dic+onary possible
30

light (not dark) fr: lumineux sw: -enye mwanga fi: valoisa th: สว่าง

light (not heavy) fr: léger sw: -epesi fi: kevyt th: เบา

light (not serious) fr: léger sw: -a kuchekesha fi: hölynpöly th: ซึง# ไร้ สาระ

light (not fa@ening) fr: allégé sw: pungufu fi: kaloriton th: ที#แคลอรี# ตํ#า

how Kamusi makes a mul+lingual dic+onary possible


31

fr: léger (without much luggage)

light (not heavy) fr: léger sw: -epesi fi: kevyt th: เบา

fr: léger (low alcohol)

fr: léger (sandy)

how Kamusi makes a mul+lingual dic+onary possible


32

• 2,099,419 • ~138,000 Ducks


Smurfs • 44 Languages
• 122 Languages
33

• 4D = Four Dimensional
• Time is the fourth dimension - capacity to treat language
change and historical languages
• Graph database structure for a complete matrix of
human expression across time and space
• the structure is realistic; the final goal is an impossible
aspiration
• Molecular lexicography design
34

light
35

light
36

light
37

meaning

shap place
e

sound +me

rela+onships
38

meaning

shap place
e

sound +me

rela+onships
39

hit a golf ball

operate a car herd animals


DRIVE
meaning

shap place
e

sound +me
compel someone
rela+onships
40
41
meaning

meaning

42
43

• Lemurs and Party Terms Wardrobe =


• Smurfs and Ducks
A set of forms (a.k.a.
• Costumes and Wardrobes
inflections) that might
Costume = be used by a smurf
A single form (a.k.a.
inflection) that might
be used by a smurf
44

DRIVE
drives, drove, driving, driven
meaning

shap place
e

sound time

relationships
45

DRIVE
drives, drove, driving, driven
meaning

shap place
e

sound +me

rela+onships
46

hit a golf ball

operate a car DRIVE herd animals


drives, drove, driving, driven
meaning

shap place
e

sound time
compel someone
relationships
47
48
49

1. The problem with linguistic


data
2. The Kam4D solution
3. Kamusi Labs projects
50

1. Gathering data for African


languages
2. SlowBrew assisted translation
3. PALE: Platform for African
Language Empowerment
4. Many more projects…
51

1. Gathering data for African


languages
2. SlowBrew assisted translation
3. PALE: Platform for African
Language Empowerment
4. Many more projects…
52

light (not dark)

light (not heavy)

light (not serious)

light (not fattening)


53

light (not dark)

light (not heavy)

light (not serious)

light (not fa@ening)

how Kamusi makes a multilingual dictionary possible


54

light (not dark)

light (not heavy)

light (not serious)

light (not fa@ening)

how Kamusi makes a mul+lingual dic+onary possible


55

light (not dark)

light (not heavy)

light (not serious)

light (not fa@ening)

how Kamusi makes a mul+lingual dic+onary possible


56

light (not dark)

light (not heavy)

light (not serious)

light (not fattening)


57

• Duck Duck Kamus – aligning


existing datasets
• Crowdsourcing games for new
microdata
• GOLDdigger – engine for
expert editors
58

1. Gathering data for African


languages
2. SlowBrew assisted translation
3. PALE: Platform for African
Language Empowerment
4. Many more projects…
59

• User selects their meaning on the source side


(predisambiguation)
• Users can suggest missing senses
60

• User selects their meaning on the source side


(predisambiguation)
• Users can suggest missing senses
• SlowBrew suggests Party Terms (MWEs), or users
can mark their own
• Party Terms are treated as Smurfs in Kam4D
• Separated expressions easily conjoined (unlike NMT)
61

• User selects their meaning on the source side • Smurfs and Ducks
(predisambiguation)
• Users can suggest missing senses • Kam4D –
• SlowBrew suggests Party Terms (MWEs), or users kamu.si/kam4d
can mark their own
• Party Terms are treated as Smurfs in Kam4D • SlowBrew
• Separated expressions easily rejoined (unlike NMT)
62
63

• User selects their meaning on the source side


(predisambiguation)
• Users can suggest missing senses
• SlowBrew suggests Party Terms (MWEs), or users
can mark their own
• Party Terms are treated as Smurfs in Kam4D
• Separated expressions easily rejoined (unlike NMT)
• DUCKS finds equivalent term in Language B
64

• User selects their meaning on the source side


(predisambiguation)
• Users can suggest missing senses
• SlowBrew suggests Party Terms (MWEs), or users
can mark their own
• Party Terms are treated as Smurfs in Kam4D
• Separated expressions easily rejoined (unlike NMT)
• DUCKS finds equivalent term in Language B

• Machine learns from context-specific user selections


• Crowdsourced dataset of spelling/meaning
annotations
• AI builds from human intelligence on the source-side
65

Unanswered Questions:
• Will users take the time to
predisambiguate?
• People take time to choose images
• People take time to spellchick
• Syntax on the target side?
• Outside Kamusi wheelhouse – partners
needed
• How to pay for it?
66

1. Gathering data for African


languages
2. SlowBrew assisted translation
3. PALE: Platform for African
Language Empowerment
4. Many more projects…
67

p l aFo rm
A L A N -AU
fo r t h e AC a : n g data
d a ta core d i s s e m in
l i n g u is:c e r i n g an d a ge s)
as t h e g a t h La n g u
l s e r ve st e m s for B o rd e r
m 4 D w il a m u s i sy a r C ro s s-
• K a w ith K e h icu l
ra te d BL s ( V
• Integ cus on 20 VC
I n i : a l fo

68

1. Gathering data for African


languages
2. SlowBrew assisted translation
3. PALE: Platform for African
Language Empowerment
4. Many more projects…
69

• KamuSee 👓 visual dictionary


• Sign languages 🙆 gesture video dictionary
• Logikamusi ⛩ ontological dictionary
• Kamedicine 💊 medical terminology translator
• Kamigrate 👣 refugee and immigrant services translator
• Kamergency 😷🧑🚒 phrasebook for accident and disaster first responders
• Kamuseum 🏛 guides for public spaces
• Box-o-Lex 🧰 field lexicography toolkit
• Talkamusi 🗣💬📖 talking dictionary
• KamHoosi 🦉 named entities
• EdTech Trio 🎶🎓 for learning IN African languages, learning FROM African languages,
and learning OF African languages
http://kamu.si/big-picture-playbook
SADiLaR: DH Colloquium – 17 November, 2021
South African Centre for Digital Language Resources

You might also like