NLP Unit-I-1
NLP Unit-I-1
NLP Unit-I-1
Unit-I
• Finding the Structure of Words
• Finding the Structure of Documents
Finding the Structure of Words
• Words and their Components
• Tokens
• Lexemes
• Morphemes
• Typology
• Issues and Challenges
• Irregularity
• Ambiguity
• Productivity
• Morphological Models
• Dictionary Lookup
• Finite-State-Morphology
• Unification-Based Morphology
• Functional Morphology
• Morphology Induction
Finding the Structure of Words-Introduction
• Human Language is used to express our thoughts, and through
language, we receive information and infer its meaning.
• Linguistic Expressions (words, phrases, sentences) show structure of
different kinds and complexity and consist of more elementary
components.
• The co-occurrence of linguistic expressions in context refines the
notions they refer to in isolation and implies further meaningful
relations between them.
Finding the Structure of Words-Introduction
• The whole disciplines that look at languages from different
perspectives and at different levels of detail are:
• Morphology- study the variable forms and functions of words.
• Syntax- It is concerned with the arrangement of words into phrases, clauses
and sentences.
• Phonology- describes the word structure constraints due to pronunciation.
• Orthography-deals with the conventions for writing in a language.
• Etymology and Lexicography- evolution of words and explains the semantic,
morphological and other links among them.
Finding the Structure of Words-Morphological
Parsing
• Here we discuss about:
• How to identify words of distinct types in human languages?
• How the internal structure of words can be modelled with respect to
grammatical properties and lexical concepts the words should represent?
• This discovery of word structure is called as morphological parsing.
Words and their Components
• Words in most languages are the smallest linguistic units that can
form a complete utterance by themselves.
• Three important terms which are integral parts of a word are:
• Phonemes – the distinctive units of sound in spoken language.
• Graphemes – the smallest unit of a written language which corresponds to a
phoneme.
• Morphemes - the minimal part of a word that delivers aspects of meaning to
the word.
Words and their Components
• Tokens
• Lexemes
• Morphemes
• Typology
Tokens
• Let us look at an example in English:
Will you read the newspaper? Will you read it? I won’t read it.
• Here we see two words newspaper and won’t.
• In writing, newspaper and its associated concepts are very clear but
in speech there are a few issues.
• When it comes to word won’t linguists prefer to analyze it as two
words or tokens will and not.
• This type of analysis is called tokenization and normalization.
Tokens
• In Arabic or Hebrew certain tokens are concatenated in writing with
the preceding or the following words, possibly changing their forms.
• This type of tokens are called clitics (I’m, we’ve).
• In the writing systems of Chinese, Japanese and Thai white space is
not used to separate words.
• In Korean character strings are called eojeol ‘word segment’ and
correspond to speech or cognitive units which are usually larger than
words and smaller than clauses.
Lexemes
• There are a lot of alternative forms that can be expressed for a given
word.
• Such sets are called lexemes or lexical items.
• They constitute the lexicon of a language.
• Lexemes are divided by their lexical categories such as verb, noun,
adjective, adverb etc.
• The citation form of a lexeme by which it is identified is called lemma.
• In the conversion of singular mouse to plural mice we inflect the
lexeme.
• In the case of receiver and reception we derive the words from the
verb to receive.
Lexemes
• Example: Did you see him? I didn’t see him? I didn’t see anyone.
• Example in Czech
• Example in Telugu:
vAlYlYu aMxamEna wotalo neVmmaxigA naduswunnAru.
They beautiful garden slowly walking
Morphemes
• The structural components that associate the properties of word
forms are called morphs.
• The morphs that by themselves represent some aspect of the
meaning of a word are called morphemes of some function.
• Example : dis-agree-ment-s where agree is a free lexical morpheme
and other elements are bound grammatical morphemes.
• Morphs when interact with each other undergo additional
phonological and orthographic changes.
• These alternative forms are called allomorphs.
• Example: the past tense morphemes, plural morphemes etc.
Typology
• Morphological typology divides languages in groups. Here we outline the
typology that is based on quantitative relations between words, their
morphemes and their features:
• Isolating, or analytic, languages include no or relatively few words that
would comprise more than one morpheme (typical members are Chinese,
Vietnamese, and Thai; analytic tendencies are also found in English).
• Synthetic languages can combine more morphemes in one word and are
further divided into agglutinative and fusional languages.
• Agglutinative languages have morphemes associated with only a single function at a
time (as in Korean, Japanese, Finnish, and Tamil, etc.).
• Fusional languages are defined by their feature-per-morpheme ratio higher than one
(as in Arabic, Czech, Latin, Sanskrit, German, etc.).
Typology
• In accordance with the notions about word formation processes
mentioned earlier, we can also discern:
• Concatenative languages linking morphs and morphemes one after another.
• Nonlinear languages allowing structural components to merge non-
sequentially to apply tonal morphemes or change the consonantal or vocalic
templates of words.
Issues and Challenges
• Irregularity
• Ambiguity
• Productivity
Issues and Challenges- Introduction
• Morphological parsing tries to remove unnecessary irregularities and give
limits to ambiguity both of which exist in natural languages.
• Irregularity is all about forms and structures that are not described
appropriately by a prototypical linguistic model.
• Ambiguity is indeterminacy in interpretation of expressions of language.
• In addition to ambiguity we need to deal with the issues of syncretism, or
systematic ambiguity. (bet)
• In addition to the above morphological modelling also faces the problem of
productivity and creativity in language, by which new but perfectly
meaningful new words or new senses are coined.
Irregularity
• Morphological parsing provides generalization and abstraction in the
world of words.
• Irregular morphology can be seen as enforcing some extended rules
the nature of which is phonological, over the underlying or
prototypical regular word forms.
• In English the general past form occurs by adding –ed or –t.(accepted
and built)
• The irregular verbs in English tend to take different forms in the past
or in the present participle depending on the origin of the word.
Irregularity
• A few examples:
Irregularity
• Example in Arabic:
Irregularity-Telugu
• Roots like telus- in Telugu inflect differently in the past.
• Examples