1 - Intro - To - NLP 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

Natural Language Processing

Introduction to NLP
Instructor: Moushmi Dasgupta

Some of the material is from Georgia Institute of Technology, Atlanta, GA, USA.
AGENDA
Introduction to NLP and Business Applications
1.1 What is Language?
1.2 Building Blocks of Language
1.3 Why is NLP Challenging?
1.4 Machine Learning, Deep Learning, and NLP: An Overview
1.5 Approaches to NLP in Business Analytics
Pre-requisite:
1. Python programming
2. An understanding of Machine Learning
3. Invest in attending classroom sessions (Weekly 1 or 2 classes of 3+ hours duration)
4. Invest in yourself with1 hour of self study everyday
Human Language
Google search reports that there are 7,151 living languages

The system of
sounds and writing
that human beings
use to express
their thoughts,
ideas and feelings

Language as understood by the machine learning algorithms


Communication With Machines

~50-70s ~80s today


4
Analytics Domain https://www.linkedin.com/in/moushmi1234/
Conversational Agents

Conversational agents contain:


● Speech recognition
● Language analysis
● Dialogue processing
● Information retrieval
● Text to speech

6
Machine Translation

7
Natural Language Processing

Applications Core Technologies


¡ Machine Translation ¡ Language modeling
¡ Information Retrieval ¡ Part-of-speech tagging
¡ Question Answering ¡ Syntactic parsing
¡ Dialogue Systems ¡ Named-entity recognition
¡ Information Extraction ¡ Word sense disambiguation
¡ Summarization ¡ Semantic role labeling
¡ Sentiment Analysis ¡ ...
¡ ...
NLP lies at the intersection of computational linguistics and machine learning. 9
Natural Language Processing (NLP) Examples

▪ Email filters.
▪ Smart assistants – Siri, Alexa, Google Assistant
▪ Search results
▪ Predictive text Analytics
▪ Language translation
▪ Digital phone calls
▪ Data analysis
▪ Text analytics
Level Of Linguistic
Knowledge

10
Phonetics, Phonology

¡ Pronunciation Modeling

Phonetics - the study of the sounds of human speech

11

Some of the material is from Georgia Institute of Technology, Atlanta, GA, USA.
Words

¡ Language Modeling
¡ Tokenization
¡ Spelling correction

12
Morphology

¡ Morphology analysis
¡ Tokenization
¡ Lemmatization

Morphology - the form of words, studied as a branch of linguistics 13


Part of Speech

¡ Part of speech tagging

14

Some of the material is from Georgia Institute of Technology, Atlanta, GA, USA.
Syntax

¡ Syntactic parsing

15

Some of the material is from Georgia Institute of Technology, Atlanta, GA, USA.
Semantics

¡ Named entity recognition


¡ Word sense disambiguation
¡ Semantic role labeling

16
Discourse

17
English Lexicon

A lexicon, word-hoard, wordbook, or word-stock Train


is the vocabulary of a person, language, or branch - Railway Station
- Platform
of knowledge (such as nautical or medical). ... - Luggage
- Ticket collector
- Passengers
The word "lexicon" derives from the Greek - Coach Number
λεξικόν (lexicon), neuter of λεξικός (lexikos) - Berth

meaning
"of or for words."
Why NLP is Hard?

1. Ambiguity
2. Scale
3. Sparsity
4. Variation
5. Expressivity
6. Unmodeled Variables
7. Unknown representations

19
Why NLP is Hard?

1. Ambiguity
2. Scale
3. Sparsity
4. Variation
5. Expressivity
6. Unmodeled Variables
7. Unknown representations

20
Ambiguity

¡ Ambiguity at multiple levels


¡ Word senses: bank (finance or river ?)
¡ Part of speech: chair (noun or verb ?)
¡ Syntactic structure: I can see a man with a telescope
¡ Multiple: I made her duck

21
22
Part of Speech Tagging
Sentences with all 8 Parts of Speech It is a process of converting a
1.Noun – Tom lives in New York. sentence to forms – list of
2.Pronoun – Did she find the book she was words, list of tuples (where
looking for?
3.Verb – I reached home. each tuple is having a form
4.Adverb – The tea is too hot. (word, tag)). The tag in case of
5.Adjective – The movie was amazing. is a part-of-speech tag, and
6.Preposition – The candle was kept under the signifies whether the word is a
table. noun, adjective, verb, and so on.
7.Conjunction – I was at home all day, but I am
feeling very tired.
8.Interjection – Oh! I forgot to turn off the
stove.

23
Part of Speech Tagging

24
Part of Speech Tagging

25
Syntax

26
Morphology + Syntax

A ship-shipping
ship, shipping
shipping-ships

27

Some of the material is from Georgia Institute of Technology, Atlanta, GA, USA.
Semantics

377 people, equivalent to a jumbo jet crashing, die every day.

Our job is to find this jumbo jet and stop it!

28
Syntax + Semantics

We saw the woman with the telescope wrapped in paper.


¡
¡
¡

29

Some of the material is from Georgia Institute of Technology, Atlanta, GA, USA.
Syntax + Semantics

We saw the woman with the telescope wrapped in


paper.
Who has the telescope?
Who or what is wrapped in paper?
An even of perception, or an assault?

30
Dealing with Ambiguity

How can we model ambiguity?


Non-probabilistic methods (CKY parsers for syntax) return all possible analyses
Probabilistic models (HMMs for POS tagging, PCFGs for syntax) and algorithms (Viterbi,
probabilistic CKY) return the best possible analyses, i.e., the most probable one
But the “best” analysis is only good if our probabilities are accurate. Where do
they come from?

31
Corpora
• A corpus is a collection of text
• Often annotated in some way
(Sometimes just lots of text)
¡ Examples
¡ Penn Treebank: 1M words of parsed WSJ
¡ Canadian Hansards: 10M+ words of French/English sentences
¡ Yelp reviews
¡ The Web!

Rosetta Stone
32

Demotic, hieroglyphic and Greek.


Statistical NLP

¡ Like most other parts of AI, NLP is dominated by statistical methods


¡ Typically more robust than rule-based methods
¡ Relevant statistics/probabilities are learned from data
¡ Normally requires lots of data about any particular phenomenon

33
Why NLP is Hard?

1. Ambiguity
2. Scale
3. Sparsity
4. Variation
5. Expressivity
6. Unmodeled Variables
7. Unknown representations

34
Sparsity

¡ Sparse data due to Zipf’s Law


¡ Example: the frequency of different
words in a large text corpus

35
Sparsity

¡ Order words by frequency. What is the frequency of nth ranked word?

36

Some of the material is from Georgia Institute of Technology, Atlanta, GA, USA.
Sparsity

¡ Regardless of how large our


corpus is, there will be a lot of
infrequent words
¡ This means we need to find
clever ways to estimate
probabilities for things we
have rarely or never seen

37
Why NLP is Hard?

1. Ambiguity
2. Scale
3. Sparsity
4. Variation
5. Expressivity
6. Unmodeled Variables
7. Unknown representations

38
Variation

¡ Suppose we train a part of speech tagger or a parser on the Wall Street Journal

¡ What will happen if we try to use this tagger/parser for social media?
¡ “ikr smh he asked fir yo last name so he can add u on fb lololol”

39
Variation

40
Why NLP is Hard?

1. Ambiguity
2. Scale
3. Sparsity
4. Variation
5. Expressivity
6. Unmodeled Variables
7. Unknown representations

41
Expressivity

¡ Not only can one form have different meanings (ambiguity) but the same meaning
can be expressed with different forms:
¡ She gave the book to Tom vs. She gave Tom the book
¡ Some kids popped by vs. A few children visited
¡ Is that window still open? vs. Please close the window

42
Unmodeled Variables

World knowledge
I dropped the glass on the floor and it broke
I dropped the hammer on the glass and it broke 48
Unmodeled Representation

Very difficult to capture what is ! , since we don’t even know how to represent the
knowledge a human has/needs:
¡ What is the “meaning” of a word or sentence?
¡ How to model context?
¡ Other general knowledge?

44
Desiderate for NLP Models

¡ Sensitivity to a wide range of phenomena and constraints in human language


¡ Generality across languages, modalities, genres, styles
¡ Strong formal guarantees (e.g., convergence, statistical efficiency, consistency)
¡ High accuracy when judged against expert annotations or test data
¡ Ethical

45
Symbolic and Probabilistic NLP

46
Probabilistic and Connectionist NLP

47
AI – ML – DL - NLP
NLP vs. Machine Learning

¡ To be successful, a machine learner needs bias/assumptions; for NLP, that might


be linguistic theory/representations.
¡ ! is not directly observable.
¡ Symbolic, probabilistic, and connectionist ML have all seen NLP as a source of
inspiring applications.

49
NLP vs. Linguistics

¡ NLP must contend with NL data as found in the world


¡ NLP ≈ computational linguistics
¡ Linguistics has begun to use tools originating in NLP!

50
Fields with Connections to NLP

¡ Machine learning
¡ Linguistics (including psycho-, socio-, descriptive, and theoretical)
¡ Cognitive science
¡ Information theory
¡ Logic
¡ Data science
¡ Political science
¡ Psychology
¡ Economics
¡ Education
51
Today’s Applications

¡ Conversational agents
¡ Information extraction and question answering
¡ Machine translation
¡ Opinion and sentiment analysis
¡ Social media analysis
¡ Visual understanding
¡ Essay evaluation
¡ Mining legal, medical, or scholarly literature
57
Factors Changing NLP Landscape

1. Increases in computing power


2. The rise of the web, then the social web
3. Advances in machine learning
4. Advances in understanding of language in social context

58
Python Libraries for NLP

1.Natural Language Toolkit(NLTK) Core Python


2.GenSim Numpy
3.SpaCy Pandas
4.CoreNLP
Scikit Learn (SkLearn)
5.TextBlob
6.AllenNLP Beautiful Soup
7.polyglot
8.scikit-learn
What’s Next?

¡ NLP Pipeline

70

Some of the material is from Georgia Institute of Technology, Atlanta, GA, USA.

You might also like