Seminar ON: Natural Language Processing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

SEMINAR

ON
NATURAL LANGUAGE PROCESSING

Prepared by
NEHA S. SHAH
Roll No. 53
BE-IV (7th Sem) [CO]
Guide: Mita Parikh
Co-Guide: Mayuri Mehta

SARVAJANIK COLLEGE OF ENGINEERING AND TECHNOLOGY,


SURAT
VEER NARMAD SOUTH GUJARAT UNIVERSITY
SARVAJANIK COLLEGE OF ENGINEERING AND TECHNOLOGY
Dr. R.K.DESAI ROAD, ATHWALINES,
SURAT-395001

DEPARTMENT OF COMPUTER ENGINEERING

CERTIFICATE

This is to certify that Ms. SHAH NEHA S. student of


B.E.IV (CO), Semester VII, Roll no. 53 has successfully conducted his/her
seminar on NATURAL LANGUAGE PROCESSING [NLP] in accordance
with technical and theoretical specifications under the guidance of MITA
PARIKH for the year 2006-2007.

Signature of Signature of Signature of


Guide Co-Guide DIC
(Computer Engg. Deptt.)

______________ ______________ _____________

Signature of Jury members:

____________ ____________ _____________ _______________


ACKNOWLEDGEMENT

We are very thankful to GUIDE: MITA PARIKH & CO-GUIDE:


MAYURI MEHTA who have given us opportunity to make this seminar report. Under
their guidance we could successfully prepare our seminar report, which is on NATURAL
LANGUAGE PROCESSING.

We are also thankful to colleagues for their help.

NEHA S. SHAH
Roll No. 53
BE-IV (7th Sem) [CO]

SARVAJANIK COLLEGE OF ENGINEERING AND TECHNOLOGY,


SURAT
VEER NARMAD SOUTH GUJARAT UNIVERSITY
ABSTRACT

Natural Language Processing (NLP) is the process that can design and build the
software that will analyze, understand and generate languages that humans uses in

routine. A Natural Language (NL) exists as a result of evolution as opposed to invention .


Consider French, English or German; these languages have evolved over centuries to
become what they are today. On the other hand, languages such as COBOL, C++ and
SQL were created in a relatively short period of time. Natural languages tend to have
very large lexicons and highly complex grammars. Natural language processing provides
methods for performing useful tasks with natural languages. NLP methodologies and
techniques assume that the patterns in grammar and the conceptual relationships between
words in language can be articulated scientifically. The ultimate goal of NLP is to
determine a system of symbols, relations, and conceptual information that can be used by
computer logic to implement artificial language interpretation. Natural language
processing has its roots in semiotics, the study of signs. Semiotics is broken up into three
branches: 1). Syntax 2). Semantics 3). Pragmatics. NLP is facing some limitations: 1]
Physical 2] Current information retrieval system. Though its future application will help
to the mankind very widely.
INDEX

XI. INTRODUCTION------------------------------------------------------------------ 1
• DEFINITION
• ORIGIN
• GOAL

XII.LEVELS OF NATURAL LANGUAGE PROCESSING-------------------3

XIII.IMPLEMENTATION-------------------------------------------------------------9

XIV.PARSER
XV.PROLOG

XVI.APPLICATIONS-------------------------------------------------------------------19

XVII.LIMITATIONS---------------------------------------------------------------------20

XVIII.FUTURE----------------------------------------------------------------------------- 21

XIX.CONCLUSION--------------------------------------------------------------------- 22

XX.BIBLIOGRAPHY-----------------------------------------------------------------23

INTRODUCTION
1. DEFINITION:
Natural Language Processing is a theoretically motivated range of computational
techniques for analyzing and representing naturally occurring texts at one or more levels
of linguistic analysis for the purpose of achieving human-like language processing for a
range of tasks or applications.
'Natural Language Processing' (NLP) is a convenient description for all attempts to use
computers to process Natural Language. NLP includes:
XXI.Speech Synthesis
XXII.Speech Recognition
XXIII.Natural Language Understanding
XXIV.Natural Language Generation
XXV.Machine translation (MT)

2. ORIGINS:

Research in natural language processing has been going on for several decades
dating back to the late 1940s. Machine translation (MT) was the first computer-based
application related to natural language.

3. GOAL:

XXVI.The goal of NLP as stated above is “to accomplish human-like language


processing”.

XXVII.The choice of the word ‘processing’ is very deliberate, and should not be replaced
with ‘understanding’.

XXVIII.For although the field of NLP was originally referred to as Natural Language
Understanding (NLU) in the early days of AI, it is well agreed today that while the
goal of NLP is true NLU, that goal has not yet been accomplished.

XXIX.The Goal of NLP:

XXX.There are more practical goals for NLP, many related to the particular
application for which it is being utilized.
XXXI.For example, an NLP-based IR system has the goal of providing more
precise, complete information in response to a user’s real information
need.

XXXII.The goal of the NLP system here is to represent the true meaning and
intent of the user’s query, which can be expressed as naturally in everyday
language as if they were speaking to a reference librarian.

XXXIII.Also, the contents of the documents that are being searched will be
represented at all their levels of meaning so that a true match between
need and response can be found, no matter how either are expressed in
their surface form.

LEVELS OF NATURAL LANGUAGE PROCESSING


Figure 1: Levels of NLP

[1] PHONOLOGICAL:

This level deals with the interpretation of speech sounds within and across words.
There are, in fact, three types of rules used in phonological analysis:

1) Phonetic rules: For sounds within words.

2) Phonemic rules: For variations of pronunciation when words are spoken


together.

3) Prosodic rules: for fluctuation in stress and intonation across a sentence.

In an NLP system that accepts spoken input, the sound waves are analyzed and
encoded into a digitized signal for interpretation by various rules or by comparison to the
particular language model being utilized.
[2] MORPHOLOGICAL:

It concerns how words are constructed from more basic meaning units called
morphemes. A morpheme is the primitive unit of meaning in a language.

XXXIV.Analyzing words into their linguistic components (morphemes).

XXXV.Morphemes are the smallest meaningful units of language.

E.g.

Cars : car + PLU

Giving : give + PROG

Geliyordum : gel+PROG+PAST+1SG ( I was coming).

XXXVI.Ambiguity: More than one alternatives

Flies : Fly VERB + PROG

Fly NOUN + PLU

XXXVII.Relatively simple for English. But for some languages such as Hindi, Turkish it
is more difficult.

3] LEXICAL:

Lexical Analyzer reads the source program character by character and returns the
tokens of the source program. A token describes a pattern of characters having same
meaning in the source program.

Ex: Newval: = oldval + 12 => Tokens: newval identifier

:= assignment operator
oldval identifier

+, 12 add operator, a number

Regular expressions are used to describe tokens. A (Deterministic) Finite State


Automaton can be used in the implementation of a Lexical analyzer.

4] SYNTACTIC:

A Syntax Analyzer creates the syntactic structure (generally a parse tree) of the
given program. A syntax analyzer is also called as a parser .A parse tree describes a
syntactic structure.

Identifier := expression

Newval expression + expression

Identifier number

oldval 12

In a parse tree, all terminals are at leaves. All inner nodes are non-terminals in a
context free grammar. The syntax of a language is specified by a context free grammar
(CFG). The rules in a CFG are mostly recursive. A syntax analyzer checks whether a
given program satisfies the rules implied by a CFG or not. If it satisfies, the syntax
analyzer creates a parse tree for the given program. The syntax analyzer works on the
smallest meaningful units (tokens) in a source program to recognize meaningful
structures in our programming Language.
5] SEMANTIC:

A semantic analyzer checks the source program for semantic errors and collects
the type information for the code generation. Type checking is an important part of
semantic analyzer. A context-free language used in syntax analyzers cannot represent
normally semantic information. Context-free grammars used in the syntax analysis are
integrated with attributes (semantic rules) the result is a syntax-directed translation,
Attribute grammars.

Ex: newval: = oldval + 12

The type of the identifier newval must match with type of the expression (oldval+12).

6] DISCOURSE:

The next level of analysis is called "discourse theory". This is about the higher-
level relations that hold among sequences of sentences in a discourse or a narrative. It
merges sometimes with literary theory, but also with pragmatics.

One thing to understand is that different sentences do different kinds of "work" in


a discourse. Some introduce new events or relations; some used them to introduce
something new.

E.g. A car began rolling down the hill.

It collided with a lamppost.

One important idea in discourse theory is the idea that much language is
performed in the context of some mutual activity.
.

Sometimes utterances can be understood as if they were steps in the execution of


a plan. For example if I say, please pass the salt. This could be thought of as a way to get
me the salt, if having salt was part of a plan.

Some people think of sentences like

‘Can you pass the salt?’

As "indirect speech acts" because they look like questions, but aren't really. One
way to think about sentences like this is that the hearer understands that this is probably
not a question, but is a conventionalized means of asking for the salt.

Another analysis of this sort of sentence is that you are trying to avoid rejection.
You do this by considering ways that your plan might fail. So you don't want to have this
happen:

Please pass the salt.

I can't, I'm tied up with ropes.

Oh, sorry.

So you ask about potential problems first -- asking about ability. So that if there is a
problem, you don't have to ask directly and you won't be rejected. It is
sort of like:

Are you doing anything Saturday night?

Yes, I'm feeding my goldfish.

So you don't have to be rejected if you actually ask for a date.

7] PRAGMATIC:

This level is concerned with the purposeful use of language in situations and
utilizes context over and above the contents of the text for understanding.

The goal is to explain how extra meaning is read into texts without actually being
encoded in them. This requires much world knowledge, including the understanding of
intentions, plans and goals. Some NLP applications may utilize knowledge bases and
inferencing modules.

For example, the following two sentences require resolution of the anaphoric
term ‘they’, but this resolution requires pragmatic or world knowledge:

1] The city councilors refused the demonstrators a permit because they feared violence.

2] The city councilors refused the demonstrators a permit because they advocated
revolution.

IMPLEMENTATION
1. PARSER

Parser technique takes output from Speech Recognizer and returns to the semantic
frame representation. Speech understanding system is given below:

PARSER-1 Notation & Definition:

XXXVIII.A chart is data structure representing the intermediate results of the


parsing process.

XXXIX.Given a sentence of n words, we will create a chart with n + 1 vertices.

XL.During the parsing process edges will be added between the vertices of the
chart by the helper functions.

An edge has the form where

XLI.In general, we will use A to refer to a nonterminal; X will refer to either a


terminal or nonterminal; and a Greek character denotes a string of terminals
and nonterminals (possibly null).

XLII.A Complete Edge is an edge where the marker occurs after the (at
the end of the string of symbols on the right-hand side). An example of a

complete edge is .

XLIII.An Incomplete Edge is an edge where the marker occurs in or before


the string of symbols on the right-hand side. An example of a incomplete edge
is .

PARSER-2 Helper Function:

XLIV.The following four functions are used in both the nondeterministic and
deterministic versions of the chart parsing algorithm.

XLV.Each of the following functions can add edges to the chart.

[I] Initializer:

XLVI.The Initializer adds an edge to indicate that the parser is


looking for the start symbol of the grammar, S.

XLVII.The Initializer function is only ever applied once to the


input string, and it will always add exactly one edge.

XLVIII.To begin parsing we add the edge


S' is a special symbol used by the
parsing algorithm to decide when it is finished.

XLIX. Once S' has been obtained we are finished our parse
(assuming that it is a parse spanning the entire input, we don't
want an incomplete parse).

[2] Predictor:

L. The Predictor takes an incomplete edge that is looking for


an X and adds new incomplete edges that, if completed,
would build an X in the right place.

LI. If there is an edge , then add

an edge for all Y, such that X


Y is an rule in the grammar.

[3] Completer:

LII.The Completer takes an incomplete edge that is looking for


an X and ending at vertex j, and a complete edge that begins
at vertex j and has X as the left-hand side, and then combines
them to make a new edge where X has been found.

LIII.If there is an incomplete edge , and a complete

edge , then add a new edge

[4] Scanner:

LIV.The Scanner takes an incomplete edge that is looking for a


terminal symbol X and ends at vertex j; if jth word of the
input is an X, then the scanner will add a new edge that
incorporates this word and extends to vertex j + 1.

LV.Again note, the new edge may or may not be a complete

edge. If there is an incomplete edge ,


and jth word of the input string is an X, then add a new edge

PARSER-3 Non-Deterministic Chart Parsing Algorithm:

 The nondeterministic chart-parsing algorithm (Figure 2.) uses the helper


functions described above.

 The algorithm is nondeterministic in the sense that it "chooses" which helper


function it will use next.
 The predictor function could also be considered nondeterministic as the
algorithm will still function even if the predictor only generates correct guesses.

Figure 2 : Nondeterministic Chart Parsing Algorithm

Figure 3 shows an example parse of the string "Artificial Intelligence is smelly."


The edges in the chart below are colored corresponding to the functions by which they
were added. The predictor, green edges by the scanner, and red edges by the completer
added black edges. Note that edge 1, added by the initializer is also in black.
Figure 3: Chart parse of the sentence ‘Artificial Intelligence is Smelly’

Figure 4 Presents the trace of this chart parse in tabular format, included to aid in
understanding the example. There is another example presented on page 700 of the.
Figure 4: Trace of chart parse of "Artificial Intelligence is smelly."

PARSER-4 Deterministic Chart Parsing Algorithm:

The deterministic chart parsing algorithm (Figure 5.) is a recursive algorithm


which performs the same task as the previous algorithm, but it does so deterministically.

Since this course is only an overview of AI we have decided not to go into the
details of how this algorithm works. With the pseudo-code provided below and several
hours to play around with it, you should be able to figure out how the algorithm works,
but please remember this is not critical for computer science 533.

Figure 5: Deterministic chart parsing algorithm

This algorithm is a form of Left-corner parser. A left-corner parser builds a parse


tree starting with the grammar's start symbol and extending down and to the left-most
word of the sentence.

It only adds edges, which will serve to extend this parse tree. One of the
advantages of this algorithm is that it avoids adding some edges, which could not
possibly be part of a sentence spanning the input string.

For example, a left-corner parsing algorithm will correctly avoid interpreting "ride
the horse" as a verb phrase (VP), but will still correctly interpret the phrase "the horse
gave" as a relative clause. This can be seen in Figure 6.

Figure 8: Advantage of a left corner parser algorithm

PARSER-5 How to Improve Parsing:

When designing a parsing algorithm there are three things we should keep in
mind. These will help ensure we have an algorithm, which is efficient.

o Don't do twice what you can do once.

Employ dynamic programming techniques, such as those learned


in computer science 413. Select a data structure (such as a chart),
which allows reuse of any previously parsed sub phrase.
o Don't do once what you can avoid altogether.

Avoid considering parses, which cannot lead to complete parse of


the input string. Left-corner parsing is one approach, which will
avoid some of this.

o Don't represent distinctions that you don’t need.

Minimize space complexity by generating a packed forest of parse


trees instead of storing every parse tree individually.

2. PROLOG:

Prolog is a logic programming language which has been popular for developing
natural language parsers and feature-based grammars, given the inbuilt support for search
and the unification operation which combines two feature structures into one.
Unfortunately Prolog is not easy to use for string processing or input/output, as the
following program code demonstrates:

Main: -

current_input(InputStream),
read_stream_to_codes(InputStream,Codes),
codesToWords(Codes,Words),
maplist(string_to_list,Words,Strings),
filter(endsWithIng,Strings,MatchingStrings), writeMany(MatchingStrings),
halt.
codesToWords([],[]). codesToWords([Head|
Tail],Words):- (char_type(Head,space)->
CodesToWords(Tail,Words);
getWord([Head|Tail],Word,Rest),
codesToWords(Rest,Words0),
Words=[Word|Words0]).
getWord([],[],[]).
getWord([Head|Tail],Word,Rest):-
( (
char_type(Head,space);char_type(Head,punct))
->Word=[],Tail=Rest;
getWord(Tail,Word0,Rest),Word=[Head|Word0]
).

filter(Predicate,List0,List):-
(List0=[]->List=[]
;List0=[Head|Tail],
(apply(Predicate,[Head])->
filter(Predicate,Tail,List1),
List=[Head|List1]
;filter(Predicate,Tail,List) )
).
endsWithIng(String):-sub_string(String,_Start,_Len,0,'ing').
writeMany([]).
writeMany([Head | Tail]) :- write(Head), nl, writeMany(Tail).
APPLICATION

Natural language applications can be classified in many different ways:


LVI. Natural Language Interface to Database:
Natural Language interfaces to the Database. So by interacting with Natural
Language user can easily find the information which he want to search.
• E.g. For accessing the book from Library, User may have the knowledge
of that library books. Which may not be possible. So Natural Language
will give the information for that, by interacting with it.

LVII.Natural Language Interface with Computer:


UC [UNIX CONSULTANT] assists a new user to UNIX operating system. It
engages the user in a dialogue box and tries to tell him what to do.
LVIII.Question Answering System & Conversational system:
Several programs are there for Question Answering purpose. They can make a
conversion with us. They can guide us, etc.
E.g.
• www.ask.com ,
http://www.lpa.co.uk/pws_exec/pws/proweb.exe?eg=Eliza

q Machine Translation:
• E.g. http://babelfish.altavista.com/translate.dvn
It can translate the sentences or letters or words into any language.

LIMITATION

LIX.Physical limitations:

The greatest challenge to NLP is representing a sentence or group of concepts


with absolute precision. The realities of computer software and hardware limitation make
this challenge nearly insurmountable. The realistic amount of data necessary to perform
NLP at the human level requires a memory space and processing capacity that is beyond
even the most powerful computer processors.

LX.No unifying ontology:

NLP suffers from the lack of a unifying ontology that addresses semantic as well
as syntactic representation. The various competing ontologies serve only to slow the
advancement of knowledge management.

LXI.No unifying semantic repository:

NLP lacks an accessible and complete knowledge base that describes the world in
the detail necessary for practical use. The most successful commercial knowledge bases
are limited to licensed use and have little chance of wide adoption. Even those with the
most academic intentions develop at an unacceptable pace.

LXII.Current information retrieval systems:

The performance of most of the current information retrieval systems is affected


by semantic overload. Web crawlers, limited by their method of indexing, more often
than not return incorrect matches as a result of ambiguous interpretation.

FUTURE

Computer Scientists and Information Professionals have been working on the idea
of natural language processing for as long as there have been computers. In the decades
of research, amazing progress has been made in this field and there are now many natural
language applications on the market and in general use.

LXIII.Some forty years later however, dynamic research is still taking place with
Natural Language Processing.

Two main reasons for that:

One, language is so complex that NLP programs will always be able to be


refined and improved.

Two, there are so many different kinds of applications where NLP will be
able to help.

LXIV.From translating texts or even websites to transcribing speech for the hearing
impaired, natural language can improve information access in so many different
ways - alone or in conjunction with other non-NLP technologies.

LXV.How about robots that understand and follow instructions by human voice or
driving by talking to the car like in some science fiction movies. Well they all
can be real one day.
CONCLUSION

Natural Language Processing takes a very important roll in new machine human
interfaces. It is very difficult to design a system that is 100 % accurate for NLP. These
problems get more complicated when we think of different people speaking the same
language with different styles. Therefore most of research on speech recognition is more
concentrated on there areas. Information retrieval can be improved to give very accurate
results for various searches. This will involve intelligence to find and sort all the results.
So such intelligent systems are being experimented right now are we will be able to see
improved applications of NLP in the near future.
BIBILIOGRAPHY

LXVI.Natural Language Processing - A Paninian Perspective. By Akshar


Bharati & Rajeev Sangal

LXVII.Speech and Language Processing - An introduction to Natural


Language Processing , Computational Linguistics and Speech
Recognition

LXVIII.http://www2.sims.berkeley.edu/courses/is290-2/f04/lectures

LXIX.http://ai-depot.com/ska/paper/node21.html

LXX.http://sern.ucalgary.ca/courses/cpsc/533/W99/presentations/L2_23A_C
urry_Lee/

LXXI.http://www.ec-gis.org/Workshops/7ec-
gis/papers/html/rachev2/7thGISRachevPaper.htm

LXXII.http://www.csc.villanova.edu/~nlp/intro.html

LXXIII.http://www.cogs.susx.ac.uk/research/nlp/gazdar/nlp-in-prolog/ch01

LXXIV.http://www.mind.ilstu.edu/published/Phillips/natlangproc.php#fn3

LXXV.<http://debra.dgbt.doc.ca/chat/chat.html>

LXXVI.<http://www.cpsc.ucalgary.ca/Courses/461/notes/Signatures/Signatu
res1.html>