Tesis Alexandra
Tesis Alexandra
Tesis Alexandra
Dissertation
Presented to the Department of Software and Computing
Systems, University of Alicante, in partial fulfillment of
the requirements for the title of
Doctor of Philosophy
Whereas our argument shows that the power and capacity of learning
exists in the soul already, and that just as the eye was unable to turn from
darkness to light without the whole body, so too the instrument of knowledge
can only by the movement of the whole soul be turned from the world of
becoming into that of being, and learn by degrees to endure the sight of
being and of the brightest and best of being, or in other words, of the good.
Plato, The Republic, Book 7, section 7 (The Myth of the Cave)
ACKNOWLEDGEMENTS
First of all, I would like to thank my main advisor Prof. Dr. Andrs Montoyo, for all
the knowledge and advice he has shared with me along these almost 4 years. I
would like to thank him, Prof. Dr. Manuel Palomar and Prof. Dr. Patricio MartnezBarco for all the support they have given me in the past years. I would like to thank
the University of Alicante for the PhD scholarship and all the people at the
Department of Software and Computing Systems, in the Natural Language
Processing Group, for their support and friendship. I would especially like to thank
Ester Boldrini and Elena Lloret, with whom I had a fruitful collaboration on
numerous occasions.
I would like to thank Dr. Ralf Steinberger, my co-advisor, for all the knowledge
and experience he has shared with me, as well as all the advice he has given me
during the past 2 years. Together with him, I would like to thank the entire
OPTIMA team at the European Commissions Joint Research Centre, led by Erik
van der Goot, for everything they taught me during my traineeship in Ispra (Italy),
for their valuable advice, their friendship and the collaboration in the development
of various resources and methods for sentiment analysis. I would especially like to
thank Dr. Mijail Kabadjov and Dr. Josef Steinberger, with whom I have
collaborated on various research papers in opinion summarization, during my
traineeship and after.
I would like to thank Prof. Dr. Dietrich Klakow for receiving me as an intern at
the Spoken Language Systems Department at the University of Saarbrcken
(Germany) and for all the knowledge he and his group shared with me during my
traineeship there. I would like to thank Dr. Michael Wiegand for his collaboration
on various sentiment analysis projects.
I would like to thank my wonderful family my mother Doina, my father Paul
and my sister Elvira for raising me to be curious, ambitious and perseverant, to
enjoy science and challenges, to never be afraid to ask, never be scared to look for
answers and always aim for excelsior. I would like to thank all my family and
friends, here, at home and in the world for their support and their advice.
I would like to thank Jess Hermida, a very special person in my life, for his
support, advice and the knowledge he has shared with me.
Finally, I would like to thank all the people that I have come across at different
events in the past years, who with their suggestions and critiques helped to improve
my research and/or life approach, as well as those who believed in me or doubted
me, making me wiser and stronger.
I
AGRADECIMIENTOS
En primer lugar, quisiera agradecer a mi tutor principal el Prof. Dr. Andrs
Montoyo, por todos los conocimientos y consejos que ha compartido conmigo a lo
largo de estos casi cuatro aos. Me gustara darle las gracias a l y a Prof. Dr.
Manuel Palomar y Prof. Dr. Patricio Martnez, por todo el apoyo que me han dado
en los ltimos aos. Gracias a la Universidad de Alicante por la beca de doctorado y
a toda la gente en el Departamento de Lenguajes y Sistemas Informticos, y en
especial al Grupo de Procesamiento del Lenguaje Natural, por su apoyo y amistad.
Me gustara dar las gracias a Ester Boldrini y Elena Lloret, con quien tuve una
fructfera colaboracin en numerosas ocasiones.
Me gustara agradecer al Dr. Ralf Steinberger, mi co-tutor, por todo el
conocimiento y la experiencia que ha compartido conmigo, as como todos los
consejos que me ha dado durante los ltimos 2 aos. Junto con l, me gustara dar
las gracias a todo el equipo de OPTIMA del Joint Research Centre de la Comisin
Europea, dirigido por el Erik van der Goot, por todo lo que me ense durante mi
periodo de prcticas en Ispra (Italia), por sus valiosos consejos, su amistad y
la colaboracin en el desarrollo de los diversos recursos y mtodos para el anlisis
de sentimientos. En especial me gustara dar las gracias a Dr. Mijail Kabadjov y Dr.
Josef Steinberger, con los que he colaborado en diversas investigaciones sobre las
tcnicas de resmenes de opinin, durante mi periodo de prcticas y despus.
Me gustara agradecer al Prof. Dr. Dietrich Klakow por darme la oportunidad de
hacer una estancia de investigacin en el Departamento de Sistemas de Habla (en
ingls Spoken Language Systems) de la Universidad de Saarbrcken (Alemania),
y por todo el conocimiento que su grupo y l mismo compartieron conmigo durante
mi periodo de prcticas. Me gustara dar las gracias tambin a Dr. Michael Wiegand
por su colaboracin en varios proyectos de anlisis de emociones, durante mi
estancia en Saarbrcken y despus.
Me gustara dar gracias a mi maravillosa familia mi madre Doina, mi padre
Paul y mi hermana Elvira por apoyar mi curiosidad, ambicin y perseverancia, por
ensearme a disfrutar de la ciencia y los desafos, a atreverme a preguntar, no tener
miedo a buscar respuestas y siempre tratar de buscar la excelencia en lo que hago,
llegar a un "excelsior". Me gustara dar las gracias a toda mi familia y amigos, aqu,
en casa y en el mundo por su apoyo y sus consejos.
Me gustara dar las gracias a Jess Hermida, una persona muy especial en mi
vida, por su apoyo, el asesoramiento y el conocimiento que ha compartido conmigo.
Por ltimo, me gustara dar las gracias a toda la gente que me he encontrado en
diferentes eventos en los ltimos aos, que con sus sugerencias y crticas ayudaron
III
IV
ABSTRACT
The present doctoral thesis deals with the issues and challenges involved in the
development of methods and resources for the Natural Language Processing (NLP)
task of sentiment analysis.
Specifically, the first aim is to develop adequate techniques for the automatic
detection and classification of directly, indirectly or implicitly-expressed sentiment
in texts of different types (reviews, newspaper articles, dialogues/debates and
blogs), in different languages. The second aim is to apply the sentiment analysis
methods proposed in the context or jointly with other NLP tasks and propose
adequate techniques to tackle the issues raised by the peculiarities of affect
expression in these tasks.
In order to achieve the proposed objectives, the work presented has been
structured around answering five research questions. Following is a description of
the questions and a summary of the answers we have given in the present thesis.
1. How can sentiment analysis and, in a broader perspective, opinion mining
be defined in a correct way? What are the main concepts to be treated in
order to create a good definition that can be used to appropriately define
the task and subsequently propose correct methods to tackle it?
In Chapter 2, we define the main concepts we will frequently employ throughout
this thesis. We first present an overview of the definitions given in the NLP
literature to the related tasks of subjectivity analysis, sentiment analysis, opinion
mining, appraisal/attitude analysis, and emotion detection. We subsequently present
the definitions of the terms that are related to these tasks, both in well-established
dictionaries, as well as the research literature in the field. Finally, we propose an
operational definition that is consistent with the manner in which the different terms
related to sentiment analysis are defined. In Chapter 3, we present the state of the
art in the field and show that depending on the final aim of the application, the tasks
involving sentiment analysis are defined and tackled in a different manner.
The subsequent research questions we address in this thesis are:
2. Can sentiment analysis be performed using the same methods, for all text
types? What are the peculiarities of the different text types and how do they
influence the methods to be used to tackle it? Do we need special resources
for different text types?
V
VII
TABLE OF CONTENTS
Chapter 1. Introduction............................................................................................. 1
1.1.
Background ............................................................................................... 1
1.2.
Motivation ................................................................................................. 5
1.3.
Applications............................................................................................... 7
2.
3.
4.
5.
6.
XII
CHAPTER 1. INTRODUCTION
Motto: Human behavior flows from three main sources: desire, emotion and
knowledge. (Plato)
1.1.
BACKGROUND
The era in which we live has been given many names. Global village,
technotronic era, post-industrial society, information society, information
age, and knowledge society are just a few of the terms that have been used in an
attempt to describe the deep changes that have occurred in the lives of societies and
people worldwide as a result of the fast development of ICT technologies, the
access to Internet and its transformation into a Social Web. In this new context,
having access to large quantities of information is no longer an issue, as there are
terabytes of new information produced on the Web every day that are available to
any individual with an Internet connection. In contrast to older times, when finding
sources of information was the key problem to companies and individuals, todays
information society challenges companies and individuals to create and employ
mechanisms to search and retrieve relevant data from the huge quantity of
available information and mine it to transform it into knowledge, which they can
use to their advantage. As opposed to the past, when this advantage was a question
of finding sources of information, in todays society, which is flooded by data that
is changing at a rapid pace, the advantage is given by the quality (accuracy,
reliability) of the extracted knowledge and its timeliness. For the era in which we
live, information has become the main trading object. In this context, having at
hand high quality and timely information is crucial to all the spheres of human
activity: social, political, and economic, to name just a few.
However, in many cases, the relevant information is not found in structured
sources (i.e. tables or databases), but in unstructured documents, written in human
language. The high quantity of such data requires the use of automatic processing
techniques. The discipline that deals with the automatic treatment of natural
language in text or speech is called Natural Language Processing (NLP). NLP is
part of the research area of Artificial Intelligence (AI), which is defined as the
science and engineering of making intelligent machines (McCarthy, 1959), by
simulating the mechanisms of human intelligence. The goal of Artificial
Intelligence, as it was stated in the 1950s, is to create machines that are capable of
passing the Turing Test. Roughly speaking, a computer will have passed the
Turing Test if it can engage in conversations indistinguishable from that of a
1
human's (Lee, 2004). In order to achieve this goal, NLP deals with the text
analysis at different levels: phonologic (sounds), lexical (words), morphologic
(parts of speech), syntactic (representation of the structure of the sequence of
lexical units based on their dependency), semantic (logical structure representing
the meaning expressed) and pragmatic (studying the influence of the context and
the world knowledge on the general meaning of the text). NLP contains many
research areas. Each of these constitute either general NLP problems, which need to
be solved in any application areas (Word Sense Disambiguation, Co-reference
resolution ), or that have been set up in the view of a specific end application
(Information Retrieval, Information Extraction, Question Answering, Text
Summarization, Machine Translation).
Traditionally, the application areas of NLP were designed for the treatment of
factual (exact) data. Nowadays, however, factual information is no longer the main
source from which crucial knowledge is extracted.
The present is marked by the growing influence of the Social Web (the web of
interaction and communication) on the lives of people worldwide. More than ever
before, people are more than willing and happy to share their lives, knowledge,
experience and thoughts with the entire world, through blogs, forums, wikis, review
sites or microblogs. They are actively participating to events, by expressing their
opinions on them, by commenting on the news appearing and the events that take
place in all spheres of the society. The large volume of subjective information
present on the Internet, in reviews, forums, blogs, microblogs and social network
communications has produced an important shift in the manner in which people
communicate, share knowledge and emotions and influence the social, political and
economic behavior worldwide. In consequence, this new reality has led to
important transformations in the manner, extent and rapidness in which news and
their associated opinions circulate, leading to new and challenging social,
economical and psychological phenomena.
In order to study these phenomena and address the issue of extracting the crucial
knowledge that nowadays is contained in opinionated data, new fields of research
were born in Natural Language Processing (NLP), aiming at detecting subjectivity
in text and/or extracting and classifying opinions into different sets (usually
positive, negative and neutral). The main tasks that were tackled in NLP are
subjectivity analysis (dealing with private states (Banfield, 1982), a term that
encloses sentiment, opinions, emotions, evaluations, beliefs and speculations)
sentiment analysis and opinion mining, although different terminologies have
been used to denote the approaches taken (e.g. review mining, appraisal extraction)
and sentiment analysis and opinion mining have been used interchangeably, as they
are considered by some authors to point to the same task (Pang and Lee, 2008). A
closely related task is also emotion detection, dealing with the classification of
2
texts according to the emotion expressed. All these research areas are part of the
wider field in Artificial Intelligence denominated affective computing (Picard,
1995).
This thesis deals with the task of sentiment analysis, in the context of
multilingual documents of different text types. Specifically, the work we will
present throughout the following chapters concentrates on answering the following
research questions:
1. How can sentiment analysis and, in a broader perspective, opinion mining
be defined in a correct way? What are the main concepts to be treated in
order to create a good definition that can be used to appropriately
delimitate the task and subsequently propose correct methods to tackle it?
In Chapter 2, we define the main concepts we will frequently employ throughout
this thesis. We first present an overview of the definitions given in the NLP
literature to the related tasks of subjectivity analysis, sentiment analysis, opinion
mining, appraisal/attitude analysis, and emotion detection. We subsequently present
the definitions of the terms that are related to these tasks, both in well-established
dictionaries, as well as the research literature in the field. Finally, we propose an
operational definition that is consistent with the manner in which the different terms
related to sentiment analysis are defined. In Chapter 3, we present the state of the
art in the field and show that depending on the final aim of the application, the tasks
involving sentiment analysis are defined and tackled in a different manner.
The subsequent research questions we address in this thesis are:
2. Can sentiment analysis be performed using the same methods, for all text
types? What are the peculiarities of the different text types and how do they
influence the methods to be used to tackle it? Do we need special resources
for different text types?
3. Can the same language resources be used in other languages (through
translation)? How can resources be extended to other languages?
In Chapter 4, we present the peculiarities of different text types (reviews,
newspaper articles, blogs, political debates), analyze them and propose adequate
techniques to address them at the time of performing sentiment analysis. In the
cases where no generally-accepted definition of the sentiment analysis task exists
for a specific textual genre, we propose new definitions and annotate new resources
accordingly. We present different methods and resources we built for the task of
sentiment analysis in different text types, in different languages (English, Spanish,
German). In each of the genres studied, we evaluate our approaches
correspondingly, both in-house, as well as in international competitions. We show
3
that the techniques employed are robust enough to obtain good results, even in the
case where the original texts are in a language for which we do not have any
resources available and for the treatment of which we employ translation engines.
Finally, given the results obtained, we show that our approaches perform at the
level of state-of-the-art systems and in many cases outperform them.
4. How can we deal with opinion in the context of traditional tasks? How can
we adapt traditional tasks (Information Retrieval, Question Answering, Text
Summarization) in the context of opinionated content? What are the new
challenges in this context?
In Chapter 4, we only concentrate on the task of sentiment analysis as a
standalone challenge, omitting the steps required in order to obtain the texts on
which the sentiment analysis methods were applied or eliminating redundancy in
the information obtained. However, in a real-world application scenario,
automatically detecting the opinion expressed in a text is often not the first, neither
the last task to be performed. In order to analyze the sentiment found in different
texts, the documents must firstly be retrieved. Additionally, the results of the
automatic sentiment analysis may still contain a high volume of information, with
much redundancy. Bearing in mind these necessities, in Chapter 5 we study
methods to combine opinion mining with question answering and summarization.
We show that performing traditional tasks in the context of opinionated text has
many challenges and that systems that were designed to work exclusively with
factual data are not able to cope with opinion questions. Thus, we propose new
methods and techniques to adapt question answering and summarization systems to
deal with opinionated content. Additionally, we create and annotate appropriate
resources for the evaluation of the proposed methods. Finally, we evaluate our
approaches, as well as the impact of using different tools and resources in these
tasks. Our evaluations, both in in-house experiments, as well as through the
participation in international competitions, show that the proposed methodologies
are appropriate for tackling question answering and summarization in the context of
opinionated texts.
The last research question we address in this thesis is:
5. Can we propose a model to detect emotion (as a component of sentiment)
from text, in the cases where it is expressed implicitly, requiring world
knowledge for its detection?
As we will see throughout this thesis, sentiments can be explicitly or implicitly
present in texts. While in the first case, lexical clues may be found in the text
indicating the presence of sentiment, through sentiment-bearing words, in the
second case, the emotion underlying the sentiment is not explicitly stated through
4
the use of affective words. In these situations, the emotion is only inferable based
on commonsense knowledge (i.e. emotion is not explicitly, but implicitly expressed
by the author, by presenting situations which most people, based on commonsense
knowledge, associate with an emotion, like going to a party, seeing your child
taking his/her first step etc.). Motivated by the fact that most work in sentiment
analysis has been done only in view of the existence of lexical clues for sentiment
detection and classification, and having seen the limitations of such models, in
Chapter 6 of the thesis, we present our contribution to the issue of automatically
detecting emotion expressed in text in an implicit manner. The initial approach is
based on the idea that emotion is triggered by specific concepts, according to their
relevance, seen in relation to the basic needs and motivations, underpinning our
idea on the Relevance Theory. The second approach we propose is based on the
Appraisal Theory models. The general idea behind it is that emotions are most of
the times not explicitly stated in texts, but results from the interpretation (appraisal)
of the actions contained in the situation described, as well as the properties of their
actors and objects. Thus, we set up a framework for representing situations
described in text as chains of actions (with their corresponding actors and objects),
and their corresponding properties (including the affective ones), according to
commonsense knowledge. We show the manner in which the so-called appraisal
criteria can be automatically detected from text and how additional knowledge on
the properties of the concepts involved in such situations can be imported from
commonsense knowledge bases. Finally, we demonstrate through an extensive
evaluation that such a representation is useful to obtain an accurate label of the
emotion expressed in text, without any linguistic clue being present therein,
increasing the recall of systems performing sentiment analysis from texts.
1.2.
MOTIVATION
The radical shift in the method employed for communication and the content of this
communication has brought with itself new challenges, but also many opportunities.
At the economic level, the globalization of markets combined with the fact that
people can freely express their opinion on any product or company on forums,
blogs or e-commerce sites led to a change in the companies marketing strategies, in
the rise of awareness for client needs and complaints, and a special attention for
brand trust and reputation. Specialists in market analysis, but also IT fields such as
Natural Language Processing, demonstrated that in the context of the newly created
opinion phenomena, decisions for economic action are not only given by factual
information, but are highly affected by rumors and negative opinions. Wright
5
(2009) claims that for many businesses, online opinion has turned into a kind of
virtual currency that can make or break a product in the marketplace 1. Studies
showed that financial information presented in news articles have a high correlation
to social phenomena, on which opinions are expressed in blogs, forums or reviews.
On the other hand, many tasks that involved extensive efforts from the companies
marketing departments are easier to perform. An example is related to market
research for advertising, business intelligence and competitive vigilance. New
forms of expression on the web made it easier to collect information of interest,
which can help to detect changes in the market attitude, discover new technologies,
machines, markets where products are needed and detect threats. On the other hand,
using the opinion information, companies can spot the market segments their
products are best associated with and can enhance their knowledge on the clients
they are addressing and on competitors. The analysis of the data flow on the web
can lead to the spotting of differences between the companies products and the
necessities expressed by clients and between the companies capacities and those of
the competitors. Last, but not least, the interpretation of the large amounts of data
and their associated opinions can give companies the capacity to support decisionmaking through the detection of new ideas and new solutions to their technological
or economic problems.
The opinionated data on the web has also produced important changes in the
manner in which communities are able to participate in the elaboration of laws and
policies. Consultations with communities that in the past were made through the use
of questionnaires are now easily made through forums of opinion. Additionally,
such data on the opinions that people have about laws, policies, and administrators
can be extracted from a variety of sources (e.g. microblogs, blogs, social networks).
The advantage and, at the same time, issue related to these new capabilities is
the large amount of information available and its fast growing rate. Lack of
information on markets and their corresponding social and economical data, leads
to wrong or late decisions and finally to important financial losses. Lack of
information of the policy makers leads to wrong decisions, affecting large
communities.
Although mostly positive, there are also downsides to the increasing
communication through the use of Web 2.0 technologies. The development of
social networks and communication between their members led to the development
of interesting phenomena, whose effects are both positive and negative and which
are difficult to assess. Within social networks gathered around the most peculiar
topics, people talk about subjects that they would not address in their everyday life
and with their friends or family. Under the hidden identity on the web, however,
they are free to express their innermost fears and desires. That is why, allowing and
1
www.nytimes.com/2009/08/24/technology/internet/24emotion.html?_r=1&ref=start-ups
supporting free communication led to the birth of sites where violence is predicated
and encouraged, where people with psychological problems or tendencies towards
suicide, addictions etc. talk to one another and encourage their negative behaviors.
Such sites must be discovered and controlled, in order to keep under control
different social issues that may arise from the described potentially conflictive
situations.
As we can see, the present reality is profoundly marked by the opinionated
information present both in traditional, as well as new textual genres. Given the
proven importance of such data, but also the challenges it raises, in terms of
volume, automatic systems, sentiment analysis has become a highly active research
field in NLP in the past years.
In the next section, we will show that this task is not only useful in order to
obtain important knowledge from non-factual data, but that it also contributes to the
improvement of systems dealing with other NLP challenges.
1.3.
APPLICATIONS
In the Motivation Section, we explained the main reasons for doing research in the
field of sentiment analysis and the applications it has in real-world scenarios.
Further on, we will present some of the fields and domains in which this task is
useful.
Research has been conducted in the field of opinion mining, aimed at improving
different social, economical, political and psychological aspects of every-day
human life. There are many applications of opinion mining systems to real-world
scenarios. Some of these applications are already available online, others are still
under research and other directions and developments in the field are merely
appearing. There are sites like swotty.com, which mine, classify and summarize
opinions from reviews on products on the e-commerce sites that people can use for
comparison, advice or recommendation. Other applications of the task, directly
related to the commerce and competition markets of companies, use opinion mining
from the web to obtain direct, sincere and unbiased market feedback, about their
own products (business intelligence), and of the products of their market
competition (competitive vigilance). Companies, as well as public figures, use
opinion mining to monitor their public image and reputation (trust). Authors can
benefit from opinion mining to track their literary reputation.
It was demonstrated that fluctuation in public opinion correlates to fluctuations
of stock prices for the targeted companies (Devitt and Ahmad, 2007). Thus, opinion
mining can be used to track opinion across time for market and financial studies, for
early action in predicted crisis situations or for the issuing of alerts.
7
Recent developments of the Social Web technologies and the growing number
of people writing and reading social media (blogs, forums etc.) also allows for the
monitoring and analysis of social phenomena, for the spotting of potentially
dangerous situations and determining the general mood of the blogosphere.
Examples of sites implementing these concepts are wefeelfine.org or
twends.com.
Yet another application of sentiment analysis is the tracking of political view, to
detect consistency and inconsistency between statements and action at the
government level. It was recently stated that election results could be better
predicted by following the discussion threads in blogs.
eRulemaking, as a democratic way of consulting the whole targeted population
when a law or policy is to be implemented, can also highly benefit from sentiment
analysis, as method to spot and classify a large quantity of subjective data. This task
is also performed when tracking views on laws from legal blogs (blawgs).
Last, but not least, studying affect related phenomena is basic for HumanComputer Interaction (Picard, 1995), as most reactions and interactions are not only
rationality-based, but heavily rely on emotion.
It was also demonstrated that opinion mining improves other Natural Language
Processing tasks, such as:
! Information Extraction, by separating facts from opinions (Riloff et al.,
2005);
! Question Answering (Somasundaran et al., 2007), where the application of
opinion mining can improve the answering of definition questions (Lita et
al., 2005)
! and Multi-Perspective Question Answering (Stoyanov et al., 2005; Yu and
Hatzivassiloglou, 2003) where there is not a single, true and correct answer,
but a set of answers describing the attitude of different persons on a given
fact;
! Summarization of multi-perspective texts (Ku et al., 2005; Ku et al., 2006),
where redundant opinion (opinion of the same polarity, given the same
arguments) must be removed;
! Authorship (source) determination (Teufel and Moens, 2000; Piao et al.,
2007);
! Word Sense Disambiguation (Wiebe and Mihalcea, 2006).
The next chapters of this thesis present different methods and approaches for
tackling the task of sentiment analysis in different text types, languages and in the
context of a variety of final applications, addressing the five research questions we
described. First of all, however, in order to ensure an understanding of the
terminology we will employ, in Chapter 2 we present an overview of the tasks and
related concepts definition. The main motivations for defining the concepts
8
involved in this task is that the issues related to the study of affective phenomena
have been studied for a long time in disciplines such as Psychology or Philosophy
and that sentiment analysis in NLP is a recent field, in which the terminology is not
yet fully established.
2
Regard is a term we deliberately use here, in order to avoid employing any of the terminology used
so far, which is defined and detailed in this chapter. In this context, it refers to: 1) an assessment based
on a set of criteria (i.e. personal taste, convenience, social approval, moral standards etc.) or 2) on the
emotional effect it has on the person as a cause of this assessment.
11
Given the high dynamics of the field and the vast amount of research done
within its framework in the past few years, the terminology employed in defining
the different tasks dealing with subjectivity, as well as the concepts involved in
them is not yet uniform across the research community.
In order to establish the scope of our research and ensure the understanding of
the approaches presented, we first give an overview of the definitions given to the
concepts involved outside the area of Natural Language Processing. Subsequently,
we present some of the tasks proposed in the research, the concepts involved in
them and the definitions they were given. Finally, we propose a set of definitions to
the concepts we will use across this thesis and of the different tasks aiming at the
automatic processing of subjective texts.
2.1. SUBJECTIVITY
In Philosophy, subjectivity refers to the subject and his or her perspective, feelings,
beliefs, and desires. (Solomon, 2005).
In NLP, the most widely used definition is the one proposed by Wiebe (1994).
The author defines subjectivity as the linguistic expression of somebodys
opinions, sentiments, emotions, evaluations, beliefs and speculations. In her
definition, the author was inspired by the work of the linguist Ann Banfield
(Banfield, 1982), who defines as subjective the sentences that take a characters
point of view (Uspensky, 1973) and that present private states (Quirk, 1985) (that
are not open to objective observation or verification) of an experiencer, holding an
attitude, optionally towards an object. Subjectivity is opposed to objectivity, which
is the expression of facts. Wiebe et al. (2005) considers the term private state,
which is in a pragmatic sense equivalent to subjectivity and that is defined as a
general term that covers opinions, beliefs, thoughts, feelings, emotions, goals,
evaluations, and judgments. According to the definition proposed by Wiebe (1994),
an example of subjective sentence is This book is amazing!, whereas an example
of objective sentence is This book costs 10 on Amazon.com.
In view of the given definition, the Multi-Perspective Question Answering
corpus (Wiebe et al., 2005) takes into account three different types of elements for
the annotation of subjectivity: explicit mentions of private states (e.g. The U.S.
fears a spill-over, said Xirao-Nima), speech events expressing private states (e.g.
The U.S. fears a spill-over,said Xirao-Nima), expressive subjective element (e.g.
The report is full of absurdities,).
In the Handbook of Natural Language Processing (2010), Bin Liu defines
subjective versus objective sentences as follows: An objective sentence expresses
12
some factual information about the world, while a subjective sentence expresses
some personal feelings or beliefs.
http://www.merriam-webster.com/
13
14
15
16
We will further on define emotion and relate it to the wider context of affect and
the term affective computing, used in Artificial Intelligence. Subsequently, we
clarify the connection between all these concepts, within the frame of the Appraisal
Theory4.
Affect is a superordinate concept that subsums particular valenced conditions
such as emotions, moods, feelings and preferences (Ortony et al., 2005), being one
of the four components whose interaction make the human organism function
effectively in the world (Ortony et al., 2005), along with motivation, cognition and
behaviour.
Affective computing is a branch of Artificial Intelligence (AI) dealing with the
design of systems and devices that can recognize, interpret, and process human
affect. The concept includes interdisciplinary work from computer science, but also
psychology and cognitive science. However, the term was introduced in the study
within AI by (Picard, 1995), who envisages both the capabilities of computers to
interpret affect from digital content, as well as imitate affect in humans.
Emotion is a complex phenomenon, on which no definition that is generally
accepted has been given. However, a commonly used definition considers emotion
as an episode of interrelated, synchronized changes in the states of all or most of
the five organismic subsystems (Information processing, Support, Executive,
Action, Monitor) in response to the evaluation of an external or internal stimulus
event as relevant to major concerns of the organism. (Scherer, 1987; Scherer,
2001).
Emotion detection and classification is the task of spotting linguistic
expressions of emotion from text and classifying them in predefined
categories/labels (e.g. anger, fear, sadness, happiness, surprise, disgust etc.).
Emotion is a much more complex phenomenon, whose expression in language may
not always have a subjective form (Ortony, 1997).
Let us consider a few examples, to note the difference between subjectivity,
opinion, emotion and sentiment, in a sense that is consistent to the definitions these
concepts are given outside the NLP world. In the following table, Y corresponds
to yes, N corresponds to no, C corresponds to context (dependence on
the context), POS to positive, NEG stands for negative and NEU for
neutral. The symbol --- stands for the lack of the corresponding element.
This set of theories have been proposed in Psychology, by De Rivera (1977), Frijda (1986), Ortony,
Clore and Collins (1988), Johnson-Laird and Oatley (1989). It has also been used in to define the
Appraisal Framework in Linguistics, by Martin and White( 2001).
17
11.
12.
13.
14.
15.
16.
17.
18.
Polarity?
POS/NEG/NE
U
(if there is a
sentiment)
10.
Sentiment?
9.
Emotion?
8.
Opinionated?
7.
Subjective?
5.
6.
N
Y
Y
N
N
Y
Y
C
Y
Y
C
C
Y
Y
C
C
NEG
POS
--NEG/--- (C)
Y
N
Y
Y
Y
Y
Y
Y
POS/NEG/(C)
NEG
NEG
NEG
Y
Y
Y
Y
C
C
Y
C
POS/NEG/NEU
(C)
POS/NEG/NEU
(C)
POS
NEG/NEU (C)
---
N
N
N
N
N
N
Y
N
Y
C
N
Y
N
Y
C
N
Y
N
Y
C
--POS/NEG (C)
--NEG
POS/NEG (C)
Example
Nr.
1.
2.
3.
4.
sentiment in all the cases. Sentiments on different targets can also be conveyed by
presenting arguments that are factual in nature (e.g. The new phone broke in two
days), as the underlying emotion that is expressed is not always stated directly.
Moreover, subjective statements that contain an opinion must not necessarily
contain an emotion, thus they have no sentiment correlated to them (e.g. I believe
in God). Finally, a text may express and emotion without expressing any opinion
and sentiment (e.g. Im finally relaxed!). This idea can be summarized in the
following schema:
Subjective
Objective
Opinion
Sentiment
Emotion
Figure 2.1: The relation between subjectivity, objectivity, opinion, sentiment and
emotion
important to take into consideration the context of what is said (who is saying it,
why, what is the intention behind what he/she is saying; is the act of writing purely
to inform the reader on a specific object, is it to produce a specific emotion to the
reader, is it to convince him/her about something, is it to purely express his/her own
regard on this object he/she is writing about?) and whom it is addressed to
(would a potential reader like what he/she is reading, would he be comfortable with
it, would he resent it, would he think it is good, positive, would he feel happy about
it).
Following on this idea and based on the definitions we have seen so far, we can
claim that much of the work that has been done under the umbrella of sentiment
analysis or opinion mining is actually concerned with detecting attitudes and
classifying them, since, in fact, the act of communication (in this case, through text)
is intended for a reader. Both the writer (with his own views on the world), as well
as the reader, who is not a mere passive entity, but who actively uses his/her own
knowledge of the world, the context and his/her affect to interpret what is written
should be taken into consideration at the time of deciphering the sentiments
expressed in text. Additionally, especially in the traditional textual genres such as
newspaper articles, where writers are thought to be objective when rendering a
piece of news, expressions of sentiment cannot be direct. Opinions are expressed in
a non-subjective manner, by omitting certain facts and overly repeating others
(Balahur and Steinberger, 2009).
The work in the field of NLP that deals with the computational treatment of
attitude is called attitude analysis or appraisal analysis relating it to the
Appraisal Theory.
An attitude (Breckler and Wiggins, 1992) is a hypothetical construct that
represents an individual's degree of like or dislike for something. Attitudes are
generally positive or negative views of a person, place, thing, or event this is
often referred to as the attitude object. People can also be conflicted or ambivalent
toward an object, meaning that they simultaneously possess both positive and
negative attitudes toward the item in question. Attitudes are judgments. They
develop on the ABC model (affect, behavior, and cognition). The affective response
is an emotional response that expresses an individual's degree of preference for an
entity. The behavioral intention is a verbal indication or typical behavioral tendency
of an individual. The cognitive response is a cognitive evaluation of the entity that
constitutes an individual's beliefs about the object. Most attitudes are the result of
either direct experience or observational learning from the environment.
Work that has concentrated on attitude analysis was done by Taboada and
Grieve (2004), Edmonds and Hirst (2002), Hatzivassiloglou and McKeown (1997)
and Wiebe et al. (2001). However, only the first work considered attitude as
20
5
6
http://www.wjh.harvard.edu/~inquirer/
http://www.cs.pitt.edu/mpqa/
24
contrast to the private frames, contain annotations of events that are objective in
nature.
The Opinion Finder lexicon (subjectivity clues) (Wilson et al., 2005) contains
over 8000 words, annotated also at the level of polarity value, and was built starting
with the grouping of the subjectivity clues in (Riloff and Wiebe, 2003) and enriched
with polarity annotated subjective words taken from the General Inquirer and the
lexicon proposed by Hatzivassiloglou and McKeown (1997). It is interesting to
notice that authors found most of the subjective words to have either a positive or a
negative polarity and only few were both positive and negative or neutral.
B) SEMI-AUTOMATICALLY AND AUTOMATICALLY CREATED
RESOURCES
Another annotation scheme and corpus for subjectivity versus objectivity
classification, as well as polarity determination at sentence level was developed by
Yu and Hatzivassiloglou (2003), in a semi-automatic manner. The authors start
from a set of 1336 seed words, manually annotated by Hatzivassiloglou and
McKeown (1997), extended by measuring co-ocurrence between the known seed
words and new words. The hypothesis on which the authors based their approach is
that positive and, respectively, negative words, tend to co-occur more than it is
expected by chance. As measure for association, the authors employ log-likelihood
on a corpus that is tagged at the part-of-speech level.
A resource for subjectivity was built semi-automatically on the basis of the
Appraisal Theory (Martin and White, 2005). The Appraisal Theory is a framework
of linguistic resources inscribed in discourse semantics, which describes how
writers and speakers express inter-subjective and ideological positions (the
language of emotion, ethics and aesthetics), elaborating on the notion of
interpersonal meaning, i.e. social relationships are negotiated through evaluations of
the self, the others and artifacts. According to this theory, emotion has achieved by
appraisal, which is composed of attitude (affect, appreciation, judgment),
graduation (force and focus), orientation (positive versus negative) and polarity
(which can be marked or unmarked). A lexicon of appraisal terms is built by
Whitelaw et al. (2005), based on the examples provided by Martin and White
(2005) and Matthiassen (1995) (400 seed terms) and patterns in which filler
candidates were extracted from WordNet (Fellbaum ed., 1999). Term filtering was
done by ranking obtained expressions and manually inspecting terms that were
ranked with high confidence. The resulting lexicon contains 1329 terms.
25
27
Another related method was used in the creation of SentiWordNet (Esuli and
Sebastiani, 2005). The idea behind this resource was that terms with similar
glosses in WordNet tend to have similar polarity. Thus, SentiWordNet was build
using a set of seed words whose polarity was known and expanded using gloss
similarity.
As mentioned in the subjectivity classification section, in the collection of
appraisal terms in Whitelaw et al. (2005), the terms also have polarity assigned.
MicroWNOp (Cerini et al., 2007), another lexicon containing opinion words
with their associated polarity, was built on the basis of a set of terms (100 terms for
each of the positive, negative and objective categories) extracted from the General
Inquirer lexicon and subsequently adding all the synsets in WordNet where these
words appear. The criticism brought to such resources is that they do not take into
consideration the context in which the words or expressions appear. Other methods
tried to overcome this critique and built sentiment lexicons using the local context
of words.
Pang et al. (2002) built a lexicon of sentiment words with associated polarity
value, starting with a set of classified seed adjectives and using conjunctions
(and) disjunctions (or, but) to deduce orientation of new words in a corpus.
Turney (2002) classifies words according to their polarity on the basis of the
idea that terms with similar orientation tend to co-occur in documents. Thus, the
author computes the Pointwise Mutual Information score between seed words and
new words on the basis of the number of AltaVista hits returned when querying the
seed word and the word to be classified with the NEAR operator.
In our work in, we compute the polarity of new words using polarity anchors
(words whose polarity is known beforehand) and Normalized Google Distance
(Cilibrasi and Vitanyi, 2006) scores using as training examples opinion words
extracted from pros and cons reviews from the same domain, using the clue that
opinion words appearing in the pros section are positive and those appearing in
the cons section are negative (Balahur and Montoyo, 2008b; Balahur and
Montoyo, 2008d; Balahur and Montoyo, 2008f). Another approach that uses the
polarity of the local context for computing word polarity is the one presented by
Popescu and Etzioni (2005), who use a weighting function of the words around the
context to be classified.
The lexical resources that were created and are freely available for use in the
task of opinion mining are:
WordNet Affect (Strapparava and Valitutti, 2004);
SentiWordNet (Esuli and Sebastiani, 2006);
Emotion triggers (Balahur and Montoyo, 2008a);
MicroWNOp (Cerini et al., 2007);
General Inquirer (Stone et al., 1966);
30
Researchers have proposed different scenarios for mining opinion and have
proposed different methods for performing this task. It is important to mention that
although the general idea of classifying sentiment in text is understood as one of
assigning a piece of text (document, sentence, review) a value of positive or
negative (or neutral), other scenarios were defined in which the positive
category refers to liking, arguments brought in favor of an idea (pros) or support
of a party or political view and the negative class includes expressions of
disliking something, arguments brought against an idea expressed (cons) or
opposition to an ideology.
Sentiment analysis (opinion mining) is a difficult task due to the high semantic
variability of natural language, which we have defined according to the
understanding given to sentiments and attitudes, supposes not only the discovery of
directly expressed opinions also the extraction of phrases that indirectly or
implicitly value objects, by means of emotions or attitudes.
It is also important to note the fact that sentiment analysis does not necessarily
require as input an opinionated piece of text (Pang and Lee, 2008). Good versus bad
news classification has also been considered as a sentiment classification task,
which was approached in research such as the one proposed by Koppel and
Shtrimberg (2004).
However, it is also very important to note that a clear distinction must be made
at the time of performing sentiment analysis at the document level, namely that, for
example, the content of good versus bad news (which is factual information) should
not influence the judgment of sentiment as far as the facts are concerned or the
people involved. To exemplify, a sentence such as Great struggles have been made
by the government to tackle the financial crisis, which led many companies to
bankruptcy must not be seen as negative because it discusses the consequences and
gravity of the financial crisis, but must be seen as positive, when sentiment on the
government is analyzed. Certainly, we can see that in this case the sentiment and its
polarity arise from the manner in which the reporting is done.
We thus distinguish between document-level, sentence-level and feature-level
sentiment analysis. The tasks are defined differently at each level and involve the
performing of extra, more specialized steps. We will further show which those steps
are. According to the survey by Pang and Lee (2008), general strategies that have
been used in sentiment polarity classification were:
Classification using the representation of text as feature vectors where
entries correspond to terms, either as count of frequencies (using tf-idf), or
counting the presence or absence of a certain opinion words. In this context,
Wilson et al. (2005) have shown that rare words (that appear very
infrequently in the opinion corpus), also called hapax legomena have a very
good precision in subjectivity classification.
32
Using information related to the part of speech of the sentiment words and
applying specialized machine learning algorithms for the acquiring of such
words (adjectives, verbs, nouns, adverbs). The work in acquiring nouns
with sentiment has been proposed by Riloff et al. (2005). Here, the authors
use dependency parsing and consider as features of machine learning
algorithms the dependency relations. In this setting, information about
modifiers or valence shifters can be introduced, as dependency analysis
allows for the identification of the constituents that are modified.
For the tasks in which sentiment on a certain topic must be extracted, the
features used in machine learning for sentiment classifications were
modified to include information on the mentions of the topic or the Named
Entities mentioned in relation to it.
In the subsequent sections, we present the methods that were employed for
sentiment analysis at the different levels considered.
33
Mullen and Collier (2004) show that classifying sentiment using Support Vector
Machines with features computed on the basis of word polarity, semantic
differentiation computed using synonymy patterns in WordNet, proximity to topic
features and syntactic relations outperforms n-gram classifications.
Another similar approach was taken by Pang and Lee (2003). In this approach,
the authors classify reviews into a larger scale of values (not only positive and
negative), seen as a regression problem, and employ SVM machine learning with
similarity features. They compare the outcome against the number of stars given to
the review.
Chaovalit and Zhou (2005) perform a comparison between different methods of
supervised and unsupervised learning based on n-gram features and semantic
orientation computed by using patterns and dependency parsing.
Goldberg and Zhu (2006) present a graph-based approach to sentiment
classification at a document level. They represent documents as vectors, computed
on the basis of presence of opinion words and then link each document to the k
most similar ones. Finally, they classify documents on the basis of the graph
information using SVM machine learning.
A similar effort is made by Ng et al. (2006), where the goal is also to classify
documents according to their polarity. The authors present an interesting
comparison between dependency-based classification and the use of dependency
relations as features for machine learning, which concludes that dependency parsing
is not truly effective at the time of performing document level sentiment analysis, as
it was previously shown in other research (Kudo and Matsumoto, 2004).
SENTENCE-LEVEL SENTIMENT ANALYSIS
At the sentence level, or part of document level, sentiment analysis is done in most
cases in two steps: the first one views the selection of subjective sentences and the
second one aims at classifying the sentiment expressed according to its polarity.
The assumption that is made in this case is that each sentence expresses one single
opinion.
Sentiment analysis at the sentence level includes work by Pang and Lee (2004),
where an algorithm based on computing the minimum cut in a graph containing
subjective sentences and their similarity scores is employed.
Yu and Hatzivassiloglou (2003) use sentence level sentiment analysis with the
aim of separating fact from opinions in a question answering scenario.
Other authors use subjectivity analysis to detect sentences from which patterns
can be deduced for sentiment analysis, based on a subjectivity lexicon
(Hatzivassiloglou and Wiebe, 2000; Wiebe and Riloff, 2006; Wilson et al., 2004).
34
Kim and Hovy (2004) try to find, given a certain topic, the positive, negative
and neutral sentiments expressed on it and the source of the opinions (the opinion
holder). After creating sentiment lists using WordNet, the authors select sentences
which contain both the opinion holder as well as carry opinion statements and
compute the sentiment of the sentence in a window of different sizes around the
target, as harmonic and, respectively, geometrical mean of the sentiment scores
assigned to the opinion words.
Kudo and Matsumoto (2004) use a subtree-based boosting algorithm using
dependency-tree-based features and show that this approach outperforms the bagof-words baseline, although it does not bring significant improvement over the use
of n-gram features.
FEATURE-LEVEL SENTIMENT ANALYSIS
Sentiment analysis at the feature level, also known as feature-based opinion
mining (Hu and Liu, 2004; Liu, 2007), is defined as the task of extracting, given
an object (product, event, person etc.), the features of the object and the opinion
words used in texts in relation to the features, classify the opinion words and
produce a final summary containing the percentages of positive versus negative
opinions expressed on each of the features. This task has been previously defined
by Dave et al. (2003).
Feature-based opinion mining involves a series of tasks:
Task 1: Identify and extract object features that have been commented on
by an opinion holder (e.g., a reviewer).
Task 2: Determine whether the opinions on the features are positive,
negative or neutral.
Task 3: Group feature synonyms.
Subsequently, once all the groups of words referring the same feature is gathered
and the polarity of the opinion is computed, the result is presented as a percentage
of positive versus negative opinion on each feature (feature-based opinion summary
of multiple reviews).
There are a series of techniques and approaches that were used in each of these
three subtasks.
For the identification of features, pros and cons reviews were used, label
sequential rules based on training sequences were employed to define extraction
rules (Popescu and Etzioni, 2005), frequent features were mined using sequential
pattern mining (frequent phrases) and patterns for part of relations were defined
(Ding et al., 2008). Infrequent features were discovered with similarity in WordNet.
Polarity classification was done using as start point the good and bad adjectives
and exploring the synonyms and antonyms of these words in WordNet (Hu and Liu,
2004), using weighting functions depending on surrounding words (Popescu and
35
Etzioni, 2005) or using local conjunction or disjunction relations with words with
priory known polarity (Ding et al., 2008). Grouping of feature synonyms was done
using relations in WordNet.
An important related research area was explored in Task 18 at SemEval 2010
(Wu and Jin, 2010). In this task, the participants were given a set of contexts in
Chinese, in which 14 dynamic sentiment ambiguous adjectives are selected. They
were: |big, |small, |many, |few, |high, |low, |thick, |thin, |deep,
|shallow, |heavy, |light, |huge, |grave. The task was to
automatically classify the polarity of these adjectives, i.e. to detect whether their
sense in the context is positive or negative. The majority of participants employed
opinion mining systems to classify the overall contexts, after which local rules were
applied, depending on which the polarity surrounding the adjective to be classified
remained the same as the overall polarity of the text, or it changed.
Recently, authors have shown that performing very fine or very coarse-grained
sentiment analysis has drawbacks for the final application, as many times the
sentiment is expressed within a context, by comparing or contrasting with it. This is
what motivated McDonald et al. (2007) to propose an incremental model for
sentiment analysis, starting with the analysis of text at a very fine-grained level and
adding up granularity to the analysis (the inclusion of more context) up to the level
of different consecutive sentences. The authors showed that this approached highly
improved the sentiment analysis performance. The same observation was done by
Balahur and Montoyo (2009) for the task of feature-based opinion mining and
subsequently confirmed by experiments in opinion question answering (Balahur et
al., 2009a; Balahur et al., 2009d; Balahur et al, 2009h; Balahur et al., 2010a;
Balahur et al., 2010c).
tradition; however, it is only recently that researchers have started to focus on the
development of Opininion Question Answering (OQA) systems.
Stoyanov et al. (2005) and Pustejovsky and Wiebe (2005) studied the
peculiarities of opinion questions and have found that they require the development
of specific techniques to be tackled, as their answers are longer and the analysis of
the question is not as strightforward as in the case of factoid questions. Cardie et al.
(2004) employed opinion summarization to support a Multi-Perspective QA system,
aiming to identify the opinion-oriented answers for a given set of questions.
Yu and Hatzivassiloglou (2003) separated opinions from facts and summarized
them as answer to opinion questions.
Kim and Hovy (2005) identified opinion holders, which are a key component in
retrieving the correct answers to opinion questions.
Due to the realized importance of blog data, recent years have also marked the
beginning of NLP research focused on the development of OQA systems and the
organization of international conferences encouraging the creation of effective QA
systems both for fact and subjective texts. The TAC 20087 QA track proposed a
collection of factoid and opinion queries called rigid list (factoid) and squishy
list (opinion) respectively, to which the traditional QA systems had to be adapted.
In this competition, some participating systems treated opinionated questions as
other and thus they did not employ opinion specific methods. However, systems
that performed better in the squishy list questions than in the rigid list
implemented additional components to classify the polarity of the question and of
the extracted answer snippet. The Alyssa system (Shen et al., 2007) uses a Support
Vector Machines (SVM) classifier trained on the MPQA corpus (Wiebe et al.,
2005), English NTCIR8 data and rules based on the subjectivity lexicon (Wilson et
al., 2005). Varma et al. (2008) performed query analysis to detect the polarity of the
question using defined rules. Furthermore, they filter opinion from fact retrieved
snippets using a classifier based on Nave Bayes with unigram features, assigning
for each sentence a score that is a linear combination between the opinion and the
polarity scores. The PolyU (Li et al., 2008b) system determines the sentiment
orientation of the sentence using the Kullback-Leibler divergence measure with the
two estimated language models for the positive versus negative categories. The
QUANTA system (Li et al., 2008a) performs opinion question sentiment analysis
by detecting the opinion holder, the object and the polarity of the opinion. It uses a
semantic labeler based on PropBank9 and manually defined patterns. Regarding the
sentiment classification, they extract and classify the opinion words. Finally, for the
http://www.nist.gov/tac/
http://research.nii.ac.jp/ntcir/
9
http://verbs.colorado.edu/~mpalmer/projects/ace.html
8
38
answer retrieval, they score the retrieved snippets depending on the presence of
topic and opinion words and only choose as answer the top ranking results.
Other related work concerns opinion holder and target detection. NTCIR 7
MOAT organized such a task, in which most participants employed machine
learning approaches using syntactic patterns learned on the MPQA corpus (Wiebe
et al., 2005). Starting from the abovementioned research, the work we proposed
(Balahur et al., 2009a; Balahur et al., 2009d; Balahur et al., 2009h; Balahur et al.,
2010a; Balahur et al., 2010e) employed opinion specific methods focused on
improving the performance of our OQA. We perform the retrieval at 1 sentence and
3 sentence-level and also determine new elements that we define as crucial for the
opinion question answering scenario: the Expected Source (ES) and the Expected
Target (ET), expected answer type (EAT) and Expected Polarity Type (EPT).
39
10
http://www.merriam-webster.com/thesaurus/
40
the 7 emotions in the ISEAR corpus and represent the examples using measures
computed on the basis of these terms.
In the SemEval 2007 Task 18 Affective Text (Strapparava and Mihalcea,
2007), the task was to classify 1000 news headlines depending on their valence and
emotion, using Ekmans 6 basic emotions model (Ekman, 1999). The participating
systems used rule-based or machine learning approaches, employing the polarity
and emotion lexicons existent at that time (SentiWordNet, General Inquirer and
WordNet Affect), the only training set available for emotion detection at that time
(the training data containing 1000 news headlines provided by the task organizers)
or calculating Pointwise Mutual Information scores using search engines.
In this chapter, we presented the main issues that research in sentiment analysis
aims to tackle and the state-of-the-art approaches that have been proposed for them.
The methods used so far, as we have seen, do not make any distinction
according to the type of text that is analyzed. In the next chapter, we present the
methods and the resources that we created in ordet to tackle the task of sentiment
analysis. In contrast to existing approaches, we will first study the requirements of
the sentiment analysis task in the different textual genres considered, and
subsequently propose adequate techniques and resources.
41
opinion expressed; however, the positivity or negativity of the news content should
not be mistaken for the polarity of the opinion expressed therein.
In blogs, we are facing the same difficulties i.e. of having to determine the
characteristics of the source, as well as ensure that the target of the opinions
expressed is the required one. Moreover, blogs have a dialogue-like structure, and
most of the times, the topic discussed is related to a news item that is taken from a
newspaper article. The same phenomena are also present in forums, microblogs,
social network comments and reviews, but the characteristics of these texts are
different (e.g. shorter documents, different language used, single versus multiple
targets of opinions, different means of referencing targets).
In this chapter, we present the tasks and the methods we have proposed, in a
suitable manner, to tackle sentiment analysis in different text types. For each of the
textual genres considered, we have appropriately defined the task of sentiment
analysis, identified the genre peculiarities and proposed adequate methods to tackle
the issues found.
Where previous approaches fell short of correctly identifying the needs of the
specific textual genre, we proposed adequate formulations of the problem and
proposed specific methods to tackle them. Additionally, where insufficient
resources were available, we have developed new annotation schemes and new
corpora, for English and other languages (Spanish, German, Chinese).
This chapter is structured as follows: in Section 4.1 we present the methods and
resources we proposed and evaluated in the context of sentiment analysis from
product reviews. Subsequently, in Section 4.2., we discuss the issues involved in
sentiment analysis from newspaper articles, specifically in reported speech
extracted from news (i.e. quotations). In Section 4.3., we present a method to detect
sentiments expressed in political debates, studying the needs of a generic sentiment
analysis system that is able to deal with different topics, in a dialogue framework, in
which different sentiment sources and targets are present. Finally, in Section 4.4.,
we present the methods and resources we have developed for sentiment analysis
from blogs. Summing up all the experience gathered from the analysis of the
previous text types, we design a general method to tackle sentiment analysis, in the
context of this new and complex text type with a dialogue-like structure, in which
formal and informal language styles are mixed and sentiment expressions are highly
diverse.
44
only the local context in which they appear next to the feature they determine, but
also other adjectives appearing with the feature and their polarity in different
contexts. Popescu and Etzioni (2005) employ a more complex approach for
feature-based summarization of opinions, by computing the web PMI (Pointwise
Mutual Information) statistics for the explicit feature extraction and a technique
called relaxation labeling for the assignation of polarity to the opinions. In this
approach, dependency parsing is used together with ten extraction rules that were
developed intuitively.
We propose an initial approach to the issue of feature-based opinion mining
(Balahur and Montoyo, 2008d; Balahur and Montoyo, 2008f), which we
subsequently extended in Balahur and Montoyo (2008c), by introducing product
technical details and comparing two different measures of term relatedness
(Normalized Google Distance (Cilibrasi and Vitanyi, 2006) and Latent Semantic
Analysis (Deerwester et al., 1990). On top of this system, we propose a method to
recommend products based on the scores obtained for the different quantified
features in the opinion mining step. The method was presented in (Balahur and
Montoyo, 2008b).
In the light of the fact that no standard annotation was available for featurebased opinion mining, in Balahur and Montoyo (2009) we proposed an annotation
scheme that aimed at standardizing the labeling of reviews, so that different types of
mentions of features (direct, indirect, implicit) and the different manner of
expressing opinions (through subjective or objective statements) can be correctly
labeled. Subsequently, we studied methods to infer sentiment expressed on different
features using examples of annotations from a review corpus and Textual
Entailment (Balahur and Montoyo, 2009).
In the following sections we describe our approaches and results. The method
we propose is language and customer-review independent. It extracts a set of
general product features, finds product specific features and feature attributes and is
thus applicable to all possible reviews in a product class. The approaches we
present in this section were summarized by Balahur et al. (2010).
47
Retrieve
reviews
User query
Reviews
R
Extract
Identify
language
Product
Class
Product class
Obtain
product
features
English productt
features
Translate
features
ConceptNet
English
Reviews
Spanish
Reviews
WordNet
EuroWordNet
Spanish productt
features
A PREPROCESSING
In our approach, we start from the following scenario: a user enters a query about a
product that he/she is interested to buy.
The search engine retrieves a series of documents containing the product name,
in different languages. Further on, two parallel operations are performed: the first
one uses the Lextek 11 language identifier software to filter and obtain two
categories one containing the reviews in English and the other the reviews in
Spanish.
The second operation implies a modified version of the system proposed by
Kozareva et al. (2007) for the classification of person names. We use this system in
order to determine the category that the product queried belongs to (e.g. digital
camera, laptop, printer, book). Once the product category is determined, we proceed
to extracting the product specific features and feature attributes. This is
accomplished using WordNet and ConceptNet and the corresponding mapping to
Spanish using EuroWordNet. Apart from the product specific class of features and
feature attributes, we consider a core of features and feature attributes that are
product-independent and whose importance determines their frequent occurrence in
customer reviews. Figure 4.1 describes the components used in the preprocessing
stage.
11
http://www.lextek.com/langid/
48
12
http://www.illc.uva.nl/EuroWordNet/
50
where x and y are two words and P(x) stands for the probability of the word x
occurring in the corpus considered. In this manner, we discover bigram features
such as battery life, mode settings and screen resolution.
B
MAIN PROCESSING
The main processing in our system is done in parallel for English and Spanish. In
the next section, we will briefly describe the steps followed in processing the initial
input containing the customer reviews in the two considered language and offer as
output the summarized opinions on the features considered. Figure 4.2 presents the
steps included in the processing.
We start from the reviews filtered according to language. For each of the two
language considered, we used a specialized tool for anaphora resolution- JavaRAP
for English and SUPAR (Ferrndez et al., 1999) for Spanish. Further on, we
separate the text into sentences and use a Named Entity Recognizer to spot names
of products, brands or shops.
Using the lists of general features and feature attributes, product-specific
features and feature attributes, we extract from the set of sentences contained in the
text only those containing at least one of the terms found in the lists.
Reviews
EN
ES
Anaphora Resolution
JavaRAP
SUPAR
Freeling
Freeling
EN
features
ES
features
EN
Results
ES
Anaphora resolution
In order to solve the anaphoric references on the product features and feature
attributes, we employ two anaphora resolution tools - JavaRAP for English and
SUPAR for Spanish. Using these tools, we replace the anaphoric references with
their corresponding referents and obtain a text in which the terms constituting
product features could be found.
JavaRAP is an implementation of the classic Resolution of Anaphora Procedure
(RAP) given by Lappin and Leass (1994). It resolves third person pronouns, lexical
anaphors, and identifies pleonastic pronouns.
Using JavaRAP, we obtain a version of the text in which pronouns and lexical
references are resolved. For example, the text: I bought this camera about a week
ago, and so far have found it very very simple to use, takes good quality pics for
what I use it for (outings with friends/family, special events). It is great that it
already comes w/ a rechargeable battery that seems to last quite a while..., by
resolving the anaphoric pronominal reference, becomes I bought this camera
about a week ago, and so far have found <this camera> very very simple to use,
takes good quality pics for what I use <this camera> for (outings with friends/family,
special events). It is great that <this camera> already comes w/a rechargeable
battery that seems to last quite a while....
For the anaphora resolution in Spanish, we employ SUPAR (Slot Unification
Parser for Anaphora Resolution). The architecture of SUPAR contains, among
others, a module solving the linguistic problems (pronoun anaphora, element
extraposition, ellipsis, etc.). We use SUPAR in the same manner as JavaRAP, to
solve the anaphora for Spanish. Sentence chunking and NER Further on, we split
the text of the customer review into sentences and identify the named entities in the
text. Splitting the text into sentences prevents us from processing sentences that
have no importance as far as product features that a possible customer could be
interested in are concerned.
Chunking and Named Entity Resolution
LingPipe13 is a suite of Java libraries for the linguistic analysis of human language.
It includes features such as tracking mentions of entities (e.g. people or proteins),
part-of-speech tagging and phrase chunking.
We use LingPipe to split the customer reviews in English into sentences and
identify the named entities referring to products of the same category as the product
queried. In this manner, we can be sure that we identify sentences referring to the
13
http://alias-i.com/lingpipe/
52
product queried, even the reference is done by making use of the name of another
product. For example, in the text For a little less, I could have bought the Nikon
Coolpix, but it is worth the extra money., anaphora resolution replaces <it> with
<Nikon Coolpix> and this step will replace it with <camera>.
The FreeLing 14 package consists of a library providing language analysis
services. The package offers many services, among which text tokenization,
sentence splitting, POS-tagging, WordNet based sense annotation and rule-based
dependency parsing. We employ FreeLing in order to split the customer reviews in
Spanish into sentences and identify the named entities referring to products of the
same category as the product queried.
Sentence extraction
Having completed the feature and feature attributes identification phase, we
proceed to extracting for further processing only the sentences that contain the
terms referring to the product, product features or feature attributes. In this manner,
we avoid further processing of text that is of no importance to the task we wish to
accomplish. For example, sentences of the type I work in the home appliances
sector will not be taken into account in further processing. Certainly, at the overall
level of review impact, such a sentence might be of great importance to a reader,
since it proves the expertise of the opinion given in the review. However, for the
problems we wish to solve by using this method, such a sentence is of no
importance.
Sentence parsing
Each of the sentences that are filtered by the previous step are parsed in order to
obtain the sentence structure and component dependencies. In order to accomplish
this, we use Minipar (Lin, 1998) for English and FreeLing for Spanish. This step is
necessary in order to be able to extract the values of the features mentioned based
on the dependency between the attributes identified and the feature they determine.
Feature value extraction
Further on, we extract features and feature attributes from each of the identified
sentences, using the following rules:
1. We introduce the following categories of context polarity shifters (Polanyi
and Zaenen, 2004), in which we split the modifiers and modal operators in
two categories i.e. positive and negative:
! negation: no, not, never, etc.
14
http://nlp.lsi.upc.edu/freeling/
53
equal with 20. We give as example the classification of the feature attribute tiny,
for the size feature. The set of positive feature attributes considered contains 15
terms (e.g. big, broad, bulky, massive, voluminous, large-scale, etc.) and the set of
negative feature attributes considered is composed as opposed examples, such as
(small, petite, pocket-sized, little, etc).
We use the anchor words to convert each of the 30 training words to 6dimensional training vectors defined as v(j,i) = NGD(wi,aj), where aj with j ranging
from 1 to 6 are the anchors and wi, with i from 1 to 30 are the words from the
positive and negative categories.
After obtaining the total 180 values for the vectors, we use SVM SMO to learn
to distinguish the product specific nuances. For each of the new feature attributes
we wish to classify, we calculate a new value of the vector
vNew(j,word)=NGD(word, aj), with j ranging from 1 to 6 and classify it using the
same anchors and trained SVM model.
In the example considered, we had the following results (we specify between
brackets the word to which the scores refer to:
(small)1.52,1.87,0.82,1.75,1.92,1.93,positive
(little)1.44,1.84,0.80,1.64,2.11,1.85,positive
(big)2.27,1.19,0.86,1.55,1.16,1.77,negative
(bulky)1.33,1.17,0.92,1.13,1.12,1.16,negative
This vector was classified by SVM as positive, using the training set specified
above. The precision value in the classifications we made was between 0.72 and
0.80, with a kappa value above 0.45.
For each of the features identified, we compute its polarity depending on the
polarity of the feature attribute that it is determined by and the polarity of the
context modifier the feature attribute is determined by, in case such a modifier
exists. Finally, we statistically summarize the polarity of the feature attributes, as
ratio between the number of positive quantifications and the total number of
quantifications made in the considered reviews to that specific feature and as ratio
between the number of negative quantifications and the total number of
quantifications made in all processed reviews. The formulas can be summarized in:
1. pos
2. neg
55
It is difficult to evaluate the performance of such a system, since we must take into
consideration both the accuracy in extracting the features that reviews comment on,
as well as the correct assignation of identified feature attributes to the positive or
negative category.
The formula used in measuring the accuracy of the system represented the
normalized sum of the ratios between the number of identified positive feature
attributes and the number of existing positive attributes and the ratio of identified
negative feature and the total number of negative feature attributes for each of the
considered features existing in the text.
Secondly, we compute the Feature Identification Precision (P) as ratio between
the number of features correctly identified from the features identified and the
number of identified features.
Thirdly, we compute the Feature Identification Recall (R) as the number of
correctly identified features from the features identified and the number of correctly
identified features. The results obtained are summarized in Table 4.1.
We show the scores for each of the two languages considered separately and the
combined score when using both systems for assigning polarity to feature attributes
of a product. In the last column, we present a baseline, calculated as average of
using the same formulas, but taking into consideration, for each feature, only the
feature attributes we considered as training examples for our method.
56
Feature
Baseline
extraction
English
Spanish
Combined
English
performance
Accuracy
0.82
0.80
0.81
0.21
Precision
0.80
0.78
0.79
0.20
Recall
0.79
0.79
0.79
0.40
Table 4.1: System results on the annotated review corpus
Baseline
Spanish
0.19
0.20
0.40
We can notice how the use of NGD helped the system acquire significant new
knowledge about the polarity of feature attributes, in the context.
There are many aspects to be taken into consideration when evaluating a system
identifying features, opinion on features and summarizing the polarity of features.
First of all, customers reviewing products on the web frequently use informal
language, disregard spelling rules and punctuation marks.
At times, phrases are pure enumerations of terms, containing no subject or
predicate. In this case, when there is no detectable dependency structure between
components, an alternative method should be employed, such as verifying if the
terms appearing near the feature within a window of specified size are frequently
used in other contexts with relation to the feature. Secondly, there are many issues
regarding the accuracy of each of the tools and language resources employed and a
certain probability of error in each of the methods used. In this initial research, we
presented a method to extract, for a given product, the features that could be
commented upon in a customer review.
Further, we have shown a method to acquire the feature attributes on which a
customer can comment in a review. Moreover, we presented a method to extract
and assign polarity to these product features and statistically summarize the polarity
they are given in the review texts in English and Spanish. The method for polarity
assignment is largely language independent (it only requires the use of a small
number of training examples) and the entire system can be implemented in any
language for which similar resources and tools as the ones used for the presented
system exist.
The main advantage obtained by using this method is that one is able to extract
and correctly classify the polarity of feature attributes, in a product dependent
manner. Furthermore, the features in texts are that are identified are correct and the
percentage of identification is high. Not lastly, we employ a measure of word
similarity that is in itself based on the word-of-mouth on the web. The main
disadvantage consists in the fact that SVM learning and classification is dependent
on the NGD scores obtained with a set of anchors that must previously be
established. This remains a rather subjective matter. Also, the polarity given in the
57
training set determines the polarity given to new terms, such that large in the
context of display will be trained as positive and in the case of size as negative.
However, there are many issues that must be addressed in systems identifying
customer opinions on different products on the web. The most important one is that
concerning the informal language style, which makes the identification of words
and dependencies in phrases sometimes impossible.
degree. In our previous approach, in order to assign polarity to each of the identified
feature attributes of a product, we employed SVM SMO machine learning and the
NGD. In this approach, we complete the solution with a classification employing
Latent Semantic Analysis with Support Vector Machines classification. For this, we
build the classes of positive and negative examples for each of the feature attributes
considered. From the list of classified feature attributes in the pros and cons
reviews, we consider all positive and negative terms associated to the considered
attribute features. We then complete the lists of positive and negative terms with
their WordNet synonyms. Since the number of positive and negative examples must
be equal, we will consider from each of the categories a number of elements equal
to the size of the smallest set among the two, with a size of at least 10 and less or
equal with 20.
We give as example the classification of the feature attribute tiny, for the
size feature. The set of positive feature attributes considered contains 15 terms
such as big, broad, bulky, massive, voluminous, large-scale etc. and
the set of negative feature attributes considered is composed as opposed examples,
such as small, petite, pocket-sized, little etc. We use the anchor words to
convert each of the 30 training words to 6-dimensional training vectors defined as
v(j,i) = LSA(wi,aj), where aj with j ranging from 1 to 6 are the anchors and wi, with i
from 1 to 30 are the words from the positive and negative categories. After
obtaining the total 180 values for the vectors, we use SVM SMO to learn to
distinguish the product specific nuances. For each of the new feature attributes we
wish to classify, we calculate a new value of the vector vNew(j,word) = LSA(word,
aj), with j ranging from 1 to 6 and classify it using the same anchors and trained
SVM model. We employed the classification on the corpus present for training in
the Infomap software pack. The blank lines represent the words which were not
found in the corpus; therefore a LSA score could not be computed. The results are
presented in Table 4.2. On the other hand, we employed the classification on a
corpus made up of reviews on different electronic products, gathered using the
Google API and a site restriction on amazon.com. In the table below, we show an
example of the scores obtained with LSA on the features attributes classified for the
feature size. The vector for the feature attribute tiny was classified by SVM as
positive, using the training set specified above. The results are presented in Table
4.3.
Feature
V1
V2 V3 V4
V5 V6 Polarity
attribute
small
0.76 0.74 --- 0.71
1
0.71
pos
big
0.80 0.75 --- 0.74 0.73 0.68
neg
bulky
----- --- ------pos
little
----- --- ------neg
59
Feature
V1
V2 V3 V4
V5 V6 Polarity
attribute
tiny
0.81 0.71 --- 0.80 0.73 0.72
--Table 4.2: LSA scores on non-specialized corpus (not only with product reviews)
In Table 4.3, we show an example of the scores obtained with the similarity
given by the LSA scores on a specialized corpus of reviews on products. The vector
for the feature attribute tiny was classified by SVM as positive, using the training
set specified above.
Feature
V1
V2
V3
V4
V5
V6
Polarity
attribute
small
0.83
0.77
0.48
0.72
1
0.64
pos
big
0.79
0.68
0.74
0.73
0.77
0.71
neg
bulky
0.76
0.67
0.71
0.75
0.63
0.78
pos
little
0.82
0.76
0.52
0.71
0.83
0.63
neg
tiny
0.70
0.70
0.65
0.67
0.71
0.71
pos
Table 4.3: LSA scores on a specialized corpus of product reviews
Precision values in classifications we made with NGD and LSA for different
product features for the examples of digital camera reviews and the mobile phones
reviews vary from 0.75 to 0.8 and kappa statistics shows high confidence of
classification (Balahur and Montoyo, 2008c).
The conclusion that can be drawn from the results presented is that the main
advantage in using the first method of polarity assignment is that NGD is language
independent and offers a measure of semantic similarity taking into account the
meaning given to words in all texts indexed by Google from the World Wide Web.
On the other hand, using the whole Web corpus can also add significant noise.
Therefore, we employ Latent Semantic Analysis at a local level, both on a nonspecialized corpus, as well as on a corpus containing customer reviews. As we will
show, the classification using LSA on a specialized corpus brings an average of 8%
of improvement in the classification of polarity and a rise of 0.20 in the kappa
measure, leading to an 8% overall improvement in the precision of the
summarization system. However, these results were obtained using a specialized
corpus of opinions, which was previously gathered from the Web. To this respect, it
is important to determine sources (web sites, blogs or forums) specific to each of
the working languages, from which to gather the corpus on which the LSA model
can be built.
Using LSA on a non-specialized corpus improved the classification to the same
degree as the classification on a specialized corpus in the cases where the specific
pairs of words to be classified were found in the corpus. However, in 41% of the
60
cases, the classification failed due to the fact that the words we tried to classify
were not found in the corpus. Further on, we developed a method for feature
polarity extraction using subjective phrases.
As observed before, some opinions on the product or its features are expressed
indirectly, with subjective phrases containing positive or negative emotions which
are related to the product name, product brand or its features. In order to identify
those phrases, we have constructed a set of rules for extraction, using the emotion
lists from WordNet Affect. For the words present in the joy emotion list, we
consider the phrases extracted as having a positive opinion on the product or the
feature contained. For the words in the anger, sadness and disgust emotion
lists, we consider the phrases extracted as having a negative opinion on the product
or the feature contained. Apart from the emotion words, we have considered a list
of positive words (pos list), containing adverbs such as definitely, totally,
very, absolutely and so on - as words positively stressing upon an idea - (Iftene
and Balahur-Dobrescu, 2007), that influence on the polarity of the emotion
expressed and that are often found in user reviews.
We present the extraction rules in Table 4.4 (verb emotion, noun emotion and
adj emotion correspond to the verbs, nouns and adjectives, respectively, found in
the emotion lists from WordNet Affect under the emotions joy, sadness,
anger and disgust). In case of surprise, as emotion expressed about a product
and its features, it can have both a positive, as well as negative connotation.
Therefore, we have chosen not to include the terms expressing this emotion in the
extraction patterns.
1. I [pos list*][verb emotion][this||the||my] [product name||product feature]
2. I ([am||m||was||feel||felt])([pos list**])[adj emotion][with||about||by]
[product name||product feature]
3. I [feel||felt][noun emotion][about||with][product name ||product brand]
4. I [pos list*][recommend][this||the][product name||product brand]
5. I ([dont])[think ||believe][sentence***]
6. It [s||is] [adj
feature][product action]
emotion]
[how||what][product
name||product
7.
You
||Everybody||Everyone||All||He||She||They][will||would][verb
emotion][this||the][product name brand||feature]
Table 4.4: List of patterns for opinion extraction based on emotion clues
61
NGD
LSA
Rules
NGD+
Rules
LSA+
Rules
P
R
P
R
P
R
P
R
P
R
0.80 0.79 0.88 0.87 0.32 0.6
0.89
0.85
0.93
0.93
Table 4.5: System results on the review test set in Balahur and Montoyo( 2008c)
In the case of the 5-review corpus proposed by Hu and Liu (2004), the
observation that is important to make is that, as opposed to the annotation made in
the corpus, we have first mapped the features identified to the general feature of the
product (for example fit refers to size and edges refers to design), as we
relieve that in real life situations, a user benefits more from a summary on coarser
classes of product features.
NGD
LSA
Rules
NGD+
Rules
LSA+
Rules
P
R
P
R
P
P
R
P
R
0.81 0.80 0.85 0.88 0.28 0.5
0.89
0.85
0.93
Table 4.6: System results on the corpus employed by Hu and Liu (2004)
P
0.93
Also, a set of sentences that were not annotated in the corpus, such as Youll
love this camera, which expresses a positive opinion on the product. The results
shown in Table 4.6 are compared against the baseline of 0.20 precision and 0.41
recall, which was obtained using only the features determined by Balahur and
Montoyo (Balahur and Montoyo, 2008f) and the feature attributes whose polarity
was computed from the pros and cons-style reviews. As it can be seen, the best
results are obtained when using the combination of LSA with the rules for
subjective phrases extraction. However, gathering the corpus for the LSA model
can be a costly process, whereas NGD scores are straightforward to be obtained and
classifying is less costly as time and resources used.
What is interesting to study is the impact of employing LSA for gradual learning
62
and correction of a system that uses NGD for classifying the polarity of feature
attributes. In such a self-learning scheme, the online classification would be that
of NGD. However, the classification of the new feature attributes can be later
improved offline using the classification given by LSA, which can then be used
as better training for learning the polarity of new feature attributes by the online
NGD classification.
From this subsequent research, we could draw some conclusions on the
advantages and disadvantages of using different scenarios for computing opinion
polarity. The main advantage in using polarity assignment depending on NGD
scores is that this is language independent and offers a measure of semantic
similarity taking into account the meaning given to words in all texts indexed by
Google from the World Wide Web. The main advantage in using LSA on a
specialized corpus, on the other hand, is that it eliminates the noise given by the
multiple senses of words. We completed the opinion extraction on different product
features with rules using the words present in WordNet Affect, as indicative of
indirectly expressed opinions on products. We showed how all the employed
methods led to significant growth in the precision and recall of our opinion mining
and summarization system.
63
of these adjectives, i.e. to detect whether their sense in the context is positive or
negative.
Our approach is based on three different strategies: a) the evaluation of the
polarity of the whole context using an opinion mining system; b) the assessment of
the polarity of the local context, given by the combinations between the closest
nouns and the adjective to be classified; c) rules aiming at refining the local
semantics through the spotting of modifiers. The final decision for classification is
taken according to the output of the majority of these three approaches. The method
used yielded good results, the OpAL system run achieving approximately 76%
micro accuracy on a Chinese corpus. In the following subsections, we explain more
in detail the individual components employed.
A. THE OPAL OPINION MINING COMPONENT
First, we process each context using Minipar15. We compute, for each word in a
sentence, a series of features, computed from the NTCIR 7 data and the EmotiBlog
annotations. These words are used to compute vectors of features for each of the
individual contexts:
! the part of speech (POS)
! opinionatedness/intensity - if the word is annotated as opinion word, its
polarity, i.e. 1 and -1 if the word is positive or negative, respectively and
0 if it is not an opinion word, its intensity (1, 2 or 3) and 0 if it is not a
subjective word
! syntactic relatedness with other opinion word if it is directly dependent
of an opinion word or modifier (0 or 1), plus the polarity/intensity and
emotion of this word (0 for all the components otherwise)
! role in 2-word, 3-word, 4-word and sentence annotations:
opinionatedness, intensity and emotion of the other words contained in
the annotation, direct dependency relations with them if they exist and 0
otherwise.
We add to the opinion words annotated in EmotiBlog the list of opinion words
found in the Opinion Finder, Opinion Finder, MicroWordNet Opinion, General
Inquirer, WordNet Affect, emotion triggers lexical resources. We train the model
using the SVM SMO implementation in Weka16.
15
16
http://webdocs.cs.ualberta.ca/~lindek/minipar.htm
http://www.cs.waikato.ac.nz/ml/weka/
64
System
Micro Acc.(%) Macro Acc.(%)
CityUHK2
62.63
60.85
CityUHK1
61.98
67.89
QLK_DSAA_NR
59.72
65.68
Twitter Sentiment
59.00
62.27
Twitter Sentiment_ext
56.77
61.09
Twitter Sentiment_zh
56.46
59.63
Biparty
51.08
51.26
Table 4.7: Results for the 16 system runs submitted (micro and macro accuracy)
Since the gold standard was not provided, we were not able to perform an
exhaustive analysis of the errors. However, from a random inspection of the system
results, we could see that a large number of errors was due to the translation
through which modifiers are placed far from the word they determine or the words
are not translated with their best equivalent.
http://gate.ac.uk
67
68
69
annotation was done by two non-native speakers of English, one with a degree in
Computer Science and the second a Linguistics student.
An example of annotation is presented in Figure 4.7.
<modifier gate:gateId="46" source ="w" target="restaurant"
id="83" feature="" type="direct" polarity="positive"
intensity="medium" POS="" phrase="multiword"
affect="admiration">I was pleasantly surprised</modifier>
to learn that the food is <modifier gate:gateId="47"
source ="w" target="excellent" id="84" feature=""
type="direct" polarity="positive" intensity="medium"
POS="adverb" phrase="word"
affect="admiration">still</modifier> <opinion
gate:gateId="48" source ="w" target="food" id="85"
feature="length" type="direct" polarity="positive"
intensity="high" POS="" phrase="word"
affect="admiration">excellent</opinion>, and the staff
very <opinion gate:gateId="49" source ="w" target="staff"
id="86" feature="length" type="direct" polarity="positive"
intensity="high" POS="adjective" phrase="word"
affect="admiration">professional</opinion> and <opinion
gate:gateId="50" source ="w" target="staff" id="87"
feature="length" type="direct" polarity="positive"
intensity="high" POS="adjective" phrase="word"
affect="admiration"> gracious </opinion>.
for half an hour in the lobby and finally, when they gave us the key,
realized the room had not been cleaned yet.). While the sentence is
purely factual in nature, it contains the phrases let us wait for half an
hour in the lobby , the room had not been cleaned yet, which we
annotated as feature expressions of implicit type.
4. There is an extensive use of conditionals within reviews. (e.g. If you
dare, buy it! It was great for two weeks until it broke!; If youve got
shaky hands, this is the camera for you and if you dont, theres no need
to pay the extra $50 ) . We consider the sentence containing the
conditional expression a modifier.
5. There are many rhetoric-related means of expressing opinion (e.g. Are
you for real? This is called a movie?). We annotate these elements as
implicit feature expressions.
At the time of performing our experiments, these phenomena are important and
must be taken into consideration, since 27% of the annotated opinionated phrases in
our corpus are composed of more than one sentence. More generally, these findings
draw our attention upon the context in which opinion mining is done. Most of the
work so far concentrated on sentence or text level, so our findings draw the
attention upon the fact that more intermediate levels should also be considered.
EXPERIMENTS AND EVALUATION
The first experiment we performed aimed at verifying the quality and constancy of
the annotation as far as fact versus opinion phrases are concerned and, within the
opinionated sentences, the performance obtained when classifying among positive
and negative sentences. In the first phase, we lemmatize the annotated sentences
using TreeTagger18 and we represented each fact and opinion phrase as a vector of
characteristics, measuring the n-gram similarity (with n ranging from 1 to 4) and
overall similarity with each of the individual corpus annotated sentences, tokens
and phrases. We perform a ten-fold cross validation using the SVM SMO. The
results for fact versus opinion and positive versus negative classifications are
presented in Table 4.8.
Precision
Recall
Kappa
Fact
0.72
0.6
0.53
Opinion
0.68
0.79
0.53
Positive
0.799
0.53
0.65
Negative
0.72
0.769
0.65
Table 4.8. Evaluation of fact vs. opinion and positive vs. negative classification
18
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
72
In the second phase, we consider for classification only the phrases containing
two sentences. As in the first phase, we represent each fact and opinion phrase as a
vector of characteristics, measuring the n-gram similarity (with n ranging from 1 to
4) and overall similarity with each of the individual corpus annotated sentences and
with all other phrases containing two sentences, then perform a ten-fold cross
validation using SVM SMO. We summarize the results obtained in Table 4.9.
Precision
Recall
Kappa
Fact
0.88
0.76
0.43
Opinion
0.74
0.89
0.43
Positive
0.84
0.71
0.68
Negative
0.89
0.92
0.68
Table 4.9. Evaluation of fact versus opinion and positive versus negative 2sentences phrase classification
In the third phase, we consider for classification only the phrases containing
three sentences. The fact and opinion phrases are represented as in the first
experiments and a ten-fold cross validation using SVM SMO is done. The results
are shown in Table 4.10.
Precision
Recall
Kappa
Fact
0.76
0.6
0.80
Opinion
0.78
0.94
0.80
Positive
0.85
0.76
0.68
Negative
0.92
0.96
0.68
Table 4.10. Evaluation of fact versus opinion and positive versus negative 3sentences phrase classification
From the results obtained, we can notice that using longer phrases, we obtain an
improved classification performance in both fact versus opinion classification, as
well as positive versus negative classification. The only drop in classification
performance is in the case of longer factual phrases. We explain this by the fact that
in many of the cases, these types of phrases contain descriptions of opinionated
sentences or represent combinations of factual and opinion sentences (e.g. They
said it would be great. They gave their word that it would be the best investment
ever made. It seems they were wrong). The results show that our annotation is
constant and that labeled elements present similarity among them; this fact can be
used to automate the annotation, as well as use the labeled corpus for the training of
an opinion mining system or for its evaluation. Evaluation proved that the
annotation schema and approach are general enough to be employed for labeling of
reviews on any product.
73
19
http://www.cs.ualberta.ca/~lindek/minipar.htm
http://alias-i.com/lingpipe/
21
http://demo.patrickpantel.com/Content/LexSem/paraphrase.htm
22
http://xwn.hlt.utdallas.edu/
20
74
From the annotated corpus, for each of the considered products, we selected the
reviews containing sentences describing opinions on the criteria which users are
also allowed to assess using the stars system. The categories which are punctuated
with two or less stars are considered as negative and those punctuated with four or
five stars are considered as having been viewed positively.
We generated hypotheses under the form Category is good and Category is
nice., Category is not good and Category is not nice.,e.g. The price was
good.,The price was not good. , The food was good., The food was not
good. The view was nice., The view was not nice.. In case no entailment was
found with such built sentences, we computed entailment with annotated sentences
in the review corpus. The results obtained are shown in Table 4.11.
Name of product
Stars Category
Accuracy
Price
62%
Service
58%
Restaurant
Food
63%
View
53%
Atmosphere
58%
Ease of use
55%
Durability
60%
Digital camera
Battery life
80%
Photo quality
65%
Shutter lag
53%
Ease of Use
60%
Durability
72%
Washing machine
Ease of Cleaning
67%
Style
65%
Table 4.11. Polarity classification accuracy against the number of stars per
category
As we can see from the obtained results, textual entailment can be useful at the
time of performing category based opinion mining. However, much remains to be
done at the level of computing semantic similarity between opinionated texts. Such
work may include the discovery of opinion paraphrases or opinion equivalence
classes.
75
obtain among others the vectors v1 and v2, corresponding to Camera1 4MP and
Camera2 4MP. In this case:
v1=(0.7,0.5,0.6,0.2,0.3,0.6,0.5, 0.5,0.7,0.8,0.7,0.8,0.4,0.3,0.3,0.7,0.6,0.3,0.8,0.4,0.4)
v2 = (0.8,1, 0.7,0.2,0.2,0.5, 0.4,0.4,0.8,0.8,0.8,0.8,0.7,0.7,0.3,0.8,0.6,0.7,0.5,0.3,0.6)
Calculating the cosine similarity between v1 and vperf and v2 and vperf,
respectively, we obtain 0.945 and 0.937. Therefore, we conclude that Camera1 4MP
is better than Camera2 4MP, because it is more similar to the perfect 4-Megapixel
camera model.
http://emm.newsbrief.eu/overview.html
77
Another interesting aspect concerns the presence of various possible targets in the
quote, on which antonymic opinions are expressed (e.g. How can they have a case
against Fred, when he didnt sign anything?). Moreover, they contain a larger scale
of affective phenomena, which are not easily classifiable when using only the
categories of positive and negative: warning (e.g. Delivering high quality education
cannot be left to chance!), doubt (e.g. We dont know what we should do at this
point), concern, confidence, justice etc. (where doubt is generally perceived as a
negative sentiment and confidence as a positive one).
The aim we have is to determine the attitude polarity (tonality of the speech),
independent of the type of news, interpreting only the content of the text and not the
effect it has on the reader.
79
The polarity value of each of the quotes was computed as sum of the values of
the words identified; a positive score leads to the classification of the quote as
positive, whereas a final negative score leads to the system classifying the quote as
negative. The resources used were: the JRC lists of opinion words, WordNet Affect
(Strapparava and Valitutti, 2004), SentiWordNet (Esuli and Sebastiani, 2005),
MicroWNOp (Cerini et al., 2007). WordNet Affect categories of anger and disgust
were grouped under high negative, fear and sadness were considered negative, joy
was taken as containing positive words and surprise as highly positive;
SentiWordNet and MicroWNOp contained positive and negative scores between 0
and 1 and in their case, we mapped the positive scores lower than 0.5 to the positive
category, the scores higher than 0.5 to the high positive set, the negative scores
lower than 0.5 to the negative category and the ones higher than 0.5 to the high
negative set.
As a filtering step, we first classified the quotes based on the presence of
subjectivity indicators, using the Opinion Finder lexicon (Wilson et al., 2005).
The subjective versus objective filtering had an accuracy of 0.89, as 2 of the
positive and 5 of the negative quotes were classified as neutral.
We evaluated the approaches both on the whole set of positive and negative
quotes, as well as only the quotes that were classified as subjective by the
subjectivity indicators. Subsequently, we grouped together resources that tended to
over-classify quotes as positive or negative, in an attempt to balance among their
classification. Finally, we grouped together all the words pertaining to the different
classes of positive, negative, high positive and high negative words belonging to all
the evaluated resources. The results are shown in Table 4.12 (-S/O and +S/O
indicate absence and presence, respectively, of the subjectivity filtering):
Resource
JRCLists
SentiWN
WNAffect
MicroWN
SentiWN + WNAffect
All
-S/O
X
+S/O
X
X
X
X
X
X
X
X
X
X
X
Ppos
0.77
0.81
1
1
0
0
0.62
0.73
0.22
0.24
0.68
0.73
Pneg
0.3
0.35
0
0
1
1
0.36
0.35
0.66
0.67
0.64
0.71
Rpos
0.54
0.6
0.51
0.54
0
0
0.52
0.57
0.42
0.47
0.7
0.75
Rneg
0.55
0.625
0
0
0.51
0.54
0.48
0.53
0.45
0.41
0.62
0.69
Table 4.12: Results of the classification using the different opinion and affect
lexicons
80
B. SIMILARITY APPROACH
In this approach we used two existing resources the ISEAR corpus (Scherer and
Walbott, 1997) - consisting of phrases where people describe a situation when they
felt a certain emotion and EmotiBlog (Boldrini et al., 2009), a corpus of blog posts
annotated at different levels of granularity (words, phrases, sentences etc.)
according to the polarity of the sentiments and the emotion expressed.
In the first approach, we computed the individual quotes similarity with the
sentences belonging to each of the emotions in the ISEAR corpus, using Pedersens
Similarity Package24, based on the Lesk similarity25. Subsequently, we classified
each of the quotes based on the highest-scoring category of emotion. Table 4.13
presents the results:
Class
Joy Fear Anger Shame Disgust Guilt Sadness
Positive
8
7
1
3
3
5
8
Negative 6
7
1
5
8
2
4
Table 4.13: Results of the classification using the similarity scores with the ISEAR
corpus
We consider as positive the examples which fell into the joy category and
classify as negative the quotes which were labeled otherwise. The results are
presented in Table 4.14:
Ppos Pneg Rpos Rneg Accuracy
0.22 0.82 0.58 0.5
0.514
Table 4.14: Results of the positive versus negative classification using the similarity
score with the ISEAR corpus
EmotiBlog represents an annotation schema for opinion in blogs and the
annotated corpus of blog posts that resulted when applying the schema. The results
of the labeling were used to create a training model for an SVM classifier that will
subsequently be used for the classification of opinion sentences. The features
considered are the number of n-grams (n ranging from 1 to 4) and similarity scores
with positive and negative annotated phrases, computed with Pedersens Similarity
Package. The approach was previously described by Balahur et al. (Balahur et al.,
2009b). The evaluation results are presented in Table 4.15:
24
http://www.d.umn.edu/~tpederse/text-similarity.html
http://kobesearch.cpan.org/htdocs/WordNet-similarity/WordNet/ Similarity/lesk.htm
23
81
Class
Precision Recall F-measure
Positive
0.667
0.219
0.33
Negative
0.533
0.89
0.667
Table 4.15: Results of the classification using SVM on the EmotiBlog corpus model
From the results obtained, we can infer that the use of some of the resources
leads to better performance when classifying positive or negative quotes
(SentiWordNet versus WordNet Affect), and that the combined resources produce
the best results when a vocabulary-based approach is used. Another conclusion is
that previous subjectivity filtering indeed improves the results.
82
83
The chain could theoretically be continued: For instance, if the newspaper were
a known defender of person A (and the corresponding) political party and attitudes,
the whole statement (3) could be interpreted as sarcasm, inverting the negative
sentiment of Person B towards Politician A, and so on. While this is clearly a
constructed example, our low inter-annotator agreement and the clarifying
discussions showed that our initial sentiment annotation instructions were underspecified and left too much leeway for interpretation.
(4) Time called on the War on Drugs?
These are real examples of texts, where we can further notice difficulties. In this
case, the journalist mocks the idea of delaying taking an action against drugs. Or the
following example:
(5) Argentina and Mexico have taken significant steps towards decriminalising
drugs amid a growing Latin American backlash against the US-sponsored war on
drugs.
In this context, US-sponsored is the key expression towards understanding the
negative opinion on the war on drugs.
For these reasons, we re-defined our task and subsequently annotated the whole
set of 1592 quotations, after which the inter-annotator agreement was 0.81%.
PREVIOUS DEFINITIONS
In order to redefine the task, we first start by looking into the definitions that were
given until this point.
Subjectivity analysis is defined by (Wiebe, 1994) as the linguistic expression of
somebodys opinions, sentiments, emotions, evaluations, beliefs and speculations.
In her definition, the author was inspired by the work of the linguist Ann Banfield
(Banfield, 1982), who defines as subjective the sentences that take a characters
point of view (Uspensky, 1973) and that present private states (Quirk, 1985) (that
are not open to objective observation or verification) of an experiencer, holding an
attitude, optionally towards an object. Subjectivity is opposed to objectivity, which
is the expression of facts. As Kim and Hovy (2004) notice, opinion is subjective,
but may not imply a sentiment. But what about our example of war on drugs?
Can facts express opinions? Is there a difference between interpretations of facts at
sentiment level and the direct expression of sentiments? Should we take them into
consideration? Therefore, in our context, this definition did not help.
Esuli and Sebastiani (2006) define opinion mining as a recent discipline at the
crossroads of information retrieval and computational linguistics which is
84
concerned not with the topic a document is about, but with the opinion it
expresses. This is a very broad definition, which targets opinions expressed at a
document level. As we have shown before, news articles contain mentions of
different persons and events, the topic in itself might involve a negative tonality and
both the author of the text, as well as the facts presented or the interpretation they
are given by the reader may lead to a different categorization of the document. So,
this definition is not specific enough for us to understand what we should be
looking for when annotating pieces of newspaper articles.
Dave et al. (2003) define an opinion mining system as one that is able to
process a set of search results for a given item, generating a list of product
attributes (quality, features, etc.) and aggregating opinions about each of them
(poor, mixed, good). Opinion mining, in this context, aims therefore at extracting
and analyzing judgments on various aspects of given products.
A similar paradigm is given by Hu and Liu (2004), which the authors entitle
feature-based opinion mining. It is, however, not clear how statements such as It
broke in two days, The night photos are blurry, that are actual fact information
(according to the definition of subjectivity, they are verifiable) could be and should
be annotated. Do they fall outside the goal of opinion mining? Since in our context,
persons, organizations or events have no definable or inferable lists of features, this
definition of the task does not work for us either.
Kim and Hovy (2005) define opinion as a quadruple (Topic, Holder, Claim,
Sentiment), in which the Holder believes a Claim about the Topic, and in many
cases associates a Sentiment, such as good or bad, with the belief. The authors
distinguish among opinions with sentiment and opinions without sentiment and
between directly and indirectly expressed opinions with sentiment. In this context,
it does not remain clear how an example such as the Local authorities have
provided no help for the victims of the accident. should be interpreted and why.
Some might even argue that a statement they claim to be opinionated but with no
sentiment Gap is likely to go bankrupt (which would probably be interesting
when assessing favorability in markets), has a sentiment and that sentiment is
negative.
In the SemEval 2007 No. 14 Affective Text Task (Mihalcea and Strapparava,
2007), the systems were supposed to classify 1000 newpaper titles according to
their valence and emotion contained. A title such as Scientists proved that mens
perspiration raises womens hormone levels or 100 killed in bomb attack were
classified as negative. However, this is factual, verifiable information. Does this
mean that when capturing the media sentiment, we should consider these results as
being negative? Do these statements refer to a fact and are we interested in the
information conveyed or in the sought effect? If so, which of these aspects would
we include in a system doing sentiment analysis from newspaper articles?
85
86
Agreement
Number
of
quotes
Number
of agreed
quotes
Number
of agreed
negative
quotes
Number
of agreed
positive
quotes
Number
of agreed
objective
quotes
1592
1292
234
193
865
81%
78%
78%
83%
Table 4.16: Results of the annotations in terms of agreement per class of sentiment
The result of the annotation guidelines and labeling process what a corpus in
which we agreed what sentiment was and was not in our case. The number of
agreed sentiment-containing quotes was one third of the total number of agreed
quotes, showing that only clear, expressly stated opinion, which required no
subjective interpretation from the annotators part was done.
The result of our labeling showed that in the case of newspapers, it is mandatory
to distinguish between three different components: the author, the reader and the
text itself (Figure 4.9).
88
AUTHOR
READER1
INTERPRETATION 1
READER 2
FACTS
ATTITUDE
SENTIMENT
READER N
TEXT
INTERPRETATION 2
INTERPRETATION N
omissions, debate limitations, story framing, selection and use of sources of quotes
and the quote boundaries, for example, conveys a certain sentiment or not. The
sentiment content of the text, finally, is what is expressly stated, and not what is left
to be understood between the lines. Although pragmatics, through the speech- act or
other theories would argue there is no text that has no intended meaning, the
sentiment or factual information conveyed is different from reader to reader and can
thus not be done at a general level, as sentiment analysis intends to. For example,
the text The results of the match between Juventus Torino and Real Madrid last
night are 3-0. would maybe be interpreted as something positive, a motive for
pride in an Italian newspaper, it would be a negative, sad thing if reported by a
Spanish source, it would be bad or good depending on whether or not an interested
reader were pro or against the two teams and it would constitute just factual news
from the strict point of view of the text. Given these three views one must be aware
of at the time of constructing a sentiment analysis system for news, we can see that
the task becomes much clearer and the agreement at the time of annotating texts,
implementing and evaluating systems is higher.
Should one want to discover the possible interpretations of texts, sources and
readers profiles must be defined and taken into consideration, for a whole
understanding of the possible sentiment effects text has or is intended to have,
and not just a general, often misunderstood one.
At this moment, having the tasks clearly defined, we have started experimenting
with adequate methods to perform sentiment analysis considering these insights.
called category words). Given that we are faced with the task of classifying opinion
in a general context, we employed a simple, yet efficient approach, presented in
(Balahur et al., 2009f).
At the present moment, there are different lexicons for affect detection and
opinion mining. In order to have a more extensive database of affect-related terms,
in the following experiments we used WordNet Affect (Strapparava and Valitutti,
2004), SentiWordNet (Esuli and Sebastiani, 2006), MicroWNOp (Cerini et al,
2007). Additionally, we used an in-house built resource of opinion words with
associated polarity, which we denote by JRC Tonality. Each of the employed
resources was mapped to four categories, which were given different scores:
positive (1), negative (-1), high positive (4) and high negative (-4). The score of
each of the quotes was computed as sum of the values of the words identified
around the mentions of the entity that was the target of the quote, either directly
(using the name), or by its title (e.g. Gordon Brown can be referred to as Gordon,
as Brown or as the British prime-minister)26. The experiments conducted used
different windows around the mentions of the target, by computing a score of the
opinion words identified and eliminating the words that were at the same time
opinion words and category words (e.g. crisis, disaster).
EVALUATION RESULTS
Table 4.17 presents an overview of the results obtained using different window
sizes and eliminating or not the category words in terms of accuracy (number of
quotes that the system correctly classified as positive, negative or neutral, divided
by the total number of quotes).
As it can be seen, the different lexicons available performed dramatically
different and the impact of eliminating the alert words was significant for some
resources or none for others, i.e. in those cases where there were no category words
that coincided with words in the respective lexicon.
26
For the full details on how the names and corresponding titles are obtained, please see (Pouliquen
and Steinberger, 2009).
91
Word
window
Whole text
3
6
6
10
W or W/O
Alerts
W Alerts
W/O Alerts
W Alerts
W/O Alerts
W Alerts
W/O Alerts
W Alerts
W/O Alerts
W Alerts
W/O Alerts
JRC
MicroWN WNAffect SentiWN
Tonality
0.47
0.54
0.21
0.25
0.44
0.53
0.2
0.2
0.51
0.53
0.24
0.25
0.5
0.5
0.23
0.23
0.63
0.65
0.2
0.23
0.58
0.6
0.18
0.15
0.82
0.2
0.23
0.79
0.18
0.15
0.61
0.64
0.22
0.2
0.56
0.64
0.15
0.11
Table 4.17: Accuracy obtained using different lexicons, window sizes and alerts
As we can see, computing sentiment around the mentions of the entity in smaller
window sizes performs better than computing the overall sentiment of texts where
the entities are mentioned. From our experiments, we could notice that some
resources have a tendency to over-classify quotes as negative (WordNet Affect) and
some have the tendency to over-classify quotes as positive (SentiWordNet). We
have performed evaluations using combinations of these four lexicons. The best
result we obtained were using the combination of JRC Tonality and MicroWN, on a
window of 6 words; in this case, the accuracy we obtained was 82%. As we can see,
the majority of the resources used did not pass the baseline (61%), which shows
that large lexicons do not necessarily mean an increase in the performance of
systems using them.
ERROR ANALYSIS
Subsequently to the evaluation, we have performed an analysis of the cases where
the system fails to correctly classify the sentiment of the phrase or incorrectly
classifies it as neutral.
The largest percentage of failures is represented by quotes which are erroneously
classified as neutral, because no sentiment words are present to account for the
opinion in an explicit manner (e.g. We have given X enough time, He was the
one behind all these atomic policies, These revelations provide, at the very least,
evidence that X has been doing favours for friends, We have video evidence that
activists of the X are giving out food products to voters) or the use of idiomatic
expressions to express sentiment (e.g. They have stirred the hornets nest).
Errors in misclassifying sentences as positive instead of negative or vice-versa
were given by the use of irony (e.g. X seemed to offer a lot of warm words, but
92
We have shown that this simple approach produces good results when the task is
clearly defined. The data is available for public use at:
http://langtech.jrc.ec.europa.eu/Resources/2010_JRC_1590-Quotes-annotatedfor-sentiment.zip
Subsequently to these annotation efforts, we have continued to work on the
creation of corpora for sentiment analysis for other languages. We have created a
corpus of quotations extracted from newspaper articles in German containing 2387
quotes, based on the same annotation criteria. This resource is also publicly
available upon request to the authors.
persons etc.), whose answers were to be found in a set of blogs. As in this scenario,
a general opinion mining system must deal with many topics, ranging from
products to brands, companies, public figures, news topics, etc., which may not be
directly stated in the text as such. Therefore, when pursuing the goal of classifying
opinions, one must first of all have a base system that is able to detect negative and
positive opinion. To this aim, we propose a general opinion mining system (Balahur
et al., 2009e), which is able to track the sentiment expressed on different topics
mentioned in political debates. In order to further generalize the nature of the text
considered, we have chosen to test our initial methods for general opinion mining
on a corpus of political debates. Taking into consideration the corpus we have at
hand, we study the manner in which opinion can be classified along dialogues,
depending on the intervening speakers. We evaluate two methods to aggregate
opinion scores in order to make a unique classification of opinions provided by a
given speaker. While this type of classification was previously done by Thomas et
al. (2006), their approach was dependent on the previous training on the same kind
of data; our approach is data-independent.
Last, but not least, we study the possibility to determine the source of the
opinion expressed taking into consideration its polarity and the affect words used in
expressing the arguments. Since the corpus is already annotated with the party the
speaker belongs to, we perform this classification among the two parties
represented democrat and republican. While this type of classification was
previously approached by Mullen and Malouf (2006), the authors took into
consideration the general vocabulary used, and not the attitude towards the topic per
se and vocabulary related to it.
4.3.2. BACKGROUND
Although in the State-of-the-art chapter we have specified the directions of research
in sentiment analysis in general, we will briefly comment on work that is related to
this particular effort. Related work includes document-level sentiment analysis and
opinion classification in political texts.
Research in sentiment analysis at a document level, relevant research was done
by Turney et al. (2002), who first select important sentences based on pre-specified
part-of-speech patterns, then compute the semantic orientation of adjectives and
subsequently sum up this orientation to determine the document polarity.
Other related research was done by Pang et al. (2002), who employ machine
learning techniques to determine the overall sentiment in user reviews. Additionally,
Dave et al. (2003) propose classifying opinion on products based on individual
opinions on product parts. Gamon (2004) that studies the problems involved in
machine learning approaches and the role of linguistic analysis for sentiment
classification in customer feedback data. Matsumoto et al. (2005) research on the
95
4.3.3. EXPERIMENTS
The corpus we use in our experiments is made up of congressional floor debates
and was compiled by Thomas et al. (2006). The corpus is available for download
and research27. It is split into three sets: the development set, the training set and
the test set. The first one contains 702 documents (one document corresponds to
one speech segment) pertaining to the discussion of 5 distinct debate topics, the
training set contains 5660 documents organized on 38 discussion topics and the test
set contains 1759 documents belonging to 10 debates. The corpus contains three
versions of these three sets, with the difference consisting in the removal of certain
clues relating to the topic and the speaker referred to in the speech. The speech27
http://www.cs.cornell.edu/home/llee/data/convote.html
96
28
97
given by Ted Pedersens Statistics Package 29 with affect, opinion and attitude
lexicon.
The affect lexicon consisted of three different sources: WordNet Affect - (with 6
categories of emotion joy, surprise, anger, fear, sadness, disgust), the ISEAR
corpus (Scherer and Walbott, 1997) that contains the 7 categories of emotion
anger, disgust, fear, guilt, joy, sadness and shame, from which stopwords are
eliminated) and the emotion triggers database (Balahur and Montoyo, 2008;
Balahur and Montoyo, 2008c; Balahur and Montoyo, 2008e)- which contains terms
related to human needs and motivations annotated with the 6 emotion categories of
WordNet Affect.
The opinion lexicon contained words expressing positive and negative values
(such as good, bad, great, impressive etc.) obtained from the opinion
mining corpus in (Balahur and Montoyo, and to which their corresponding nouns,
verbs and adverbs were added using Rogets Thesaurus.
Finally, the attitude corpus contains the categories of accept, approval,
confidence, importance, competence, correctness, justice, power,
support, truth and trust, with their corresponding antonymic categories
criticism, opposition, uncertainty, doubt, unimportance, incompetence,
injustice, objection, refusal , incorrectness.
After obtaining the similarity scores, we summed up the scores pertaining to
positive categories of emotion, opinion and attitude and the negative categories,
respectively. Therefore, the general positive score was computed as sum of the
individual similarity scores for the categories of joy and surprise from the
affect category, the positive values of the opinion lexicon and the accept,
approval, competence, confidence, correctness, justice, power,
support, trust and truth. On the other hand, the general negative score was
computed as sum of the anger, fear, sadness, shame from the affect
categories, the negative values of the opinion lexicon and the criticism,
opposition, uncertainty, doubt, unimportance, incompetence, injustice,
objection, refusal and incorrectness categories of the attitude lexicon. The
first classification between negative and positive speaker segments was done
comparing these two resulting scores and selecting the higher of the two as final
value for polarity. We evaluated the approach on the development, training and test
sets (Classification 1).
On the other hand, we employed the scores obtained for each of the emotion,
opinion and attitude categories, as well as the combined scores used for classifying
in the first step for the training of an SVM classifier, using the development and
training sets. We then tested the approach on the test set solely. Due to the fact that
in the affect category there are more negative emotions (4) than positive ones (only
29
http://www.d.umn.edu/~tpederse/text-similarity.html
98
2), we chose for classification only the two strongest emotions according to the
similarity scores found in the first approach. Those two categories were fear and
anger. The results are presented under Classification 2.
Further, we parsed the speaker segments using Minipar30, in order to determine
possible dependency paths between words pertaining to our affect, opinion or
attitude lexicon and the topic under discussion or mentioning of another speaker.
Our guess was that many of the speech segments that had been classified as
negative, although the ground-truth annotation had them assigned a positive value,
contained a negative opinion, but not on the topic under discussion, but on the
opinion that was expressed by one of the anterior speakers. Therefore, the goal of
our approach was to see whether the false negatives were due to the classification
method or due to the fact that the object on which the opinion was given was not
the one we had in mind when classifying. In order to verify our hypothesis, we
extracted from the files in which the opinion words from the files with similarity
higher than 0 appeared and sought dependency relations between those words and
the mention of a speaker (based on the number assigned) or the words describing
the topic discussed marked in files in which this names appear, the words bill,
legislation, amendment and measure. Affect, opinion or attitude words to
which no relation was found to the mentioned topic or a speaker were discarded. In
this approach, we did not use anaphora resolution, although, theoretically, it could
help improve the results obtained. It would be interesting to study the effect of
applying anaphora resolution on this task.
The results of the classification are summed up under Classification 3. Figure
4.10 presents an example of the dependency analysis for one of the sentences in
which an attitude word was identified. It can be see that the word support
pertaining to the attitude category, has a dependency path towards the name of the
bill under discussion h.r. 3283. Figure 4.11 shows a schematic overview of the
first approach, with the resources, tools and methods employed therein.
> (
E0
1
2
E2
3
4
5
6
7
8
30
(()
fin C *
(i
~ N
2
(rise ~ V
E0
(()
I N
2
(in
~ Prep
(strong
~ A
(support
~ N
(of
~ Prep
(h.r ~ N
6
(3283 ~ N
7
)
s
(gov rise))
i
(gov fin))
subj (gov rise) (antecedent 1))
2
mod
(gov rise))
5
mod
(gov support))
3
pcomp-n
(gov in))
5
mod
(gov support))
pcomp-n
(gov of))
num
(gov h.r))
http://www.cs.ualberta.ca/~lindek/minipar.htm
99
9
10
11
12
13
14
15
16
17
)
(,
~ U
7
punc (gov h.r))
(the ~ Det 16
det
(gov act))
(united
~ U
12
lex-mod
(gov United States))
(states
United States N
16
nn
(gov act))
(trade
~ A
16
mod
(gov act))
(rights
right N
16
nn
(gov act))
(enforcement
~ N
16
nn
(gov act))
(act ~ N
7
appo (gov h.r))
(.
~ U
*
punc)
Figure 4.10: Minipar output for a sentence in topic 421 on act h.r. 3283
WordNet
Affect
Emotion
triggers
ISEAR
corpus
Opinion
lexicon
S
I
M
I
L
A
R
I
T
Y
S
C
O
R
E
Individual
Speech
Segments
MINIPAR
Attitude
lexicon
Classification 1
SVM
CLASSIFIER
Classification 3
Classification 2
Figure 4.11: Resources and tools scheme for the first approach
The second approach on the data was aggregating the individual speaker
segments on the same debate topic into single documents we denote as speaker
interventions. We then performed, on the one hand, a classification of these
interventions using the sum-up of the scores obtained in the individual speech
segments and, on the other hand, based on the highest score in each of the
categories. Thirdly, we employed SVM to classify the speaker interventions using
the aggregated scores from the individual text segments and the highest scores of
the individual speaker segments, respectively. The training was performed on the
development and training sets and the classifications (Classification 4 and
Classification 5, respectively) were evaluated on the test set.
100
SOURCE CLASSIFICATION
The second experiment we performed was classifying the source of opinions
expressed. In the following experiments, we used the fact that the corpus contained
the name of the party the speaker belonged to coded in the filenames. The goal was
to see whether or not we are able to determine the party a speaker belongs to, by
taking into consideration the words used to express opinion on a given subject, the
arguments (the words used within the argumentation) and the attitude on the subject
in question.
Our hypothesis was that, for example, parties in favor of a certain piece of
legislation will use both a set of words that are positively speaking on the matter, as
well as a set of arguments related to the topic that are highlighting the positive side.
In order to perform this task, we used a clustering on the words pertaining to the
affect lexicon, opinion lexicon and attitude lexicon, as well as the most frequent
words appearing in the individual speaker segments of persons belonging to each of
the two parties Democrat and Republican.
As mentioned by Mullen and Malouf (2006), there are two problems that arise
when intending to classify pertainance to a political party in a topic debate. The first
one is the fact that when talking on a certain topic, all or most persons participating
in the debate will use the same vocabulary. The second issue is that a certain
attitude on a topic cannot reliably predict the attitude on another topic. Related to
the first problem, we verify whether or not attitude towards a topic can be
discriminated on the basis of the arguments given in support or against that topic,
together with the affect, opinion and attitude lexicon used in connection to the
arguments. As far as the second issue is concerned, we do not aim to classify
depending on topic, but rather predict, on the basis of the arguments and affective
words used, the party the speaker belongs to.
4.3.4. EVALUATION
We evaluated the approaches described in terms of precision and recall.
In order to exemplify the manner in which we calculated these scores, we
present confusion matrices for all individual speech segments pertaining to the 5
topics in the development set.
The yes category includes the individual speech segments whose ground truth
was yes. The no category includes the individual speech segments that whose
ground truth was no.
101
The positive category includes the individual speech segments that the system
classified as positive. The negative category includes the individual speech
segments the system classified as negative.
yes
no
positive
30
7
negative
22
16
Table 4.18: Confusion matrix for topic 199 from the development set
yes
no
positive
45
2
negative
14
26
Table 4.19: Confusion matrix for topic 553 from the development set
yes
no
positive
28
3
negative
15
29
Table 4.20: Confusion matrix for topic 421 from the development set
yes
no
positive
44
3
negative
26
58
Table 4.21: Confusion matrix for topic 493 from the development set
yes
no
positive
35
2
negative
24
26
Table 4.22: Confusion matrix for topic 052 from the development set
The following table shows the confusion matrix for the source classification, trained
on the development set and tested using a sample of 100 documents from the test
set, equally distributed among the Democrat and Republican Party.
D
R
Classified D
29
21
Classified R
10
40
Table 4.23: Results for source classification 100 documents
102
We compute the accuracy score as the sum of the number of correct positive
classifications and the number of correct negative classifications, divided by the
total number of documents on a topic.
A(199) = 46/75 = 0.62; A(553) = 0.81; A(421) = 0.76;
A(493) = 0.77; A (052) = 0.70
103
104
possibility to annotate the style of writing, the language employed and the
structure of writing and linking (e.g. misspellings, ungrammaticality,
shortening of words and/or repetition of letters and punctuation signs, the
use of colloquial expressions, urban acronyms, smileys).
The annotation scheme that allows for all these elements to be labeled is called
EmotiBlog and it was proposed by Boldrini et al. (2009). The corpus that was
annotated with this scheme contains blog posts in three languages: Spanish, Italian,
and English about three subjects of interest, which in part overlap with the topics
annotated in the MPQA corpus. The aim in choosing similar topics is that we can
subsequently compare the results of the systems depending on the type of texts
considered.
The first one contains blog posts commenting upon the signing of the Kyoto
Protocol against global warming, the second collection consists of blog entries
about the Mugabe government in Zimbabwe, and finally we selected a series of
blog posts discussing the issues related to the 2008 USA presidential elections. For
each of the abovementioned topics, we have manually selected 100 blog posts,
summing up a total of 30.000 words approximately for each language.
Description
Confidence, comment, source, target.
Confidence, comment, level, emotion, phenomenon, polarity,
source and target.
Confidence, comment, level, emotion, phenomenon,
modifier/not, polarity, source and target.
Confidence, comment, level, emotion, phenomenon,
modifier/not, polarity, source and target.
Confidence, comment, level, emotion, phenomenon, polarity,
mode, source and target.
Confidence, comment, type, source and target.
Confidence, comment, level, emotion, phenomenon,
modifier/not, polarity, source and target.
Confidence, comment, level, emotion, phenomenon,
modifier/not, polarity, source and target.
Confidence, comment, level, emotion, phenomenon,
107
Element
Reader
Interpretation
Author
Interpretation
Emotions
Description
modifier/not, polarity, and source.
Confidence, comment, level, emotion, phenomenon, polarity,
source and target.
Confidence, comment, level, emotion, phenomenon, polarity,
source and target.
Confidence, comment; accept, anger, anticipation, anxiety,
appreciation, bad, bewilderment, comfort, compassion,
confidence, consternation, correct, criticism, disappointment
discomfort, disgust, despondency, depression, envy, enmity,
excuse, force, fear, grief, guilt, greed, hatred, hope, irony,
interesting, important.
Table 4.29: EmotiBlog structure
For each element we are labelling the annotator has to insert his level of
confidence. In this way, each label is assigned a weight that will be computed for
future evaluations. Moreover, the annotator has to insert the polarity, which can be
positive or negative, the level (high, medium, and low) and also the emotion this
element is expressing. The phenomenon level describes whether the element is a
saying or a colloquialism or a multi-word phrase.
As suggested by Balahur and Steinberger (2009), even if the writer uses an
apparently objective formulation, he/she intends to transmit an emotion and a
sentiment. For this reason we added two elements: reader and author interpretation.
The first one is the impression/feeling/reaction the reader has reading the
intervention and what s/he can deduce from the piece of text and the author
interpretation is what we can understand from the author (politic orientation,
preferences). Another innovative element we inserted in the model is the coreference but just at a cross-post level. It is necessary because blogs are composed
by posts linked between them and thus cross-document co-reference can help the
reader to follow the conversations. We also label the unusual usage of capital letters
and repeated punctuation. In fact, it is very common in blogs to find words written
in capital letter or with no conventional usage of punctuation; these features usually
mean shouts or a particular mood of the writer. Using EmotiBlog, we annotate the
single elements, but we also mark sayings or collocations, representative of each
language. Finally we insert for each element the source and topic.
108
Evaluation
type
Polarity
Intensity
Polarity
Intensity
http://www.cs.waikato.ac.nz/ml/weka/
110
Precision
Recall
32.13
36.00
36.4
38.7
54.09
53.2
51.00
57.81
Test
Evaluation
Precision
Recall
Corpus
type
Polarity
38.57
51.3
SemEval
I
Intensity
37.39
50.9
Polarity
35.8
58.68
SemEval
II
Intensity
32.3
50.4
Table 4.30: Results for polarity and intensity classification using models built on
EmotiBlog
The results presented in Table 4.30 show a significantly high improvement over
the results obtained in the SemEval task in 2007. This is explainable, on the one
hand, by the fact that systems performing the opinion task did not have at their
disposal the lexical resources for opinion employed in the EmotiBlog II model, but
also because of the fact that they did not use machine learning on a corpus
comparable to EmotiBlog (as seen from the results obtained when using solely the
EmotiBlog I corpus). Compared to the NTCIR 8 Multilingual Analysis Task this
year, we obtained significant improvements in precision, with a recall that is
comparable to most of the participating systems.
In the second experiment, we tested the performance of emotion classification
using the two models built using EmotiBlog on the three corpora JRC quotes,
SemEval 2007 Task No.14 test set and the ISEAR corpus. The JRC quotes are
labeled using EmotiBlog; however, the other two are labeled with a small set of
emotions 6 in the case of the SemEval data (joy, surprise, anger, fear, sadness,
disgust) and 7 in ISEAR (joy, sadness, anger, fear, guilt, shame, disgust). Moreover,
the SemEval data contains more than one emotion per title in the Gold Standard,
therefore we consider as correct any of the classifications containing one of them.
In order to unify the results and obtain comparable evaluations, we assessed the
performance of the system using the alternative dimensional structures defined by
Boldrini et al. (2009). The ones not overlapping with the category of any of the 8
different emotions in SemEval and ISEAR are considered as Other and are not
included either in the training, nor test set. The results of the evaluation are
presented in Table 4.31. Again, the values I and II correspond to the models
EmotiBlog I and II. The Emotions category contains the following emotions: joy,
sadness, anger, fear, guilt, shame, disgust, surprise.
Test
corpus
JRC
quotes I
Evaluation
type
Emotions
Precision
Recall
24.7
15.08
111
Test
Evaluation
Precision
Recall
corpus
type
JRC
Emotions
33.65
18.98
quotes II
SemEval
Emotions
29.03
18.89
I
SemEval
Emotions
32.98
18.45
II
ISEAR I Emotions
22.31
15.01
ISEAR
Emotions
25.62
17.83
II
Table 4.31: Results for emotion classification using the models built on
EmotiBlog
The best results for emotion detection were obtained for the anger category,
where the precision was around 35 percent, for a recall of 19 percent. The worst
results obtained were for the ISEAR category of shame, where precision was
around 12 percent, with a recall of 15 percent. We believe this is due to the fact that
the latter emotion is a combination of more complex affective states and it can be
easily misclassified to other categories of emotion. Moreover, from the analysis
performed on the errors, we realized that many of the affective phenomena
presented were more explicit in the case of texts expressing strong emotions such as
joy and anger, and were mostly related to common-sense interpretations of the
facts presented in the weaker ones.
As it can be seen in Table 4.30, results for the texts pertaining to the news
category obtain better results, most of all news titles. This is due to the fact that
such texts, although they contain a few words, have a more direct and stronger
emotional charge than direct speech (which may be biased by the need to be
diplomatic, find the best suited words etc.). Finally, the error analysis showed that
emotion that is directly reported by the persons experiencing is more hidden, in
the use of words carrying special signification or related to general human
experience. This fact makes emotion detection in such texts a harder task.
Nevertheless, the results in all corpora are comparable, showing that the approach is
robust enough to handle different text types. All in all, the results obtained using the
fine and coarse-grained annotations in EmotiBlog increased the performance of
emotion detection as compared to the systems in the SemEval competition.
DISCUSSION ON THE OVERALL RESULTS
From the results obtained, we can see that this approach combining the features
extracted from the EmotiBlog fine and coarse-grained annotations helps to balance
112
between the results obtained for precision and recall. The impact of using additional
resources that contain opinion words is that of increasing the recall of the system, at
the cost of a slight drop in precision, which shows that the approach is robust
enough so that additional knowledge sources can be added. Although the corpus is
small, the results obtained show that the phenomena captured by the approach are
relevant to the opinion mining task, not only for the blog sphere, but also for other
types of text (newspaper articles, self-reported affect).
Another advantage of EmotiBlog is the fact that it contains texts in three languages:
English, Spanish and Italian. That is why, in the following experiments, we will test
the usability of the resource in a second language, namely, Spanish.
The main aim of this experiment, described by Balahur et al. (Balahur et al.,
2009b) is to obtain a system able to mine opinion from user input in real time and,
according to the inferred information on the polarity of the sentiment, offer them
corresponding feedback. The topic of our opinion mining experiment is
recycling: the computer asks about a persons opinion on recycling and then the
user answers this question generating a sentence with emotion that can be of
different intensity levels. The system reacts to the user input with some messages or
faces that correspond to the reactions for the users feedback.
For the task at hand we employ annotations from texts on the Kyoto protocol
pertaining to the EmotiBlog corpus. We use the annotated elements to train our
opinion mining system and then classify new sentences that are on the topic
recycling (to which some vocabulary similarity can be found, since they both
topics refer to environmental issues). For the task at hand, we manually created a
set of 150 sentences on recycling, 50 for each of the positive, negative and neutral
categories. The first experiment carried out aimed at proving that the corpus is a
valid resource and we can use the annotations for the training of our opinion mining
system. For this assessment, we use the same methodology we will further employ
to mine opinions from user input.
CROSS-FOLD EVALUATION OF THE ANNOTATION
As a result of the annotation, we obtained 1647 subjective phrases and 1336
objective ones. Our agreement was 0.59 for subjective phrases and 0.745 for the
objective one.
Further on, we will consider for our tests only the sentences upon which we agreed
and the phrases whose annotation length was above four tokens of the type noun,
verb, adverb or adjective. For the cross-validation of the corpus, each of the
sentences is POS-Tagged and lemmatized using FreeLing 32 . Further on, we
represent each sentence as a feature vector, whose components are unigram features
containing the positive and respectively negative categories of nouns, verbs,
adverbs, adjectives, prepositions and punctuation signs (having 1 in the
corresponding position of the feature vector for the words contained and 0
otherwise), the number of bigrams/ trigrams and 4-grams overlapping with each of
the phrases we have annotated as positive and negative or objective, respectively
and finally the overall similarity given by the number of overlapping words with
each of the positive and negative or objective phrases from the corpus, normalized
by the length of the given phrase. We test out method in two steps: first of all the
classification of sentences among subjective and objective, for which the vectors
contain as final values subjective or objective and second of all the
classification of subjective sentences into positive and negative, for which case the
32
http://www.lsi.upc.edu/~nlp/freeling/
114
115
Precision
Recall
Kappa
Subjective
0.977
0.619
0.409
Objective
0.44
0.95
0.409
Positive
0.881
0.769
0.88
Negative
0.92
0.96
0.88
Table 4.33: Classification results using n-grams
As we can notice from the results, using the annotated elements, it is easier to
distinguish the subjective sentences, due to the fact that we train on subjective ngrams. As far as the positive, negative and neutral classification is concerned, the
results are both high, as well as balanced, proving the correctness of our approach.
CLASSIFICATION USING N-GRAMS, N>2.
In this experiment, we test the importance of annotating affect in texts at the token
level. From our blog corpus, we have a large number of nouns, verbs, adverbs and
adjective, annotated as positive or negative and at the emotion level. We used these
words at the time of classifying examples using n-grams, with n ranging from 1 to 4
(in 5.2.1). To test their importance, we removed the vector components accounting
for their presence in the feature vectors and re-classified, both at the level of
objective versus subjective, as well as at the positive, negative, neutral level. In the
table below, we can see the results obtained.
Precision
Recall
Kappa
Subjective
0.93
0.60
0.43
Objective
0.43
0.7
0.43
Positive
0.83
0.64
0.85
Negative
0.90
0.91
0.85
Neutral
0.90
0.96
0.85
Table 4.34: Classification results using n-grams, n>2
As we can see, removing single words with their associated polarities from the
training data resulted in lower scores. Therefore, fine-grained annotation helps at
the time of training the opinion mining system and is well-worth the effort.
peculiarities of this task and identified the weak points of existing research. We
proposed and evaluated different methods to overcome the identified challenges,
among which the most important were the discovery of indirectly mentioned
features and the computation of the polarity of opinions in a manner that is featuredependent. Subsequently, we proposed a unified model for sentiment annotation for
this type of text, able to capture the important phenomena that we had identified
different types of sentiment expressions, feature mentioning and span of text
expressing a specific opinion.
Further on, we explored different methods to tackle sentiment analysis from
newspaper articles. After the initial experiments, we analyzed the reasons for the
low performance obtained and redefined the task, taking into account the
peculiarities of this textual genre. We created an annotation model and labeled two
different corpora of newspaper article quotations, in English and German. After
redefining the task and delimiting the scope of the sentiment analysis process to
quotations small text snippets containing direct speech, whose source and target
are previously known-, the annotation agreement rose significantly. Additionally,
improving the definition of the task made it possible to implement automatic
processing methods that are appropriate for the task and significantly improve the
performance of the sentiment analysis system we had designed. In the view of
applying sentiment analysis to different types of texts, in which objective content is
highly mixed with subjective one and where the sources and targets of opinions are
multiple, we have proposed different general methods for sentiment analysis, which
we applied to political debates.
The results of this latter experiment motivated us to analyze the requirements of
a general labeling scheme for the task of sentiment analysis, which can be used to
capture all relevant phenomena in sentiment expression.
To this aim, Boldrini et al. (2009) defined EmotiBlog, an annotation scheme that
is able to capture, at a fine-grained level, all linguistic phenomena related to
sentiment expression in text. The subsequent experiments have shown that this
model is appropriate for the training of machine learning models for the task of
sentiment analysis in different textual genres, in both languages in which
experiments have been carried out using it English and Spanish.
In this chapter, we have only concentrated on the task of sentiment analysis as a
standalone challenge, omitting the steps required in order to obtain the texts on
which the sentiment analysis methods were applied. In a real scenario, however,
automatically detecting the opinion expressed in a text is not the first task to be
performed. Additionally, in many of the cases, the results obtained after
automatically processing texts to determine the sentiment they contain still pose
many problems in terms of volume. Thus, even if the sentiment is determined
117
118
5.1. INTRODUCTION
In the previous chapter, we presented different methods to perform sentiment
analysis from a variety of text types. We have shown what the challenges related to
each of these genres are and how the task of opinion mining can be tackled in each
of them.
Nevertheless, real-world applications of sentiment analysis often require more
than an opinion mining component. On the one hand, an application should allow a
user to query about opinions, in which case the documents in which these opinions
appear have to be retrieved. In more user-friendly applications, the users can be
given the option to formulate the query into a question in natural language.
Therefore, question answering techniques must be applied in order to determine
the information required by the user and subsequently retrieve and analyze it. On
the other hand, opinion mining offers mechanisms to automatically detect and
classify sentiments in texts, overcoming the issue given by the high volume of such
information present on the Internet. However, in many cases, even the result of the
opinion processing by an automatic system still contains large quantities of
information, which are still difficult to deal with manually. For example, for
questions such as Why do people like George Clooney? we can find thousands of
answers on the Web. Therefore, finding the relevant opinions expressed on George
Clooney, classifying them and filtering only the positive opinions is not helpful
enough for the user. He/she will still have to sift through thousands of texts
snippets, containing relevant, but also much redundant information. Moreover,
when following the comments on a topic posted on a blog, for example, finding the
arguments given in favor and against the given topic might not be sufficient to a
real user. He/she might find the information truly useful only if it is structured and
has no redundant pieces of information. Therefore, apart from analyzing the opinion
in text, a real-world application for sentiment analysis could also contain a
summarization component.
The aim of the work presented in this chapter is to apply the different opinion
mining resources, tools and approaches to other tasks within NLP. The objective
was, on the one hand, to evaluate the performance of our approaches, and, on the
other, to test the requirements and extra needs of an opinion mining system in the
context of larger applications. In this chapter, we present the research we carried
119
out in order to test the manner in which opinion mining can be best combined with
information retrieval (IR), question answering (QA) and summarization (SUM), in
order to create a useful, real-life, end-to-end system for opinion analysis in text. Our
initial efforts concentrated on applying already-existing IR, QA and SUM systems
in tandem with our sentiment analysis systems. Subsequently, having realized that
directly applying systems that were designed to deal with factual data in the context
of opinionated text led to low results, we proposed new methods to tackle IR, QA
and SUM in a manner that is appropriate in the context of subjective texts.
In this chapter, we present the methods and improvements achieved in Opinion
Question Answering and Opinion Summarization.
approach to a more general one, which is similar to the one used to determine
opinions from political debates (Balahur et al., 2009e).
The Opinion Summarization Pilot task consisted in generating summaries from
blogs snippets that were the answers to a set of opinion questions. The participants
were given a set of blogs from the Blog06 collection and a set of squishy list
(opinion) questions from the Question Answering track, and had as task to produce
a summary of the blog snippets that answered these questions. There were 25
targets, and on each of them one or two questions were formulated. All these
questions concerned the attitude held by specified sources on the targets given. For
example, for the target George Clooney, the two questions asked were Why do
people like George Clooney? and Why do people dislike George Clooney?).
Additionally, a set of text snippets were also provided, which contained the answers
to the questions. These snippets were selected from the answers given by systems
participating in the Question Answering track, and opinion summarization systems
could either use them or choose to perform themselves the retrieval of the answers
to the questions in the corresponding blogs.
Within our participation in the Opinion Summarization Pilot task, we used two
different methods for opinion mining and summarization. The two approaches
suggested were different only as far as the use of the optional text snippets provided
by the TAC organization was concerned. Our first approach (the Snippet-driven
Approach) used these snippets, whereas the second one (Blog-driven Approach)
found the answers directly in the corresponding blogs.
In the first phase, we processed the questions, in order to determine a set of
attributes that will further help us find and filter the answers. The process is
described in Figure 5.1. In order to extract the topic and determine the question
polarity, we define question patterns. These patterns take into consideration the
interrogation formula and extract the opinion words (nouns, verbs, adverbs,
adjectives and their determiners). The opinion words are then classified in order to
determine the polarity of the question, using the WordNet Affect emotion lists, the
emotion triggers resource (Balahur and Montoyo, 2008), a list of four attitudes that
we built, containing the verbs, nouns, adjectives and adverbs for the categories of
criticism, support, admiration and rejection and two categories of value words
(good and bad) taken from the opinion mining system proposed by Balahur and
Montoyo (Balahur and Montoyo, 2008c).
Examples of rules for the interrogation formula What reasons are:
1. What reason(s) (.*?) for (not) (affect verb + ing) (.*?)?
2. What reason(s) (.*?) for (lack of) (affect noun) (.*?)?
3. What reason(s) (.*?) for (affect adjective ||positive ||negative) opinions
(.*?)?
121
Using these extraction patterns, we identified the nouns, verbs, adjectives etc.
that gave an indication of the question polarity 33 . Further on, these indicators
were classified according to the affect lists mentioned above. The keywords of the
question are determined by eliminating the stop words. At the end of the question
processing stage, we obtain, on the one hand, the reformulation patterns (that are
eventually used to link and give coherence to the final summaries) and, on the other
hand, the question focus, keywords and the question polarity. Depending on the
focus/topic and polarity identified for each question, a decision on the further
processing of the snippet was made, using the following rules:
1. If there is only one question made on the topic, determining its polarity is
sufficient for making the correspondence between the question and the
snippets retrieved; the retrieved snippet must simply obey the criteria that it
has the same polarity as the question.
2. If there are two questions made on the topic and each of the questions has a
different polarity, the correspondence between the question and the answer
snippets can simply be done by classifying the snippets retrieved according
to their polarity.
3. If there are two questions that have different focus but different polarities,
the correspondence between the questions and the answer snippets is done
using the classification of the answer snippets according to focus and
polarity.
4. If there are two questions that have the same focus and the same polarity,
the correspondence between the questions and the answer snippets is done
using the order of appearance of the entities in focus, both in the question
and in the possible answer snippet retrieved, simultaneously with the
verification that the intended polarity of the answer snippet is the same as
that of the question.
The categorization of questions into these four classes is decisive at the time of
making the question - answer snippet correspondence, in the snippet/blog phrase
processing stage. Details on these issues are given in what follows and in Figure
5.1.
33
Later in this chapter, as we will redefine the task of question answering in the context of opinions,
we will refer to this concept as Expected Polarity Type (EPT). Although in this first experiment, we
called it question polarity, the new EPT term was necessary, as question polarity entails that the
question in itself has some sort of orientation, when in fact it is the polarity of the expected answer that
is actually computed at this stage.
122
WN
AFFECT
QUESTION
PATTERNS
REFORMULATION
PATTERNS
EMOTION
TRIGGERS
ATTITUDE
WORDS
VALUE
WORDS
QUESTIONS
QUESTION
KEYWORDS
QUESTION
FOCUS
QUESTION
POLARITY
DECISION
ON
SNIPPET
123
In the final phrases used in creating the summary we added, for coherence
reasons, the reformulation patterns deduced using the question structure. Taking
into consideration the number of characters limitation, we only included in the
summary the phrases with high positive scores and those with high negative scores,
completed with the reformulation patterns, until reaching the imposed character
limit. Thus, the score given by the sentiment analysis system constituted the main
criteria for selecting the relevant information for the summaries and the F-measure
score reflects the quality of the opinion mining process.
SNIPPETS/KEYWORDS
WN
AFFECT
ORIGINAL BLOG
PHRASES
EMOTION
TRIGGERS
ISEAR
SNIPPET/P.
/
POLARITY
SORT BY
POOLARITY
STRENGTH
DECISION ON
SNIPPET
QUESTIONQ
SNIPPET
CORRESPON
ANCE
SNIPPET/P.
TOPIC
REFORMULATION
PATTERNS
FINAL
SUMMARIES
and change them to a neutral style. A set of rules to identify pronouns was created,
and they were also changed to the more general pronoun they and its
corresponding forms, to avoid personal opinions.
The second approach had as starting point determining the focus, keywords,
topic and polarity in each of the given questions. The processing of the question is
similar to the one performed for the first approximation. Starting from the focus,
keywords and topic of the question, we sought sentences in the blog collection
(previously processed as described in the first approximation) that could constitute
possible answers to the questions, according to their similarity to the latter. The
similarity score was computed with Pedersens Text Similarity Package 34 . The
snippets thus determined underwent dependency parsing with Minipar and only the
sentences which contained subject and predicate were kept, thus ensuring the
elimination of some of the present noise (such as section titles, dates, times etc.).
The remaining snippets were classified according to their polarity, using the
similarity score with respect to the described emotion vectors. The direct language
style was changed to indirect speech style. The reformulation patterns that were
deduced using the questions structure were added to bind together the snippets and
produce the final summary, concatenating the snippets with the added
reformulations. Since the final length of the summary could easily overpass the
imposed limit, we sorted the snippets using their polarity strength (the higher the
polarity score - be it positive or negative the higher the rank of the snippet), and
included the reformulated snippets in descending order until the final limit was
reached.
EVALUATION OF THE APPROACH IN THE TAC2008 COMPETITION
45 runs were submitted by 19 teams for evaluation in the TAC 2008 Opinion Pilot
task. Each team was allowed to submit up to three runs, but finally, due to the
difficulty involved in the evaluation of such a task, only the first two runs of each
team was evaluated, leading to 36 runs being evaluated. Table 5.1 shows the final
results obtained by the first two runs we submitted for evaluation in the TAC 2008
Opinion Pilot and the rank they had compared to the other participating systems.
The column numbers stand for the following information:
1. summarizerID ( our Run 1 had summarizerID 8 and Run 2 had
summarizerID 34)
2. Run type: "manual" or "automatic"
3. Did the run use the answer snippets provided by NIST: "Yes" or "No"
4. Average pyramid F-score (Beta=1), averaged over 22 summaries
5. Average score for Grammaticality
34
http://wn-similarity.sourceforge.net/
125
2
3
4
5
6
7
8
automatic Yes
0.357
4.727
5.364
3.409
3.636
automatic
No
0.155
3.545
4.364
3.091
2.636
Table 5.1: Evaluation results in the TAC 2008 competition
9
5.045
2.227
Further on, we will present the system performances with respect to all other
teams, first as an overall classification (Table 5.2) and secondly, taking into
consideration whether or not the run used the optional answer snippets provided by
NIST (Table 4).
In Table 5.2, the numbers in columns 4, 5, 6, 7, 8 and 9 correspond to the
position within the 36 evaluated submissions. In Table 5.3, the numbers in columns
4, 5, 6, 7, 8 and 9 correspond to the position within the 17 submissions that used the
given optional answer snippets (in case of Run 1) and the position within the 19
submissions evaluated that did not use the provided answer snippets.
1
8
34
2
3
4
5
6
7
8
automatic
Yes
7
8
28
4
16
automatic
No
23
36
36
13
36
Table 5.2: Classification results (overall comparison)
9
5
28
1
2
3
4
5
6
7
8
9
8
automatic
Yes
7
15
14
2
11
5
34
automatic
No
9
19
19
6
19
14
Table 5.3: Classification results (comparison with systems using/not using answer
snippets)
As it can be noticed from the results table, our system performed well regarding
Precision and Recall, the first run being classified 7th among the 36 evaluated runs
as far as F-measure. As far as the structure and coherence are concerned, the results
were also good, placing Run 1 in 4th position among the 36 evaluated runs. Also
worth mentioning is the good performance obtained as far as the overall
responsiveness is concerned, where Run 1 ranked 5th among the 36. When
comparing our approaches separately, in both cases, they did not perform very well
with respect of the non-redundancy criterion, nor the grammaticality one. An
126
interesting thing that is worth mentioning as far as the results obtained are
concerned, is that the use of reformulation patterns, in order to generated sentences
for completing the summaries, has been appropriate, leading to very good rankings
according to the structure/coherence criterion. However, due to the low results
obtained as far as the redundancy and grammaticality criteria are concerned, we
decided to test different methods to overcome these issues.
35
This principle states that the more times a word appears in a document, the more relevant the
sentences that contain this word are.
127
36
The RTE Challenges have been organized between 2005 and 2007 by the Pascal Network of
Excellence and since 2008, by the National Institute for Standards and Technology, within the Text
Analysis Conference (TAC).
128
(OpSum-1 and OpSum-2), in order to analyze whether they have been improved or
not.
System
OpSum-1
0.592
0.272
0.357
OpSum-2
0.251
0.141
0.155
WF-1
0.705
0.392
0.486
TE+WF -1
0.684
0.630
0.639
WF -2
0.322
0.234
0.241
TE+WF-2 0.292
0.282
0.262
Table 5.4: Comparison of the results obtained in the competition versus the results
obtained applying the summarization system proposed by Lloret et al. (2008)
From these results it can be stated that the TE module in conjunction with the
WF counts, have been very appropriate in selecting the most important information
of a document. Although it can be thought that applying TE can remove some
meaningful sentences which contained important information, results show the
opposite. It benefits the Precision value, because a shorter summary contains
greater ratio of relevant information. On the other hand, taking into consideration
the F-measure value only, it can be seen that the approach combining TE and WF,
for the sentences in the first approach, significantly improved the best F-measure
result among the participants of TAC 2008, increasing its performance by 20%
(with respect to WF only), and improving by approximately 80% with respect to
our first approach submitted to TAC 2008.
However, a simple generic summarization system like the one we have used
here is not enough to produce opinion oriented summaries, since semantic
coherence given by the grouping of positive and negative opinions is not taken into
account. Therefore, simply applying the summarization system does not yield the
desired results in terms of opinionated content quality. Hence the opinion
classification stage must be added in the same manner as used in the competition
and combined appropriately with a redundancy-removing method.
Motivated by this first set of experiments, in a second approach, we wanted to
test how much of the redundant information would be possible to remove by using a
Textual Entailment system similar to the one proposed by Iftene and BalahurDobrescu (2007), without it affecting the quality of the remaining data. As input for
the TE system, we considered the snippets retrieved from the original blog posts.
We applied the entailment verification on each of the possible pairs, taking in turn
all snippets as Text and Hypothesis with all other snippets as Hypothesis and Text,
129
F-Measure
Best system
0.534
0.490
OpSum-1 + TE
0.530
OpSum-1
0.357
Table 5.5. Comparison of the best scoring systems in TAC 2008 and the DLSIUAES
teams improved system
Table 5.5 shows that applying TE before generating the final summary leads to
very good results increasing the F-measure by 48.50% with respect to the original
first approach. Moreover, it can be seen form Table 5.5 that our improved approach
would have ranked in the second place among all the participants, regarding Fmeasure, maintaining the linguistic quality level, with which our approach ranked
high in the TAC 2008 Opinion Pilot competition. The main problem with this
approach is the long processing time. We can apply Textual Entailment in the
manner described within the generic summarization system presented, successively
testing the relation as Snippet1 entails Snippet2?, Snippet1+Snippet2 entails
Snippet3? and so on. The problem then becomes the fact that this approach is
random, since different snippets come from different sources, so there is no order
among them. Further on, we have seen that many problems arise from the fact that
extracting information from blogs introduces a lot of noise. In many cases, we had
examples such as:
At 4:00 PM John said Starbucks coffee tastes great
John said Starbucks coffee tastes great, always get one when reading New York
Times.
To the final summary, the important information that should be added is
Starbucks coffee tastes great. Our TE system contains a rule specifying that the
existence or not of a Named Entity in the hypothesis and it not being mentioned in
the text leads to the decision of NO entailment. For the example given, both
snippets are maintained, although they contain the same data. Another issue to be
addressed is the extra information contained in final summaries that is not scored as
nugget. As we have seen from our data, much of this information is also valid and
correctly answers the questions. Therefore, what methods can be employed to give
more weight to some and penalize others automatically?
130
Match S-P
90%
10%
Noun-det
75%
25%
Upper case
80%
20%
Repeated words
100%
0%
Repeated .
80%
20%
Spelling mistakes
60%
40%
Unpaired /()
100%
0%
Table 5.6. Grammaticality analysis
The greatest problem encountered was the fact that bigrams are not detected and
agreement is not made in cases in which the noun does not appear exactly after the
determiner. All in all, using this module, the grammaticality of our texts was greatly
improved. In our post-competition experiments, we showed that using a generic
summarization system, we obtain 80% improvement over the results obtained in the
competition, with coherence being maintained by using the same polarity
classification mechanisms. Using redundancy removal with TE, as opposed to our
initial polarity strength based sentence filtering improved the system performance
by almost 50%. Finally, we showed that grammaticality can be checked and
improved using an independent solution given by Language Tool.
Nevertheless, many of the components of this approach still require
improvement. First of all, as we have seen, special methods must be applied in
order to treat the questions, in order to determine additional elements (expected
polarity of answer, source of opinion expected, target of opinion etc.). Additionally,
as in the case of the initial QA task at TAC 2008, real-life systems must be able to
distinguish between factual questions and questions that require an opinionated
37
http://community.languagetool.org/
131
Using EmotiBlog, we label each answer adding three elements. The first one
is the source, the second one is target and finally, we annotate the required
polarity. Using these annotations, we will be able to detect the author of the
sentence, the target of the sentiment and also the polarity of the expressed
opinion.
Furthermore, we also have to find an effective method to check if an answer
is correct. This is difficult for general opinion answers, because in most of the
cases answers for the same question can be different but not for this reason
incorrect. In fact, they could express different points of view about a subject. We
could say that each of them is partially correct and none of them is totally
correct.
After having retrieved the answers, answer validation is needed. We
could perform the Answer Validation Exercise (AVE) for objective answers, but
for opinion sentences we do not have at our disposal any similar tool. We would
need for example a consistent collection of paraphrases for opinion but, to the
best of our knowledge, such a collection does not exist.
In order to carry out our experiments, we created a small collection of questions
in addition to the ones made by Stoyanov et al. (2005) for the OpQA corpus. We
decided to use part of this collection because our corpus and the MPQA share the
same topic, the Kyoto Protocol; as a consequence our corpus will probably contain
the answer to the queries and our questions will also have answers in their corpus.
We used 8 questions in English, divided into two groups; the first one is
composed by factoid ones and we consider them as objective queries. They usually
require a single fact as answer and include questions such as: What is the Kyoto
Protocol? or When was the Kyoto Protocol ratified? As we can deduce, they are
asking for a definition and for a date.
In general, factoid queries ask for a name, date, location, time etc. and as a
consequence, the most relevant aspect is that they need a univocal answer.
Moreover, answers to factoid questions can be validated, using different techniques,
such as Textual Entailment, as proposed in the Answer Validation Exercise
(AVE)38.
The second group is composed of opinion questions. They are different from the
ones previously described. They are more complex in nature, as they contain
additional elements, such as required polarity, and the answer is not univocal.
Through their nature, opinion questions require more answers to be retrieved that
are equally correct and valid (i.e. What reasons do people have for liking
MacDonalds?). We will obtain a wide list of answers each of them will be correct.
Furthermore, there are opinion questions that could be interpreted as objective,
but which are actually requiring the analysis of opinionated content in order to be
38
http://nlp.uned.es/clef-qa/ave/
133
answered (e.g. Are the Japanese unanimous in their opinion of Bushs position on
the Kyoto Protocol?). Usually, opinion queries can have more than one answer.
One of the possible answers found in the corpus under the form of Japanese
were unanimous in their opinion about Bushs position on the Kyoto protocol, and
it is equally possible that answering this question requires the analysis of many
opinions on the same subject. They are evaluative or subjective because they
express sentiments, feelings or opinions.
Table 5.7 presents the collection of questions we used in order to carry out our
experiments.
Type
Questions
factoid
What is the Kyoto protocol about?
factoid
When was the Kyoto Protocol adopted?
factoid
Who is the president of Kiko Network?
factoid
What is the Kiko Network?
opinion
What is Bush's opinion about the Kyoto protocol?
opinion
What are people's feelings about Bush's decision?
opinion What is the Japanese reaction to the Kyoto protocol?
opinion
What are peoples opinions about Bush
Table 5.7: The set of questions proposed on the EmotiBlog corpus
As we can see in Table 5.7, we have a total of 8 questions, divided between
factoid and opinion. After having created the collection of questions, we labeled the
answers in our blog corpus using EmotiBlog, the annotation scheme we built for
emotion detection in non-traditional textual genres, as for example blogs or forums.
The granularity of the annotations done using this scheme could be an advantage if
the model is designed in a simple, but effective way. Using EmotiBlog, we aim at
capturing all relevant phenomena for the sentiment analysis task, in order to provide
the opinion mining system the maximum amount of information.
In order to propose appropriate methods to tackle opinion questions, we must
first determine what are the challenges associated, in comparison to factual
questions. Therefore, we first studied the answers to opinion and factual questions
in the OpQA corpus. In the case of factoid questions, we noticed that, in most of the
cases, they are answerable in a straightforward manner, i.e. the information required
is found in the corpus expressed in a direct manner. One problem we found was
related to temporal expressions. For example, for the question When was the
Kyoto Protocol ratified?, there are some answers that contain 1997, but other
answers are last November, or in the course of the year. As a consequence, we
deduce that, in order to correctly interpret the answer, additional context must be
134
used. Such extra details could be for example the date of the document. For the rest
of factoid question we did not detect other relevant obstacles.
Regarding opinion queries, the analysis revealed a series of interesting issues.
The first one is that some opinion questions could be interpreted as factoid. For
example, if we have: Are the Japanese unanimous in their opinion of Bushs
position on the Kyoto Protocol? This query could be answered with yes/no, but the
information we really need is an opinion. Thus, the first thing to do should be to
build up some patterns in order to detect if a question is about an opinion.
Another problem we detected analyzing OpQA answers is the lack of source of
the opinion. If we ask: How is Bushs decision not to ratify the Kyoto Protocol
looked upon by Japan and other US allies? This question is asking for two different
points of view, but its corresponding answers are not clear. We detected for
example: Wed like to see further efforts on the part of the U.S. or dump. As we can
notice, we do not know who is expressing each of these opinions. This is another
issue that should be solved. Therefore, when annotating the answers with
EmotiBlog, we specify the target (the entity on which the opinion is expressed) and
the source of the opinion.
In order to improve this task we could label our collection of questions using the
Expected Answer Type (EAT) and our corpus with Named Entities (NE). On the
one hand, the EAT would solve the problem of understanding each query type, and
on the other hand NE labeling could improve the entity detection.
It is worth mentioning that opinion QA is a challenging and complex task. The
first step we should be able to perform is to discriminate factoid versus opinion
questions and, after having solved this first problem, we should try to find a way to
expand each of our queries.
Finally, we should have a method to automatically verify the correctness of
partial answers.
In order to study the task of fact versus opinion classification, we use on the one
hand the sets provided within the AVE (Answer Validation Exercise) 2007 and
2008 development and test sets (fact questions) and the development and test sets in
the TAC (Text Analysis Conference) 2008 Opinion Pilot, the questions made on the
OpQA corpus (Stoyanov et al., 2005) and the questions we formulated on the
EmotiBlog corpus. Finally, we gathered a total of 334 fact and 79 opinion questions.
The first classification we performed was using the corpus annotations on the Kyoto
protocol blog posts. For each of the questions, we computed the similarity it has
with each of the objective versus subjective phrases annotated in the corpus and
built vectors of the obtained scores. We then classified these vectors using ten-fold
cross validation with SVM SMO.
The results obtained in this first phase are summarized in Table 5.8 (P denotes
the Precision score, R the Recall, Acc. is the accuracy over the classified examples
135
Fact
0.891 0.955
Opinion
0.727 0.506
Acc. Kappa
0.86
0.52
Fact
0.923
0.987
Opinion
0.93
0.671
Acc.
Kappa
0.92
0.73
Table 5.9: Fact versus opinion question classification using the interrogation
formula
As we can notice from Table 5.9, when using the clue of the interrogation
formula, the results improved substantially for the opinion question category
classification. Further on, we performed a third experiment that had as aim to test
the influence of the number of learning examples for each category on the
classification results. As category of opinion questions category was much smaller
than the number of examples for fact questions (79 as opposed to 334), we wanted
to measure the performance in classification when the training sets were balanced
for the two categories. We randomly selected 79 fact questions from the 334 initial
136
Fact
0.908
0.747
Opinion
0.785
0.924
Acc.
Kappa
0.83
0.67
Table 5.10 Fact versus opinion question classification using the interrogation
formula and an equal number of training examples
In this case, we can see that recall significantly drops for the fact category and
increases for the opinion examples. Precision for the fact category remains around
the same values in all classification settings and drops for the opinion category
when using fewer examples of fact questions. At an overall level, however, the
results become more balanced. Therefore, a first conclusion that we can draw is that
we need larger collections of opinion questions in order to better classify opinion
and fact questions in a mixed setting.
Subsequently to this analysis of methods for fact versus opinion question
classification, we annotated the answers corresponding to the set of questions we
proposed, in the EmotiBlog corpus. Table 5.11 presents the questions and the
number of answers annotated using the EmotiBlog annotation scheme.
Number
of answers
What is Bush's opinion about the Kyoto protocol?
4
What are people's feelings about Bush's decision?
1
What is the Japanese reaction to the Kyoto Protocol
3
What are peoples opinions about Bush?
4
Table 5.11 List of opinion questions and the number of corresponding answers
annotated in the corpus
Questions
As we can see in table 5.11, we have labeled different answers for each opinion
questions. Each of these questions is different in nature from the other ones and
requires a distinct approach in order to be answered. For example, if we consider
the last question: What are people's feelings about Bush's decision? The annotated
answers are: I am just disappointed that the US never supported it/ The whole
worlds perturbed with the greenhouse effect; emission gases destroying the earth
and global warming causing terrible climate changes, except, of course President
Bush./ After years of essentially rolling over against the Bush administration on
137
Type
138
Question
What international organization do people criticize for its
policy on carbon emissions?
Cul fue uno de los primeros pases que se preocup por el
problema medioambiental?
What motivates peoples negative opinions on the Kyoto
Protocol?
Cul es el pas con mayor responsabilidad de la
contaminacin mundial segn la opinin pblica?
What country do people praise for not signing the Kyoto
No
Type
10
11
12
13
14
Question
Protocol?
Quin piensa que la reduccin de la contaminacin se
debera apoyar en los consejos de los cientficos?
What is the nation that brings most criticism to the Kyoto
Protocol?
Qu administracin acta totalmente en contra de la lucha
contra el cambio climtico?
What are the reasons for the success of the Kyoto Protocol?
Qu personaje importante est a favor de la colaboracin
del estado en la lucha contra el calentamiento global?
What arguments do people bring for their criticism of media
as far as the Kyoto Protocol is concerned?
A qu polticos americanos culpa la gente por la grave
situacin en la que se encuentra el planeta?
Why do people criticize Richard Branson?
A quin reprocha la gente el fracaso del Protocolo de
Kyoto?
What president is criticized worldwide for his reaction to the
Kyoto Protocol?
Quin acusa a China por provocar el mayor dao al medio
ambiente?
What American politician is thought to have developed bad
environmental policies?
Cmo ven los expertos el futuro?
What American politician has a positive opinion on the Kyoto
protocol?
Cmo se considera el atentado del 11 de septiembre?
What negative opinions do people have on Hilary Benn?
Cul es la opinin sobre EEUU?
Why do Americans praise Al Gores attitude towards the
Kyoto protocol and other environmental issues?
De dnde viene la riqueza de EEUU?
What country disregards the importance of the Kyoto
Protocol?
Por qu la guerra es negativa?
What country is thought to have rejected the Kyoto Protocol
due to corruption?
Por qu Bush se retir del Protocolo de Kyoto?
What alternative environmental friendly resources do people
139
No
15
16
17
Type
F/O
F/O
18
F/O
19
F/O
20
F/O
Question
O
suggest to use instead of gas en the future?
Cul fue la posicin de EEUU sobre el Protocolo de Kyoto?
Is Arnold Schwarzenegger pro or against the reduction of
O
CO2 emissions?
Qu piensa Bush sobre el cambio climtico?
What American politician supports the reduction of CO2
O
emissions?
Qu impresin da Bush?
What improvements are proposed to the Kyoto Protocol?
O
Qu piensa China del calentamiento global?
What is Bush accused of as far as political measures are
O
concerned?
Cul es la opinin de Rusia sobre el Protocolo de Kyoto?
What initiative of an international body is thought to be a
O
good continuation for the Kyoto Protocol?
Qu cree que es necesario hacer Yvo Boer?
Table 5.12: List of question in English and Spanish
We created a set of factoid (F) and opinion (O) queries for English and for
Spanish, presented in Table 5.12. Some of the questions could be defined between
factoid and opinion (F/O) and the system can retrieve multiple answers after having
selected, for example, the polarity of the sentences in the corpus.
The first step in our analysis was to evaluate and compare the generic QA
system of the University of Alicante (Moreda et al., 2008) and the opinion QA
system presented by Balahur et al. (2008), in which Named Entity Recognition with
LingPipe 39 and FreeLing 40 was added, in order to boost the scores of answers
containing NEs of the question Expected Answer Type (EAT).
The open domain QA system of the University of Alicante (Moreda et al., 2008)
deals with factual questions as location, person, organization, date-time and
number in English and also in Spanish. Its architecture comprises three modules:
!
39
http://alias-i.com/lingpipe/
http://garraf.epsevg.upc.es/freeling/
41
http://www.cs.ualberta.ca/~lindek/minipar.htm
42
http://garraf.epsevg.upc.es/freeling/
40
140
Question Analysis
Question
Language
Type
Keywords
Information
ti R
Retrieval
Documentss
Document Retrieval
Answer Extraction
Answer
Language
NE
Scoring
Dist.
Clustering
--
http://alias-i.com/lingpipe/
141
question keywords, whose polarity is the same as the determined question polarity
and which contains a NE of the EAT. As the traditional QA system outputs 50
answers, we also take the 50 most similar sentences and extract the NEs they
contain. In the future, when training examples will be available, we plan to set a
threshold for similarity, thus not limiting the number of output answers, but setting
a border to the similarity score (this is related to the observation made by Stoyanov
et al. (2005) that opinion questions have a highly variable number of answers). The
approach is depicted in Figure 5.4.
Further on, we present the details of our method. In order to extract the topic and
determine the question polarity, we define question patterns. These patterns take
into consideration the interrogation formula and extract the opinion words (nouns,
verbs, adverbs, adjectives and their determiners). The opinion words are then
classified in order to determine the polarity of the question, using the WordNet
Affect emotion lists, the emotion triggers resource (Balahur and Montoyo, 2008), a
list of four attitudes that we built, containing the verbs, nouns, adjectives and
adverbs for the categories of criticism, support, admiration and rejection and a list
of positive and negative opinion words taken from the system by Balahur and
Montoyo (2008).
Questions
Analyze questions
Blog
sentences
NER (Freeling)
EAT patterns
Question
polarity
EAT
Calculate
similarity
Question
focus
Annotated
sentences
Question
keywords
Annotate
sentences
EmotiBlog
Retrieve
sentences
Candidate
sentences
50 top
similar
sentences
The similarity score was computed with Pedersens Text Similarity Package44
(using this software, both the words in the question, as well as the words in the blog
sentences are also stemmed). The condition we subsequently set was that the
polarity of the retrieved snipped be the same as the one of the question.
The polarity was computed using SVM on the trained model for the annotations
in the EmotiBlog corpus, using as features the n-gram similarity (with n ranging
from 1 to 4), as well as overall similarity to the annotated phrases, an approach
similar to Balahur et al. (2009). Moreover, in the case of questions with EAT
PERSON, ORGANIZATION or LOCATION, we required that a Named Entity of
the appropriate type was present in the retrieved snippets and we boosted the score
of the snippets fulfilling these conditions to the score of the highest ranking one. In
case more than 50 snippets were retrieved, we only considered for evaluation the
first 50 in the order of their similarity score, filtered by the polarity and NE
presence criteria (which proved to be a good indicator of the snippets importance
(Balahur et al., 2008).
EVALUATION OF THE INITIAL APPROACH FOR OPINION
QUESTION ANSWERING
Table 5.12 presents the results obtained for English and Table 5.13 for Spanish. We
indicate the id of the question (Q), the question type (T) and the number of answer
of the Gold Standard (A). We present the number of the retrieved questions by the
traditional system (TQA) and by the opinion one (OQA). We take into account the
first 1, 5, 10 and 50 answers.
1
2
3
4
5
6
7
8
9
10
F
O
F
F
O
O
O
F
F
F
44
http://www.d.umn.edu/~tpederse/text-similarity.html
143
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
F
F
F
F
F
F
F
F
O
O
O
O
O
O
O
O
O
O
O
9
13
2
1
3
2
4
1
5
2
5
2
8
25
36
23
50
10
4
144
@1
TQA
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
OQA
0
1
1
0
0
0
0
0
1
0
0
0
1
1
1
0
1
1
1
OQA
1
3
2
0
0
1
0
0
2
0
1
1
2
2
2
0
5
1
1
TQA
1
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
OQA
1
6
2
0
0
1
0
0
2
0
2
1
2
4
6
0
6
2
1
@ 50
TQA
1
11
2
1
1
2
4
1
0
0
0
0
0
0
0
0
0
0
0
OQA
3
7
2
0
0
1
0
0
4
0
3
1
4
8
15
0
10
2
1
@1
@ 50
of the NE of the sought type within the retrieved snippet and in some cases the
name was misspelled in the blog entries, whereas in other NER performed by
FreeLing either attributed the wrong category to an entity, failed to annotate it or
wrongfully annotated words as being NEs. Not of less importance is the question
duality aspect in question 17. Bush is commented in more than 600 sentences;
therefore, when polarity is not specified, it is difficult to correctly rank the answers.
Finally, also the problems of temporal expressions and the co-reference need to be
taken into account.
FINAL PROPOSAL FOR AN OPINION QUESTION ANSWERING
SYSTEM
Summarizing our efforts until this point, subsequent to the research in classifying
questions between factual and opinionated, we created a collection of both factual
and opinion queries in Spanish and English. We labeled the Gold Standard of the
answers in the corpora and subsequently we employed two QA systems, one open
domain, one for opinion questions. Our main objective was to compare the
performances of these two systems and analyze their errors, proposing solutions to
creating an effective QA system for both factoid an opinionated queries. We saw
that, even using specialized resources, the task of QA is still challenging. From our
preliminary analysis, we could see that Opinion QA can benefit from snippet
retrieval at a paragraph level, since in many cases the answers were not simple parts
of sentences, but consisted in two or more consecutive sentences. On the other
hand, we have seen cases in which each of three different consecutive sentences
was a separate answer to a question.
Therefore, our subsequent efforts (Balahur et al., 2010a; Balahur et al., 2010c)
concentrated on research to analyze the improvements that can be brought at the
different stages of the OQA process: question treatment (identification of expected
polarity EPT, expected source ES and expected target ET-), opinion retrieval
(at the level of one and three-sentences long snippets, using topic-related words or
using paraphrases), opinion analysis (using topic detection and anaphora
resolution). This research is motivated by the conclusions drawn by previous
studies (Balahur et al., 2009d). Our purpose is to verify if the inclusion of new
elements and methods - source and target detection (using semantic role labeling
(SRL)), topic detection (using Latent Semantic Analysis) and joint topic-sentiment
analysis (classification of the opinion expressed only in sentences related to the
topic), followed by anaphora resolution (using a system whose performance is not
optimal), affects the results of the system and how. Our contribution to this respect
is the identification of the challenges related to OQA compared to traditional QA.
We propose adding the appropriate methods, tools and resources to resolve the
146
identified challenges. With the purpose of testing the effect of each tool, resource
and technique, we will carry out a separate and a global evaluation. Until this point,
although previous approaches opinion questions have longer answers than factual
ones, the research done in OQA so far has only considered a sentence-level
approach. This also includes our work (Balahur et al., 2009a; Balahur et al, 2009d).
In the following experiments we will thus evaluate the impact of the retrieval at 1
and 3-sentence level and the retrieval based on similarity to query paraphrases
enriched with topic-related words. We believe retrieving longer text could cause
additional problems such as redundancy, co-reference and temporal expressions or
the need to apply contextual information.
Starting from the research in previous works (Balahur et al., 2009a; Balahur et
al., 2009d), our aim is to give a step forward and employ opinion specific methods
focused on improving the performance of our OQA. We perform the retrieval at 1
sentence and 3 sentence-level and also determine the ES and the ET of the
questions, which are fundamental to properly retrieve the correct answer. These two
elements are selected employing SR. The expected answer type (EAT) is
determined using Machine Learning (ML) using Support Vector Machines (SVM),
by taking into account the interrogation formula, the subjectivity of the verb and the
presence of polarity words in the target SR. In the case of expected opinionated
answers, we also compute the EPT by applying OM on the affirmative version of
the question. These experiments are presented in more detail in the experiment
section.
In order to carry out the present research for detecting and solving the
complexities of opinion QA systems, we employed two blog posts corpora:
EmotiBlog (Boldrini et al., 2009a) and the TAC 2008 Opinion Pilot test collection
(part of the Blog06 corpus).
The TAC 2008 Opinion Pilot test collection is composed by documents with the
answers to the opinion questions given on 25 targets. EmotiBlog is a collection of
blog posts in English extracted from the Web. As a consequence, it represents a
genuine example of this textual genre. It consists in a monothematic corpus about
the Kyoto Protocol, annotated with the improved version of EmotiBlog (Boldrini et
al., 2009b). It is well know that Opinion Mining (OM) is a very complex task due to
the high variability of the language we study, thus our objective is to build an
annotation model for an exhaustive detection of subjective speech, which can
capture the most important linguistic phenomena, which give subjectivity to the
text. Additional criteria employed when choosing the elements to be annotated were
effectiveness and noise minimization. Thus, from the first version of the model, the
elements not statistically relevant have been eliminated. The elements that compose
the improved version of the annotation model are presented in Table 5.14.
147
Elements
Obj. speech
Subj. speech
Description
Confidence, comment, source, target.
Confidence, comment, level, emotion, phenomenon,
polarity, source and target.
Adjectives/Adverbs
Confidence, comment, level, emotion, phenomenon,
modifier/not, polarity, source and target.
Verbs/ Names
Confidence, comment, level, emotion, phenomenon,
polarity, mode, source and target.
Anaphora
Confidence, comment, type, source and target.
Capital
letter/
Confidence, comment, level, emotion, phenomenon,
Punctuation
polarity, source and target.
Phenomenon
Confidence, comment, type, collocation, saying,
slang, title, and rhetoric.
Reader/Author
Confidence, comment, level, emotion, phenomenon,
Interpretatipm (obj.)
polarity, source and target.
Emotions
Confidence, comment, accept, anger, anticipation,
anxiety, appreciation, bad, bewilderment, comfort,
compassion
Table 5.14: EmotiBlog structure
The first distinction consists in objective and subjective speech. Subsequently, a
finer-grained annotation is employed for each of the two types of data.
Objective sentences are annotated with source and target (when necessary also
the level of confidence of the annotator and a comment).
The subjective elements can be annotated at a sentence level, but they also have
to be labeled at a word level. EmotiBlog also contains annotations of anaphora at a
cross-document level (to interpret the storyline of the posts) and the sentence type
(simple sentence or title, but also saying or collocation).
Finally, the Reader and the Writer interpretation have to be marked in objective
sentences. This element is employed to mark and interpret correctly an apparent
objective discourse, whose aim is to implicitly express an opinion (e.g. The
camera broke in two days). The first is useful to extract what is the interpretation
of the reader (for example if the writer says The result of their governing was an
increase of 3.4% in the unemployment rate instead of The result of their governing
was a disaster for the unemployment rate) and the second to understand the
background of the reader (i.e.. These criminals are not able to govern instead of
saying the x party is not able to govern) from this sentence the reader can deduce
the political ideas of the writer. The questions whose answers are annotated with
EmotiBlog are the subset of opinion questions in English presented in (Balahur et
al., 2009). The complete list of questions is shown in Table 5.15.
148
Question
Number
2
5
6
7
11
12
15
16
18
19
20
Question
What motivates peoples negative opinions on the Kyoto
Protocol?
What are the reasons for the success of the Kyoto Protocol?
What arguments do people bring for their criticism of media as
far as the Kyoto Protocol is concerned?
Why do people criticize Richard Branson?
What negative opinions do people have on Hilary Benn?
Why do Americans praise Al Gores attitude towards the Kyoto
protocol?
What alternative environmental friendly resources do people
suggest to use instead of gas en the future?
Is Arnold Schwarzenegger pro or against the reduction of CO2
emissions?
What improvements are proposed to the Kyoto Protocol?
What is Bush accused of as far as political measures are
concerned?
What initiative of an international body is thought to be a good
continuation for the Kyoto Protocol?
Table 5.15: Questions over the EmotiBlog corpus
The main difference between the two corpora employed is that Emotiblog is
monothematic, in fact only posts about the Kyoto Protocol compose it, while the
TAC 2008 corpus contains documents on a multitude of subjects. Therefore,
different techniques must be adjusted in order to treat each of them.
determined using Machine Learning (ML) using Support Vector Machine (SVM),
by taking into account the interrogation formula, the subjectivity of the verb and the
presence of polarity words in the target SR. In the case of expected opinionated
answers, we also compute the expected polarity type (EPT) by applying OM on
the affirmative version of the question. An example of such a transformation is:
given the question What are the reasons for the success of the Kyoto Protocol?,
the affirmative version of the question is The reasons for the success of the Kyoto
Protocol are X.
In the answer retrieval stage, we employ four strategies:
1. Using the JIRS (JAVA Information Retrieval System) IR engine (Gmez et
al., 2007) to find relevant snippets. JIRS retrieves passages (of the desired
length), based on searching the question structures (n-grams) instead of the
keywords, and comparing them.
2. Using the Yahoo search engine to retrieve the first 20 documents that are
most related to the query. Subsequently, we apply LSA on the retrieved
documents and extract the words that are most related to the topic. Finally,
we expand the query using words that are very similar to the topic and
retrieve snippets that contain at least one of them and the ET.
3. Generating equivalent expressions for the query, using the DIRT
paraphrase collection (Lin and Pantel, 2001) and retrieving candidate
snippets of length 1 and 3 (length refers to the number of sentences
retrieved) that are similar to each of the new generated queries and contain
the ET. Similarity is computed using the cosine measure. Examples of
alternative queries for People like George Clooney are People adore
George Clooney, People enjoy George Clooney, People prefer
George Clooney.
4. Enriching the equivalent expressions for the query in 3. with the topicrelated words discovered in 2. using LSA.
In order to determine the correct answers from the collection of retrieved
snippets, we must filter only the candidates that have the same polarity as the
question EPT. For polarity detection, we use a combined system employing SVM
ML on unigram and bigram features trained on the NTCIR MOAT 7 data and an
unsupervised lexicon-based system. In order to compute the features for each of the
unigrams and bigrams, we compute the tf-idf scores.
The unsupervised system uses the Opinion Finder lexicon to filter out subjective
sentences that contain more than two subjective words or a subjective word and a
valence shifter (obtained from the General Inquirer resource). Subsequently, it
accounts for the presence of opinionated words from four different lexicons Micro
WordNet (Cerini et al., 2007), WNAffect (Strapparava and Valitutti, 2004),
Emotion Triggers (Balahur and Montoyo, 2008) and General Inquirer (Stone et al.,
150
1966). For the joint topic-polarity analysis, we first employ LSA to determine the
words that are strongly associated to the topic. Consequently, we compute the
polarity of the sentences that contain at least one topic word and the question target.
Finally, answers are filtered using the Semrol system for SR labeling proposed
by Moreda (2008). Subsequently, we filter all snippets that have the required target
and source as agent or patient. Semrol receives as input plain text with information
about grammar, syntax, word senses, Named Entities and constituents of each verb.
The system output is the given text, in which the semantic roles information of each
constituent is marked. Ambiguity is resolved depending on the machine algorithm
employed, which in this case is TIMBL45.
Q
No.
No.
A
2
5
6
7
11
12
15
5
11
2
5
2
3
1
Baseline
(Balahur et al.,
2009d)
@ @ @
@
1
5
10 50
0
2
3
4
0
0
0
0
0
0
1
2
0
0
1
3
1
1
1
1
0
1
1
1
0
0
1
1
1 phrase + ET+SA
3 phrases +ET+SA
@
1
1
0
1
1
0
0
0
@
1
1
1
0
0
0
0
1
@
5
2
2
1
1
0
1
0
@
10
3
2
2
1
0
2
1
@
50
4
2
2
3
0
3
1
@
5
2
2
1
2
0
0
1
@
10
3
3
2
2
0
1
1
@
20
4
4
2
4
1
2
1
45
151
Q
No.
No.
A
16
18
19
20
6
1
27
4
Baseline
(Balahur et al.,
1 phrase + ET+SA
3 phrases +ET+SA
2009d)
1
4
4
4
0
1
1
2
1
2
2
6
0
0
0
0
0
0
0
0
0
0
0
0
1
5
6
18
0
1
1
2
0
1
1
1
0
0
0
0
0
0
1
1
0
0
1
2
Table 5.16: Results for questions over EmotiBlog
The retrieval of the TAC 2008 1-phrase and 3-phrase candidate snippets was
done using JIRS. Subsequently, we performed different evaluations, in order to
assess the impact of using different resources and tools. Since the TAC 2008 had a
limit of the output of 7000 characters, in order to compute a comparable F-measure,
at the end of each processing chain, we only considered the snippets for the 1phrase retrieval and for the 3-phases one until this limit was reached.
1. In the first evaluation, we only apply the sentiment analysis tool and select the
snippets that have the same polarity as the question EPT and the ET is found in
the snippet. (i.e. What motivates peoples negative opinions on the Kyoto
Protocol? The Kyoto Protocol becomes deterrence to economic development
and international cooperation/ Secondly, in terms of administrative aspect, the
Kyoto Protocol is difficult to implement. - same EPT and ET)
We also detected cases of same polarity but no ET, e.g. These attempts mean
annual expenditures of $700 million in tax credits in order to endorse
technologies, $3 billion in developing research and $200 million in settling
technology into developing countries EPT negative but not same ET.
2. In the second evaluation, we add the result of the LSA process to filter out the
snippets from 1., containing the words related to the topic starting from the
retrieval performed by Yahoo, which extracts the first 20 documents about the
topic.
3. In the third evaluation, we filter the results in 2 by applying the Semrol system
and setting the condition that the ET and ES are the agent or the patient of the
snippet.
4. In the fourth evaluation setting, we replaced the set of snippets retrieved using
JIRS with the ones obtained by generating alternative queries using
paraphrases. We subsequently filtered these results based on their polarity (so
that it corresponds to the EPT) and on the condition that the source and target of
the opinion (identified through SRL using Semrol) correspond to the ES and
ET.
5. In the fourth evaluation setting, we replaced the set of snippets retrieved using
JIRS with the ones obtained by generating alternative queries using
152
http://wing.comp.nus.edu.sg/~qiu/NLPTools/JavaRAP.htm
154
could retrieve answers to the set of questions in English in any of the other 3
languages.
In order to evaluate our approaches to opinion question answering, as well as
evaluate the inclusion of filtering techniques based on topic relevance and temporal
restrictions, we participated in the first 3 subtasks in an English, monolingual
setting, as well as in the cross-lingual challenge, retrieving the answers to the
question set in English in the document set in Traditional Chinese. For the
participation in this competition, we named our opinion question answering system
OpAL.
For the English monolingual subtasks, we submitted three runs of the OpAL
system, for the opinionated, relevance and polarity judgment tasks.
$TACKLING THE ENGLISH MONOLINGUAL SUBTASKS AT
17&,5MOAT
Judging sentence opinionatedness
The opinionated subtask required systems to assign the values YES or NO (Y/N)
to each of the sentences in the document collection provided. This value is given
depending on whether the sentence contains an opinion (Y) or it does not (N).
In order to judge the opinionatedness of the sentence, we employed two different
approaches (the first one corresponding to system run number 1 and the second to
system runs 2 and 3).
Both approaches are rule-based, but they differ in the resources employed. We
considered as opinionated sentences the ones that contain at least two opinion
words or one opinion word preceded by a modifier. For the first approach, the
opinion words were taken from the General Inquirer, Micro WordNet Opinion and
Opinion Finder lexicon and in the second approach we only used the first two
resources.
Determining sentence relevance
In the sentence relevance judgment task, the systems had to output, for each
sentence in the given collection documents per topic, an assessment on whether or
not the sentence is relevant for the given question. For the sentence relevance
judgement task stage, we employ three strategies (corresponding to the system runs
1, 2 and 3, respectively):
1. Using the JIRS (JAVA Information Retrieval System) IR engine (Gmez et
al., 2007) to find relevant snippets. JIRS retrieves passages (of the desired
length), based on searching the question structures (n-grams) instead of the
keywords, and comparing them.
155
47
48
http://infomap-nlp.sourceforge.net/
http://webdocs.cs.ualberta.ca/~lindek/minipar.htm
156
49
50
http://alias-i.com/lingpipe/
http://www.wjh.harvard.edu/~inquirer/
157
158
System
RunID
OpAL 1
OpAL 2
17.99
19.44
45.16
44
25.73
26.97
OpAL 3
19.44
44
26.97
Table 5.18: Results of system runs for opinionatedness
System
RunID
OpAL 1
OpAL 2
82.05
82.61
47.83
5.16
60.43
9.71
OpAL 3
76.32
3.94
7.49
Table 5.19: Results of system runs for relevance
System
P
R
F
RunID
OpAL 1
38.13
12.82
19.19
OpAL 2
50.93
12.26
19.76
Table 5.20: Results of system runs for polarity
System
RunID
OpAL 1
OpAL 2
3.54
3.35
56.23
42.75
6.34
5.78
OpAL 3
3.42
72.13
6.32
Table 5.21: Results of system runs for the cross-lingual task agreed measures,
Traditional Chinese
System RunID
OpAL 1
OpAL 2
P
14.62
14.64
R
60.47
49.73
F
21.36
19.57
OpAL 3
15.02
77.68
23.55
Table 5.22: Results of system runs for the cross-lingual task non-agreed
measures, Traditional Chinese
DISCUSSION AND CONCLUSIONS
From the results obtained, on the one hand, we can see that although the extensive
filtering according to the topic and the temporal restrictions increases the system
159
precision, we obtain a dramatic drop in the recall. On the other hand, the use of
simpler methods in the cross-lingual task yielded better results, the OpAL crosslingual run 3 obtaining the highest F score for the non-agreed measures and ranking
second according to the agreed measures.
From the error analysis performed, we realized that, on the one hand, the LSAbased method to determine topic-related words is not enough to perform this task.
The terms obtained by employing this method are correct and useful, but they
should be expanded using language models, to better account for the language
variability.
Finally, we have seen that systems performing finer tasks, such as temporal
expression resolution, are not mature enough to be employed in such tasks. This
was confirmed by in-house experiments using anaphora resolution tools such as
JavaRAP51, whose use also led to lower performances of the system and dramatic
loss in recall.
5.2.6. CONCLUSIONS
In this section, our research was focused on solving a recent problem born with the
massive usage of the Web 2.0: the exponential growth of the opinionated data that
need to be efficiently managed for a wide range of practical applications.
We identified and explored the challenges raised by OQA, as opposed to the
traditional QA. Moreover, we studied the performance of new sentiment-topic
detection methods and analyzed the improvements that can be brought at the
different stages of the OQA process and analyzed the contribution of discourse
analysis, employing techniques such as co-reference resolution and temporality
detection. We also experimented new retrieval techniques such as faceted search
using Wikipedia with LSA, which demonstrate to improve the performance of the
task.
From the results obtained, we can draw the following conclusions. The first one
is that on the one hand, the extensive filtering according to the topic and the
temporal restrictions increases the system precision but it produces a dramatic drop
in the recall. As a consequence, the use of simpler methods in the cross-lingual task
would be more appropriate in this context. The OpAL cross-lingual run 3 obtaining
the highest F score for the non-agreed measures and ranking second according to
the agreed measures. On the other hand, we can deduce that LSA-based method to
determine topic-related words is not enough to perform this task. The terms
obtained by employing this method are correct and useful; however, as future work
our purpose is to use language models, to better account for the language
variability. Finally, we understand that co-reference or temporal resolution systems
51
http://aye.comp.nus.edu.sg/~qiu/NLPTools/JavaRAP.html
160
Element
Attribute
Polarity
Positive, negative
Level
Low, medium, high
Source
name
Target
name
Table 5.24: Elements from the EmotiBlog scheme used in for the annotation of blog
threads
The blog threads are written in English and have the same structure: the authors
create an initial post containing a piece of news and possibly their opinion on it and
subsequently, bloggers reply, expressing their opinions about the topic (thus
forming a discussion thread). The blog corpus annotation contains the URL from
which the thread was extracted, the initial annotated piece of news and the labeled
user comments. The topics contained in the corpus are very diverse: economy,
science and technology, cooking, society and sports. The data is annotated at a
document level, with the overall polarity and topic, and at sentence level,
162
163
EXPERIMENTAL SETTING
The main objective of our experiments is to design a system that is able to produce
opinion summaries, in two different types of texts: a) blog threads, in which case
we aim at producing summaries of the positive and negative arguments given on the
thread topic; and b) reviews, in the context of which we assess the best manner to
use opinion summarization in order to determine the overall polarity of the
sentiment expressed. In our first opinion summarization experiments, we adopt a
standard approach by employing in tandem a sentiment classification system and a
text summarizer. The output of the former is used to divide the sentences in the blog
threads into three groups: sentences containing positive sentiment, sentences
containing negative sentiment and neutral or objective sentences. Subsequently, the
positive and the negative sentences are passed on to the summarizer separately to
produce one summary for the positive posts and another one for the negative ones.
Next, we present the sentiment analysis system followed by a description of the
summarization system, both of which serve as a foundation for subsequent sections.
The ideas and results presented in this section were initially put forward in Balahur
et al. (Balahur et al., 2009g).
The Sentiment Analysis System
The first step we took in our approach was to determine the opinionated sentences,
assign each of them a polarity (positive or negative) and a numerical value
corresponding to the polarity strength (the higher the negative score, the more
negative the sentence and vice versa). Given that we are faced with the task of
classifying opinion in a general context, we employed the simple, yet efficient
approach, presented in Balahur et al. (2009).
In the following experiments, we used WordNet Affect (Strapparava and
Valitutti, 2004), SentiWordNet (Esuli and Sebastiani, 2005), MicroWNOp (Cerini
et al., 2007). Each of the resources we employed were mapped to four categories,
which were given different scores: positive (1), negative (-1), high positive (4) and
high negative (4).
As we have shown (Balahur et al., 2009f), these values per formed better than
the usual assignment of only positive (1) and negative (-1) values. First, the score of
each of the blog posts was computed as the sum of the values of the words that
were identified; a positive score leads to the classification of the post as positive,
whereas a negative score leads to the system classifying the post as negative.
Subsequently, we performed sentence splitting using Lingpipe and classified the
sentences we thus obtained according to their polarity, by adding the individual
scores of the affective words identified.
164
Precision
0.53
0.67
Recall
0.89
0.22
F1
0.67
0.33
Thread
No.
1
2
3
4
6
8
11
12
14
15
16
18
30
31
Number of sentences
with positive sentiment
OOTotal O-P
N
O
5
0
1
2
5
0
0
1
4
1
2
1
4
2
1
1
5
1
4
0
5
4
0
1
4
3
0
1
4
0
3
1
5
1
3
1
5
2
2
1
4
2
2
0
5
0
3
2
7
3
3
1
5
2
3
0
Number of sentences
with negative sentiment
OOOTotal
P
N
O
4
0
2
0
5
1
2
1
4
0
2
0
4
0
2
1
5
0
4
1
6
0
3
2
6
3
3
0
6
0
3
1
5
0
4
0
5
0
3
2
5
1
2
0
4
1
0
3
4
0
2
0
5
1
3
0
Asserted
Pos
Asserted
Neg
2
1
0
0
0
0
0
0
0
0
0
0
0
0
2
1
2
1
1
1
0
2
1
0
2
0
2
1
R1
R2
RSU4
RL
Sent + Summneg
0.22 (0.18-0.26)
0.09 (0.06-0.11)
0.09 (0.06-0.11)
0.21 (0.17-0.24)
Sent + Summpos
0.21 (0.17-0.26)
0.05 (0.02-0.09)
0.05 (0.02-0.09)
0.19 (0.16-0.23)
SummTAC08
0.348
0.081
0.12
than the standard data sets traditionally used for summarization evaluation.
Nevertheless, in our case, the LSA method, being a statistical method, proved to be
quite robust to variations in the input data and, most importantly, to the change of
domain. We used F1 score instead of recall used at TAC, because the lengths of our
model summaries vary from one thread to another.
where
169
In the setting considered, n is the number of sample points, 1 is group one and 2
is group two. More specifically, in our case group one is composed of all the
comments annotated as salient in our corpus (i.e., gold summary comments) and
group two is composed of all the comments that were not annotated (i.e., gold nonsummary comments). Furthermore, we further slice the data upon polarity (as
produced by the sentiment analysis tool), so we have two samples (i.e., group one
and group two) for the case of positive comments and two samples for the case of
negative comments. For example, out of all the comments that were assigned a
positive score by the sentiment analysis tool, there are those that were also
annotated as positive by the annotators these constitute group one for the positive
polarity case and those that were not annotated at all these constitute group two for
the positive polarity case. The same thinking applies for the negative polarity case.
EVALUATION OF THE OPINION SUMMARIZATION PROPOSAL
BASED ON SENTIMENT INTENSITY
The performance results of the sentiment analysis are shown in Table 5.28.
System
Precision
Recall
F1
Sentneg
0.98
0.54
0.69
Sentpos
0.07
0.69
0.12
170
System
R1
R2
RSU4
RL
SISummneg at 15%
0.07
0.03
0.03
0.07
SISummpos at 15%
0.22
0.03
0.03
0.19
SISummneg at 30%
0.17
0.06
0.06
0.16
SISummpos at 30%
0.19
0.03
0.03
0.17
TopSummTAC08
0.111
0.142
System
R1
R2
RSU4
RL
BottomSummTAC08
0.069
0.081
Table 5.29: Evaluation of the sentiment summarization system with ROUGE scores
We used the same evaluation metrics as the ones employed in our previous
efforts (Balahur et al., 2009g).
There are five rows in Table 5.29: the first (SISummneg at 15%) is the
performance of the sentiment-intensity-based summarizer (SISumm) on the
negative posts at 15% compression rate; the second (SISummpos at 15%) presents
the performance of SISumm on the positive posts at 15% compression rate; the
third (SISummneg at 30%) is the performance of the SISumm on the negative posts
at 30% compression rate; the fourth (SISummpos at 30%) presents the performance
of SISumm on the positive posts at 30% compression rate; and finally, the fifth and
the sixth rows correspond to the official scores of the top and bottom performing
summarizers at the 2008 Text Analysis Conference Summarization track (TAC08),
respectively. The last scores are included to provide some context for the other
results. Certainly, in order to use gold polarity alongside the score produced by the
sentiment analysis tool as we do, we had to firstly automatically align all the
automatically identified sentences with the annotated comments. The criterion for
alignment we used was that at least 70% of the words in an automatically identified
sentence are contained in an annotated comment for it to inherit the gold polarity of
that comment (and by virtue of that to be considered a gold summary sentence).
DISCUSSION
From Table 5.29 it is evident that the ROUGE scores obtained are low (at least in
the context of TAC 2008). This suggests that sentiment intensity alone is not a
sufficiently representative feature of the importance of comments for
summarization purposes. Thus, using it in combination with other features that have
proven useful for summarization, such as entities mentioned in a given comment
(Balahur et al., 2010d), certain cue phrases and surface features, or features
capturing the relevance of blog posts to the main topic, is likely to yield better
results. In particular, incorporating topic detection features would be crucial, since
at the moment off-topic, but very negative or very positive, comments are clearly
bad choices for a summary, and currently we employ no means for filtering these
out.
There is also an alternative interpretation of the attained results. These results
were obtained by using a methodology used in text summarization research, so it is
possible that the method is not particularly well-suited for the task at hand, that of
171
producing sentiment-rich summaries. Hence, the reason for the low results may be
that we addressed the problem in the context of a slightly different task, suggesting
that the task of producing content-based summaries and that of producing
sentiment-based summaries are two distinct tasks which require a different
treatment. In addition to the above results, we perform the statistical hypothesis test.
The values of the variables and the resulting t-statistic values are shown in Table
5.30.
-3.95
-4.04
1092
1381
10.13
t statistic
Negative
10.5
0.021
Positive
4.37
4.26
48
1268
9.3
28.03
0.036
Polarity
Table 5.30: Values for the variables and resulting t-statistic for the 2-sample t-test,
unequal sample sizes, equal variances
In both cases, negative and positive polarity, the t values obtained are not large
enough for us to reject the null hypothesis in favor of the alternative hypothesis.
That is, we do not have any empirical evidence to reject the null hypothesis that the
sentiment intensity of salient blog comments is any different from the sentiment
intensity of non-salient comments in favor of our alternative hypothesis that,
indeed, sentiment intensity in summary blog comments is different from that of
non-summary blog comments.
We conclude that, based on our annotated corpus, the hypothesis that very
positive or very negative sentences are also good summary sentences does not hold.
But, once again, we point out that these results are meaningful in the context of text
summarization, that is, the task of producing content-based summaries. Hence, the
observation we made above that producing content based summaries is different
from producing sentiment-based summaries and as such these tasks should be
treated differently also applies in this case. We note, however, that the results on
our corpus are not directly comparable with those of TAC08, since the data sets are
different and the tasks involved are significantly distinct. Blog posts in our corpus
were annotated as important with respect to the main topic of the respective blog
threads.
resulting pages. For each of these 5 corpora, we apply LSA, using the Infomap NLP
Software52. Subsequently, we compute the 100 most associated words with two of
the terms that are most associated with each of the 5 topics and the 100 most
associated words with the topic word. For example, for the term bank, which is
associated to economy, we obtain (the first 20 terms):
bank:1.000000;money:0.799950;pump:0.683452;switched:0.682389; interest:0.674177;
easing:0.661366; authorised:0.660222; coaster:0.656544; roller:0.656544;
maintained:0.656216; projected:0.656026; apf:0.655364; requirements:0.650757;
tbills:0.650515; ordering:0.648081; eligible:0.645723;
ferguson's:0.644950;proportionally:0.63358; integrate:0.625096; rates:0.624235
http://infomap-nlp.sourceforge.net/
174
LSA summarizer on the negative posts (i.e., using only words), the second one,
Sent + Summneg, is the enhanced LSA summarizer exploiting entities and IS-A
relationships as given by the MeSH taxonomy, the third one, Sent + BLSummpos,
presents the performance of the baseline LSA summarizer on the positive posts and
the fourth one, Sent+Summpos, is the enhanced LSA summarizer for the positive
posts.
System
R1
R2
RSU4
RL
0.22
0.09
0.09
0.21
Sent+BLSummneg
(0.18-0.26) (0.06-0.11)
(0.06-0.11)
(0.17-0.24)
Sent+Summneg
0.087
0.087
0.268
0.253
0.21
0.05
0.19
Sent+BLSummneg
0.05 (0.02-0.09)
(0.17-0.26) (0.02-0.09)
(0.16-0.23)
Sent+Summneg
0.076
0.076
0.275
0.249
Table 5.31: Results of the opinion summarization process
Based on Table 5.31 we can say that the results obtained with the enhanced LSA
summarizer are overall better than the baseline summarizer. The numbers in bold
show statistically significant improvement over the baseline system (note they are
outside of the confidence intervals of the baseline system). The one exception
where there is a slight drop in performance of the enhanced summarizer with
respect to the baseline system is in the case of the negative posts for the metrics R2
and RSU4, however, the F1 is still within the confidence intervals of the baseline
system, meaning the difference is not statistically significant.
We note that the main improvement in the performance of the enhanced
summarizer comes from better precision and either no loss or minimal loss in recall
with respect to the baseline system. The improved precision can be attributed, on
one hand, to the incorporation of entities and IS-A relationships, but also, on the
other hand, to the use of a better sentiment analyzer than the one used to produce
the results of the baseline system.
We conclude that by using a combined topic-sentiment approach in opinion
mining and exploiting higher-level semantic information, such as entities and IS-A
relationships, in the summarization process, we obtain a tangible improvement for
the opinion-oriented summarization of blogs.
175
sentences in each review using the summarization system described above and
obtained the 15% most salient sentences in each of these two sets. We then
computed the overall score of the reviews as sum of the normalized positive score
(total score of the selected positive sentences divided by the number of selected
positive sentences) and the normalized negative score (total score of the selected
negative sentences divided by the number of selected sentences).
In the fourth experiment (Summarizer+OM in Table 5.32), we first process each
review with the summarization system and obtain the top 15% most important
sentences. Subsequently, we compute the sentiment score in each of these
sentences, using the opinion mining system.
EVALUATION AND DISCUSSION
The results obtained in these four approaches are presented in Table 5.32. They are
also compared against a random baseline, obtained by an average of 10 random
baselines computed over a set of 80 examples to be classified into positive or
negative (we have 28 negative and 52 positive reviews).
Approach
Document level
OM+Top-scoring sentences
OM+Summarizer
Summarizer+OM
Random baseline
Accuracy
0.62
0.76
0.8
0.66
0.47
sentence score. This shortcoming can be overcome, as seen in the results, by using
the summarization system, which adds information on the importance of the
sentences as far as the information content is concerned.
6.1. INTRODUCTION
In the previous chapters, we explored the task of sentiment analysis in different text
types and languages, proposing a variety of methods that were appropriate for
tackling the issues in each particular text type. Most of the times, however, the
approaches we took were limited to discovering only the situations where sentiment
was expressed explicitly (i.e. where linguistic cues could be found in the text to
indicate it contained subjective elements or sentiment).
Nevertheless, in many cases, the emotion underlying the sentiment is not
explicitly present in text, but is inferable based on commonsense knowledge (i.e.
emotion is not explicitly, but implicitly expressed by the author, by presenting
situations which most people, based on commonsense knowledge, associate with an
emotion). In this final chapter, we will present our contribution to the issue of
automatically detecting emotion expressed in text in an implicit manner.
Firstly, we present our initial approach, which is based on the idea that emotion
is triggered by specific concepts, according to their relevance, seen in relation to the
basic needs and motivations (Maslow, 1943; Max-Neef 1990). This idea is based on
the Relevance Theory (Sperber and Wilson, 2000). Subsequently, based on the
Appraisal Theory models (De Rivera, 1977; Frijda, 1986; Ortony, Clore and
Collins, 1988; Johnson-Laird and Oatley, 1989), we abstract on our initial idea and
set up a framework for representing situations described in text as chains of actions
and their appraisal values, in the form of a knowledge base. We show the manner in
which additional knowledge on the properties of the concepts involved in such
situations can be imported from external sources and how such a representation is
useful to obtain an accurate label of the emotion expressed in text, without any
linguistic clue being present therein.
179
6.1.1. MOTIVATION
Remembering the definition we provided in Chapter 2, sentiment 53 suggests a
settled opinion reflective of ones feelings -the conscious subjective
experience of emotion (Van den Bos, 2006). Thus, sentiments cannot be present
without an emotion being expressed in text, either implicitly, or explicitly. Due to
this reason, detecting implicit expressions of emotion can increase the performance
of sentiment analysis systems, making them able to spot sentiment even in the cases
where it is not directly stated in text, but results as a consequence of the readers
emotion, as a consequence of interpreting what is said.
Detecting emotion is a more difficult task than the mere sentiment analysis from
text, as the task includes classification between a larger number of categories (i.e
emotion labels), which are not as easily separable or distinguishable as the
positive and negative classes, because of their number (at least 6 basic
emotions54) and characteristics. Although emotion detection is a related problem to
sentiment analysis, we chose to present the work done in this area separately, as we
consider that the approaches in the first problem are more difficult and require
specific methods and tools to be tackled.
This chapter is structured as follows: we first give a brief introduction on the
concepts of emotion and the background of the work presented. Subsequently, we
describe and evaluate the emotion trigger method, put forward by Balahur and
Montoyo (2008), in which the main idea is to create a collection of terms that
invoke an emotion based on their relevance to human needs and motivations.
Finally, we present EmotiNet the framework we built for the detection of emotion
implicitly expressed in text (Balahur et al, 2011a; Balahur et al., 2011b). The
underlying mechanism of this framework is the EmotiNet knowledge base, which
was built on the idea of situation appraisal using commonsense knowledge.
http://www.merriam-webster.com/
One of the most widely used classifications of emotions is that of Paul Ekman (1972), which
includes 6 basic emotions. Other models, such as the ones proposed by Parrot (2001) or Plutchik
(2001) include a higher number of basic emotions, as well as secondary and tertiary emotions.
54
180
6.1.3. BACKGROUND
Understanding the manner in which humans express, sense and react to emotion has
always been a challenge, each period and civilization giving a distinct explanation
and interpretation to the diversity of sentiments (Oatley, 2004). Societies used
emotions for the definition of social norms, for the detection of anomalies, and even
for the explanation of mythical or historical facts (e.g. the anger and wrath of the
Greek Gods, the fear of the unknown and the Inquisition in Middle Age, the
romantic love in Modern Times) (Ratner, 2000). The cultural representations
concerning emotions are generally ambivalent (Goldie, 2000, Evans, 2001, Oatley
et al., 2006). Emotions were praised for their persuasive power (Aristotle defines
pathos the ability to appeal to the audiences emotions as the second component
of the art of rhetorics), but also criticized as a weakness of the human being,
which should ideally be rational.
Different scientific theories of emotion have been developed along the last
century of research in philosophy, psychology, cognitive sciences or neuroscience,
each trying to offer an explanation to the diversity of affect phenomena. There were
different attempts to build systems that automatically detect emotion from text in
the 70s and 80s. However, it was not until 1995, when Rosalind Picard consecrated
the term affect computing in Artificial Intelligence (Picard, 1995) that the interest
computer engineers expressed towards the research in emotion increased
significantly. The need to develop systems that are able to detect and respond to
affect in an automatic manner has become even more obvious in the past two
decades, when a multitude of environments of interaction between humans and
computers has been built e.g. e-learning sites, social media applications and
intelligent robots. On the one hand, if such environments are able to detect emotion,
they can better adapt to the user needs. On the other hand, if they are able to express
emotion, they create a more natural type of interaction. Despite the fact that Picard
(1995) identified three different types of systems dealing with automatic affect
processing (systems detecting emotions, systems expressing what a human would
perceive as emotion and systems feeling an emotion), most of the research in affect
181
computing has so far concentrated solely on the first type of systems (Calvo and
DMello, 2010).
In Natural Language Processing (NLP), the task of detecting emotion expressed
in text has grown in importance in the last decade, together with the development of
the Web technologies supporting social interaction. Although different approaches
to tackle the issue of emotion detection in text have been proposed by NLP
researchers, the complexity of the emotional phenomena and the fact that
approaches most of the times contemplate only the word level have led to a low
performance of the systems implementing this task - e.g. the ones participating in
the SemEval 2007 Task No. 14 (Strapparava and Mihalcea, 2007). The first
explanation for these results, supported by linguistic studies and psychological
models of emotion, is that expressions of emotion are most of the times not direct,
through the use of specific words (e.g. I am angry.). In fact, according to a
linguistic study by Pennebaker et al (2003), only 4% of words carry an affective
content. Most of the times, the affect expressed in text results from the
interpretation of the situation presented therein (Balahur and Montoyo, 2008;
Balahur and Steinberger, 2009), from the properties of the concepts involved and
how they are related within the text. In this sense, the first experiments we
performed aimed at building a lexicon of terms whose presence in text trigger
emotion. Subsequently, we described a framework for detecting and linking
concepts (and not just words) that are used to implicitly express emotion in a text.
are at the top. The basic needs are the general human ones; as we move towards the
top, we find the more individual dependent ones.
5. Self
actualization
4. Self-esteem,
confidence
3. Friendship, family,
intimacy
55
http://www.rainforestinfo.org.au/background/maxneef.htm
184
preserve the intended meaning, by taking the top relevant domain of each word
sense and assigning the corresponding verb or noun in NomLex the sense number
that has the same top relevant domain. If more such senses exist, they are all added.
Using EuroWordNet 56 , we map the words in the English lexical database of
emotion triggers to their Spanish correspondents, preserving the meaning through
the WordNet sense numbers.
The final step in building the lexical databases consists of adding real-world
situations, cultural-dependent contexts terms to the two lexical databases. For
English, we add the concepts in ConceptNet57 that are linked to the emotion triggers
contained so far in the lexicon based on the relations DefinedAs, LocationOf,
CapableOf, PropertyOf and UsedFor. For Spanish, we add the cultural context by
using the Larousse Ideologic Dictionary of the Spanish Language.
http://en.wikipedia.org/wiki/EuroWordNet
http://web.media.mit.edu/~hugo/conceptnet/
57
186
triggers built from Maslows pyramid. In the case of emotion triggers stemming
from Neef`s matrix of fundamental human needs, the weights of the valence shifters
are multiplied with the emotion-category association ratio, computed for each
emotion trigger and each of the four existential categories. In order to determine
the importance of the concepts to a specific domain, we will employ the association
ratio formula.
The association ratio score provides a significance score information of the most
relevant and common domain of a word. The formula for calculating it is:
AR ( w; D) " Pr(w, D) log 2
Pr(w, D) , where:
Pr(w) Pr(D)
!
!
!
188
CI module
Emotion
!
!
!
The total valence of text equals the sum of all weighted valences of all emotion
triggers. The obtained value is rounded to the closest of the two possible values : 0
and 1.
Further on, we calculate the emotions present in the text, by the following
method:
! for each emotion trigger stemming from Maslows pyramid, we compute the
emotion to level association ratio
! for each emotion trigger stemming from Neefs matrix, we the emotion to
category association ratio
We then apply the Construction Integration Model in a manner similar to that
described by Lemaire (2005) and construct a spreading activation network. We
consider the working memory as being composed of the set of emotion triggers and
their emotion association ratio value which is considered as activation value. The
semantic memory is set up of the modifiers and the top 5 synonyms and antonyms
of emotion triggers with their AR value. We set the value of each emotion trigger to
1. We create a link between all concepts in the semantic memory with all the
emotion triggers. We consider the strength of link the higher of the two Emotion
trigger Association Ratio scores.
The text is processed in the order in which emotion triggers appear and finally
obtain the activation value for each emotion trigger.
The output for the values of the emotions in text is obtained by multiplying the
activation values with 100 and adding the scores obtained for the same emotion
from different emotion triggers when it is the case.
190
the state of a physical or emotional object) and the context in which they
take place, using ontological representations. In this way, we abstract from
the treatment of texts as mere sequences of words to a conceptual
representation, able to capture the semantics of the situations described, in
the same manner as psychological models claim that humans do.
2. Design and populate a knowledge base of action chains called EmotiNet,
based on the proposed model. We will show the manner in which using the
EmotiNet ontologies, we can describe the elements of the situation (the
actor, the action, the object etc.) and their properties - corresponding to
appraisal criteria. Moreover, we demonstrate that the resource can be
extended to include all such defined criteria, either by automatic extraction,
extension with knowledge from other sources, such as ConceptNet (Liu and
Singh, 2004) or VerbOcean (Chklovski and Pantel, 2004), inference or by
manual input.
Motivated by the fact that most of the research in psychology has been
made on self-reported affect, the core of the proposed resource is built from
a subset of situations and their corresponding emotion labels taken from the
International Survey on Emotion Antecedents and Reactions (ISEAR)
(Scherer and Walbott, 1997).
3. Propose and validate a method to detect emotion in text based on EmotiNet
using new examples from ISEAR. We thus evaluate the usability of the
resource and demonstrate the appropriateness of the proposed model.
192
appraisal criteria proposed in the different theories do converge and cover the same
type of appraisals.
Examples of such criteria are the ones proposed and empirically evaluated by
Lazarus and Smith (1988), organized into a four categories:
i.
Intrinsic characteristics of objects and events;
ii.
Significance of events to individual needs and goals;
iii.
Individuals ability to cope with the consequences of the event;
iv.
Compatibility of event with social or personal standards, norms and
values.
Scherer (1988) proposed five different categories of appraisal (novelty, intrinsic
pleasantness, goal significance, coping potential, compatibility standard),
containing a list of 16 appraisal criteria (suddenness, familiarity, predictability,
intrinsic pleasantness, concern relevance, outcome probability, expectation,
conduciveness, urgency, cause: agent, cause: motive, control, power, adjustment,
external compatibility standards, internal compatibility standards). He later used the
values of these criteria in self-reported affect-eliciting situations to construct the
vectorial model in the expert system GENESIS (Scherer, 1993). The system
maintains a database of 14 emotion vectors (corresponding to 14 emotions), with
each vector component representing the quantitative measure associated to the
value of an appraisal component. The values for the new situations are obtained by
asking the subject a series of 15 questions, from which the values for the appraisal
factors considered (components of the vector representing the situation) are
extracted. Subsequetnly, the label assigned to the emotional experience is computed
by calculating the most similar vector in the database of emotion-eliciting
situations.
The appraisal models defined in psychology have also been employed in
linguistics. The Appraisal framework (Martin and White, 2005) is a development of
work in Systemic Functional Linguistics (Halliday, 1994) and is concerned with
interpersonal meaning in text the negotiation of social relationships by
communicating emotion, judgement and appreciation.
KNOWLEDGE BASES FOR NLP APPLICATIONS
As far as knowledge bases are concerned, many NLP applications have been
developed using manually created knowledge repositories such as WordNet
(Fellbaum, 1998), CYC 58 , ConceptNet or SUMO 59 . Some authors tried to learn
ontologies and relations automatically, using sources that evolve in time - e.g. Yago
58
59
http://cyc.com/cyc/opencyc/overview
http://www.ontologyportal.org/index.html
194
(Suchanek et al., 2007) which employs Wikipedia to extract concepts, using rules
and heuristics based on the Wikipedia categories.
Other approaches to knowledge base population were by (Pantel et al., 2004),
and for relation learning (Berland and Charniak, 1999). DIPRE (Brin, 1998) and
Snowball (Agichtein and Gravano, 2000) label a small set of instances and create
hand-crafted patterns to extract ontology concepts.
depending on the context in which they appear (e.g. (6) I must go to this party, in
which the obligation aspect completely changes the emotion label of the situation).
As we can see from examples (3) to (6), even systems employing world knowledge
(concepts instead of words) would fail in most of these cases.
Similarly, we can show that while the fuzzy models of emotion perform well for
a series of cases that fit the described patterns, they remain weak at the time of
acquiring, combining and using new information.
The most widely used methods of affect detection in NLP are the based on
machine learning models built from corpora. Even such models, while possibly
strong in determining lexical or syntactic patterns of emotion, are limited as far as
the extraction of the text meaning is concerned, even when deep text understanding
features are used. They are also limited at the time of, for example, semantically
combining the meaning of different situations that taken separately lead to no
affective reaction, but their interaction does (i.e. when world knowledge is required
to infer the overall emotional meaning of the situation).
Besides the identified shortcomings, which can be overcome by using existing
methods, there are also other issues, which none of the present approaches consider.
Even if we follow only our own intuition, without regarding any scientific model of
emotion, we can say that, for example, the fact that the development of emotional
states is also highly dependent on the affect at the current moment; also the context
in which the action takes place, the characteristics of the agent performing it, or of
the object of the action, and all the peculiarities concerning these elements can
influence the emotion felt in a specific situation.
Given the identified pitfalls of the current systems and their impossibility to take
into account such factors as context and characteristics of the elements in it, we
propose a new framework for modeling affect, that is robust and flexible and that is
based on the most widely used model of emotion that of the appraisal theories.
The implementation of such models showed promising results (Scherer, 1993).
However, they simply represented in a quantitative manner the appraisal criteria in
a self-reported affective situation, using multiple choice questionnaires. The
problem becomes much more complex, if not impossible, when such factors have to
be automatically extracted from text. If the appraisal criteria for the
actor/action/object of the situation are not presented in the text, they cannot be
extracted from it.
Given all these considerations, our contribution relies in proposing and
implementing a resource for modeling affect based on the appraisal theory, that can
support:
a) The automatic processing of texts to extract:
! The components of the situation presented (which we denote by action
chains) and their relation (temporal, causal etc.)
196
The elements on which the appraisal is done in each action of the chain
(agent, action, object);
! The appraisal criteria that can automatically be determined from the text
(modifiers of the action, actor, object in each action chain);
b) The inference on the value of the appraisal criteria, extracted from external
knowledge sources (characteristics of the actor, action, object or their modifiers that
are inferable from text based on common-sense knowledge);
c) The manual input of appraisal criteria of a specific situation.
!
197
Situation
I went to buy a bicycle with my father. When I wanted to pay, my father took his purse
and payed.
Q.
Question
Can
we
No.
answer?
1.
Did the situation that elicited your emotion happen very suddenly or YES
abruptly?
2.
Did the situation concern an event or action that had happened in the YES
past, that had just happened or that was expected in the future?
3.
This type of event, independent of your personal evaluation, would it WK
be generally considered as pleasant or unpleasant?
4.
Was the event relevant for your general well-being, for urgent needs WK
you felt, or for specific goals you were pursuing at the time?
5.
Did you expect the event and its consequences before the situation YES
actually happened?
6.
Did the event help you or hinder you in satisfying your needs, in YES
pursuing your plans or in attaining your goals?
7.
Did you feel that action on your part was urgently required to cope WK
with the event and its consequences?
8.
Was the event caused by your own actions in other words, were YES/I
you partially or fully responsible for what happened?
9.
Was the event caused by one or several other persons in other YES
words, were other people fully or partially responsible for what
happened?
10. Was the event mainly due to chance?
NO
11. Can the occurrence and the consequences of this type of event NO
generally be controlled or modified by human action?
12. Did you feel that you had enough power to cope with the event i.e. NO
being able to influence what was happening or to modify the
consequences?
13. Did you feel that, after you used all your means of intervention, you WK
could live with the situation and adapt to the consequences?
14. Would the large majority of people consider what happened to be WK/FD
quite in accordance with social norms and morally acceptable?
15. If you were personally responsible for what happened, did your NO
action correspond to your self image?
Table 6.3. Analysis of the possibility to extract answers concerning appraisal criteria
from a self-reported affective situation (the questions are reproduced from Scherer, 1993)
198
As we can see from Table 6.3, the majority of appraisal criteria cannot be
extracted automatically from the text, as there is no information on them directly
mentioned therein. Some criteria can only be inferred from what is said in the text,
others depend on our use of the world knowledge and there are even questions to
which we cannot answer, since those details are specific to the person living the
reported situation.
Nonetheless, this analysis can offer us a very good insight on the phenomena
that is involved in the appraisal process, from which we can extract a simpler
representation. Viewed in a simpler manner, a situation is presented as a chain of
actions, each with an author and an object; the appraisal depends on the temporal
and causal relationship between them, on the characteristics of the actors involved
in the action and on the object of the action.
Given this insight, the general idea behind our approach is to model situations as
chains of actions and their corresponding emotional effect using an ontological
representation. According to the definition provided by Studer et al. (1998), an
ontology captures knowledge shared by a community that can be easily sharable
with other communities. These two characteristics are especially relevant if we
want the recall of our approach to be increased. Knowledge managed in our
approach has to be shared by a large community and it also needs to be fed by
heterogeneous sources of common knowledge to avoid uncertainties. However,
specific assertions can be introduced to account for the specificities of individuals
or contexts.
In this manner, we can model the interaction of different events in the context in
which they take place and add inference mechanisms to extract knowledge that is
not explicitly present in the text. We can also include knowledge on the appraisal
criteria relating to different concepts found in other ontologies and knowledge bases
(e.g. The man killed the mosquito. does not produce the same emotional effect as
The man killed his wife. or The man killed the burglar in self-defence., because
the criteria used to describe them are very different).
At the same time, we can define the properties of emotions and how they
combine. Such an approach can account for the differences in interpretation, as the
specific knowledge on the individual beliefs or preferences can be easily added as
action chains or affective appraisals (properties) of concepts.
60
http://www.cs.waikato.ac.nz/ml/weka/
200
http://www.unige.ch/fapse/emotion/databanks/isear.html
201
202
ONTOLOGY DESIGN
The process of building the core of the EmotiNet knowledge base (KB) of action
chains started with the design of the core of knowledge, in our case ontology.
Specifically, the design process was divided in three stages:
1. Establishing the scope and purpose of the ontology. The ontology we
propose has to be capable of defining the concepts required in a general
manner, which will allow it to be expanded and specialized by external
knowledge sources. Specifically, the EmotiNet ontology needs to capture
and manage knowledge from three domains: kinship membership, emotions
(and their relations) and actions (characteristics and relations between
them).
2. Reusing knowledge from existing ontologies. In a second stage, we
searched for other ontologies on the Web that contained concepts related to
the knowledge cores we needed. At the end of the process, we located two
ontologies that would be the basis of our ontological representation: the
ReiAction ontology62, which represents actions between entities in a general
manner and whose RDF (Resource Description Framework) graph is
depicted in Figure 6.3, and the family relations ontology63, which contained
knowledge about family members and the relations between them.
3. Building our own knowledge core from the ontologies imported. This
third stage involved the design of the last remaining core, i.e. emotion, and
the combination of the different knowledge sources into a single ontology:
EmotiNet. In this case, we designed a new knowledge core from scratch
based on a combination of the models of emotion presented (see Figure
6.4). This knowledge core includes different types of relations between
emotions and a collection of specific instances of emotion (e.g. anger, fear,
joy). In the last step, these three cores were combined using new classes and
relations between the existing members of these ontologies.
ONTOLOGY EXTENSION AND POPULATION
The next process extended EmotiNet with new types of action and instances of
action chains using real examples from the ISEAR corpus. We began the process by
manually sellecting a subset of 175 documents from the collection after applying
the SRL system proposed by Moreda et al. (2007), with expressions related to all
the emotions: anger (25), disgust (25), guilt (25), fear (25), sadness (25), joy (25)
and shame (25). The criteria for selecting this subset were the simplicity of the
sentences and the variety of actions described.
62
63
www.cs.umbc.edu/~lkagal1/rei/ontologies/ReiAction.owl
www.dlsi.ua.es/~jesusmhc/EmotiNet/family.owl
203
204
which must end with an instance of Feel. Figure 6.6 shows an example of a RDF
graph, previously simplified, with the action chain of our example.
Action Chain
anger
emotionFelt
mother_f1
feel_anger_1
actor
actor
hasChild
disgust
oblige_1
implies
target
daughter_f1
actor
second
go_1
first
target
sequence_2
second
party_1
first
sequence_1
http://jena.sourceforge.net/
206
sources of general knowledge we add the more flexible will be EmotiNet, thus
increasing the possibilities of processing unseen action chains.
not present in the ontology. The experiment was applied to each of the three test
sets (A, B, C), and the corresponding results are marked with A1, B1 and C1,
respectively.
(2). A subsequent approach aimed at surpassing the issues raised by the missing
knowledge in EmotiNet. In a first approximation, we aimed at introducing extra
knowledge from VerbOcean, by adding the verbs that were similar to the ones in the
core examples (represented in VerbOcean through the similar relation).
Subsequently, each of the actions in the examples to be classified that was not
already contained in EmotiNet, was sought in VerbOcean. In case one of the similar
actions was already contained in the KB, the actions were considered equivalent.
Further on, each action was associated with an emotion, using ConceptNet relations
and concepts. Action chains were represented as chains of actions with their
associated emotion. Finally, new examples were matched against chains of actions
containing the same emotions, in the same order. While more complete than the
first approximation, this approach was also affected by lack of knowledge about the
emotional content of actions. To overcome this issue, we proposed two heuristics:
(2a) In the first one, actions on which no affect information was available,
were sought in within the examples already introduced in the EmotiNet and were
assigned the most frequent class of emotion labeling them. The experiment was
applied to each of the three test sets (A, B, C), and the corresponding results are
marked with A2a, B2a and C2a, respectively.
(2b) In the second approximation, we used the most frequent emotion
associated to the known links of a chain, whose individual emotions were
obtained from SentiWordNet. In this case, the core of action chains is not
involved in the process. The experiment was applied to each of the three test sets
(A, B, C), and the corresponding results are marked with A2b, B2b and C2b,
respectively.
EVALUATION RESULTS
a) Test set A:
We performed the steps described on the 516 examples (ISEAR phrases
corresponding to the four emotions modelled, from which the examples used as
core of EmotiNet were removed). For the first approach, the queries led to a result
only in the case of 199, for the second approach, approximations (A2a) and (A2b) 409. For the remaining ones, the knowledge stored in the KB is not sufficient, so
that the appropriate action chain can be extracted. Table 6.4 presents the results of
the evaluations on the subset of examples, whose corresponding query returned a
result.
208
Table 6.5 reports on the recall obtained when testing on all examples. The baseline
is random, computed as average of 10 random generations of classes for all
classified examples.
b) Test set B:
We performed the steps described on the 487 examples (ISEAR phrases
corresponding to the four emotions modelled, from which the examples used as
core of EmotiNet were removed). For the first approach, the queries led to a result
only in the case of 90, for the second approach, approximations (B2a)- 165 and
(B2b) - 171. For the remaining ones, the knowledge stored in the KB is not
sufficient, so that the appropriate action chain can be extracted. Table 6.6 presents
the results of the evaluations on the subset of examples, whose corresponding query
returned a result. Table 6.7 reports on the recall obtained when testing on all
examples.
The baseline is random, computed as average of 10 random generations of
classes for all classified examples.
c) Test set C:
We performed the steps described on the 895 examples (ISEAR phrases
corresponding to the seven emotions modelled, from which the examples used as
core of EmotiNet were removed). For the first approach, the queries led to a result
only in the case of 571, for the second approach, approximations (C2a) 617 and
(C2b) - 625. For the remaining ones, the knowledge stored in the KB is not
sufficient, so that the appropriate action chain can be extracted. Table 6.8 presents
the results of the evaluations on the subset of examples, whose corresponding query
returned a result. Table 6.9 reports on the recall obtained when testing on all
examples.
The baseline is random, computed as average of 10 random generations of
classes for all classified examples.
Table 6.10 reproduces the results reported in Danisman and Alpkocak (2008)
which we will mark as DA-, on the ISEAR corpus, using ten-fold cross-validation.
We compare them to the results we obtained in the presented experiments. As the
authors only present the mean accuracy obtained for 5 of the emotions in ISEAR
(anger, disgust, fear, joy, sadness) and they perform ten-fold cross-validation on all
examples in the abovementioned corpus, the results are not directly comparable. In
fact, a ten-fold cross validation means that they have used 90% of the cases to train
the classifier and only tested on the rest of 10% of the cases. As a proof of the
significance of the results obtained, we also include the evaluation outcome of the
same system on another corpus (marked as DA- SemEval). Again, we cannot
209
directly compare the results, but we can notice that our system performs much
better in terms of accuracy and recall when tested on new data.
In any case, we believe that such comparisons can give a clearer idea of the task
difficulty and the measure of the success of our approach. On the last line, we
include the results reported for the GENESIS system (Scherer, 1993). However, it
should be noted that this expert system does not directly detect and classify emotion
from text; it only represents answers to a set of questions aimed at determining the
values of the appraisal factors included in the emotional episode, after which it
computes the similarity to previously computed vectors of situations.
Correct
Total
Accuracy
Emotion
A
A1
A2a
A2b
A1
A2b
A1
A2a
A2b
2a
disgust
11
25
25
26
59
63
42.3
42.4
39.7
anger
38
27
26
62
113
113
61.3
23.9
23
fear
4
5
7
29
71
73
16
7.1
9.6
guilt
13
30
26
86
166
160
15.1
18.1
16.3
Total
66
87
84
199
409
409
31.2
21.3
20.5
Baseline
61
84
84
229
409
409
21.9
20.5
20.5
Table 6.4. Results of the emotion detection using EmotiNet on classified examples in
test set A
Correct
Total
Recall
A1
A2a
A2b
A1
A1
A2a
A2b
disgust
11
25
25
76
16.3
32.9
32.9
anger
38
27
26
148
27
18.3
17.6
fear
4
5
7
94
4.5
5.3
7.5
guilt
13
30
26
198
7.7
15.2
13.1
Total
66
87
84
516
14
16.9
16.2
Baseline
112
112
112
516
21.7
21.7
21.7
Table 6.5. Results of the emotion detection using EmotiNet on all test examples in
test set A
Emotion
Correct
Emotion
disgust
anger
fear
210
B1 B2a B2b
B1
10
16
37
41
102
55
28
39
43
29
39
44
Total
B
2a
52
114
74
Accuracy
B2b
67
119
76
B1
B2a
B2b
Correct
Total
Accuracy
Emotion
B
B1 B2a B2b B1
B2b B1
B2a
B2b
2a
guilt
27 55
59 146 157 165 18.49 35.03 35.76
Total
90 165 171 344 397 427 26.16 41.56 40.05
Table 6.6. Results of the emotion detection using EmotiNet on classified examples in
test set B
Correct
Total
Recall
B1
B2a
B2b
B1
B1
B2a
B2b
disgust
10
28
29
59
16.95 47.46 49.15
anger
16
39
39
145
11.03 26.90 26.90
fear
37
43
44
85
43.53 50.59 51.76
guilt
27
55
59
198
13.64 27.78 29.80
Total
90
165
171
487
18.48 33.88 35.11
Baseline
124
124
124
487
0.25
0.25
0.25
Table 6.7. Results of the emotion detection using EmotiNet on all test examples in
test set B
Emotion
Correct
Total
Accuracy
Emotion
C
C1 C2a C2b C1
C2b C1
C2a C2b
2a
disgust
16
16
21
44 42
40 36.36 38.09 52.50
shame
25
25
26
70 78
73 35.71 32.05 35.62
anger
31
47
57 105 115 121 29.52 40.86 47.11
fear
35
34
37
58 65
60 60.34 52.30 61.67
sadness
46
45
41 111 123 125 41.44 36.58 32.80
joy
13
16
18
25 29
35
52
55.17 51.43
guilt
59
68
64 158 165 171 37.34 41.21 37.43
Total
225 251 264 571 617 625 39.40 40.68 42.24
Table 6.8. Results of the emotion detection using EmotiNet on classified examples in
test set C
Emotion
disgust
shame
anger
fear
C1
16
25
31
35
Correct
C2a
C2b
16
21
25
26
47
57
34
37
Total
C1
59
91
145
85
C1
27.11
27.47
21.37
60.34
Recall
C2a
27.11
27.47
32.41
52.30
C2b
35.59
28.57
39.31
61.67
211
Correct
Total
Recall
C1
C2a
C2b
C1
C1
C2a
C2b
sadness
46
45
41
267
17.22
16.85
15.36
joy
13
16
18
50
26
32
36.00
guilt
59
68
64
198
29.79
34.34
32.32
Total
225
251
264
895
25.13
28.04
29.50
Baseline
126
126
126
895
14.0.7
14.07
14.07
Table 6.9. Results of the emotion detection using EmotiNet on all test examples in
test set C
Emotion
Method
Mean Accuracy
DA: NB (with stemming)
67.2
DA: NB (without stemming)
67.4
DA: SVM (with stemming)
67.4
DA: SVM (without stemming)
66.9
DA SemEval: NB (with stemming)
F1= 27.9
DA SemEval: NB (without stemming)
F1= 28.5
DA SemEval: SVM (with stemming)
F1= 28.6
DA SemEval: SVM (without stemming)
F1= 27.8
DA SemEval: Vector Space Model (with stemming)
F1= 31.5
DA SemEval: Vector Space Model (without stemming)
F1= 32.2
EmotiNet A1
31.2
EmotiNet A2a
21.3
EmotiNet A2b
20.5
EmotiNet B1
26.16
EmotiNet B2a
41.56
EmotiNet B2b
40.05
EmotiNet C1
39.4
EmotiNet C2a
40.68
EmotiNet C2b
42.24
GENESIS
77.9
Table 6.10. Comparison of different systems for affect detection using ISEAR or
self-reported affect in general
DISCUSSION AND CONCLUSIONS
From the results in Tables 6.4 to 6.9 we can conclude that the approach is valid,
although much remains to be done to fully exploit the capabilities of EmotiNet.
Given the number of core examples and the results obtained, we can see that the
212
number of chains corresponding to one emotion in the core do influence the final
result directly. However, the systems performs significantly better when an equal
number of core examples is modeled, although when more emotions are evaluated
(the difference between test sets B and C), the noise introduced leads to a drop in
performance.
The comparative results shown in Table 6.10 show, on the one hand, that the task
of detecting affect in text is very difficult. Thus, even if the appraisal criteria are
directly given to a system (as in the case of GENESIS), its accuracy level only
reaches up to 80%. If a system is trained on 90% of the data in one corpus using
lexical information, its performance reaches up to around 68%. However, the results
drop significantly when the approach is used on different data, showing that it is
highly dependent on the vocabulary it uses. As opposed to this, the model we
proposed based on appraisal theories proved to be flexible, its level of performance
improving either by percentual increase, or by the fact that the results for different
emotional categories become more balanced. We showed that introducing new
information can be easily done from existing common-sense knowledge bases and
that the approach is robust in the face of the noise introduced.
From the error analysis we performed, we could determine some of the causes of
error in the system. The first important finding is that extracting only the action,
verb and patient semantic roles is not sufficient. There are other roles, such as the
modifiers, which change the overall emotion in the text (e.g. I had a fight with my
sister sadness, versus I had a fight with my stupid sister anger). Therefore,
such modifiers should be included as attributes of the concepts identified in the
roles, and, additionally, added to the tuples, as they can account for other appraisal
criteria. This can also be a method to account for negation. Given that just 3 roles
were extracted, there were also many examples that did not make sense when input
into the system. Further on, we tried to assigned emotion to all the actions contained
in the chains. However, some actions have no emotional effect. Therefore, an
accurate source of knowledge on the affect associated to concepts has to be added.
Another issue we detected was that certain emotions tend to be classified most
of the times as another emotion (e.g. fear is mostly classified as anger). This is due
to the fact that emotions are subjective (one and the same situation can be a cause
for anger for a person and a cause of fear to another or a mixture of the two); also,
in certain situations, there are very subtle nuances that distinguish one emotion
from the other.
A further source of errors was that lack of knowledge on specific actions. As we
have seen, this knowledge can be imported from external knowledge bases and
integrated in the core. This extension using larger common-sense knowledge bases,
may lead to problems related to knowledge consistency and redundancy, with which
we have not dealt with yet. VerbOcean extended the knowledge, in the sense that
213
more examples could be classified. However, given the ambiguity of the resource
and the fact that it is not perfectly accurate also introduced many errors.
Finally, other errors were produced by NLP processes and propagated at various
steps of the processing chain (e.g. SRL, co-reference resolution). Some of these
errors cannot be eliminated; however, others can be partially solved by using
alternative NLP tools. A thorough analysis of the errors produced at each of the
stages involved in the application and extension of EmotiNet must be made in order
to obtain a clear idea of the importance/noise of each component.
214
CHAPTER 7. CONTRIBUTIONS
Motto: The value of the sentiment is the value of the sacrifice you are prepared to
make for it. (John Galsworthy)
This thesis focused on the resolution of different problems related to the task of
sentiment analysis. Specifically, we concentrated on:
1. Defining the general task and related concepts, by presenting an overview
of the present definition and clarifying the inconsistencies found among the
ones that were previously given in the literature;
2. Proposing and evaluating methods to define and tackle sentiment analysis
from a variety of textual genres, in different languages;
3. Redefining the task and proposing methods to annotate specific corpora for
sentiment analysis in the corresponding text type, in different languages in
case the task of sentiment analysis was not clearly defined for a specific
textual genre and/or no specific corpora was available for it. These
resources are publicly available for the use of the research community;
4. Applying opinion mining techniques in the context of end-to-end systems
that involve other NLP tasks as well. To this aim, we concentrated on
performing sentiment analysis in the context of question answering and
summarization.
5. Carrying out experiments using existing question answering and
summarization systems, designed to deal with factual data only.
6. Proposing and evaluating a new framework for what we called opinion
question answering and new methods for opinion summarization,
subsequent to experiments showing that systems performing question
answering and summarization over factual texts were not entirely suited in
the context of opinions;
7. Presenting a general method for the detection of implicitly-expressed
emotion from text. First, we presented the method to build a lexicon of
terms that in themselves contain no emotion, but that trigger emotion in a
reader. Subsequently, we abstracted from the analysis of sentiment
expressed in text based on linguistic cues and proposed and evaluated a
method to represent text as action chains. The emotion elicited by the
situation presented in the text was subsequently judged using commonsense
knowledge on the emotional effect of each action in the chain;
8. Evaluating our approaches in international competitions, in order to
compare our approaches to others and validate them.
215
Further on, we present in detail the contributions we have made to the research
in the field of sentiment analysis throughout this thesis and show how the methods
and resources we proposed filled important gaps in the existing research. The main
contributions answer five research questions:
1. How can sentiment analysis and, in a broader perspective, opinion mining
be defined in a correct way? What are the main concepts to be treated in
order to do that?
In Chapter 2, we presented a variety of definitions that were given to concepts
related and involved in the task of sentiment analysis subjectivity, objectivity,
opinion, sentiment, emotion, attitude and appraisal. Our contribution in this chapter
resided in clearly showing that sentiment analysis and opinion mining are not
synonymous, although in the literature they are usually employed interchangeably.
Additionally, we have shown that opinion, as it is defined by the Webster
dictionary, is not synonymous to sentiment. Whereas sentiments are types of
opinions, reflecting the feelings (i.e. conscious part of emotions), all opinions are
not sentiments (i.e there are types of opinions that are not reflective of emotions).
We have also shown that subjectivity analysis is not directly linked to sentiment
analysis as it is considered by many of the researchers in the field. In other words,
detecting subjective sentences does not imply directly obtaining the sentences that
contain sentiment. The latter, as expressions of evaluations based on emotion, are
not necessarily indicated in subjective sentences, but can also be expressed in
objective sentences. Subjective sentences can or cannot contain expressions of
emotions. The idea is summarized in Chapter 2, Figure 2.1:
Finally, we have shown that there is a clear connection between the work done
under the umbrella of sentiment analysis/opinion mining and the one in
appraisal/attitude analysis. Although all these areas are usually considered to refer
to the same type of work, the wider aim of attitude or appraisal analysis can capture
much better the research that has been done in sentiment analysis, including all
classes of evaluation (affective, cognitive, behavioral) and the connection between
author, reader and text meaning. Based on this observation and in view of the
Appraisal Theory, in Chapter 6 we proposed a model of emotion detection based on
commonsense knowledge.
Clearly defining these concepts has also helped in defining in an appropriate
manner the sentiment analysis task in the context of the different textual genres we
employed in our research. Subsequently, the correct definition has made it possible
to define annotation schemes and create resources for sentiment analysis in all the
textual genres we performed research with. All these resources were consequently
employed in the evaluation of the specific methods we created for sentiment
analysis from different textual genres. Both the resources created, through the high
inter-annotator agreement we obtained, as well as the methods we proposed,
216
through the performance of the systems implementing them, have shown our efforts
to give the clear definition were indeed an important contribution to this field.
2. Can sentiment analysis be performed using the same methods, for all text
types? What are the peculiarities of the different text types and how do they
influence the methods to be used to tackle it? Do we need special resources
for different text types?
3. Can the same language resources be used in other languages (through
translation)? How can resources be extended to other languages?
In Chapter 4, we showed the peculiarities of different text types (reviews,
newspaper articles, blogs, political debates), analyzed them and proposed adequate
techniques to address them at the time of performing sentiment analysis. We
evaluated our approaches correspondingly and showed that they perform at the
level of state-of-the-art systems and in many cases outperform them. In this chapter,
we presented different methods and resources we built for the task of sentiment
analysis in different text types. We started by presenting methods to tackle the task
of feature-based opinion mining and summarization, applied to product reviews.
We have analyzed the peculiarities of this task and identified the weak points of
existing research. We proposed and evaluated different methods to overcome the
identified pitfalls, among which the most important were the discovery of indirectly
mentioned features and computing the polarity of opinions in a manner that is
feature-dependent, using the Normalized Google Distance and Latent Semantic
Analysis as measure of term association. Subsequently, we proposed a unified
model for sentiment annotation for this type of text, able to capture the important
phenomena that we had identified different types of sentiment expressions
direct, indirect, implicit, feature mentioning and span of text expressing a specific
opinion. Such a distinction had not been proposed in the literature. Its contribution
is not only given by the annotation process, but also by the fact that considering
larger spans of text as representing one single opinion has lead us to research on
opinion question answering (see Chapter 5) using retrieval of 3-sentences-long text
snippets, highly improving the performance of the opinion question answering
system. Following this annotation scheme, we also proposed a method to detect and
classify opinion stated on the most important features of a product, on which stars
were given in the review, based on textual entailment. Apart from this contribution,
the performance obtained applying this method is indicative of the fact that it is
possible to obtain, apart from a 2-way classification of opinions on product features,
a summary of the most important sentences referring to them. These latter can be
employed to offer support snippets to the feature-based opinion mining and
summarization process, which normally only offers a percent-based summary of the
opinions expressed on the product in question.
217
classification correctly (according to the polarity of the sentiment), the fact that
topic-relatedness is not contemplated leads to the introduction of irrelevant data in
the final summaries. Finally, we have shown that in the case of opinionated text,
relevance is given not only by the information contained, but also by the polarity of
the opinion and its intensity. Although initial results have shown that there is no
correlation between the Gold Standard annotations and the intensity level of
sentences, as output by the sentiment analysis system, given the fact that using this
method, we obtained high results as far as F-measure is concerned in TAC 2008, we
believe that more mechanisms for opinion intensity should be studied, so that the
clear connection between sentence relevance and the opinion it contains, as well as
the intensity it has, can be established.
5. Can we propose a model to detect emotion from text, in the cases where it
is expressed implicitly, needing world knowledge?
In the first chapters of this thesis, we explored the task of sentiment analysis in
different text types and languages, proposing a variety of methods that were
appropriate for tackling the issues in each particular text type. Most of the times,
however, the approaches we took were limited to discovering only the situations
where sentiment was expressed explicitly (i.e. where linguistic cues could be found
in the text to indicate it contained subjective elements or sentiment). Nevertheless,
in many cases, the emotion underlying the sentiment is not explicitly present in text,
but is inferable based on commonsense knowledge (i.e. emotion is not explicitly,
but implicitly expressed by the author, by presenting situations which most people,
based on commonsense knowledge, associate with an emotion, like going to a
party, seeing your child taking his/her first step etc.).
In Chapter 6 of the thesis, we presented our contribution to the issue of
automatically detecting emotion expressed in text in an implicit manner. The initial
approach is based on the idea that emotion is triggered by specific concepts,
according to their relevance, seen in relation to the basic needs and motivations,
underpinning our idea on the Relevance Theory. The second approach we propose
is based on the Appraisal Theory models. The general idea behind it is that
emotions are most of the times not explicitly stated in texts, but results from the
interpretation (appraisal) of the actions contained in the situation described, as well
as the properties of their actors and objects. Our contribution in this last part of the
research resides in setting up a framework for representing situations described in
text as chains of actions (with their corresponding actors and objects), and their
corresponding properties (including the affective ones), as commonsense
knowledge. We show the manner in which the so-called appraisal criteria can be
automatically detected from text and how additional knowledge on the properties of
the concepts involved in such situations can be imported from external sources.
220
221
of work is the extension of the proposed resources for other languages, either
through translation, in which case the resources should be refined and evaluated on
different types of texts in their original language, or by direct annotation in the
target language. In the latter case, it would be interesting to study the differences
given by the peculiarities of sentiment expression in a manner that is dependent on
the culture and language differences. In direct relation to the work developed in this
thesis, in the different text types and languages, future lines of research could be:
1. In the case of review texts:
a. The automatic extraction of taxonomies for product features
b. The extension of the proposed framework for review annotation
and feature-based opinion mining and summarization for languages
other than English and Spanish
2. In the context of newspaper quotations and, in a more general manner,
newspaper articles:
a. The study of the impact of news source (i.e. in terms of bias,
reputation, trust) on the sentiment analysis process, within sources
from the same country/culture and across countries and cultures
b. The study of the influence that the reader background has on the
manner in which sentiment is perceived from newspaper articles
c. The automatic and semi-automatic extension of the resources we
built to other languages (i.e. in addition to the collection of
quotations we annotated for English and German)
d. The development of a framework for sentiment analysis that takes
into account the 3 proposed components of text the author, reader
and text and the manner in which they interact in the text meaning
negotiation, according to the Speech-Act and the Appraisal
Theories
e. The study of the influence of news content (i.e. what we denoted as
good versus bad news) on the manner in which sentiment is
expressed and subsequently on the performance of the automatic
sentiment analysis task
3. In the context of political debates and general texts:
a. Study the methods to represent the dialogue structure and the
influence of different methods for discourse analysis, as well as
tools (e.g. for anaphora resolution) on the proposed sentiment
analysis methods.
b. The study and use of topic modeling techniques in order to
accurately detect the target of the sentiment expressed, independent
of its nature being known beforehand
224
226
REFERENCES
1. [Agichtein and Gravano, 2000] Agichtein, D. and Gravano, L. (2000).
Snowball: Extracting Relations from Large Plain-Text Collections. In
Proceedings of the 5th ACM International Conference on Digital
Libraries (ACM DL) 2000: 85-94.
2. [Atkinson and Van der Goot, 2009] Atkinson, M. and Van der Goot, E.
(2009). Near real time information mining in multilingual news. In
Proceedings of the 18th International World Wide Web Conference,
2009: 1153-1154.
3. [Austin, 1976] Austin, J.L. (1976). How to do things with words.
Oxford et. al.: Oxford University Press.
4. [Balahur and Montoyo, 2008] Balahur, A. and Montoyo, A. Applying a
Culture Dependent Emotion Triggers Database for Text Valence and
Emotion Classification. In Proceedings of the AISB 2008 Convention
Communication, Interaction and Social Intelligence, 2008.
5. [Balahur and Montoyo, 2008a] Balahur, A. and Montoyo, A. (2008a).
Applying a culture dependent emotion triggers database for text valence
and emotion classification. In Procesamiento del Lenguaje Natural,
40(40).
6. [Balahur and Montoyo, 2008b] Balahur, A. and Montoyo, A. (2008b).
Building a recommender system using community level social filtering.
In 5th International Workshop on Natural Language and Cognitive
Science (NLPCS), 2008:32-41.
7. [Balahur and Montoyo, 2008c] Balahur, A. and Montoyo, A. (2008c).
Determining the semantic orientation of opinions on products - a
comparative analysis. In Procesamiento del Lenguaje Natural, 41(41).
8. [Balahur and Montoyo, 2008d] Balahur, A. and Montoyo, A. (2008d). A
feature-driven approach to opinion mining and classification. In
Proceedings of the NLPKE 2008.
9. [Balahur and Montoyo, 2008e] Balahur, A. and Montoyo, A. (2008e).
An incremental multilingual approach to forming a culture dependent
emotion triggers lexical database. In Proceedings of the Conference of
Terminology and Knowledge Engineering (TKE 2008).
10. [Balahur and Montoyo, 2008f] Balahur, A. and Montoyo, A. (2008f).
Multilingual feature-driven opinion extraction and summarization from
customer reviews .In Lecture Notes in Computer Science, 5039, NLDB,
2008:345-346.
11. [Balahur and Montoyo, 2009] Balahur, A. and Montoyo, A. (2009).
Semantic approaches to fine and coarse-grained feature-based opinion
227
230
39. [Boldrini et al., 2010] Boldrini, E., Balahur, A., Martnez-Barco, P.,
Montoyo, A. (2010). EmotiBlog: a finer-grained and more precise
learning of subjectivity expression models. In Proceedings of the
4th Linguistic Annotation Workshop (LAW IV), satellite workshop to
ACL 2010, pp. 1-10, Uppsala Sweden.
40. [Bossard et al., 2008] Bossard, A., Gnreux, M., and Poibeau, T.
(2008). Description of the LIPN systems at TAC 2008: Summarizing
information and opinions. In Proceedings of the Text Analysis
Conference of the National Institute for Standards and Technology,
Geithersbury, Maryland, USA.
41. [Breckler and Wiggins, 1992] Breckler, S. J. and Wiggins, E. C. (1992).
On defining attitude and attitude theory: Once more with feeling. In A.
R. Pratkanis, S. J. Breckler, and A. C. Greenwald (Eds.), Attitude
structure and function. Hillsdale, NJ, Erlbaum. pp. 407427.
42. [Brin, 1998] Brin, S. (1998). Extracting patterns and relations from the
World-Wide Web. In Proceedings of the 1998 International Workshop
on Web and Databases (WebDB98), 1998: 172-183.
43. [Calvo and DMello, 2010] Calvo, R. A. and DMello, S. (2010). Affect
Detection: An Interdisciplinary Review of Models, Methods and Their
Applications. IEEE Transactions on Affective Computing, Vol. 1, No. 1,
Jan.-Jun 2010, pp.18-37.
44. [Cambria et al., 2009] Cambria, E., Hussain, A., Havasi, C. and Eckl, C.
(2009). Affective Space: Blending Common Sense and Affective
Knowledge to Perform Emotive Reasoning. In Proceedings of the 1st
Workshop on Opinion Mining and Sentiment Analysis (WOMSA) 2009:
32-41.
45. [Cardie et al., 2004] Cardie, C., Wiebe, J., Wilson, T., Litman, D.
(2004). Low-Level Annotations and Summary Representations of
Opinions for Multiperspective QA. In Mark Maybury (ed), New
Directions in Question Answering , AAAI Press/MIT Press, pp.17-98.
46. [Cerini et al., 2007] Cerini, S., Compagnoni, V., Demontis, A.,
Formentelli, M., and Gandini, G. (2007). Micro-wnop: A gold standard
for the evaluation of auto-matically compiled lexical resources for
opinion mining, Milano, IT.
47. [Chaovalit and Zhou, 2005] Chaovalit, P. and Zhou, L. (2005). Movie
Review Mining: a Comparison between Supervised and Unsupervised
Classification Approaches. In Proceedings of HICSS-05, the 38th
Hawaii International Conference on System Sciences, IEEE Computer
Society.
231
233
85. [He et al., 2008] He, T., Chen, J., Gui, Z., and Li, F. (2008). CCNU at
TAC 2008: Proceeding on using semantic method for automated
summarization yield. In Proceedings of the Text Analysis Conference of
the National Institute for Standards and Technology.
86. [Hovy, 2005] Hovy, E. H. (2005). Automated text summarization. In
Mitkov, R., editor, The Oxford Handbook of Computational Linguistics,
pages 583598. Oxford University Press, Oxford, UK.
87. [Hu and Liu, 2004] Hu, M. and Liu, B. (2004). Mining Opinion Features
in Customer Reviews. In Proceedings of Nineteenth National
Conference on Artificial Intellgience AAAI-2004, San Jose, USA.
88. [Iftene and Balahur-Dobrescu, 2007] Iftene, A. and Balahur-Dobrescu,
A. (2007). Hypothesis transformation and semantic variability rules used
in recognizing textual entailment. In Proceedings of the ACL-PASCAL
Workshop on Textual Entailment and Paraphrasing 2007.
89. [Jiang and Vidal, 2006] Jiang, H. and Vidal, J. M. (2006). From rational
to emotional agents. In Proceedings of the AAAI Workshop on Cognitive
Modeling and Agent-based Social Simulation.
90. [Jijkoun et al., 2010] Jijkoun, V., de Rijke, M., Weerkamp, W. (2010).
Generating Focused Topic-Specific Sentiment Lexicons. ACL 2010, pp.
585-594.
91. [Johnson-Laird and Oatley, 1989] Johnson-Laird, P. N. and Oatley, K.
(1989). The language of emotions: An analysis of a semantic field.
Cognition and Emotion, 3, 81-123.
92. [Kabadjov et al, 2009a] Kabadjov, M. A., Steinberger, J., Pouliquen, B.,
Steinberger, R., and Poesio, M. (2009). Multilingual statistical news
summarisation: Preliminary experimentswith english. In Proceedings of
theWorkshop on Intelligent Analysis and Processing of Web News
Content at the IEEE / WIC / ACM International Conferenceson Web
Intelligence and Intelligent Agent Technology(WI-IAT), 2009:519-522.
93. [Kabajov et al., 2009] Kabadjov, M., Balahur, A., Boldrini, E. (2009).
Sentiment Intensity: Is It a Good Summary Indicator? Proceedings of
the 4th Language Technology Conference LTC, 2009: 203-212.
94. [Kim and Hovy, 2004] Kim, S.-M. and Hovy, E. (2004). Determining
the Sentiment of Opinions. In Proceedings of COLING 2004, pp.13671373, Geneva, Swizerland.
95. [Kim and Hovy, 2005] Kim, S.-M. and Hovy, E. (2005). Automatic
detection of opinion bearing words and sentences. In Companion
Volume to the Proceedings of the International Joint Conference on
Natural Language Processing (IJCNLP), Jeju Island, Korea.
235
96. [Kim and Hovy, 2006] Kim, S.-M. and Hovy, E. (2006). Automatic
identification of pro and con reasons in online reviews. In Proceedings
of the COLING/ACL Main Conference Poster Sessions, pp. 483490.
97. [Kim et al., 2010] Kim, J., Li, J.-J. and Lee, J.-H. (2010). Evaluating
Multilanguage-Comparability of Subjectivity Analysis Systems. In
Proceedings of the 48th Annual Meeting of the Association for
Computational Linguistics, pages 595603, Uppsala, Sweden, 11-16
July 2010.
98. [Kintsch, 1999] Kintsch, W. (1999). Comprehension: A paradigm for
cognition. New York, Cambridge University Press.
99. [Kintsch, 2000] Kintsch, W. (2000). Metaphor comprehension: A
computational theory. Psychonomic Bulletin and Review, 7, 527-566.
100. [Koppel and Shtrimberg, 2004] Koppel, M. and Shtrimberg, I.
(2004). Good or bad news? Let the market decide. In Proceedings of the
AAAI Spring Symposium on Exploring Attitude and Affect in Text:
Theories and Applicationns.
101. [Kouleykov and Magnini, 2006] Kouleykov, M. and Magnini, B.
(2006). Tree Edit Distance for Recognizing Textual Entailment:
Estimating the Cost of Insertion. In Proceedings of the Second PASCAL
Challenges
Workshop
on
Recognising
Textual
Entailment, Venice, Italy.
102. [Kozareva et al., 2007] Kozareva, Z., Vzquez, S., and Montoyo, A.
(2007). Discovering the Underlying Meanings and Categories of a
Name through Domain and Semantic Information. In Proceedings of the
Conference on Recent Advances in Natural Language Processing
(RANLP 2007), Borovetz, Bulgaria.
103. [Ku et al., 2005] Ku, L.-W., Li, L.-Y., Wu, T.-H., and Chen, H.-H.
(2005). Major topic detection and its application to opinion
summarization. In Proceedings of the ACM Special Interest Group on
Information Retrieval (SIGIR), Salvador, Brasil.
104. [Ku et al., 2006] Ku, L.-W., Liang, Y.-T., and Chen, H.-H. (2006).
Opinion extraction, summarization and tracking in news and blog
corpora. In AAAI Symposium on Computational Approaches to
Analysing Weblogs (AAAI CAAW 2006), AAAI Technical Report, SS06-03, Standford University, 100-107.
105. [Kudo and Matsumoto, 2004] Kudo, T. and Matsumoto, Y. (2004).
A boosting algorithm for classification of semistructured text. In
Proceedings of the Conference on Empirical Methods in Natural
Language Processing (EMNLP) 2004.
236
106. [Lappin and Leass, 1994] Lappin, S. and Leass, H.J. (1994). An
algorithm for pronominal anaphora resolution. In Journal of
Computational Linguistics, 20(4):535561.
107. [Larousse, 1995] Larousse (1995). Diccionario Ideolgico de la
Lengua Espaola. Larousse Editorial, RBA Promociones Editoriales,
S.L.
108. [Laver et al., 2003] Laver, M., Benoit, K., and Garry, J. (2003).
Extracting policy positions from political texts using words as data.
American Political Science Review , 97(2, May): 311-331.
109. [Lazarus and Smith, 1988] Lazarus, R. S. and Smith, C. A. (1988).
Knowledge and appraisal in the cognition-emotion relationship.
Cognition and Emotion, 2, 281-300.
110. [Lee, 2004] Lee, L. (2004). I'm sorry Dave, I'm afraid I can't do
that: Linguistics, Statistics, and Natural Language Processing circa
2001. In Computer Science: Reflections on the Field, Reflections from
the Field (Report of the National Academies' Study on the Fundamentals
of Computer Science), pp. 111-118, 2004.
111. [Li et al., 2008] Li, W., Ouyang, Y., Hu, Y., Wei, F. (2008). PolyU
at TAC 2008. In Proceedings of the Text Analysis Conference 2008,
National Institute for Science and Technology, Gaithersburg, Maryland,
USA .
112. [Li et al., 2008a] Li, F., Zheng, Z., Yang, T., Bu, F., Ge, R., Yan
Zhu, X., Zhang, X., and Huang, M. (2008a). THU QUANTA at TAC
2008 QA and RTE track. In Proceedings of the Text Analysis
Conference (TAC 2008).
113. [Li et al., 2008b] Li, W., Ouyang, Y., Hu, Y., and Wei, F. (2008b).
PolyU at TAC 2008. In Proceedings of the Text Analysis Conference
2008.
114. [Lin and Pantel, 2001] Lin, D. and Pantel, P. (2001). Discovery of
Inference Rules for Question Answering. In Journal of Natural
Language Engineering 7(4):343-360.
115. [Lin et al., 2006] Lin, W., Wilson, T., Wiebe, J., and Hauptman, A.
(2006). Which Side are You on? Identifying Perspectives at the
Document and Sentence Levels. In Proceedings of the Tenth Conference
on Natural Language Learning CoNLL06.
116. [Lin, 1998] Lin, D. (1998). Dependency-based Evaluation of
MINIPAR. In Workshop on the Evaluation of Parsing Systems,Granada,
Spain, May,1998.
237
117. [Lita et al., 2005] Lita, L., Schlaikjer, A., Hong, W., and Nyberg, E.
(2005). Qualitative dimensions in question answering: Extending the
definitional QA task. In Proceedings of AAAI 2005.
118. [Liu and Maes, 2007] Liu, H. and Maes, P. (2007). Introduction to
the semantics of people and culture, International Journal on Semantic
Web and Information Systems, Special Issue on Semantics of People and
Culture (Eds. H. Liu & P. Maes), 3(1), Hersey, PA: Idea Publishing
Group.
119. [Liu and Singh, 2004] Liu, H. and Singh, P. (2004). ConceptNet: A
Practical Commonsense Reasoning Toolkit. BT Technology
Journal, Volume 22, nr.4, pp.211-226, Kluwer Academic Publishers.
120. [Liu et al., 2003] Liu, H., Lieberman, H. and Selker, T. (2003). A
Model of Textual Affect Sensing Using Real-World Knowledge. In
Proceedings of IUI 2003.
121. [Liu, 2007] Liu, B. (2007). Web Data Mining. Exploring
Hyperlinks, Contents and Usage Data. Springer, first edition.
122. [Liu, 2010] Liu, B. (2010). Sentiment Analysis and
Subjectivity. In Handbook of Natural Language Processing, eds.
N.Indurkhya and F.J.Damenan, 2010.
123. [Lloret et al., 2008] Lloret, E., Ferrndez, O., Muoz, R., and
Palomar, M. (2008). A text summarization approach under the influence
of textual entailment. In Proceedings of the 5th International Workshop
on Natural Language Processing and Cognitive Science (NLPCS 2008).
124. [Lloret et al., 2009] Lloret, E., Balahur, A., Palomar, M., and
Montoyo, A. (2009). Towards building a competitive opinion
summarization system: Challenges and keys. In Proceedings of the
NAACL-HLT 2009 Student Research Workshop, 2009: 72-77.
125. [Macleod et al., 1994] Macleod, C., Grishman, R., and Meyers, A.
(1994). Creating a common syntactic dictionary of english. In
Proceedings of the International Workshop on Sharable Natural
Language Resources.
126. [Macleod et al., 1998] Macleod, C., Grishman, R., Meyers, A.,
Barrett, L., and Reeves, R. (1998). Nomlex: A lexicon of
nominalizations. In Proceedings of EURALEX 1998, Liege, Belgium.
127. [Martin and Vanberg, 2007] Martin, L. W. and Vanberg,G. (2007).
Coalition Government and Political Communication. In Political
Research Quarterly, September 2008, Vol. 61, No. 3, pp. 502-516.
128. [Martin and Vanberg, 2008] Martin, L. W. and Vanberg, G. (2008).
A robust transformation procedure for interpreting political text.
Political Analysis Advance Access, Oxford University Press, 16(1).
238
239
139. [Mei Lee et al., 2009] Mei Lee, S. Y., Chen, Y. and Huang, C.-R.
(2009). Cause Event Representations of Happiness and Surprise. In
Proceedings of PACLIC 2009, Hong Kong.
140. [Mihalcea et al., 2007] Mihalcea, R., Banea, C., and Wiebe, J.
(2007). Learning multilingual subjective language via cross-lingual
projections. In Proceedings of the Conference of the Annual Meeting of
the Association for Computational Linguistics 2007, pp.976-983,
Prague, Czech Republic.
141. [Moreda et al., 2007] Moreda, P., Navarro, B. and Palomar, M.
(2007). Corpus-based semantic role approach in information
retrieval, Data Knowledge English. (DKE) 61(3): 467-483.
142. [Moreda et al., 2008a] Moreda, P., Llorens, H., Saquete, E., and
Palomar, M. (2008a). Automatic generalization of a qa answer
extraction module based on semantic roles. In Proceedings of AAI IBERAMIA , vol. 5290, 2008: 233-242.
143. [Moreda et al., 2008b] Moreda, P., Llorens, H., Saquete, E., and
Palomar, M. (2008b). The influence of semantic roles in qa: a
comparative analysis. In Proceedings of the SEPLN 2008, vol.41, pp.5562, Madrid, Spain.
144. [Moreda, 2008] Moreda, P. (2008). Los Roles Semnticos en la
Tecnologa del Lenguaje Humano: Anotacin y Aplicacin. Doctoral
Thesis. University of Alicante.
145. [Mullen and Collier, 2004] Mullen, T. and Collier, M. (2004).
Sentiment Analysis Using Support Vector Machines with Diverse
Information Sources. In Proceedings of EMNLP 2004.
146. [Mullen and Malouf, 2006] Mullen, T. and Malouf, R. (2006). A
preliminary investigation into sentiment analysis of informal political
discourse. In Proceedings of the AAAI Symposium on Computational
Approaches to Analysing Weblogs (AAAI-CAAW), 2006: 159-162.
147. [Nasukawa and Yi, 2003] Nasukawa, T. and Yi, J. (2003).
Sentiment analysis: Capturing favorability using natural language
processing. In Proceedings of the Conference on Knowledge Capture
(K-CAP), 2003: 70-77.
148. [Neviarouskaya et al., 2010] Neviarouskaya, A., Prendinger, H. and
Ishizuka, M. (2010). EmoHeart: Conveying Emotions in Second Life
Based on Affect Sensing from Text. Advances in Human-Computer
Interaction, vol. 2010.
149. [Ng et al., 2006] Ng, V., Dasgupta, S., and Arifin, S.M. N. (2006).
Examining the Role of Linguistic Knowledge Sources in the Automatic
Identification and Classification of Reviews. In Proceedings 40th
240
241
242
244
245
208. [Tanev et al., 2009] Tanev, H., Pouliquen, B., Zavarella, B., and
Steinberger, R. (2009). Automatic expansion of a social network using
sentiment analysis, Annals of Information Systems, 2009, Springer,
Verlag.
209. [Tanev, 2007] Tanev, H. (2007). Unsupervised learning of social
networks from a multiple-source news corpus. In Proceedings of the
Workshop Multi-source Multilingual Information Extraction and
Summarization (MMIES 2007), held at RANLP 2007, Borovets,
Bulgaria.
210. [Terveen et al., 1997] Terveen, L., Hill, W., Amento, B.,
McDonald, D. and Creter, J. (1997). PHOAKS: A system for sharing
recommendations. Communications of the Association for Computing
Machinery (CACM), 40(3):5962, 1997.
211. [Teufel and Moens, 2000] Teufel, S. and Moens, M. (2000). Whats
yours and whats mine: Determining intellectual attribution in scientific
texts. In Proceedings of the 2000 Joint SIGDAT conference on
Empirical methods in natural language processing and very large
corpora: held in conjunction with the 38th Annual Meeting of the
Association for Computational Linguistics, Hong Kong.
212. [Thayer, 2001] Thayer, R. E. (2001). Calm Energy: How People
Regulate Mood With Food and Exercise. Oxford University Press, N.Y.
213. [Thomas et al., 2006] Thomas, M., Pang, B., and Lee, L. (2006).
Get out the vote: Determining support or opposition from congressional
floor-debate transcripts. In Proceedings of EMNLP 2006, pp,327-335.
214. [Tong, 2001] Tong, R.M. (2001). An operational system for
detecting and tracking opinions in on-line discussion. In Proceedings of
the Workshop on Operational Text Classification (OTC), 2001, New
Orleans, USA.
215. [Toprak et al., 2010] Toprak, C., Jakob, N. and Gurevych, I. (2010).
Sentence and Expression Level Annotation of Opinions in UserGenerated Discourse. In Proceedings of ACL 2010: 575-584.
216. [Turing, 1950] Turing, A. (1950). Computing machinery and
intelligence. Mind, 1950.
217. [Turney and Littman, 2003] Turney, P. and Littman, M. (2003).
Measuring praise and criticism: Inference of semantic orientation from
association. ACM Transactions on Information Systems, 21(4), 315-346.
218. [Turney, 2002] Turney, P. (2002). Thumbs up or thumbs down?
Semantic orientation applied to unsupervised classification of reviews.
In Proceedings 40th Annual Meeting of the Association for
Computational Linguistics, ACL 2002, 417-424, Philadelphia, USA.
246
232. [Wiegand et al., 2010] Wiegand, M., Balahur, A., Roth, B.,
Klakow, D., Montoyo, A. (2010). A Survey on the Role of Negation in
Sentiment Analysis. In Proceedings of the Workshop on Negation and
Speculation in Natural Language Processing, NeSp-NLP 2010.
233. [Wilson and Wiebe, 2003] Wilson, T. and Wiebe, J. (2003).
Annotating opinions in the world press. In Proceedings of SIGdial 2003.
234. [Wilson et al., 2004] Wilson, T., Wiebe, J., and Hwa, R. (2004).
Just how mad are you? Finding strong and weak opinion clauses. In
Proceedings of AAAI 2004,761-769, San Jose, USA.
235. [Wilson et al., 2005] Wilson, T., Wiebe, J., and Hoffmann, P.
(2005). Recognizing contextual polarity in phrase-level sentiment
analysis. In Proceedings of HLT-EMNLP 2005, pp.347-354, Vancouver,
Canada.
236. [Wu and Jin, 2010] Wu, Y., Jin, P. (2010). SemEval-2010 Task 18:
Disambiguating Sentiment Ambiguous Adjectives. In Proceedings of
the 5th International Workshop on Semantic Evaluation, ACL 2010,
pages 8185, Uppsala, Sweden, 15-16 July 2010.
237. [Yi et al., 2003] Yi, J., Nasukawa, T., Bunescu, R. and Niblack, W.
(2003). Sentiment analyzer: Extracting sentiments about a given topic
using natural language processing techniques. In Proceedings of the
IEEE International Conference on Data Mining (ICDM). 2003: 427434.
238. [Yu and Hatzivassiloglou, 2003] Yu, D. and Hatzivassiloglou, V.
(2003). Towards answering opinion questions: Separating facts from
opinions and identifying the polarity of opinion sentences. In
Proceedings of the Conference on Empirical Methods in Natural
Language Processing (EMNLP), 2003: 129-136, Shapporo, Japan.
248
JOURNAL ARTICLES:
1. Balahur, A.; Hermida J.M.; Montoyo, A. Detecting Implicit Expressions of
Emotion Using Commonsense Knowledge and Appraisal Models. IEEE
Transactions on Affective Computing. Special Issue on Naturalistic Affect
Resources for System Building and Evaluation, 2011 (accepted).
2. Balahur, A.; Hermida J.M.; Montoyo, A. Building and Exploiting EmotiNet:
a Knowledge Base for Emotion Detection Based on the Appraisal Theory
Model. In Lecture Notes in Computer Science Nr. 6677, 2011 - Proceedings
of the 9th International Symposium on Neural Networks (in press).
3. Balahur, A.; Hermida J.M.; Montoyo, A. EmotiNet: a Knowledge Base for
Emotion Detection in Text Built on the Appraisal Theories. In Lecture Notes
in Computer Science Proceedings of the 16th International Conference on
the Applications of Natural Language to Information Systems, 2011 (in
press).
4. Balahur, A.; Kabadjov, M.; Steinberger, J.; Steinberger, R.; Montoyo, A.
Opinion Summarization for Mass Opinion Estimation. Journal of Intelligent
Information Systems, 2011 (accepted).
5. Kabadjov, M.; Balahur, A.; Boldrini, E. Sentimtent Intensity: Is It a Good
Summary Indicator? In Lecture Notes in Artificial Intelligence, Nr. 6562
249
6.
7.
8.
9.
10.
11.
12.
Selected Papers from the 4th Language Technology Conference 2009 (in
press).
Balahur, A., Kabadjov, M. Steinberger, J. Exploiting Higher-level Semantic
Information for the Opinion-oriented Summarization of Blogs. International
Journal of Computational Linguistics and Applications, ISSN: 0976-0962,
Vol. 1, No. 1-2, pp. 45-59, 2010.
Martn, T.; Balahur, A.; Montoyo, A.; Pons, A. Word Sense Disambiguation
in Opinion Mining: Pros and Cons. In Journal Research in Computing
Science, Special Issue: Natural Language Processing and its Applications,
ISSN: 1870-4069, Vol. 46, pp. 119-130.
Balahur, A.; Kozareva, Z.; Montoyo, A. Determining the Polarity and
Source of Opinions Expressed in Political Debates. In Proceedings of the
10th International Conference on Intelligent Text Processing and
Computational Linguistics (CICLing) 2009 (Lecture Notes in Computer
Science No. 5449, 2009).
Balahur, A.; Montoyo, A. A Semantic Relatedness Approach to Classifying
Opinion from Web Reviews. In Procesamiento del Lenguaje Natural No. 42,
2009.
Balahur, A; Balahur, P. What Does the World Think About You? Opinion
Mining and Sentiment Analysis in the Social Web. In Analele Stiintifice ale
Universitatii Al.I. Cuza Iasi, Nr.2, pp. 101-110, 2009, ISSN: 2065-2917.
Balahur, A.; Montoyo, A. Definicin de disparador de emocin asociado a
la cultura y aplicacin a la clasificacin de la valencia y de la emocin en
textos. Procesamiento del Lenguaje Natural, Vol: 40, Num: 40, 2008.
Balahur, A.; Montoyo, A. Multilingual Feature-Driven Opinion Extraction
and Summarization from Customer Reviews. Journal Lecture Notes in
Computer Science, 5029, pp. 345-346 Proceedings of the 13th International
Conference on Applications of Natural Language to Information Systems
(NLDB 2008).
251
252
23. Balahur, A.; Lloret, E.; Ferrndez, O.;Montoyo, A., Palomar, M.; Muoz,
R. The DLSIUAES Teams Participation in the TAC 2008 Tracks
Proceedings of the Text Analysis Conference 2008 Workshop, Washington,
USA, 2008.
24. Iftene, A.; Balahur, A. A Language Independent Approach for Recognizing
Textual Entailment. Research in Computing Science , Vol: 334, 2008.
25. Balahur, A.; Montoyo, A. Applying a Culture Dependent Emotion Triggers
Database for Text Valence and Emotion Classification. In Proceedings of
the AISB 2008 Symposium on Affective Language in Human and Machine,
Aberdeen, Scotland, 2008.
26. Balahur, A.; Montoyo, A. Building a Recommender System Using
Community Level Social Filtering. In Proceedings of the 5th International
Workshop on Natural Language and Cognitive Science, Barcelona, Spain,
2008.
3. PARTICIPATION IN PROJECTS:
1. Title of the project: Living in Surveillance Societies (LiSS) COST
(European Cooperation in the field of Scientific and Technical Research)
Action IS0807 (European Commission FP7)
253
2. Title of the project: Las tecnologas del lenguaje humano ante los nuevos
retos de la comunicacin digital ACOMP/2010/286 Consellera de
Educacin
3. Title of the project: Procesamiento del lenguaje y sistemas de informacin
(GPLSI) VIGROB-100 Universidad de Alicante
4. Title of the project: Las Tecnologas del Lenguaje Humano ante los
nuevos retos de la comunicacin digital
(TIN2009-13391-C04-01) Ministerio de Ciencia e Innovacin (Ministry of Science and Innovation)
5. Title of the project: Desarollo conjunto de un grupo de herramientas y
recursos de tecnologas del lenguaje humano (Joint Development of Tools
and Resources for Human Language Technologies)
6. Title of the project: Desarollo y licencia de software para la extraccin de
informacin de documentos notariales (Software Development and
Licencing for the Extraction of Information from Notary Documents)
7.
P ometeo /2009/119 - Desarrollo de tcnicas inteligentes
era de textos.
I was also a reviewer for the Language Resources and Evaluation journal - Special
Issue on Short Texts, IEEE Transactions on Affective Computing - Special Issue on
Naturalistic Affect Resources for System Building and Evaluation, ACM
Transactions on Itelligent Systems and Technology, RANLP 2009, NLDB 2009,
SEPLN 2009, IceTAL 2010, NLDB 2011.
255
cualquier mbito del PLN (por ejemplo, la Desambiguacin del Sentido de las
Palabras, la Resolucin de la Co-referencia, la Resolucin de las Expresiones
Temporales, el Etiquetado de Roles Semnticos), o en dependencia de una
aplicacin final especfica (por ejemplo, la Recuperacin de Informacin, la
Extraccin de Informacin, la Bsqueda de Respuestas, los Resmenes de Texto, la
Traduccin Automtica).
Tradicionalmente, estas reas de aplicacin de la PLN fueron diseadas para el
tratamiento de los textos que describen datos factuales (hechos que se pueden
observar y comprobar en la realidad). No obstante, hoy en da, la informacin sobre
hechos ya no es la fuente principal de donde se extrae el conocimiento fundamental
o ms bsico.
El presente est marcado por la creciente influencia de la web social (la web de
la interaccin y la comunicacin) en las vidas de las personas en todo el mundo.
Ahora, ms que nunca, la gente est totalmente dispuesta y feliz por compartir sus
vidas, conocimientos, experiencias y pensamientos con el mundo entero, a travs de
blogs, microblogs, foros, wikis o, incluso, sitios de comercio electrnico que dan la
opcin de compartir reseas sobre los productos que venden. La gente est
participando activamente en los principales acontecimientos que tienen lugar en
todas las esferas de la sociedad, expresando su opinin sobre ellos y comentando las
noticias que aparecen. El gran volumen de datos que contienen opiniones
disponibles en Internet, en reseas, foros, blogs, microblogs y redes sociales ha
producido un importante cambio en la forma en que las personas se comunican,
comparten conocimientos y emociones e influyen en el comportamiento social,
poltico y econmico. En consecuencia, esta nueva realidad ha dado lugar a
importantes transformaciones en la forma, extensin y rapidez de circulacin de las
noticias y sus opiniones asociadas, dando lugar a fenmenos sociales, econmicos y
psicolgicos nuevos y desafiantes.
Para estudiar estos fenmenos y abordar la cuestin de extraer el conocimiento
fundamental que en la actualidad figura en los textos que contienen expresiones con
sentimientos, han nacido nuevos campos de investigacin dentro del PLN, que
tienen el objetivo de detectar la subjetividad en el texto y/o extraer y clasificar los
sentimientos de las opiniones expresadas en distintas categoras (por lo general en
positivos, negativos y neutrales). Las nuevas tareas que se abordan en el PLN son
principalmente el anlisis de la subjetividad (que trata sobre los "estados privados"
(Banfield, 1982), un trmino que contiene sentimientos, opiniones, emociones,
evaluaciones, creencias y especulaciones), el anlisis de sentimientos y la minera
de opiniones. Esta no ha sido la nica forma de referirse a los enfoques adoptados.
Tambin se han utilizado tambin otras terminologas que utilizaban otros trminos
para denominar a las tareas, por ejemplo, minera de reseas o la extraccin de
valoracin. Asimismo, los trminos de anlisis de sentimientos y minera de
258
producto, escritas por una solo fuente, en comparacin con blogs o debates, que
tienen una estructura de dialogo, en el que se expresan opiniones sobre distintos
objetos, por distintas fuentes). Finalmente, para las aplicaciones finales, el anlisis
de sentimientos no es la primera ni la ltima tarea que se debe realizar. Para extraer
el sentimiento de textos, primero es necesario recuperar un conjunto de documentos
relevantes. El resultado del procesamiento de un texto con un sistema de anlisis de
sentimientos puede tener mucha informacin redundante e incluso puede no
resolver totalmente el problema, debido a la gran cantidad de datos existentes.
Los sistemas implementados para la tarea de anlisis de sentimientos se basan en
reglas, bolsas de palabras, utilizando un lxico de palabras que tienen una
orientacin del sentimiento (positivo o negativo), mtodos estadsticos o
aprendizaje automtico.
Analizando los sistemas existentes, hemos identificado los siguientes problemas:
! La tarea de anlisis de sentimientos y los conceptos relacionados no son
definidos de forma nica en los diferentes trabajos de investigacin. Por
tanto, no est claro siempre si los distintos investigadores que trabajan en el
anlisis de sentimientos pueden comparar el rendimiento de sus sistemas,
ya que los textos sobre los que evalan pueden tener diferentes elementos
anotados.
! La tarea de anlisis de sentimientos se resuelve de la misma manera,
independientemente del tipo de texto que se procesa y del objetivo de la
aplicacin final.
! No existen recursos anotados para la tarea de anlisis de sentimientos en
todos los gneros textuales.
! No existen lxicos de palabras que expresen sentimientos para otros
idiomas distintos al ingls.
! La mayora de sistemas trabajan a nivel lxico, utilizando reglas, lxicos,
mtodos estadsticos o aprendizaje automtico. La investigacin que se ha
hecho hasta ahora no toma en cuenta otros niveles de anlisis, como el
sintctico o semntico. Por tanto, el asegurar que la fuente de la opinin
expresada es la requerida o sobre qu objeto se expresa la opinin en un
texto son aspectos que no se toman en consideracin. Estos aspectos
pueden tener un alto impacto sobre el rendimiento y la utilidad de los
sistemas de anlisis de opiniones.
! La mayor parte de la investigacin no distingue sobre los distintos
componentes de un texto, en especial sobre el autor, el texto y el lector. La
tarea de anlisis de sentimientos puede tener diferentes objetivos,
dependiendo de la perspectiva que se requiere analizar (por ejemplo, si el
autor tiene preferencia sobre un cierto objeto descrito, si el texto contiene
260
261
CONTRIBUCIN
Este apartado presenta en detalle las contribuciones realizadas a la investigacin en
el campo de anlisis de sentimientos a lo largo de esta tesis y muestra cmo los
mtodos y recursos propuestos llenan vacos importantes en la investigacin
existente. Las principales contribuciones responder a cinco preguntas de
investigacin:
1. Cmo se puede definir la tarea de anlisis de sentimientos y, en una
perspectiva ms amplia, la minera de opiniones de una manera correcta?
Cules son los conceptos principales que se deberan definir antes de
afrontar la tarea?
262
Objectivo
Opinin
Sentimiento
Emocin
Por ltimo, demostramos la existencia de una clara conexin entre el trabajo
realizado en el marco de anlisis de sentimientos/minera opinin y la de
evaluacin/anlisis de la actitud. A pesar de todas estas reas se consideran
generalmente para referirse al mismo tipo de trabajo, el objetivo ms amplio de la
actitud o el anlisis de evaluacin pueden captar mucho mejor la investigacin que
se ha hecho en el anlisis de los sentimientos, incluyendo todas las clases de
evaluacin (afectivo, cognitivo-conductual) y la relacin entre autor, lector y el
significado del texto. En base a esta observacin y en vista de la teora de la
valoracin, en el captulo 6 se propone un modelo de deteccin de emociones
263
270
iii.
iv.
272
273
5HXQLGRHO 7ULEXQDOTX HVXVFULE HHQ HOGtDGHODIH FKD DFRUGyR WRUJDUSRU DOD7H VLV
'RFWRUDOGH'RQ'xDODFDOLILFDFLyQGH
$OLFDQWH GH GH
(O
6HFUHWDULR
(O3UHVLGHQWH
UNIVERSIDAD DE ALICANTE
CEDIP
/DSUH VHQWH7HVL VGH' BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBK DVL GR
UHJLVWUDGDFRQHOQBBBBBBBBBBBBGHOUHJLVWURGHHQWUDGDFRUUHVSRQGLHQWH
$OLFDQWHBBBGHBBBBBBBBBBGHBBBBB
(O(QFDUJDGRGHO5HJLVWUR
de
de
EL PRESIDENTE