Ijcai01 HMM

Appears in Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI-2001).
Representing Sentence Structure in Hidden Markov Models

for Information Extraction
Soumya Rayy Mark Craveny
[email protected] [email protected]
y
Department of Computer Sciences Department of Biostatistics & Medical Informatics
University of Wisconsin University of Wisconsin
Madison, Wisconsin 53706 Madison, Wisconsin 53706
Abstract Seymore et al., 1999; Freitag and McCallum, 2000; McCal-

lum et al., 2000].
We study the application of Hidden Markov Mod- Previous HMM approaches to information extraction do
els (HMMs) to learning information extractors for not adequately address several key aspects of the problem do-
n-ary relations from free text. We propose an ap- mains on which we are focused. First, the data we are pro-
proach to representing the grammatical structure of cessing is complex natural language text. Whereas previous
sentences in the states of the model. We also in- approaches have represented their data as sequences of to-
vestigate using an objective function during HMM kens, we present an approach in which sentences are first pro-
training which maximizes the ability of the learned cessed by a shallow parser and then represented as sequences
models to identify the phrases of interest. We eval- of typed phrases. Second, the data we are processing include
uate our methods by deriving extractors for two bi- many sentences that are not relevant to the relations of inter-
nary relations in biomedical domains. Our experi- est. Even in relevant sentences, only certain phrases contain
ments indicate that our approach learns more accu- information to be extracted. Whereas previous approaches to
rate models than several baseline approaches. applying HMMs for IE have focused the training process on
maximizing the likelihood of the training sentences, we adopt
a training method that is designed to maximize the probabil-
1 Introduction ity of assigning the correct labels to various parts of the sen-
Information extraction (IE) may be defined as the task of au- tences being processed [Krogh, 1994]. Our approach involves
tomatically extracting instances of specified classes or rela- coupling the algorithm devised by Krogh with the use of null
tions from text. In our research, we are interested in using models which are intended to represent data not directly rele-
machine learning approaches, including hidden Markov mod- vant to the task at hand.
els (HMMs), to extract certain relationships among objects
from biomedical text sources. We present and evaluate two 2 Problem Domain
contributions to the state of the art in learning information
extractors with HMMs. First, we investigate an approach to Our work is focused on extracting instances of specific re-
incorporating information about the grammatical structure of lations of interest from abstracts in the MEDLINE database
sentences into HMM architectures. Second, we investigate an [National Library of Medicine, 2001]. MEDLINE contains
objective function for HMM training whose emphasis is on bibliographic information and abstracts from more than 4,000
maximizing the ability of the learned models to identify the biomedical journals.
phrases of interest rather than simply maximizing the likeli- An example of a binary relation that we consider in our
hood of the training data. Our experiments in two challeng- experiments is the subcellular-localization relation, which
ing real world domains indicate that both contributions lead represents the location of a particular protein within a cell.
to more accurate learned models. We refer to the domains of this relation as PROTEIN and
Automated methods for information extraction have sev- LOCATION. We refer to an instance of a relation as a tuple.
eral valuable applications including populating knowledge Figure 1 provides an illustration of our extraction task. The
bases and databases, summarizing collections of documents, top of the figure shows two sentences in a MEDLINE ab-
and identifying significant but unknown relationships among stract. The bottom of the figure shows the instance of the
objects. Since constructing information extraction systems target relation subcellular-localization that we would like to
manually has proven to be expensive[Riloff, 1996], there extract from the second sentence. This tuple asserts that the
has been much recent interest in using machine learning protein UBC6 is found in the subcellular compartment called
methods to learn information extraction models from la- the endoplasmic reticulum.
beled training data. Hidden Markov models are among the In order to learn models to perform this task, training ex-
more successful approaches considered for learning informa- amples consisting of passages of text, annotated with the tu-
tion extractors [Leek, 1997; Freitag and McCallum, 1999; ples that should be extracted from them, are needed. In our
“. . . stance. Every other sentence was considered to be a negative
Here we report the identification of an integral membrane instance. It is clear that while this process is automatic, it will
ubiquitin-conjugating enzyme. This enzyme, UBC6, local- result in a noisy labeling. A sentence may have the words in a
izes to the endoplasmic reticulum, with the catalytic domain target tuple of the relation while the semantics may not refer
facing the cytosol. to the relation. On the other hand, a tuple in a relation may be
. . .” described by synonymous words which were not in any target
+ tuple; therefore, sentences where tuples exist as synonyms are
labeled incorrectly. We used a random sample of 200 positive
subcellular-localization(UBC6,endoplasmic
and 200 negative sentences to estimate the amount of noise
reticulum)
introduced by the labeling process. We estimate with 95%
confidence that approximately 10% to 15% of the sentences
Figure 1: An example of the information extraction task. The are labeled incorrectly (either falsely labeled or unlabeled
top shows part of a document from which we wish to extract when they should have been) in the subcellular-localization
instances of the subcellular-localization relation. The bot- data set. We believe that the disorder-association data set is
tom shows the extracted tuple. not as noisy as the subcellular-localization data set.
3 Representing Phrase Structure

approach, each training and test instance is an individual sen-
tence. There are several aspects of the data that make this a Hidden Markov Models (HMMs) are the stochastic analogs
difficult information extraction task: (i) it involves free text, of finite state automata. An HMM is defined by a set of states
(ii) the genre of text is, in general, not grammatically simple, and a set of transitions between them. Each state has an as-
(iii) the text includes a lot of technical terminology, (iv) there sociated emission distribution which defines the likelihood of
are many sentences from which nothing should be extracted. a state to emit various tokens. The transitions from a given
In the terminology that has been used in the information state have an associated transition distribution which defines
extraction literature, our task is inherently a multiple slot the likelihood of the next state given the current state.
extraction task. Since we are interested in extracting in- In previous HMM approaches to information extraction,
stances of n-ary relations, we cannot treat each domain of sentences have been represented as sequences of tokens. We
such a relation as a separate unary component to be extracted hypothesize that incorporating sentence structure into the
(also called single slot extraction). Consider the subcellular- models we build results in better extraction accuracy.
localization relation discussed above. A document may men- Our approach is based on using syntactic parses of all sen-
tion many proteins and many locations but this relation holds tences we process. In particular, we use the Sundance sys-
only among certain pairs of these proteins and locations. tem [Riloff, 1998] to obtain a shallow parse of each given
In the experiments reported here, we use two data sets rep- sentence. Our representation does not incorporate all of the
resenting two different binary relations. The subcellular- information provided by a Sundance parse, but instead “flat-
localization data set includes 545 sentences that represent tens” it into a sequence of phrase segments. Each phrase seg-
positive instances (labeled with tuples that should be ex- ment consists of a type describing the grammatical nature of
tracted from them) and 6,700 sentences that represent neg- the phrase, and the words that are part of the phrase.
ative instances (not labeled with any tuples). The 545 posi- In positive training examples, if a segment contains a word
tive instances are labeled with 645 tuples in all; there are 335 or words that belong to a domain in a target tuple, it is anno-
unique tuples. The second data set is for a binary relation that tated with the corresponding domain. We refer to these an-
characterizes associations between genes and genetic disor- notations as labels. Labels are absent in the test instances.
ders. We refer to this relation as disorder-association, and Figure 2a shows a sentence containing an instance of the
the domains of this relation as GENE and DISORDER. This subcellular-localization relation and its annotated segments
data set contains 892 positive instances and 11,487 negative (we shall discuss the other panels of this figure later). The
instances. The positive instances are labeled with 899 tuples second phrase segment in this example is a noun phrase seg-
in all (126 unique). For both data sets, the negative instances ment (NP SEGMENT) that contains the protein name UBC6
are “near misses” in that they come from the same population (hence the PROTEIN label). Note that the types are constants
of abstracts as the positive instances, and in many cases they that are pre-defined by our representation of Sundance parses,
discuss concepts that are associated with the target relation. while the labels are defined with respect to the domains of re-
lation we are trying to extract. Also note that the parsing is
The target tuples for the subcellular-localization re-
not always accurate, for instance, the third segment in fig-
lation were collected from the Yeast Protein Database
ure 2a should really be a VP SEGMENT, but has been typed
(YPD) [Hodges et al., 1998], and the target tuples for the
as an NP SEGMENT by Sundance.
disorder-association relation were collected from the On-
The states in our HMMs represent the annotated segments
line Mendelian Inheritance in Man (OMIM) database [Center
of a sentence. Like a segment, each state in the model is
annotated with a htype,labeli pair1 . A given state can emit
for Medical Genetics , 2001]. Relevant MEDLINE abstracts
were also gathered from entries in these databases. To label
only segments whose type is identical to the state’s type; for
the sentences in these abstracts, we matched the target tuples
to the words in the sentence. A sentence which contained 1
We can think of segments that do not have a label corresponding
words that matched a tuple was taken to be a positive in- a domain of the relation as having an implicit empty label.
“This enzyme, UBC6, localizes to the endoplasmic reticulum, with the catalytic domain facing the cytosol.”
NP SEGMENT this enzyme DET this this

NP SEGMENT:PROTEIN ubc6 UNK enzyme enzyme
NP SEGMENT localizes UNK:PROTEIN ubc6 PROTEIN ubc6
PP SEGMENT to UNK localizes localizes
NP SEGMENT:LOCATION the endoplasmic reticulum PREP to to
PP SEGMENT with ART the the
NP SEGMENT the catalytic domain N:LOCATION endoplasmic LOCATION endoplasmic
VP SEGMENT facing UNK:LOCATION reticulum LOCATION reticulum
NP SEGMENT the cytosol PREP with with
ART the the
N catalytic catalytic
UNK domain domain
V facing facing
ART the the
N cytosol cytosol
(a) (b) (c)
Figure 2: HMM input representations. (a) The Phrase representation: the sentence is segmented into typed phrases. (b) The
POS representation: the sentence is segmented into words typed with part-of-speech tags. (c) The Token representation: the
sentence is segmented into untyped words. For each representation, the labels (PROTEIN, LOCATION) are only present in the
training sentences.
8 0p 1
example, the segment “this enzyme” in figure 2a could currence as follows:
be emitted by any state with type NP SEGMENT, regardless
>
> Y X
(wj jq )A
j ij
of its label. Each state that has a label corresponding to a
domain of the relation plays a direct role in extracting tuples. >
< j=1 M T (qjr)r (i 1);
r
q (i)
Figure 3 is a schematic of the architecture of our phrase- =
>
> if type(q ) = type(pi );
(2)
based hidden Markov models. The top of the figure shows >
: 0;
the positive model, which is trained to represent positive in- otherwise:
stances in the training set. The bottom of the figure shows the
null model, which is trained to represent negative instances Here wj is the j th word in the ith phrase segment pi , and
in the training set. Since our Phrase representation includes type is a function that returns the type of a segment or state
14 phrase types, both models have 14 states without labels, as described above. The two key aspects of this modification
and the positive model also has five to six additional labeled are that (i) the type of a segment has to agree with the type
states (one for each htype,labeli combination that occurs in of state in order for the state to emit it, and (ii) the emission
the training set). We assume a fully connected model, that probability of the words in the segment is computed as the
is, the model may emit a segment of any type at any position product of the emission probabilities of the individual words.
within a sentence. This latter aspect is analogous to having states use a Naı̈ve
To train and test our Phrase models, we have to modify Bayes model for the words in a phrase. Note that this equation
the standard Forward, Backward and Viterbi algorithms [Ra- requires a normalization factor to define a proper distribution
biner, 1989]. The Forward algorithm calculates the proba- over sentences. However, since we use these equations to
bility q (i) of a sentence being in state q of the model after make relative comparisons only, we leave this factor implicit.
having emitted i elements of an instance. When a sentence The modifications to the Viterbi and Backward algorithms are
is represented as a sequence of tokens, the algorithm is based similar to this modification of the Forward algorithm.
on the following recurrence: Given these modifications to the Forward and Backward al-
gorithm, we could train phrase-based models using the Baum-
Welch algorithm [Baum, 1972]. However, for the models we
ST ART (0) = 1
consider here, there is no hidden state for training examples
q (0) ; q 6= START
= 0
X (i.e., there is an unambiguous path through the model for each
q (i) = M (wi jq) T (qjr)r (i 1) (1)
example), and thus there is no need to use Baum-Welch. In-
stead, we assume a fully connected model and obtain tran-
r
sition frequencies by considering how often segments with
various htype, labeli annotations are adjacent to each other.
where M and T represent the emission and transition distri- We smooth these frequencies over the set of possible transi-
butions respectively, wi is the ith element in the instance, and tions for every state using m-estimates [Cestnik, 1990]. In
r ranges over the states that transition to q. a similar manner, we obtain the emission frequencies of the
Our modification involves changing the last part of this re- words in each state by summing over all segments with the
positive 1. Associate segments in the order in which they occur.
NP_SEGMENT
Thus for subcellular-localization, the first segment
NP_SEGMENT
matching a PROTEIN state is associated with the first
START
PROTEIN
.... END
segment matching a LOCATION state, and so on.
2. If there are fewer segments containing an element of
NP_SEGMENT
PP_SEGMENT some domain, use the last match of this domain to con-
LOCATION
struct the remaining tuples. For instance, if we pre-
dicted one PROTEIN phrase P1 and two LOCATION
phrases L1 and L2 , we would create two tuples based
null on hP1 ; L1 i and hP1 ; L2 i.
NP_SEGMENT 3.1 Experiments
START .... END
In the experiments presented in this section, we test our
hypothesis that incorporating phrase-level sentence structure
PP_SEGMENT into our model provides improved extraction performance in
terms of precision and recall. We test this hypothesis by
comparing against several hidden Markov models that repre-
sent less information about the grammatical structure of sen-
Figure 3: The general architecture of our phrase-based tences. Henceforth, we refer to the model described above as
HMMs. The top part of the figure shows the positive model the Phrase Model.
and the bottom part of the figure shows the null model. The first model we compare against, which we call the POS
Model, is based on the representation shown in Figure 2b.
same htype, labeli annotations in our training set. We smooth This model represents some grammatical information, in that
these frequency counts using m-estimates over the entire vo- it associates a type with each token indicating the part-of-
cabulary of words. speech(POS) tag for the word (as determined by Sundance).
However, unlike the Phrase Model, the POS model represents
Once the model has been constructed, we use it to predict
sentences as sequences of tokens, not phrases. This model is
tuples in test sentences. We use the Viterbi algorithm, modi-
comparable in size to the Phrase Model. The positive com-
fied as described above, to determine the most likely path of a
ponent of this model has 17 states without labels and six to
sentence through the positive model. We consider a sentence
ten states with labels (depending on the training set). The null
to represent a tuple of the target relation if and only if two
component of the model has 17 states without labels.
conditions hold:
The other models we consider, which we call the Token
1. The likelihood of emission of the sentence by the posi- Models, are based on the representation shown in Figure 2c.
tive model is greater than the likelihood of emission by This representation treats a sentence simply as a sequence of
the null model: PEND (n) > NEND (n), where P and N words. We investigate two variants that employ this represen-
refer to the positive and null models respectively and the tation. The simpler of the two hidden Markov models based
sentence has n segments. on this representation, which we refer to as Token Model 1,
has three states in its positive model and one state in its null
2. In the Viterbi path for the positive model, there are seg-
ments aligned with states corresponding to all the do- model (not counting the START and END states). None of the
states in this model have types. Two of the states in the posi-
mains of the relation. For example, for the subcellular-
tive model represent the domains of the binary target relation,
localization relation, the Viterbi path for a sentence
while the remaining states have no labels. The role of the lat-
must pass through a state with the PROTEIN label and a
ter set of states is to model all tokens that do not correspond
state with the LOCATION label.
to the domains of the target relation. A more complex version
Note that even after phrases have been identified in this of this model, which is illustrated in Figure 4, has three un-
way, the extraction task is not quite complete, since some of labeled states in its positive model. We define the transitions
the phrases might contain words other than those that belong and train these models in such a way that these three states
in an extracted tuple. Consider the example in Figure 2a. can specialize to (i) tokens that come before any relation in-
The LOCATION phrase contains the word “the” in addition stances, (ii) tokens that are interspersed between the domains
to the location. Therefore, tuple extraction with these models of relation instances, and (iii) tokens that come after relation
must include a post-processing phase in which such extrane- instances.
ous words are stripped away before tuples are returned. We The training algorithm used for the POS Model is identical
do not address this issue here. Instead, we consider a predic- to that used for the Phrase Model. The training algorithm for
tion to be correct if the model correctly identifies the phrases the Token Models is essentially the same, except that there are
containing the target tuple as a subphrase. no type constraints on either the tokens or states.
It is possible to have multiple predicted segments for each Since we consider a prediction made by the Phrase Model
domain of the relation. In this case, we must decide which to be correct if it simply identifies the phrases containing the
combinations of segments constitute tuples. We do this using words of the tuple, we use a similar criterion to decide if the
two simple rules: predictions made by the POS Model and Token Models are
positive
1
(untyped) Phrase Model
PROTEIN POS Model
Token Model 1
0.8 Token Model 2
(untyped) (untyped) (untyped) 0.6
Precision
START END
0.4
(untyped)
LOCATION
0.2
null
0
0 0.2 0.4 0.6 0.8 1
START (untyped) END
Recall
Figure 4: The architecture of Token Model 2. Figure 5: Precision vs. recall for the four models on the
subcellular-localization data set.
correct. We consider POS Model and Token Model predic- 1

tions to be correct if the labeled states of these models iden- Phrase Model
POS Model
tify sequences of tokens that contain the words of the tuple. 0.8
Token Model 1
These models are not penalized for extracting extra adjacent Token Model 2
words along with the actual words of a target tuple.
0.6
Precision
We process the HMM input data (after parsing in cases
where Sundance is used) by stemming words with the Porter
algorithm [Porter, 1980], and replacing words that occur only 0.4
once in a training set with a generic UNKNOWN token. The
statistics for this token are then used by the model while emit- 0.2
ting out-of-vocabulary words encountered during prediction.
Similarly, numbers are mapped to a generic NUMBER token. 0
Positive predictions are ranked by a confidence measure 0 0.2 0.4 0.6 0.8 1
which is computed as the ratio of the likelihood of the Recall
Viterbi path of a sentence through a model to the likelihood

of the model to emit that sentence, i.e. confidence(s) = Figure 6: Precision vs. recall for the four models on the
ÆEND (n)=END (n): Here s is a sentence of n segments, disorder-association data set.
ÆEND (n) is the likelihood of the most probable path of all
n segments threaded through to the END state, and END (n) disorder-association data set. Here, the differences are
is the comparable value calculated by the Forward algorithm. much more pronounced. The Phrase Model achieves signifi-
We construct precision-recall graphs for our models by vary- cantly higher levels of precision than any of the other models,
ing a threshold on the confidence measures. including the POS Model. The recall endpoint for the Phrase
For both data sets we measure precision and recall using Model is also superior to those of the other models. We con-
5-fold cross validation. The data is partitioned such that clude that the experiments presented here support our hypoth-
all of the sentences from a given MEDLINE abstract are in esis that incorporating sentence structure into the models we
the same fold. This procedure ensures that our experiments build results in better extraction accuracy.
model the nature of the real application setting. For training,
we sample the negative instances so that there are an equal 4 Improving Parameter Estimates
number of positive and negative instances per fold. We have
observed that we get better recall consistently by doing this. Standard HMM training algorithms, like Baum-Welch, are
Figure 5 shows the precision-recall curves for the designed to maximize the likelihood of the data given the
model. Specifically, if si is a sentence in the training set,
subcellular-localization data set. The curve for the Phrase
Baum-Welch (and the method we used earlier) tries to find
Model is superior to the curves for both Token Models. At
parameters ^ such that
low levels of recall, the POS Model exhibits slightly higher
Y
precision than the Phrase Model, but the latter is superior
at higher recall levels, and the Phrase Model has a signifi- ^ = arg max Pr( is j ) :

cantly higher recall endpoint. These results suggest that there i
is value in representing grammatical structure in the HMM We hypothesize that more accurate models can be learned by
architectures, but the Phrase Model is not definitively more training with an objective function that aims to maximize the
accurate. likelihood of predicting the correct sequence of labels for a
Figure 6 shows the precision-recall curves for the given sentence (as before, we assume that states and phrases
positive MODEL
Phrase Model trained on the corresponding data sets in our
previous experiments.
START END The methodology for this experiment is the same as before.
Note that for the Combined Model, prediction is simpler than
null MODEL with a separate null model, since it suffices to consider the
Viterbi path of a sentence through the model to extract tuples,
if any. We do not train the Combined Model to convergence
Figure 7: Combined model architecture. The positive and to avoid overfitting. Instead, we set the number of iterations
null models refer to the corresponding models in Figure 3. for which to do gradient descent to a fixed constant value of
100.
without labels have an implicit empty label). Let i be the
known sequence of labels for a sentence si in our training set. 1
Krogh’s Algorithm
We would like to estimate parameters ^ such that Initial Parameter Estimates
Y 0.8
^ = arg max jsi ; )
Pr( i (3)

i
Y Pr( i ; sij)
0.6
Precision
:
s j )
= arg max (4)
Pr( i 0.4
i
This is similar to the task of optimizing parameters to recover 0.2

the sequence of states given a set of observations[McCallum
et al., 2000]. Krogh[1994] has devised an HMM training al- 0
gorithm that tries to optimize this criterion. After transform- 0 0.2 0.4 0.6 0.8 1
ing this objective function into one which aims to minimize Recall
the negative log likelihood of the above equation, the follow-
ing incremental update rule is obtained: Figure 8: Effect of Krogh’s Algorithm on the combined
model for the subcellular-localization data set.
jnew = N (jold + (mij nij )) (5)
Figure 8 shows the precision-recall curves for this exper-
where j is the j th parameter, mij is the expected number of
iment for the subcellular-localization data set. For each
times j is used by the ith sentence on correct paths through precision-recall curve, we also show 95% confidence inter-
the model, nij is the expected number of times j is used by vals. From the figure, we observe that there is some improve-
the ith sentence on all paths through the model, N is a nor- ment in the precision of the model on this data set, while re-
malizing constant, and is the learning rate. The n and m call is held nearly constant. While the improvement is small,
terms can be calculated using the Forward-Backward proce- we have observed it consistently across the various model
dure. Note that the update rule represents an online training architectures we have explored. Figure 9 shows the corre-
procedure. sponding precision-recall curves and confidence intervals for
In our previous experiments, we used a separate null model the experiment on the disorder-association data set. Here,
to represent negative instances. We would like to use Krogh’s the difference between the initial model and the trained model
algorithm with this configuration to observe if it results in is more pronounced. The model trained with Krogh’s algo-
more accurate models. However, the null model as we have
described it is a separate entity which is trained separately.
1
With this architecture, Krogh’s algorithm would be unable Krogh’s Algorithm
to correct false positives in the training set since doing so Initial Parameter Estimates
might require adjusting the parameters of the positive model 0.8
in response to a negative instance. To remedy this problem,

we propose an alternative to having a separate null model, 0.6
Precision
which we refer to as a Combined Model. A Combined Model

consists of two submodels sharing common START and END 0.4
states. A schematic is shown in figure 7. The shared START
and END states allow the training algorithm to update param- 0.2
eters in both parts of the model in response to a given training
sentence.
0
0 0.2 0.4 0.6 0.8 1
4.1 Experiments Recall
To evaluate this algorithm, we train the Combined Model con-
figuration on the subcellular-localization and the disorder- Figure 9: Effect of Krogh’s Algorithm on the combined
association data sets. We compare these models against the model for the disorder-association data set.
rithm has significantly better precision than the initial model, [Krogh, 1994] A. Krogh. Hidden Markov models for la-
while maintaining a similar level of recall. We conclude that beled sequences. In Proceedings of the Twelfth Interna-
this training algorithm is appropriate for our task, and can tional Conference on Pattern Recognition, pages 140–144,
improve accuracy, sometimes significantly. Jerusalem, Israel, 1994. IEEE Computer Society Press.
[Leek, 1997] T. Leek. Information extraction using hidden
5 Conclusion Markov models. Master’s thesis, Department of Com-
We have presented two contributions to learning Hidden puter Science and Engineering, University of California,
Markov Models for information extraction, and evaluated San Diego, CA, 1997.
these contributions on two challenging biomedical domains. [McCallum et al., 2000] A. McCallum, D. Freitag, and
We have presented an approach to representing the grammat- F. Pereira. Maximum entropy Markov models for infor-
ical structure of sentences in an HMM. Comparative exper- mation extraction and segmentation. In Proceedings of the
iments with other models lacking such information shows Seventeenth International Conference on Machine Learn-
that this approach learns extractors that have increased pre- ing, pages 591–598, Stanford, CA, 2000. Morgan Kauf-
cision and recall performance. We have also investigated the mann.
application of a training algorithm developed by Krogh to
[National Library of Medicine, 2001] National Library
our models. This algorithm consistently provides an accu-
racy gain over our original models. We believe that these are of Medicine. The MEDLINE database, 2001.
promising approaches to the task of deriving information ex- http://www.ncbi.nlm.nih.gov/PubMed/.
tractors for free text domains. [Porter, 1980] M. F. Porter. An algorithm for suffix stripping.
Program, 14(3):127–130, 1980.
Acknowledgments [Rabiner, 1989] L. R. Rabiner. A tutorial on hidden Markov
This research was supported in part by NIH Grant 1R01 models and selected applications in speech recognition.
LM07050-01, and NSF CAREER award IIS-0093016. The Proceedings of the IEEE, 77(2):257–286, 1989.
authors would like to thank Michael Waddell for his work [Riloff, 1996] E. Riloff. An empirical study of automated
on building the disorder-association data set, and Peter An- dictionary construction for information extraction in three
dreae, Joseph Bockhorst, Tina Eliassi-Rad, and Jude Shavlik domains. Artificial Intelligence, 85:101–134, 1996.
for critiquing the initial draft.
[Riloff, 1998] E. Riloff. The Sundance sentence analyzer,
1998. http://www.cs.utah.edu/projects/nlp/.
References
[Seymore et al., 1999] K. Seymore, A. McCallum, and
[Baum, 1972] L. E. Baum. An equality and associated maxi-
R. Rosenfeld. Learning hidden Markov model structure
mization technique in statistical estimation for probabilis- for information extraction. In Working Notes of the AAAI
tic functions of Markov processes. Inequalities, 3:1–8, Workshop on Machine Learning for Information Extrac-
1972. tion, pages 37–42. AAAI Press, 1999.
[Center for Medical Genetics , 2001] Center for Med-
ical Genetics, Johns Hopkins University and Na-
tional Center for Biotechnology Information. Online
Mendelian inheritance in man, OMIM (TM), 2001.
http://www.ncbi.nlm.nih.gov/omim/.
[Cestnik, 1990] B. Cestnik. Estimating probabilities: A
crucial task in machine learning. In Proceedings of
the Ninth European Conference on Artificial Intelligence,
pages 147–150, Stockholm, Sweden, 1990. Pitman.
[Freitag and McCallum, 1999] D. Freitag and A. McCallum.
Information extraction with HMMs and shrinkage. In
Working Notes of the AAAI-99 Workshop on Machine
Learning for Information Extraction, Orlando, FL, 1999.
AAAI Press.
[Freitag and McCallum, 2000] D. Freitag and A. McCallum.
Information extraction with HMM structures learned by
stochastic optimization. In Proceedings of the Seventeenth
National Conference on Artificial Intelligence, Austin,
TX, 2000. AAAI Press.
[Hodges et al., 1998] P. E. Hodges, W. E. Payne, and J. I.
Garrels. Yeast protein database (YPD): A database for the
complete proteome of saccharomyces cerevisiae. Nucleic
Acids Research, 26:68–72, 1998.

Ijcai01 HMM

Uploaded by

Copyright:

Available Formats

Ijcai01 HMM

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ijcai01 HMM

Uploaded by

Copyright:

Available Formats

Appears in Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI-2001).

Representing Sentence Structure in Hidden Markov Models

Abstract Seymore et al., 1999; Freitag and McCallum, 2000; McCal-

3 Representing Phrase Structure

NP SEGMENT this enzyme DET this this

(untyped) (untyped) (untyped) 0.6

correct. We consider POS Model and Token Model predic- 1

Viterbi path of a sentence through a model to the likelihood

This is similar to the task of optimizing parameters to recover 0.2

in response to a negative instance. To remedy this problem,

which we refer to as a Combined Model. A Combined Model

You might also like