Ijcai01 HMM
Ijcai01 HMM
Ijcai01 HMM
Figure 2: HMM input representations. (a) The Phrase representation: the sentence is segmented into typed phrases. (b) The
POS representation: the sentence is segmented into words typed with part-of-speech tags. (c) The Token representation: the
sentence is segmented into untyped words. For each representation, the labels (PROTEIN, LOCATION) are only present in the
training sentences.
8 0p 1
example, the segment “this enzyme” in figure 2a could currence as follows:
be emitted by any state with type NP SEGMENT, regardless
>
> Y X
(wj jq )A
j ij
of its label. Each state that has a label corresponding to a
domain of the relation plays a direct role in extracting tuples. >
< j=1 M T (qjr)r (i 1);
r
q (i)
Figure 3 is a schematic of the architecture of our phrase- =
>
> if type(q ) = type(pi );
(2)
based hidden Markov models. The top of the figure shows >
: 0;
the positive model, which is trained to represent positive in- otherwise:
stances in the training set. The bottom of the figure shows the
null model, which is trained to represent negative instances Here wj is the j th word in the ith phrase segment pi , and
in the training set. Since our Phrase representation includes type is a function that returns the type of a segment or state
14 phrase types, both models have 14 states without labels, as described above. The two key aspects of this modification
and the positive model also has five to six additional labeled are that (i) the type of a segment has to agree with the type
states (one for each htype,labeli combination that occurs in of state in order for the state to emit it, and (ii) the emission
the training set). We assume a fully connected model, that probability of the words in the segment is computed as the
is, the model may emit a segment of any type at any position product of the emission probabilities of the individual words.
within a sentence. This latter aspect is analogous to having states use a Naı̈ve
To train and test our Phrase models, we have to modify Bayes model for the words in a phrase. Note that this equation
the standard Forward, Backward and Viterbi algorithms [Ra- requires a normalization factor to define a proper distribution
biner, 1989]. The Forward algorithm calculates the proba- over sentences. However, since we use these equations to
bility q (i) of a sentence being in state q of the model after make relative comparisons only, we leave this factor implicit.
having emitted i elements of an instance. When a sentence The modifications to the Viterbi and Backward algorithms are
is represented as a sequence of tokens, the algorithm is based similar to this modification of the Forward algorithm.
on the following recurrence: Given these modifications to the Forward and Backward al-
gorithm, we could train phrase-based models using the Baum-
Welch algorithm [Baum, 1972]. However, for the models we
ST ART (0) = 1
consider here, there is no hidden state for training examples
q (0) ; q 6= START
= 0
X (i.e., there is an unambiguous path through the model for each
q (i) = M (wi jq) T (qjr)r (i 1) (1)
example), and thus there is no need to use Baum-Welch. In-
stead, we assume a fully connected model and obtain tran-
r
sition frequencies by considering how often segments with
various htype, labeli annotations are adjacent to each other.
where M and T represent the emission and transition distri- We smooth these frequencies over the set of possible transi-
butions respectively, wi is the ith element in the instance, and tions for every state using m-estimates [Cestnik, 1990]. In
r ranges over the states that transition to q. a similar manner, we obtain the emission frequencies of the
Our modification involves changing the last part of this re- words in each state by summing over all segments with the
positive 1. Associate segments in the order in which they occur.
NP_SEGMENT
Thus for subcellular-localization, the first segment
NP_SEGMENT
matching a PROTEIN state is associated with the first
START
PROTEIN
.... END
segment matching a LOCATION state, and so on.
2. If there are fewer segments containing an element of
NP_SEGMENT
PP_SEGMENT some domain, use the last match of this domain to con-
LOCATION
struct the remaining tuples. For instance, if we pre-
dicted one PROTEIN phrase P1 and two LOCATION
phrases L1 and L2 , we would create two tuples based
null on hP1 ; L1 i and hP1 ; L2 i.
NP_SEGMENT 3.1 Experiments
START .... END
In the experiments presented in this section, we test our
hypothesis that incorporating phrase-level sentence structure
PP_SEGMENT into our model provides improved extraction performance in
terms of precision and recall. We test this hypothesis by
comparing against several hidden Markov models that repre-
sent less information about the grammatical structure of sen-
Figure 3: The general architecture of our phrase-based tences. Henceforth, we refer to the model described above as
HMMs. The top part of the figure shows the positive model the Phrase Model.
and the bottom part of the figure shows the null model. The first model we compare against, which we call the POS
Model, is based on the representation shown in Figure 2b.
same htype, labeli annotations in our training set. We smooth This model represents some grammatical information, in that
these frequency counts using m-estimates over the entire vo- it associates a type with each token indicating the part-of-
cabulary of words. speech(POS) tag for the word (as determined by Sundance).
However, unlike the Phrase Model, the POS model represents
Once the model has been constructed, we use it to predict
sentences as sequences of tokens, not phrases. This model is
tuples in test sentences. We use the Viterbi algorithm, modi-
comparable in size to the Phrase Model. The positive com-
fied as described above, to determine the most likely path of a
ponent of this model has 17 states without labels and six to
sentence through the positive model. We consider a sentence
ten states with labels (depending on the training set). The null
to represent a tuple of the target relation if and only if two
component of the model has 17 states without labels.
conditions hold:
The other models we consider, which we call the Token
1. The likelihood of emission of the sentence by the posi- Models, are based on the representation shown in Figure 2c.
tive model is greater than the likelihood of emission by This representation treats a sentence simply as a sequence of
the null model: PEND (n) > NEND (n), where P and N words. We investigate two variants that employ this represen-
refer to the positive and null models respectively and the tation. The simpler of the two hidden Markov models based
sentence has n segments. on this representation, which we refer to as Token Model 1,
has three states in its positive model and one state in its null
2. In the Viterbi path for the positive model, there are seg-
ments aligned with states corresponding to all the do- model (not counting the START and END states). None of the
states in this model have types. Two of the states in the posi-
mains of the relation. For example, for the subcellular-
tive model represent the domains of the binary target relation,
localization relation, the Viterbi path for a sentence
while the remaining states have no labels. The role of the lat-
must pass through a state with the PROTEIN label and a
ter set of states is to model all tokens that do not correspond
state with the LOCATION label.
to the domains of the target relation. A more complex version
Note that even after phrases have been identified in this of this model, which is illustrated in Figure 4, has three un-
way, the extraction task is not quite complete, since some of labeled states in its positive model. We define the transitions
the phrases might contain words other than those that belong and train these models in such a way that these three states
in an extracted tuple. Consider the example in Figure 2a. can specialize to (i) tokens that come before any relation in-
The LOCATION phrase contains the word “the” in addition stances, (ii) tokens that are interspersed between the domains
to the location. Therefore, tuple extraction with these models of relation instances, and (iii) tokens that come after relation
must include a post-processing phase in which such extrane- instances.
ous words are stripped away before tuples are returned. We The training algorithm used for the POS Model is identical
do not address this issue here. Instead, we consider a predic- to that used for the Phrase Model. The training algorithm for
tion to be correct if the model correctly identifies the phrases the Token Models is essentially the same, except that there are
containing the target tuple as a subphrase. no type constraints on either the tokens or states.
It is possible to have multiple predicted segments for each Since we consider a prediction made by the Phrase Model
domain of the relation. In this case, we must decide which to be correct if it simply identifies the phrases containing the
combinations of segments constitute tuples. We do this using words of the tuple, we use a similar criterion to decide if the
two simple rules: predictions made by the POS Model and Token Models are
positive
1
(untyped) Phrase Model
PROTEIN POS Model
Token Model 1
0.8 Token Model 2
Precision
START END
0.4
(untyped)
LOCATION
0.2
null
0
0 0.2 0.4 0.6 0.8 1
START (untyped) END
Recall
Figure 4: The architecture of Token Model 2. Figure 5: Precision vs. recall for the four models on the
subcellular-localization data set.
Precision
We process the HMM input data (after parsing in cases
where Sundance is used) by stemming words with the Porter
algorithm [Porter, 1980], and replacing words that occur only 0.4
once in a training set with a generic UNKNOWN token. The
statistics for this token are then used by the model while emit- 0.2
ting out-of-vocabulary words encountered during prediction.
Similarly, numbers are mapped to a generic NUMBER token. 0
Positive predictions are ranked by a confidence measure 0 0.2 0.4 0.6 0.8 1
which is computed as the ratio of the likelihood of the Recall
Precision
:
s j )
= arg max (4)
Pr( i 0.4
i