Compacting the Penn Treebank Grammar

Krotov, Alexander; Hepple, Mark; Gaizauskas, Robert; Wilks, Yorick

Computer Science > Computation and Language

arXiv:cs/9902001 (cs)

[Submitted on 31 Jan 1999]

Title:Compacting the Penn Treebank Grammar

Authors:Alexander Krotov, Mark Hepple, Robert Gaizauskas, Yorick Wilks (Department of Computer Science, University of Sheffield, UK)

View PDF

Abstract: Treebanks, such as the Penn Treebank (PTB), offer a simple approach to obtaining a broad coverage grammar: one can simply read the grammar off the parse trees in the treebank. While such a grammar is easy to obtain, a square-root rate of growth of the rule set with corpus size suggests that the derived grammar is far from complete and that much more treebanked text would be required to obtain a complete grammar, if one exists at some limit. However, we offer an alternative explanation in terms of the underspecification of structures within the treebank. This hypothesis is explored by applying an algorithm to compact the derived grammar by eliminating redundant rules -- rules whose right hand sides can be parsed by other rules. The size of the resulting compacted grammar, which is significantly less than that of the full treebank grammar, is shown to approach a limit. However, such a compacted grammar does not yield very good performance figures. A version of the compaction algorithm taking rule probabilities into account is proposed, which is argued to be more linguistically motivated. Combined with simple thresholding, this method can be used to give a 58% reduction in grammar size without significant change in parsing performance, and can produce a 69% reduction with some gain in recall, but a loss in precision.

Comments:	5 pages, 2 figures
Subjects:	Computation and Language (cs.CL)
ACM classes:	I.2.7
Cite as:	arXiv:cs/9902001 [cs.CL]
	(or arXiv:cs/9902001v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.cs/9902001
Journal reference:	In Proceedings of COLING-98 (Montreal), pages 699-703

Submission history

From: Alexander Krotov [view email]
[v1] Sun, 31 Jan 1999 18:57:45 UTC (24 KB)

Computer Science > Computation and Language

Title:Compacting the Penn Treebank Grammar

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Compacting the Penn Treebank Grammar

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators