Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books

Kaser, Owen; Lemire, Daniel

Computer Science > Digital Libraries

arXiv:0707.1913 (cs)

[Submitted on 13 Jul 2007 (v1), last revised 22 Aug 2016 (this version, v3)]

Title:Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books

Authors:Owen Kaser, Daniel Lemire

View PDF

Abstract:Collaborative work on unstructured or semi-structured documents, such as in literature corpora or source code, often involves agreed upon templates containing metadata. These templates are not consistent across users and over time. Rule-based parsing of these templates is expensive to maintain and tends to fail as new documents are added. Statistical techniques based on frequent occurrences have the potential to identify automatically a large fraction of the templates, thus reducing the burden on the programmers. We investigate the case of the Project Gutenberg corpus, where most documents are in ASCII format with preambles and epilogues that are often copied and pasted or manually typed. We show that a statistical approach can solve most cases though some documents require knowledge of English. We also survey various technical solutions that make our approach applicable to large data sets.

Comments:	short version appeared in CASCON 2007 proceedings, available from this http URL Source code at this https URL
Subjects:	Digital Libraries (cs.DL); Computation and Language (cs.CL)
Report number:	Department of CSAS, UNBSJ Technical Report TR-07-001
Cite as:	arXiv:0707.1913 [cs.DL]
	(or arXiv:0707.1913v3 [cs.DL] for this version)
	https://doi.org/10.48550/arXiv.0707.1913

Submission history

From: Daniel Lemire [view email]
[v1] Fri, 13 Jul 2007 02:30:10 UTC (303 KB)
[v2] Fri, 21 Aug 2009 19:57:15 UTC (307 KB)
[v3] Mon, 22 Aug 2016 21:02:58 UTC (307 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.DL

< prev | next >

new | recent | 2007-07

Change to browse by:

cs
cs.CL

References & Citations

1 blog link

(what is this?)

DBLP - CS Bibliography

listing | bibtex

Owen Kaser
Daniel Lemire

export BibTeX citation

Computer Science > Digital Libraries

Title:Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Digital Libraries

Title:Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators