Google Scholar

Orthographic errors in web pages: Toward cleaner web corpora

C Ringlstetter, KU Schulz, S Mihov - Computational Linguistics, 2006 - direct.mit.edu

C Ringlstetter, KU Schulz, S Mihov

Computational Linguistics, 2006•direct.mit.edu

Since the Web by far represents the largest public repository of natural language texts,
recent experiments, methods, and tools in the area of corpus linguistics often use the Web
as a corpus. For applications where high accuracy is crucial, the problem has to be faced
that a non-negligible number of orthographic and grammatical errors occur in Web
documents. In this article we investigate the distribution of orthographic errors of various
types in Web pages. As a by-product, methods are developed for efficiently detecting …

Abstract

Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a non-negligible number of orthographic and grammatical errors occur in Web documents. In this article we investigate the distribution of orthographic errors of various types in Web pages. As a by-product, methods are developed for efficiently detecting erroneous pages and for marking orthographic errors in acceptable Web documents, reducing thus the number of errors in corpora and linguistic knowledge bases automatically retrieved from the Web.

MIT Press

Show moreShow less

Save Cite Cited by 75 Related articles All 13 versions

Showing the best result for this search. See all results

Cite

Advanced search

Saved to My library

Orthographic errors in web pages: Toward cleaner web corpora