You are here: University of Vienna PHAIDRA Detail o:429536
Title
Archiving Deferred Representations Using a Two-Tiered Crawling Approach
Language
English
Description (en)
Web resources are increasingly interactive, resulting in resources that are increasingly difficult to archive. The archival difficulty is based on the use of client-side technologies (e.g., JavaScript) to change the client-side state of a representation after it has initially loaded. We refer to these representations as deferred representations. We can better archive deferred representations using tools like headless browsing clients. We use 10,000 seed Universal Resource Identifiers (URIs) to explore the impact of including PhantomJS – a headless browsing tool – into the crawling process by comparing the performance of wget (the baseline), PhantomJS, and Heritrix. Heritrix crawled 2.065 URIs per second, 12.15 times faster than PhantomJS and 2.4 times faster than wget. However, PhantomJS discovered 531,484 URIs, 1.75 times more than Heritrix and 4.11 times more than wget. To take advantage of the performance benefits of Heritrix and the URI discovery of PhantomJS, we recommend a tiered crawling strategy in which a classifier predicts whether a representation will be deferred or not, and only resources with deferred representations are crawled with PhantomJS while resources without deferred representations are crawled with Heritrix. We show that this approach is 5.2 times faster than using only PhantomJS and creates a frontier (set of URIs to be crawled) 1.8 times larger than using only Heritrix.
Keywords (en)
Web Architecture, HTTP, Web Archiving, Memento
ISBN
978-0-692-59881-8
Author of the digital object
Justin  Brunelle
Michele  Weigle
Michael  Nelson
Format
application/pdf
Size
1.3 MB
Licence Selected
CC BY 4.0 International
Conferences
Conference 2015
Name of Publication (en)
Proceedings of the 12th International Conference on Digital Preservation
Publisher
School of Information and Library Science, University of North Carolina at Chapel Hill
Other links

ISBN
978-0-692-59881-8

Content
Details
Uploader
Object type
PDFDocument
Format
application/pdf
Created
03.03.2016 08:06:52
Metadata