Four heuristics to guide structured content crawling

J Umbrich, A Harth, A Hogan… - 2008 Eighth International …, 2008 - ieeexplore.ieee.org
2008 Eighth International Conference on Web Engineering, 2008ieeexplore.ieee.org
Search engines focusing on particular media types face difficulties in discovering suitable
URIs on the Web. Since the engines are only interested in a small fraction of the Web, a
crawler should use heuristics to concentrate on that fraction. To devise such a heuristic, we
postulate four hypotheses based on RFCs and W3C recommendations to find cues for
certain content types. Tests on a corpus of 22m files (793GB content size) containing 630m
URIs show that for the content types text, image, and application, the recommendations are …
Search engines focusing on particular media types face difficulties in discovering suitable URIs on the Web. Since the engines are only interested in a small fraction of the Web, a crawler should use heuristics to concentrate on that fraction. To devise such a heuristic, we postulate four hypotheses based on RFCs and W3C recommendations to find cues for certain content types. Tests on a corpus of 22m files (793GB content size) containing 630m URIs show that for the content types text, image, and application, the recommendations are mostly being followed, while results for audio and video are much less consistent. Our findings and recommendations can be implemented as heuristics for efficient discovery of structured content on the Web on top of existing crawlers.
ieeexplore.ieee.org
Showing the best result for this search. See all results