21jul201512071432 DAIWAT A VYAS 1-6
21jul201512071432 DAIWAT A VYAS 1-6
21jul201512071432 DAIWAT A VYAS 1-6
Print ISSN: 2393-9907; Online ISSN: 2393-9915; Volume 2, Number 11; April-June, 2015 pp. 1-6
Krishi Sanskriti Publications
http://www.krishisanskriti.org/acsit.html
AbstractWith the rapid development and increase in global data online resources that somehow or the other is dependent on
on World Wide Web and with increased and rapid growth in web online resources for his/her day to day activities.
users across the globe, an acute need has arisen to improve and
modify or design search algorithms that helps in effectively and Search engines [6]are the most basic tools used for searching
efficiently searching the specific required data from the huge over the internet. Web search engines are usually
repository available. Various search engines use different web equippedwith multiple powerful web page search algorithms.
crawlers for obtaining search results efficiently. Some search engines But with the explosive growth of the World Wide
use focused web crawler that collects different web pages that usually Web,searching information on the web is becoming an
satisfy some specific property, by effectively prioritizing the crawler
increasingly difficult task. All this poses an unprecedented
frontier and managing the exploration process for hyperlink. A
focused web crawler analyzes its crawl boundary to locate the links scaling challenges for the general purpose crawlers and search
that are likely to be most relevant for the crawl, and avoids irrelevant engines. Major challenges like, to make users available with
regions of the web. This leads to significant savings in hardware and the fastest possible access to the requested information in a
network resources, and helps keep the crawl more up-to-date.The most precise manner, making lighter web interfaces, etc are
process of focused web crawler is to nurture a collection set of web being addressed by researchers across the globe.
documents that are focused on some topical subspaces. It identifies
the next most important and relevant link to follow by relying on Web Crawlers are one of the main components ofweb search
probabilistic models for effectively predicting the relevancy of the engines i.e. systems that assemble a corpus of web pages,
document. Researchers across have proposed various algorithms for indexthem, and allow users to issue queries against the index
improving efficiency of focused web crawler. We try to investigate and find the webpages that match the queries fired by the
various types of crawlers with their pros and cons.Major focus area users.Web crawling is the process by which system gather
is focused web crawler. Future directions for improving efficiency of pages from the Web resources, inorder to index them and
focused web crawler have been discussed. This will provide a base support a search engine that serves the user queries. The
reference for anyone who wishes in researching or using concept of
primary objective of crawlingis to quickly, effectively and
focused webcrawler in their research work that he/she wishes to
carry out. The performance of a focused webcrawler depends on the efficiently gather as many useful web pages as possible,
richness of links in the specific topic being searched by the user, and together with the link structure that interconnects them and
it usually relies on a general web search engine for providing provide the search results to the user requesting it. A crawler
starting points for searching. must possess features like robustness, scalability, etc.
Keywords: Focused Web Crawler, algorithms, World Wide Web, The first generation of crawlers on which most of the search
probabilistic models. engines are based, rely heavily on the traditional graph
algorithms like the breadth-first search and the depth-first
1. INTRODUCTION search to index the web. In the NetCraft Web Server survey,
the Web is measured in the number of Websites which from a
Innovations in the field of web technology and data mining small number in August 1995 increased over 1 billion in April
has had a significant impact on the way web based 2014. Due to the vast expansion of the Web and the inherently
technologies are being developed. Internet has been the most limited resources in a search engine, no single search engine is
useful technology of modern times and has become the largest able to index more than one-third of the entire Web. This is
knowledge base and data repository. Internet has various the primary reason for general purpose web crawlers having
diversified uses like in communication, research, financial poor performance.
transactions, entertainment, crowdsourcing, and politics and is
responsible for the professional as well as the personal The basic purpose of enhancement in the search results
development of individuals be he/she be a technical person or specific to some keywords can be achieved through focused
a non-technical person. Every person is so acquainted with web crawler. [3] With the exponential increase in the number of
Websites, more emphasis is made in the implementation of
focused web crawler. It is a crawling technique that
2 Dvijesh Bhatt, Daiwat Amit Vyas and Sharnil Pandya
dynamically browses the Internet by choosing specific modes The crawlers can be divided into two parts: - Universal
that maximize the probability of discovering relevant pages, crawlers and Preferential crawlers. Further the preferential
given a specific search query by user. Predicting the relevance crawlers can be divided into two parts: - the focused web
of the document before seeing its contents i.e. relying on the crawler[2]and the topical crawlers [4, 14]. The topical crawlers
parent pages only, is one of the central characteristic of are of types: - Adaptive topical crawlers and Static crawlers.
focused crawling because it can save significant amount of The evolutionary crawlers, reinforcement learning crawlers,
bandwidth resources. Instead of collecting and indexing all etc. are examples of adaptive topical crawlers and the best-
accessible web documents to be able to answer to all ad-hoc first search, PageRank algorithms, etc. are examples of static
queries, a focused web crawler analyzes its crawl boundary to crawlers.
find the links that are likely to be most relevant for the crawl,
and avoids irrelevant regions of the web making it more
focused on some specific keywords. Adopting to this it helps
significant in savings in hardware and network resources, and
helps keep the crawl more up-to-date.
In this paper, we study a focused web crawler[1, 12] which
seeks, acquires, indexes and maintains pages on a specific set
of topics that represent a relatively narrow segment of the
web.
The flow of topics in paper is as mentioned. In the next
section, the classification [2] and types of crawlers are Fig. 1: Classification of Web Crawlers
described and all the types of crawlers are discussed in brief.
In chapter 3 various challenges in web crawling are The Universal crawlers support universal search engines. It
discussed.In chapter 4, basic overview of the focused web browses the World Wide Web in a methodical, automated
crawler is included. This includes the basic functionality of the manner and creates the index of the documents that it
focused web crawler and the principle behind working of accesses. The Universal crawler first downloads the first
focused web crawler is discussed. In chapter 5, the website. It then goes through the HTML and finds the link tag
architecture of focused web crawler is drilled upon. The and retrieves the outside link. When it finds a link tag, it adds
various components of the focused web crawler and the the link to the list of links it plans to crawl. Thus, as the
functionality and working of those components is discussed. universal crawler crawls all the pages found, huge cost of
In section 6, some algorithms which are discussed to improve crawl is incurred over many queries from the users. The
the efficiency of the focused web crawler are explained. Many universal crawler is comparatively expensive.
algorithms [5, 18] to better the performance of the focused web
crawler are discussed. Out of those, some algorithms are The preferential crawlers are the topic based crawlers. They
explained in that section. In the last chapter, the paper is are selective in case of web pages. They are built to retrieve
concluded and future work is outlined. pages within a certain topic. Focused web crawlers and topical
crawlers are types of preferential crawlers that search related
2. RELATED WORK to a specific topic and only download a predefined subset of
web pages from the entire web. The algorithms for the topical
The World Wide Web is experiencing an exponential and focused web crawlers started with the earlier breadth-first
growth[6], both in size and the number of users accessing. The search and the depth-first search. Now, we can see a variety of
quantity and variety of documentation available, poses the algorithms. There is De Bras fish search[8], Shark search
problem of discovering information relevant to a specific topic which is a more aggressive variant of the De Bras fish search.
of interest from users perspective. The instruments developed Another algorithms using concept of topical crawlers are the
to ease information recovering on the Internet suffer from link structure analysis, the page rank algorithm[11, 19] and the
various limitations. Web directories cannot realize exhaustive hits algorithm. Several machine learning algorithms are also
taxonomies and have a high maintenance cost due to the need used in focused web crawlers.
for human classification of new documents.
The adaptive topical crawlers[8, 16, and 17] and the static crawlers
Web crawlers are used by web search engines and a few other [9]
are type of topical crawlers. If a focused web crawler
sites to update their web content or indexes of other sites web includes learning methods in order to adapt its behavior during
content. A Web crawler is an Internet bot or a robotic crawler the crawl to the particular environment and its relationships
that systematically browses the World Wide Web, typically with the given input parameters, e.g. the set of retrieved pages
for the purpose of web indexing, which helps in faster access and the user-defined topic, the crawler is named adaptive.
of information. A Web crawler has many names like Spider, Static crawlers are simple crawlers not adapting to the
Robot, Web agent, Wanderer, worm etc. environment they are provided.
assigned to the buffer in the Crawler Manager. The URL Hypertext Analyzer: - The Hypertext Analyzer[13] receives
buffer is a priority queue. Based on the size of the URL buffer, the keywords from the Link Extractor and finds the relevancy
the Crawler Manager dynamically generates instance for the of the terms with the search keyword referring the Taxonomy
crawlers, which will download the document. For more Hierarchy.
efficiency, crawler manager can create a crawler pool. The
A focused web crawler has the following advantages
manager is also responsible for limiting the speed of the
ascompared to the other crawlers:
crawlers and balancing the load among them. This is achieved
by monitoring the crawlers. a) The focused web crawler steadily & easily acquires the
relevant pages by focusing on specific keywords, while
other crawler quickly loses its way, even though they start
from the same seed set.
b) It can discover valuable Web pages that are many links
away from the seed set, and on the other hand can acquire
millions of Web pages that may lie within same radius.
This helps in having a high quality collections of Web
documents on specific topicsi.e focused on some keywords
etc..
c) It can also identify regions of the Web that are dynamic or
grow as compared to that are relatively static.
than the depth of the parent. When the depth reaches zero, the Missing Relevant Pages: One issue with focused web
direction is dropped and none of its children is inserted into crawlers is that they may miss relevant pages by only crawling
the list. The algorithm is helpful in forming the priority table pages that are expected to give immediate benefit.
but the limitation is that there is very low differentiation
Maintaining Freshness of Database: Many HTML pages
among the priority of the URLs. Many documents have the
consist of information that gets updated on daily, weekly or
same priority. And also the scoring capability is problematic
monthly basis. The crawler has to download these pages and
as it is difficult to assign a more precise potential score to
updates them into the database to provide up-to-date
documents which have not yet been fetched.
information to the users. The crawling processbecomes slow
Shark-Search: - The Shark-Search [15]is the improved version and puts pressure on the Internet traffic if such pages are large
of the Fish-Search algorithm. The Fish-Search did a binary in number. Thus, a major issue is to develop a strategy that
evaluation of the URL to be analyzed and so the actual manages these pages.
relevance cannot be obtained. In Shark-Search, a similarity
Absence of Particular Context: The focused web crawler uses
engine is called which evaluates the relevance of the
the best strategy to download the most relevant pages based on
documents to a given query. Such an engine analyzes two
some criteria. The crawler focuses on a particular topic but in
documents dynamically and returns a "fuzzy" score, i.e., a
the absence of a particular context, it may download large
score between 0 and 1 (0 for no similarity whatsoever, 1 for
number of irrelevant pages. Thus the challenge is to develop
perfect "conceptual" match) rather than a binary value. The
focused crawling techniques that focus on particular context
similarity algorithm can be seen as orthogonal to the fish-
also.
search algorithm. The potential score of the URL that is
extracted can be refined by the meta-information contained in Some specific mechanisms like handling freshness of
the links to the documents. database, context, making sure relevant pages are retrieved
and support of indexing algorithms, we can to a very good
Genetic Algorithm: - The Genetic algorithm [7, 20]is used to
extent improve the performance of focused web crawling
improve the quality of search results obtained from focused
crawling. It is an adaptive and heuristic method for increasing 7. CONCLUSION
optimization and improving search results. It exploits several
techniques inspired by biological evolution such as In this paper, we discussed about focusedweb crawler, the
inheritance, selection, cross-over and mutation. It has four functionality and methodology of itsworking, various
phases. In the first phase, the parameters like population size, components and the various algorithms that can be
generation size, cross-over rate or the probability of cross-over implemented with the focusedweb crawler so that its
and mutation rate or probability of mutation rate are fixed. efficiency can be improved. The focusedweb crawler is a
Initial URLs are fetched by the crawler. On the basis of system that learns the specialization from the examples and
Jaccards similarity function, a fitness value is assigned to the then explores the Web, guided by a relevance and popularity
Web page. The higher the fitness value, the more is the page rating mechanism. If filters at the data acquisition level, rather
similar to the domain lexicon. The Jaccards function, based than as a post-processing step. Thus, focusedweb crawler
on links, is the ratio of the number of intersection links and proves to have better performance than general Web crawlers.
union links between the two Web pages. The more number of We can implement algorithms to improve the results obtained
common links, the higher is the Jaccards score. After fitness from focusedweb crawler.New algorithms can be developed,
values are calculated, the pages with better fitness values are which can be applied to the focusedweb crawler, to increase
selected by a random number generator. Some relevant pages the optimality of the Web page results.
are selected and the rest are discarded. Then, all outbound-
links are extracted from the survived pagesand a cross-over REFERENCES
operation is performed to select the most promising URLs.
Cross-over of a URL is the sum of the fitness of all pages that [1] S. Chakrabarti, M.H. Van den Berg, and B.E. Dom, Focused
Crawling: A New Approach to Topic-Specific Web Resource
contain the URL. Based on cross-over value, the URLs are
Discovery, Computer Networks, vol. 31, nos. 1116, pp. 1623
sorted and put into crawling queue. The mutation operation is 1640, 1999.
aimed at giving the crawler, the ability to explore multiple [2] Kamilalkan, RifatOzcan, Comparing classification methods
suitable Web communities. Random keywords from the for link context based Focused crawlers,Department of
lexicon are extracted and are run as query in well-known Computer Engineering, TurgutOzal University, Ankara, Turkey.
search engines. Top results from the search-engines and [3] G. Pant and P. Srinivasan, Learning to Crawl: Comparing
results from the cross-over phase are combined to give more Classification Schemes, ACM Trans. Information Systems, vol.
optimal results. 23, no. 4, 2005.
Some challenges to deal with while using focused web [4] G. Pant and P. Srinivasan, "Link contexts in classifier guided
topical crawlers," Knowledge and Data Engineering, IEEE
crawling are as mentioned below:
Transactions on , vol.18, no.1, pp.107-122, Jan. 2006.