Deep Crawling of Web Sites Using Frontier Technique: Samantula Hemalatha
Deep Crawling of Web Sites Using Frontier Technique: Samantula Hemalatha
Deep Crawling of Web Sites Using Frontier Technique: Samantula Hemalatha
1 (January 2017)
Available online on http://www.rspublication.com/ijeted/ijeted_index.htm ISSN 2249-6149
ABSTRACT
I. INTRODUCTION
The potential power of web mining [8]-[10] is illustrated by one study that used a
computationally expensive technique in order to extract patterns from the web and was powerful
enough to find information in individual web pages that the authors would not have been aware
of .A second illustration is given by the search engine Google, which uses mathematical
calculations on a huge matrix in order to extract meaning from the link structure of the web. The
development of an effective paradigm for a web-mining Search is, therefore, a task of some
importance. A web Search, robot or spider is a program or suite of programs that is capable of
iteratively and automatically downloading web pages, extracting URLs from their HTML and
fetching them. A web Search, for example, could be fed with the home page of a site and left to
download the rest of it.
Some of the Best applications that comes under web search includes :
1. Web Page Content Validation Approach
2. Website Structural Analysis Approach
3. Graph Visualization Approach
4. Page Validation Approach
5. Update Notification Approach
For example, a user can navigate from the entry page to a final page through the
following paths as shown in Figure 1
From the figure 1, we can find out the sample logical structure of a web site, where the
website starts with catalog as the starting root node or root URL, from there the child urls comes
as follows in three different categories like Appliances ,Electronics and Furniture. Now from the
sub URLS we may again get individual items or sub items and which in turn ends at the least
level like leaf node level. From the figure 1,if we try to do crawling ,in the current web crawler
we will get the number of visitors count or total page rank as a traffic and but we cant able to
get the individual page traffic at any level. So in this paper we mainly designed a new smart
crawler which will try to crawl the each and every individual page separately and try to find how
much traffic available in each and every page and the try to send the page traffic based on BFS
search process. So if this type of crawling is done almost all web administrators can easily find
the high traffic pages with that of low traffic available pages in the network dynamically.
In this section we will mainly discuss about the related work that was carried out in our
proposed application like web search and its fundamentals.Now let us discuss about these in
details as follows:
Now a days web mining has become one of the major contributions in the data mining
applications. As the deep web is growing huge day by day the use of web mining techniques in
the field of mining also increased a lot. Web mining is mainly divided into three different
categories like:
Web Usage Mining is one of the applications of data mining techniques, which is used to
discover interesting and important patterns from deep web .This will discover a new pattern and
try to understand each and every thing clearly from that pattern and then serve the needs of Web-
based applications. In this web usage mining [1], the term usage indicates that it will identify the
origin or behavior of user browsing at a web site [2]. This can be again classified into various
levels based on the kind of usage. Now we can look at this in detail as follows:
In this level it is mainly used to identify the log information of a web user. This log
information is mainly collected by the web server like IP Address, Date and Time, Access Page
Reference and so on.
In this level, it is mainly used to track the various kinds of business events and collect all
the log information and save them in the application servers. This level is mainly used to enable
the E-Commerce applications to be built on the top of any web site with a very little effort.
In this level ,it is mainly used for deriving new kinds of events and logging can be turned on for
them just generating histories of derived events.This will be mainly used by many application in
order to store the application level data inside this category.
[4]. Generally this web structure mining is mainly divided into two kinds based on the structure
of data, this is as follows:
1. The first and initial category of web structure mining is the ability to extract the patterns
from hyperlinks that are available in the web.
2. The second category of web structure mining is ability to mine the document structure
like HTML and XML.
I) WEB GRAPH
Web Graph is a directed graph representing web. This is mainly used as a component to
represent the web in the form of a graph.
II) NODE
III) EDGE
This is represented as a hyperlinks of a web site. For each and every web site there will
be a set of hyperlinks available to one and other pages, so this Edge in a graph is treated as a
hyperlink for a given web page.
IV) IN DEGREE
This is one of the main components in a graph, which represents the number of links that
are pointing to a particular web page or for a node. The number of pages which are pointed to
each and every page is represented by the in-degree.
V) OUT DEGREE
This is also one of the main components in a graph, which represents the number of links
generated from a particular web page or for a node. The number of pages which are generated
out from this web page or node is represented by the out- degree.
information and knowledge from various web page content. As the World Wide Web is keep on
increasing with a lot of heterogeneous data for the various sources, there is a huge demand for
identifying the sources from the web by using some of the tools like web crawler, Meta crawler
and a many more. As there are many tools for extracting the content from the web pages, they do
not generally provide structural information nor categorize, filter, or interpret documents. In the
current days, these factors have prompted researchers to develop more intelligent tools for
information retrieval, such as intelligent web agents, as well as to extend database and data
mining techniques to provide a higher level of organization for semi-structured data available on
the web [5] ,[6].
From the above figure 1, we can clearly find that web mining has divided into three
categories and each category has its own and individual functionality. All these categories were
discussed clearly in the previous paragraphs. Our application mainly deals with all the three
categories all together to identify the individual pages priority based on page traffic. In our
proposed application, we can retrieve the web pages hierarchy based on individual page traffic
rather than overall page rank.
In this paper we are going to implement BFS (Breadth First Search) Algorithm for
implementing the search traversal technique and finally the sorted URLS will be placed in the
order of BFS Hierarchy where there will be no chance of repentance of visited URL to be
replaced once again in between the searched URLs.
Breadth First Search (BFS) is an uninformed search method that aims to expand and
examine all nodes of a graph systematically in search of a solution. In other words, it
exhaustively searches the entire graph without considering the goal until it finds it. It does not
use a heuristic.From the standpoint of the algorithm; all child nodes obtained by expanding a
node are added to a FIFO queue. In typical implementations, nodes that have not yet been
examined for their neighbors are placed in some container (such as a queue or linked list) called
"open" and then once examined are placed in the container "closed".In order to build a major
search engine or a large repository such as the Internet Archive, high-performance Search start
out at a small set of pages and then explore other pages by following links in a breadth first-
like fashion. In reality, the web pages are often not traversed in a strict breadth-first fashion, but
using a variety of policies, e.g., for pruning Searchs inside a web site, or for Searching more
important pages first.
Now in the next section we will discuss about the pseudo code that was used for
implementing the BFS Algorithm. Let us discuss about that in detail as follows:
The below pseudo code clearly represents the BFS algorithm procedure and its working
principle. Now let us discuss about that in detail as follows:
Here in this section we mainly take the BFS algorithm for searching the web site.A web
site mainly consists of web urls or web pages ,where each and every web page has its individual
functionality. So if we want to crawl the web site based on individual page traffic then we use
various classes that are pre-defined in java [7] and of all these BFS is very important as the
search of URLs is mainly done with the help of BFS.Now let us discuss about this in detail with
following example as shown in figure 3.
Figure 3.Represents the Crawling of a Sample Web Site by Using BFS Algorithm
From the above figure 3, we can find that there are total seven URLs in that web
site,where the URLS are numbered with URL1,URL2 and so on to URL7.Here the BFS
algorithm takes URL1 as the root or seed URL and from that URL ,it will try to find the traffic
that was available for each and every web page that was connected directly with main URL.Here
we are having four parameters like Undiscovered,Discovered,Top of Queue and Finished as the
parameters for crawling the URLS. Now we can discuss about each and every parameter in detail
as follows. As we all know the BFS use queue as the storing medium ,here all the URLS are
initially pushed into the queue and we will take each and every URL one at a time from queue
and search the traffic of that URL and then apply the pop operation on that queue to perform the
search exactly for one time. As we are using the BFS algorithm it will not support for back
traversing ,hence the URLS once crawled from the list should be removed immediately from the
queue as visited urls because to maintain the uniqueness in the searched URLS. Here initially
each and every URL will be having the status as undiscovered, once if URL is visited then the
URL is marked as visited or discovered. And once we try to visit each and every URL, we will
try to find out the URL which is having highest traffic, so we will then add that URL at the top
level so this is treated as top of queue and once all the URLs are crawled for single time then
finally we will take the finished attribute, which tells all the URLs are crawled from the website.
V. CONCLUSION
In this paper, we for the first time have implemented the new crawling of web pages
technique like smart crawler for crawling the web pages based on traffic. Here the crawler is
using BFS as the search criteria for extracting the traffic of any web site based on individual
page traffic; hence we called this proposed crawling as a smart crawler tool. Searching is one of
the process of finding the required data or information from a set of un-structured or raw data.
Here the raw data is nothing but any web site, where the web site is formed by a collection of
several web pages through hyperlinks. Now a days in order to get the priority or importance for
any web site in google server, we mainly use google page ranking as the best source for
providing the rank of each and every web site. Generally this ranking gives the count of total
number of users who viewed the web site from a certain period of time and the value that was
displayed as the rank. This value will keep on change dynamically as the number of users who
viewed the web site changes. The main motivation for doing this proposed paper a new smart
crawler tool for extracting the web page ranking from a deep web sites is mainly because we are
unable to find each and every individual web page priority in the current page ranking, only we
can find the total page rank.After a deep analysis of our proposed model ,we finally came to a
conclusion that this approach is best suited for crawling the web sites having no SSL protection
enabled as they were still under HTTP Protocol access and can be easily crawled dynamically
despite of its web site size and complexity.
VI. REFERENCES
[1] Costa, RP and Seco, N. Hyponymy Extraction and Web Search Behavior Analysis Based
On Query Reformulation, 11th Ibero-American Conference on Artificial Intelligence, 2008
October.
[2] "Facts about Google and Competition". Archived from the original on 4 November 2011.
Retrieved 12 July 2014.
[3] Baraglia, R. Silvestri, F. (2007) "Dynamic personalization of web sites without user
intervention", In Communication of the ACM 50(2): 63-67
[4] Cooley, R., Mobasher, B. and Srivastava, J. Data Preparation for Mining World Wide Web
Browsing Patterns, Journal of Knowledge and Information System, Vol.1, Issue. 1, pp. 532,
1999
[5] Cooley, R. Mobasher, B. and Srivastave, J. (1997) Web Mining: Information and Pattern
Discovery on the World Wide Web In Proceedings of the 9th IEEE International Conference on
Tool with Artificial Intelligence.
[6] "Google Press Center: Fun Facts". www.google.com. Archived from the original on 2001-
07-15.
[7] Java Tutorial, Third Edition: A Short Course on the Basics By Mary Campione,
Kathy Walrath, Alison Huml.
[8] Data Mining Techniques By Arun K Pujari.
[9] Searching the Web. Gautam Pant, Padmini Srinivasan and Filippo Menczer3.
[10] S. Chakrabarti. Mining the Web. Morgan Kaufmann.