On Unit-4

Unit - 4
Tools for web scrapping

Web scraping tools are software developed specifically
to simplify the process of data extraction from
websites. Data extraction is quite a useful and
commonly used process however, it also can easily turn
into a complicated, messy business and require a
heavy amount of time and effort.
Web scraper tools search for new data manually or
automatically. They fetch the updated or new data,
and then, store them for you to easily access. These
tools are useful for anyone trying to collect data from
the internet.
 Scrape.dois an easy-to-use web scraper tool, providing a scalable, fast,
proxy web scraper API in an endpoint.
 Features :
 Rotating proxies; allow you to scrape any website. Scrape.do rotates
every request made to the API using its proxy pool.
 Unlimited bandwidth in all plans
 Fully customizable
 Only charges for successful requests
 Geotargeting option for over 10 countries
 JavaScript render which allows scraping web pages that require to
render JavaScript
 Super proxy parameter: allows you to scrape data from websites with
protections against data center IPs.
Apify : is the no-code most powerful web scraping
and automation platform.
Features:
Hundreds of ready-to-use tools
No-code, open-source proxy management
Search engine crawler
Proxy API
Browser extension
AvesAPI is a SERP (search engine results page) API
tool that allows developers and agencies to scrape
structured data from Google Search.
Features:
Get structured data in JSON or HTML in real-time
Acquire top-100 results from any location and
language
Geo-specific search for local results
Parse product data on shopping
 ParseHub is a free web scraper tool developed for extracting online data. This tool comes as a
downloadable desktop app. It provides more features than most of the other scrapers, for
example, you can scrape and download images/files, download CSV and JSON files. Here’s a
list of more of its features.
 Features :
 IP rotation
 Cloud-based for automatically storing data
 Scheduled collection (to collect data monthly, weekly, etc.)
 Regular expressions to clean text and HTML before downloading data
 API & webhooks for integrations
 REST API
 JSON and Excel format for downloads
 Get data from tables and maps
 Infinitely scrolling pages
 Get data behind a log-in
 Diffbot : is another web scraping tool that provides extracted data from
web pages. This data scraper is one of the top content extractors out
there. It allows you to identify pages automatically with the Analyze API
feature and extract products, articles, discussions, videos, or images.
 Features :
 Product API
 Clean text and HTML
 Structured search to see only the matching results
 Visual processing that enables scraping most non-English web pages
 JSON or CSV format
 The article, product, discussion, video, image extraction APIs
 Custom crawling controls
 Fully-hosted SaaS
Octoparse : stands out as an easy-to-use, no-code web
scraping tool. It provides cloud services to store
extracted data and IP rotation to prevent IPs from
getting blocked. You can schedule scraping at any
specific time. Besides, it offers an infinite scrolling
feature. Download results can be in CSV, Excel, or API
formats.
ScrapingBee is another popular data extraction tool. It renders
your web page as if it was a real browser, enabling the
management of thousands of headless instances using the latest
Chrome version.
Features:
JavaScript rendering
Rotating proxies
General web scraping tasks like real estate scraping, price-
monitoring, extracting reviews without getting blocked.
Scraping search engine results pages
Growth hacking (lead generation, extracting contact
information, or social media.)
Scrapingdog : is a web scraping tool that makes it easier to
handle proxies, browsers, as well as CAPTCHAs. This tool
provides HTML data of any webpage in a single API call. One of
the best features of Scraping dog is that it also has a LinkedIn
API available. Here are other prominent features of
Scrapingdog:
Features:
Rotates IP address with each request and bypasses every
CAPTCHA for scraping without getting blocked.
Rendering JavaScript
Webhooks
Headless Chrome
 Grepsr : can help your lead generation programs, as well as
competitive data collection, news aggregation, and financial data
collection. Web scraping for lead generation or lead scraping
enables you to extract email addresses.
Features
Lead generation data
Pricing & competitive data
Financial & market data
Distribution chain monitoring
Any custom data requirements
API ready
Social media data and more
Scraper API : is a proxy API for web scraping. This tool helps
you manage proxies, browsers, and CAPTCHAs, so you can
get the HTML from any web page by making an API call.
Features :
IP rotation
Fully customizable (request headers, request type, IP
geolocation, headless browser)
JavaScript rendering
Unlimited bandwidth with speeds up to 100Mb/s
40+ million IPs
12+ geolocations
Scrapy is an open-source and collaborative framework
designed to extract data from websites. It is a web
scraping library for Python developers who want to
build scalable web crawlers.
Import.io : offers a builder to form your own datasets
by importing the data from a specific web page and
then exporting the extracted data to CSV. Also, it
allows building 1000+ APIs based on your
requirements
Feature selection algorithms
While building a machine learning model for real-life
dataset, we come across a lot of features in the dataset
and not all these features are important every time.
Adding unnecessary features while training the model
leads us to reduce the overall accuracy of the model.
feature selection is one of the important steps while
building a machine learning model. Its goal is to find
the best possible set of features for building a machine
learning model.
Some popular techniques of feature selection in
machine learning are:
Filter methods
Wrapper methods
Embedded methods
Filter Methods
These methods are generally used while doing the pre-
processing step. These methods select features from
the dataset irrespective of the use of any machine
learning algorithm. In terms of computation, they are
very fast and inexpensive and are very good for
removing duplicated, correlated, redundant features
but these methods do not remove multicollinearity.
Some techniques used are:
Information Gain – It is defined as the amount of
information provided by the feature for identifying the
target value and measures reduction in the entropy
values. Information gain of each attribute is calculated
considering the target values for feature selection.
Chi-square test — Chi-square method (X2) is
generally used to test the relationship between
categorical variables. It compares the observed values
from different attributes of the dataset to its expected
value.
Chi-square test
Fisher’s Score – Fisher’s Score selects each feature
independently according to their scores under Fisher
criterion leading to a suboptimal set of features. The larger
the Fisher’s score is, the better is the selected feature.
Correlation Coefficient – Pearson’s Correlation
Coefficient is a measure of quantifying the association
between the two continuous variables and the direction of
the relationship with its values ranging from -1 to 1.
Wrapper methods:
Wrapper methods, also referred to as greedy
algorithms train the algorithm by using a subset of
features in an iterative manner.
 Based on the conclusions made from training in prior
to the model, addition and removal of features takes
place.
Forward selection – This method is an iterative approach
where we initially start with an empty set of features and
keep adding a feature which best improves our model after
each iteration. The stopping criterion is till the addition of a
new variable does not improve the performance of the
model.
Backward elimination – This method is also an iterative
approach where we initially start with all features and after
each iteration, we remove the least significant feature. The
stopping criterion is till no improvement in the performance
of the model is observed after the feature is removed.
 Bi-directional elimination – This method uses both forward
selection and backward elimination technique simultaneously to
reach to one unique solution.
 Exhaustive selection – This technique is considered as the brute
force approach for the evaluation of feature subsets. It creates all
possible subsets and builds a learning algorithm for each subset and
selects the subset whose model’s performance is best.
 Recursive elimination – This greedy optimization method selects
features by recursively considering the smaller and smaller set of
features. The estimator is trained on an initial set of features and
their importance is obtained using feature_importance_attribute.
The least important features are then removed from the current set
of features till we are left with the required number of features.
Embedded methods:
In embedded methods, the feature selection algorithm
is blended as part of the learning algorithm, thus
having its own built-in feature selection methods.
Embedded methods encounter the drawbacks of filter
and wrapper methods and merge their advantages.
These methods are faster like those of filter methods
and more accurate than the filter methods and take into
consideration a combination of features as well.
Regularization – This method adds a penalty to different
parameters of the machine learning model to avoid over-fitting
of the model. This approach of feature selection uses Lasso (L1
regularization) and Elastic nets (L1 and L2 regularization). The
penalty is applied over the coefficients, thus bringing down some
coefficients to zero. The features having zero coefficient can be
removed from the dataset.
Tree-based methods – These methods such as Random Forest,
Gradient Boosting provides us feature importance as a way to
select features as well. Feature importance tells us which features
are more important in making an impact on the target feature.
Decision tree:
A decision tree builds regression or classification models in
the form of a tree structure.
It breaks down a dataset into smaller and smaller subsets
while at the same time an associated decision tree is
incrementally developed.
The final result is a with decision nodes and leaf nodes
REGRESSION TREE: for continues quantitative target variable
Example:predicting rainfall etc
CLASSIFICATION TREE:for discrete categorical target
variables
Example:predicting win or loss healthy or not etc
ID3 (iterative dichotomiser 3)-ENTROPY impurity
CART(classification and regression tree)-GINI
ENTROPY: entropy is the measure of randomness or impurity
contained in a dataset
Formula E(S)=Σ –Pi log Pi
 INFORMATION GAIN: information gain measures how
much information a feature gives us about the class
IG(y,x)=E(y)-E(y|x)
It is the main parameter used to construct a decision tree
An attribute with the highest information gain will be
tested / split
Steps to construct a decision tree classifier
Compute the entropy for target
For every attribute / feature:
1. Calculate entropy for all categorical values
2. Take average information entropy for the current
attribute
3. Calculate gain for the current attribute
 Pick the highest gain attribute
 Repeat until we get the desired decision tree
Day Outlook Temp Humidity Wind Play
tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain mild High Strong No
Attribute: OUTLOOK
Values(outlook)=sunny,overcast,rain
1st entropy of entire dataset:
S[9+ , 5-] =>
entropy(s)= -9/14 log2(9/14)- 5/14 log2(5/14)=0.94
SSUNNY ->[2+ ,3-]=>Entropy(SSUNNY ) = -2/5 log2 (2/5) - 3/5 log2
(3/5)=0.971
SOVERCAST ->[4+ ,0-]=>Entropy(SOVERCAST) = -4/4 log2 (4/4) - 0/4 log2
(0/4)=0
SRAIN ->[3+ ,2-]=>Entropy(SRAIN ) = -3/5 log2 (3/5) - 2/5 log2 (2/5)=0.971
IG(s,outlook)=entropy(s) - Σv=3 |sv|/ |s| entropy(sv)
Entropy(s,outlook)=entropy(S)- 5/14 entropy(SSUNNY )-

4/14 entropy(Sovercast ) – 5/14 entropy (Srain )
=0.94- 5/14 *0.971 -4/14 *0 -5/14 *0.971
IG(outlook)=0.2464
Attribute: temparature
 values(temparature)= hot , mild ,cool
S[9+ , 5-] =>
entropy(s)= -9/14 log2(9/14)- 5/14 log2(5/14)=0.94
Shot ->[2+ ,2-]=> 1
Smild ->[4+ ,2-]=>Entropy(Smild ) = -4/6 log2 (4/6) - 2/4 log2
(2/4)=0.9183
Scool ->[3+ ,1-]=>Entropy(Scool ) = -3/4 log2 (3/4) - 1/4 log2 (1/4)=0.8113
IG(s,temparature)=entropy(s) - Σv=3 |sv|/ |s|
entropy(sv)
=0.94- 4/14 *1 -6/14 *0.9183 -4/14 *0.8113

IG(temparature)=0.0289
 Attribute: humidity
 values(humidity)= high , normal
S[9+ , 5-] =>
entropy(s)= -9/14 log2(9/14)- 5/14 log2(5/14)=0.94
Shigh ->[3+ ,4-]=>Entropy(Shigh ) = -3/7 log2 (3/7) - 4/7 log2 (4/7)=0.9852
S normal ->[6+ ,1-]=>Entropy(S normal ) = -6/7 log2 (6/7) - 1/7 log2
(1/7)=0.5916
IG(s,temparature) =0.94- 7/14 *0.9852 -7/14 *0.5916
IG(temparature)=0.1516
Attribute: wind
 values(wind)= strong , weak
S[9+ , 5-] =>
entropy(s)= -9/14 log2(9/14)- 5/14 log2(5/14)=0.94
Sweak ->[6+ ,2-]=>Entropy(Sweak ) = -6/8 log2 (6/8) - 2/8 log2
(2/8)=0.8113
S strong ->[3+ ,3-]=>1.0
IG(s,wind) =0.94- 8/14 *0.8113 - 6/14 *1.0
IG(wind)=0.0478
Gain(s,outlook)=0.2464
Gain(s,temp)=0.0289
Gain(s,humidity)=0.1516
Gain(s,wind)=0.0478
OUTLOOK
SUNNY OVERCAST RAIN
YES
DAY TEMPARATURE HUMIDITY WIND PLAY TENNIS
D1 HOT HIGH WEAK NO
D2 HOT HIGH STRONG NO
D8 MILD HIGH WEAK NO
D9 COOL NORMAL WEAK YES
D11 MILD NORMAL STRONG YES
Attribute: TEMPARATURE
 values(TEMPARATURE)= Hot,mild,cool
Ssunny[2+ , 3-] =>
entropy(Ssunny)= -2/5 log2(2/5)- 3/5 log2(3/5)=0.97
Shot ->[o+ ,2-]=>0
S mild ->[1+ ,1-]=>1
Scool ->[1+ ,0-]=>0
IG(Ssunny,temparature) =0.97- 2/5 *0 -2/5 *0.1-1/5*0.0

=0.570
Attribute: humidity
Ssunny[2+ , 3-] =>
Shigh ->[0+ ,3-]=>0
S normal ->[2+ ,0-]=>0
IG(Ssunny,temparature) =0.97- 3/5 *0 -2/5 *0
IG(Ssunny,temparature)=0.97
Attribute: wind
Ssunny[2+ , 3-] =>
Sstrong ->[1+ ,1-]=>1
S weak->[1+ ,2-]=> -1/3 log2(1/3)- 2/3 log2(2/3)=0.9183
IG(Ssunny,wind) =0.97- 2/5 *1 -3/5 *0.9183
IG(Ssunny,wind)=0.192
Gain(Ssunny,wind)=0.192
Gain(Ssunny,temp)=0.570
Gain(Ssunny,humidity)=0.97
outlook
sunny overcast rain
yes
humidity
high normal
no yes
DAY TEMP HUMIDITY WIND PLAY TENNIS
D4 MILD HIGH WEAK YES
D5 COOL NORMAL WEAK YES
D6 COOL NORMAL STRONG NO
D10 MILD NORMAL WEAK YES
D14 MILD HIGH STRONG NO
Attribute: TEMPARATURE
 values(TEMPARATURE)= Hot,mild,cool
Srain[3+ , 2-] =>
entropy(Srain)= -3/5 log2(3/5)- 2/5 log2(2/5)=0.97
Shot ->[o+ ,0-]=>0
S mild ->[2+ ,1-]=> -2/3 log2(2/3)- 1/3 log2(1/3)=0.9183
Scool ->[1+ ,1-]=>1.0
IG(Srain,temparature) =0.97- o/5 *0 -3/5 *0.9183-2/5*1.0

=0.0192
Attribute: humidity
Srain[3+ , 2-] =>
Shigh ->[1+ ,1-]=>1
S normal ->[2+ ,1-]=> -3/3 log2(3/3)- 1/3 log2(1/3)=0.9183
IG(Ssunny,temparature) =0.97- 2/5 *1.0 -3/5 *0.9183
IG(Ssunny,temparature)=0.0192
Attribute: wind
Srain[3+ , 2-] =>
Sstrong ->[0+ ,2-]=>0
S weak->[3+ ,0-]=> 0
IG(Srain,wind) =0.97- 2/5 *0 -3/5 *0
IG(Ssunny,wind)=0.97
Gain(Srain,temp)=0.0192
Gain(Srain,humidity)=0.0192
Gain(Srain,wind)=0.97
outlook
sunny overcast rain
humidity yes wind
high normal weak strong
no yes yes no
Ensemble learning:
 an ensemble method is a technique that combines the
predictions from multiple machine learning
algorithms together to make more accurate
predictions .
a model comprised of many models is called as an
ensemble model
knn
Data set Naïve Final

bayes predicton
Decision
tree
Many ensemble methods contains the same type of
learning algorithms which are called “ homogeneous
ensembles”
There are also some methods contains the different type of
learning algorithms which are called “ heterogeneous”
ensembles
Types of ensemble learning:
Bagging (boot strap aggregation)
Boosting
Stacking
cascading
Bagging:
It is a general procedure that can be used to reduce the
variance for that algorithm that has high variance.
Bagging makes each model run independently and then
aggregates the outputs at the end without preference to any
model.
Example: random forest
In bagging, we take different subsets of data set randomly and
combine them with the help of bootstrap sampling. In detail,
given a training data set containing the n number of training
examples, a sample of m training examples will be generated
by sampling with replacement
A B C
D E F
G H I
A H E G
F A
E C H F
G C
MODEL 1 MODEL 2 MODEL 3
AVERAGE / VOTING
Random forests
Random forest is a supervised learning algorithm which
uses ensemble learning method for classification and
regression
Random forest is a bagging technique. The trees in
random forest are run in parallel. There is no
interaction between these trees while building the trees
The basic idea behind random forest is that it combines
multiple decision trees to determine the merges their
predictions together to get a more accurate and a stable
prediction
The steps for random forest algorithm are as follows:
Step 1:pick at a random k data points from the training
set
Step 2:build the decision tree associated with these k
data points
Step 3:choose the number Ntree of trees you want to
build and repeat step 1 &2
Step 4:for a new data point, make each one of your Ntree
trees predict the category to chich the data points
belongs and assign the new data point to the category
that wins the majority vote.
Training Training Training
data 1 data 2 data n
Training set
Decisio Decisio Decisio
n tree 1 n tree 2 n tree n
Voting (averaging)
prediction
Construct a decision tree using id3
algorithm-example for practice
Major Exp Tie Hired ?
Cs Programming Pretty No
Cs Programming Pretty No
Cs Management Pretty Yes
Cs Management Ugly Yes
Business Programming Pretty Yes
Business Programming Ugly Yes
Business Management Pretty No
business management pretty no

On Unit-4

Uploaded by

Copyright:

Available Formats

On Unit-4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

On Unit-4

Uploaded by

Copyright:

Available Formats

Unit - 4

Tools for web scrapping

Entropy(s,outlook)=entropy(S)- 5/14 entropy(SSUNNY )-

=0.94- 4/14 1 -6/14 0.9183 -4/14 *0.8113

SUNNY OVERCAST RAIN

IG(Ssunny,temparature) =0.97- 2/5 0 -2/5 0.1-1/5*0.0

sunny overcast rain

IG(Srain,temparature) =0.97- o/5 0 -3/5 0.9183-2/5*1.0

sunny overcast rain

humidity yes wind

high normal weak strong

Data set Naïve Final

MODEL 1 MODEL 2 MODEL 3

You might also like