Data Aggregation by Web Scraping Using Python
Data Aggregation by Web Scraping Using Python
S. SRAVYA (206Y1A0587)
SANIYA SALWA (206Y1A0589)
T. SRIVANI (206Y1A0597)
A. PRATHIMA (206Y1A05A6)
2023-2024
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
CERTIFICATE
External Examiner
ACKNOWLEDGEMENT
We wish to take this opportunity to express our deep gratitude to all the people who have
extended their cooperation in various ways during my project work. It is our pleasure to
acknowledge the help of all those individuals.
We would like to thank our project guide Mr. M. Mruthyunjaya, Asst. Prof., Computer
Science and Engineering Department for his guidance and help throughout the development
of this project work by providing us with required information. With his guidance,
cooperation and encouragement we had learnt many new things Suring our project tenure.
We would like to thank our project coordinator Mr. V. SRINIVAS, Asst. Prof. Computer
Science and Engineering Department for his continuous coordination throughout the project
tenure.
We specially thank Dr. E. SUDARSHAN, Professor and Head of The Department,
Computer Science and Engineering Department for his continuous encouragement and
valuable guidance in bringing shape to this dissertation.
We specially thank Dr. I. RAJASRI REDDY, Principal, Sumathi Reddy Institute of
Technology for Women for his encouragement and support.
In completing this project successfully all our faculty members have given an excellent
cooperation by guiding us in every aspect. We also thank our lab faculty and librarians.
S. SRAVYA (206Y1A0587)
SANIYA SALWA (206Y1A0589)
T. SRIVANI (206Y1A0597)
A. PRATHIMA (206Y1A05A6)
ABSTRACT
Web scraping automates the process of extracting and saving large amounts of data from
different websites with ease and in a small amount of time. Web scraping is a technique to
fetch data from websites. Web scraping collects and categorizes all the required data in one
accessible location. Most of this data is unstructured data in an HTML format which is then
converted into structured data in a spreadsheet or a database so that it can be used in various
applications. Web scraping finds many uses both at a professional and personal level, it can
be used for Brand Monitoring and Competition Analysis, Machine Learning, Financial
Data Analysis, Social Media Analysis, SEO monitoring etc.
CONTENTS
S.NO Topics Page
1. INTRODUCTION……………………………………...01
1.1 What is Web scraping?
1.2 Who is using web scraping?
1.3 Why Web Scraping for data science?
1.4 Why Python for Web scraping?
1.5 Different types of Web scraping
2. LITERATURE SURVEY……………………………..08
3. SYSTEM ANALYSIS………………………………....09
3.1 Existing system
3.2 Problem statement
3.3 Proposed system
3.4 Mathematical model
4. METHODOLOGY…………………………………….11
4.1 Inspect your data source
4.2 Scrape HTML content from page
4.3 Parse html code with beautiful soup
5. DESIGN………………………………………………...26
5.1 System Requirements Specification (SRS)
5.2 UML diagrams
5.3 System Study
6. TECHNOLOGIES LEARNT………………………….31
6.1 Python web scraping tools
6.2 Installation of python packages
6.3 HTTP headers.
7. TESTING………………………………………………37
8. RESULTS……………………………………………...38
9. CONCLUSION AND FUTURE SCOPE………….....39
10. BIBLIOGRAPHY…………………………………......42
LIST OF FIGURES
1
1.2 WHO IS USING WEB SCRAPING
There are numerous reasonable uses of approaching and assembling data on the web,
considerable lot of which fall in the domain of data science. The following list outlines
some interesting real-life use cases:
Many of Google's items have profited by Google's Centre business of crawling the web.
Google Translate, for example, uses text put away on the web to prepare and develop
itself.
• Scraping is being applied a ton in HR and worker examination. The San Francisco-
based hiQ startup works in selling worker examinations by gathering and inspecting
public profile data, for example, from LinkedIn (who was troubled about this but rather
was so far unfit to forestall this work on after a legal dispute; see https://www.
bloomberg.com/news/highlights/2017-11-15/the- merciless battle tomine-your-data-
and-offer it-to-your-chief).
• Digital marketeers and computerized specialists frequently use data from the web for a
wide range of intriguing and innovative tasks. "We Feel Fine" by Jonathan Harris and
Sep Kamvar, for example, scraped different blog locales for phrases beginning with "I
feel," the aftereffects of which could at that point imagine how the world was feeling
for the duration of the day.
• In another examination, messages scraped from Twitter, web journals, and other web-
based media were scraped to develop a data set that was utilized to assemble a
prescient model toward distinguishing examples of melancholy furthermore, self-
destructive musings. This may be a priceless instrument for help suppliers, however
obviously it warrants an exhaustive thought of security related issues also (see
https://www.sas.com/en_ca/bitsof knowledge/articles/examination/utilizing huge data-
to-predict suicide-hazard canada.html).
• Emmanuel Sales additionally scraped Twitter, however here with the objective to sort
out his own group of friends and course of events of posts (see
https://emsal.me/blog/4). An intriguing perception here is that the creator previously
thought about utilizing Twitter's API, however found that "Twitter vigorously rate
limits doing this: on the off chance that you need to get a client's follow list, at that
point you can just do so multiple times like clockwork, which is quite awkward to
work with."
• In a paper named "The Billion Prices Project: Using Online Prices for Estimation and
Research" (see http://www.nber.org/papers/ w22111), web scraping was utilized to
gather a data set of online cost data that was utilized to build a hearty day by day value
record for different nations.
2
• Sociopolitical researchers are scraping social sites to follow populace opinion and
political direction. A celebrated article called "Analyzing Trump's Most Rabid Online
Following" (see https://fivethirtyeight.com/highlights/taking apart trumps most-
frenzied web based after/) examines client conversations on Reddit utilizing semantic
investigation to portray the online devotees what's more, aficionados of Donald Trump.
• One analyst had the option to prepare a profound learning model dependent on scraped
pictures from Tinder and Instagram along with their "likes" to anticipate whether a
picture would be considered "appealing" (see
http://karpathy.github.io/2015/10/25/selfie/). Cell phone creators are now consolidating
such models in their photograph applications to help you review your photos.
• In "The Girl with the Brick Earring," Lucas Woltmann sets out to scratch Lego block
data from https://www.bricklink.com to decide the best determination of Lego pieces
(see http:// lucaswoltmann.de/art'n'images/2017/04/08/the-young lady with-thebrick-
earring.html) to address a picture (one of the co-creators of this book is an ardent Lego
fan, so we needed to incorporate this model).
• In "Examining 1000+ Greek Wines With Python," Florents Tselai scratches data
around 1,000 wine assortments from a Greek wine shop (see https://tselai.com/greek-
wines-analysis.html) to investigate their starting point, rating, type, and strength (one
of the coauthors of this book is an enthusiastic wine fan, so we needed to incorporate
this model).
• Lyst, a London-based online style commercial center, scraped the web for semi-
organized data about style items and afterward applied AI to introduce this data neatly
and exquisitely for buyers from one focal site. Other data researchers have done
comparable tasks to group comparative style items (see http://talks.lystit.com/dsl-
scraping presentation/).
• We've administered an examination where web scraping was utilized to remove data
from places of work, to get a thought with respect to the prevalence of distinctive data
science-and investigation related apparatuses in the work environment (spoiler: Python
and R were both rising consistently).
• Another investigation from our examination bunch included utilizing web scraping to
screen media sources and web gatherings to follow public estimation concerning.
Regardless of your field of revenue, there's quite often a utilization case to improve or
enhance your training dependent on data. "Data is the new oil," so the regular saying
goes, and the web has a ton of it.
3
1.3 WHY WEB SCRAPING FOR DATA SCIENCE
When riding the web utilizing a typical internet browser, you've most likely experienced
numerous locales where you thought about social occasion, putting away, and breaking
down the data introduced on the site's pages. Particularly for data scientists, whose "crude
material" is data, the web uncovered a great deal of fascinating freedoms:
• There may be a fascinating table on a Wikipedia page (or pages) you need to
recover to play out some factual examination.
• Perhaps you need to get a rundown of audits from a film site to perform text
mining, make a suggestion motor, or fabricate a prescient model to spot counterfeit
surveys.
• You may wish to get a posting of properties on a land site to construct an engaging
geo-perception.
• You'd prefer to accumulate extra highlights to improve your data set based on data
found on the web, say, climate data to estimate, for instance, soda deals.
• You may be pondering about doing informal community examination utilizing
profile data found on a web gathering.
• It may be fascinating to screen a news site for moving new stories on a specific
subject of interest.
Internet browsers are truly adept at showing pictures, showing movements, and spreading
out sites as it were that is outwardly interesting to people, however they don't uncover a
basic method to send out their data, at any rate not much of the time. Rather than review
the website page by page through your internet browser's window, wouldn't it be ideal to
have the option to consequently accumulate a rich data set? This is actually where web
scraping enters the image. In the event that you feel comfortable around the web a piece,
you'll most likely be pondering: "Isn't this precisely what Application Programming
Interface (APIs) are for?" Indeed, numerous sites these days give such an API that gives a
way to the rest of the world to access their data vault in an organized manner — intended
to be burned-through and gotten to by PC programs, not people (albeit the projects are
composed by people, of course). Twitter, Facebook, LinkedIn, and Google, for example,
all give such APIs in request to look and post tweets, get a rundown of your companions
and their preferences, see who you're associated with, etc. So why, at that point, would we
actually require web scraping? The fact is that APIs are incredible intends to get to data
sources, given the current site gives one in the first place and that the API uncovered the
usefulness you need. The overall guideline of thumb is to search for an API first and utilize
that in the event that you can, prior to embarking to assemble a web scrubber to assemble
the data.
4
For example, you can without much of a stretch utilize Twitter's API to get a rundown of
ongoing tweets, rather than wasting time yourself. In any case, there are still different
reasons why web scraping may be ideal over the utilization of an API:
• The site you need to separate data from doesn't give an API.
• The API gave isn't free (while the site is).
• The API gave is rate restricted: which means you can just access it a number of
specific times each second, out of every day, …
• The API doesn't uncover all the data you wish to acquire (though the site does).
In these cases, the use of web scraping may prove to be useful. The reality stays that in the
event that you can see some data in your internet browser, you will actually want to get to
also, recover it through a program. On the off chance that you can get to it through a
program, the data can be put away, cleaned, and utilized in any capacity.
Web Scrapers can extract all the data on specific locales or the particular data that a client
needs. Preferably, it's ideal in the event that you indicate the data you need so the web
scraper just extracts that data rapidly. For instance, you should scratch an Amazon page for
the sorts of juicers accessible, yet you may just need the data about the models of various
juicers and not the client surveys.
5
So when a web scraper needs to scratch a webpage, first it is given the URL's of the
necessary destinations. At that point it stacks all the HTML code for those locales and a
further developed scraper may even extract all the CSS and JavaScript components too. At
that point the scraper gets the necessary data from this HTML code and yields this data in
the configuration indicated by the client. Generally, this is as an Excel spreadsheet or a
CSV record yet the data can likewise be saved in different configurations, for example, a
JSON document.
6
Adaptability — Python provides a couple of great libraries that can be utilized for various
conditions. You can use Requests for making simple HTTP requests and, on the other end,
Selenium for scraping dynamically rendered content.
7
CHAPTER 2
LITERATURE SURVEY
To know how the data extraction measure has developed has such a lot of one should
comprehend the strategies engaged with this strategy for web scraping is significant
scraping has been near almost as long as the web. The sway behind business web scraping
has reliably been to get a straightforward business advantage and fuse things like
sabotaging a competitor's extraordinary esteeming, taking leads, securing advancing
endeavors, redirecting APIs, and the all-around theft of and information. The essential
aggregators and assessment engines appeared to be hot on the effect points of the online
business impact and worked by and large unchallenged until the authentic challenges of
the mid-2000s. Early scraping contraptions were truly central - truly reordering anything
obvious from the site. Today, in any case, it's a by and large extraordinary story: web
scraping is tremendous business with amazing gadgets also, organizations to arrange.
Extraction and Analysis of information are for the most part used by the Digital
distributers and lists, Travel, Real home, and E-exchange. At that point once more,
assessment and figuring return way with the advances in collection parts also, the
development of Real Databases: The data had been seen and managed as data to be set up
for data assessment. The crucial turning point was the proximity of RDB (Relational
Database) in the midst of the 1980s which enabled clients to make Sequel (SQL) to
recover data from the database. For clients, the benefit of RDB and SQL is to have the
capacity to isolate their data on interest. It made the strategy to get data essential and
spread database use. Information Warehouse: The qualification from standard social
databases is that information stockrooms are for the most part smoothed out for response
time to requests. The improvement of data mining as made conceivable appreciation to
database and data stockroom movements, which connect with relationship store additional
information and still separate it during a sensible method. A general business design
created, where organizations started to "predict" customer's possible requirements reliant
upon assessment of the chronicled getting plans.
8
CHAPTER 3
SYSTEM ANALYSIS
3.1 Existing System
In Existing system is the manual web data extraction process has two major problems.
Firstly, it can’t measure costs efficiently and can escalate it very quickly. The data
collection costs increase as more data is collected from each website. In order to conduct a
manual extraction, businesses need to hire large number of staffs, this increases the cost of
labour significantly. Secondly, each manual extraction is known to be error prone. Further,
if any business process is very complex then cleaning up the data can get expensive and
time consuming. The below figure explains the errors and data cleanup processes problems
with the Manual method.
2. Linear Regression Straight relapse is one of the simplest and most well known Machine
Learning calculations. It is a measurable strategy that is utilized for prescient examination.
Straight relapse makes expectations for persistent/genuine or numeric factors like deals,
compensation, age, item cost, and so on Direct relapse calculation shows a straight
connection between a reliant (y) and at least one autonomous (y) factors, consequently
called as direct relapse. Since straight relapse shows the direct relationship, which implies
it discovers how the worth of the reliant variable is changing as indicated by the worth of
the free factor. The direct relapse model gives a slanted straight line addressing the
connection between the factors. Mathematically, it can represent a linear regression as:
y= a0+a1x+ ε
10
CHAPTER 4
METHODOLOGY
There are mainly 4 steps followed in web scraping and we use python programming
language to implement the scraping process. The following are the 4 main steps that we
follow to implement web scraping:
1) Inspect Your Data Source
2) Scrape HTML Content from a Page
3) Parse HTML Code with Beautiful Soup
4) Generating a CSV from the data
Let’s see the methodology that is followed in doing web scraping by taking an example of
web scraper that fetches Software Developer job listings from the Monster job aggregator
site which is a static website. Your web scraper will parse HTML to choose the important
snippets of data and channel that content for explicit words. You can scratch any webpage
on the Internet that you can take a gander at, however the trouble of doing so relies upon
the website.
4.1 INSPECT YOUR DATA SOURCE
The initial step is to go to the site you need to scrape utilizing your number one browser.
You'll have to comprehend the site construction to extract the data you're keen on.
Explore the website
11
Navigate the site and associate with it actually like any ordinary client would. For instance,
you could look for Software Developer occupations in Australia utilizing the site's local
inquiry interface:
You can see that there's a rundown of jobs returned on the left side, and there are more
definite depictions about the chose work on the right side. At the point when you click on
any of the jobs on the left, the substance on the right changes. You can likewise see that
when you cooperate with the website, the URL in your browser's location bar additionally
changes.
12
Attempt to change the search parameters and see how that influences your URL. Feel free
Change these values to notice the progressions in the URL. Then, attempt to change the
values straightforwardly in your URL. See what happens when you glue the accompanying
URL into your browser's location bar:
https://www.monster.com/jobs/search/?q=Programmer&where=New-York
You'll see that adjustments of the search box of the site are straightforwardly reflected in
the URL's query parameters and the other way around. In the event that you change both of
them, you'll see various outcomes on the website. At the point when you investigate URLs,
you can get data on the best way to recover data from the website's worker.
13
Fig.4.1.2: The HTML on the right represents the structure of the page you can see on the left
You can think about the content showed in your browser as the HTML design of that page.
At the point when you right-click components on the page, you can choose Inspect to
zoom to their location in the DOM. You can likewise drift over the HTML text on your
right and see the relating components light up on the page. Play around and investigate!
The more you become acquainted with the page you're working with, the simpler it will be
to scrape it. Notwithstanding, don't get excessively overpowered with all that HTML text.
You'll utilize the force of programming to venture through this labyrinth and single out just
the intriguing parts with Beautiful Soup.
Static Websites
The website you’re scraping in this example serves static HTML content. In this
situation, the server that has the site sends back HTML archives that as of now contain all
the data you'll will see as a client.
At the point when you inspected the page with developer tools prior on, you found that a
job posting comprises of the accompanying long and muddled looking HTML:
<section class="card-content" data-jobid="4755ec59-d0db-4ce9-8385-b4df7c1e9f7c"
onclick="MKImpressionTrackingMouseDownHijack(this, event)">
<div class="flex-row">
<div class="summary">
<header class="card-header">
</div>
<div class="location">
<span class="name"> Woodlands, WA
</span>
</div>
</div>
<div class="meta flex-col">
<time datetime="2017-05-26T12:00">2 days ago</time>
<span class="mux-tooltip applied-only" data-mux="tooltip" title="Applied">
<i aria-hidden="true" class="icon icon-applied"></i>
<span class="sr-only">Applied</span>
It tends to be hard to fold your head over a long square of HTML code. To make it simpler
to peruse, you can utilize a HTML formatter to consequently tidy it up somewhat more.
Great comprehensibility assists you with bettering comprehend the design of any code
block. While it could conceivably assist with improving the organizing of the HTML, it's
consistently worth an attempt. Remember that each website will appear to be unique. That
is the reason it's important to inspect and comprehend the construction of the site you're as
of now working with prior to pushing ahead.
The HTML above unquestionably has a couple of befuddling parts in it. For example, you
can scroll to the right to see the large number of attributes that the <a> element has.
Luckily, the class names on the elements that you’re interested in are relatively
straightforward:
• class="title": the title of the job posting
• class="company": the company that offers the position
• class="location": the location where you’d be working.
16
In the event that you at any point become mixed up in a huge heap of HTML, recollect that
you can generally return to your browser and use developer tools to additionally investigate
the HTML structure intuitively.
At this point, you've effectively outfit the force and easy to use plan of Python's requests
library. With a couple of lines of code, you figured out how to scrape the static HTML
content from the web and make it accessible for additional preparing. Be that as it may,
there are a couple of additional difficult circumstances you may experience when you're
scraping websites. Before you start utilizing Beautiful Soup to pick the pertinent data from
the HTML that you just scraped, investigate two of these circumstances.
Hidden Websites
Some pages contain information that’s hidden behind a login. That means you’ll need an
account to be able to see (and scrape) anything from the page. The process to make an
HTTP request from your Python script is different than how you access a page from your
browser. That means that just because you can log in to the page through your browser,
that doesn’t mean you’ll be able to scrape it with your Python script.
In any case, there are some exceptional strategies that you can use with the requests to get
to the content behind logins. These methods will permit you to sign in to websites while
making the HTTP demand from inside your content.
Dynamic Websites
Static sites are simpler to work with on the grounds that the server sends you a HTML
page that as of now contains all the data as a reaction. You can parse a HTML reaction
with Beautiful Soup and start to select the applicable data.
Then again, with a unique website the server probably won't send back any HTML
whatsoever. Instead, you’ll receive JavaScript code as a response. To offload work from
the server to the clients’ machines, many modern websites avoid crunching numbers on
their servers whenever possible. Instead, they’ll send JavaScript code that your browser
will execute locally to produce the desired HTML.
As referenced previously, what occurs in the browser isn't identified with what occurs in
your content. Your browser will perseveringly execute the JavaScript code it gets back
from a server and make the DOM and HTML for you locally. In any case, doing a
solicitation to a powerful website in your Python content won't give you the HTML page
content.
At the point when you use requests, you'll just get what the server sends back. On account
17
of a unique website, you'll end up with some JavaScript code, which you will not have the
option to parse utilizing Beautiful Soup. The best way to go from the JavaScript code to
the content you're keen on is to execute the code, actually like your browser does. The
requests library can't do that for you, however there are different arrangements that can.
For instance, requests-html is a task made by the creator of the requests library that permits
you to effectively deliver JavaScript utilizing linguistic structure that is like the grammar
in requests. It additionally incorporates capacities for parsing the data by utilizing
Beautiful Soup in the engine.
Another mainstream decision for scraping dynamic content is Selenium. You can consider
Selenium a thinned down browser that executes the JavaScript code for you prior to giving
the delivered HTML reaction to your content. But, in our case we will now only scrape
static websites using beautiful soup.
import requests
18
FIND ELEMENTS BY ID
In a HTML web page, each component can have an id characteristic doled out. As the
name as of now proposes, that id property makes the component remarkably recognizable
on the page. You can start to parse your page by choosing a particular component by its
ID.
Switch back to developer tools and distinguish the HTML object that contains the entirety
of the job postings. Investigate by floating over pieces of the page and utilizing right-snap
to Inspect.
At the hour of this composition, the component you're searching for is a <div> with an id
trait that has the value "ResultsContainer". It two or three different characteristics also,
however beneath is the substance of what you're searching for:
<div id="ResultsContainer">
<!-- all the job listings -->
</div>
Beautiful Soup allows you to find that specific element easily by its ID:
results = soup.find(id='ResultsContainer')
For simpler survey, you can .prettify() any Beautiful Soup object when you print it out. On
the off chance that you call this strategy on the outcomes variable that you just doled out
above, at that point you should see all the HTML contained inside the <div>:
print(results.prettify())
At the point when you utilize the component's ID, you're ready to select one component
from among the remainder of the HTML. This permits you to work with just this particular
piece of the page's HTML. It would seem that the soup just got somewhat more slender!
Notwithstanding, it's still very thick.
19
FIND ELEMENTS BY HTML CLASS NAME
You've seen that each job posting is enclosed by a <section> component with the class
card-content. Presently you can work with your new Beautiful Soup object called results
and select just the job postings. These are, all things considered, the pieces of the HTML
that you're keen on! You can do this in one line of code:
job_elems = results.find_all('section', class_='card-content')
Here, you call .find_all() on a Beautiful Soup object, which returns an iterable containing
all the HTML for all the job postings showed on that page.
Investigate every one of them:
for job_elem in job_elems: print(job_elem, end='\n'*2)
That is now quite flawless, yet there's still a ton of HTML! You've seen before that your
page has engaging class names on certain components. We should select just those:
# You can use the same methods on it as you did before. title_elem = job_elem.find('h2',
class_='title') company_elem = job_elem.find('div', class_='company') location_elem =
job_elem.find('div', class_='location') print(title_elem)
print(company_elem) print(location_elem)
print()
Fantastic! You're drawing nearer and nearer to the data you're really intrigued by. In any
case, there's a great deal going on with every one of those HTML labels and properties
gliding around:
<div class="company">
<ul class="list-inline">
</ul>
</div>
<div class="location">
</div>
21
Run the above code piece and you'll see the content showed. Be that as it may, you'll
likewise get a great deal of whitespace. Since you're currently working with Python
strings, you can .strip() the unnecessary whitespace. You can likewise apply some other
recognizable Python string techniques to additional tidy up your content. The web is
chaotic and you can't depend on a page design to be reliable all through. Along these lines,
you'll usually run into errors while parsing HTML.
At the point when you run the above code, you may experience an Attribute Error:
AttributeError: 'NoneType' object has no attribute 'text'
Assuming that is the situation, make a stride back and inspect your past outcomes. Were
there any things with a value of None? You may have seen that the construction of the
page isn't completely uniform. There could be a commercial in there that shows in an
unexpected route in comparison to the typical job postings, which may return various
outcomes. For this model currently taken, you can safely ignore the hazardous component
and skirt it while parsing the HTML:
for job_elem in job_elems:
title_elem = job_elem.find('h2', class_='title')
company_elem = job_elem.find('div', class_='company')
location_elem = job_elem.find('div', class_='location')
if None in (title_elem, company_elem, location_elem):
continue print(title_elem.text.strip())
print(company_elem.text.strip())
print(location_elem.text.strip())
print()
Go ahead and investigate why one of the components is returned as None. You can utilize
the restrictive assertion you composed above to print() out and inspect the important
component in more detail. What do you belive is going on there?
After you complete the above advances take a stab at running your content once more. The
outcomes at last look much better:
Python Developer
LanceSoft Inc Woodlands,
WA
Senior Engagement Manager Zuora
Sydney, NSW
22
FIND ELEMENTS BY CLASS NAME AND TEXT CONTENT
At this point, you've tidied up the rundown of jobs that you saw on the website. While that
is quite flawless as of now, you can make your content more valuable. Nonetheless, not the
entirety of the job postings appears to be developer jobs that you'd be keen on as a Python
developer. So as opposed to printing out the entirety of the jobs from the page, you'll
initially channel them for certain keywords.
You realize that job titles in the page are kept inside <h2> components. To channel just for
explicit ones, you can utilize the string contention:
python_jobs = results.find_all('h2', string='Python Developer')
This code discovers all <h2> components where the contained string matches 'Python
Developer' precisely. Note that you're straightforwardly calling the technique on your first
outcomes variable. On the off chance that you feel free to print() the yield of the above
code scrap to your reassure, at that point you may be disillusioned in light of the fact that it
will most likely be unfilled:
There was unquestionably a job with that title in the search results, so for what reason is it
not appearance up? At the point when you use string= as you did over, your program
searches for precisely that string. Any distinctions in capitalization or whitespace will keep
the component from coordinating. In the following segment, you'll figure out how to make
the string broader.
Pass a Function to a Beautiful Soup Method Notwithstanding strings, you can regularly
pass capacities as contentions to Beautiful Soup techniques. You can change the past line
of code to utilize a capacity all things considered:
Presently you're passing an unknown capacity to the string= contention. The lambda works
takes a gander at the content of each <h2> component, changes it over to lowercase, and
checks whether the substring 'python' is found anyplace in there. Presently you have a
match:
>>> print(len(python_jobs)) 1
Your program has discovered a match!.
23
On the off chance that you actually don't get a match, take a stab at adjusting your search string.
The job offers on this page are continually changing and there probably won't be a job recorded
that incorporates the substring 'python' in its title at the time that you're working.
The way toward discovering explicit components relying upon their content is an incredible
method to channel your HTML reaction for the data that you're searching for. Beautiful Soup
permits you to utilize either precise strings or capacities as contentions for separating text in
Beautiful Soup objects.
The sifted results will just show connects to job openings that remember python for their
title. You can utilize a similar square-section documentation to extract other HTML
attributes also. A typical use case is to bring the URL of a connection, as you did
previously.
24
GENERATING A CSV FROM THE DATA
Finally, we would like to save all our data in some CSV file. with open('results.csv', 'w',
newline='', encoding='utf-8') as f:
print("Saving the extracted data into a file")
writer = csv.writer(f)
writer.writerow(['Title', 'Company', 'Location','JobUrl'])
writer.writerows(records)
Here we create a CSV file called results.csv and save all the job postings in it for any
further use.
25
CHAPTER 5
DESIGN
5.1 SYSTEM REQUIREMENTS SPECIFICATION
5.1.1 HARDWARE REQUIREMENTS
The hardware requirements for the project are:
CPU : 2 x 64-bit, 2.8 GHz, 8.00 GT/s CPUs or better.
RAM : at least 2 GB
HARDDISK : at least 20 GB
5.1.3 DEPENDENCIES
Ensure that necessary Python packages like requests, pandas, and other supporting libraries
are installed.
An UML diagram is a diagram dependent on the UML (Unified Modeling Language) with
the motivation behind outwardly addressing a framework alongside its principle actors,
roles, actions, antiques or classes, to more readily comprehend, change, keep up, or record
information about the framework.
26
5.2.1 CLASS DIAGRAM
27
5.2.3 ACTIVITY DIAGRAM
28
3. Data Sources
❖ Identifying and categorising the target websites, considering their:
❖ Content structure.
❖ HTML markup and CSS classes.
❖ Frequency of updates.
❖ Any anti-scraping measures in place.
4. Data Aggregation Process
❖ Define the step-by-step data aggregation process:
❖ Initiate HTTP requests to target websites.
❖ Parse HTML content using Beautiful Soup and/or utilize Selenium for dynamic
content.
❖ Extract relevant data points based on defined criteria.
❖ Transform and clean data for consistency.
❖ Load data into the chosen database.
5. Data Storage
❖ Choose a suitable database (e.g., SQLite, MySQL, MongoDB) and design an
appropriate schema.
❖ Considerations such as normalization, indexing, and handling large datasets are
done.
6. Error Handling
❖ Implementing mechanisms to handle potential issues:
❖ Monitor for changes in website structure.
❖ Address connectivity issues.
❖ Log errors and provide notifications for intervention.
7. Security and Ethical Considerations
❖ Ensure compliance with legal and ethical standards:
❖ Respect website terms of service.
❖ Implement rate limiting to avoid IP blocking.
❖ Encrypt sensitive data if applicable.
8. Monitoring and Logging
❖ Develop a logging system to track:
❖ Successful data extraction.
❖ Errors and exceptions.
❖ Performance metrics.
29
9. User Interface
❖ Consider adding a user interface for:
❖ Configuring scraping parameters.
❖ Monitoring the scraping process.
❖ Viewing aggregated data.
10. Testing
❖ Establish a comprehensive testing plan covering:
❖ Unit testing for individual functions.
❖ Integration testing for the entire scraping pipeline.
❖ Handling edge cases and exceptions.
30
CHAPTER 6
TECHNOLOGIES LEARNT
6.1 Python web scraping tools
Python web scraping tools
In the Python ecosystem, there are several well-established tools for executing a web
scraping project:
• Scrapy
• Selenium
• BeautifulSoup
In the following, we will go over the advantages and disadvantages of each of these three
tools.
Web scraping with Scrapy
The Python web scraping tool Scrapy uses an HTML parser to extract information from
the HTML source code of a page. This results in the following schema illustrating web
scraping with Scrapy:
URL → HTTP request → HTML → Scrapy
The core concept for scraper development with Scrapy are scrapers called web spiders.
These are small programs based on Scrapy. Each spider is programmed to scrape a specific
website and crawls across the web from page to page as a spider is wont to do. Object-
oriented programming is used for this purpose. Each spider is its own Python class.
In addition to the core Python package, the Scrapy installation comes with a command-line
tool. The spiders are controlled using this Scrapy shell. In addition, existing spiders can be
uploaded to the Scrapy cloud. There the spiders can be run on a schedule. As a result, even
large websites can be scraped without having to use your own computer and home internet
connection. Alternatively, you can set up your own web scraping server using the open-
source software Scrapyd.
Scrapy is a sophisticated platform for performing web scraping with Python. The
architecture of the tool is designed to meet the needs of professional projects. For example,
Scrapy contains an integrated pipeline for processing scraped data. Page retrieval in Scrapy
is asynchronous which means that multiple pages can be downloaded at the same time.
This makes Scrapy well suited for scraping projects in which a high volume of pages needs
to be processed.
31
Web scraping with Selenium
The free-to-use software Selenium is a framework for automated software testing for web
applications. While it was originally developed to test websites and web apps, the
Selenium WebDriver with Python can also be used to scrape websites. Despite the fact that
Selenium itself is not written in Python, the software’s functions can be accessed using
Python.
Unlike Scrapy or BeautifulSoup, Selenium does not use the page’s HTML source code.
Instead, the page is loaded in a browser without a user interface. The browser interprets the
page’s source code and generates a Document Object Model (DOM). This standardized
interface makes it possible to test user interactions. For example, clicks can be simulated
and forms can be filled out automatically. The resulting changes to the page are reflected
in the DOM. This results in the following schema illustrating web scraping with Selenium:
URL → HTTP request → HTML → Selenium → DOM
Since the DOM is generated dynamically, Selenium also makes it possible to scrape pages
with content created in JavaScript. Being able to access dynamic content is a key
advantage of Selenium. Selenium can also be used in combination with Scrapy or
BeautifulSoup. Selenium delivers the source code, while the second tool parses and
analyzes it. This results in the following schema:
URL → HTTP request → HTML → Selenium → DOM → HTML →
Scrapy/BeautifulSoup.
32
Scrapy Selenium BeautifulSoup
Easy to ++ + +++
learn
Accesses ++ +++ +
dynamic
content
Creates +++ + ++
complex
applications
Able to ++ + +++
cope with
HTML
errors
Optimized +++ + +
for scraping
performance
Strong +++ + ++
ecosystem
33
Package Use
venv Manage a virtual environment for the
project
request Request websites
lxml Use alternative parsers for HTML and
XML
csv Read and write spreadsheet data in
CSV format
pandas Process and analyze data
scrapy Use Scrapy
selenium Use Selenium WebDriver
34
• Request Headers
• Response Headers
• Representation Headers
• Payload Headers
Let us learn each of them in detail.
1. Request Headers
The headers sent by the client when requesting data from the server are known as Request
Headers. It also helps to recognize the request sender or client using the information in the
headers.
Here are some examples of the request headers.
• authority: en.wikipedia.org
• method: GET
• accept-language: en-US, en;q=0.9
• accept-encoding: gzip, deflate, br
• upgrade-insecure-requests: 1
• user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_4) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/100.0.4869.91 Safari/537.36
The user agent indicates the type of software or application used to send the request to the
server.
The Accept-Language header tells the server about the desired language for the response.
The Accept-Encoding header is a request header sent by the client that indicates the
content encoding it can understand.
Note: Not all headers in the request can be specified as request headers. For example —
The Content-Type header is not a request header but a representation header.
2. Response Headers
The headers sent by the server to the client in response to the request headers from the user
are known as Response Headers. It is not related to the content of the message. It is sent by
the server to convey instructions and information to the client.
Here are some examples of the response headers.
• content-length: 35408
• content-type: text/html
• date: Thu, 13 Apr 2023 14:09:10 GMT
• server: ATS/9.1.4
• cache-control: private, s-maxage=0, max-age=0, must-revalidate
35
The Date header indicates the date on which the response is sent to the client.
The Server header informs the client from which server the response is returned, and
the Content-Length header indicates the length of the content returned by the server.
Note: The Content-Type header is the representation header.
3. Representation Headers
The headers that communicate the information about the representation of resources in the
HTTP response body sent to the client by the server are known as Representation Header.
The data can be transferred in several formats, such as JSON, XML, HTML, etc.
Here are some examples of the representation headers.
• content-encoding: gzip
• content-length: 35408
• content-type: text/html
The Content-Encoding header informs the client about the encoding of the HTTP response
body.
4. Payload Headers
The headers that contain information about the original resource representation are known
as Payload Headers.
Here are some examples of payload headers.
• content-length: 35408
• content-range: bytes 200–1000/67589
• trailer: Expires
The Content-Range header tells the position of the partial message in the full-body
message.
Here, we are completed with the Headers section. There can be more headers to be
discussed. But, it will make the blog long and deviate from the main topic.
36
CHAPTER 7
TESTING
The scraper that we designed now with the above code only gives the complete data
whatever appears on
website when we search it through manually going to concern web site. So, every time the
CSV file is created but only has data which we see when we go to site and search
whatever we want. Therefore, every time we need to check the generated CSV file for
correct working of the code and if we give any type of input which is relevant or non-
relevant the output data extracted wholly depends upon the web site and the available data
it’s possess.
Case 1: if we give empty string as input
37
CHAPTER 8
RESULTS
As we know that, it only gives the same exact data that we see when we go to website
manually and see. So, the results will be as follows,
Case 1
Case 2
38
CHAPTER 9
CONCLUSION AND FUTURE SCOPE
It is safe to say that web scraping has become a fundamental expertise to get in the
present advanced world, not just for tech organizations and not just for specialized
positions. On one side, ordering enormous datasets are essential to Big Data
analytics, Machine Learning, and Artificial Intelligence; on the opposite side, with
the blast of computerized data, Big Data is getting a lot simpler to access than any
time in recent memory.
FUTURE SCOPE
With the expansion of increasingly more data in the realm of the web, the
significance of web scraping is expanding. Numerous organizations are presently
offering altered web scraping apparatuses to their customers where they assemble
data from everywhere the universe of the web and mastermind them into helpful
and effectively justifiable data. It diminishes the valuable labor to physically visit
every website and gather the data. Web Scrapers are planned and code for each
and singular website and crawlers do expansive scraping. In the event that the
website has a convoluted design, more coding is needed to scrap its data when
contrasted with a straightforward one. The Future of web scraping is in reality
brilliant and it will turn out to be increasingly more fundamental for each business
with the progression of time.
Web scraping administrations are considered as quite possibly the most rehearsed
exercises done by a large portion of the IT organizations and Ecommerce Stores
that work across the globe. A typical inquiry that is regularly posed is the reason
an organization, business or eCommerce store needs to separate the data from the
web. The straightforward answer is that the Internet is the biggest wellspring of
data on the planet and contains data in each field of life. Regardless of whether it
is the data of a specific item, value list, work, or offer costs. The entirety of this
data can be assembled with the assistance of web scraping. A few organizations
are as yet utilizing old and manual techniques for social affair the data from the
web that abandoned them in the development field.
One of the broadest and effective suppliers of web scraping administrations is
Information Transformation Services (ITS) working across the Globe. ITS has
customers everywhere on the world and arrangements in each field and specialty.
For more data click Web Scraping Services.
39
We have particular programming and methods that assist organizations with
getting fast outcomes from the web. ITS offers 99% effective and precise data.
Our Services speed up and diminish the time and assets of the organization for
social event data from the Internet. We further wipe out the mistake inclined
manual cycles and ensure that scraping is liberated from any manual or
programmed blunder.
The data separated from the web scraping is for the most part put away in the E-
business framework. The System allows the customers effectively to look for any
item and get data about it. The data could be anything from item names, part
numbers, costs, and the stock level. This is the most famous thought in the
eCommerce stores and other web- based frameworks whose tasks may be in cozy
relationship with the data that may or might be associated with or may contain E-
trade destinations and the catalogs of the individuals with the scraping
administrations offered by ITS, our customers can gather the outcomes from any
professional resources, part postings, or whatever other website which may
contain data that is applicable.
40
ADVANTAGES OF WEB SCRAPING
➢ Promoting
Sooner rather than later, Web scraping will be one of the significant apparatuses leading
the pack age measure. The web scraping device can make statistical surveying of the
specific item/administrations and tremendous advantages to offer in the showcasing field.
41
BIBLIOGRAPHY
42