Web scraping is a way of collecting data, which involves using code. There are other names to it, such as data scraping, screen scraping, web crawling, etc. Whichever the term, it’s a simple step of examining a website’s HTML markup and writing code that can fetch this HTML, and extract the data you need.
How to choose the right tools
Even though web scraping code can be written in many languages, Python stands out primarily because it has a vast number of open-source libraries that are easy to use and very powerful.
As every website is unique, you will have to deal with different challenges while scraping. On top of that, your data collection objects may vary a lot. It could range from a simple one-time scrape from a small website to a daily price-monitoring of millions of products.
Python has libraries for all the cases —simple libraries for one-time quick scrape and complex frameworks that can work well with millions of pages.
Popular Python Libraries for Web Scraping
1. Requests
Requests is one of the most commonly used libraries in web scraping. Using this library, you can send HTTP requests and receive a response.
Let’s see a quick example. First, install this library using `pip`:
$ pip3 install requests
Then run the following code:
import requests response = requests.get('https://sandbox.oxylabs.io/')
print(f'Status Code: {response.status_code}')
print(f'HTML Markup: {response.text}')
Once run, this code will print the status code as 200 and the HTML markup of the page.
The Requests library also makes it very easy to route your HTTP requests using proxies. See this example:
proxy = {
'http': 'http://username:[email protected]:8888',
'https': 'https://username:[email protected]:8888',
}
response = requests.get('https://sandbox.oxylabs.io/', proxies=proxy)
If you don’t have a proxy provider, here is a list of the best proxy providers.
The limitation of the `Requests` library is that it returns the HTML as plain text, making it difficult to extract specific information. This is where you would need a library that can parse the HTML.
2. Beautiful Soup
Beautiful Soup is the other most common library for web scraping. BeautifulSoup provides simple methods that make extracting specific information from any HTML easy.
Note that BeautifulSoup is a wrapper around other parser libraries such as `lxml` and `html5lib`. Using libraries such as `lxml` directly is tedious, especially if you need to learn XPath. BeautifulSoup hides all the complications and even allows you to switch the parsers if you want to.
A note about installation: Beautiful Soup 3 does not support Python 3 and thus has been retired. Always use Beautiful Soup 4.
$ pip3 install beautifulsoup4
Let’s begin with a simple HTML string:
from bs4 import BeautifulSoup
html_content = '<html><head><title>Test</title></head><body>Hello World</body></html>'
soup = BeautifulSoup(html_content, 'html.parser')
print(soup.title) #prints <title>Test</title>
print(soup.title.string) #prints Test
Typically, you will combine Requests and Beautiful Soup for web scraping as follows:
import requests from bs4
import BeautifulSoup
response = requests.get('https://sandbox.oxylabs.io/products')
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.title.string) # prints 'E-commerce | Oxylabs Scraping Sandbox'
More commonly, you will be using find and `find_all` methods. Let’s see an example.
If you examine the HTML of https://sandbox.oxylabs.io/products, you will notice that all the games are in h4 tags.
You can find all the h4 tags as follows and get the names of games.
for game in soup.find_all('h4'):
print(game.string) #prints names of all games on the page
3. lxml
The lxml library is primarily for parsing XML documents. However, it can handle HTML well using the lxml.html module. Note that, you need to use XPath to query the elements.
Let’s see an example:
Let’s begin with installing this library:
$ pip install lxml
Next, get the html using the requests library and then parse it using lxml.
import requests from lxml
import html
response = requests.get('https://sandbox.oxylabs.io/products')
tree = html.fromstring(response.content)
title = tree.findtext('.//title')
print(title) # prints 'E-commerce | Oxylabs Scraping Sandbox'
4. Selenium
Selenium is made for browser automation. It allows the loading of web pages in a browser. It means that Selenium is suitable for scraping dynamic web pages—the pages that load content using JavaScript.
You would need two components before you write Python code: Selenium Driver for your browser and Selenium.
For Chrome, you can find the driver here.
Selenium can be installed using `pip`:
$ pip3 install selenium
Here is a sample code that uses selenium:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
service = Service('/path/to/chromedriver')
driver = webdriver.Chrome(service=service)
driver.get('https://sandbox.oxylabs.io/products')
games = driver.find_elements(By.TAG_NAME, 'h4')
for game in games:
print(game.text)
driver.quit()
5. Playwright
Playwright is a similar browser automation tool that has recently gained popularity. In addition to Python, Playwright works with multiple languages, such as JavaScript and .NET.
To install Playwright, install using `pip`:
$pip3 install playwright
It does not use a browser and driver like Selenium but uses its minimal browsers, including Chromium.
$ playwright install chromium
Here is the sample code that uses proxies. Notice how easy it is to use proxies. You can remove it from the launch option if you don’t need to use a proxy. However, we highly recommend using proxies for the reasons listed at the end of the article.
from playwright.sync_api import sync_playwright
proxy = { 'server': 'http://username:[email protected]:8888' }
with sync_playwright() as p:
browser = p.chromium.launch(proxy=proxy)
page = browser.new_page()
page.goto('https://sandbox.oxylabs.io/products')
games = page.locator('h4')
for game in games.all_text_contents():
print(game)
browser.close()
If you are looking for a good proxy provider, look at the residential proxy pool.
6. AIOHTTP
AIOHTTP is an asynchronous client for Python that uses the asyncio library. Combining these two allows writing code that can send many aysnc HTTP requests that can be parsed and queried.
Think of this library as the requests library that can work asynchronously.
It means that you would still need a parser such as Beautiful Soup. However, as web scraping is I/O bound, most of the time wasted is waiting for network delays; using this library is still going to make your code very fast.
Let’s start with installing this library:
$pip install aiohttp
Here is a simple example that uses AIOHTTP along with Beautiful Soup, to parse the game names:
import aiohttp
import asyncio
from bs4 import BeautifulSoup
async def fetch(url, session, proxy):
async with session.get(url, proxy=proxy) as response:
return await response.text()
async def main():
proxy = 'http://username:[email protected]:8888'
url = 'https://sandbox.oxylabs.io/products'
async with aiohttp.ClientSession() as session:
html = await fetch(url, session, proxy)
soup = BeautifulSoup(html, 'html.parser')
for game in soup.find_all('h4'):
print(game.string)
asyncio.run(main())
7. Scrapy
Scrapy is an open-source framework for extracting data. Note the emphasis on the word framework. It can send HTTP requests and parse the response, and you can query the response using CSS or XPath selectors. It even has item processors that can handle cleaning up or processing the data. It is asynchronous by design and can send output to many file formats.
To install this library, you will need a C compiler. See the installation document for more details.
$ pip3 install scrapy
You will notice that it installs many other libraries on which it depends.
See this example Scrapy spider:
import scrapy
class ProductsSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://sandbox.oxylabs.io/products']
def parse(self, response):
for game in response.css('h4::text').getall():
yield {'game_name': game}
To run this Scrapy spider, you would need to use the scrapy executable:
$ scrapy runspider example.py
If you want to send the output to a JSON file, add `-o ouput.json` to the above command.
If you are working with dynamic pages, you can integrate Playwright within Scrapy using the scrapy-playwright plugin.
Using proxies with web scraping
You will encounter a few challenges when working on a web scraping task. The only solution to these problems is using a proxy. The most common one is getting your IP banned, which can be temporary or permanent.
You would also see some other issues, such as:
- Geolocation restrictions: Some sites allow access from only a specific country or sometimes only from a particular city.
- Rate limitations: This is a prevalent problem when websites allow a small number of requests in a time frame. This usually results in slow responses or IP bans.
- Captcha challenge: This is another common problem that you will face. When websites see a lot of traffic from a single site, they send a captcha in the response.
Apart from these common challenges, there are many other reasons why you should use proxies:
- Accessing multiple accounts: If you use proxies, you can use multiple credentials to access various accounts simultaneously, making scraping faster.
- Anonymity: If you use proxies, it means you are hiding your actual IP. It means you are anonymous, which can result in unbiased data. You can use a residential proxy pool for better anonymity, increasing your success rates.
- Quality of data: Proxies also improve data quality by distributing traffic. Traffic from multiple sources bypasses websites’ bias to serve similar content in one region.
A comparison of libraries
Library | Ease of Use | Type | JavaScript Support | Use case | Pros | Cons |
Requests | Easy | For HTTP requests | No | HTTP requests | Simple, powerful, lot of documentation, native support for JSON | Can’t parse, no JavaScript support and thus can’t handle dynamic pages |
BeautifulSoup | Easy | Parser | No | Parsing and querying HTML or XML | Easy to learn, can change underlying parser | Slower, non-standard methods, no XPath support |
lxml | Moderate | Parser | No | When parsing speed is very important | Very fast, supports XPath | Difficult to learn, no JavaScript support and thus can’t handle dynamic pages |
Selenium | Moderate | Uses Browsers | Yes | For JavaScript-heavy websites, when speed is not a concern | Loads everything in a browser, mimics human behavior, best for JavaScript-heavy sites | slow, resource-heavy |
Playwright | Moderate | Uses Browser | Yes | For JavaScript-heavy websites, needs lightweight browser | Faster than Selenium, supports async operations | Newer library, less community support |
Aiohttp | Advanced | For HTTP requests and parsing | No | High-concurrency scraping, APIs | High performance, very fast | Difficult to learn, needs expertise, no JavaScript support and thus can’t handle dynamic pages |
Scrapy | Moderate | All in one | No (Yes with plugins) | Large-scale projects, crawling multiple pages/sites | Most powerful framework, lot of built-in features, asynchronous by design, thus very fast | Complex, difficult to learn, not ideal for small projects, need playwright plugin for JavaScript support |
Conclusion
In this article, we have looked at various web scraping libraries, including Requests, Beautiful Soup, lxml, Selenium, Playwright, AIOHTTP, and Scrapy.
All the libraries that work with HTTP requests support proxies. We have looked at the code samples and summarized why using proxies can speed up and increase proxies’ accuracy.