Popular Python Libraries for Web Scraping

Web scraping is a way of collecting data, which involves using code. There are other names to it, such as data scraping, screen scraping, web crawling, etc. Whichever the term, it’s a simple step of examining a website’s HTML markup and writing code that can fetch this HTML, and extract the data you need.

How to choose the right tools

Even though web scraping code can be written in many languages, Python stands out primarily because it has a vast number of open-source libraries that are easy to use and very powerful.

As every website is unique, you will have to deal with different challenges while scraping. On top of that, your data collection objects may vary a lot. It could range from a simple one-time scrape from a small website to a daily price-monitoring of millions of products.

Python has libraries for all the cases —simple libraries for one-time quick scrape and complex frameworks that can work well with millions of pages.

Popular Python Libraries for Web Scraping

1. Requests

Requests is one of the most commonly used libraries in web scraping. Using this library, you can send HTTP requests and receive a response.

Let’s see a quick example. First, install this library using `pip`:

$ pip3 install requests

Then run the following code:

import requests response = requests.get('https://sandbox.oxylabs.io/') 
print(f'Status Code: {response.status_code}') 
print(f'HTML Markup: {response.text}')  

Once run, this code will print the status code as 200 and the HTML markup of the page.

The Requests library also makes it very easy to route your HTTP requests using proxies. See this example:

proxy = { 
    'http': 'http://username:[email protected]:8888', 
    'https': 'https://username:[email protected]:8888',
} 

response = requests.get('https://sandbox.oxylabs.io/', proxies=proxy)

If you don’t have a proxy provider, here is a list of the best proxy providers

The limitation of the `Requests` library is that it returns the HTML as plain text, making it difficult to extract specific information. This is where you would need a library that can parse the HTML.

2. Beautiful Soup

Beautiful Soup is the other most common library for web scraping. BeautifulSoup provides simple methods that make extracting specific information from any HTML easy.

Note that BeautifulSoup is a wrapper around other parser libraries such as `lxml` and `html5lib`. Using libraries such as `lxml` directly is tedious, especially if you need to learn XPath. BeautifulSoup hides all the complications and even allows you to switch the parsers if you want to.

A note about installation: Beautiful Soup 3 does not support Python 3 and thus has been retired. Always use Beautiful Soup 4.

$ pip3 install beautifulsoup4

Let’s begin with a simple HTML string:

from bs4 import BeautifulSoup

html_content = '<html><head><title>Test</title></head><body>Hello World</body></html>'

soup = BeautifulSoup(html_content, 'html.parser')

print(soup.title) #prints <title>Test</title>

print(soup.title.string) #prints Test

Typically, you will combine Requests and Beautiful Soup for web scraping as follows:

import requests from bs4 
import BeautifulSoup 

response = requests.get('https://sandbox.oxylabs.io/products') 

soup = BeautifulSoup(response.text, 'html.parser') 

print(soup.title.string) # prints 'E-commerce | Oxylabs Scraping Sandbox'

More commonly, you will be using find and `find_all` methods. Let’s see an example.

Beautiful Soup

If you examine the HTML of https://sandbox.oxylabs.io/products, you will notice that all the games are in h4 tags.

You can find all the h4 tags as follows and get the names of games.

for game in soup.find_all('h4'):

print(game.string) #prints names of all games on the page

3. lxml

The lxml library is primarily for parsing XML documents. However, it can handle HTML well using the lxml.html module. Note that, you need to use XPath to query the elements.

Let’s see an example:

Let’s begin with installing this library:

$ pip install lxml

Next, get the html using the requests library and then parse it using lxml.

import requests from lxml 
import html 

response = requests.get('https://sandbox.oxylabs.io/products') 

tree = html.fromstring(response.content) 
title = tree.findtext('.//title') 

print(title) # prints 'E-commerce | Oxylabs Scraping Sandbox'

4. Selenium

Selenium is made for browser automation. It allows the loading of web pages in a browser. It means that Selenium is suitable for scraping dynamic web pages—the pages that load content using JavaScript.

You would need two components before you write Python code: Selenium Driver for your browser and Selenium.

For Chrome, you can find the driver here.

Selenium can be installed using `pip`:

$ pip3 install selenium

Here is a sample code that uses selenium:

from selenium import webdriver 
from selenium.webdriver.chrome.service import Service 
from selenium.webdriver.common.by import By 

service = Service('/path/to/chromedriver') 
driver = webdriver.Chrome(service=service) 
driver.get('https://sandbox.oxylabs.io/products') 
games = driver.find_elements(By.TAG_NAME, 'h4') 

for game in games: 
    print(game.text) 

driver.quit()

5. Playwright

Playwright is a similar browser automation tool that has recently gained popularity. In addition to Python, Playwright works with multiple languages, such as JavaScript and .NET.

To install Playwright, install using `pip`:

$pip3 install playwright

It does not use a browser and driver like Selenium but uses its minimal browsers, including Chromium.

$ playwright install chromium

Here is the sample code that uses proxies. Notice how easy it is to use proxies. You can remove it from the launch option if you don’t need to use a proxy. However, we highly recommend using proxies for the reasons listed at the end of the article.

from playwright.sync_api import sync_playwright 

proxy = { 'server': 'http://username:[email protected]:8888' } 

with sync_playwright() as p: 
    browser = p.chromium.launch(proxy=proxy) 
    page = browser.new_page() 
    page.goto('https://sandbox.oxylabs.io/products') 
    games = page.locator('h4') 
    for game in games.all_text_contents(): 
        print(game) 
    browser.close()

If you are looking for a good proxy provider, look at the residential proxy pool

6. AIOHTTP

AIOHTTP is an asynchronous client for Python that uses the asyncio library. Combining these two allows writing code that can send many aysnc HTTP requests that can be parsed and queried.

Think of this library as the requests library that can work asynchronously.

It means that you would still need a parser such as Beautiful Soup. However, as web scraping is I/O bound, most of the time wasted is waiting for network delays; using this library is still going to make your code very fast.

Let’s start with installing this library:

$pip install aiohttp

Here is a simple example that uses AIOHTTP along with Beautiful Soup, to parse the game names:

import aiohttp 
import asyncio 
from bs4 import BeautifulSoup 

async def fetch(url, session, proxy): 
    async with session.get(url, proxy=proxy) as response:
        return await response.text()

async def main(): 
    proxy = 'http://username:[email protected]:8888' 
    url = 'https://sandbox.oxylabs.io/products'
    
    async with aiohttp.ClientSession() as session:
        html = await fetch(url, session, proxy)
        soup = BeautifulSoup(html, 'html.parser')
        
        for game in soup.find_all('h4'):
            print(game.string)

asyncio.run(main())

7. Scrapy

Scrapy is an open-source framework for extracting data. Note the emphasis on the word framework. It can send HTTP requests and parse the response, and you can query the response using CSS or XPath selectors. It even has item processors that can handle cleaning up or processing the data. It is asynchronous by design and can send output to many file formats.

To install this library, you will need a C compiler. See the installation document for more details.

$ pip3 install scrapy

You will notice that it installs many other libraries on which it depends.

See this example Scrapy spider:

import scrapy

class ProductsSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://sandbox.oxylabs.io/products']
    
    def parse(self, response):
        for game in response.css('h4::text').getall():
            yield {'game_name': game}

To run this Scrapy spider, you would need to use the scrapy executable:

$ scrapy runspider example.py

If you want to send the output to a JSON file, add `-o ouput.json` to the above command.

If you are working with dynamic pages, you can integrate Playwright within Scrapy using the scrapy-playwright plugin.

Using proxies with web scraping

You will encounter a few challenges when working on a web scraping task. The only solution to these problems is using a proxy. The most common one is getting your IP banned, which can be temporary or permanent.

You would also see some other issues, such as:

  • Geolocation restrictions: Some sites allow access from only a specific country or sometimes only from a particular city.
  • Rate limitations: This is a prevalent problem when websites allow a small number of requests in a time frame. This usually results in slow responses or IP bans.
  • Captcha challenge: This is another common problem that you will face. When websites see a lot of traffic from a single site, they send a captcha in the response.

Apart from these common challenges, there are many other reasons why you should use proxies:

  • Accessing multiple accounts: If you use proxies, you can use multiple credentials to access various accounts simultaneously, making scraping faster.
  • Anonymity: If you use proxies, it means you are hiding your actual IP. It means you are anonymous, which can result in unbiased data. You can use a residential proxy pool for better anonymity, increasing your success rates.
  • Quality of data: Proxies also improve data quality by distributing traffic. Traffic from multiple sources bypasses websites’ bias to serve similar content in one region.

A comparison of libraries

LibraryEase of UseTypeJavaScript SupportUse caseProsCons
RequestsEasyFor HTTP requestsNoHTTP requestsSimple, powerful, lot of documentation, native support for JSONCan’t parse, no JavaScript support and thus can’t handle dynamic pages
BeautifulSoupEasyParserNoParsing and querying HTML or XMLEasy to learn, can change underlying parserSlower, non-standard methods, no XPath support
lxmlModerateParserNoWhen parsing speed is very importantVery fast, supports XPathDifficult to learn, no JavaScript support and thus can’t handle dynamic pages
SeleniumModerateUses BrowsersYesFor JavaScript-heavy websites, when speed is not a concernLoads everything in a browser, mimics human behavior, best for JavaScript-heavy sitesslow, resource-heavy
PlaywrightModerateUses BrowserYesFor JavaScript-heavy websites, needs lightweight browserFaster than Selenium, supports async operationsNewer library, less community support
AiohttpAdvancedFor HTTP requests and parsingNoHigh-concurrency scraping, APIsHigh performance, very fastDifficult to learn, needs expertise, no JavaScript support and thus can’t handle dynamic pages
ScrapyModerateAll in oneNo (Yes with plugins)Large-scale projects, crawling multiple pages/sitesMost powerful framework, lot of built-in features, asynchronous by design, thus very fastComplex, difficult to learn, not ideal for small projects, need playwright plugin for JavaScript support

Conclusion

In this article, we have looked at various web scraping libraries, including Requests, Beautiful Soup, lxml, Selenium, Playwright, AIOHTTP, and Scrapy.

All the libraries that work with HTTP requests support proxies. We have looked at the code samples and summarized why using proxies can speed up and increase proxies’ accuracy.