Scrapy
Scrapy
Release 1.7.3
Scrapy developers
1 Getting help 3
2 First steps 5
2.1 Scrapy at a glance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Installation guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Scrapy Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Basic concepts 23
3.1 Command line tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Spiders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Selectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5 Item Loaders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.6 Scrapy shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.7 Item Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.8 Feed exports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.9 Requests and Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.10 Link Extractors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.11 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.12 Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
i
5.8 Debugging memory leaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
5.9 Downloading and processing files and images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5.10 Deploying Spiders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.11 AutoThrottle extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.12 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
5.13 Jobs: pausing and resuming crawls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Index 291
ii
Scrapy Documentation, Release 1.7.3
Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured
data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated
testing.
First steps 1
Scrapy Documentation, Release 1.7.3
2 First steps
CHAPTER 1
Getting help
3
Scrapy Documentation, Release 1.7.3
First steps
Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide
range of useful applications, like data mining, information processing or historical archival.
Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as
Amazon Associates Web Services) or as a general purpose web crawler.
In order to show you what Scrapy brings to the table, we’ll walk you through an example of a Scrapy Spider using the
simplest way to run a spider.
Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagi-
nation:
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com/tag/humor/',
]
5
Scrapy Documentation, Release 1.7.3
Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider
command:
When this finishes you will have in the quotes.json file a list of the quotes in JSON format, containing text and
author, looking like this (reformatted here for better readability):
[{
"author": "Jane Austen",
"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a
˓→good novel, must be intolerably stupid.\u201d"
},
{
"author": "Groucho Marx",
"text": "\u201cOutside of a dog, a book is man's best friend. Inside of a dog it
˓→'s too dark to read.\u201d"
},
{
"author": "Steve Martin",
"text": "\u201cA day without sunshine is like, you know, night.\u201d"
},
...]
When you ran the command scrapy runspider quotes_spider.py, Scrapy looked for a Spider definition
inside it and ran it through its crawler engine.
The crawl started by making requests to the URLs defined in the start_urls attribute (in this case, only the
URL for quotes in humor category) and called the default callback method parse, passing the response object as
an argument. In the parse callback, we loop through the quote elements using a CSS Selector, yield a Python dict
with the extracted quote text and author, look for a link to the next page and schedule another request using the same
parse method as callback.
Here you notice one of the main advantages about Scrapy: requests are scheduled and processed asynchronously. This
means that Scrapy doesn’t need to wait for a request to be finished and processed, it can send another request or do
other things in the meantime. This also means that other requests can keep going even if some request fails or an error
happens while handling it.
While this enables you to do very fast crawls (sending multiple concurrent requests at the same time, in a fault-tolerant
way) Scrapy also gives you control over the politeness of the crawl through a few settings. You can do things like
setting a download delay between each request, limiting amount of concurrent requests per domain or per IP, and even
using an auto-throttling extension that tries to figure out these automatically.
Note: This is using feed exports to generate the JSON file, you can easily change the export format (XML or CSV,
for example) or the storage backend (FTP or Amazon S3, for example). You can also write an item pipeline to store
the items in a database.
You’ve seen how to extract and store items from a website using Scrapy, but this is just the surface. Scrapy provides a
lot of powerful features for making scraping easy and efficient, such as:
• Built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and
XPath expressions, with helper methods to extract using regular expressions.
• An interactive shell console (IPython aware) for trying out the CSS and XPath expressions to scrape data, very
useful when writing or debugging your spiders.
• Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple
backends (FTP, S3, local filesystem)
• Robust encoding support and auto-detection, for dealing with foreign, non-standard and broken encoding dec-
larations.
• Strong extensibility support, allowing you to plug in your own functionality using signals and a well-defined
API (middlewares, extensions, and pipelines).
• Wide range of built-in extensions and middlewares for handling:
– cookies and session handling
– HTTP features like compression, authentication, caching
– user-agent spoofing
– robots.txt
– crawl depth restriction
– and more
• A Telnet console for hooking into a Python console running inside your Scrapy process, to introspect and debug
your crawler
• Plus other goodies like reusable spiders to crawl sites from Sitemaps and XML/CSV feeds, a media pipeline
for automatically downloading images (or any other media) associated with the scraped items, a caching DNS
resolver, and much more!
The next steps for you are to install Scrapy, follow through the tutorial to learn how to create a full-blown Scrapy
project and join the community. Thanks for your interest!
Scrapy runs on Python 2.7 and Python 3.4 or above under CPython (default Python implementation) and PyPy (starting
with PyPy 5.9).
If you’re using Anaconda or Miniconda, you can install the package from the conda-forge channel, which has up-to-
date packages for Linux, Windows and OS X.
To install Scrapy using conda, run:
Alternatively, if you’re already familiar with installation of Python packages, you can install Scrapy and its dependen-
cies from PyPI with:
Note that sometimes this may require solving compilation issues for some Scrapy dependencies depending on your
operating system, so be sure to check the Platform specific installation notes.
We strongly recommend that you install Scrapy in a dedicated virtualenv, to avoid conflicting with your system
packages.
For more detailed and platform specifics instructions, as well as troubleshooting information, read on.
Scrapy is written in pure Python and depends on a few key Python packages (among others):
• lxml, an efficient XML and HTML parser
• parsel, an HTML/XML data extraction library written on top of lxml,
• w3lib, a multi-purpose helper for dealing with URLs and web page encodings
• twisted, an asynchronous networking framework
• cryptography and pyOpenSSL, to deal with various network-level security needs
The minimal versions which Scrapy is tested against are:
• Twisted 14.0
• lxml 3.4
• pyOpenSSL 0.14
Scrapy may work with older versions of these packages but it is not guaranteed it will continue working because it’s
not being tested against them.
Some of these packages themselves depends on non-Python packages that might require additional installation steps
depending on your platform. Please check platform-specific guides below.
In case of any trouble related to these dependencies, please refer to their respective installation instructions:
• lxml installation
• cryptography installation
Once you have created a virtualenv, you can install scrapy inside it with pip, just like any other Python package. (See
platform-specific guides below for non-Python dependencies that you may need to install beforehand).
Python virtualenvs can be created to use Python 2 by default, or Python 3 by default.
• If you want to install scrapy with Python 3, install scrapy within a Python 3 virtualenv.
• And if you want to install scrapy with Python 2, install scrapy within a Python 2 virtualenv.
Windows
Though it’s possible to install Scrapy on Windows using pip, we recommend you to install Anaconda or Miniconda
and use the package from the conda-forge channel, which will avoid most installation issues.
Once you’ve installed Anaconda or Miniconda, install Scrapy with:
Scrapy is currently tested with recent-enough versions of lxml, twisted and pyOpenSSL, and is compatible with recent
Ubuntu distributions. But it should support older versions of Ubuntu too, like Ubuntu 14.04, albeit with potential
issues with TLS connections.
Don’t use the python-scrapy package provided by Ubuntu, they are typically too old and slow to catch up with
latest Scrapy.
To install scrapy on Ubuntu (or Ubuntu-based) systems, you need to install these dependencies:
Inside a virtualenv, you can install Scrapy with pip after that:
Note: The same non-Python dependencies can be used to install Scrapy in Debian Jessie (8.0) and above.
Mac OS X
Building Scrapy’s dependencies requires the presence of a C compiler and development headers. On OS X this is
typically provided by Apple’s Xcode development tools. To install the Xcode command line tools open a terminal
window and run:
xcode-select --install
There’s a known issue that prevents pip from updating system packages. This has to be addressed to successfully
install Scrapy and its dependencies. Here are some proposed solutions:
• (Recommended) Don’t use system python, install a new, updated version that doesn’t conflict with the rest of
your system. Here’s how to do it using the homebrew package manager:
– Install homebrew following the instructions in https://brew.sh/
– Update your PATH variable to state that homebrew packages should be used before system packages
(Change .bashrc to .zshrc accordantly if you’re using zsh as default shell):
source ~/.bashrc
– Install python:
– Latest versions of python have pip bundled with them so you won’t need to install it separately. If this is
not the case, upgrade python:
PyPy
We recommend using the latest PyPy version. The version tested is 5.9.0. For PyPy3, only Linux installation was
tested.
Most scrapy dependencides now have binary wheels for CPython, but not for PyPy. This means that these dependecies
will be built during installation. On OS X, you are likely to face an issue with building Cryptography dependency,
solution to this problem is described here, that is to brew install openssl and then export the flags that this
command recommends (only needed when installing scrapy). Installing on Linux has no special issues besides in-
stalling build dependencies. Installing scrapy with PyPy on Windows is not tested.
You can check that scrapy is installed correctly by running scrapy bench. If this command gives errors such as
TypeError: ... got 2 unexpected keyword arguments, this means that setuptools was unable to
pick up one PyPy-specific dependency. To fix this issue, run pip install 'PyPyDispatcher>=2.1.0'.
2.2.3 Troubleshooting
After you install or upgrade Scrapy, Twisted or pyOpenSSL, you may get an exception with the following traceback:
[...]
File "[...]/site-packages/twisted/protocols/tls.py", line 63, in <module>
from twisted.internet._sslverify import _setAcceptableProtocols
File "[...]/site-packages/twisted/internet/_sslverify.py", line 38, in <module>
TLSVersion.TLSv1_1: SSL.OP_NO_TLSv1_1,
AttributeError: 'module' object has no attribute 'OP_NO_TLSv1_1'
The reason you get this exception is that your system or virtual environment has a version of pyOpenSSL that your
version of Twisted does not support.
To install a version of pyOpenSSL that your version of Twisted supports, reinstall Twisted with the tls extra option:
In this tutorial, we’ll assume that Scrapy is already installed on your system. If that’s not the case, see Installation
guide.
We are going to scrape quotes.toscrape.com, a website that lists quotes from famous authors.
This tutorial will walk you through these tasks:
1. Creating a new Scrapy project
2. Writing a spider to crawl a site and extract data
3. Exporting the scraped data using the command line
4. Changing spider to recursively follow links
5. Using spider arguments
Scrapy is written in Python. If you’re new to the language you might want to start by getting an idea of what the
language is like, to get the most out of Scrapy.
If you’re already familiar with other languages, and want to learn Python quickly, the Python Tutorial is a good
resource.
If you’re new to programming and want to start with Python, the following books may be useful to you:
• Automate the Boring Stuff With Python
• How To Think Like a Computer Scientist
• Learn Python 3 The Hard Way
You can also take a look at this list of Python resources for non-programmers, as well as the suggested resources in
the learnpython-subreddit.
Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store
your code and run:
scrapy startproject tutorial
tutorial/ # project's Python module, you'll import your code from here
__init__.py
Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites).
They must subclass scrapy.Spider and define the initial requests to make, optionally how to follow links in the
pages, and how to parse the downloaded page content to extract data.
This is the code for our first Spider. Save it in a file named quotes_spider.py under the tutorial/spiders
directory in your project:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
As you can see, our Spider subclasses scrapy.Spider and defines some attributes and methods:
• name: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different
Spiders.
• start_requests(): must return an iterable of Requests (you can return a list of requests or write a generator
function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from
these initial requests.
• parse(): a method that will be called to handle the response downloaded for each of the requests made.
The response parameter is an instance of TextResponse that holds the page content and has further helpful
methods to handle it.
The parse() method usually parses the response, extracting the scraped data as dicts and also finding new
URLs to follow and creating new requests (Request) from them.
To put our spider to work, go to the project’s top level directory and run:
This command runs the spider with name quotes that we’ve just added, that will send some requests for the
quotes.toscrape.com domain. You will get an output similar to this:
Now, check the files in the current directory. You should notice that two new files have been created: quotes-1.html
and quotes-2.html, with the content for the respective URLs, as our parse method instructs.
Note: If you are wondering why we haven’t parsed the HTML yet, hold on, we will cover that soon.
Scrapy schedules the scrapy.Request objects returned by the start_requests method of the Spider. Upon
receiving a response for each one, it instantiates Response objects and calls the callback method associated with the
request (in this case, the parse method) passing the response as argument.
Instead of implementing a start_requests() method that generates scrapy.Request objects from URLs,
you can just define a start_urls class attribute with a list of URLs. This list will then be used by the default
implementation of start_requests() to create the initial requests for your spider:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
The parse() method will be called to handle each of the requests for those URLs, even though we haven’t explicitly
told Scrapy to do so. This happens because parse() is Scrapy’s default callback method, which is called for requests
without an explicitly assigned callback.
Extracting data
The best way to learn how to extract data with Scrapy is trying selectors using the Scrapy shell. Run:
scrapy shell 'http://quotes.toscrape.com/page/1/'
Note: Remember to always enclose urls in quotes when running Scrapy shell from command-line, otherwise urls
containing arguments (ie. & character) will not work.
On Windows, use double quotes instead:
scrapy shell "http://quotes.toscrape.com/page/1/"
Using the shell, you can try selecting elements using CSS with the response object:
>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
The result of running response.css('title') is a list-like object called SelectorList, which represents a
list of Selector objects that wrap around XML/HTML elements and allow you to run further queries to fine-grain
the selection or extract the data.
To extract the text from the title above, you can do:
>>> response.css('title::text').getall()
['Quotes to Scrape']
There are two things to note here: one is that we’ve added ::text to the CSS query, to mean we want to select
only the text elements directly inside <title> element. If we don’t specify ::text, we’d get the full title element,
including its tags:
>>> response.css('title').getall()
['<title>Quotes to Scrape</title>']
The other thing is that the result of calling .getall() is a list: it is possible that a selector returns more than one
result, so we extract them all. When you know you just want the first result, as in this case, you can do:
>>> response.css('title::text').get()
'Quotes to Scrape'
>>> response.css('title::text')[0].get()
'Quotes to Scrape'
However, using .get() directly on a SelectorList instance avoids an IndexError and returns None when it
doesn’t find any element matching the selection.
There’s a lesson here: for most scraping code, you want it to be resilient to errors due to things not being found on a
page, so that even if some parts fail to be scraped, you can at least get some data.
Besides the getall() and get() methods, you can also use the re() method to extract using regular expressions:
>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']
In order to find the proper CSS selectors to use, you might find useful opening the response page from the shell in your
web browser using view(response). You can use your browser’s developer tools to inspect the HTML and come
up with a selector (see Using your browser’s Developer Tools for scraping).
Selector Gadget is also a nice tool to quickly find CSS selector for visually selected elements, which works in many
browsers.
>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').get()
'Quotes to Scrape'
XPath expressions are very powerful, and are the foundation of Scrapy Selectors. In fact, CSS selectors are converted
to XPath under-the-hood. You can see that if you read closely the text representation of the selector objects in the
shell.
While perhaps not as popular as CSS selectors, XPath expressions offer more power because besides navigating the
structure, it can also look at the content. Using XPath, you’re able to select things like: select the link that contains the
text “Next Page”. This makes XPath very fitting to the task of scraping, and we encourage you to learn XPath even if
you already know how to construct CSS selectors, it will make scraping much easier.
We won’t cover much of XPath here, but you can read more about using XPath with Scrapy Selectors here. To learn
more about XPath, we recommend this tutorial to learn XPath through examples, and this tutorial to learn “how to
think in XPath”.
Now that you know a bit about selection and extraction, let’s complete our spider by writing the code to extract the
quotes from the web page.
Each quote in http://quotes.toscrape.com is represented by HTML elements that look like this:
<div class="quote">
<span class="text">“The world as we have created it is a process of our
thinking. It cannot be changed without changing our thinking.”</span>
<span>
by <small class="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>
Let’s open up scrapy shell and play a bit to find out how to extract the data we want:
>>> response.css("div.quote")
Each of the selectors returned by the query above allows us to run further queries over their sub-elements. Let’s assign
the first selector to a variable, so that we can run our CSS selectors directly on a particular quote:
Now, let’s extract text, author and the tags from that quote using the quote object we just created:
Given that the tags are a list of strings, we can use the .getall() method to get all of them:
Having figured out how to extract each bit, we can now iterate over all the quotes elements and put them together into
a Python dictionary:
Let’s get back to our spider. Until now, it doesn’t extract any data in particular, just saves the whole HTML page to a
local file. Let’s integrate the extraction logic above into our spider.
A Scrapy spider typically generates many dictionaries containing the data extracted from the page. To do that, we use
the yield Python keyword in the callback, as you can see below:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
(continues on next page)
If you run this spider, it will output the extracted data with the log:
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.
˓→toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated
˓→for what you are than to be loved for what you are not.”'}
˓→"}
The simplest way to store the scraped data is by using Feed exports, with the following command:
scrapy crawl quotes -o quotes.json
That will generate an quotes.json file containing all scraped items, serialized in JSON.
For historic reasons, Scrapy appends to a given file instead of overwriting its contents. If you run this command twice
without removing the file before the second time, you’ll end up with a broken JSON file.
You can also use other formats, like JSON Lines:
scrapy crawl quotes -o quotes.jl
The JSON Lines format is useful because it’s stream-like, you can easily append new records to it. It doesn’t have the
same problem of JSON when you run twice. Also, as each record is a separate line, you can process big files without
having to fit everything in memory, there are tools like JQ to help doing that at the command-line.
In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex
things with the scraped items, you can write an Item Pipeline. A placeholder file for Item Pipelines has been set up
for you when the project is created, in tutorial/pipelines.py. Though you don’t need to implement any item
pipelines if you just want to store the scraped items.
Let’s say, instead of just scraping the stuff from the first two pages from http://quotes.toscrape.com, you want quotes
from all the pages in the website.
Now that you know how to extract data from pages, let’s see how to follow links from them.
First thing is to extract the link to the page we want to follow. Examining our page, we can see there is a link to the
next page with the following markup:
<ul class="pager">
<li class="next">
<a href="/page/2/">Next <span aria-hidden="true">→</span></a>
</li>
</ul>
This gets the anchor element, but we want the attribute href. For that, Scrapy supports a CSS extension that lets you
select the attribute contents, like this:
There is also an attrib property available (see Selecting element attributes for more):
Let’s see now our spider modified to recursively follow the link to the next page, extracting data from it:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
Now, after extracting the data, the parse() method looks for the link to the next page, builds a full absolute URL
using the urljoin() method (since the links can be relative) and yields a new request to the next page, registering
itself as callback to handle the data extraction for the next page and to keep the crawling going through all the pages.
What you see here is Scrapy’s mechanism of following links: when you yield a Request in a callback method, Scrapy
will schedule that request to be sent and register a callback method to be executed when that request finishes.
Using this, you can build complex crawlers that follow links according to rules you define, and extract different kinds
of data depending on the page it’s visiting.
In our example, it creates a sort of loop, following all the links to the next page until it doesn’t find one – handy for
crawling blogs, forums and other sites with pagination.
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
Unlike scrapy.Request, response.follow supports relative URLs directly - no need to call urljoin. Note that
response.follow just returns a Request instance; you still have to yield this Request.
You can also pass a selector to response.follow instead of a string; this selector should extract necessary at-
tributes:
For <a> elements there is a shortcut: response.follow uses their href attribute automatically. So the code can
be shortened further:
Here is another spider that illustrates callbacks and following links, this time for scraping author information:
import scrapy
class AuthorSpider(scrapy.Spider):
name = 'author'
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
This spider will start from the main page, it will follow all the links to the authors pages calling the parse_author
callback for each of them, and also the pagination links with the parse callback as we saw before.
Here we’re passing callbacks to response.follow as positional arguments to make the code shorter; it also works
for scrapy.Request.
The parse_author callback defines a helper function to extract and cleanup the data from a CSS query and yields
the Python dict with the author data.
Another interesting thing this spider demonstrates is that, even if there are many quotes from the same author, we don’t
need to worry about visiting the same author page multiple times. By default, Scrapy filters out duplicated requests to
URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. This can
be configured by the setting DUPEFILTER_CLASS.
Hopefully by now you have a good understanding of how to use the mechanism of following links and callbacks with
Scrapy.
As yet another example spider that leverages the mechanism of following links, check out the CrawlSpider class
for a generic spider that implements a small rules engine that you can use to write your crawlers on top of it.
Also, a common pattern is to build an item with data from more than one page, using a trick to pass additional data to
the callbacks.
You can provide command line arguments to your spiders by using the -a option when running them:
scrapy crawl quotes -o quotes-humor.json -a tag=humor
These arguments are passed to the Spider’s __init__ method and become spider attributes by default.
In this example, the value provided for the tag argument will be available via self.tag. You can use this to make
your spider fetch only quotes with a specific tag, building the URL based on the argument:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
url = 'http://quotes.toscrape.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + 'tag/' + tag
yield scrapy.Request(url, self.parse)
If you pass the tag=humor argument to this spider, you’ll notice that it will only visit URLs from the humor tag,
such as http://quotes.toscrape.com/tag/humor.
You can learn more about handling spider arguments here.
This tutorial covered only the basics of Scrapy, but there’s a lot of other features not mentioned here. Check the What
else? section in Scrapy at a glance chapter for a quick overview of the most important ones.
You can continue from the section Basic concepts to know more about the command-line tool, spiders, selectors and
other things the tutorial hasn’t covered like modeling the scraped data. If you prefer to play with an example project,
check the Examples section.
2.4 Examples
The best way to learn is with examples, and Scrapy is no exception. For this reason, there is an example Scrapy
project named quotesbot, that you can use to play and learn more about Scrapy. It contains two spiders for http:
//quotes.toscrape.com, one using CSS selectors and another one using XPath expressions.
The quotesbot project is available at: https://github.com/scrapy/quotesbot. You can find more information about it in
the project’s README.
If you’re familiar with git, you can checkout the code. Otherwise you can download the project as a zip file by clicking
here.
Scrapy at a glance Understand what Scrapy is and how it can help you.
Installation guide Get Scrapy installed on your computer.
Scrapy Tutorial Write your first Scrapy project.
Examples Learn more by playing with a pre-made Scrapy project.
Basic concepts
Scrapy will look for configuration parameters in ini-style scrapy.cfg files in standard locations:
1. /etc/scrapy.cfg or c:\scrapy\scrapy.cfg (system-wide),
2. ~/.config/scrapy.cfg ($XDG_CONFIG_HOME) and ~/.scrapy.cfg ($HOME) for global (user-
wide) settings, and
3. scrapy.cfg inside a scrapy project’s root (see next section).
Settings from these files are merged in the listed order of preference: user-defined values have higher priority than
system-wide defaults and project-wide settings will override all others, when defined.
Scrapy also understands, and can be configured through, a number of environment variables. Currently these are:
• SCRAPY_SETTINGS_MODULE (see Designating the settings)
• SCRAPY_PROJECT (see Sharing the root directory between projects)
• SCRAPY_PYTHON_SHELL (see Scrapy shell)
23
Scrapy Documentation, Release 1.7.3
Before delving into the command-line tool and its sub-commands, let’s first understand the directory structure of a
Scrapy project.
Though it can be modified, all Scrapy projects have the same file structure by default, similar to this:
scrapy.cfg
myproject/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
...
The directory where the scrapy.cfg file resides is known as the project root directory. That file contains the name
of the python module that defines the project settings. Here is an example:
[settings]
default = myproject.settings
A project root directory, the one that contains the scrapy.cfg, may be shared by multiple Scrapy projects, each
with its own settings module.
In that case, you must define one or more aliases for those settings modules under [settings] in your scrapy.
cfg file:
[settings]
default = myproject1.settings
project1 = myproject1.settings
project2 = myproject2.settings
By default, the scrapy command-line tool will use the default settings. Use the SCRAPY_PROJECT environment
variable to specify a different project for scrapy to use:
You can start by running the Scrapy tool with no arguments and it will print some usage help and the available
commands:
Usage:
scrapy <command> [options] [args]
Available commands:
crawl Run a spider
fetch Fetch a URL using the Scrapy downloader
[...]
The first line will print the currently active project if you’re inside a Scrapy project. In this example it was run from
outside a project. If run from inside a project it would have printed something like this:
Scrapy X.Y - project: myproject
Usage:
scrapy <command> [options] [args]
[...]
Creating projects
The first thing you typically do with the scrapy tool is create your Scrapy project:
scrapy startproject myproject [project_dir]
That will create a Scrapy project under the project_dir directory. If project_dir wasn’t specified,
project_dir will be the same as myproject.
Next, you go inside the new project directory:
cd project_dir
And you’re ready to use the scrapy command to manage and control your project from there.
Controlling projects
You use the scrapy tool from inside your projects to control and manage them.
For example, to create a new spider:
scrapy genspider mydomain mydomain.com
Some Scrapy commands (like crawl) must be run from inside a Scrapy project. See the commands reference below
for more information on which commands must be run from inside projects, and which not.
Also keep in mind that some commands may have slightly different behaviours when running them from inside
projects. For example, the fetch command will use spider-overridden behaviours (such as the user_agent attribute
to override the user-agent) if the url being fetched is associated with some specific spider. This is intentional, as the
fetch command is meant to be used to check how spiders are downloading pages.
This section contains a list of the available built-in commands with a description and some usage examples. Remember,
you can always get more info about each command by running:
scrapy <command> -h
scrapy -h
There are two kinds of commands, those that only work from inside a Scrapy project (Project-specific commands) and
those that also work without an active Scrapy project (Global commands), though they may behave slightly different
when running from inside a project (as they would use the project overridden settings).
Global commands:
• startproject
• genspider
• settings
• runspider
• shell
• fetch
• view
• version
Project-only commands:
• crawl
• check
• list
• edit
• parse
• bench
startproject
genspider
Create a new spider in the current folder or in the current project’s spiders folder, if called from inside a project.
The <name> parameter is set as the spider’s name, while <domain> is used to generate the allowed_domains
and start_urls spider’s attributes.
Usage example:
$ scrapy genspider -l
Available templates:
basic
crawl
csvfeed
xmlfeed
This is just a convenience shortcut command for creating spiders based on pre-defined templates, but certainly not the
only way to create spiders. You can just create the spider source code files yourself, instead of using this command.
crawl
check
$ scrapy check -l
first_spider
* parse
* parse_item
second_spider
* parse
* parse_item
$ scrapy check
[FAILED] first_spider:parse_item
>>> 'RetailPricex' field is missing
list
$ scrapy list
spider1
spider2
edit
fetch
view
shell
(200, 'http://example.com/')
(302, 'http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F')
parse
# Requests -----------------------------------------------------------------
[]
settings
runspider
version
bench
You can also add your custom project commands by using the COMMANDS_MODULE setting. See the Scrapy com-
mands in scrapy/commands for examples on how to implement your commands.
COMMANDS_MODULE
COMMANDS_MODULE = 'mybot.commands'
You can also add Scrapy commands from an external library by adding a scrapy.commands section in the entry
points of the library setup.py file.
The following example adds my_command command:
setup(name='scrapy-mymodule',
entry_points={
'scrapy.commands': [
'my_command=my_scrapy_module.commands:MyCommand',
],
},
)
3.2 Spiders
Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform
the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words,
Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site (or,
in some cases, a group of sites).
For spiders, the scraping cycle goes through something like this:
1. You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called
with the response downloaded from those requests.
The first requests to perform are obtained by calling the start_requests() method which (by default)
generates Request for the URLs specified in the start_urls and the parse method as callback function
for the Requests.
2. In the callback function, you parse the response (web page) and return either dicts with extracted data, Item
objects, Request objects, or an iterable of these objects. Those Requests will also contain a callback (maybe
the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.
3. In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup,
lxml or whatever mechanism you prefer) and generate items with the parsed data.
4. Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or
written to a file using Feed exports.
Even though this cycle applies (more or less) to any kind of spider, there are different kinds of default spiders bundled
into Scrapy for different purposes. We will talk about those types here.
3.2.1 scrapy.Spider
class scrapy.spiders.Spider
This is the simplest spider, and the one from which every other spider must inherit (including spiders that come
bundled with Scrapy, as well as spiders that you write yourself). It doesn’t provide any special functionality. It
just provides a default start_requests() implementation which sends requests from the start_urls
spider attribute and calls the spider’s method parse for each of the resulting responses.
name
A string which defines the name for this spider. The spider name is how the spider is located (and instan-
tiated) by Scrapy, so it must be unique. However, nothing prevents you from instantiating more than one
instance of the same spider. This is the most important spider attribute and it’s required.
If the spider scrapes a single domain, a common practice is to name the spider after the domain, with
or without the TLD. So, for example, a spider that crawls mywebsite.com would often be called
mywebsite.
allowed_domains
An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs
not belonging to the domain names specified in this list (or their subdomains) won’t be followed if
OffsiteMiddleware is enabled.
Let’s say your target url is https://www.example.com/1.html, then add 'example.com' to
the list.
start_urls
A list of URLs where the spider will begin to crawl from, when no particular URLs are specified. So, the
first pages downloaded will be those listed here. The subsequent Request will be generated successively
from data contained in the start URLs.
custom_settings
A dictionary of settings that will be overridden from the project wide configuration when running this
spider. It must be defined as a class attribute since the settings are updated before instantiation.
For a list of available built-in settings see: Built-in settings reference.
crawler
This attribute is set by the from_crawler() class method after initializating the class, and links to the
Crawler object to which this spider instance is bound.
Crawlers encapsulate a lot of components in the project for their single entry access (such as extensions,
middlewares, signals managers, etc). See Crawler API to know more about them.
3.2. Spiders 33
Scrapy Documentation, Release 1.7.3
settings
Configuration for running this spider. This is a Settings instance, see the Settings topic for a detailed
introduction on this subject.
logger
Python logger created with the Spider’s name. You can use it to send log messages through it as described
on Logging from Spiders.
from_crawler(crawler, *args, **kwargs)
This is the class method used by Scrapy to create your spiders.
You probably won’t need to override this directly because the default implementation acts as a proxy to
the __init__() method, calling it with the given arguments args and named arguments kwargs.
Nonetheless, this method sets the crawler and settings attributes in the new instance so they can be
accessed later inside the spider’s code.
Parameters
• crawler (Crawler instance) – crawler to which the spider will be bound
• args (list) – arguments passed to the __init__() method
• kwargs (dict) – keyword arguments passed to the __init__() method
start_requests()
This method must return an iterable with the first Requests to crawl for this spider. It is called by
Scrapy when the spider is opened for scraping. Scrapy calls it only once, so it is safe to implement
start_requests() as a generator.
The default implementation generates Request(url, dont_filter=True) for each url in
start_urls.
If you want to change the Requests used to start scraping a domain, this is the method to override. For
example, if you need to start by logging in using a POST request, you could do:
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
return [scrapy.FormRequest("http://www.example.com/login",
formdata={'user': 'john', 'pass': 'secret'}
˓→,
callback=self.logged_in)]
parse(response)
This is the default callback used by Scrapy to process downloaded responses, when their requests don’t
specify a callback.
The parse method is in charge of processing the response and returning scraped data and/or more URLs
to follow. Other Requests callbacks have the same requirements as the Spider class.
This method, as well as any other Request callback, must return an iterable of Request and/or dicts or
Item objects.
Parameters response (Response) – the response to parse
import scrapy
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/1.html',
'http://www.example.com/2.html',
'http://www.example.com/3.html',
]
import scrapy
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/1.html',
'http://www.example.com/2.html',
'http://www.example.com/3.html',
]
Instead of start_urls you can use start_requests() directly; to give data more structure you can use Items:
import scrapy
from myproject.items import MyItem
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
def start_requests(self):
yield scrapy.Request('http://www.example.com/1.html', self.parse)
yield scrapy.Request('http://www.example.com/2.html', self.parse)
yield scrapy.Request('http://www.example.com/3.html', self.parse)
(continues on next page)
3.2. Spiders 35
Scrapy Documentation, Release 1.7.3
Spiders can receive arguments that modify their behaviour. Some common uses for spider arguments are to define the
start URLs or to restrict the crawl to certain sections of the site, but they can be used to configure any functionality of
the spider.
Spider arguments are passed through the crawl command using the -a option. For example:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
The default __init__ method will take any spider arguments and copy them to the spider as attributes. The above
example can also be written as follows:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
yield scrapy.Request('http://www.example.com/categories/%s' % self.category)
Keep in mind that spider arguments are only strings. The spider will not do any parsing on its own. If you were
to set the start_urls attribute from the command line, you would have to parse it on your own into a list using
something like ast.literal_eval or json.loads and then set it as an attribute. Otherwise, you would cause iteration over a
start_urls string (a very common python pitfall) resulting in each character being seen as a separate url.
A valid use case is to set the http auth credentials used by HttpAuthMiddleware or the user agent used by
UserAgentMiddleware:
Spider arguments can also be passed through the Scrapyd schedule.json API. See Scrapyd documentation.
Scrapy comes with some useful generic spiders that you can use to subclass your spiders from. Their aim is to provide
convenient functionality for a few common scraping cases, like following all links on a site based on certain rules,
crawling from Sitemaps, or parsing an XML/CSV feed.
For the examples used in the following spiders, we’ll assume you have a project with a TestItem declared in a
myproject.items module:
import scrapy
class TestItem(scrapy.Item):
id = scrapy.Field()
name = scrapy.Field()
description = scrapy.Field()
CrawlSpider
class scrapy.spiders.CrawlSpider
This is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism for
following links by defining a set of rules. It may not be the best suited for your particular web sites or project,
but it’s generic enough for several cases, so you can start from it and override it as needed for more custom
functionality, or just implement your own spider.
Apart from the attributes inherited from Spider (that you must specify), this class supports a new attribute:
rules
Which is a list of one (or more) Rule objects. Each Rule defines a certain behaviour for crawling the
site. Rules objects are described below. If multiple rules match the same link, the first one will be used,
according to the order they’re defined in this attribute.
This spider also exposes an overrideable method:
parse_start_url(response)
This method is called for the start_urls responses. It allows to parse the initial responses and must return
either an Item object, a Request object, or an iterable containing any of them.
Crawling rules
Warning: When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses
the parse method itself to implement its logic. So if you override the parse method, the crawl spider will
no longer work.
3.2. Spiders 37
Scrapy Documentation, Release 1.7.3
cb_kwargs is a dict containing the keyword arguments to be passed to the callback function.
follow is a boolean which specifies if links should be followed from each response extracted with this rule. If
callback is None follow defaults to True, otherwise it defaults to False.
process_links is a callable, or a string (in which case a method from the spider object with that name
will be used) which will be called for each list of links extracted from each response using the specified
link_extractor. This is mainly used for filtering purposes.
process_request is a callable (or a string, in which case a method from the spider object with that name
will be used) which will be called for every Request extracted by this rule. This callable should take said
request as first argument and the Response from which the request originated as second argument. It must
return a Request object or None (to filter out the request).
CrawlSpider example
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method
˓→ parse_item
Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
item['link_text'] = response.meta['link_text']
return item
This spider would start crawling example.com’s home page, collecting category links, and item links, parsing the latter
with the parse_item method. For each item response, some data will be extracted from the HTML using XPath,
and an Item will be filled with it.
XMLFeedSpider
class scrapy.spiders.XMLFeedSpider
XMLFeedSpider is designed for parsing XML feeds by iterating through them by a certain node name. The
iterator can be chosen from: iternodes, xml, and html. It’s recommended to use the iternodes iterator
for performance reasons, since the xml and html iterators generate the whole DOM at once in order to parse
it. However, using html as the iterator may be useful when parsing XML with bad markup.
To set the iterator and the tag name, you must define the following class attributes:
iterator
A string which defines the iterator to use. It can be either:
• 'iternodes' - a fast iterator based on regular expressions
• 'html' - an iterator which uses Selector. Keep in mind this uses DOM parsing and must load
all DOM in memory which could be a problem for big feeds
• 'xml' - an iterator which uses Selector. Keep in mind this uses DOM parsing and must load all
DOM in memory which could be a problem for big feeds
It defaults to: 'iternodes'.
itertag
A string with the name of the node (or element) to iterate in. Example:
itertag = 'product'
namespaces
A list of (prefix, uri) tuples which define the namespaces available in that document that will be
processed with this spider. The prefix and uri will be used to automatically register namespaces using
the register_namespace() method.
You can then specify nodes with namespaces in the itertag attribute.
Example:
class YourSpider(XMLFeedSpider):
Apart from these new attributes, this spider has the following overrideable methods too:
adapt_response(response)
A method that receives the response as soon as it arrives from the spider middleware, before the spider
starts parsing it. It can be used to modify the response body before parsing it. This method receives a
response and also returns a response (it could be the same or another one).
parse_node(response, selector)
This method is called for the nodes matching the provided tag name (itertag). Receives the response
and an Selector for each node. Overriding this method is mandatory. Otherwise, you spider won’t
work. This method must return either a Item object, a Request object, or an iterable containing any of
them.
process_results(response, results)
This method is called for each result (item or request) returned by the spider, and it’s intended to perform
any last time processing required before returning the results to the framework core, for example setting
the item IDs. It receives a list of results and the response which originated those results. It must return a
list of results (Items or Requests).
XMLFeedSpider example
These spiders are pretty easy to use, let’s have a look at one example:
3.2. Spiders 39
Scrapy Documentation, Release 1.7.3
class MySpider(XMLFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.xml']
iterator = 'iternodes' # This is actually unnecessary, since it's the default
˓→value
itertag = 'item'
item = TestItem()
item['id'] = node.xpath('@id').get()
item['name'] = node.xpath('name').get()
item['description'] = node.xpath('description').get()
return item
Basically what we did up there was to create a spider that downloads a feed from the given start_urls, and then
iterates through each of its item tags, prints them out, and stores some random data in an Item.
CSVFeedSpider
class scrapy.spiders.CSVFeedSpider
This spider is very similar to the XMLFeedSpider, except that it iterates over rows, instead of nodes. The method
that gets called in each iteration is parse_row().
delimiter
A string with the separator character for each field in the CSV file Defaults to ',' (comma).
quotechar
A string with the enclosure character for each field in the CSV file Defaults to '"' (quotation mark).
headers
A list of the column names in the CSV file.
parse_row(response, row)
Receives a response and a dict (representing each row) with a key for each provided (or detected)
header of the CSV file. This spider also gives the opportunity to override adapt_response and
process_results methods for pre- and post-processing purposes.
CSVFeedSpider example
Let’s see an example similar to the previous one, but using a CSVFeedSpider:
class MySpider(CSVFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.csv']
(continues on next page)
item = TestItem()
item['id'] = row['id']
item['name'] = row['name']
item['description'] = row['description']
return item
SitemapSpider
class scrapy.spiders.SitemapSpider
SitemapSpider allows you to crawl a site by discovering the URLs using Sitemaps.
It supports nested sitemaps and discovering sitemap urls from robots.txt.
sitemap_urls
A list of urls pointing to the sitemaps whose urls you want to crawl.
You can also point to a robots.txt and it will be parsed to extract sitemap urls from it.
sitemap_rules
A list of tuples (regex, callback) where:
• regex is a regular expression to match urls extracted from sitemaps. regex can be either a str or a
compiled regex object.
• callback is the callback to use for processing the urls that match the regular expression. callback
can be a string (indicating the name of a spider method) or a callable.
For example:
Rules are applied in order, and only the first one that matches will be used.
If you omit this attribute, all urls found in sitemaps will be processed with the parse callback.
sitemap_follow
A list of regexes of sitemap that should be followed. This is only for sites that use Sitemap index files that
point to other sitemap files.
By default, all sitemaps are followed.
sitemap_alternate_links
Specifies if alternate links for one url should be followed. These are links for the same website in another
language passed within the same url block.
For example:
<url>
<loc>http://example.com/</loc>
<xhtml:link rel="alternate" hreflang="de" href="http://example.com/de"/>
</url>
3.2. Spiders 41
Scrapy Documentation, Release 1.7.3
<url>
<loc>http://example.com/</loc>
<lastmod>2005-01-01</lastmod>
</url>
class FilteredSitemapSpider(SitemapSpider):
name = 'filtered_sitemap_spider'
allowed_domains = ['example.com']
sitemap_urls = ['http://example.com/sitemap.xml']
This would retrieve only entries modified on 2005 and the following years.
Entries are dict objects extracted from the sitemap document. Usually, the key is the tag name and the
value is the text inside it.
It’s important to notice that:
• as the loc attribute is required, entries without this tag are discarded
• alternate links are stored in a list with the key alternate (see sitemap_alternate_links)
• namespaces are removed, so lxml tags named as {namespace}tagname become only tagname
If you omit this method, all entries found in sitemaps will be processed, observing other attributes and
their settings.
SitemapSpider examples
Simplest example: process all urls discovered through sitemaps using the parse callback:
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
Process some urls with certain callback and other urls with a different callback:
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
sitemap_rules = [
('/product/', 'parse_product'),
('/category/', 'parse_category'),
]
Follow sitemaps defined in the robots.txt file and only follow sitemaps whose url contains /sitemap_shop:
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/robots.txt']
sitemap_rules = [
('/shop/', 'parse_shop'),
]
sitemap_follow = ['/sitemap_shops']
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/robots.txt']
sitemap_rules = [
('/shop/', 'parse_shop'),
]
other_urls = ['http://www.example.com/about']
def start_requests(self):
requests = list(super(MySpider, self).start_requests())
requests += [scrapy.Request(x, self.parse_other) for x in self.other_urls]
return requests
3.2. Spiders 43
Scrapy Documentation, Release 1.7.3
3.3 Selectors
When you’re scraping web pages, the most common task you need to perform is to extract data from the HTML source.
There are several libraries available to achieve this, such as:
• BeautifulSoup is a very popular web scraping library among Python programmers which constructs a Python
object based on the structure of the HTML code and also deals with bad markup reasonably well, but it has one
drawback: it’s slow.
• lxml is an XML parsing library (which also parses HTML) with a pythonic API based on ElementTree. (lxml is
not part of the Python standard library.)
Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts
of the HTML document specified either by XPath or CSS expressions.
XPath is a language for selecting nodes in XML documents, which can also be used with HTML. CSS is a language
for applying styles to HTML documents. It defines selectors to associate those styles with specific HTML elements.
Note: Scrapy Selectors is a thin wrapper around parsel library; the purpose of this wrapper is to provide better
integration with Scrapy Response objects.
parsel is a stand-alone web scraping library which can be used without Scrapy. It uses lxml library under the hood, and
implements an easy API on top of lxml API. It means Scrapy selectors are very similar in speed and parsing accuracy
to lxml.
Constructing selectors
>>> response.selector.xpath('//span/text()').get()
'good'
Querying responses using XPath and CSS is so common that responses include two more shortcuts: response.
xpath() and response.css():
>>> response.xpath('//span/text()').get()
'good'
>>> response.css('span::text').get()
'good'
Scrapy selectors are instances of Selector class constructed by passing either TextResponse object or markup
as an unicode string (in text argument). Usually there is no need to construct Scrapy selectors manually: response
object is available in Spider callbacks, so in most cases it is more convenient to use response.css() and
response.xpath() shortcuts. By using response.selector or one of these shortcuts you can also ensure
the response body is parsed only once.
But if required, it is possible to use Selector directly. Constructing from text:
Selector automatically chooses the best parsing rules (XML vs HTML) based on input type.
Using selectors
To explain how to use the selectors we’ll use the Scrapy shell (which provides interactive testing) and an example
page located in the Scrapy documentation server:
https://docs.scrapy.org/en/latest/_static/selectors-sample1.html
For the sake of completeness, here’s its full HTML code:
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div id='images'>
<a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
<a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
<a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
<a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
<a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
</div>
</body>
</html>
Then, after the shell loads, you’ll have the response available as response shell variable, and its attached selector in
response.selector attribute.
Since we’re dealing with HTML, the selector will automatically use an HTML parser.
So, by looking at the HTML code of that page, let’s construct an XPath for selecting the text inside the title tag:
>>> response.xpath('//title/text()')
[<Selector xpath='//title/text()' data='Example website'>]
To actually extract the textual data, you must call the selector .get() or .getall() methods, as follows:
>>> response.xpath('//title/text()').getall()
['Example website']
>>> response.xpath('//title/text()').get()
'Example website'
.get() always returns a single result; if there are several matches, content of a first match is returned; if there are no
matches, None is returned. .getall() returns a list with all results.
3.3. Selectors 45
Scrapy Documentation, Release 1.7.3
Notice that CSS selectors can select text or attribute nodes using CSS3 pseudo-elements:
>>> response.css('title::text').get()
'Example website'
As you can see, .xpath() and .css() methods return a SelectorList instance, which is a list of new selectors.
This API can be used for quickly selecting nested data:
>>> response.css('img').xpath('@src').getall()
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
If you want to extract only the first matched element, you can call the selector .get() (or its alias .
extract_first() commonly used in previous Scrapy versions):
>>> response.xpath('//div[@id="images"]/a/text()').get()
'Name: My image 1 '
>>> response.xpath('//div[@id="not-exists"]/text()').get(default='not-found')
'not-found'
Instead of using e.g. '@src' XPath it is possible to query for attributes using .attrib property of a Selector:
As a shortcut, .attrib is also available on SelectorList directly; it returns attributes for the first matching element:
>>> response.css('img').attrib['src']
'image1_thumb.jpg'
This is most useful when only a single result is expected, e.g. when selecting by id, or selecting unique elements on a
web page:
>>> response.css('base').attrib['href']
'http://example.com/'
Now we’re going to get the base URL and some image links:
>>> response.xpath('//base/@href').get()
'http://example.com/'
>>> response.css('base::attr(href)').get()
(continues on next page)
>>> response.css('base').attrib['href']
'http://example.com/'
>>> response.css('a[href*=image]::attr(href)').getall()
['image1.html',
'image2.html',
'image3.html',
'image4.html',
'image5.html']
Per W3C standards, CSS selectors do not support selecting text nodes or attribute values. But selecting these is so
essential in a web scraping context that Scrapy (parsel) implements a couple of non-standard pseudo-elements:
• to select text nodes, use ::text
• to select attribute values, use ::attr(name) where name is the name of the attribute that you want the value
of
Warning: These pseudo-elements are Scrapy-/Parsel-specific. They will most probably not work with other
libraries like lxml or PyQuery.
Examples:
• title::text selects children text nodes of a descendant <title> element:
>>> response.css('title::text').get()
'Example website'
• *::text selects all descendant text nodes of the current selector context:
3.3. Selectors 47
Scrapy Documentation, Release 1.7.3
• foo::text returns no results if foo element exists, but contains no text (i.e. text is empty):
>>> response.css('img::text').getall()
[]
This means .css('foo::text').get() could return None even if an element exists. Use default=''
if you always want a string:
>>> response.css('img::text').get()
>>> response.css('img::text').get(default='')
''
Note: You cannot chain these pseudo-elements. But in practice it would not make much sense: text nodes do not
have attributes, and attribute values are string values already and do not have children nodes.
Nesting selectors
The selection methods (.xpath() or .css()) return a list of selectors of the same type, so you can call the selection
methods for those selectors too. Here’s an example:
>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.getall()
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
There are several ways to get a value of an attribute. First, one can use XPath syntax:
>>> response.xpath("//a/@href").getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
XPath syntax has a few advantages: it is a standard XPath feature, and @attributes can be used in other parts of
an XPath expression - e.g. it is possible to filter by attribute value.
Scrapy also provides an extension to CSS selectors (::attr(...)) which allows to get attribute values:
>>> response.css('a::attr(href)').getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
In addition to that, there is a .attrib property of Selector. You can use it if you prefer to lookup attributes in Python
code, without using XPaths or CSS extensions:
This property is also available on SelectorList; it returns a dictionary with attributes of a first matching element. It
is convenient to use when a selector is expected to give a single result (e.g. when selecting by element ID, or when
selecting an unique element on a page):
>>> response.css('base').attrib
{'href': 'http://example.com/'}
>>> response.css('base').attrib['href']
'http://example.com/'
>>> response.css('foo').attrib
{}
Selector also has a .re() method for extracting data using regular expressions. However, unlike using .
xpath() or .css() methods, .re() returns a list of unicode strings. So you can’t construct nested .re()
calls.
Here’s an example used to extract image names from the HTML code above:
3.3. Selectors 49
Scrapy Documentation, Release 1.7.3
There’s an additional helper reciprocating .get() (and its alias .extract_first()) for .re(), named .
re_first(). Use it to extract just the first matching string:
If you’re a long-time Scrapy user, you’re probably familiar with .extract() and .extract_first() selector
methods. Many blog posts and tutorials are using them as well. These methods are still supported by Scrapy, there are
no plans to deprecate them.
However, Scrapy usage docs are now written using .get() and .getall() methods. We feel that these new
methods result in a more concise and readable code.
The following examples show how these methods map to each other.
1. SelectorList.get() is the same as SelectorList.extract_first():
>>> response.css('a::attr(href)').get()
'image1.html'
>>> response.css('a::attr(href)').extract_first()
'image1.html'
>>> response.css('a::attr(href)').getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
>>> response.css('a::attr(href)').extract()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
>>> response.css('a::attr(href)')[0].get()
'image1.html'
>>> response.css('a::attr(href)')[0].extract()
'image1.html'
>>> response.css('a::attr(href)')[0].getall()
['image1.html']
So, the main difference is that output of .get() and .getall() methods is more predictable: .get() always
returns a single result, .getall() always returns a list of all extracted results. With .extract() method it was
not always obvious if a result is a list or not; to get a single result either .extract() or .extract_first()
should be called.
Here are some tips which may help you to use XPath with Scrapy selectors effectively. If you are not much familiar
with XPath yet, you may want to take a look first at this XPath tutorial.
Note: Some of the tips are based on this post from ScrapingHub’s blog.
Keep in mind that if you are nesting selectors and use an XPath that starts with /, that XPath will be absolute to the
document and not relative to the Selector you’re calling it from.
For example, suppose you want to extract all <p> elements inside <div> elements. First, you would get all <div>
elements:
At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all <p> elements
from the document, not only those inside <div> elements:
>>> for p in divs.xpath('//p'): # this is wrong - gets all <p> from the whole
˓→document
... print(p.get())
This is the proper way to do it (note the dot prefixing the .//p XPath):
For more details about relative XPaths see the Location Paths section in the XPath specification.
Because an element can contain multiple CSS classes, the XPath way to select elements by class is the rather verbose:
If you use @class='someclass' you may end up missing elements that have other classes, and if you just use
contains(@class, 'someclass') to make up for that you may end up with more elements that you want, if
they have a different class name that shares the string someclass.
As it turns out, Scrapy selectors allow you to chain selectors, so most of the time you can just select by class using
CSS and then switch to XPath when needed:
>>> sel.css('.shout').xpath('./time/@datetime').getall()
['2014-07-23 19:00']
3.3. Selectors 51
Scrapy Documentation, Release 1.7.3
This is cleaner than using the verbose XPath trick shown above. Just remember to use the . in the XPath expressions
that will follow.
//node[1] selects all the nodes occurring first under their respective parents.
(//node)[1] selects all the nodes in the document, and then gets only the first of them.
Example:
This gets all first <li> elements under whatever it is its parent:
>>> xp("//li[1]")
['<li>1</li>', '<li>4</li>']
And this gets the first <li> element in the whole document:
>>> xp("(//li)[1]")
['<li>1</li>']
>>> xp("//ul/li[1]")
['<li>1</li>', '<li>4</li>']
And this gets the first <li> element under an <ul> parent in the whole document:
>>> xp("(//ul/li)[1]")
['<li>1</li>']
When you need to use the text content as argument to an XPath string function, avoid using .//text() and use just
. instead.
This is because the expression .//text() yields a collection of text elements – a node-set. And when a node-
set is converted to a string, which happens when it is passed as argument to a string function like contains() or
starts-with(), it results in the text for the first element only.
Example:
A node converted to a string, however, puts together the text of itself plus of all its descendants:
>>> sel.xpath("//a[1]").getall() # select the first node
['<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> sel.xpath("string(//a[1])").getall() # convert it to string
['Click here to go to the Next Page']
So, using the .//text() node-set won’t select anything in this case:
>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").getall()
[]
XPath allows you to reference variables in your XPath expressions, using the $somevariable syntax. This is some-
what similar to parameterized queries or prepared statements in the SQL world where you replace some arguments in
your queries with placeholders like ?, which are then substituted with values passed with the query.
Here’s an example to match an element based on its “id” attribute value, without hard-coding it (that was shown
previously):
>>> # `$val` used in the expression, a `val` argument needs to be passed
>>> response.xpath('//div[@id=$val]/a/text()', val='images').get()
'Name: My image 1 '
Here’s another example, to find the “id” attribute of a <div> tag containing five <a> children (here we pass the value
5 as an integer):
>>> response.xpath('//div[count(a)=$cnt]/@id', cnt=5).get()
'images'
All variable references must have a binding value when calling .xpath() (otherwise you’ll get a ValueError:
XPath error: exception). This is done by passing as many named arguments as necessary.
parsel, the library powering Scrapy selectors, has more details and examples on XPath variables.
Removing namespaces
When dealing with scraping projects, it is often quite convenient to get rid of namespaces altogether and just work with
element names, to write more simple/convenient XPaths. You can use the Selector.remove_namespaces()
3.3. Selectors 53
Scrapy Documentation, Release 1.7.3
You can see several namespace declarations including a default “http://www.w3.org/2005/Atom” and another one
using the “gd:” prefix for “http://schemas.google.com/g/2005”.
Once in the shell we can try selecting all <link> objects and see that it doesn’t work (because the Atom XML
namespace is obfuscating those nodes):
>>> response.xpath("//link")
[]
But once we call the Selector.remove_namespaces() method, all nodes can be accessed directly by their
names:
>>> response.selector.remove_namespaces()
>>> response.xpath("//link")
[<Selector xpath='//link' data='<link rel="alternate" type="text/html" h'>,
<Selector xpath='//link' data='<link rel="next" type="application/atom+'>,
...
If you wonder why the namespace removal procedure isn’t always called by default instead of having to call it manu-
ally, this is because of two reasons, which, in order of relevance, are:
1. Removing namespaces requires to iterate and modify all nodes in the document, which is a reasonably expensive
operation to perform by default for all documents crawled by Scrapy
2. There could be some cases where using namespaces is actually required, in case some element names clash
between namespaces. These cases are very rare though.
Being built atop lxml, Scrapy selectors support some EXSLT extensions and come with these pre-registered names-
paces to use in XPath expressions:
Regular expressions
The test() function, for example, can prove quite useful when XPath’s starts-with() or contains() are
not sufficient.
Example selecting links in list item with a “class” attribute ending with a digit:
Warning: C library libxslt doesn’t natively support EXSLT regular expressions so lxml’s implementation
uses hooks to Python’s re module. Thus, using regexp functions in your XPath expressions may add a small
performance penalty.
Set operations
These can be handy for excluding parts of a document tree before extracting text elements for example.
Example extracting microdata (sample content taken from http://schema.org/Product) with groups of itemscopes and
corresponding itemprops:
3.3. Selectors 55
Scrapy Documentation, Release 1.7.3
>>>
Here we first iterate over itemscope elements, and for each one, we look for all itemprops elements and exclude
those that are themselves inside another itemscope.
Scrapy selectors also provide a sorely missed XPath extension function has-class that returns True for nodes that
have all of the specified HTML classes.
For the following HTML:
>>> response.xpath('//p[has-class("foo")]')
[<Selector xpath='//p[has-class("foo")]' data='<p class="foo bar-baz">First</p>'>,
<Selector xpath='//p[has-class("foo")]' data='<p class="foo">Second</p>'>]
>>> response.xpath('//p[has-class("foo", "bar-baz")]')
[<Selector xpath='//p[has-class("foo", "bar-baz")]' data='<p class="foo bar-baz">First
˓→</p>'>]
3.3. Selectors 57
Scrapy Documentation, Release 1.7.3
Selector objects
selector.xpath('//a[href=$url]', url="http://www.example.com")
css(query)
Apply the given CSS selector and return a SelectorList instance.
query is a string containing the CSS selector to apply.
In the background, CSS queries are translated into XPath queries using cssselect library and run .
xpath() method.
get()
Serialize and return the matched nodes in a single unicode string. Percent encoded content is unquoted.
See also: extract() and extract_first()
attrib
Return the attributes dictionary for underlying element.
SelectorList objects
class scrapy.selector.SelectorList
The SelectorList class is a subclass of the builtin list class, which provides a few additional methods.
xpath(xpath, namespaces=None, **kwargs)
Call the .xpath() method for each element in this list and return their results flattened as another
SelectorList.
query is the same argument as the one in Selector.xpath()
namespaces is an optional prefix: namespace-uri mapping (dict) for additional
prefixes to those registered with register_namespace(prefix, uri). Contrary to
register_namespace(), these prefixes are not saved for future calls.
Any additional named arguments can be used to pass values for XPath variables in the XPath expression,
e.g.:
selector.xpath('//a[href=$url]', url="http://www.example.com")
css(query)
Call the .css() method for each element in this list and return their results flattened as another
SelectorList.
query is the same argument as the one in Selector.css()
3.3. Selectors 59
Scrapy Documentation, Release 1.7.3
getall()
Call the .get() method for each element is this list and return their results flattened, as a list of unicode
strings.
See also: extract() and extract_first()
get(default=None)
Return the result of .get() for the first element in this list. If the list is empty, return the default value.
See also: extract() and extract_first()
re(regex, replace_entities=True)
Call the .re() method for each element in this list and return their results flattened, as a list of unicode
strings.
By default, character entity references are replaced by their corresponding character (except for &
and <. Passing replace_entities as False switches off these replacements.
re_first(regex, default=None, replace_entities=True)
Call the .re() method for the first element in this list and return the result in an unicode string. If the
list is empty or the regex doesn’t match anything, return the default value (None if the argument is not
provided).
By default, character entity references are replaced by their corresponding character (except for &
and <. Passing replace_entities as False switches off these replacements.
attrib
Return the attributes dictionary for the first element. If the list is empty, return an empty dict.
See also: Selecting element attributes.
3.3.4 Examples
Here are some Selector examples to illustrate several concepts. In all cases, we assume there is already a
Selector instantiated with a HtmlResponse object like this:
sel = Selector(html_response)
1. Select all <h1> elements from an HTML response body, returning a list of Selector objects (ie. a
SelectorList object):
sel.xpath("//h1")
2. Extract the text of all <h1> elements from an HTML response body, returning a list of unicode strings:
sel.xpath("//h1").getall() # this includes the h1 tag
sel.xpath("//h1/text()").getall() # this excludes the h1 tag
3. Iterate over all <p> tags and print their class attribute:
for node in sel.xpath("//p"):
print(node.attrib['class'])
Here are some examples to illustrate concepts for Selector objects instantiated with an XmlResponse object:
sel = Selector(xml_response)
1. Select all <product> elements from an XML response body, returning a list of Selector objects (ie. a
SelectorList object):
sel.xpath("//product")
2. Extract all prices from a Google Base XML feed which requires registering a namespace:
sel.register_namespace("g", "http://base.google.com/ns/1.0")
sel.xpath("//g:price").getall()
3.4 Items
The main goal in scraping is to extract structured data from unstructured sources, typically, web pages. Scrapy spiders
can return the extracted data as Python dicts. While convenient and familiar, Python dicts lack structure: it is easy to
make a typo in a field name or return inconsistent data, especially in a larger project with many spiders.
To define common output data format Scrapy provides the Item class. Item objects are simple containers used to
collect the scraped data. They provide a dictionary-like API with a convenient syntax for declaring their available
fields.
Various Scrapy components use extra information provided by Items: exporters look at declared fields to figure out
columns to export, serialization can be customized using Item fields metadata, trackref tracks Item instances to
help find memory leaks (see Debugging memory leaks with trackref ), etc.
Items are declared using a simple class definition syntax and Field objects. Here is an example:
import scrapy
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
stock = scrapy.Field()
tags = scrapy.Field()
last_updated = scrapy.Field(serializer=str)
Note: Those familiar with Django will notice that Scrapy Items are declared similar to Django Models, except that
Scrapy Items are much simpler as there is no concept of different field types.
Field objects are used to specify metadata for each field. For example, the serializer function for the
last_updated field illustrated in the example above.
You can specify any kind of metadata for each field. There is no restriction on the values accepted by Field objects.
For this same reason, there is no reference list of all available metadata keys. Each key defined in Field objects
could be used by a different component, and only those components know about it. You can also define and use any
other Field key in your project too, for your own needs. The main goal of Field objects is to provide a way to
3.4. Items 61
Scrapy Documentation, Release 1.7.3
define all field metadata in one place. Typically, those components whose behaviour depends on each field use certain
field keys to configure that behaviour. You must refer to their documentation to see which metadata keys are used by
each component.
It’s important to note that the Field objects used to declare the item do not stay assigned as class attributes. Instead,
they can be accessed through the Item.fields attribute.
Here are some examples of common tasks performed with items, using the Product item declared above. You will
notice the API is very similar to the dict API.
Creating items
>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC
>>> product['price']
1000
>>> product['last_updated']
Traceback (most recent call last):
...
KeyError: 'last_updated'
To access all populated values, just use the typical dict API:
>>> product.keys()
['price', 'name']
>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]
Copying items
To copy an item, you must first decide whether you want a shallow copy or a deep copy.
If your item contains mutable values like lists or dictionaries, a shallow copy will keep references to the same mutable
values across all different copies.
For example, if you have an item with a list of tags, and you create a shallow copy of that item, both the original item
and the copy have the same list of tags. Adding a tag to the list of one of the items will add the tag to the other item as
well.
If that is not the desired behavior, use a deep copy instead.
See the documentation of the copy module for more information.
To create a shallow copy of an item, you can either call copy() on an existing item (product2 = product.
copy()) or instantiate your item class from an existing item (product2 = Product(product)).
To create a deep copy, call deepcopy() instead (product2 = product.deepcopy()).
3.4. Items 63
Scrapy Documentation, Release 1.7.3
>>> Product({'name': 'Laptop PC', 'lala': 1500}) # warning: unknown field in dict
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
You can extend Items (to add more fields or to change some metadata for some fields) by declaring a subclass of your
original Item.
For example:
class DiscountedProduct(Product):
discount_percent = scrapy.Field(serializer=str)
discount_expiration_date = scrapy.Field()
You can also extend field metadata by using the previous field metadata and appending more values, or changing
existing values, like this:
class SpecificProduct(Product):
name = scrapy.Field(Product.fields['name'], serializer=my_serializer)
That adds (or replaces) the serializer metadata key for the name field, keeping all the previously existing meta-
data values.
class scrapy.item.Item([arg ])
Return a new Item optionally initialized from the given argument.
Items replicate the standard dict API, including its constructor. The only additional attribute provided by Items
is:
fields
A dictionary containing all declared fields for this Item, not only those populated. The keys are the field
names and the values are the Field objects used in the Item declaration.
class scrapy.item.Field([arg ])
The Field class is just an alias to the built-in dict class and doesn’t provide any extra functionality or at-
tributes. In other words, Field objects are plain-old Python dicts. A separate class is used to support the item
declaration syntax based on class attributes.
Item Loaders provide a convenient mechanism for populating scraped Items. Even though Items can be populated
using their own dictionary-like API, Item Loaders provide a much more convenient API for populating them from a
scraping process, by automating some common tasks like parsing the raw extracted data before assigning it.
In other words, Items provide the container of scraped data, while Item Loaders provide the mechanism for populating
that container.
Item Loaders are designed to provide a flexible, efficient and easy mechanism for extending and overriding different
field parsing rules, either by spider, or by source format (HTML, XML, etc) without becoming a nightmare to maintain.
To use an Item Loader, you must first instantiate it. You can either instantiate it with a dict-like object (e.g. Item or
dict) or without one, in which case an Item is automatically instantiated in the Item Loader constructor using the Item
class specified in the ItemLoader.default_item_class attribute.
Then, you start collecting values into the Item Loader, typically using Selectors. You can add more than one value to
the same item field; the Item Loader will know how to “join” those values later using a proper processing function.
Here is a typical Item Loader usage in a Spider, using the Product item declared in the Items chapter:
By quickly looking at that code, we can see the name field is being extracted from two different XPath locations in
the page:
1. //div[@class="product_name"]
2. //div[@class="product_title"]
In other words, data is being collected by extracting it from two XPath locations, using the add_xpath() method.
This is the data that will be assigned to the name field later.
Afterwards, similar calls are used for price and stock fields (the latter using a CSS selector with the add_css()
method), and finally the last_update field is populated directly with a literal value (today) using a different
method: add_value().
Finally, when all data is collected, the ItemLoader.load_item() method is called which actually returns
the item populated with the data previously extracted and collected with the add_xpath(), add_css(), and
add_value() calls.
An Item Loader contains one input processor and one output processor for each (item) field. The input processor
processes the extracted data as soon as it’s received (through the add_xpath(), add_css() or add_value()
methods) and the result of the input processor is collected and kept inside the ItemLoader. After collecting all data,
the ItemLoader.load_item() method is called to populate and get the populated Item object. That’s when the
output processor is called with the data previously collected (and processed using the input processor). The result of
the output processor is the final value that gets assigned to the item.
Let’s see an example to illustrate how the input and output processors are called for a particular field (the same applies
for any other field):
l = ItemLoader(Product(), some_selector)
l.add_xpath('name', xpath1) # (1)
l.add_xpath('name', xpath2) # (2)
l.add_css('name', css) # (3)
l.add_value('name', 'test') # (4)
return l.load_item() # (5)
Note: Both input and output processors must receive an iterator as their first argument. The output of those functions
can be anything. The result of input processors will be appended to an internal list (in the Loader) containing the
collected values (for that field). The result of the output processors is the value that will be finally assigned to the item.
If you want to use a plain function as a processor, make sure it receives self as the first argument:
def lowercase_processor(self, values):
for v in values:
yield v.lower()
class MyItemLoader(ItemLoader):
name_in = lowercase_processor
This is because whenever a function is assigned as a class variable, it becomes a method and would be passed the
instance as the the first argument when being called. See this answer on stackoverflow for more details.
The other thing you need to keep in mind is that the values returned by input processors are collected internally (in
lists) and then passed to output processors to populate the fields.
Last, but not least, Scrapy comes with some commonly used processors built-in for convenience.
Item Loaders are declared like Items, by using a class definition syntax. Here is an example:
class ProductLoader(ItemLoader):
default_output_processor = TakeFirst()
name_in = MapCompose(unicode.title)
name_out = Join()
price_in = MapCompose(unicode.strip)
# ...
As you can see, input processors are declared using the _in suffix while output processors are declared us-
ing the _out suffix. And you can also declare a default input/output processors using the ItemLoader.
default_input_processor and ItemLoader.default_output_processor attributes.
As seen in the previous section, input and output processors can be declared in the Item Loader definition, and it’s
very common to declare input processors this way. However, there is one more place where you can specify the input
and output processors to use: in the Item Field metadata. Here is an example:
import scrapy
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from w3lib.html import remove_tags
def filter_price(value):
if value.isdigit():
return value
class Product(scrapy.Item):
name = scrapy.Field(
input_processor=MapCompose(remove_tags),
output_processor=Join(),
)
price = scrapy.Field(
input_processor=MapCompose(remove_tags, filter_price),
output_processor=TakeFirst(),
)
The precedence order, for both input and output processors, is as follows:
1. Item Loader field-specific attributes: field_in and field_out (most precedence)
2. Field metadata (input_processor and output_processor key)
3. Item Loader defaults: ItemLoader.default_input_processor() and ItemLoader.
default_output_processor() (least precedence)
The Item Loader Context is a dict of arbitrary key/values which is shared among all input and output processors in
the Item Loader. It can be passed when declaring, instantiating or using Item Loader. They are used to modify the
behaviour of the input/output processors.
For example, suppose you have a function parse_length which receives a text value and extracts a length from it:
By accepting a loader_context argument the function is explicitly telling the Item Loader that it’s able to receive
an Item Loader context, so the Item Loader passes the currently active context when calling it, and the processor
function (parse_length in this case) can thus use them.
There are several ways to modify Item Loader context values:
1. By modifying the currently active Item Loader context (context attribute):
loader = ItemLoader(product)
loader.context['unit'] = 'cm'
2. On Item Loader instantiation (the keyword arguments of Item Loader constructor are stored in the Item Loader
context):
3. On Item Loader declaration, for those input/output processors that support instantiating them with an Item
Loader context. MapCompose is one of them:
class ProductLoader(ItemLoader):
length_out = MapCompose(parse_length, unit='cm')
• response (Response object) – The response used to construct the selector using the
default_selector_class, unless the selector argument is given, in which case this
argument is ignored.
The item, selector, response and the remaining keyword arguments are assigned to the Loader context (accessible
through the context attribute).
ItemLoader instances have the following methods:
get_value(value, *processors, **kwargs)
Process the given value by the given processors and keyword arguments.
Available keyword arguments:
Parameters re (str or compiled regex) – a regular expression to use for extracting
data from the given value using extract_regex() method, applied before processors
Examples:
'FOO`
nested_css(css)
Create a nested loader with a css selector. The supplied selector is applied relative to selector associated
with this ItemLoader. The nested loader shares the Item with the parent ItemLoader so calls to
add_xpath(), add_value(), replace_value(), etc. will behave as expected.
get_collected_values(field_name)
Return the collected values for the given field.
get_output_value(field_name)
Return the collected values parsed using the output processor, for the given field. This method doesn’t
populate or modify the item at all.
get_input_processor(field_name)
Return the input processor for the given field.
get_output_processor(field_name)
Return the output processor for the given field.
ItemLoader instances have the following attributes:
item
The Item object being parsed by this Item Loader.
context
The currently active Context of this Item Loader.
default_item_class
An Item class (or factory), used to instantiate items when not given in the constructor.
default_input_processor
The default input processor to use for those fields which don’t specify one.
default_output_processor
The default output processor to use for those fields which don’t specify one.
default_selector_class
The class used to construct the selector of this ItemLoader, if only a response is given in the
constructor. If a selector is given in the constructor this attribute is ignored. This attribute is sometimes
overridden in subclasses.
selector
The Selector object to extract data from. It’s either the selector given in the constructor or one created
from the response given in the constructor using the default_selector_class. This attribute is
meant to be read-only.
When parsing related values from a subsection of a document, it can be useful to create nested loaders. Imagine you’re
extracting details from a footer of a page that looks something like:
Example:
<footer>
<a class="social" href="https://facebook.com/whatever">Like Us</a>
<a class="social" href="https://twitter.com/whatever">Follow Us</a>
<a class="email" href="mailto:[email protected]">Email Us</a>
</footer>
Without nested loaders, you need to specify the full xpath (or css) for each value that you wish to extract.
Example:
loader = ItemLoader(item=Item())
# load stuff not in the footer
loader.add_xpath('social', '//footer/a[@class = "social"]/@href')
loader.add_xpath('email', '//footer/a[@class = "email"]/@href')
loader.load_item()
Instead, you can create a nested loader with the footer selector and add values relative to the footer. The functionality
is the same but you avoid repeating the footer selector.
Example:
loader = ItemLoader(item=Item())
# load stuff not in the footer
footer_loader = loader.nested_xpath('//footer')
footer_loader.add_xpath('social', 'a[@class = "social"]/@href')
footer_loader.add_xpath('email', 'a[@class = "email"]/@href')
# no need to call footer_loader.load_item()
loader.load_item()
You can nest loaders arbitrarily and they work with either xpath or css selectors. As a general guideline, use nested
loaders when they make your code simpler but do not go overboard with nesting or your parser can become difficult
to read.
As your project grows bigger and acquires more and more spiders, maintenance becomes a fundamental problem,
especially when you have to deal with many different parsing rules for each spider, having a lot of exceptions, but also
wanting to reuse the common processors.
Item Loaders are designed to ease the maintenance burden of parsing rules, without losing flexibility and, at the same
time, providing a convenient mechanism for extending and overriding them. For this reason Item Loaders support
traditional Python class inheritance for dealing with differences of specific spiders (or groups of spiders).
Suppose, for example, that some particular site encloses their product names in three dashes (e.g. ---Plasma
TV---) and you don’t want to end up scraping those dashes in the final product names.
Here’s how you can remove those dashes by reusing and extending the default Product Item Loader
(ProductLoader):
def strip_dashes(x):
return x.strip('-')
class SiteSpecificLoader(ProductLoader):
name_in = MapCompose(strip_dashes, ProductLoader.name_in)
Another case where extending Item Loaders can be very helpful is when you have multiple source formats, for example
XML and HTML. In the XML version you may want to remove CDATA occurrences. Here’s an example of how to do
it:
Even though you can use any callable function as input and output processors, Scrapy provides some commonly
used processors, which are described below. Some of them, like the MapCompose (which is typically used as input
processor) compose the output of several functions executed in order, to produce the final parsed value.
Here is a list of all built-in processors:
class scrapy.loader.processors.Identity
The simplest processor, which doesn’t do anything. It returns the original values unchanged. It doesn’t receive
any constructor arguments, nor does it accept Loader contexts.
Example:
class scrapy.loader.processors.TakeFirst
Returns the first non-null/non-empty value from the values received, so it’s typically used as an output processor
to single-valued fields. It doesn’t receive any constructor arguments, nor does it accept Loader contexts.
Example:
class scrapy.loader.processors.Join(separator=u’ ’)
Returns the values joined with the separator given in the constructor, which defaults to u' '. It doesn’t accept
Loader contexts.
When using the default separator, this processor is equivalent to the function: u' '.join
Examples:
Each function can optionally receive a loader_context parameter. For those which do, this processor will
pass the currently active Loader context through that parameter.
The keyword arguments passed in the constructor are used as the default Loader context values passed to each
function call. However, the final Loader context values passed to functions are overridden with the currently
active Loader context accessible through the ItemLoader.context() attribute.
class scrapy.loader.processors.MapCompose(*functions, **default_loader_context)
A processor which is constructed from the composition of the given functions, similar to the Compose pro-
cessor. The difference with this processor is the way internal results are passed among functions, which is as
follows:
The input value of this processor is iterated and the first function is applied to each element. The results of these
function calls (one for each element) are concatenated to construct a new iterable, which is then used to apply
the second function, and so on, until the last function is applied to each value of the list of values collected so
far. The output values of the last function are concatenated together to produce the output of this processor.
Each particular function can return a value or a list of values, which is flattened with the list of values returned
by the same function applied to the other input values. The functions can also return None in which case the
output of that function is ignored for further processing over the chain.
This processor provides a convenient way to compose functions that only work with single values (instead of
iterables). For this reason the MapCompose processor is typically used as input processor, since data is often
extracted using the extract() method of selectors, which returns a list of unicode strings.
The example below should clarify how it works:
As with the Compose processor, functions can receive Loader contexts, and constructor keyword arguments are
used as default context values. See Compose processor for more info.
class scrapy.loader.processors.SelectJmes(json_path)
Queries the value using the json path provided to the constructor and returns the output. Requires jmespath
(https://github.com/jmespath/jmespath.py) to run. This processor takes only one input at a time.
Example:
The Scrapy shell is an interactive shell where you can try and debug your scraping code very quickly, without having
to run the spider. It’s meant to be used for testing data extraction code, but you can actually use it for testing any kind
of code as it is also a regular Python shell.
The shell is used for testing XPath or CSS expressions and see how they work and what data they extract from the web
pages you’re trying to scrape. It allows you to interactively test your expressions while you’re writing your spider,
without having to run the spider to test every change.
Once you get familiarized with the Scrapy shell, you’ll see that it’s an invaluable tool for developing and debugging
your spiders.
If you have IPython installed, the Scrapy shell will use it (instead of the standard Python console). The IPython console
is much more powerful and provides smart auto-completion and colorized output, among other things.
We highly recommend you install IPython, specially if you’re working on Unix systems (where IPython excels). See
the IPython installation guide for more info.
Scrapy also has support for bpython, and will try to use it where IPython is unavailable.
Through scrapy’s settings you can configure it to use any one of ipython, bpython or the standard python shell,
regardless of which are installed. This is done by setting the SCRAPY_PYTHON_SHELL environment variable; or by
defining it in your scrapy.cfg:
[settings]
shell = bpython
To launch the Scrapy shell you can use the shell command like this:
# UNIX-style
scrapy shell ./path/to/file.html
scrapy shell ../other/path/to/file.html
scrapy shell /absolute/path/to/file.html
# File URI
scrapy shell file:///absolute/path/to/file.html
Note: When using relative file paths, be explicit and prepend them with ./ (or ../ when relevant). scrapy
shell index.html will not work as one might expect (and this is by design, not a bug).
Because shell favors HTTP URLs over File URIs, and index.html being syntactically similar to example.
com, shell will treat index.html as a domain name and trigger a DNS lookup error:
shell will not test beforehand if a file called index.html exists in the current directory. Again, be explicit.
The Scrapy shell is just a regular Python console (or IPython console if you have it available) which provides some
additional shortcut functions for convenience.
Available Shortcuts
• shelp() - print a help with the list of available objects and shortcuts
• fetch(url[, redirect=True]) - fetch a new response from the given URL and update all related
objects accordingly. You can optionaly ask for HTTP 3xx redirections to not be followed by passing
redirect=False
• fetch(request) - fetch a new response from the given request and update all related objects accordingly.
• view(response) - open the given response in your local web browser, for inspection. This will add a <base>
tag to the response body in order for external links (such as images and style sheets) to display properly. Note,
however, that this will create a temporary file in your computer, which won’t be removed automatically.
The Scrapy shell automatically creates some convenient objects from the downloaded page, like the Response object
and the Selector objects (for both HTML and XML content).
Those objects are:
Here’s an example of a typical shell session where we start by scraping the https://scrapy.org page, and then proceed
to scrape the https://reddit.com page. Finally, we modify the (Reddit) request method to POST and re-fetch it getting
an error. We end the session by typing Ctrl-D (in Unix systems) or Ctrl-Z in Windows.
Keep in mind that the data extracted here may not be the same when you try it, as those pages are not static and could
have changed by the time you test this. The only purpose of this example is to get you familiarized with how the
Scrapy shell works.
First, we launch the shell:
Then, the shell fetches the URL (using the Scrapy downloader) and prints the list of available objects and useful
shortcuts (you’ll notice that these lines all start with the [s] prefix):
>>>
>>> response.xpath('//title/text()').get()
'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'
>>> fetch("https://reddit.com")
>>> response.xpath('//title/text()').get()
'reddit: the front page of the internet'
>>> response.status
404
>>> pprint(response.headers)
{'Accept-Ranges': ['bytes'],
'Cache-Control': ['max-age=0, must-revalidate'],
'Content-Type': ['text/html; charset=UTF-8'],
'Date': ['Thu, 08 Dec 2016 16:21:19 GMT'],
'Server': ['snooserv'],
'Set-Cookie': ['loid=KqNLou0V9SKMX4qb4n; Domain=reddit.com; Max-Age=63071999; Path=/;
˓→ expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
'Vary': ['accept-encoding'],
'Via': ['1.1 varnish'],
'X-Cache': ['MISS'],
'X-Cache-Hits': ['0'],
'X-Content-Type-Options': ['nosniff'],
'X-Frame-Options': ['SAMEORIGIN'],
'X-Moose': ['majestic'],
'X-Served-By': ['cache-cdg8730-CDG'],
'X-Timer': ['S1481214079.394283,VS0,VE159'],
'X-Ua-Compatible': ['IE=edge'],
'X-Xss-Protection': ['1; mode=block']}
>>>
Sometimes you want to inspect the responses that are being processed in a certain point of your spider, if only to check
that response you expect is getting there.
This can be achieved by using the scrapy.shell.inspect_response function.
Here’s an example of how you would call it from your spider:
import scrapy
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = [
"http://example.com",
"http://example.org",
"http://example.net",
]
When you run the spider, you will get something similar to this:
2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://
˓→example.com> (referer: None)
>>> response.url
'http://example.org'
Nope, it doesn’t. So you can open the response in your web browser and see if it’s the response you were expecting:
>>> view(response)
True
Finally you hit Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the crawling:
>>> ^D
2014-01-23 17:50:03-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://
˓→example.net> (referer: None)
...
Note that you can’t use the fetch shortcut here since the Scrapy engine is blocked by the shell. However, after you
leave the shell, the spider will continue crawling where it stopped, as shown above.
After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components
that are executed sequentially.
Each item pipeline component (sometimes referred as just “Item Pipeline”) is a Python class that implements a simple
method. They receive an item and perform an action over it, also deciding if the item should continue through the
pipeline or be dropped and no longer processed.
Typical uses of item pipelines are:
• cleansing HTML data
• validating scraped data (checking that the items contain certain fields)
• checking for duplicates (and dropping them)
• storing the scraped item in a database
Each item pipeline component is a Python class that must implement the following method:
process_item(self, item, spider)
This method is called for every item pipeline component. process_item() must either: return a dict with
data, return an Item (or any descendant class) object, return a Twisted Deferred or raise DropItem exception.
Dropped items are no longer processed by further pipeline components.
Parameters
• item (Item object or a dict) – the item scraped
• spider (Spider object) – the spider which scraped the item
Additionally, they may also implement the following methods:
open_spider(self, spider)
This method is called when the spider is opened.
Parameters spider (Spider object) – the spider which was opened
close_spider(self, spider)
This method is called when the spider is closed.
Parameters spider (Spider object) – the spider which was closed
from_crawler(cls, crawler)
If present, this classmethod is called to create a pipeline instance from a Crawler. It must return a new instance
of the pipeline. Crawler object provides access to all Scrapy core components like settings and signals; it is a
way for pipeline to access them and hook its functionality into Scrapy.
Parameters crawler (Crawler object) – crawler that uses this pipeline
Let’s take a look at the following hypothetical pipeline that adjusts the price attribute for those items that do not
include VAT (price_excludes_vat attribute), and drops those items which don’t contain a price:
from scrapy.exceptions import DropItem
class PricePipeline(object):
vat_factor = 1.15
The following pipeline stores all scraped items (from all spiders) into a single items.jl file, containing one item
per line serialized in JSON format:
import json
class JsonWriterPipeline(object):
Note: The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store
all scraped items into a JSON file you should use the Feed exports.
In this example we’ll write items to MongoDB using pymongo. MongoDB address and database name are specified
in Scrapy settings; MongoDB collection is named after item class.
The main point of this example is to show how to use from_crawler() method and how to clean up the resources
properly.:
import pymongo
class MongoPipeline(object):
collection_name = 'scrapy_items'
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)
This example demonstrates how to return Deferred from process_item() method. It uses Splash to render screen-
shot of item url. Pipeline makes request to locally running instance of Splash. After request is downloaded and
Deferred callback fires, it saves item to a file and adds filename to an item.
import scrapy
import hashlib
from urllib.parse import quote
class ScreenshotPipeline(object):
"""Pipeline that uses Splash to render screenshot of
every Scrapy item."""
SPLASH_URL = "http://localhost:8050/render.png?url={}"
Duplicates filter
A filter that looks for duplicate items, and drops those items that were already processed. Let’s say that our items have
a unique id, but our spider returns multiples items with the same id:
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = set()
To activate an Item Pipeline component you must add its class to the ITEM_PIPELINES setting, like in the following
example:
ITEM_PIPELINES = {
'myproject.pipelines.PricePipeline': 300,
'myproject.pipelines.JsonWriterPipeline': 800,
}
The integer values you assign to classes in this setting determine the order in which they run: items go through from
lower valued to higher valued classes. It’s customary to define these numbers in the 0-1000 range.
For serializing the scraped data, the feed exports use the Item exporters. These formats are supported out of the box:
• JSON
• JSON lines
• CSV
• XML
But you can also extend the supported format through the FEED_EXPORTERS setting.
JSON
• FEED_FORMAT: json
• Exporter used: JsonItemExporter
• See this warning if you’re using JSON with large feeds.
JSON lines
• FEED_FORMAT: jsonlines
• Exporter used: JsonLinesItemExporter
CSV
• FEED_FORMAT: csv
• Exporter used: CsvItemExporter
• To specify columns to export and their order use FEED_EXPORT_FIELDS. Other feed exporters can also use
this option, but it is important for CSV because unlike many other export formats CSV uses a fixed header.
XML
• FEED_FORMAT: xml
• Exporter used: XmlItemExporter
Pickle
• FEED_FORMAT: pickle
• Exporter used: PickleItemExporter
Marshal
• FEED_FORMAT: marshal
• Exporter used: MarshalItemExporter
3.8.2 Storages
When using the feed exports you define where to store the feed using a URI (through the FEED_URI setting). The
feed exports supports multiple storage backend types which are defined by the URI scheme.
The storages backends supported out of the box are:
• Local filesystem
• FTP
• S3 (requires botocore or boto)
• Standard output
Some storage backends may be unavailable if the required external libraries are not available. For example, the S3
backend is only available if the botocore or boto library is installed (Scrapy supports boto only on Python 2).
The storage URI can also contain parameters that get replaced when the feed is being created. These parameters are:
• %(time)s - gets replaced by a timestamp when the feed is being created
• %(name)s - gets replaced by the spider name
Any other named parameter gets replaced by the spider attribute of the same name. For example, %(site_id)s
would get replaced by the spider.site_id attribute the moment the feed is being created.
Here are some examples to illustrate:
• Store in FTP using one directory per spider:
– ftp://user:[email protected]/scraping/feeds/%(name)s/%(time)s.
json
• Store in S3 using one directory per spider:
– s3://mybucket/scraping/feeds/%(name)s/%(time)s.json
Local filesystem
FTP
S3
• Required external libraries: botocore (Python 2 and Python 3) or boto (Python 2 only)
The AWS credentials can be passed as user/password in the URI, or they can be passed through the following settings:
• AWS_ACCESS_KEY_ID
• AWS_SECRET_ACCESS_KEY
You can also define a custom ACL for exported feeds using this setting:
• FEED_STORAGE_S3_ACL
Standard output
The feeds are written to the standard output of the Scrapy process.
• URI scheme: stdout
• Example URI: stdout:
• Required external libraries: none
3.8.5 Settings
These are the settings used for configuring the feed exports:
• FEED_URI (mandatory)
• FEED_FORMAT
• FEED_STORAGES
• FEED_STORAGE_FTP_ACTIVE
• FEED_STORAGE_S3_ACL
• FEED_EXPORTERS
• FEED_STORE_EMPTY
• FEED_EXPORT_ENCODING
• FEED_EXPORT_FIELDS
• FEED_EXPORT_INDENT
FEED_URI
Default: None
The URI of the export feed. See Storage backends for supported URI schemes.
This setting is required for enabling the feed exports.
FEED_FORMAT
The serialization format to be used for the feed. See Serialization formats for possible values.
FEED_EXPORT_ENCODING
Default: None
The encoding to be used for the feed.
If unset or set to None (default) it uses UTF-8 for everything except JSON output, which uses safe numeric encoding
(\uXXXX sequences) for historic reasons.
Use utf-8 if you want UTF-8 for JSON too.
FEED_EXPORT_FIELDS
Default: None
A list of fields to export, optional. Example: FEED_EXPORT_FIELDS = ["foo", "bar", "baz"].
Use FEED_EXPORT_FIELDS option to define fields to export and their order.
When FEED_EXPORT_FIELDS is empty or None (default), Scrapy uses fields defined in dicts or Item subclasses a
spider is yielding.
If an exporter requires a fixed set of fields (this is the case for CSV export format) and FEED_EXPORT_FIELDS is
empty or None, then Scrapy tries to infer field names from the exported data - currently it uses field names from the
first item.
FEED_EXPORT_INDENT
Default: 0
Amount of spaces used to indent the output on each level. If FEED_EXPORT_INDENT is a non-negative integer, then
array elements and object members will be pretty-printed with that indent level. An indent level of 0 (the default), or
negative, will put each item on a new line. None selects the most compact representation.
Currently implemented only by JsonItemExporter and XmlItemExporter, i.e. when you are exporting to
.json or .xml.
FEED_STORE_EMPTY
Default: False
Whether to export empty feeds (ie. feeds with no items).
FEED_STORAGES
Default: {}
A dict containing additional feed storage backends supported by your project. The keys are URI schemes and the
values are paths to storage classes.
FEED_STORAGE_FTP_ACTIVE
Default: False
Whether to use the active connection mode when exporting feeds to an FTP server (True) or use the passive connec-
tion mode instead (False, default).
For information about FTP connection modes, see What is the difference between active and passive FTP?.
FEED_STORAGE_S3_ACL
FEED_STORAGES_BASE
Default:
{
'': 'scrapy.extensions.feedexport.FileFeedStorage',
'file': 'scrapy.extensions.feedexport.FileFeedStorage',
'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage',
's3': 'scrapy.extensions.feedexport.S3FeedStorage',
'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
}
A dict containing the built-in feed storage backends supported by Scrapy. You can disable any of these backends by
assigning None to their URI scheme in FEED_STORAGES. E.g., to disable the built-in FTP storage backend (without
replacement), place this in your settings.py:
FEED_STORAGES = {
'ftp': None,
}
FEED_EXPORTERS
Default: {}
A dict containing additional exporters supported by your project. The keys are serialization formats and the values are
paths to Item exporter classes.
FEED_EXPORTERS_BASE
Default:
{
'json': 'scrapy.exporters.JsonItemExporter',
'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',
'jl': 'scrapy.exporters.JsonLinesItemExporter',
'csv': 'scrapy.exporters.CsvItemExporter',
'xml': 'scrapy.exporters.XmlItemExporter',
'marshal': 'scrapy.exporters.MarshalItemExporter',
'pickle': 'scrapy.exporters.PickleItemExporter',
}
A dict containing the built-in feed exporters supported by Scrapy. You can disable any of these exporters by assign-
ing None to their serialization format in FEED_EXPORTERS. E.g., to disable the built-in CSV exporter (without
replacement), place this in your settings.py:
FEED_EXPORTERS = {
'csv': None,
}
Scrapy uses Request and Response objects for crawling web sites.
Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader,
which executes the request and returns a Response object which travels back to the spider that issued the request.
Both Request and Response classes have subclasses which add functionality not required in the base classes.
These are described below in Request subclasses and Response subclasses.
request_with_cookies = Request(url="http://www.example.com",
cookies={'currency': 'USD',
˓→'country': 'UY'})
request_with_cookies = Request(url="http://www.example.com",
cookies=[{'name': 'currency',
'value': 'USD',
'domain': 'example.com',
'path': '/currency'}])
The latter form allows for customizing the domain and path attributes of the cookie. This
is only useful if the cookies are saved for later requests. When some site returns cookies (in
a response) those are stored in the cookies for that domain and will be sent again in future
requests. That’s the typical behaviour of any regular web browser. However, if, for some
reason, you want to avoid merging with existing cookies you can instruct Scrapy to do so by
setting the dont_merge_cookies key to True in the Request.meta.
Example of request without merging cookies:
request_with_cookies = Request(url="http://www.example.com",
cookies={'currency': 'USD', 'country
˓→': 'UY'},
meta={'dont_merge_cookies': True})
meta
A dict that contains arbitrary metadata for this request. This dict is empty for new Requests, and is usually
populated by different Scrapy components (extensions, middlewares, etc). So the data contained in this
dict depends on the extensions you have enabled.
See Request.meta special keys for a list of special meta keys recognized by Scrapy.
This dict is shallow copied when the request is cloned using the copy() or replace() methods, and
can also be accessed, in your spider, from the response.meta attribute.
cb_kwargs
A dictionary that contains arbitrary metadata for this request. Its contents will be passed to the Request’s
callback as keyword arguments. It is empty for new Requests, which means by default callbacks only get
a Response object as argument.
This dict is shallow copied when the request is cloned using the copy() or replace() methods, and
can also be accessed, in your spider, from the response.cb_kwargs attribute.
copy()
Return a new Request which is a copy of this Request. See also: Passing additional data to callback
functions.
replace([url, method, headers, body, cookies, meta, flags, encoding, priority, dont_filter, callback,
errback, cb_kwargs ])
Return a Request object with the same members, except for those members given new values by whichever
keyword arguments are specified. The Request.cb_kwargs and Request.meta attributes are shal-
low copied by default (unless new values are given as arguments). See also Passing additional data to
callback functions.
The callback of a request is a function that will be called when the response of that request is downloaded. The
callback function will be called with the downloaded Response object as its first argument.
Example:
In some cases you may be interested in passing arguments to those callback functions so you can receive the arguments
later, in the second callback. The following example shows how to achieve this by using the Request.cb_kwargs
attribute:
Caution: Request.cb_kwargs was introduced in version 1.7. Prior to that, using Request.meta was
recommended for passing information around callbacks. After 1.7, Request.cb_kwargs became the pre-
ferred way for handling user information, leaving Request.meta for communication with components like
middlewares and extensions.
The errback of a request is a function that will be called when an exception is raise while processing it.
It receives a Twisted Failure instance as first parameter and can be used to track connection establishment timeouts,
DNS errors etc.
Here’s an example spider logging all errors and catching some specific errors if needed:
import scrapy
class ErrbackSpider(scrapy.Spider):
name = "errback_example"
start_urls = [
"http://www.httpbin.org/", # HTTP 200 expected
"http://www.httpbin.org/status/404", # Not found error
"http://www.httpbin.org/status/500", # server issue
"http://www.httpbin.org:12345/", # non-responding host, timeout
˓→expected
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse_httpbin,
errback=self.errback_httpbin,
dont_filter=True)
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
(continues on next page)
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
The Request.meta attribute can contain any arbitrary data, but there are some special keys recognized by Scrapy
and its built-in extensions.
Those are:
• dont_redirect
• dont_retry
• handle_httpstatus_list
• handle_httpstatus_all
• dont_merge_cookies
• cookiejar
• dont_cache
• redirect_reasons
• redirect_urls
• bindaddress
• dont_obey_robotstxt
• download_timeout
• download_maxsize
• download_latency
• download_fail_on_dataloss
• proxy
• ftp_user (See FTP_USER for more info)
• ftp_password (See FTP_PASSWORD for more info)
• referrer_policy
• max_retry_times
bindaddress
The IP of the outgoing IP address to use for the performing the request.
download_timeout
The amount of time (in secs) that the downloader will wait before timing out. See also: DOWNLOAD_TIMEOUT.
download_latency
The amount of time spent to fetch the response, since the request has been started, i.e. HTTP message sent over the
network. This meta key only becomes available when the response has been downloaded. While most other meta keys
are used to control Scrapy behavior, this one is supposed to be read-only.
download_fail_on_dataloss
max_retry_times
The meta key is used set retry times per request. When initialized, the max_retry_times meta key takes higher
precedence over the RETRY_TIMES setting.
Here is the list of built-in Request subclasses. You can also subclass it to implement your own custom functionality.
FormRequest objects
The FormRequest class extends the base Request with functionality for dealing with HTML forms. It uses lxml.html
forms to pre-populate form fields with form data from Response objects.
class scrapy.http.FormRequest(url[, formdata, ... ])
The FormRequest class adds a new argument to the constructor. The remaining arguments are the same as
for the Request class and are not documented here.
Parameters formdata (dict or iterable of tuples) – is a dictionary (or iterable of
(key, value) tuples) containing HTML Form data which will be url-encoded and assigned to the
body of the request.
The FormRequest objects support the following class method in addition to the standard Request methods:
sometimes it can cause problems which could be hard to debug. For example, when working with forms
that are filled and/or submitted using javascript, the default from_response() behaviour may not be
the most appropriate. To disable this behaviour you can set the dont_click argument to True. Also, if
you want to change the control clicked (instead of disabling it) you can also use the clickdata argument.
Caution: Using this method with select elements which have leading or trailing whitespace in the
option values will not work due to a bug in lxml, which should be fixed in lxml 3.8 and above.
Parameters
• response (Response object) – the response containing a HTML form which will be
used to pre-populate the form fields
• formname (string) – if given, the form with name attribute set to this value will be
used.
• formid (string) – if given, the form with id attribute set to this value will be used.
• formxpath (string) – if given, the first form that matches the xpath will be used.
• formcss (string) – if given, the first form that matches the css selector will be used.
• formnumber (integer) – the number of form to use, when the response contains
multiple forms. The first one (and also the default) is 0.
• formdata (dict) – fields to override in the form data. If a field was already present in
the response <form> element, its value is overridden by the one passed in this parameter.
If a value passed in this parameter is None, the field will not be included in the request,
even if it was present in the response <form> element.
• clickdata (dict) – attributes to lookup the control clicked. If it’s not given, the form
data will be submitted simulating a click on the first clickable element. In addition to html
attributes, the control can be identified by its zero-based index relative to other submittable
inputs inside the form, via the nr attribute.
• dont_click (boolean) – If True, the form data will be submitted without clicking in
any element.
The other parameters of this class method are passed directly to the FormRequest constructor.
New in version 0.10.3: The formname parameter.
New in version 0.17: The formxpath parameter.
New in version 1.1.0: The formcss parameter.
New in version 1.1.0: The formid parameter.
If you want to simulate a HTML Form POST in your spider and send a couple of key-value fields, you can return a
FormRequest object (from your spider) like this:
return [FormRequest(url="http://www.example.com/post/action",
formdata={'name': 'John Doe', 'age': '27'},
callback=self.after_post)]
It is usual for web sites to provide pre-populated form fields through <input type="hidden"> elements, such
as session related data or authentication tokens (for login pages). When scraping, you’ll want these fields to be
automatically pre-populated and only override a couple of them, such as the user name and password. You can use the
FormRequest.from_response() method for this job. Here’s an example spider which uses it:
import scrapy
def authentication_failed(response):
# TODO: Check the contents of the response and return True if it failed
# or False if it succeeded.
pass
class LoginSpider(scrapy.Spider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
JSONRequest
The JSONRequest class extends the base Request class with functionality for dealing with JSON requests.
class scrapy.http.JSONRequest(url[, ... data, dumps_kwargs ])
The JSONRequest class adds two new argument to the constructor. The remaining arguments are the same as
for the Request class and are not documented here.
Using the JSONRequest will set the Content-Type header to application/json and Accept header
to application/json, text/javascript, */*; q=0.01
Parameters
• data (JSON serializable object) – is any JSON serializable object that needs
to be JSON encoded and assigned to body. if Request.body argument is provided this
parameter will be ignored. if Request.body argument is not provided and data argument
is provided Request.method will be set to 'POST' automatically.
• dumps_kwargs (dict) – Parameters that will be passed to underlying json.dumps
method which is used to serialize data into JSON format.
data = {
'name1': 'value1',
'name2': 'value2',
}
yield JSONRequest(url='http://www.example.com/post/action', data=data)
response.headers.getlist('Set-Cookie')
body
The body of this Response. Keep in mind that Response.body is always a bytes object. If you want the
unicode version use TextResponse.text (only available in TextResponse and subclasses).
This attribute is read-only. To change the body of a Response use replace().
request
The Request object that generated this response. This attribute is assigned in the Scrapy engine, after
the response and the request have passed through all Downloader Middlewares. In particular, this means
that:
• HTTP redirections will cause the original request (to the URL before redirection) to be assigned to
the redirected response (with the final URL after redirection).
urlparse.urljoin(response.url, url)
Here is the list of available built-in Response subclasses. You can also subclass the Response class to implement your
own functionality.
TextResponse objects
Parameters encoding (string) – is a string which contains the encoding to use for this re-
sponse. If you create a TextResponse object with a unicode body, it will be encoded using
this encoding (remember the body attribute is always a string). If encoding is None (default
value), the encoding will be looked up in the response headers and body instead.
TextResponse objects support the following attributes in addition to the standard Response ones:
text
Response body, as unicode.
The same as response.body.decode(response.encoding), but the result is cached after the
first call, so you can access response.text multiple times without extra overhead.
Note: unicode(response.body) is not a correct way to convert response body to unicode: you
would be using the system default encoding (typically ascii) instead of the response encoding.
encoding
A string with the encoding of this response. The encoding is resolved by trying the following mechanisms,
in order:
1. the encoding passed in the constructor encoding argument
2. the encoding declared in the Content-Type HTTP header. If this encoding is not valid (ie. unknown),
it is ignored and the next resolution mechanism is tried.
3. the encoding declared in the response body. The TextResponse class doesn’t provide any special
functionality for this. However, the HtmlResponse and XmlResponse classes do.
4. the encoding inferred by looking at the response body. This is the more fragile method but also the
last one tried.
selector
A Selector instance using the response as target. The selector is lazily instantiated on first access.
TextResponse objects support the following methods in addition to the standard Response ones:
xpath(query)
A shortcut to TextResponse.selector.xpath(query):
response.xpath('//p')
css(query)
A shortcut to TextResponse.selector.css(query):
response.css('p')
body_as_unicode()
The same as text, but available as a method. This method is kept for backward compatibility; please
prefer response.text.
HtmlResponse objects
XmlResponse objects
Link extractors are objects whose only purpose is to extract links from web pages (scrapy.http.Response
objects) which will be eventually followed.
There is scrapy.linkextractors.LinkExtractor available in Scrapy, but you can create your own custom
Link Extractors to suit your needs by implementing a simple interface.
The only public method that every link extractor has is extract_links, which receives a Response object
and returns a list of scrapy.link.Link objects. Link extractors are meant to be instantiated once and their
extract_links method called several times with different responses to extract links to follow.
Link extractors are used in the CrawlSpider class (available in Scrapy), through a set of rules, but you can also use
it in your spiders, even if you don’t subclass from CrawlSpider, as its purpose is very simple: to extract links.
Link extractors classes bundled with Scrapy are provided in the scrapy.linkextractors module.
The default link extractor is LinkExtractor, which is the same as LxmlLinkExtractor:
There used to be other link extractor classes in previous Scrapy versions, but they are deprecated now.
LxmlLinkExtractor
different for requests with canonicalized and raw URLs. If you’re using LinkExtractor to
follow links it is more robust to keep the default canonicalize=False.
• unique (boolean) – whether duplicate filtering should be applied to extracted links.
• process_value (callable) – a function which receives each value extracted from the
tag and attributes scanned and can modify the value and return a new one, or return None
to ignore the link altogether. If not given, process_value defaults to lambda x: x.
For example, to extract links from this code:
def process_value(value):
m = re.search("javascript:goToPage\('(.*?)'", value)
if m:
return m.group(1)
3.11 Settings
The Scrapy settings allows you to customize the behaviour of all Scrapy components, including the core, extensions,
pipelines and spiders themselves.
The infrastructure of the settings provides a global namespace of key-value mappings that the code can use to pull
configuration values from. The settings can be populated through different mechanisms, which are described below.
The settings are also the mechanism for selecting the currently active Scrapy project (in case you have many).
For a list of available built-in settings see: Built-in settings reference.
When you use Scrapy, you have to tell it which settings you’re using. You can do this by using an environment variable,
SCRAPY_SETTINGS_MODULE.
The value of SCRAPY_SETTINGS_MODULE should be in Python path syntax, e.g. myproject.settings. Note
that the settings module should be on the Python import search path.
Settings can be populated using different mechanisms, each of which having a different precedence. Here is the list of
them in decreasing order of precedence:
1. Command line options (most precedence)
2. Settings per-spider
Arguments provided by the command line are the ones that take most precedence, overriding any other options. You
can explicitly override one (or more) settings using the -s (or --set) command line option.
Example:
2. Settings per-spider
Spiders (See the Spiders chapter for reference) can define their own settings that will take precedence and override the
project ones. They can do so by setting their custom_settings attribute:
class MySpider(scrapy.Spider):
name = 'myspider'
custom_settings = {
'SOME_SETTING': 'some value',
}
The project settings module is the standard configuration file for your Scrapy project, it’s where most of your custom
settings will be populated. For a standard Scrapy project, this means you’ll be adding or changing the settings in the
settings.py file created for your project.
Each Scrapy tool command can have its own default settings, which override the global default settings. Those custom
command settings are specified in the default_settings attribute of the command class.
The global defaults are located in the scrapy.settings.default_settings module and documented in the
Built-in settings reference section.
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://example.com']
Note: The settings attribute is set in the base Spider class after the spider is initialized. If you want to
use the settings before the initialization (e.g., in your spider’s __init__() method), you’ll need to override the
from_crawler() method.
Settings can be accessed through the scrapy.crawler.Crawler.settings attribute of the Crawler that is
passed to from_crawler method in extensions, middlewares and item pipelines:
class MyExtension(object):
def __init__(self, log_is_enabled=False):
if log_is_enabled:
print("log is enabled!")
@classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
return cls(settings.getbool('LOG_ENABLED'))
The settings object can be used like a dict (e.g., settings['LOG_ENABLED']), but it’s usually preferred to extract
the setting in the format you need it to avoid type errors, using one of the methods provided by the Settings API.
Setting names are usually prefixed with the component that they configure. For example, proper setting names for
a fictional robots.txt extension would be ROBOTSTXT_ENABLED, ROBOTSTXT_OBEY, ROBOTSTXT_CACHEDIR,
etc.
Here’s a list of all available Scrapy settings, in alphabetical order, along with their default values and the scope where
they apply.
The scope, where available, shows where the setting is being used, if it’s tied to any particular component. In that case
the module of that component will be shown, typically an extension, middleware or pipeline. It also means that the
component must be enabled in order for the setting to have any effect.
AWS_ACCESS_KEY_ID
Default: None
The AWS access key used by code that requires access to Amazon Web services, such as the S3 feed storage backend.
AWS_SECRET_ACCESS_KEY
Default: None
The AWS secret key used by code that requires access to Amazon Web services, such as the S3 feed storage backend.
AWS_ENDPOINT_URL
Default: None
Endpoint URL used for S3-like storage, for example Minio or s3.scality. Only supported with botocore library.
AWS_USE_SSL
Default: None
Use this option if you want to disable SSL connection for communication with S3 or S3-like storage. By default SSL
will be used. Only supported with botocore library.
AWS_VERIFY
Default: None
Verify SSL connection between Scrapy and S3 or S3-like storage. By default SSL verification will occur. Only
supported with botocore library.
AWS_REGION_NAME
Default: None
The name of the region associated with the AWS client. Only supported with botocore library.
BOT_NAME
Default: 'scrapybot'
The name of the bot implemented by this Scrapy project (also known as the project name). This will be used to
construct the User-Agent by default, and also for logging.
It’s automatically populated with your project name when you create your project with the startproject com-
mand.
CONCURRENT_ITEMS
Default: 100
Maximum number of concurrent items (per response) to process in parallel in the Item Processor (also known as the
Item Pipeline).
CONCURRENT_REQUESTS
Default: 16
The maximum number of concurrent (ie. simultaneous) requests that will be performed by the Scrapy downloader.
CONCURRENT_REQUESTS_PER_DOMAIN
Default: 8
The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single domain.
See also: AutoThrottle extension and its AUTOTHROTTLE_TARGET_CONCURRENCY option.
CONCURRENT_REQUESTS_PER_IP
Default: 0
The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single IP. If non-
zero, the CONCURRENT_REQUESTS_PER_DOMAIN setting is ignored, and this one is used instead. In other words,
concurrency limits will be applied per IP, not per domain.
This setting also affects DOWNLOAD_DELAY and AutoThrottle extension: if CONCURRENT_REQUESTS_PER_IP is
non-zero, download delay is enforced per IP, not per domain.
DEFAULT_ITEM_CLASS
Default: 'scrapy.item.Item'
The default class that will be used for instantiating items in the the Scrapy shell.
DEFAULT_REQUEST_HEADERS
Default:
{
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
The default headers used for Scrapy HTTP Requests. They’re populated in the DefaultHeadersMiddleware.
DEPTH_LIMIT
Default: 0
Scope: scrapy.spidermiddlewares.depth.DepthMiddleware
The maximum depth that will be allowed to crawl for any site. If zero, no limit will be imposed.
DEPTH_PRIORITY
Default: 0
Scope: scrapy.spidermiddlewares.depth.DepthMiddleware
An integer that is used to adjust the priority of a Request based on its depth.
The priority of a request is adjusted as follows:
As depth increases, positive values of DEPTH_PRIORITY decrease request priority (BFO), while negative values
increase request priority (DFO). See also Does Scrapy crawl in breadth-first or depth-first order?.
Note: This setting adjusts priority in the opposite way compared to other priority settings
REDIRECT_PRIORITY_ADJUST and RETRY_PRIORITY_ADJUST.
DEPTH_STATS_VERBOSE
Default: False
Scope: scrapy.spidermiddlewares.depth.DepthMiddleware
Whether to collect verbose depth stats. If this is enabled, the number of requests for each depth is collected in the
stats.
DNSCACHE_ENABLED
Default: True
Whether to enable DNS in-memory cache.
DNSCACHE_SIZE
Default: 10000
DNS in-memory cache size.
DNS_TIMEOUT
Default: 60
Timeout for processing of DNS queries in seconds. Float is supported.
DOWNLOADER
Default: 'scrapy.core.downloader.Downloader'
The downloader to use for crawling.
DOWNLOADER_HTTPCLIENTFACTORY
Default: 'scrapy.core.downloader.webclient.ScrapyHTTPClientFactory'
Defines a Twisted protocol.ClientFactory class to use for HTTP/1.0 connections (for
HTTP10DownloadHandler).
Note: HTTP/1.0 is rarely used nowadays so you can safely ignore this setting, unless you use Twisted<11.1, or if you
really want to use HTTP/1.0 and override DOWNLOAD_HANDLERS_BASE for http(s) scheme accordingly, i.e. to
'scrapy.core.downloader.handlers.http.HTTP10DownloadHandler'.
DOWNLOADER_CLIENTCONTEXTFACTORY
Default: 'scrapy.core.downloader.contextfactory.ScrapyClientContextFactory'
Represents the classpath to the ContextFactory to use.
Here, “ContextFactory” is a Twisted term for SSL/TLS contexts, defining the TLS/SSL protocol version to use,
whether to do certificate verification, or even enable client-side authentication (and various other things).
Note: Scrapy default context factory does NOT perform remote server certificate verification. This is usually fine
for web scraping.
If you do need remote server certificate verification enabled, Scrapy also has another context factory class that you can
set, 'scrapy.core.downloader.contextfactory.BrowserLikeContextFactory', which uses the
platform’s certificates to validate remote endpoints. This is only available if you use Twisted>=14.0.
If you do use a custom ContextFactory, make sure it accepts a method parameter at init (this is the OpenSSL.SSL
method mapping DOWNLOADER_CLIENT_TLS_METHOD).
DOWNLOADER_CLIENT_TLS_METHOD
Default: 'TLS'
Use this setting to customize the TLS/SSL method used by the default HTTP/1.1 downloader.
This setting must be one of these string values:
• 'TLS': maps to OpenSSL’s TLS_method() (a.k.a SSLv23_method()), which allows protocol negotia-
tion, starting from the highest supported by the platform; default, recommended
• 'TLSv1.0': this value forces HTTPS connections to use TLS version 1.0 ; set this if you want the behavior
of Scrapy<1.1
• 'TLSv1.1': forces TLS version 1.1
• 'TLSv1.2': forces TLS version 1.2
• 'SSLv3': forces SSL version 3 (not recommended)
Note: We recommend that you use PyOpenSSL>=0.13 and Twisted>=0.13 or above (Twisted>=14.0 if you can).
DOWNLOADER_MIDDLEWARES
Default:: {}
A dict containing the downloader middlewares enabled in your project, and their orders. For more info see Activating
a downloader middleware.
DOWNLOADER_MIDDLEWARES_BASE
Default:
{
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
}
A dict containing the downloader middlewares enabled by default in Scrapy. Low orders are closer to the en-
gine, high orders are closer to the downloader. You should never modify this setting in your project, modify
DOWNLOADER_MIDDLEWARES instead. For more info see Activating a downloader middleware.
DOWNLOADER_STATS
Default: True
Whether to enable downloader stats collection.
DOWNLOAD_DELAY
Default: 0
The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same
website. This can be used to throttle the crawling speed to avoid hitting servers too hard. Decimal numbers are
supported. Example:
This setting is also affected by the RANDOMIZE_DOWNLOAD_DELAY setting (which is enabled by default). By
default, Scrapy doesn’t wait a fixed amount of time between requests, but uses a random interval between 0.5 *
DOWNLOAD_DELAY and 1.5 * DOWNLOAD_DELAY.
When CONCURRENT_REQUESTS_PER_IP is non-zero, delays are enforced per ip address instead of per domain.
You can also change this setting per spider by setting download_delay spider attribute.
DOWNLOAD_HANDLERS
Default: {}
A dict containing the request downloader handlers enabled in your project. See DOWNLOAD_HANDLERS_BASE for
example format.
DOWNLOAD_HANDLERS_BASE
Default:
{
'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler',
'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler',
'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler',
}
A dict containing the request download handlers enabled by default in Scrapy. You should never modify this setting in
your project, modify DOWNLOAD_HANDLERS instead.
You can disable any of these download handlers by assigning None to their URI scheme in DOWNLOAD_HANDLERS.
E.g., to disable the built-in FTP handler (without replacement), place this in your settings.py:
DOWNLOAD_HANDLERS = {
'ftp': None,
}
DOWNLOAD_TIMEOUT
Default: 180
The amount of time (in secs) that the downloader will wait before timing out.
Note: This timeout can be set per spider using download_timeout spider attribute and per-request using
download_timeout Request.meta key.
DOWNLOAD_MAXSIZE
Note: This size can be set per spider using download_maxsize spider attribute and per-request using
download_maxsize Request.meta key.
This feature needs Twisted >= 11.1.
DOWNLOAD_WARNSIZE
Note: This size can be set per spider using download_warnsize spider attribute and per-request using
download_warnsize Request.meta key.
This feature needs Twisted >= 11.1.
DOWNLOAD_FAIL_ON_DATALOSS
Default: True
Whether or not to fail on broken responses, that is, declared Content-Length does not match con-
tent sent by the server or chunked response was not properly finish. If True, these responses raise a
ResponseFailed([_DataLoss]) error. If False, these responses are passed through and the flag dataloss
is added to the response, i.e.: 'dataloss' in response.flags is True.
Optionally, this can be set per-request basis by using the download_fail_on_dataloss Request.meta key to
False.
Note: A broken response, or data loss error, may happen under several circumstances, from server misconfiguration
to network errors to data corruption. It is up to the user to decide if it makes sense to process broken responses
considering they may contain partial or incomplete content. If RETRY_ENABLED is True and this setting is set to
True, the ResponseFailed([_DataLoss]) failure will be retried as usual.
DUPEFILTER_CLASS
Default: 'scrapy.dupefilters.RFPDupeFilter'
The class used to detect and filter duplicate requests.
The default (RFPDupeFilter) filters based on request fingerprint using the scrapy.utils.request.
request_fingerprint function. In order to change the way duplicates are checked you could subclass
RFPDupeFilter and override its request_fingerprint method. This method should accept scrapy
Request object and return its fingerprint (a string).
You can disable filtering of duplicate requests by setting DUPEFILTER_CLASS to 'scrapy.dupefilters.
BaseDupeFilter'. Be very careful about this however, because you can get into crawling loops. It’s usually a
better idea to set the dont_filter parameter to True on the specific Request that should not be filtered.
DUPEFILTER_DEBUG
Default: False
By default, RFPDupeFilter only logs the first duplicate request. Setting DUPEFILTER_DEBUG to True will
make it log all duplicate requests.
EDITOR
EXTENSIONS
Default:: {}
A dict containing the extensions enabled in your project, and their orders.
EXTENSIONS_BASE
Default:
{
'scrapy.extensions.corestats.CoreStats': 0,
'scrapy.extensions.telnet.TelnetConsole': 0,
'scrapy.extensions.memusage.MemoryUsage': 0,
'scrapy.extensions.memdebug.MemoryDebugger': 0,
'scrapy.extensions.closespider.CloseSpider': 0,
'scrapy.extensions.feedexport.FeedExporter': 0,
'scrapy.extensions.logstats.LogStats': 0,
'scrapy.extensions.spiderstate.SpiderState': 0,
'scrapy.extensions.throttle.AutoThrottle': 0,
}
A dict containing the extensions available by default in Scrapy, and their orders. This setting contains all stable built-in
extensions. Keep in mind that some of them need to be enabled through a setting.
For more information See the extensions user guide and the list of available extensions.
FEED_TEMPDIR
The Feed Temp dir allows you to set a custom folder to save crawler temporary files before uploading with FTP feed
storage and Amazon S3.
FTP_PASSIVE_MODE
Default: True
Whether or not to use passive mode when initiating FTP transfers.
FTP_PASSWORD
Default: "guest"
The password to use for FTP connections when there is no "ftp_password" in Request meta.
Note: Paraphrasing RFC 1635, although it is common to use either the password “guest” or one’s e-mail address
for anonymous FTP, some FTP servers explicitly ask for the user’s e-mail address and will not allow login with the
“guest” password.
FTP_USER
Default: "anonymous"
The username to use for FTP connections when there is no "ftp_user" in Request meta.
ITEM_PIPELINES
Default: {}
A dict containing the item pipelines to use, and their orders. Order values are arbitrary, but it is customary to define
them in the 0-1000 range. Lower orders process before higher orders.
Example:
ITEM_PIPELINES = {
'mybot.pipelines.validate.ValidateMyItem': 300,
'mybot.pipelines.validate.StoreMyItem': 800,
}
ITEM_PIPELINES_BASE
Default: {}
A dict containing the pipelines enabled by default in Scrapy. You should never modify this setting in your project,
modify ITEM_PIPELINES instead.
LOG_ENABLED
Default: True
Whether to enable logging.
LOG_ENCODING
Default: 'utf-8'
The encoding to use for logging.
LOG_FILE
Default: None
File name to use for logging output. If None, standard error will be used.
LOG_FORMAT
LOG_DATEFORMAT
LOG_LEVEL
Default: 'DEBUG'
Minimum level to log. Available levels are: CRITICAL, ERROR, WARNING, INFO, DEBUG. For more info see
Logging.
LOG_STDOUT
Default: False
If True, all standard output (and error) of your process will be redirected to the log. For example if you
print('hello') it will appear in the Scrapy log.
LOG_SHORT_NAMES
Default: False
If True, the logs will just contain the root path. If it is set to False then it displays the component responsible for
the log output
LOGSTATS_INTERVAL
Default: 60.0
The interval (in seconds) between each logging printout of the stats by LogStats.
MEMDEBUG_ENABLED
Default: False
Whether to enable memory debugging.
MEMDEBUG_NOTIFY
Default: []
When memory debugging is enabled a memory report will be sent to the specified addresses if this setting is not empty,
otherwise the report will be written to the log.
Example:
MEMDEBUG_NOTIFY = ['[email protected]']
MEMUSAGE_ENABLED
Default: True
Scope: scrapy.extensions.memusage
Whether to enable the memory usage extension. This extension keeps track of a peak memory used by the pro-
cess (it writes it to stats). It can also optionally shutdown the Scrapy process when it exceeds a memory limit (see
MEMUSAGE_LIMIT_MB), and notify by email when that happened (see MEMUSAGE_NOTIFY_MAIL).
See Memory usage extension.
MEMUSAGE_LIMIT_MB
Default: 0
Scope: scrapy.extensions.memusage
The maximum amount of memory to allow (in megabytes) before shutting down Scrapy (if MEMUSAGE_ENABLED
is True). If zero, no check will be performed.
See Memory usage extension.
MEMUSAGE_CHECK_INTERVAL_SECONDS
MEMUSAGE_NOTIFY_MAIL
Default: False
Scope: scrapy.extensions.memusage
A list of emails to notify if the memory limit has been reached.
Example:
MEMUSAGE_NOTIFY_MAIL = ['[email protected]']
MEMUSAGE_WARNING_MB
Default: 0
Scope: scrapy.extensions.memusage
The maximum amount of memory to allow (in megabytes) before sending a warning email notifying about it. If zero,
no warning will be produced.
NEWSPIDER_MODULE
Default: ''
Module where to create new spiders using the genspider command.
Example:
NEWSPIDER_MODULE = 'mybot.spiders_dev'
RANDOMIZE_DOWNLOAD_DELAY
Default: True
If enabled, Scrapy will wait a random amount of time (between 0.5 * DOWNLOAD_DELAY and 1.5 *
DOWNLOAD_DELAY) while fetching requests from the same website.
This randomization decreases the chance of the crawler being detected (and subsequently blocked) by sites which
analyze requests looking for statistically significant similarities in the time between their requests.
The randomization policy is the same used by wget --random-wait option.
If DOWNLOAD_DELAY is zero (default) this option has no effect.
REACTOR_THREADPOOL_MAXSIZE
Default: 10
The maximum limit for Twisted Reactor thread pool size. This is common multi-purpose thread pool used by various
Scrapy components. Threaded DNS Resolver, BlockingFeedStorage, S3FilesStore just to name a few. Increase this
value if you’re experiencing problems with insufficient blocking IO.
REDIRECT_MAX_TIMES
Default: 20
Defines the maximum times a request can be redirected. After this maximum the request’s response is returned as is.
We used Firefox default value for the same task.
REDIRECT_PRIORITY_ADJUST
Default: +2
Scope: scrapy.downloadermiddlewares.redirect.RedirectMiddleware
Adjust redirect request priority relative to original request:
• a positive priority adjust (default) means higher priority.
• a negative priority adjust means lower priority.
RETRY_PRIORITY_ADJUST
Default: -1
Scope: scrapy.downloadermiddlewares.retry.RetryMiddleware
Adjust retry request priority relative to original request:
• a positive priority adjust means higher priority.
• a negative priority adjust (default) means lower priority.
ROBOTSTXT_OBEY
Default: False
Scope: scrapy.downloadermiddlewares.robotstxt
If enabled, Scrapy will respect robots.txt policies. For more information see RobotsTxtMiddleware.
Note: While the default value is False for historical reasons, this option is enabled by default in settings.py file
generated by scrapy startproject command.
SCHEDULER
Default: 'scrapy.core.scheduler.Scheduler'
The scheduler to use for crawling.
SCHEDULER_DEBUG
Default: False
Setting to True will log debug information about the requests scheduler. This currently logs (only once) if the
requests cannot be serialized to disk. Stats counter (scheduler/unserializable) tracks the number of times
this happens.
Example entry in logs:
SCHEDULER_DISK_QUEUE
Default: 'scrapy.squeues.PickleLifoDiskQueue'
Type of disk queue that will be used by scheduler. Other available types are scrapy.squeues.
PickleFifoDiskQueue, scrapy.squeues.MarshalFifoDiskQueue, scrapy.squeues.
MarshalLifoDiskQueue.
SCHEDULER_MEMORY_QUEUE
Default: 'scrapy.squeues.LifoMemoryQueue'
Type of in-memory queue used by scheduler. Other available type is: scrapy.squeues.FifoMemoryQueue.
SCHEDULER_PRIORITY_QUEUE
Default: 'scrapy.pqueues.ScrapyPriorityQueue'
Type of priority queue used by the scheduler. Another available type is scrapy.pqueues.
DownloaderAwarePriorityQueue. scrapy.pqueues.DownloaderAwarePriorityQueue works
better than scrapy.pqueues.ScrapyPriorityQueue when you crawl many different domains in paral-
lel. But currently scrapy.pqueues.DownloaderAwarePriorityQueue does not work together with
CONCURRENT_REQUESTS_PER_IP.
SPIDER_CONTRACTS
Default:: {}
A dict containing the spider contracts enabled in your project, used for testing spiders. For more info see Spiders
Contracts.
SPIDER_CONTRACTS_BASE
Default:
{
'scrapy.contracts.default.UrlContract' : 1,
'scrapy.contracts.default.ReturnsContract': 2,
'scrapy.contracts.default.ScrapesContract': 3,
}
A dict containing the scrapy contracts enabled by default in Scrapy. You should never modify this setting in your
project, modify SPIDER_CONTRACTS instead. For more info see Spiders Contracts.
You can disable any of these contracts by assigning None to their class path in SPIDER_CONTRACTS. E.g., to disable
the built-in ScrapesContract, place this in your settings.py:
SPIDER_CONTRACTS = {
'scrapy.contracts.default.ScrapesContract': None,
}
SPIDER_LOADER_CLASS
Default: 'scrapy.spiderloader.SpiderLoader'
The class that will be used for loading spiders, which must implement the SpiderLoader API.
SPIDER_LOADER_WARN_ONLY
Note: Some scrapy commands run with this setting to True already (i.e. they will only issue a warning and will not
fail) since they do not actually need to load spider classes to work: scrapy runspider, scrapy settings,
scrapy startproject, scrapy version.
SPIDER_MIDDLEWARES
Default:: {}
A dict containing the spider middlewares enabled in your project, and their orders. For more info see Activating a
spider middleware.
SPIDER_MIDDLEWARES_BASE
Default:
{
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,
'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,
}
A dict containing the spider middlewares enabled by default in Scrapy, and their orders. Low orders are closer to the
engine, high orders are closer to the spider. For more info see Activating a spider middleware.
SPIDER_MODULES
Default: []
A list of modules where Scrapy will look for spiders.
Example:
STATS_CLASS
Default: 'scrapy.statscollectors.MemoryStatsCollector'
The class to use for collecting stats, who must implement the Stats Collector API.
STATS_DUMP
Default: True
Dump the Scrapy stats (to the Scrapy log) once the spider finishes.
For more info see: Stats Collection.
STATSMAILER_RCPTS
TELNETCONSOLE_ENABLED
Default: True
A boolean which specifies if the telnet console will be enabled (provided its extension is also enabled).
TELNETCONSOLE_PORT
TEMPLATES_DIR
URLLENGTH_LIMIT
Default: 2083
Scope: spidermiddlewares.urllength
The maximum URL length to allow for crawled URLs. For more information about the default value for this setting
see: https://boutell.com/newfaq/misc/urllength.html
USER_AGENT
The following settings are documented elsewhere, please check each specific case to see how to enable and use them.
• AJAXCRAWL_ENABLED
• AUTOTHROTTLE_DEBUG
• AUTOTHROTTLE_ENABLED
• AUTOTHROTTLE_MAX_DELAY
• AUTOTHROTTLE_START_DELAY
• AUTOTHROTTLE_TARGET_CONCURRENCY
• AWS_ACCESS_KEY_ID
• AWS_ENDPOINT_URL
• AWS_REGION_NAME
• AWS_SECRET_ACCESS_KEY
• AWS_USE_SSL
• AWS_VERIFY
• BOT_NAME
• CLOSESPIDER_ERRORCOUNT
• CLOSESPIDER_ITEMCOUNT
• CLOSESPIDER_PAGECOUNT
• CLOSESPIDER_TIMEOUT
• COMMANDS_MODULE
• COMPRESSION_ENABLED
• CONCURRENT_ITEMS
• CONCURRENT_REQUESTS
• CONCURRENT_REQUESTS_PER_DOMAIN
• CONCURRENT_REQUESTS_PER_IP
• COOKIES_DEBUG
• COOKIES_ENABLED
• DEFAULT_ITEM_CLASS
• DEFAULT_REQUEST_HEADERS
• DEPTH_LIMIT
• DEPTH_PRIORITY
• DEPTH_STATS_VERBOSE
• DNSCACHE_ENABLED
• DNSCACHE_SIZE
• DNS_TIMEOUT
• DOWNLOADER
• DOWNLOADER_CLIENTCONTEXTFACTORY
• DOWNLOADER_CLIENT_TLS_METHOD
• DOWNLOADER_HTTPCLIENTFACTORY
• DOWNLOADER_MIDDLEWARES
• DOWNLOADER_MIDDLEWARES_BASE
• DOWNLOADER_STATS
• DOWNLOAD_DELAY
• DOWNLOAD_FAIL_ON_DATALOSS
• DOWNLOAD_HANDLERS
• DOWNLOAD_HANDLERS_BASE
• DOWNLOAD_MAXSIZE
• DOWNLOAD_TIMEOUT
• DOWNLOAD_WARNSIZE
• DUPEFILTER_CLASS
• DUPEFILTER_DEBUG
• EDITOR
• EXTENSIONS
• EXTENSIONS_BASE
• FEED_EXPORTERS
• FEED_EXPORTERS_BASE
• FEED_EXPORT_ENCODING
• FEED_EXPORT_FIELDS
• FEED_EXPORT_INDENT
• FEED_FORMAT
• FEED_STORAGES
• FEED_STORAGES_BASE
• FEED_STORAGE_FTP_ACTIVE
• FEED_STORAGE_S3_ACL
• FEED_STORE_EMPTY
• FEED_TEMPDIR
• FEED_URI
• FILES_EXPIRES
• FILES_RESULT_FIELD
• FILES_STORE
• FILES_STORE_GCS_ACL
• FILES_STORE_S3_ACL
• FILES_URLS_FIELD
• FTP_PASSIVE_MODE
• FTP_PASSWORD
• FTP_USER
• GCS_PROJECT_ID
• HTTPCACHE_ALWAYS_STORE
• HTTPCACHE_DBM_MODULE
• HTTPCACHE_DIR
• HTTPCACHE_ENABLED
• HTTPCACHE_EXPIRATION_SECS
• HTTPCACHE_GZIP
• HTTPCACHE_IGNORE_HTTP_CODES
• HTTPCACHE_IGNORE_MISSING
• HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS
• HTTPCACHE_IGNORE_SCHEMES
• HTTPCACHE_POLICY
• HTTPCACHE_STORAGE
• HTTPERROR_ALLOWED_CODES
• HTTPERROR_ALLOW_ALL
• HTTPPROXY_AUTH_ENCODING
• HTTPPROXY_ENABLED
• IMAGES_EXPIRES
• IMAGES_MIN_HEIGHT
• IMAGES_MIN_WIDTH
• IMAGES_RESULT_FIELD
• IMAGES_STORE
• IMAGES_STORE_GCS_ACL
• IMAGES_STORE_S3_ACL
• IMAGES_THUMBS
• IMAGES_URLS_FIELD
• ITEM_PIPELINES
• ITEM_PIPELINES_BASE
• LOGSTATS_INTERVAL
• LOG_DATEFORMAT
• LOG_ENABLED
• LOG_ENCODING
• LOG_FILE
• LOG_FORMAT
• LOG_LEVEL
• LOG_SHORT_NAMES
• LOG_STDOUT
• MAIL_FROM
• MAIL_HOST
• MAIL_PASS
• MAIL_PORT
• MAIL_SSL
• MAIL_TLS
• MAIL_USER
• MEDIA_ALLOW_REDIRECTS
• MEMDEBUG_ENABLED
• MEMDEBUG_NOTIFY
• MEMUSAGE_CHECK_INTERVAL_SECONDS
• MEMUSAGE_ENABLED
• MEMUSAGE_LIMIT_MB
• MEMUSAGE_NOTIFY_MAIL
• MEMUSAGE_WARNING_MB
• METAREFRESH_ENABLED
• METAREFRESH_IGNORE_TAGS
• METAREFRESH_MAXDELAY
• NEWSPIDER_MODULE
• RANDOMIZE_DOWNLOAD_DELAY
• REACTOR_THREADPOOL_MAXSIZE
• REDIRECT_ENABLED
• REDIRECT_MAX_TIMES
• REDIRECT_MAX_TIMES
• REDIRECT_PRIORITY_ADJUST
• REFERER_ENABLED
• REFERRER_POLICY
• RETRY_ENABLED
• RETRY_HTTP_CODES
• RETRY_PRIORITY_ADJUST
• RETRY_TIMES
• ROBOTSTXT_OBEY
• SCHEDULER
• SCHEDULER_DEBUG
• SCHEDULER_DISK_QUEUE
• SCHEDULER_MEMORY_QUEUE
• SCHEDULER_PRIORITY_QUEUE
• SPIDER_CONTRACTS
• SPIDER_CONTRACTS_BASE
• SPIDER_LOADER_CLASS
• SPIDER_LOADER_WARN_ONLY
• SPIDER_MIDDLEWARES
• SPIDER_MIDDLEWARES_BASE
• SPIDER_MODULES
• STATSMAILER_RCPTS
• STATS_CLASS
• STATS_DUMP
• TELNETCONSOLE_ENABLED
• TELNETCONSOLE_HOST
• TELNETCONSOLE_PASSWORD
• TELNETCONSOLE_PORT
• TELNETCONSOLE_PORT
• TELNETCONSOLE_USERNAME
• TEMPLATES_DIR
• URLLENGTH_LIMIT
• USER_AGENT
3.12 Exceptions
DropItem
exception scrapy.exceptions.DropItem
The exception that must be raised by item pipeline stages to stop processing an Item. For more information see Item
Pipeline.
CloseSpider
exception scrapy.exceptions.CloseSpider(reason=’cancelled’)
This exception can be raised from a spider callback to request the spider to be closed/stopped. Supported
arguments:
Parameters reason (str) – the reason for closing
For example:
DontCloseSpider
exception scrapy.exceptions.DontCloseSpider
This exception can be raised in a spider_idle signal handler to prevent the spider from being closed.
IgnoreRequest
exception scrapy.exceptions.IgnoreRequest
This exception can be raised by the Scheduler or any downloader middleware to indicate that the request should be
ignored.
NotConfigured
exception scrapy.exceptions.NotConfigured
This exception can be raised by some components to indicate that they will remain disabled. Those components
include:
• Extensions
• Item pipelines
• Downloader middlewares
• Spider middlewares
The exception must be raised in the component’s __init__ method.
NotSupported
exception scrapy.exceptions.NotSupported
This exception is raised to indicate an unsupported feature.
Command line tool Learn about the command-line tool used to manage your Scrapy project.
Spiders Write the rules to crawl your websites.
Selectors Extract the data from web pages using XPath.
Scrapy shell Test your extraction code in an interactive environment.
Items Define the data you want to scrape.
Item Loaders Populate your items with the extracted data.
Item Pipeline Post-process and store your scraped data.
Feed exports Output your scraped data using different formats and storages.
Requests and Responses Understand the classes used to represent HTTP requests and responses.
Link Extractors Convenient classes to extract links to follow from pages.
Settings Learn how to configure Scrapy and see all available settings.
Exceptions See all available exceptions and their meaning.
Built-in services
4.1 Logging
Note: scrapy.log has been deprecated alongside its functions in favor of explicit calls to the Python standard
logging. Keep reading to learn more about the new logging system.
Scrapy uses Python’s builtin logging system for event logging. We’ll provide some simple examples to get you started,
but for more advanced use-cases it’s strongly suggested to read thoroughly its documentation.
Logging works out of the box, and can be configured to some extent with the Scrapy settings listed in Logging settings.
Scrapy calls scrapy.utils.log.configure_logging() to set some reasonable defaults and handle those
settings in Logging settings when running commands, so it’s recommended to manually call it if you’re running Scrapy
from scripts as described in Run Scrapy from a script.
Python’s builtin logging defines 5 different levels to indicate the severity of a given log message. Here are the standard
ones, listed in decreasing order:
1. logging.CRITICAL - for critical errors (highest severity)
2. logging.ERROR - for regular errors
3. logging.WARNING - for warning messages
4. logging.INFO - for informational messages
5. logging.DEBUG - for debugging messages (lowest severity)
Here’s a quick example of how to log a message using the logging.WARNING level:
127
Scrapy Documentation, Release 1.7.3
import logging
logging.warning("This is a warning")
There are shortcuts for issuing log messages on any of the standard 5 levels, and there’s also a general logging.log
method which takes a given level as argument. If needed, the last example could be rewritten as:
import logging
logging.log(logging.WARNING, "This is a warning")
On top of that, you can create different “loggers” to encapsulate messages. (For example, a common practice is to
create different loggers for every module). These loggers can be configured independently, and they allow hierarchical
constructions.
The previous examples use the root logger behind the scenes, which is a top level logger where all messages are
propagated to (unless otherwise specified). Using logging helpers is merely a shortcut for getting the root logger
explicitly, so this is also an equivalent of the last snippets:
import logging
logger = logging.getLogger()
logger.warning("This is a warning")
You can use a different logger just by getting its name with the logging.getLogger function:
import logging
logger = logging.getLogger('mycustomlogger')
logger.warning("This is a warning")
Finally, you can ensure having a custom logger for any module you’re working on by using the __name__ variable,
which is populated with current module’s path:
import logging
logger = logging.getLogger(__name__)
logger.warning("This is a warning")
See also:
Module logging, HowTo Basic Logging Tutorial
Module logging, Loggers Further documentation on loggers
Scrapy provides a logger within each Spider instance, which can be accessed and used like this:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://scrapinghub.com']
That logger is created using the Spider’s name, but you can use any custom Python logger you want. For example:
import logging
import scrapy
logger = logging.getLogger('mycustomlogger')
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://scrapinghub.com']
Loggers on their own don’t manage how messages sent through them are displayed. For this task, different “handlers”
can be attached to any logger instance and they will redirect those messages to appropriate destinations, such as the
standard output, files, emails, etc.
By default, Scrapy sets and configures a handler for the root logger, based on the settings below.
Logging settings
Command-line options
There are command-line arguments, available for all commands, that you can use to override some of the Scrapy
settings regarding logging.
• --logfile FILE Overrides LOG_FILE
• --loglevel/-L LEVEL Overrides LOG_LEVEL
• --nolog Sets LOG_ENABLED to False
See also:
Module logging.handlers Further documentation on available handlers
Advanced customization
Because Scrapy uses stdlib logging module, you can customize logging using all features of stdlib logging.
For example, let’s say you’re scraping a website which returns many HTTP 404 and 500 responses, and you want to
hide all messages like this:
import logging
import scrapy
class MySpider(scrapy.Spider):
# ...
def __init__(self, *args, **kwargs):
logger = logging.getLogger('scrapy.spidermiddlewares.httperror')
logger.setLevel(logging.WARNING)
super().__init__(*args, **kwargs)
If you run this spider again then INFO messages from scrapy.spidermiddlewares.httperror logger will
be gone.
scrapy.utils.log.configure_logging(settings=None, install_root_handler=True)
Initialize logging defaults for Scrapy.
Parameters
• settings (dict, Settings object or None) – settings used to create and configure a
handler for the root logger (default: None).
• install_root_handler (bool) – whether to install root logging handler (default:
True)
import logging
from scrapy.utils.log import configure_logging
configure_logging(install_root_handler=False)
logging.basicConfig(
filename='log.txt',
format='%(levelname)s: %(message)s',
level=logging.INFO
)
Refer to Run Scrapy from a script for more details about using Scrapy this way.
Scrapy provides a convenient facility for collecting stats in the form of key/values, where values are often counters.
The facility is called the Stats Collector, and can be accessed through the stats attribute of the Crawler API, as
illustrated by the examples in the Common Stats Collector uses section below.
However, the Stats Collector is always available, so you can always import it in your module and use its API (to
increment or set new stat keys), regardless of whether the stats collection is enabled or not. If it’s disabled, the API
will still work but it won’t collect anything. This is aimed at simplifying the stats collector usage: you should spend
no more than one line of code for collecting stats in your spider, Scrapy extension, or whatever code you’re using the
Stats Collector from.
Another feature of the Stats Collector is that it’s very efficient (when enabled) and extremely efficient (almost unno-
ticeable) when disabled.
The Stats Collector keeps a stats table per open spider which is automatically opened when the spider is opened, and
closed when the spider is closed.
Access the stats collector through the stats attribute. Here is an example of an extension that access stats:
class ExtensionThatAccessStats(object):
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.stats)
stats.set_value('hostname', socket.gethostname())
stats.inc_value('custom_count')
stats.max_value('max_items_scraped', value)
stats.min_value('min_free_memory_percent', value)
>>> stats.get_value('custom_count')
1
>>> stats.get_stats()
{'custom_count': 1, 'start_time': datetime.datetime(2009, 7, 14, 21, 47, 28, 977139)}
Besides the basic StatsCollector there are other Stats Collectors available in Scrapy which extend the basic Stats
Collector. You can select which Stats Collector to use through the STATS_CLASS setting. The default Stats Collector
used is the MemoryStatsCollector.
MemoryStatsCollector
class scrapy.statscollectors.MemoryStatsCollector
A simple stats collector that keeps the stats of the last scraping run (for each spider) in memory, after they’re
closed. The stats can be accessed through the spider_stats attribute, which is a dict keyed by spider domain
name.
This is the default Stats Collector used in Scrapy.
spider_stats
A dict of dicts (keyed by spider name) containing the stats of the last scraping run for each spider.
DummyStatsCollector
class scrapy.statscollectors.DummyStatsCollector
A Stats collector which does nothing but is very efficient (because it does nothing). This stats collector can
be set via the STATS_CLASS setting, to disable stats collect in order to improve performance. However, the
performance penalty of stats collection is usually marginal compared to other Scrapy workload like parsing
pages.
Although Python makes sending e-mails relatively easy via the smtplib library, Scrapy provides its own facility for
sending e-mails which is very easy to use and it’s implemented using Twisted non-blocking IO, to avoid interfering
with the non-blocking IO of the crawler. It also provides a simple API for sending attachments and it’s very easy to
configure, with a few settings.
There are two ways to instantiate the mail sender. You can instantiate it using the standard constructor:
Or you can instantiate it passing a Scrapy settings object, which will respect the settings:
mailer = MailSender.from_settings(settings)
MailSender is the preferred class to use for sending emails from Scrapy, as it uses Twisted non-blocking IO, like the
rest of the framework.
class scrapy.mail.MailSender(smtphost=None, mailfrom=None, smtpuser=None, smtp-
pass=None, smtpport=None)
Parameters
• smtphost (str or bytes) – the SMTP host to use for sending the emails. If omitted,
the MAIL_HOST setting will be used.
• mailfrom (str) – the address used to send emails (in the From: header). If omitted, the
MAIL_FROM setting will be used.
• smtpuser – the SMTP user. If omitted, the MAIL_USER setting will be used. If not given,
no SMTP authentication will be performed.
• smtppass (str or bytes) – the SMTP pass for authentication.
• smtpport (int) – the SMTP port to connect to
• smtptls (boolean) – enforce using SMTP STARTTLS
These settings define the default constructor values of the MailSender class, and can be used to configure e-mail
notifications in your project without writing any code (for those extensions and code that uses MailSender).
MAIL_FROM
Default: 'scrapy@localhost'
Sender email to use (From: header) for sending emails.
MAIL_HOST
Default: 'localhost'
SMTP host to use for sending emails.
MAIL_PORT
Default: 25
SMTP port to use for sending emails.
MAIL_USER
Default: None
User to use for SMTP authentication. If disabled no SMTP authentication will be performed.
MAIL_PASS
Default: None
Password to use for SMTP authentication, along with MAIL_USER.
MAIL_TLS
Default: False
Enforce using STARTTLS. STARTTLS is a way to take an existing insecure connection, and upgrade it to a secure
connection using SSL/TLS.
MAIL_SSL
Default: False
Enforce connecting using an SSL encrypted connection
Scrapy comes with a built-in telnet console for inspecting and controlling a Scrapy running process. The telnet console
is just a regular python shell running inside the Scrapy process, so you can do literally anything from it.
The telnet console is a built-in Scrapy extension which comes enabled by default, but you can also disable it if you
want. For more information about the extension itself see Telnet console extension.
Warning: It is not secure to use telnet console via public networks, as telnet doesn’t provide any transport-layer
security. Having username/password authentication doesn’t change that.
Intended usage is connecting to a running Scrapy spider locally (spider process and telnet client are on the same
machine) or over a secure connection (VPN, SSH tunnel). Please avoid using telnet console over insecure connec-
tions, or disable it completely using TELNETCONSOLE_ENABLED option.
The telnet console listens in the TCP port defined in the TELNETCONSOLE_PORT setting, which defaults to 6023.
To access the console you need to type:
By default Username is scrapy and Password is autogenerated. The autogenerated Password can be seen on scrapy
logs like the example below:
Default Username and Password can be overriden by the settings TELNETCONSOLE_USERNAME and
TELNETCONSOLE_PASSWORD.
Warning: Username and password provide only a limited protection, as telnet is not using secure transport - by
default traffic is not encrypted even if username and password are set.
You need the telnet program which comes installed by default in Windows, and most Linux distros.
The telnet console is like a regular Python shell running inside the Scrapy process, so you can do anything from it
including importing new modules, etc.
However, the telnet console comes with some default variables defined for convenience:
Shortcut Description
crawler the Scrapy Crawler (scrapy.crawler.Crawler object)
engine Crawler.engine attribute
spider the active spider
slot the engine slot
extensions the Extension Manager (Crawler.extensions attribute)
stats the Stats Collector (Crawler.stats attribute)
settings the Scrapy settings object (Crawler.settings attribute)
est print a report of the engine status
prefs for memory debugging (see Debugging memory leaks)
p a shortcut to the pprint.pprint function
hpy for memory debugging (see Debugging memory leaks)
Here are some example tasks you can do with the telnet console:
You can use the est() method of the Scrapy engine to quickly show its state using the telnet console:
time()-engine.start_time : 8.62972998619
engine.has_capacity() : False
len(engine.downloader.active) : 16
engine.scraper.is_idle() : False
engine.spider.name : followall
engine.spider_is_idle(engine.spider) : False
engine.slot.closing : False
len(engine.slot.inprogress) : 16
len(engine.slot.scheduler.dqs or []) : 0
len(engine.slot.scheduler.mqs) : 92
(continues on next page)
To pause:
To resume:
To stop:
scrapy.extensions.telnet.update_telnet_vars(telnet_vars)
Sent just before the telnet console is opened. You can hook up to this signal to add, remove or update the
variables that will be available in the telnet local namespace. In order to do that, you need to update the
telnet_vars dict in your handler.
Parameters telnet_vars (dict) – the dict of telnet variables
These are the settings that control the telnet console’s behaviour:
TELNETCONSOLE_PORT
TELNETCONSOLE_HOST
Default: '127.0.0.1'
The interface the telnet console should listen on
TELNETCONSOLE_USERNAME
Default: 'scrapy'
The username used for the telnet console
TELNETCONSOLE_PASSWORD
Default: None
The password used for the telnet console, default behaviour is to have it autogenerated
BeautifulSoup and lxml are libraries for parsing HTML and XML. Scrapy is an application framework for writing
web spiders that crawl web sites and extract data from them.
Scrapy provides a built-in mechanism for extracting data (called selectors) but you can easily use BeautifulSoup (or
lxml) instead, if you feel more comfortable working with them. After all, they’re just parsing libraries which can be
imported and used from any Python code.
In other words, comparing BeautifulSoup (or lxml) to Scrapy is like comparing jinja2 to Django.
Yes, you can. As mentioned above, BeautifulSoup can be used for parsing HTML responses in Scrapy callbacks. You
just have to feed the response’s body into a BeautifulSoup object and extract whatever data you need from it.
Here’s an example spider using BeautifulSoup API, with lxml as the HTML parser:
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = (
'http://www.example.com/',
)
139
Scrapy Documentation, Release 1.7.3
Note: BeautifulSoup supports several HTML/XML parsers. See BeautifulSoup’s official documentation on
which ones are available.
Scrapy is supported under Python 2.7 and Python 3.4+ under CPython (default Python implementation) and PyPy
(starting with PyPy 5.9). Python 2.6 support was dropped starting at Scrapy 0.20. Python 3 support was added in
Scrapy 1.1. PyPy support was added in Scrapy 1.4, PyPy3 support was added in Scrapy 1.5.
Note: For Python 3 support on Windows, it is recommended to use Anaconda/Miniconda as outlined in the installa-
tion guide.
Probably, but we don’t like that word. We think Django is a great open source project and an example to follow, so
we’ve used it as an inspiration for Scrapy.
We believe that, if something is already done well, there’s no need to reinvent it. This concept, besides being one of
the foundations for open source and free software, not only applies to software but also to documentation, procedures,
policies, etc. So, instead of going through each problem ourselves, we choose to copy ideas from those projects that
have already solved them properly, and focus on the real problems we need to solve.
We’d be proud if Scrapy serves as an inspiration for other projects. Feel free to steal from us!
Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See
HttpProxyMiddleware.
By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order.
This order is more convenient in most cases.
If you do want to crawl in true BFO order, you can do it by setting the following settings:
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
5.1.13 Why does Scrapy download pages in English instead of my native lan-
guage?
Try changing the default Accept-Language request header by overriding the DEFAULT_REQUEST_HEADERS setting.
See Examples.
Yes. You can use the runspider command. For example, if you have a spider written in a my_spider.py file
you can run it with:
5.1.16 I get “Filtered offsite request” messages. How can I fix them?
Those messages (logged with DEBUG level) don’t necessarily mean there is a problem, so you may not need to fix
them.
Those messages are thrown by the Offsite Spider Middleware, which is a spider middleware (enabled by default)
whose purpose is to filter out requests to domains outside the ones covered by the spider.
For more info see: OffsiteMiddleware.
It’ll depend on how large your output is. See this warning in JsonItemExporter documentation.
Some signals support returning deferreds from their handlers, others don’t. See the Built-in signals reference to know
which ones.
999 is a custom response status code used by Yahoo sites to throttle requests. Try slowing down the crawling speed
by using a download delay of 2 (or higher) in your spider:
class MySpider(CrawlSpider):
name = 'myspider'
download_delay = 2
Or by setting a global download delay in your project with the DOWNLOAD_DELAY setting.
Yes, but you can also use the Scrapy shell which allows you to quickly analyze (and even modify) the response being
processed by your spider, which is, quite often, more useful than plain old pdb.set_trace().
For more info see Invoking the shell from spiders to inspect responses.
5.1.22 Simplest way to dump all my scraped items into a JSON/CSV/XML file?
5.1.23 What’s this huge cryptic __VIEWSTATE parameter used in some forms?
The __VIEWSTATE parameter is used in sites built with ASP.NET/VB.NET. For more info on how it works see this
page. Also, here’s an example spider which scrapes one of these sites.
5.1.24 What’s the best way to parse big XML/CSV data feeds?
Parsing big feeds with XPath selectors can be problematic since they need to build the DOM of the entire feed in
memory, and this can be quite slow and consume a lot of memory.
In order to avoid parsing all the entire feed at once in memory, you can use the functions xmliter and csviter
from scrapy.utils.iterators module. In fact, this is what the feed spiders (see Spiders) use under the cover.
Yes, Scrapy receives and keeps track of cookies sent by servers, and sends them back on subsequent requests, like any
regular web browser does.
For more info see Requests and Responses and CookiesMiddleware.
5.1.26 How can I see the cookies being sent and received from Scrapy?
Raise the CloseSpider exception from a callback. For more info see: CloseSpider.
Both spider arguments and settings can be used to configure your spider. There is no strict rule that mandates to use
one or the other, but settings are more suited for parameters that, once set, don’t change much, while spider arguments
are meant to change more often, even on each spider run and sometimes are required for the spider to run at all (for
example, to set the start url of a spider).
To illustrate with an example, assuming you have a spider that needs to log into a site to scrape data, and you only
want to scrape data from a certain section of the site (which varies each time). In that case, the credentials to log in
would be settings, while the url of the section to scrape would be a spider argument.
5.1.30 I’m scraping a XML document and my XPath selector doesn’t return any
items
Item pipelines cannot yield multiple items per input item. Create a spider middleware instead, and use its
process_spider_output() method for this puspose. For example:
class MultiplyItemsMiddleware:
This document explains the most common techniques for debugging spiders. Consider the following scrapy spider
below:
import scrapy
from myproject.items import MyItem
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = (
'http://example.com/page1',
'http://example.com/page2',
)
Basically this is a simple spider which parses two pages of items (the start_urls). Items also have a details page with
additional information, so we use the cb_kwargs functionality of Request to pass a partially populated item.
The most basic way of checking the output of your spider is to use the parse command. It allows to check the
behaviour of different parts of the spider at the method level. It has the advantage of being flexible and simple to use,
but does not allow debugging code inside a method.
In order to see the item scraped from a specific url:
# Requests -----------------------------------------------------------------
[]
Using the --verbose or -v option we can see the status at each depth level:
# Requests -----------------------------------------------------------------
[<GET item_details_url>]
# Requests -----------------------------------------------------------------
[]
Checking items scraped from a single start_url, can also be easily achieved using:
While the parse command is very useful for checking behaviour of a spider, it is of little help to check what hap-
pens inside a callback, besides showing the response received and the output. How to debug the situation when
parse_details sometimes receives no item?
Fortunately, the shell is your bread and butter in this case (see Invoking the shell from spiders to inspect responses):
Sometimes you just want to see how a certain response looks in a browser, you can use the open_in_browser
function for that. Here is an example of how you would use it:
open_in_browser will open a browser with the response received by Scrapy at that point, adjusting the base tag
so that images and styles are displayed properly.
5.2.4 Logging
Logging is another useful option for getting information about your spider run. Although not as convenient, it comes
with the advantage that the logs will be available in all future runs should they be necessary again:
Note: This is a new feature (introduced in Scrapy 0.15) and may be subject to minor functionality/API updates.
Check the release notes to be notified of updates.
Testing spiders can get particularly annoying and while nothing prevents you from writing unit tests the task gets
cumbersome quickly. Scrapy offers an integrated way of testing your spiders by the means of contracts.
This allows you to test each callback of your spider by hardcoding a sample url and check various constraints for
how the callback processes the response. Each contract is prefixed with an @ and included in the docstring. See the
following example:
@url http://www.amazon.com/s?field-keywords=selfish+gene
@returns items 1 16
@returns requests 0 0
@scrapes Title Author Year Price
"""
@url url
class scrapy.contracts.default.ReturnsContract
This contract (@returns) sets lower and upper bounds for the items and requests returned by the spider. The
upper bound is optional:
class scrapy.contracts.default.ScrapesContract
This contract (@scrapes) checks that all the items returned by the callback have the specified fields:
If you find you need more power than the built-in scrapy contracts you can create and load your own contracts in the
project by using the SPIDER_CONTRACTS setting:
SPIDER_CONTRACTS = {
'myproject.contracts.ResponseCheck': 10,
'myproject.contracts.ItemValidate': 10,
}
Each contract must inherit from Contract and can override three methods:
class scrapy.contracts.Contract(method, *args)
Parameters
• method (function) – callback function to which the contract is associated
• args (list) – list of arguments passed into the docstring (whitespace separated)
adjust_request_args(args)
This receives a dict as an argument containing default arguments for request object. Request is used
by default, but this can be changed with the request_cls attribute. If multiple contracts in chain have
this attribute defined, the last one is used.
Must return the same or a modified version of it.
pre_process(response)
This allows hooking in various checks on the response received from the sample request, before it’s being
passed to the callback.
post_process(output)
This allows processing the output of the callback. Iterators are converted listified before being passed to
this hook.
Raise ContractFail from pre_process or post_process if expectations are not met:
class scrapy.exceptions.ContractFail
Error raised in case of a failing contract
Here is a demo contract which checks the presence of a custom header in the response received:
class HasHeaderContract(Contract):
""" Demo contract which checks the presence of a custom header
@has_header X-CustomHeader
"""
name = 'has_header'
When scrapy check is running, the SCRAPY_CHECK environment variable is set to the true string. You can use
os.environ to perform any change to your spiders or your settings when scrapy check is used:
import os
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
def __init__(self):
(continues on next page)
This section documents common practices when using Scrapy. These are things that cover many topics and don’t often
fall into any other specific section.
You can use the API to run Scrapy from a script, instead of the typical way of running Scrapy via scrapy crawl.
Remember that Scrapy is built on top of the Twisted asynchronous networking library, so you need to run it inside the
Twisted reactor.
The first utility you can use to run your spiders is scrapy.crawler.CrawlerProcess. This class will start
a Twisted reactor for you, configuring the logging and setting shutdown handlers. This class is the one used by all
Scrapy commands.
Here’s an example showing how to run a single spider with it.
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess(settings={
'FEED_FORMAT': 'json',
'FEED_URI': 'items.json'
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
Define settings within dictionary in CrawlerProcess. Make sure to check CrawlerProcess documentation to get
acquainted with its usage details.
If you are inside a Scrapy project there are some additional helpers you can use to import those components
within the project. You can automatically import your spiders passing their name to CrawlerProcess, and use
get_project_settings to get a Settings instance with your project settings.
What follows is a working example of how to do that, using the testspiders project as example.
process = CrawlerProcess(get_project_settings())
There’s another Scrapy utility that provides more control over the crawling process: scrapy.crawler.
CrawlerRunner. This class is a thin wrapper that encapsulates some simple helpers to run multiple crawlers,
but it won’t start or interfere with existing reactors in any way.
Using this class the reactor should be explicitly run after scheduling your spiders. It’s recommended you use
CrawlerRunner instead of CrawlerProcess if your application is already using Twisted and you want to run
Scrapy in the same reactor.
Note that you will also have to shutdown the Twisted reactor yourself after the spider is finished. This can be achieved
by adding callbacks to the deferred returned by the CrawlerRunner.crawl method.
Here’s an example of its usage, along with a callback to manually stop the reactor after MySpider has finished
running.
class MySpider(scrapy.Spider):
# Your spider definition
...
d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
See also:
Twisted Reactor Overview.
By default, Scrapy runs a single spider per process when you run scrapy crawl. However, Scrapy supports running
multiple spiders per process using the internal API.
Here is an example that runs multiple spiders simultaneously:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
Same example but running the spiders sequentially by chaining the deferreds:
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
yield runner.crawl(MySpider1)
yield runner.crawl(MySpider2)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
See also:
Run Scrapy from a script.
Scrapy doesn’t provide any built-in facility for running crawls in a distribute (multi-server) manner. However, there
are some ways to distribute crawls, which vary depending on how you plan to distribute them.
If you have many spiders, the obvious way to distribute the load is to setup many Scrapyd instances and distribute
spider runs among those.
If you instead want to run a single (big) spider through many machines, what you usually do is partition the urls to
crawl and send them to each separate spider. Here is a concrete example:
First, you prepare the list of urls to crawl and put them into separate files/urls:
http://somedomain.com/urls-to-crawl/spider1/part1.list
http://somedomain.com/urls-to-crawl/spider1/part2.list
http://somedomain.com/urls-to-crawl/spider1/part3.list
Then you fire a spider run on 3 different Scrapyd servers. The spider would receive a (spider) argument part with
the number of the partition to crawl:
Some websites implement certain measures to prevent bots from crawling them, with varying degrees of sophistication.
Getting around those measures can be difficult and tricky, and may sometimes require special infrastructure. Please
consider contacting commercial support if in doubt.
Here are some tips to keep in mind when dealing with these kinds of sites:
• rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them)
• disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour
• use download delays (2 or higher). See DOWNLOAD_DELAY setting.
• if possible, use Google cache to fetch pages, instead of hitting the sites directly
• use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh. An open source
alternative is scrapoxy, a super proxy that you can attach your own proxies to.
• use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean
pages. One example of such downloaders is Crawlera
If you are still unable to prevent your bot getting banned, consider contacting commercial support.
Scrapy defaults are optimized for crawling specific sites. These sites are often handled by a single Scrapy spider,
although this is not necessary or required (for example, there are generic spiders that handle any given site thrown at
them).
In addition to this “focused crawl”, there is another common type of crawling which covers a large (potentially un-
limited) number of domains, and is only limited by time or other arbitrary constraint, rather than stopping when the
domain was crawled to completion or when there are no more requests to perform. These are called “broad crawls”
and is the typical crawlers employed by search engines.
These are some common properties often found in broad crawls:
• they crawl many domains (often, unbounded) instead of a specific set of sites
• they don’t necessarily crawl domains to completion, because it would be impractical (or impossible) to do so,
and instead limit the crawl by time or number of pages crawled
• they are simpler in logic (as opposed to very complex spiders with many extraction rules) because data is often
post-processed in a separate stage
• they crawl many domains concurrently, which allows them to achieve faster crawl speeds by not being limited
by any particular site constraint (each site is crawled slowly to respect politeness, but many sites are crawled in
parallel)
As said above, Scrapy default settings are optimized for focused crawls, not broad crawls. However, due to its asyn-
chronous architecture, Scrapy is very well suited for performing fast broad crawls. This page summarizes some things
you need to keep in mind when using Scrapy for doing broad crawls, along with concrete suggestions of Scrapy
settings to tune in order to achieve an efficient broad crawl.
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.DownloaderAwarePriorityQueue'
Concurrency is the number of requests that are processed in parallel. There is a global
limit (CONCURRENT_REQUESTS) and an additional limit that can be set either per domain
(CONCURRENT_REQUESTS_PER_DOMAIN ) or per IP (CONCURRENT_REQUESTS_PER_IP).
Note: The scheduler priority queue recommended for broad crawls does not support
CONCURRENT_REQUESTS_PER_IP.
The default global concurrency limit in Scrapy is not suitable for crawling many different domains in parallel, so you
will want to increase it. How much to increase it will depend on how much CPU and memory you crawler will have
available.
A good starting point is 100:
CONCURRENT_REQUESTS = 100
But the best way to find out is by doing some trials and identifying at what concurrency your Scrapy process gets CPU
bounded. For optimum performance, you should pick a concurrency where CPU usage is at 80-90%.
Increasing concurrency also increases memory usage. If memory usage is a concern, you might need to lower your
global concurrency limit accordingly.
Currently Scrapy does DNS resolution in a blocking way with usage of thread pool. With higher concurrency levels
the crawling could be slow or even fail hitting DNS resolver timeouts. Possible solution to increase the number of
threads handling DNS queries. The DNS queue will be processed faster speeding up establishing of connection and
crawling overall.
To increase maximum thread pool size use:
REACTOR_THREADPOOL_MAXSIZE = 20
If you have multiple crawling processes and single central DNS, it can act like DoS attack on the DNS server resulting
to slow down of entire network or even blocking your machines. To avoid this setup your own DNS server with local
cache and upstream to some large DNS like OpenDNS or Verizon.
When doing broad crawls you are often only interested in the crawl rates you get and any errors found. These stats are
reported by Scrapy when using the INFO log level. In order to save CPU (and log storage requirements) you should
not use DEBUG log level when preforming large broad crawls in production. Using DEBUG level when developing
your (broad) crawler may be fine though.
To set the log level use:
LOG_LEVEL = 'INFO'
Disable cookies unless you really need. Cookies are often not needed when doing broad crawls (search engine crawlers
ignore them), and they improve performance by saving some CPU cycles and reducing the memory footprint of your
Scrapy crawler.
To disable cookies use:
COOKIES_ENABLED = False
Retrying failed HTTP requests can slow down the crawls substantially, specially when sites causes are very slow (or
fail) to respond, thus causing a timeout error which gets retried many times, unnecessarily, preventing crawler capacity
to be reused for other domains.
To disable retries use:
RETRY_ENABLED = False
Unless you are crawling from a very slow connection (which shouldn’t be the case for broad crawls) reduce the
download timeout so that stuck requests are discarded quickly and free up capacity to process the next ones.
To reduce the download timeout use:
DOWNLOAD_TIMEOUT = 15
Consider disabling redirects, unless you are interested in following them. When doing broad crawls it’s common to
save redirects and resolve them when revisiting the site at a later crawl. This also help to keep the number of request
constant per crawl batch, otherwise redirect loops may cause the crawler to dedicate too many resources on any specific
domain.
To disable redirects use:
REDIRECT_ENABLED = False
Some pages (up to 1%, based on empirical data from year 2013) declare themselves as ajax crawlable. This means
they provide plain HTML version of content that is usually available only via AJAX. Pages can indicate it in two ways:
1) by using #! in URL - this is the default way;
2) by using a special meta tag - this way is used on “main”, “index” website pages.
Scrapy handles (1) automatically; to handle (2) enable AjaxCrawlMiddleware:
AJAXCRAWL_ENABLED = True
When doing broad crawls it’s common to crawl a lot of “index” web pages; AjaxCrawlMiddleware helps to crawl
them correctly. It is turned OFF by default because it has some performance overhead, and enabling it for focused
crawls doesn’t make much sense.
If your broad crawl shows a high memory usage, in addition to crawling in BFO order and lowering concurrency you
should debug your memory leaks.
Here is a general guide on how to use your browser’s Developer Tools to ease the scraping process. Today almost
all browsers come with built in Developer Tools and although we will use Firefox in this guide, the concepts are
applicable to any other browser.
In this guide we’ll introduce the basic tools to use from a browser’s Developer Tools by scraping quotes.toscrape.com.
Since Developer Tools operate on a live browser DOM, what you’ll actually see when inspecting the page source
is not the original HTML, but a modified one after applying some browser clean up and executing Javascript code.
Firefox, in particular, is known for adding <tbody> elements to tables. Scrapy, on the other hand, does not modify
the original page HTML, so you won’t be able to extract any data if you use <tbody> in your XPath expressions.
Therefore, you should keep in mind the following things:
• Disable Javascript while inspecting the DOM looking for XPaths to be used in Scrapy (in the Developer Tools
settings click Disable JavaScript)
• Never use full XPath paths, use relative and clever ones based on attributes (such as id, class, width, etc)
or any identifying features like contains(@href, 'image').
• Never include <tbody> elements in your XPath expressions unless you really know what you’re doing
By far the most handy feature of the Developer Tools is the Inspector feature, which allows you to inspect the under-
lying HTML code of any webpage. To demonstrate the Inspector, let’s look at the quotes.toscrape.com-site.
On the site we have a total of ten quotes from various authors with specific tags, as well as the Top Ten Tags. Let’s say
we want to extract all the quotes on this page, without any meta-information about authors, tags, etc.
Instead of viewing the whole source code for the page, we can simply right click on a quote and select Inspect
Element (Q), which opens up the Inspector. In it you should see something like this:
If you hover over the first div directly above the span tag highlighted in the screenshot, you’ll see that the corre-
sponding section of the webpage gets highlighted as well. So now we have a section, but we can’t find our quote text
anywhere.
The advantage of the Inspector is that it automatically expands and collapses sections and tags of a webpage, which
greatly improves readability. You can expand and collapse a tag by clicking on the arrow in front of it or by double
clicking directly on the tag. If we expand the span tag with the class= "text" we will see the quote-text we
clicked on. The Inspector lets you copy XPaths to selected elements. Let’s try it out: Right-click on the span tag,
select Copy > XPath and paste it in the scrapy shell like so:
$ scrapy shell "http://quotes.toscrape.com/"
(...)
>>> response.xpath('/html/body/div/div[2]/div[1]/div[1]/span[1]/text()').getall()
['"The world as we have created it is a process of our thinking. It cannot be changed
˓→without changing our thinking.”]
Adding text() at the end we are able to extract the first quote with this basic selector. But this XPath is not really
that clever. All it does is go down a desired path in the source code starting from html. So let’s see if we can refine
</span>
<span>(...)</span>
<div class="tags">(...)</div>
</div>
With this knowledge we can refine our XPath: Instead of a path to follow, we’ll simply select all span tags with the
class="text" by using the has-class-extension:
>>> response.xpath('//span[has-class("text")]/text()').getall()
['"The world as we have created it is a process of our thinking. It cannot be changed
˓→without changing our thinking.”,
'“It is our choices, Harry, that show what we truly are, far more than our abilities.
˓→”',
'“There are only two ways to live your life. One is as though nothing is a miracle.
˓→The other is as though everything is a miracle.”',
(...)]
And with one simple, cleverer XPath we are able to extract all quotes from the page. We could have constructed a
loop over our first XPath to increase the number of the last div, but this would have been unnecessarily complex and
by simply constructing an XPath with has-class("text") we were able to extract all quotes in one line.
The Inspector has a lot of other helpful features, such as searching in the source code or directly scrolling to an element
you selected. Let’s demonstrate a use case:
Say you want to find the Next button on the page. Type Next into the search bar on the top right of the Inspector.
You should get two results. The first is a li tag with the class="text", the second the text of an a tag. Right click
on the a tag and select Scroll into View. If you hover over the tag, you’ll see the button highlighted. From
here we could easily create a Link Extractor to follow the pagination. On a simple site such as this, there may not be
the need to find an element visually but the Scroll into View function can be quite useful on complex sites.
Note that the search bar can also be used to search for and test CSS selectors. For example, you could search for
span.text to find all quote texts. Instead of a full text search, this searches for exactly the span tag with the
class="text" in the page.
While scraping you may come across dynamic webpages where some parts of the page are loaded dynamically through
multiple requests. While this can be quite tricky, the Network-tool in the Developer Tools greatly facilitates this task.
To demonstrate the Network-tool, let’s take a look at the page quotes.toscrape.com/scroll.
The page is quite similar to the basic quotes.toscrape.com-page, but instead of the above-mentioned Next button, the
page automatically loads new quotes when you scroll to the bottom. We could go ahead and try out different XPaths
directly, but instead we’ll check another quite useful command from the scrapy shell:
A browser window should open with the webpage but with one crucial difference: Instead of the quotes we just see a
greenish bar with the word Loading....
The view(response) command let’s us view the response our shell or later our spider receives from the server.
Here we see that some basic template is loaded which includes the title, the login-button and the footer, but the
quotes are missing. This tells us that the quotes are being loaded from a different request than quotes.toscrape/
scroll.
If you click on the Network tab, you will probably only see two entries. The first thing we do is enable persistent logs
by clicking on Persist Logs. If this option is disabled, the log is automatically cleared each time you navigate to
a different page. Enabling this option is a good default, since it gives us control on when to clear the logs.
If we reload the page now, you’ll see the log get populated with six new requests.
Here we see every request that has been made when reloading the page and can inspect each request and its response.
So let’s find out where our quotes are coming from:
First click on the request with the name scroll. On the right you can now inspect the request. In Headers you’ll
find details about the request headers, such as the URL, the method, the IP-address, and so on. We’ll ignore the other
tabs and click directly on Reponse.
What you should see in the Preview pane is the rendered HTML-code, that is exactly what we saw when we called
view(response) in the shell. Accordingly the type of the request in the log is html. The other requests have
types like css or js, but what interests us is the one request called quotes?page=1 with the type json.
If we click on this request, we see that the request URL is http://quotes.toscrape.com/api/quotes?
page=1 and the response is a JSON-object that contains our quotes. We can also right-click on the request and open
Open in new tab to get a better overview.
With this response we can now easily parse the JSON-object and also request each page to get every quote on the site:
import scrapy
import json
class QuoteSpider(scrapy.Spider):
name = 'quote'
allowed_domains = ['quotes.toscrape.com']
page = 1
start_urls = ['http://quotes.toscrape.com/api/quotes?page=1']
This spider starts at the first page of the quotes-API. With each response, we parse the response.text and
assign it to data. This lets us operate on the JSON-object like on a Python dictionary. We iterate through
the quotes and print out the quote["text"]. If the handy has_next element is true (try loading
quotes.toscrape.com/api/quotes?page=10 in your browser or a page-number greater than 10), we increment the page
attribute and yield a new request, inserting the incremented page-number into our url.
You can see that with a few inspections in the Network-tool we were able to easily replicate the dynamic requests of
the scrolling functionality of the page. Crawling dynamic pages can be quite daunting and pages can be very complex,
but it (mostly) boils down to identifying the correct request and replicating it in your spider.
Some webpages show the desired data when you load them in a web browser. However, when you download them
using Scrapy, you cannot reach the desired data using selectors.
When this happens, the recommended approach is to find the data source and extract the data from it.
If you fail to do that, and you can nonetheless access the desired data through the DOM from your web browser, see
Pre-rendering JavaScript.
To extract the desired data, you must first find its source location.
If the data is in a non-text-based format, such as an image or a PDF document, use the network tool of your web
browser to find the corresponding request, and reproduce it.
If your web browser lets you select the desired data as text, the data may be defined in embedded JavaScript code, or
loaded from an external resource in a text-based format.
In that case, you can use a tool like wgrep to find the URL of that resource.
If the data turns out to come from the original URL itself, you must inspect the source code of the webpage to determine
where the data is located.
If the data comes from a different URL, you will need to reproduce the corresponding request.
Sometimes you need to inspect the source code of a webpage (not the DOM) to determine where some desired data is
located.
Use Scrapy’s fetch command to download the webpage contents as seen by Scrapy:
If the desired data is in embedded JavaScript code within a <script/> element, see Parsing JavaScript code.
If you cannot find the desired data, first make sure it’s not just Scrapy: download the webpage with an HTTP client
like curl or wget and see if the information can be found in the response they get.
If they get a response with the desired data, modify your Scrapy Request to match that of the other HTTP client.
For example, try using the same user-agent string (USER_AGENT) or the same headers.
If they also get a response without the desired data, you’ll need to take steps to make your request more similar to that
of the web browser. See Reproducing requests.
Sometimes we need to reproduce a request the way our web browser performs it.
Use the network tool of your web browser to see how your web browser performs the desired request, and try to
reproduce that request with Scrapy.
It might be enough to yield a Request with the same HTTP method and URL. However, you may also need to
reproduce the body, headers and form parameters (see FormRequest) of that request.
Once you get the expected response, you can extract the desired data from it.
You can reproduce any request with Scrapy. However, some times reproducing all necessary requests may not seem
efficient in developer time. If that is your case, and crawling speed is not a major concern for you, you can alternatively
consider JavaScript pre-rendering.
If you get the expected response sometimes, but not always, the issue is probably not your request, but the target server.
The target server might be buggy, overloaded, or banning some of your requests.
Once you have a response with the desired data, how you extract the desired data from it depends on the type of
response:
• If the response is HTML or XML, use selectors as usual.
• If the response is JSON, use json.loads to load the desired data from response.text:
data = json.loads(response.text)
If the desired data is inside HTML or XML code embedded within JSON data, you can load that HTML or
XML code into a Selector and then use it as usual:
selector = Selector(data['html'])
• If the response is JavaScript, or HTML with a <script/> element containing the desired data, see Parsing
JavaScript code.
• If the response is CSS, use a regular expression to extract the desired data from response.text.
• If the response is an image or another format based on images (e.g. PDF), read the response as bytes from
response.body and use an OCR solution to extract the desired data as text.
For example, you can use pytesseract. To read a table from a PDF, tabula-py may be a better choice.
• If the response is SVG, or HTML with embedded SVG containing the desired data, you may be able to extract
the desired data using selectors, since SVG is based on XML.
Otherwise, you might need to convert the SVG code into a raster image, and handle that raster image.
If the desired data is hardcoded in JavaScript, you first need to get the JavaScript code:
• If the JavaScript code is in a JavaScript file, simply read response.text.
• If the JavaScript code is within a <script/> element of an HTML page, use selectors to extract the text within
that <script/> element.
Once you have a string with the JavaScript code, you can extract the desired data from it:
• You might be able to use a regular expression to extract the desired data in JSON format, which you can then
parse with json.loads.
For example, if the JavaScript code contains a separate line like var data = {"field": "value"};
you can extract that data as follows:
• Otherwise, use js2xml to convert the JavaScript code into an XML document that you can parse using selectors.
For example, if the JavaScript code contains var data = {field: "value"}; you can extract that
data as follows:
On webpages that fetch data from additional requests, reproducing those requests that contain the desired data is the
preferred approach. The effort is often worth the result: structured, complete data with minimum parsing time and
network transfer.
However, sometimes it can be really hard to reproduce certain requests. Or you may need something that no request
can give you, such as a screenshot of a webpage as seen in a web browser.
In these cases use the Splash JavaScript-rendering service, along with scrapy-splash for seamless integration.
Splash returns as HTML the DOM of a webpage, so that you can parse it with selectors. It provides great flexibility
through configuration or scripting.
If you need something beyond what Splash offers, such as interacting with the DOM on-the-fly from Python code
instead of using a previously-written script, or handling multiple web browser windows, you might need to use a
headless browser instead.
A headless browser is a special web browser that provides an API for automation.
The easiest way to use a headless browser with Scrapy is to use Selenium, along with scrapy-selenium for seamless
integration.
In Scrapy, objects such as Requests, Responses and Items have a finite lifetime: they are created, used for a while, and
finally destroyed.
From all those objects, the Request is probably the one with the longest lifetime, as it stays waiting in the Scheduler
queue until it’s time to process it. For more info see Architecture overview.
As these Scrapy objects have a (rather long) lifetime, there is always the risk of accumulating them in memory without
releasing them properly and thus causing what is known as a “memory leak”.
To help debugging memory leaks, Scrapy provides a built-in mechanism for tracking objects references called trackref ,
and you can also use a third-party library called Guppy for more advanced memory debugging (see below for more
info). Both mechanisms must be used from the Telnet Console.
It happens quite often (sometimes by accident, sometimes on purpose) that the Scrapy developer passes objects ref-
erenced in Requests (for example, using the cb_kwargs or meta attributes or the request callback function) and
that effectively bounds the lifetime of those referenced objects to the lifetime of the Request. This is, by far, the most
common cause of memory leaks in Scrapy projects, and a quite difficult one to debug for newcomers.
In big projects, the spiders are typically written by different people and some of those spiders could be “leaking” and
thus affecting the rest of the other (well-written) spiders when they get to run concurrently, which, in turn, affects the
whole crawling process.
The leak could also come from a custom middleware, pipeline or extension that you have written, if you are not
releasing the (previously allocated) resources properly. For example, allocating resources on spider_opened but
not releasing them on spider_closed may cause problems if you’re running multiple spiders per process.
By default Scrapy keeps the request queue in memory; it includes Request objects and all objects referenced in
Request attributes (e.g. in cb_kwargs and meta). While not necessarily a leak, this can take a lot of memory.
Enabling persistent job queue could help keeping memory usage in control.
trackref is a module provided by Scrapy to debug the most common cases of memory leaks. It basically tracks the
references to all live Requests, Responses, Item and Selector objects.
You can enter the telnet console and inspect how many objects (of the classes mentioned above) are currently alive
using the prefs() function which is an alias to the print_live_refs() function:
>>> prefs()
Live References
As you can see, that report also shows the “age” of the oldest object in each class. If you’re running multiple spiders
per process chances are you can figure out which spider is leaking by looking at the oldest request or response. You
can get the oldest object of each class using the get_oldest() function (from the telnet console).
The objects tracked by trackrefs are all from these classes (and all its subclasses):
• scrapy.http.Request
• scrapy.http.Response
• scrapy.item.Item
• scrapy.selector.Selector
• scrapy.spiders.Spider
A real example
Let’s see a concrete example of a hypothetical case of memory leaks. Suppose we have some spider with a line similar
to this one:
That line is passing a response reference inside a request which effectively ties the response lifetime to the requests’
one, and that would definitely cause memory leaks.
Let’s see how we can discover the cause (without knowing it a-priori, of course) by using the trackref tool.
After the crawler is running for a few minutes and we notice its memory usage has grown a lot, we can enter its telnet
console and check the live references:
>>> prefs()
Live References
The fact that there are so many live responses (and that they’re so old) is definitely suspicious, as responses should
have a relatively short lifetime compared to Requests. The number of responses is similar to the number of requests,
so it looks like they are tied in a some way. We can now go and check the code of the spider to discover the nasty line
that is generating the leaks (passing response references inside requests).
Sometimes extra information about live objects can be helpful. Let’s check the oldest response:
If you want to iterate over all objects, instead of getting the oldest one, you can use the scrapy.utils.
trackref.iter_all() function:
If your project has too many spiders executed in parallel, the output of prefs() can be difficult to read. For this
reason, that function has a ignore argument which can be used to ignore a particular class (and all its subclases).
For example, this won’t show any live references to spiders:
>>> from scrapy.spiders import Spider
>>> prefs(ignore=Spider)
scrapy.utils.trackref module
trackref provides a very convenient mechanism for tracking down memory leaks, but it only keeps track of the
objects that are more likely to cause memory leaks (Requests, Responses, Items, and Selectors). However, there are
other cases where the memory leaks could come from other (more or less obscure) objects. If this is your case, and you
can’t find your leaks using trackref, you still have another resource: the Guppy library. If you’re using Python3,
see Debugging memory leaks with muppy.
If you use pip, you can install Guppy with the following command:
pip install guppy
The telnet console also comes with a built-in shortcut (hpy) for accessing Guppy heap objects. Here’s an example to
view all Python objects available in the heap using Guppy:
>>> x = hpy.heap()
>>> x.bytype
Partition of a set of 297033 objects. Total size = 52587824 bytes.
Index Count % Size % Cumulative % Type
0 22307 8 16423880 31 16423880 31 dict
1 122285 41 12441544 24 28865424 55 str
2 68346 23 5966696 11 34832120 66 tuple
3 227 0 5836528 11 40668648 77 unicode
4 2461 1 2222272 4 42890920 82 type
5 16870 6 2024400 4 44915320 85 function
6 13949 5 1673880 3 46589200 89 types.CodeType
(continues on next page)
You can see that most space is used by dicts. Then, if you want to see from which attribute those dicts are referenced,
you could do:
>>> x.bytype[0].byvia
Partition of a set of 22307 objects. Total size = 16423880 bytes.
Index Count % Size % Cumulative % Referred Via:
0 10982 49 9416336 57 9416336 57 '.__dict__'
1 1820 8 2681504 16 12097840 74 '.__dict__', '.func_globals'
2 3097 14 1122904 7 13220744 80
3 990 4 277200 2 13497944 82 "['cookies']"
4 987 4 276360 2 13774304 84 "['cache']"
5 985 4 275800 2 14050104 86 "['meta']"
6 897 4 251160 2 14301264 87 '[2]'
7 1 0 196888 1 14498152 88 "['moduleDict']", "['modules']"
8 672 3 188160 1 14686312 89 "['cb_kwargs']"
9 27 0 155016 1 14841328 90 '[1]'
<333 more rows. Type e.g. '_.more' to view.>
As you can see, the Guppy module is very powerful but also requires some deep knowledge about Python internals.
For more info about Guppy, refer to the Guppy documentation.
Here’s an example to view all Python objects available in the heap using muppy:
Sometimes, you may notice that the memory usage of your Scrapy process will only increase, but never decrease.
Unfortunately, this could happen even though neither Scrapy nor your project are leaking memory. This is due to a
(not so well) known problem of Python, which may not return released memory to the operating system in some cases.
For more information on this issue see:
• Python Memory Management
• Python Memory Management Part 2
• Python Memory Management Part 3
The improvements proposed by Evan Jones, which are detailed in this paper, got merged in Python 2.5, but this only
reduces the problem, it doesn’t fix it completely. To quote the paper:
Unfortunately, this patch can only free an arena if there are no more objects allocated in it anymore. This
means that fragmentation is a large issue. An application could have many megabytes of free memory,
scattered throughout all the arenas, but it will be unable to free any of it. This is a problem experienced
by all memory allocators. The only way to solve it is to move to a compacting garbage collector, which is
able to move objects in memory. This would require significant changes to the Python interpreter.
To keep memory consumption reasonable you can split the job into several smaller jobs or enable persistent job queue
and stop/start spider from time to time.
Scrapy provides reusable item pipelines for downloading files attached to a particular item (for example, when you
scrape products and also want to download their images locally). These pipelines share a bit of functionality and
structure (we refer to them as media pipelines), but typically you’ll either use the Files Pipeline or the Images Pipeline.
Both pipelines implement these features:
• Avoid re-downloading media that was downloaded recently
• Specifying where to store the media (filesystem directory, Amazon S3 bucket, Google Cloud Storage bucket)
The Images Pipeline has a few extra functions for processing images:
• Convert all downloaded images to a common format (JPG) and mode (RGB)
• Thumbnail generation
• Check images width/height to make sure they meet a minimum constraint
The pipelines also keep an internal queue of those media URLs which are currently being scheduled for download,
and connect those responses that arrive containing the same media to that queue. This avoids downloading the same
media more than once when it’s shared by several items.
The typical workflow, when using the FilesPipeline goes like this:
1. In a Spider, you scrape an item and put the URLs of the desired into a file_urls field.
2. The item is returned from the spider and goes to the item pipeline.
3. When the item reaches the FilesPipeline, the URLs in the file_urls field are scheduled for download
using the standard Scrapy scheduler and downloader (which means the scheduler and downloader middlewares
are reused), but with a higher priority, processing them before other pages are scraped. The item remains
“locked” at that particular pipeline stage until the files have finish downloading (or fail for some reason).
4. When the files are downloaded, another field (files) will be populated with the results. This field will contain
a list of dicts with information about the downloaded files, such as the downloaded path, the original scraped url
(taken from the file_urls field) , and the file checksum. The files in the list of the files field will retain
the same order of the original file_urls field. If some file failed downloading, an error will be logged and
the file won’t be present in the files field.
Using the ImagesPipeline is a lot like using the FilesPipeline, except the default field names used are dif-
ferent: you use image_urls for the image URLs of an item and it will populate an images field for the information
about the downloaded images.
The advantage of using the ImagesPipeline for image files is that you can configure some extra functions like
generating thumbnails and filtering the images based on their size.
The Images Pipeline uses Pillow for thumbnailing and normalizing images to JPEG/RGB format, so you need to install
this library in order to use it. Python Imaging Library (PIL) should also work in most cases, but it is known to cause
troubles in some setups, so we recommend to use Pillow instead of PIL.
To enable your media pipeline you must first add it to your project ITEM_PIPELINES setting.
For Images Pipeline, use:
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
Note: You can also use both the Files and Images Pipeline at the same time.
Then, configure the target storage setting to a valid value that will be used for storing the downloaded images. Other-
wise the pipeline will remain disabled, even if you include it in the ITEM_PIPELINES setting.
For the Files Pipeline, set the FILES_STORE setting:
FILES_STORE = '/path/to/valid/dir'
IMAGES_STORE = '/path/to/valid/dir'
File system is currently the only officially supported storage, but there are also support for storing files in Amazon S3
and Google Cloud Storage.
The files are stored using a SHA1 hash of their URLs for the file names.
For example, the following image URL:
http://www.example.com/image.jpg
3afec3b4765f8f0a07b78f98c07b83f013567a0a
<IMAGES_STORE>/full/3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg
Where:
• <IMAGES_STORE> is the directory defined in IMAGES_STORE setting for the Images Pipeline.
• full is a sub-directory to separate full images from thumbnails (if used). For more info see Thumbnail gener-
ation for images.
Amazon S3 storage
FILES_STORE and IMAGES_STORE can represent an Amazon S3 bucket. Scrapy will automatically upload the
files to the bucket.
For example, this is a valid IMAGES_STORE value:
IMAGES_STORE = 's3://bucket/images'
You can modify the Access Control List (ACL) policy used for the stored files, which is defined by the
FILES_STORE_S3_ACL and IMAGES_STORE_S3_ACL settings. By default, the ACL is set to private. To
make the files publicly available use the public-read policy:
IMAGES_STORE_S3_ACL = 'public-read'
For more information, see canned ACLs in the Amazon S3 Developer Guide.
Because Scrapy uses boto / botocore internally you can also use other S3-like storages. Storages like self-hosted
Minio or s3.scality. All you need to do is set endpoint option in you Scrapy settings:
AWS_ENDPOINT_URL = 'http://minio.example.com:9000'
For self-hosting you also might feel the need not to use SSL and not to verify SSL connection:
FILES_STORE and IMAGES_STORE can represent a Google Cloud Storage bucket. Scrapy will automatically
upload the files to the bucket. (requires google-cloud-storage )
For example, these are valid IMAGES_STORE and GCS_PROJECT_ID settings:
IMAGES_STORE = 'gs://bucket/images/'
GCS_PROJECT_ID = 'project_id'
IMAGES_STORE_GCS_ACL = 'publicRead'
For more information, see Predefined ACLs in the Google Cloud Platform Developer Guide.
import scrapy
class MyItem(scrapy.Item):
If you want to use another field name for the URLs key or for the results key, it is also possible to override it.
For the Files Pipeline, set FILES_URLS_FIELD and/or FILES_RESULT_FIELD settings:
FILES_URLS_FIELD = 'field_name_for_your_files_urls'
FILES_RESULT_FIELD = 'field_name_for_your_processed_files'
IMAGES_URLS_FIELD = 'field_name_for_your_images_urls'
IMAGES_RESULT_FIELD = 'field_name_for_your_processed_images'
If you need something more complex and want to override the custom pipeline behaviour, see Extending the Media
Pipelines.
If you have multiple image pipelines inheriting from ImagePipeline and you want to have different settings in
different pipelines you can set setting keys preceded with uppercase name of your pipeline class. E.g. if your
pipeline is called MyPipeline and you want to have custom IMAGES_URLS_FIELD you define setting MYP-
IPELINE_IMAGES_URLS_FIELD and your custom settings will be used.
File expiration
The Image Pipeline avoids downloading files that were downloaded recently. To adjust this retention delay use the
FILES_EXPIRES setting (or IMAGES_EXPIRES, in case of Images Pipeline), which specifies the delay in number
of days:
# 120 days of delay for files expiration
FILES_EXPIRES = 120
The Images Pipeline can automatically create thumbnails of the downloaded images. In order to use this feature,
you must set IMAGES_THUMBS to a dictionary where the keys are the thumbnail names and the values are their
dimensions.
For example:
IMAGES_THUMBS = {
'small': (50, 50),
'big': (270, 270),
}
When you use this feature, the Images Pipeline will create thumbnails of the each specified size with this format:
<IMAGES_STORE>/thumbs/<size_name>/<image_id>.jpg
Where:
• <size_name> is the one specified in the IMAGES_THUMBS dictionary keys (small, big, etc)
• <image_id> is the SHA1 hash of the image url
Example of image files stored using small and big thumbnail names:
<IMAGES_STORE>/full/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
<IMAGES_STORE>/thumbs/small/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
<IMAGES_STORE>/thumbs/big/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
The first one is the full image, as downloaded from the site.
When using the Images Pipeline, you can drop images which are too small, by specifying the minimum allowed size
in the IMAGES_MIN_HEIGHT and IMAGES_MIN_WIDTH settings.
For example:
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110
It is possible to set just one size constraint or both. When setting both of them, only images that satisfy both minimum
sizes will be saved. For the above example, images of sizes (105 x 105) or (105 x 200) or (200 x 105) will all be
dropped because at least one dimension is shorter than the constraint.
By default, there are no size constraints, so all images are processed.
Allowing redirections
By default media pipelines ignore redirects, i.e. an HTTP redirection to a media file URL request will mean the media
download is considered failed.
To handle media redirections, set this setting to True:
MEDIA_ALLOW_REDIRECTS = True
See here the methods that you can override in your custom Files Pipeline:
class scrapy.pipelines.files.FilesPipeline
import os
from urllib.parse import urlparse
class MyFilesPipeline(FilesPipeline):
Those requests will be processed by the pipeline and, when they have finished downloading, the results
will be sent to the item_completed() method, as a list of 2-element tuples. Each tuple will contain
(success, file_info_or_error) where:
• success is a boolean which is True if the image was downloaded successfully or False if it failed
for some reason
• file_info_or_error is a dict containing the following keys (if success is True) or a Twisted
Failure if there was a problem.
– url - the url where the file was downloaded from. This is the url of the request returned from the
get_media_requests() method.
– path - the path (relative to FILES_STORE) where the file was stored
– checksum - a MD5 hash of the image contents
The list of tuples received by item_completed() is guaranteed to retain the same order of the requests
returned from the get_media_requests() method.
Here’s a typical value of the results argument:
[(True,
{'checksum': '2b00042f7481c7b056c4b410d28f33cf',
'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg',
'url': 'http://www.example.com/files/product1.pdf'}),
(False,
Failure(...))]
By default the get_media_requests() method returns None which means there are no files to down-
load for the item.
item_completed(results, item, info)
The FilesPipeline.item_completed() method called when all file requests for a single item
have completed (either finished downloading, or failed for some reason).
The item_completed() method must return the output that will be sent to subsequent item pipeline
stages, so you must return (or drop) the item, as you would in any pipeline.
Here is an example of the item_completed() method where we store the downloaded file paths
(passed in results) in the file_paths item field, and we drop the item if it doesn’t contain any files:
import os
from urllib.parse import urlparse
class MyImagesPipeline(ImagesPipeline):
Here is a full example of the Images Pipeline whose methods are examplified above:
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
class MyImagesPipeline(ImagesPipeline):
This section describes the different options you have for deploying your Scrapy spiders to run them on a regular basis.
Running Scrapy spiders in your local machine is very convenient for the (early) development stage, but not so much
when you need to execute long-running spiders or move spiders to run in production continuously. This is where the
solutions for deploying Scrapy spiders come in.
Popular choices for deploying Scrapy spiders are:
• Scrapyd (open source)
• Scrapy Cloud (cloud-based)
Scrapyd is an open source application to run Scrapy spiders. It provides a server with HTTP API, capable of running
and monitoring Scrapy spiders.
To deploy spiders to Scrapyd, you can use the scrapyd-deploy tool provided by the scrapyd-client package. Please
refer to the scrapyd-deploy documentation for more information.
Scrapyd is maintained by some of the Scrapy developers.
Scrapy Cloud is a hosted, cloud-based service by Scrapinghub, the company behind Scrapy.
Scrapy Cloud removes the need to setup and monitor servers and provides a nice UI to manage spiders and review
scraped items, logs and stats.
To deploy spiders to Scrapy Cloud you can use the shub command line tool. Please refer to the Scrapy Cloud docu-
mentation for more information.
Scrapy Cloud is compatible with Scrapyd and one can switch between them as needed - the configuration is read from
the scrapy.cfg file just like scrapyd-deploy.
This is an extension for automatically throttling crawling speed based on load of both the Scrapy server and the website
you are crawling.
Note: The AutoThrottle extension honours the standard Scrapy settings for concurrency and delay. This means that
it will respect CONCURRENT_REQUESTS_PER_DOMAIN and CONCURRENT_REQUESTS_PER_IP options and
never set a download delay lower than DOWNLOAD_DELAY.
In Scrapy, the download latency is measured as the time elapsed between establishing the TCP connection and receiv-
ing the HTTP headers.
Note that these latencies are very hard to measure accurately in a cooperative multitasking environment because Scrapy
may be busy processing a spider callback, for example, and unable to attend downloads. However, these latencies
should still give a reasonable estimate of how busy Scrapy (and ultimately, the server) is, and this extension builds on
that premise.
5.11.4 Settings
AUTOTHROTTLE_ENABLED
Default: False
Enables the AutoThrottle extension.
AUTOTHROTTLE_START_DELAY
Default: 5.0
The initial download delay (in seconds).
AUTOTHROTTLE_MAX_DELAY
Default: 60.0
The maximum download delay (in seconds) to be set in case of high latencies.
AUTOTHROTTLE_TARGET_CONCURRENCY
AUTOTHROTTLE_DEBUG
Default: False
Enable AutoThrottle debug mode which will display stats on every response received, so you can see how the throttling
parameters are being adjusted in real time.
5.12 Benchmarking
scrapy bench
˓→MODULE': 'quotesbot.spiders'}
That tells you that Scrapy is able to crawl about 3000 pages per minute in the hardware where you run it. Note that
this is a very simple spider intended to follow links, any custom spider you write will probably do more stuff which
results in slower crawl rates. How slower depends on how much your spider does and how well it’s written.
In the future, more cases will be added to the benchmarking suite to cover other common scenarios.
Sometimes, for big sites, it’s desirable to pause crawls and be able to resume them later.
Scrapy supports this functionality out of the box by providing the following facilities:
• a scheduler that persists scheduled requests on disk
• a duplicates filter that persists visited requests on disk
• an extension that keeps some spider state (key/value pairs) persistent between batches
To enable persistence support you just need to define a job directory through the JOBDIR setting. This directory
will be for storing all required data to keep the state of a single job (ie. a spider run). It’s important to note that this
directory must not be shared by different spiders, or even different jobs/runs of the same spider, as it’s meant to be
used for storing the state of a single job.
Then, you can stop the spider safely at any time (by pressing Ctrl-C or sending a signal), and resume it later by issuing
the same command:
Sometimes you’ll want to keep some persistent spider state between pause/resume batches. You can use the spider.
state attribute for that, which should be a dict. There’s a built-in extension that takes care of serializing, storing and
loading that attribute from the job directory, when the spider starts and stops.
Here’s an example of a callback that uses the spider state (other spider code is omitted for brevity):
There are a few things to keep in mind if you want to be able to use the Scrapy persistence support:
Cookies expiration
Cookies may expire. So, if you don’t resume your spider quickly the requests scheduled may no longer work. This
won’t be an issue if you spider doesn’t rely on cookies.
Request serialization
Requests must be serializable by the pickle module, in order for persistence to work, so you should make sure that
your requests are serializable.
The most common issue here is to use lambda functions on request callbacks that can’t be persisted.
So, for example, this won’t work:
If you wish to log the requests that couldn’t be serialized, you can set the SCHEDULER_DEBUG setting to True in
the project’s settings page. It is False by default.
Frequently Asked Questions Get answers to most frequently asked questions.
Debugging Spiders Learn how to debug common problems of your scrapy spider.
Spiders Contracts Learn how to use contracts for testing your spiders.
Common Practices Get familiar with some Scrapy common practices.
Broad Crawls Tune Scrapy for crawling a lot domains in parallel.
Using your browser’s Developer Tools for scraping Learn how to scrape with your browser’s developer tools.
Selecting dynamically-loaded content Read webpage data that is loaded dynamically.
Debugging memory leaks Learn how to find and get rid of memory leaks in your crawler.
Downloading and processing files and images Download files and/or images associated with your scraped items.
Deploying Spiders Deploying your Scrapy spiders and run them in a remote server.
AutoThrottle extension Adjust crawl rate dynamically based on load.
Benchmarking Check how Scrapy performs on your hardware.
Jobs: pausing and resuming crawls Learn how to pause and resume crawls for large spiders.
Extending Scrapy
This document describes the architecture of Scrapy and how its components interact.
6.1.1 Overview
The following diagram shows an overview of the Scrapy architecture with its components and an outline of the data
flow that takes place inside the system (shown by the red arrows). A brief description of the components is included
below with links for more detailed information about them. The data flow is also described below.
183
Scrapy Documentation, Release 1.7.3
The data flow in Scrapy is controlled by the execution engine, and goes like this:
1. The Engine gets the initial Requests to crawl from the Spider.
2. The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
3. The Scheduler returns the next Requests to the Engine.
4. The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see
process_request()).
5. Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the
Engine, passing through the Downloader Middlewares (see process_response()).
6. The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing
through the Spider Middleware (see process_spider_input()).
7. The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine,
passing through the Spider Middleware (see process_spider_output()).
8. The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks
for possible next Requests to crawl.
9. The process repeats (from step 1) until there are no more requests from the Scheduler.
6.1.3 Components
Scrapy Engine
The engine is responsible for controlling the data flow between all components of the system, and triggering events
when certain actions occur. See the Data Flow section above for more details.
Scheduler
The Scheduler receives requests from the engine and enqueues them for feeding them later (also to the engine) when
the engine requests them.
Downloader
The Downloader is responsible for fetching web pages and feeding them to the engine which, in turn, feeds them to
the spiders.
Spiders
Spiders are custom classes written by Scrapy users to parse responses and extract items (aka scraped items) from them
or additional requests to follow. For more information see Spiders.
Item Pipeline
The Item Pipeline is responsible for processing the items once they have been extracted (or scraped) by the spiders.
Typical tasks include cleansing, validation and persistence (like storing the item in a database). For more information
see Item Pipeline.
Downloader middlewares
Downloader middlewares are specific hooks that sit between the Engine and the Downloader and process requests
when they pass from the Engine to the Downloader, and responses that pass from Downloader to the Engine.
Use a Downloader middleware if you need to do one of the following:
• process a request just before it is sent to the Downloader (i.e. right before Scrapy sends the request to the
website);
• change received response before passing it to a spider;
• send a new Request instead of passing received response to a spider;
• pass response to a spider without fetching a web page;
• silently drop some requests.
For more information see Downloader Middleware.
Spider middlewares
Spider middlewares are specific hooks that sit between the Engine and the Spiders and are able to process spider input
(responses) and output (items and requests).
Use a Spider middleware if you need to
• post-process output of spider callbacks - change/add/remove requests or items;
• post-process start_requests;
• handle spider exceptions;
• call errback instead of callback for some of the requests based on response content.
For more information see Spider Middleware.
Scrapy is written with Twisted, a popular event-driven networking framework for Python. Thus, it’s implemented
using a non-blocking (aka asynchronous) code for concurrency.
For more information about asynchronous programming and Twisted see these links:
• Introduction to Deferreds in Twisted
• Twisted - hello, asynchronous programming
• Twisted Introduction - Krondo
The downloader middleware is a framework of hooks into Scrapy’s request/response processing. It’s a light, low-level
system for globally altering Scrapy’s requests and responses.
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomDownloaderMiddleware': 543,
}
If you want to disable a built-in middleware (the ones defined in DOWNLOADER_MIDDLEWARES_BASE and enabled
by default) you must define it in your project’s DOWNLOADER_MIDDLEWARES setting and assign None as its value.
For example, if you want to disable the user-agent middleware:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomDownloaderMiddleware': 543,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}
Finally, keep in mind that some middlewares may need to be enabled through a particular setting. See each middleware
documentation for more info.
Each downloader middleware is a Python class that defines one or more of the methods defined below.
The main entry point is the from_crawler class method, which receives a Crawler instance. The Crawler
object gives you access, for example, to the settings.
class scrapy.downloadermiddlewares.DownloaderMiddleware
Note: Any of the downloader middleware methods may also return a deferred.
process_request(request, spider)
This method is called for each request that goes through the download middleware.
process_request() should either: return None, return a Response object, return a Request
object, or raise IgnoreRequest.
If it returns None, Scrapy will continue processing this request, executing all other middlewares until,
finally, the appropriate downloader handler is called the request performed (and its response downloaded).
If it returns a Response object, Scrapy won’t bother calling any other process_request() or
process_exception() methods, or the appropriate download function; it’ll return that response.
The process_response() methods of installed middleware is always called on every response.
If it returns a Request object, Scrapy will stop calling process_request methods and reschedule the
returned request. Once the newly returned request is performed, the appropriate middleware chain will be
called on the downloaded response.
If it raises an IgnoreRequest exception, the process_exception() methods of installed down-
loader middleware will be called. If none of them handle the exception, the errback function of the request
(Request.errback) is called. If no code handles the raised exception, it is ignored and not logged
(unlike other exceptions).
Parameters
• request (Request object) – the request being processed
• spider (Spider object) – the spider for which this request is intended
process_response(request, response, spider)
process_response() should either: return a Response object, return a Request object or raise a
IgnoreRequest exception.
If it returns a Response (it could be the same given response, or a brand-new one), that response will
continue to be processed with the process_response() of the next middleware in the chain.
If it returns a Request object, the middleware chain is halted and the returned request is resched-
uled to be downloaded in the future. This is the same behavior as if a request is returned from
process_request().
If it raises an IgnoreRequest exception, the errback function of the request (Request.errback) is
called. If no code handles the raised exception, it is ignored and not logged (unlike other exceptions).
Parameters
• request (is a Request object) – the request that originated the response
• response (Response object) – the response being processed
• spider (Spider object) – the spider for which this response is intended
process_exception(request, exception, spider)
Scrapy calls process_exception() when a download handler or a process_request() (from a
downloader middleware) raises an exception (including an IgnoreRequest exception)
process_exception() should return: either None, a Response object, or a Request object.
If it returns None, Scrapy will continue processing this exception, executing any other
process_exception() methods of installed middleware, until no middleware is left and the default
exception handling kicks in.
If it returns a Response object, the process_response() method chain of installed middleware is
started, and Scrapy won’t bother calling any other process_exception() methods of middleware.
If it returns a Request object, the returned request is rescheduled to be downloaded in the future. This
stops the execution of process_exception() methods of the middleware the same as returning a
response would.
Parameters
• request (is a Request object) – the request that generated the exception
• exception (an Exception object) – the raised exception
• spider (Spider object) – the spider for which this request is intended
from_crawler(cls, crawler)
If present, this classmethod is called to create a middleware instance from a Crawler. It must return
a new instance of the middleware. Crawler object provides access to all Scrapy core components like
settings and signals; it is a way for middleware to access them and hook its functionality into Scrapy.
Parameters crawler (Crawler object) – crawler that uses this middleware
This page describes all downloader middleware components that come with Scrapy. For information on how to use
them and how to write your own downloader middleware, see the downloader middleware usage guide.
For a list of the components enabled by default (and their orders) see the DOWNLOADER_MIDDLEWARES_BASE
setting.
CookiesMiddleware
class scrapy.downloadermiddlewares.cookies.CookiesMiddleware
This middleware enables working with sites that require cookies, such as those that use sessions. It keeps track
of cookies sent by web servers, and send them back on subsequent requests (from that spider), just like web
browsers do.
Keep in mind that the cookiejar meta key is not “sticky”. You need to keep passing it along on subsequent requests.
For example:
COOKIES_ENABLED
Default: True
Whether to enable the cookies middleware. If disabled, no cookies will be sent to web servers.
Notice that despite the value of COOKIES_ENABLED setting if Request.meta['dont_merge_cookies']
evaluates to True the request cookies will not be sent to the web server and received cookies in Response will not
be merged with the existing cookies.
For more detailed information see the cookies parameter in Request.
COOKIES_DEBUG
Default: False
If enabled, Scrapy will log all cookies sent in requests (ie. Cookie header) and all cookies received in responses (ie.
Set-Cookie header).
Here’s an example of a log with COOKIES_DEBUG enabled:
Cookie: clientlanguage_nl=en_EN
2011-04-06 14:35:14-0300 [scrapy.downloadermiddlewares.cookies] DEBUG: Received
˓→cookies from: <200 http://www.diningcity.com/netherlands/index.html>
[...]
DefaultHeadersMiddleware
class scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware
This middleware sets all default requests headers specified in the DEFAULT_REQUEST_HEADERS setting.
DownloadTimeoutMiddleware
class scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware
This middleware sets the download timeout for requests specified in the DOWNLOAD_TIMEOUT setting or
download_timeout spider attribute.
Note: You can also set download timeout per-request using download_timeout Request.meta key; this is sup-
ported even when DownloadTimeoutMiddleware is disabled.
HttpAuthMiddleware
class scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware
This middleware authenticates all requests generated from certain spiders using Basic access authentication
(aka. HTTP auth).
To enable HTTP authentication from certain spiders, set the http_user and http_pass attributes of those
spiders.
Example:
class SomeIntranetSiteSpider(CrawlSpider):
http_user = 'someuser'
http_pass = 'somepass'
name = 'intranet.example.com'
HttpCacheMiddleware
class scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware
This middleware provides low-level cache to all HTTP requests and responses. It has to be combined with a
cache storage backend as well as a cache policy.
Scrapy ships with three HTTP cache storage backends:
• Filesystem storage backend (default)
This policy has no awareness of any HTTP Cache-Control directives. Every request and its corresponding response are
cached. When the same request is seen again, the response is returned without transferring anything from the Internet.
The Dummy policy is useful for testing spiders faster (without having to wait for downloads every time) and for trying
your spider offline, when an Internet connection is not available. The goal is to be able to “replay” a spider run exactly
as it ran before.
In order to use this policy, set:
• HTTPCACHE_POLICY to scrapy.extensions.httpcache.DummyPolicy
RFC2616 policy
This policy provides a RFC2616 compliant HTTP cache, i.e. with HTTP Cache-Control awareness, aimed at produc-
tion and used in continuous runs to avoid downloading unmodified data (to save bandwidth and speed up crawls).
what is implemented:
• Do not attempt to store responses/requests with no-store cache-control directive set
• Do not serve responses from cache if no-cache cache-control directive is set even for fresh responses
• Compute freshness lifetime from max-age cache-control directive
• Compute freshness lifetime from Expires response header
• Compute freshness lifetime from Last-Modified response header (heuristic used by Firefox)
• Compute current age from Age response header
• Compute current age from Date header
• Revalidate stale responses based on Last-Modified response header
• Revalidate stale responses based on ETag response header
• Set Date header for any received response missing it
• Support max-stale cache-control directive in requests
This allows spiders to be configured with the full RFC2616 cache policy, but avoid revalidation on a request-by-
request basis, while remaining conformant with the HTTP spec.
Example:
Add Cache-Control: max-stale=600 to Request headers to accept responses that have exceeded their
expiration time by no more than 600 seconds.
See also: RFC2616, 14.9.3
what is missing:
• Pragma: no-cache support https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9.1
• Vary header support https://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.6
• Invalidation after updates or deletes https://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.10
• . . . probably others ..
In order to use this policy, set:
• HTTPCACHE_POLICY to scrapy.extensions.httpcache.RFC2616Policy
File system storage backend is available for the HTTP cache middleware.
In order to use this storage backend, set:
• HTTPCACHE_STORAGE to scrapy.extensions.httpcache.FilesystemCacheStorage
Each request/response pair is stored in a different directory containing the following files:
• request_body - the plain request body
• request_headers - the request headers (in raw HTTP format)
• response_body - the plain response body
• response_headers - the request headers (in raw HTTP format)
• meta - some metadata of this cache resource in Python repr() format (grep-friendly format)
• pickled_meta - the same metadata in meta but pickled for more efficient deserialization
The directory name is made from the request fingerprint (see scrapy.utils.request.fingerprint), and
one level of subdirectories is used to avoid creating too many files into the same directory (which is inefficient in many
file systems). An example directory could be:
/path/to/cache/dir/example.com/72/72811f648e718090f041317756c03adb0ada46c7
You can implement a cache storage backend by creating a Python class that defines the methods described below.
class scrapy.extensions.httpcache.CacheStorage
open_spider(spider)
This method gets called after a spider has been opened for crawling. It handles the open_spider signal.
Parameters spider (Spider object) – the spider which has been opened
close_spider(spider)
This method gets called after a spider has been closed. It handles the close_spider signal.
Parameters spider (Spider object) – the spider which has been closed
retrieve_response(spider, request)
Return response if present in cache, or None otherwise.
Parameters
• spider (Spider object) – the spider which generated the request
• request (Request object) – the request to find cached reponse for
store_response(spider, request, response)
Store the given response in the cache.
Parameters
• spider (Spider object) – the spider for which the response is intended
• request (Request object) – the corresponding request the spider generated
• response (Response object) – the response to store in the cache
In order to use your storage backend, set:
• HTTPCACHE_STORAGE to the Python import path of your custom storage class.
HTTPCACHE_ENABLED
HTTPCACHE_EXPIRATION_SECS
Default: 0
Expiration time for cached requests, in seconds.
Cached requests older than this time will be re-downloaded. If zero, cached requests will never expire.
Changed in version 0.11: Before 0.11, zero meant cached requests always expire.
HTTPCACHE_DIR
Default: 'httpcache'
The directory to use for storing the (low-level) HTTP cache. If empty, the HTTP cache will be disabled. If a relative
path is given, is taken relative to the project data dir. For more info see: Default structure of Scrapy projects.
HTTPCACHE_IGNORE_HTTP_CODES
HTTPCACHE_IGNORE_MISSING
Default: False
If enabled, requests not found in the cache will be ignored instead of downloaded.
HTTPCACHE_IGNORE_SCHEMES
HTTPCACHE_STORAGE
Default: 'scrapy.extensions.httpcache.FilesystemCacheStorage'
The class which implements the cache storage backend.
HTTPCACHE_DBM_MODULE
HTTPCACHE_POLICY
HTTPCACHE_GZIP
HTTPCACHE_ALWAYS_STORE
HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS
HttpCompressionMiddleware
class scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware
This middleware allows compressed (gzip, deflate) traffic to be sent/received from web sites.
This middleware also supports decoding brotli-compressed responses, provided brotlipy is installed.
HttpCompressionMiddleware Settings
COMPRESSION_ENABLED
Default: True
Whether the Compression middleware will be enabled.
HttpProxyMiddleware
RedirectMiddleware
class scrapy.downloadermiddlewares.redirect.RedirectMiddleware
This middleware handles redirection of requests based on response status.
The urls which the request goes through (while being redirected) can be found in the redirect_urls Request.
meta key. The reason behind each redirect in redirect_urls can be found in the redirect_reasons
Request.meta key. For example: [301, 302, 307, 'meta refresh'].
The format of a reason depends on the middleware that handled the corresponding redirect. For
example, RedirectMiddleware indicates the triggering response status code as an integer, while
MetaRefreshMiddleware always uses the 'meta refresh' string as reason.
The RedirectMiddleware can be configured through the following settings (see the settings documentation for
more info):
• REDIRECT_ENABLED
• REDIRECT_MAX_TIMES
If Request.meta has dont_redirect key set to True, the request will be ignored by this middleware.
If you want to handle some redirect status codes in your spider, you can specify these in the
handle_httpstatus_list spider attribute.
For example, if you want the redirect middleware to ignore 301 and 302 responses (and pass them through to your
spider) you can do this:
class MySpider(CrawlSpider):
handle_httpstatus_list = [301, 302]
The handle_httpstatus_list key of Request.meta can also be used to specify which response codes to
allow on a per-request basis. You can also set the meta key handle_httpstatus_all to True if you want to
allow any response code for a request.
RedirectMiddleware settings
REDIRECT_ENABLED
REDIRECT_MAX_TIMES
Default: 20
The maximum number of redirections that will be followed for a single request.
MetaRefreshMiddleware
class scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware
This middleware handles redirection of requests based on meta-refresh html tag.
The MetaRefreshMiddleware can be configured through the following settings (see the settings documentation
for more info):
• METAREFRESH_ENABLED
• METAREFRESH_IGNORE_TAGS
• METAREFRESH_MAXDELAY
This middleware obey REDIRECT_MAX_TIMES setting, dont_redirect, redirect_urls and
redirect_reasons request meta keys as described for RedirectMiddleware
MetaRefreshMiddleware settings
METAREFRESH_ENABLED
METAREFRESH_IGNORE_TAGS
METAREFRESH_MAXDELAY
Default: 100
The maximum meta-refresh delay (in seconds) to follow the redirection. Some sites use meta-refresh for redirecting
to a session expired page, so we restrict automatic redirection to the maximum delay.
RetryMiddleware
class scrapy.downloadermiddlewares.retry.RetryMiddleware
A middleware to retry failed requests that are potentially caused by temporary problems such as a connection
timeout or HTTP 500 error.
Failed pages are collected on the scraping process and rescheduled at the end, once the spider has finished crawling all
regular (non failed) pages.
The RetryMiddleware can be configured through the following settings (see the settings documentation for more
info):
• RETRY_ENABLED
• RETRY_TIMES
• RETRY_HTTP_CODES
If Request.meta has dont_retry key set to True, the request will be ignored by this middleware.
RetryMiddleware Settings
RETRY_ENABLED
RETRY_TIMES
Default: 2
Maximum number of times to retry, in addition to the first download.
Maximum number of retries can also be specified per-request using max_retry_times attribute of Request.
meta. When initialized, the max_retry_times meta key takes higher precedence over the RETRY_TIMES set-
ting.
RETRY_HTTP_CODES
RobotsTxtMiddleware
class scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware
This middleware filters out requests forbidden by the robots.txt exclusion standard.
To make sure Scrapy respects robots.txt make sure the middleware is enabled and the ROBOTSTXT_OBEY
setting is enabled.
If Request.meta has dont_obey_robotstxt key set to True the request will be ignored by this middleware
even if ROBOTSTXT_OBEY is enabled.
DownloaderStats
class scrapy.downloadermiddlewares.stats.DownloaderStats
Middleware that stores stats of all requests, responses and exceptions that pass through it.
To use this middleware you must enable the DOWNLOADER_STATS setting.
UserAgentMiddleware
class scrapy.downloadermiddlewares.useragent.UserAgentMiddleware
Middleware that allows spiders to override the default user agent.
In order for a spider to override the default user agent, its user_agent attribute must be set.
AjaxCrawlMiddleware
class scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware
Middleware that finds ‘AJAX crawlable’ page variants based on meta-fragment html tag. See https://developers.
google.com/webmasters/ajax-crawling/docs/getting-started for more info.
Note: Scrapy finds ‘AJAX crawlable’ pages for URLs like 'http://example.com/!#foo=bar' even
without this middleware. AjaxCrawlMiddleware is necessary when URL doesn’t contain '!#'. This is often a
case for ‘index’ or ‘main’ website pages.
AjaxCrawlMiddleware Settings
AJAXCRAWL_ENABLED
Whether the AjaxCrawlMiddleware will be enabled. You may want to enable it for broad crawls.
HttpProxyMiddleware settings
HTTPPROXY_ENABLED
Default: True
Whether or not to enable the HttpProxyMiddleware.
HTTPPROXY_AUTH_ENCODING
Default: "latin-1"
The default encoding for proxy authentication on HttpProxyMiddleware.
The spider middleware is a framework of hooks into Scrapy’s spider processing mechanism where you can plug custom
functionality to process the responses that are sent to Spiders for processing and to process the requests and items that
are generated from spiders.
To activate a spider middleware component, add it to the SPIDER_MIDDLEWARES setting, which is a dict whose
keys are the middleware class path and their values are the middleware orders.
Here’s an example:
SPIDER_MIDDLEWARES = {
'myproject.middlewares.CustomSpiderMiddleware': 543,
}
The SPIDER_MIDDLEWARES setting is merged with the SPIDER_MIDDLEWARES_BASE setting defined in Scrapy
(and not meant to be overridden) and then sorted by order to get the final sorted list of enabled middlewares: the
first middleware is the one closer to the engine and the last is the one closer to the spider. In other words, the
process_spider_input() method of each middleware will be invoked in increasing middleware order (100,
200, 300, . . . ), and the process_spider_output() method of each middleware will be invoked in decreasing
order.
To decide which order to assign to your middleware see the SPIDER_MIDDLEWARES_BASE setting and pick a value
according to where you want to insert the middleware. The order does matter because each middleware performs a
different action and your middleware could depend on some previous (or subsequent) middleware being applied.
If you want to disable a builtin middleware (the ones defined in SPIDER_MIDDLEWARES_BASE, and enabled by
default) you must define it in your project SPIDER_MIDDLEWARES setting and assign None as its value. For
example, if you want to disable the off-site middleware:
SPIDER_MIDDLEWARES = {
'myproject.middlewares.CustomSpiderMiddleware': 543,
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,
}
Finally, keep in mind that some middlewares may need to be enabled through a particular setting. See each middleware
documentation for more info.
Each spider middleware is a Python class that defines one or more of the methods defined below.
The main entry point is the from_crawler class method, which receives a Crawler instance. The Crawler
object gives you access, for example, to the settings.
class scrapy.spidermiddlewares.SpiderMiddleware
process_spider_input(response, spider)
This method is called for each response that goes through the spider middleware and into the spider, for
processing.
process_spider_input() should return None or raise an exception.
If it returns None, Scrapy will continue processing this response, executing all other middlewares until,
finally, the response is handed to the spider for processing.
If it raises an exception, Scrapy won’t bother calling any other spider middleware
process_spider_input() and will call the request errback if there is one, otherwise it will start
the process_spider_exception() chain. The output of the errback is chained back in the other
direction for process_spider_output() to process it, or process_spider_exception() if
it raised an exception.
Parameters
• response (Response object) – the response being processed
• spider (Spider object) – the spider for which this response is intended
process_spider_output(response, result, spider)
This method is called with the results returned from the Spider, after it has processed the response.
process_spider_output() must return an iterable of Request, dict or Item objects.
Parameters
• response (Response object) – the response which generated this output from the spi-
der
• result (an iterable of Request, dict or Item objects) – the result returned by the
spider
• spider (Spider object) – the spider whose result is being processed
process_spider_exception(response, exception, spider)
This method is called when a spider or process_spider_output() method (from a previous spider
middleware) raises an exception.
process_spider_exception() should return either None or an iterable of Request, dict or
Item objects.
If it returns None, Scrapy will continue processing this exception, executing any other
process_spider_exception() in the following middleware components, until no middleware
components are left and the exception reaches the engine (where it’s logged and discarded).
If it returns an iterable the process_spider_output() pipeline kicks in, starting from the next
spider middleware, and no other process_spider_exception() will be called.
Parameters
• response (Response object) – the response being processed when the exception was
raised
• exception (Exception object) – the exception raised
• spider (Spider object) – the spider which raised the exception
process_start_requests(start_requests, spider)
New in version 0.15.
This method is called with the start requests of the spider, and works similarly to the
process_spider_output() method, except that it doesn’t have a response associated and must
return only requests (not items).
It receives an iterable (in the start_requests parameter) and must return another iterable of
Request objects.
Note: When implementing this method in your spider middleware, you should always return an iterable
(that follows the input one) and not consume all start_requests iterator because it can be very large
(or even unbounded) and cause a memory overflow. The Scrapy engine is designed to pull start requests
while it has capacity to process them, so the start requests iterator can be effectively endless where there
is some other condition for stopping the spider (like a time limit or item/page count).
Parameters
• start_requests (an iterable of Request) – the start requests
• spider (Spider object) – the spider to whom the start requests belong
from_crawler(cls, crawler)
If present, this classmethod is called to create a middleware instance from a Crawler. It must return
a new instance of the middleware. Crawler object provides access to all Scrapy core components like
settings and signals; it is a way for middleware to access them and hook its functionality into Scrapy.
Parameters crawler (Crawler object) – crawler that uses this middleware
This page describes all spider middleware components that come with Scrapy. For information on how to use them
and how to write your own spider middleware, see the spider middleware usage guide.
For a list of the components enabled by default (and their orders) see the SPIDER_MIDDLEWARES_BASE setting.
DepthMiddleware
class scrapy.spidermiddlewares.depth.DepthMiddleware
DepthMiddleware is used for tracking the depth of each Request inside the site being scraped. It works by setting
request.meta['depth'] = 0 whenever there is no value previously set (usually just the first Request)
and incrementing it by 1 otherwise.
It can be used to limit the maximum depth to scrape, control Request priority based on their depth, and things
like that.
The DepthMiddleware can be configured through the following settings (see the settings documentation for
more info):
• DEPTH_LIMIT - The maximum depth that will be allowed to crawl for any site. If zero, no limit will be
imposed.
• DEPTH_STATS_VERBOSE - Whether to collect the number of requests for each depth.
• DEPTH_PRIORITY - Whether to prioritize the requests based on their depth.
HttpErrorMiddleware
class scrapy.spidermiddlewares.httperror.HttpErrorMiddleware
Filter out unsuccessful (erroneous) HTTP responses so that spiders don’t have to deal with them, which (most
of the time) imposes an overhead, consumes more resources, and makes the spider logic more complex.
According to the HTTP standard, successful responses are those whose status codes are in the 200-300 range.
If you still want to process response codes outside that range, you can specify which response codes the spider is able
to handle using the handle_httpstatus_list spider attribute or HTTPERROR_ALLOWED_CODES setting.
For example, if you want your spider to handle 404 responses you can do this:
class MySpider(CrawlSpider):
handle_httpstatus_list = [404]
The handle_httpstatus_list key of Request.meta can also be used to specify which response codes to
allow on a per-request basis. You can also set the meta key handle_httpstatus_all to True if you want to
allow any response code for a request.
Keep in mind, however, that it’s usually a bad idea to handle non-200 responses, unless you really know what you’re
doing.
For more information see: HTTP Status Code Definitions.
HttpErrorMiddleware settings
HTTPERROR_ALLOWED_CODES
Default: []
Pass all responses with non-200 status codes contained in this list.
HTTPERROR_ALLOW_ALL
Default: False
Pass all responses, regardless of its status code.
OffsiteMiddleware
class scrapy.spidermiddlewares.offsite.OffsiteMiddleware
Filters out Requests for URLs outside the domains covered by the spider.
This middleware filters out every request whose host names aren’t in the spider’s allowed_domains at-
tribute. All subdomains of any domain in the list are also allowed. E.g. the rule www.example.org will also
allow bob.www.example.org but not www2.example.com nor example.com.
When your spider returns a request for a domain not belonging to those covered by the spider, this middleware
will log a debug message similar to this one:
To avoid filling the log with too much noise, it will only print one of these messages for each new domain
filtered. So, for example, if another request for www.othersite.com is filtered, no log message will be
printed. But if a request for someothersite.com is filtered, a message will be printed (but only for the first
request filtered).
If the spider doesn’t define an allowed_domains attribute, or the attribute is empty, the offsite middleware
will allow all requests.
If the request has the dont_filter attribute set, the offsite middleware will allow the request even if its
domain is not listed in allowed domains.
RefererMiddleware
class scrapy.spidermiddlewares.referer.RefererMiddleware
Populates Request Referer header, based on the URL of the Response which generated it.
RefererMiddleware settings
REFERER_ENABLED
REFERRER_POLICY
Note: You can also set the Referrer Policy per request, using the special "referrer_policy" Request.meta key,
with the same acceptable values as for the REFERRER_POLICY setting.
class scrapy.spidermiddlewares.referer.DefaultReferrerPolicy
A variant of “no-referrer-when-downgrade”, with the addition that “Referer” is not sent if the parent request was
using file:// or s3:// scheme.
Warning: Scrapy’s default referrer policy — just like “no-referrer-when-downgrade”, the W3C-recommended
value for browsers — will send a non-empty “Referer” header from any http(s):// to any https:// URL,
even if the domain is different.
“same-origin” may be a better choice if you want to remove referrer information for cross-domain requests.
class scrapy.spidermiddlewares.referer.NoReferrerPolicy
https://www.w3.org/TR/referrer-policy/#referrer-policy-no-referrer
The simplest policy is “no-referrer”, which specifies that no referrer information is to be sent along with requests
made from a particular request client to any origin. The header will be omitted entirely.
class scrapy.spidermiddlewares.referer.NoReferrerWhenDowngradePolicy
https://www.w3.org/TR/referrer-policy/#referrer-policy-no-referrer-when-downgrade
The “no-referrer-when-downgrade” policy sends a full URL along with requests from a TLS-protected environ-
ment settings object to a potentially trustworthy URL, and requests from clients which are not TLS-protected to
any origin.
Requests from TLS-protected clients to non-potentially trustworthy URLs, on the other hand, will contain no
referrer information. A Referer HTTP header will not be sent.
This is a user agent’s default behavior, if no policy is otherwise specified.
Note: “no-referrer-when-downgrade” policy is the W3C-recommended default, and is used by major web browsers.
However, it is NOT Scrapy’s default referrer policy (see DefaultReferrerPolicy).
class scrapy.spidermiddlewares.referer.SameOriginPolicy
https://www.w3.org/TR/referrer-policy/#referrer-policy-same-origin
The “same-origin” policy specifies that a full URL, stripped for use as a referrer, is sent as referrer information
when making same-origin requests from a particular request client.
Cross-origin requests, on the other hand, will contain no referrer information. A Referer HTTP header will not
be sent.
class scrapy.spidermiddlewares.referer.OriginPolicy
https://www.w3.org/TR/referrer-policy/#referrer-policy-origin
The “origin” policy specifies that only the ASCII serialization of the origin of the request client is sent as referrer
information when making both same-origin requests and cross-origin requests from a particular request client.
class scrapy.spidermiddlewares.referer.StrictOriginPolicy
https://www.w3.org/TR/referrer-policy/#referrer-policy-strict-origin
The “strict-origin” policy sends the ASCII serialization of the origin of the request client when making requests:
- from a TLS-protected environment settings object to a potentially trustworthy URL, and - from non-TLS-
protected environment settings objects to any origin.
Requests from TLS-protected request clients to non- potentially trustworthy URLs, on the other hand, will
contain no referrer information. A Referer HTTP header will not be sent.
class scrapy.spidermiddlewares.referer.OriginWhenCrossOriginPolicy
https://www.w3.org/TR/referrer-policy/#referrer-policy-origin-when-cross-origin
The “origin-when-cross-origin” policy specifies that a full URL, stripped for use as a referrer, is sent as referrer
information when making same-origin requests from a particular request client, and only the ASCII serialization
of the origin of the request client is sent as referrer information when making cross-origin requests from a
particular request client.
class scrapy.spidermiddlewares.referer.StrictOriginWhenCrossOriginPolicy
https://www.w3.org/TR/referrer-policy/#referrer-policy-strict-origin-when-cross-origin
The “strict-origin-when-cross-origin” policy specifies that a full URL, stripped for use as a referrer, is sent as
referrer information when making same-origin requests from a particular request client, and only the ASCII
serialization of the origin of the request client when making cross-origin requests:
• from a TLS-protected environment settings object to a potentially trustworthy URL, and
• from non-TLS-protected environment settings objects to any origin.
Requests from TLS-protected clients to non- potentially trustworthy URLs, on the other hand, will contain no
referrer information. A Referer HTTP header will not be sent.
class scrapy.spidermiddlewares.referer.UnsafeUrlPolicy
https://www.w3.org/TR/referrer-policy/#referrer-policy-unsafe-url
The “unsafe-url” policy specifies that a full URL, stripped for use as a referrer, is sent along with both cross-
origin requests and same-origin requests made from a particular request client.
Note: The policy’s name doesn’t lie; it is unsafe. This policy will leak origins and paths from TLS-protected
resources to insecure origins. Carefully consider the impact of setting such a policy for potentially sensitive
documents.
UrlLengthMiddleware
class scrapy.spidermiddlewares.urllength.UrlLengthMiddleware
Filters out requests with URLs longer than URLLENGTH_LIMIT
The UrlLengthMiddleware can be configured through the following settings (see the settings documenta-
tion for more info):
• URLLENGTH_LIMIT - The maximum URL length to allow for crawled URLs.
6.4 Extensions
The extensions framework provides a mechanism for inserting your own custom functionality into Scrapy.
Extensions are just regular classes that are instantiated at Scrapy startup, when extensions are initialized.
Extensions use the Scrapy settings to manage their settings, just like any other Scrapy code.
It is customary for extensions to prefix their settings with their own name, to avoid collision with existing (and
future) extensions. For example, a hypothetic extension to handle Google Sitemaps would use settings like
GOOGLESITEMAP_ENABLED, GOOGLESITEMAP_DEPTH, and so on.
Extensions are loaded and activated at startup by instantiating a single instance of the extension class. Therefore, all
the extension initialization code must be performed in the class constructor (__init__ method).
To make an extension available, add it to the EXTENSIONS setting in your Scrapy settings. In EXTENSIONS, each
extension is represented by a string: the full Python path to the extension’s class name. For example:
EXTENSIONS = {
'scrapy.extensions.corestats.CoreStats': 500,
'scrapy.extensions.telnet.TelnetConsole': 500,
}
As you can see, the EXTENSIONS setting is a dict where the keys are the extension paths, and their values are the
orders, which define the extension loading order. The EXTENSIONS setting is merged with the EXTENSIONS_BASE
setting defined in Scrapy (and not meant to be overridden) and then sorted by order to get the final sorted list of enabled
extensions.
As extensions typically do not depend on each other, their loading order is irrelevant in most cases. This is why the
EXTENSIONS_BASE setting defines all extensions with the same order (0). However, this feature can be exploited if
you need to add an extension which depends on other extensions already loaded.
Not all available extensions will be enabled. Some of them usually depend on a particular setting. For example, the
HTTP Cache extension is available by default but disabled unless the HTTPCACHE_ENABLED setting is set.
In order to disable an extension that comes enabled by default (ie. those included in the EXTENSIONS_BASE setting)
you must set its order to None. For example:
EXTENSIONS = {
'scrapy.extensions.corestats.CoreStats': None,
}
Each extension is a Python class. The main entry point for a Scrapy extension (this also includes middlewares and
pipelines) is the from_crawler class method which receives a Crawler instance. Through the Crawler object
you can access settings, signals, stats, and also control the crawling behaviour.
Typically, extensions connect to signals and perform tasks triggered by them.
Finally, if the from_crawler method raises the NotConfigured exception, the extension will be disabled. Oth-
erwise, the extension will be enabled.
Sample extension
Here we will implement a simple extension to illustrate the concepts described in the previous section. This extension
will log a message every time:
• a spider is opened
• a spider is closed
• a specific number of items are scraped
The extension will be enabled through the MYEXT_ENABLED setting and the number of items will be specified through
the MYEXT_ITEMCOUNT setting.
Here is the code of such extension:
import logging
from scrapy import signals
from scrapy.exceptions import NotConfigured
logger = logging.getLogger(__name__)
class SpiderOpenCloseLogging(object):
@classmethod
def from_crawler(cls, crawler):
# first check if the extension should be enabled and raise
# NotConfigured otherwise
if not crawler.settings.getbool('MYEXT_ENABLED'):
raise NotConfigured
class scrapy.extensions.logstats.LogStats
Log basic stats like crawled pages and scraped items.
class scrapy.extensions.corestats.CoreStats
Enable the collection of core statistics, provided the stats collection is enabled (see Stats Collection).
class scrapy.extensions.telnet.TelnetConsole
Provides a telnet console for getting into a Python interpreter inside the currently running Scrapy process, which can
be very useful for debugging.
The telnet console must be enabled by the TELNETCONSOLE_ENABLED setting, and the server will listen in the port
specified in TELNETCONSOLE_PORT.
class scrapy.extensions.memusage.MemoryUsage
Monitors the memory used by the Scrapy process that runs the spider and:
1. sends a notification e-mail when it exceeds a certain value
2. closes the spider when it exceeds a certain value
The notification e-mails can be triggered when a certain warning value is reached (MEMUSAGE_WARNING_MB) and
when the maximum value is reached (MEMUSAGE_LIMIT_MB) which will also cause the spider to be closed and the
Scrapy process to be terminated.
This extension is enabled by the MEMUSAGE_ENABLED setting and can be configured with the following settings:
• MEMUSAGE_LIMIT_MB
• MEMUSAGE_WARNING_MB
• MEMUSAGE_NOTIFY_MAIL
• MEMUSAGE_CHECK_INTERVAL_SECONDS
class scrapy.extensions.memdebug.MemoryDebugger
An extension for debugging memory usage. It collects information about:
• objects uncollected by the Python garbage collector
• objects left alive that shouldn’t. For more info, see Debugging memory leaks with trackref
To enable this extension, turn on the MEMDEBUG_ENABLED setting. The info will be stored in the stats.
class scrapy.extensions.closespider.CloseSpider
Closes a spider automatically when some conditions are met, using a specific closing reason for each condition.
The conditions for closing a spider can be configured through the following settings:
• CLOSESPIDER_TIMEOUT
• CLOSESPIDER_ITEMCOUNT
• CLOSESPIDER_PAGECOUNT
• CLOSESPIDER_ERRORCOUNT
CLOSESPIDER_TIMEOUT
Default: 0
An integer which specifies a number of seconds. If the spider remains open for more than that number of second, it
will be automatically closed with the reason closespider_timeout. If zero (or non set), spiders won’t be closed
by timeout.
CLOSESPIDER_ITEMCOUNT
Default: 0
An integer which specifies a number of items. If the spider scrapes more than that amount and those items are passed
by the item pipeline, the spider will be closed with the reason closespider_itemcount. Requests which are
currently in the downloader queue (up to CONCURRENT_REQUESTS requests) are still processed. If zero (or non
set), spiders won’t be closed by number of passed items.
CLOSESPIDER_PAGECOUNT
CLOSESPIDER_ERRORCOUNT
StatsMailer extension
class scrapy.extensions.statsmailer.StatsMailer
This simple extension can be used to send a notification e-mail every time a domain has finished scraping, including
the Scrapy stats collected. The email will be sent to all recipients specified in the STATSMAILER_RCPTS setting.
Debugging extensions
class scrapy.extensions.debug.StackTraceDump
Dumps information about the running process when a SIGQUIT or SIGUSR2 signal is received. The information
dumped is the following:
1. engine status (using scrapy.utils.engine.get_engine_status())
2. live references (see Debugging memory leaks with trackref )
3. stack trace of all threads
After the stack trace and engine status is dumped, the Scrapy process continues running normally.
This extension only works on POSIX-compliant platforms (ie. not Windows), because the SIGQUIT and SIGUSR2
signals are not available on Windows.
There are at least two ways to send Scrapy the SIGQUIT signal:
1. By pressing Ctrl-while a Scrapy process is running (Linux only?)
2. By running this command (assuming <pid> is the process id of the Scrapy process):
Debugger extension
class scrapy.extensions.debug.Debugger
Invokes a Python debugger inside a running Scrapy process when a SIGUSR2 signal is received. After the debugger
is exited, the Scrapy process continues running normally.
For more info see Debugging in Python.
This extension only works on POSIX-compliant platforms (ie. not Windows).
The main entry point to Scrapy API is the Crawler object, passed to extensions through the from_crawler class
method. This object provides access to all Scrapy core components, and it’s the only way for extensions to access
them and hook their functionality into Scrapy.
The Extension Manager is responsible for loading and keeping track of installed extensions and it’s configured through
the EXTENSIONS setting which contains a dictionary of all available extensions and their order similar to how you
configure the downloader middlewares.
class scrapy.crawler.Crawler(spidercls, settings)
The Crawler object must be instantiated with a scrapy.spiders.Spider subclass and a scrapy.
settings.Settings object.
settings
The settings manager of this crawler.
This is used by extensions & middlewares to access the Scrapy settings of this crawler.
For an introduction on Scrapy settings see Settings.
For the API see Settings class.
signals
The signals manager of this crawler.
This is used by extensions & middlewares to hook themselves into Scrapy functionality.
For an introduction on signals see Signals.
For the API see SignalManager class.
stats
The stats collector of this crawler.
This is used from extensions & middlewares to record stats of their behaviour, or access stats collected by
other extensions.
For an introduction on stats collection see Stats Collection.
For the API see StatsCollector class.
extensions
The extension manager that keeps track of enabled extensions.
Most extensions won’t need to access this attribute.
For an introduction on extensions and a list of available extensions on Scrapy see Extensions.
engine
The execution engine, which coordinates the core crawling logic between the scheduler, downloader and
spiders.
Some extension may want to access the Scrapy engine, to inspect or modify the downloader and scheduler
behaviour, although this is an advanced use and this API is not yet stable.
spider
Spider currently being crawled. This is an instance of the spider class provided while constructing the
crawler, and it is created after the arguments given in the crawl() method.
crawl(*args, **kwargs)
Starts the crawler by instantiating its spider class with the given args and kwargs arguments, while
setting the execution engine in motion.
Returns a deferred that is fired when the crawl is finished.
stop()
Starts a graceful stop of the crawler and returns a deferred that is fired when the crawler is stopped.
class scrapy.crawler.CrawlerRunner(settings=None)
This is a convenient helper class that keeps track of, manages and runs crawlers inside an already setup Twisted
reactor.
The CrawlerRunner object must be instantiated with a Settings object.
This class shouldn’t be needed (since Scrapy is responsible of using it accordingly) unless writing scripts that
manually handle the crawling process. See Run Scrapy from a script for an example.
crawl(crawler_or_spidercls, *args, **kwargs)
Run a crawler with the provided arguments.
It will call the given Crawler’s crawl() method, while keeping track of it so it can be stopped later.
If crawler_or_spidercls isn’t a Crawler instance, this method will try to create one using this
parameter as the spider class given to it.
Returns a deferred that is fired when the crawling is finished.
Parameters
• crawler_or_spidercls (Crawler instance, Spider subclass or string) – already
created crawler, or a spider class or spider’s name inside the project to create it
• args (list) – arguments to initialize the spider
• kwargs (dict) – keyword arguments to initialize the spider
crawlers
Set of crawlers started by crawl() and managed by this class.
create_crawler(crawler_or_spidercls)
Return a Crawler object.
• If crawler_or_spidercls is a Crawler, it is returned as-is.
• If crawler_or_spidercls is a Spider subclass, a new Crawler is constructed for it.
• If crawler_or_spidercls is a string, this function finds a spider with this name in a Scrapy
project (using spider loader), then creates a Crawler instance for it.
join()
Returns a deferred that is fired when all managed crawlers have completed their executions.
stop()
Stops simultaneously all the crawling jobs taking place.
Returns a deferred that is fired when they all have ended.
class scrapy.crawler.CrawlerProcess(settings=None, install_root_handler=True)
Bases: scrapy.crawler.CrawlerRunner
A class to run multiple scrapy crawlers in a process simultaneously.
This class extends CrawlerRunner by adding support for starting a Twisted reactor and handling shutdown
signals, like the keyboard interrupt command Ctrl-C. It also configures top-level logging.
This utility should be a better fit than CrawlerRunner if you aren’t running another Twisted reactor within
your application.
The CrawlerProcess object must be instantiated with a Settings object.
Parameters install_root_handler – whether to install root logging handler (default: True)
This class shouldn’t be needed (since Scrapy is responsible of using it accordingly) unless writing scripts that
manually handle the crawling process. See Run Scrapy from a script for an example.
crawl(crawler_or_spidercls, *args, **kwargs)
Run a crawler with the provided arguments.
It will call the given Crawler’s crawl() method, while keeping track of it so it can be stopped later.
If crawler_or_spidercls isn’t a Crawler instance, this method will try to create one using this
parameter as the spider class given to it.
Returns a deferred that is fired when the crawling is finished.
Parameters
• crawler_or_spidercls (Crawler instance, Spider subclass or string) – already
created crawler, or a spider class or spider’s name inside the project to create it
• args (list) – arguments to initialize the spider
• kwargs (dict) – keyword arguments to initialize the spider
crawlers
Set of crawlers started by crawl() and managed by this class.
create_crawler(crawler_or_spidercls)
Return a Crawler object.
• If crawler_or_spidercls is a Crawler, it is returned as-is.
• If crawler_or_spidercls is a Spider subclass, a new Crawler is constructed for it.
• If crawler_or_spidercls is a string, this function finds a spider with this name in a Scrapy
project (using spider loader), then creates a Crawler instance for it.
join()
Returns a deferred that is fired when all managed crawlers have completed their executions.
start(stop_after_crawl=True)
This method starts a Twisted reactor, adjusts its pool size to REACTOR_THREADPOOL_MAXSIZE, and
installs a DNS cache based on DNSCACHE_ENABLED and DNSCACHE_SIZE.
If stop_after_crawl is True, the reactor will be stopped after all crawlers have finished, using
join().
Parameters stop_after_crawl (boolean) – stop or not the reactor when all crawlers
have finished
stop()
Stops simultaneously all the crawling jobs taking place.
Returns a deferred that is fired when they all have ended.
scrapy.settings.SETTINGS_PRIORITIES
Dictionary that sets the key name and priority level of the default settings priorities used in Scrapy.
Each item defines a settings entry point, giving it a code name for identification and an integer priority. Greater
priorities take more precedence over lesser ones when setting and retrieving values in the Settings class.
SETTINGS_PRIORITIES = {
'default': 0,
'command': 10,
'project': 20,
'spider': 30,
'cmdline': 40,
}
copy()
Make a deep copy of current settings.
This method returns a new instance of the Settings class, populated with the same values and their
priorities.
Modifications to the new object won’t be reflected on the original settings.
copy_to_dict()
Make a copy of current settings and convert to a dict.
This method returns a new dict populated with the same values and their priorities as the current settings.
Modifications to the returned dict won’t be reflected on the original settings.
This method can be useful for example for printing settings in Scrapy shell.
freeze()
Disable further changes to the current settings.
After calling this method, the present state of the settings will become immutable. Trying to change values
through the set() method and its variants won’t be possible and will be alerted.
frozencopy()
Return an immutable copy of the current settings.
Alias for a freeze() call in the object returned by copy().
get(name, default=None)
Get a setting value without affecting its original type.
Parameters
• name (string) – the setting name
• default (any) – the value to return if no setting is found
getbool(name, default=False)
Get a setting value as a boolean.
1, '1', True‘ and 'True' return True, while 0, '0', False, 'False' and None return False.
For example, settings populated through environment variables set to '0' will return False when using
this method.
Parameters
• name (string) – the setting name
• default (any) – the value to return if no setting is found
getdict(name, default=None)
Get a setting value as a dictionary. If the setting original type is a dictionary, a copy of it will be returned.
If it is a string it will be evaluated as a JSON dictionary. In the case that it is a BaseSettings instance
itself, it will be converted to a dictionary, containing all its current settings values as they would be returned
by get(), and losing all information about priority and mutability.
Parameters
• name (string) – the setting name
• default (any) – the value to return if no setting is found
getfloat(name, default=0.0)
Get a setting value as a float.
Parameters
update(values, priority=’project’)
Store key/value pairs with a given priority.
This is a helper function that calls set() for every item of values with the provided priority.
If values is a string, it is assumed to be JSON-encoded and parsed into a dict with json.loads()
first. If it is a BaseSettings instance, the per-key priorities will be used and the priority parameter
ignored. This allows inserting/updating settings with different priorities with a single command.
Parameters
• values (dict or string or BaseSettings) – the settings names and values
• priority (string or int) – the priority of the settings. Should be a key of
SETTINGS_PRIORITIES or an integer
class scrapy.spiderloader.SpiderLoader
This class is in charge of retrieving and handling the spider classes defined across the project.
Custom spider loaders can be employed by specifying their path in the SPIDER_LOADER_CLASS project
setting. They must fully implement the scrapy.interfaces.ISpiderLoader interface to guarantee an
errorless execution.
from_settings(settings)
This class method is used by Scrapy to create an instance of the class. It’s called with the current project
settings, and it loads the spiders found recursively in the modules of the SPIDER_MODULES setting.
Parameters settings (Settings instance) – project settings
load(spider_name)
Get the Spider class with the given name. It’ll look into the previously loaded spiders for a spider class
with name spider_name and will raise a KeyError if not found.
Parameters spider_name (str) – spider class name
list()
Get the names of the available spiders in the project.
find_by_request(request)
List the spiders’ names that can handle the given request. Will try to match the request’s url against the
domains of the spiders.
Parameters request (Request instance) – queried request
class scrapy.signalmanager.SignalManager(sender=_Anonymous)
There are several Stats Collectors available under the scrapy.statscollectors module and they all implement
the Stats Collector API defined by the StatsCollector class (which they all inherit from).
class scrapy.statscollectors.StatsCollector
get_value(key, default=None)
Return the value for the given stats key or default if it doesn’t exist.
get_stats()
Get all stats from the currently running spider as a dict.
set_value(key, value)
Set the given value for the given stats key.
set_stats(stats)
Override the current stats with the dict passed in stats argument.
inc_value(key, count=1, start=0)
Increment the value of the given stats key, by the given count, assuming the start value given (when it’s not
set).
max_value(key, value)
Set the given value for the given key only if current value for the same key is lower than value. If there is
no current value for the given key, the value is always set.
min_value(key, value)
Set the given value for the given key only if current value for the same key is greater than value. If there is
no current value for the given key, the value is always set.
clear_stats()
Clear all stats.
The following methods are not part of the stats collection api but instead used when implementing custom stats
collectors:
open_spider(spider)
Open the given spider for stats collection.
close_spider(spider)
Close the given spider. After this is called, no more specific stats can be accessed or collected.
6.6 Signals
Scrapy uses signals extensively to notify when certain events occur. You can catch some of those signals in your
Scrapy project (using an extension, for example) to perform additional tasks or extend Scrapy to add functionality not
provided out of the box.
Even though signals provide several arguments, the handlers that catch them don’t need to accept all of them - the
signal dispatching mechanism will only deliver the arguments that the handler receives.
You can connect to signals (or send your own) through the Signals API.
Here is a simple example showing how you can catch signals and perform some action:
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
]
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider
Some signals support returning Twisted deferreds from their handlers, see the Built-in signals reference below to know
which ones.
engine_started
scrapy.signals.engine_started()
Sent when the Scrapy engine has started crawling.
This signal supports returning deferreds from their handlers.
Note: This signal may be fired after the spider_opened signal, depending on how the spider was started. So
don’t rely on this signal getting fired before spider_opened.
engine_stopped
scrapy.signals.engine_stopped()
Sent when the Scrapy engine is stopped (for example, when a crawling process has finished).
This signal supports returning deferreds from their handlers.
item_scraped
item_dropped
item_error
Parameters
• item (dict or Item object) – the item dropped from the Item Pipeline
• response (Response object) – the response being processed when the exception was
raised
• spider (Spider object) – the spider which raised the exception
• failure (Failure object) – the exception raised as a Twisted Failure object
spider_closed
scrapy.signals.spider_closed(spider, reason)
Sent after a spider has been closed. This can be used to release per-spider resources reserved on
spider_opened.
This signal supports returning deferreds from their handlers.
Parameters
• spider (Spider object) – the spider which has been closed
• reason (str) – a string which describes the reason why the spider was closed. If it was
closed because the spider has completed scraping, the reason is 'finished'. Otherwise,
if the spider was manually closed by calling the close_spider engine method, then
the reason is the one passed in the reason argument of that method (which defaults to
'cancelled'). If the engine was shutdown (for example, by hitting Ctrl-C to stop it) the
reason will be 'shutdown'.
spider_opened
scrapy.signals.spider_opened(spider)
Sent after a spider has been opened for crawling. This is typically used to reserve per-spider resources, but can
be used for any task that needs to be performed when a spider is opened.
This signal supports returning deferreds from their handlers.
Parameters spider (Spider object) – the spider which has been opened
spider_idle
scrapy.signals.spider_idle(spider)
Sent when a spider has gone idle, which means the spider has no further:
• requests waiting to be downloaded
• requests scheduled
• items being processed in the item pipeline
If the idle state persists after all handlers of this signal have finished, the engine starts closing the spider. After
the spider has finished closing, the spider_closed signal is sent.
You may raise a DontCloseSpider exception to prevent the spider from being closed.
This signal does not support returning deferreds from their handlers.
Parameters spider (Spider object) – the spider which has gone idle
Note: Scheduling some requests in your spider_idle handler does not guarantee that it can prevent the spider
from being closed, although it sometimes can. That’s because the spider may still remain idle if all the scheduled
requests are rejected by the scheduler (e.g. filtered due to duplication).
spider_error
request_scheduled
scrapy.signals.request_scheduled(request, spider)
Sent when the engine schedules a Request, to be downloaded later.
The signal does not support returning deferreds from their handlers.
Parameters
• request (Request object) – the request that reached the scheduler
• spider (Spider object) – the spider that yielded the request
request_dropped
scrapy.signals.request_dropped(request, spider)
Sent when a Request, scheduled by the engine to be downloaded later, is rejected by the scheduler.
The signal does not support returning deferreds from their handlers.
Parameters
• request (Request object) – the request that reached the scheduler
• spider (Spider object) – the spider that yielded the request
request_reached_downloader
scrapy.signals.request_reached_downloader(request, spider)
Sent when a Request reached downloader.
The signal does not support returning deferreds from their handlers.
Parameters
• request (Request object) – the request that reached downloader
• spider (Spider object) – the spider that yielded the request
response_received
response_downloaded
Once you have scraped your items, you often want to persist or export those items, to use the data in some other
application. That is, after all, the whole purpose of the scraping process.
For this purpose Scrapy provides a collection of Item Exporters for different output formats, such as XML, CSV or
JSON.
If you are in a hurry, and just want to use an Item Exporter to output scraped data see the Feed exports. Otherwise, if
you want to know how Item Exporters work or need more custom functionality (not covered by the default exports),
continue reading below.
In order to use an Item Exporter, you must instantiate it with its required args. Each Item Exporter requires different
arguments, so check each exporter documentation to be sure, in Built-in Item Exporters reference. After you have
instantiated your exporter, you have to:
1. call the method start_exporting() in order to signal the beginning of the exporting process
2. call the export_item() method for each item you want to export
3. and finally call the finish_exporting() to signal the end of the exporting process
Here you can see an Item Pipeline which uses multiple Item Exporters to group scraped items to different files accord-
ing to the value of one of their fields:
class PerYearXmlExportPipeline(object):
"""Distribute items across multiple XML files according to their 'year' field"""
By default, the field values are passed unmodified to the underlying serialization library, and the decision of how to
serialize them is delegated to each particular serialization library.
However, you can customize how each field value is serialized before it is passed to the serialization library.
There are two ways to customize how a field will be serialized, which are described next.
If you use Item you can declare a serializer in the field metadata. The serializer must be a callable which receives a
value and returns its serialized form.
Example:
import scrapy
def serialize_price(value):
return '$ %s' % str(value)
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field(serializer=serialize_price)
You can also override the serialize_field() method to customize how your field value will be exported.
Make sure you call the base class serialize_field() method after your custom code.
Example:
class ProductXmlExporter(XmlItemExporter):
Here is a list of the Item Exporters bundled with Scrapy. Some of them contain output examples, which assume you’re
exporting these two items:
BaseItemExporter
finish_exporting()
Signal the end of the exporting process. Some exporters may use this to generate some required footer (for
example, the XmlItemExporter). You must always call this method after you have no more items to
export.
fields_to_export
A list with the name of the fields that will be exported, or None if you want to export all fields. Defaults to
None.
Some exporters (like CsvItemExporter) respect the order of the fields defined in this attribute.
Some exporters may require fields_to_export list in order to export the data properly when spiders return
dicts (not Item instances).
export_empty_fields
Whether to include empty/unpopulated item fields in the exported data. Defaults to False. Some ex-
porters (like CsvItemExporter) ignore this attribute and always export all empty fields.
This option is ignored for dict items.
encoding
The encoding that will be used to encode unicode values. This only affects unicode values (which are
always serialized to str using this encoding). Other value types are passed unchanged to the specific
serialization library.
indent
Amount of spaces used to indent the output on each level. Defaults to 0.
• indent=None selects the most compact representation, all items in the same line with no indenta-
tion
• indent<=0 each item on its own line, no indentation
• indent>0 each item on its own line, indented with the provided numeric value
XmlItemExporter
Unless overridden in the serialize_field() method, multi-valued fields are exported by serializing each
value inside a <value> element. This is for convenience, as multi-valued fields are very common.
For example, the item:
CsvItemExporter
product,price
Color TV,1200
DVD player,200
PickleItemExporter
PprintItemExporter
JsonItemExporter
Warning: JSON is very simple and flexible serialization format, but it doesn’t scale well for large amounts
of data since incremental (aka. stream-mode) parsing is not well supported (if at all) among JSON parsers (on
any language), and most of them just parse the entire object in memory. If you want the power and simplicity
of JSON with a more stream-friendly format, consider using JsonLinesItemExporter instead, or
splitting the output in multiple chunks.
JsonLinesItemExporter
Unlike the one produced by JsonItemExporter, the format produced by this exporter is well suited for
serializing large amounts of data.
Architecture overview Understand the Scrapy architecture.
Downloader Middleware Customize how pages get requested and downloaded.
Spider Middleware Customize the input and output of your spiders.
Extensions Extend Scrapy with your custom functionality
Core API Use it on extensions and middlewares to extend Scrapy functionality
Signals See all available signals and how to work with them.
Item Exporters Quickly export your scraped items to a file (XML, CSV, etc).
Note: Scrapy 1.x will be the last series supporting Python 2. Scrapy 2.0, planned for Q4 2019 or Q1 2020, will
support Python 3 only.
Enforce lxml 4.3.5 or lower for Python 3.4 (issue 3912, issue 3918).
Note: Make sure you install Scrapy 1.7.1. The Scrapy 1.7.0 package in PyPI is the result of an erroneous commit
tagging and does not include all the changes described below.
Highlights:
• Improvements for crawls targeting multiple domains
231
Scrapy Documentation, Release 1.7.3
Backward-incompatible changes
New features
• Exceptions from ItemLoader input and output processors are now more verbose (issue 3836, issue 3840)
• Crawler, CrawlerRunner.crawl and CrawlerRunner.create_crawler now fail gracefully if
they receive a Spider subclass instance instead of the subclass itself (issue 2283, issue 3610, issue 3872)
Bug fixes
• process_spider_exception() is now also invoked for generators (issue 220, issue 2061)
• System exceptions like KeyboardInterrupt are no longer caught (issue 3726)
• ItemLoader.load_item() no longer makes later calls to ItemLoader.get_output_value() or
ItemLoader.load_item() return empty data (issue 3804, issue 3819)
• The images pipeline (ImagesPipeline) no longer ignores these Amazon S3 settings:
AWS_ENDPOINT_URL, AWS_REGION_NAME, AWS_USE_SSL, AWS_VERIFY (issue 3625)
• Fixed a memory leak in MediaPipeline affecting, for example, non-200 responses and exceptions from
custom middlewares (issue 3813)
• Requests with private callbacks are now correctly unserialized from disk (issue 3790)
• FormRequest.from_response() now handles invalid methods like major web browsers (issue 3777,
issue 3794)
Documentation
• A new topic, Selecting dynamically-loaded content, covers recommended approaches to read dynamically-
loaded data (issue 3703)
• Broad Crawls now features information about memory usage (issue 1264, issue 3866)
• The documentation of Rule now covers how to access the text of a link when using CrawlSpider (issue
3711, issue 3712)
• A new section, Writing your own storage backend, covers writing a custom cache storage backend for
HttpCacheMiddleware (issue 3683, issue 3692)
• A new FAQ entry, How to split an item into multiple items in an item pipeline?, explains what to do when you
want to split an item into multiple items from an item pipeline (issue 2240, issue 3672)
• Updated the FAQ entry about crawl order to explain why the first few requests rarely follow the desired order
(issue 1739, issue 3621)
• The LOGSTATS_INTERVAL setting (issue 3730), the FilesPipeline.file_path and
ImagesPipeline.file_path methods (issue 2253, issue 3609) and the Crawler.stop() method
(issue 3842) are now documented
• Some parts of the documentation that were confusing or misleading are now clearer (issue 1347, issue 1789,
issue 2289, issue 3069, issue 3615, issue 3626, issue 3668, issue 3670, issue 3673, issue 3728, issue 3762, issue
3861, issue 3882)
• Minor documentation fixes (issue 3648, issue 3649, issue 3662, issue 3674, issue 3676, issue 3694, issue 3724,
issue 3764, issue 3767, issue 3791, issue 3797, issue 3806, issue 3812)
Deprecation removals
• From scrapy.core.downloader.handlers:
– http.HttpDownloadHandler (use http10.HTTP10DownloadHandler)
• scrapy.loader.ItemLoader._get_values (use _get_xpathvalues)
• scrapy.loader.XPathItemLoader (use ItemLoader)
• scrapy.log (see Logging)
• From scrapy.pipelines:
– files.FilesPipeline.file_key (use file_path)
– images.ImagesPipeline.file_key (use file_path)
– images.ImagesPipeline.image_key (use file_path)
– images.ImagesPipeline.thumb_key (use thumb_path)
• From both scrapy.selector and scrapy.selector.lxmlsel:
– HtmlXPathSelector (use Selector)
– XmlXPathSelector (use Selector)
– XPathSelector (use Selector)
– XPathSelectorList (use Selector)
• From scrapy.selector.csstranslator:
– ScrapyGenericTranslator (use parsel.csstranslator.GenericTranslator)
– ScrapyHTMLTranslator (use parsel.csstranslator.HTMLTranslator)
– ScrapyXPathExpr (use parsel.csstranslator.XPathExpr)
• From Selector:
– _root (both the constructor argument and the object property, use root)
– extract_unquoted (use getall)
– select (use xpath)
• From SelectorList:
– extract_unquoted (use getall)
– select (use xpath)
– x (use xpath)
• scrapy.spiders.BaseSpider (use Spider)
• From Spider (and subclasses):
– DOWNLOAD_DELAY (use download_delay)
– set_crawler (use from_crawler())
• scrapy.spiders.spiders (use SpiderLoader)
• scrapy.telnet (use scrapy.extensions.telnet)
• From scrapy.utils.python:
– str_to_unicode (use to_unicode)
– unicode_to_str (use to_bytes)
• scrapy.utils.response.body_or_str
The following deprecated settings have also been removed (issue 3578):
• SPIDER_MANAGER_CLASS (use SPIDER_LOADER_CLASS)
Deprecations
Other changes
• It is now possible to run all tests from the same tox environment in parallel; the documentation now covers this
and other ways to run tests (issue 3707)
• It is now possible to generate an API documentation coverage report (issue 3806, issue 3810, issue 3860)
• The documentation policies now require docstrings (issue 3701) that follow PEP 257 (issue 3748)
• Internal fixes and cleanup (issue 3629, issue 3643, issue 3684, issue 3698, issue 3734, issue 3735, issue 3736,
issue 3737, issue 3809, issue 3821, issue 3825, issue 3827, issue 3833, issue 3857, issue 3877)
Highlights:
• better Windows support;
• Python 3.7 compatibility;
• big documentation improvements, including a switch from .extract_first() + .extract() API to
.get() + .getall() API;
• feed exports, FilePipeline and MediaPipeline improvements;
• better extensibility: item_error and request_reached_downloader signals; from_crawler sup-
port for feed exporters, feed storages and dupefilters.
• scrapy.contracts fixes and new features;
• telnet console security improvements, first released as a backport in Scrapy 1.5.2 (2019-01-22);
• clean-up of the deprecated code;
• various bug fixes, small new features and usability improvements across the codebase.
While these are not changes in Scrapy itself, but rather in the parsel library which Scrapy uses for xpath/css selectors,
these changes are worth mentioning here. Scrapy now depends on parsel >= 1.5, and Scrapy documentation is updated
to follow recent parsel API conventions.
Most visible change is that .get() and .getall() selector methods are now preferred over .
extract_first() and .extract(). We feel that these new methods result in a more concise and readable
code. See extract() and extract_first() for more details.
Note: There are currently no plans to deprecate .extract() and .extract_first() methods.
Another useful new feature is the introduction of Selector.attrib and SelectorList.attrib properties,
which make it easier to get attributes of HTML elements. See Selecting element attributes.
CSS selectors are cached in parsel >= 1.5, which makes them faster when the same CSS path is used many times. This
is very common in case of Scrapy spiders: callbacks are usually called several times, on different pages.
If you’re using custom Selector or SelectorList subclasses, a backward incompatible change in parsel may
affect your code. See parsel changelog for a detailed description, as well as for the full list of improvements.
Telnet console
Backward incompatible: Scrapy’s telnet console now requires username and password. See Telnet Console for more
details. This change fixes a security issue; see Scrapy 1.5.2 (2019-01-22) release notes for details.
• from_crawler support is added to feed exporters and feed storages. This, among other things, allows to
access Scrapy settings from custom feed storages and exporters (issue 1605, issue 3348).
• from_crawler support is added to dupefilters (issue 2956); this allows to access e.g. settings or a spider from
a dupefilter.
• item_error is fired when an error happens in a pipeline (issue 3256);
• request_reached_downloader is fired when Downloader gets a new Request; this signal can be useful
e.g. for custom Schedulers (issue 3393).
• new SitemapSpider sitemap_filter() method which allows to select sitemap entries based on their at-
tributes in SitemapSpider subclasses (issue 3512).
• Lazy loading of Downloader Handlers is now optional; this enables better initialization error handling in custom
Downloader Handlers (issue 3394).
scrapy.contracts improvements
Usability improvements
Bug fixes
• fixed issue with extra blank lines in .csv exports under Windows (issue 3039);
• proper handling of pickling errors in Python 3 when serializing objects for disk queues (issue 3082)
• flags are now preserved when copying Requests (issue 3342);
• FormRequest.from_response clickdata shouldn’t ignore elements with input[type=image] (issue 3153).
• FormRequest.from_response should preserve duplicate keys (issue 3247)
Documentation improvements
• Docs are re-written to suggest .get/.getall API instead of .extract/.extract_first. Also, Selectors docs are updated
and re-structured to match latest parsel docs; they now contain more topics, such as Selecting element attributes
or Extensions to CSS Selectors (issue 3390).
• Using your browser’s Developer Tools for scraping is a new tutorial which replaces old Firefox and Firebug
tutorials (issue 3400).
• SCRAPY_PROJECT environment variable is documented (issue 3518);
• troubleshooting section is added to install instructions (issue 3517);
• improved links to beginner resources in the tutorial (issue 3367, issue 3468);
• fixed RETRY_HTTP_CODES default values in docs (issue 3335);
Deprecation removals
Compatibility shims for pre-1.0 Scrapy module names are removed (issue 3318):
• scrapy.command
• scrapy.contrib (with all submodules)
• scrapy.contrib_exp (with all submodules)
• scrapy.dupefilter
• scrapy.linkextractor
• scrapy.project
• scrapy.spider
• scrapy.spidermanager
• scrapy.squeue
• scrapy.stats
• scrapy.statscol
• scrapy.utils.decorator
See Module Relocations for more information, or use suggestions from Scrapy 1.5.x deprecation warnings to update
your code.
Other deprecation removals:
• Deprecated scrapy.interfaces.ISpiderManager is removed; please use scrapy.interfaces.ISpiderLoader.
• Deprecated CrawlerSettings class is removed (issue 3327).
• Deprecated Settings.overrides and Settings.defaults attributes are removed (issue 3327, issue
3359).
• All Scrapy tests now pass on Windows; Scrapy testing suite is executed in a Windows environment on CI (issue
3315).
• Python 3.7 support (issue 3326, issue 3150, issue 3547).
• Testing and CI fixes (issue 3526, issue 3538, issue 3308, issue 3311, issue 3309, issue 3305, issue 3210, issue
3299)
• scrapy.http.cookies.CookieJar.clear accepts “domain”, “path” and “name” optional arguments
(issue 3231).
• additional files are included to sdist (issue 3495);
• code style fixes (issue 3405, issue 3304);
• unneeded .strip() call is removed (issue 3519);
• collections.deque is used to store MiddlewareManager methods instead of a list (issue 3476)
• Security bugfix: Telnet console extension can be easily exploited by rogue websites POSTing content to http:
//localhost:6023, we haven’t found a way to exploit it from Scrapy, but it is very easy to trick a browser to do so
and elevates the risk for local development environment.
The fix is backward incompatible, it enables telnet user-password authentication by default with a random gen-
erated password. If you can’t upgrade right away, please consider setting TELNET_CONSOLE_PORT out of its
default value.
See telnet console documentation for more info
• Backport CI build failure under GCE environemnt due to boto import error.
This is a maintenance release with important bug fixes, but no new features:
• O(N^2) gzip decompression issue which affected Python 3 and PyPy is fixed (issue 3281);
• skipping of TLS validation errors is improved (issue 3166);
• Ctrl-C handling is fixed in Python 3.5+ (issue 3096);
• testing fixes (issue 3092, issue 3263);
• documentation improvements (issue 3058, issue 3059, issue 3089, issue 3123, issue 3127, issue 3189, issue
3224, issue 3280, issue 3279, issue 3201, issue 3260, issue 3284, issue 3298, issue 3294).
This release brings small new features and improvements across the codebase. Some highlights:
• Google Cloud Storage is supported in FilesPipeline and ImagesPipeline.
• Crawling with proxy servers becomes more efficient, as connections to proxies can be reused now.
• Warnings, exception and logging messages are improved to make debugging easier.
• scrapy parse command now allows to set custom request meta via --meta argument.
• Compatibility with Python 3.6, PyPy and PyPy3 is improved; PyPy and PyPy3 are now supported officially, by
running tests on CI.
• Better default handling of HTTP 308, 522 and 524 status codes.
• Documentation is improved, as usual.
• 522 and 524 status codes are added to RETRY_HTTP_CODES (issue 2851)
New features
Bug fixes
Docs
• Added missing bullet point for the AUTOTHROTTLE_TARGET_CONCURRENCY setting. (issue 2756)
• Update Contributing docs, document new support channels (issue 2762, issue:3038)
• Include references to Scrapy subreddit in the docs
• Fix broken links; use https:// for external links (issue 2978, issue 2982, issue 2958)
• Document CloseSpider extension better (issue 2759)
• Use pymongo.collection.Collection.insert_one() in MongoDB example (issue 2781)
• Spelling mistake and typos (issue 2828, issue 2837, issue 2884, issue 2924)
• Clarify CSVFeedSpider.headers documentation (issue 2826)
• Document DontCloseSpider exception and clarify spider_idle (issue 2791)
• Update “Releases” section in README (issue 2764)
• Fix rst syntax in DOWNLOAD_FAIL_ON_DATALOSS docs (issue 2763)
• Small fix in description of startproject arguments (issue 2866)
• Clarify data types in Response.body docs (issue 2922)
• Add a note about request.meta['depth'] to DepthMiddleware docs (issue 2374)
• Add a note about request.meta['dont_merge_cookies'] to CookiesMiddleware docs (issue 2999)
• Up-to-date example of project structure (issue 2964, issue 2976)
• A better example of ItemExporters usage (issue 2989)
• Document from_crawler methods for spider and downloader middlewares (issue 3019)
Scrapy 1.4 does not bring that many breathtaking new features but quite a few handy improvements nonetheless.
Scrapy now supports anonymous FTP sessions with customizable user and password via the new FTP_USER and
FTP_PASSWORD settings. And if you’re using Twisted version 17.1.0 or above, FTP is now available with Python 3.
There’s a new response.follow method for creating requests; it is now a recommended way to create Requests
in Scrapy spiders. This method makes it easier to write correct spiders; response.follow has several advantages
over creating scrapy.Request objects directly:
• it handles relative URLs;
• it works properly with non-ascii URLs on non-UTF8 pages;
• in addition to absolute and relative URLs it supports Selectors; for <a> elements it can also extract their href
values.
For example, instead of this:
Link extractors are also improved. They work similarly to what a regular modern browser would do: leading and
trailing whitespace are removed from attributes (think href=" http://example.com") when building Link
objects. This whitespace-stripping also happens for action attributes with FormRequest.
Please also note that link extractors do not canonicalize URLs by default anymore. This was puzzling users every
now and then, and it’s not what browsers do in fact, so we removed that extra transformation on extracted links.
For those of you wanting more control on the Referer: header that Scrapy sends when following links, you can set
your own Referrer Policy. Prior to Scrapy 1.4, the default RefererMiddleware would simply and blindly
set it to the URL of the response that generated the HTTP request (which could leak information on your URL seeds).
By default, Scrapy now behaves much like your regular browser does. And this policy is fully customizable with W3C
standard values (or with something really custom of your own if you wish). See REFERRER_POLICY for details.
To make Scrapy spiders easier to debug, Scrapy logs more stats by default in 1.4: memory usage stats, detailed retry
stats, detailed HTTP error code stats. A similar change is that HTTP cache path is also visible in logs now.
Last but not least, Scrapy now has the option to make JSON and XML items more human-readable, with newlines
between items and even custom indenting offset, using the new FEED_EXPORT_INDENT setting.
Enjoy! (Or read on for the rest of changes in this release.)
New Features
• Warn users when project contains duplicate spider names (fixes issue 2181)
• CaselessDict now accepts Mapping instances and not only dicts (issue 2646)
• Media downloads, with FilesPipelines or ImagesPipelines, can now optionally handle HTTP redi-
rects using the new MEDIA_ALLOW_REDIRECTS setting (issue 2616, fixes issue 2004)
• Accept non-complete responses from websites using a new DOWNLOAD_FAIL_ON_DATALOSS setting (issue
2590, fixes issue 2586)
• Optional pretty-printing of JSON and XML items via FEED_EXPORT_INDENT setting (issue 2456, fixes issue
1327)
• Allow dropping fields in FormRequest.from_response formdata when None value is passed (issue 667)
• Per-request retry times with the new max_retry_times meta key (issue 2642)
• python -m scrapy as a more explicit alternative to scrapy command (issue 2740)
Bug fixes
• LinkExtractor now strips leading and trailing whitespaces from attributes (issue 2547, fixes issue 1614)
• Properly handle whitespaces in action attribute in FormRequest (issue 2548)
• Buffer CONNECT response bytes from proxy until all HTTP headers are received (issue 2495, fixes issue 2491)
• FTP downloader now works on Python 3, provided you use Twisted>=17.1 (issue 2599)
• Use body to choose response type after decompressing content (issue 2393, fixes issue 2145)
• Always decompress Content-Encoding: gzip at HttpCompressionMiddleware stage (issue
2391)
• Respect custom log level in Spider.custom_settings (issue 2581, fixes issue 1612)
• ‘make htmlview’ fix for macOS (issue 2661)
• Remove “commands” from the command list (issue 2695)
• Fix duplicate Content-Length header for POST requests with empty body (issue 2677)
• Properly cancel large downloads, i.e. above DOWNLOAD_MAXSIZE (issue 1616)
• ImagesPipeline: fixed processing of transparent PNG images with palette (issue 2675)
• Tests: remove temp files and folders (issue 2570), fixed ProjectUtilsTest on OS X (issue 2569), use portable
pypy for Linux on Travis CI (issue 2710)
• Separate building request from _requests_to_follow in CrawlSpider (issue 2562)
• Remove “Python 3 progress” badge (issue 2567)
• Add a couple more lines to .gitignore (issue 2557)
• Remove bumpversion prerelease configuration (issue 2159)
• Add codecov.yml file (issue 2750)
• Set context factory implementation based on Twisted version (issue 2577, fixes issue 2560)
• Add omitted self arguments in default project middleware template (issue 2595)
• Remove redundant slot.add_request() call in ExecutionEngine (issue 2617)
Documentation
• Binary mode is required for exporters (issue 2564, fixes issue 2553)
• Mention issue with FormRequest.from_response due to bug in lxml (issue 2572)
• Use single quotes uniformly in templates (issue 2596)
• Document ftp_user and ftp_password meta keys (issue 2587)
• Removed section on deprecated contrib/ (issue 2636)
• Recommend Anaconda when installing Scrapy on Windows (issue 2477, fixes issue 2475)
• FAQ: rewrite note on Python 3 support on Windows (issue 2690)
• Rearrange selector sections (issue 2705)
• Remove __nonzero__ from SelectorList docs (issue 2683)
• Mention how to disable request filtering in documentation of DUPEFILTER_CLASS setting (issue 2714)
• Add sphinx_rtd_theme to docs setup readme (issue 2668)
• Open file in text mode in JSON item writer example (issue 2729)
• Clarify allowed_domains example (issue 2670)
Bug fixes
• Make SpiderLoader raise ImportError again by default for missing dependencies and wrong
SPIDER_MODULES. These exceptions were silenced as warnings since 1.3.0. A new setting is introduced
to toggle between warning or exception if needed ; see SPIDER_LOADER_WARN_ONLY for details.
Bug fixes
• Preserve request class when converting to/from dicts (utils.reqser) (issue 2510).
• Use consistent selectors for author field in tutorial (issue 2551).
• Fix TLS compatibility in Twisted 17+ (issue 2558)
New features
• Support 'True' and 'False' string values for boolean settings (issue 2519); you can now do something like
scrapy crawl myspider -s REDIRECT_ENABLED=False.
• Support kwargs with response.xpath() to use XPath variables and ad-hoc namespaces declarations ; this
requires at least Parsel v1.1 (issue 2457).
• Add support for Python 3.6 (issue 2485).
• Run tests on PyPy (warning: some tests still fail, so PyPy is not supported yet).
Bug fixes
Documentation
• Reword Code of Coduct section and upgrade to Contributor Covenant v1.4 (issue 2469).
• Clarify that passing spider arguments converts them to spider attributes (issue 2483).
• Document formid argument on FormRequest.from_response() (issue 2497).
• Add .rst extension to README files (issue 2507).
• Mention LevelDB cache storage backend (issue 2525).
• Use yield in sample callback code (issue 2533).
• Add note about HTML entities decoding with .re()/.re_first() (issue 1704).
• Typos (issue 2512, issue 2534, issue 2531).
Cleanups
This release comes rather soon after 1.2.2 for one main reason: it was found out that releases since 0.18 up to 1.2.2
(included) use some backported code from Twisted (scrapy.xlib.tx.*), even if newer Twisted modules are
available. Scrapy now uses twisted.web.client and twisted.internet.endpoints directly. (See also
cleanups below.)
As it is a major change, we wanted to get the bug fix out quickly while not breaking any projects using the 1.2 series.
New Features
• MailSender now accepts single strings as values for to and cc arguments (issue 2272)
• scrapy fetch url, scrapy shell url and fetch(url) inside scrapy shell now follow HTTP redi-
rections by default (issue 2290); See fetch and shell for details.
• HttpErrorMiddleware now logs errors with INFO level instead of DEBUG; this is technically backward
incompatible so please check your log parsers.
• By default, logger names now use a long-form path, e.g. [scrapy.extensions.logstats], instead
of the shorter “top-level” variant of prior releases (e.g. [scrapy]); this is backward incompatible if you
have log parsers expecting the short logger name part. You can switch back to short logger names using
LOG_SHORT_NAMES set to True.
• Scrapy now requires Twisted >= 13.1 which is the case for many Linux distributions already.
• As a consequence, we got rid of scrapy.xlib.tx.* modules, which copied some of Twisted code for users
stuck with an “old” Twisted version
• ChunkedTransferMiddleware is deprecated and removed from the default downloader middlewares.
Bug fixes
Documentation
Other changes
Bug fixes
• Include OpenSSL’s more permissive default ciphers when establishing TLS/SSL connections (issue 2314).
• Fix “Location” HTTP header decoding on non-ASCII URL redirects (issue 2321).
Documentation
Other changes
New Features
• New FEED_EXPORT_ENCODING setting to customize the encoding used when writing items to a file. This
can be used to turn off \uXXXX escapes in JSON output. This is also useful for those wanting something else
than UTF-8 for XML or CSV output (issue 2034).
• startproject command now supports an optional destination directory to override the default one based on
the project name (issue 2005).
• New SCHEDULER_DEBUG setting to log requests serialization failures (issue 1610).
• JSON encoder now supports serialization of set instances (issue 2058).
• Interpret application/json-amazonui-streaming as TextResponse (issue 1503).
• scrapy is imported by default when using shell tools (shell, inspect_response) (issue 2248).
Bug fixes
• DefaultRequestHeaders middleware now runs before UserAgent middleware (issue 2088). Warning: this is
technically backward incompatible, though we consider this a bug fix.
• HTTP cache extension and plugins that use the .scrapy data directory now work outside projects (issue 1581).
Warning: this is technically backward incompatible, though we consider this a bug fix.
• Selector does not allow passing both response and text anymore (issue 2153).
• Fixed logging of wrong callback name with scrapy parse (issue 2169).
• Fix for an odd gzip decompression bug (issue 1606).
• Fix for selected callbacks when using CrawlSpider with scrapy parse (issue 2225).
• Fix for invalid JSON and XML files when spider yields no items (issue 872).
• Implement flush() fpr StreamLogger avoiding a warning in logs (issue 2125).
Refactoring
Scrapy’s new requirements baseline is Debian 8 “Jessie”. It was previously Ubuntu 12.04 Precise. What this means in
practice is that we run continuous integration tests with these (main) packages versions at a minimum: Twisted 14.0,
pyOpenSSL 0.14, lxml 3.4.
Scrapy may very well work with older versions of these packages (the code base still has switches for older Twisted
versions for example) but it is not guaranteed (because it’s not tested anymore).
Documentation
Bug fixes
• Class attributes for subclasses of ImagesPipeline and FilesPipeline work as they did before 1.1.1
(issue 2243, fixes issue 2198)
Documentation
• Overview and tutorial rewritten to use http://toscrape.com websites (issue 2236, issue 2249, issue 2252).
Bug fixes
Bug fixes
New features
Documentation
Tests
• Upgrade py.test requirement on Travis CI and Pin pytest-cov to 2.2.1 (issue 2095)
This 1.1 release brings a lot of interesting features and bug fixes:
• Scrapy 1.1 has beta Python 3 support (requires Twisted >= 15.5). See Beta Python 3 Support for more details
and some limitations.
• Hot new features:
– Item loaders now support nested loaders (issue 1467).
– FormRequest.from_response improvements (issue 1382, issue 1137).
– Added setting AUTOTHROTTLE_TARGET_CONCURRENCY and improved AutoThrottle docs (issue
1324).
– Added response.text to get body as unicode (issue 1730).
– Anonymous S3 connections (issue 1358).
– Deferreds in downloader middlewares (issue 1473). This enables better robots.txt handling (issue 1471).
– HTTP caching now follows RFC2616 more closely, added settings HTTPCACHE_ALWAYS_STORE and
HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS (issue 1151).
– Selectors were extracted to the parsel library (issue 1409). This means you can use Scrapy Selectors
without Scrapy and also upgrade the selectors engine without needing to upgrade Scrapy.
– HTTPS downloader now does TLS protocol negotiation by default, instead of forcing TLS 1.0. You can
also set the SSL/TLS method using the new DOWNLOADER_CLIENT_TLS_METHOD.
• These bug fixes may require your attention:
– Don’t retry bad requests (HTTP 400) by default (issue 1289). If you need the old behavior, add 400 to
RETRY_HTTP_CODES.
– Fix shell files argument handling (issue 1710, issue 1550). If you try scrapy shell index.html it
will try to load the URL http://index.html, use scrapy shell ./index.html to load a local file.
– Robots.txt compliance is now enabled by default for newly-created projects (issue 1724). Scrapy will also
wait for robots.txt to be downloaded before proceeding with the crawl (issue 1735). If you want to disable
this behavior, update ROBOTSTXT_OBEY in settings.py file after creating a new project.
– Exporters now work on unicode, instead of bytes by default (issue 1080). If you use
PythonItemExporter, you may want to update your code to disable binary mode which is now dep-
recated.
– Accept XML node names containing dots as valid (issue 1533).
– When uploading files or images to S3 (with FilesPipeline or ImagesPipeline), the default
ACL policy is now “private” instead of “public” Warning: backward incompatible!. You can use
FILES_STORE_S3_ACL to change it.
– We’ve reimplemented canonicalize_url() for more correct output, especially for URLs with non-
ASCII characters (issue 1947). This could change link extractors output compared to previous scrapy
versions. This may also invalidate some cache entries you could still have from pre-1.1 runs. Warning:
backward incompatible!.
Keep reading for more details on other improvements and bug fixes.
We have been hard at work to make Scrapy run on Python 3. As a result, now you can run spiders on Python 3.3, 3.4
and 3.5 (Twisted >= 15.5 required). Some features are still missing (and some may never be ported).
Almost all builtin extensions/middlewares are expected to work. However, we are aware of some limitations in Python
3:
• Scrapy does not work on Windows with Python 3
• Sending emails is not supported
• FTP download handler is not supported
• Telnet console is not supported
Relocations
Bugfixes
• Scrapy does not retry requests that got a HTTP 400 Bad Request response anymore (issue 1289). Warn-
ing: backward incompatible!
• Support empty password for http_proxy config (issue 1274).
• FIX: RetryMiddleware is now robust to non-standard HTTP status codes (issue 1857)
• FIX: Filestorage HTTP cache was checking wrong modified time (issue 1875)
• DOC: Support for Sphinx 1.4+ (issue 1893)
• DOC: Consistency in selectors examples (issue 1869)
• FIX: [Backport] Ignore bogus links in LinkExtractors (fixes issue 907, commit 108195e)
• TST: Changed buildbot makefile to use ‘pytest’ (commit 1f3d90a)
• DOC: Fixed typos in tutorial and media-pipeline (commit 808a9ea and commit 803bd87)
• DOC: Add AjaxCrawlMiddleware to DOWNLOADER_MIDDLEWARES_BASE in settings docs (commit
aa94121)
• Twisted 15.3.0 does not raises PicklingError serializing lambda functions (commit b04dd7d)
• Minor method name fix (commit 6f85c7f)
• minor: scrapy.Spider grammar and clarity (commit 9c9d2e0)
• Put a blurb about support channels in CONTRIBUTING (commit c63882b)
• Fixed typos (commit a9ae7b0)
• Fix doc reference. (commit 7c8a4fe)
• Unquote request path before passing to FTPClient, it already escape paths (commit cc00ad2)
• include tests/ to source distribution in MANIFEST.in (commit eca227e)
• DOC Fix SelectJmes documentation (commit b8567bc)
• DOC Bring Ubuntu and Archlinux outside of Windows subsection (commit 392233f)
• DOC remove version suffix from ubuntu package (commit 5303c66)
• DOC Update release date for 1.0 (commit c89fa29)
You will find a lot of new features and bugfixes in this major release. Make sure to check our updated overview to get
a glance of some of the changes, along with our brushed tutorial.
Declaring and returning Scrapy Items is no longer necessary to collect the scraped data from your spider, you can now
return explicit dictionaries instead.
Classic version
class MyItem(scrapy.Item):
url = scrapy.Field()
class MySpider(scrapy.Spider):
def parse(self, response):
return MyItem(url=response.url)
New version
class MySpider(scrapy.Spider):
def parse(self, response):
return {'url': response.url}
Last Google Summer of Code project accomplished an important redesign of the mechanism used for populating
settings, introducing explicit priorities to override any given setting. As an extension of that goal, we included a new
level of priority for settings that act exclusively for a single spider, allowing them to redefine project settings.
Start using it by defining a custom_settings class variable in your spider:
class MySpider(scrapy.Spider):
custom_settings = {
"DOWNLOAD_DELAY": 5.0,
"RETRY_ENABLED": False,
}
Python Logging
Scrapy 1.0 has moved away from Twisted logging to support Python built in’s as default logging system. We’re
maintaining backward compatibility for most of the old custom interface to call logging functions, but you’ll get
warnings to switch to the Python logging API entirely.
Old version
New version
import logging
logging.info('MESSAGE')
Logging with spiders remains the same, but on top of the log() method you’ll have access to a custom logger
created for the spider to issue log events:
class MySpider(scrapy.Spider):
def parse(self, response):
self.logger.info('Response received')
Another milestone for last Google Summer of Code was a refactoring of the internal API, seeking a simpler and easier
usage. Check new core interface in: Core API
A common situation where you will face these changes is while running Scrapy from scripts. Here’s a quick example
of how to run a Spider manually with the new API:
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start()
Bear in mind this feature is still under development and its API may change until it reaches a stable status.
See more examples for scripts running Scrapy: Common Practices
Module Relocations
There’s been a large rearrangement of modules trying to improve the general structure of Scrapy. Main changes
were separating various subpackages into new projects and dissolving both scrapy.contrib and scrapy.
contrib_exp into top level packages. Backward compatibility was kept among internal relocations, while im-
porting deprecated modules expect warnings indicating their new place.
Outsourced packages
Note: These extensions went through some minor changes, e.g. some setting names were changed. Please check the
documentation in each new repository to get familiar with the new usage.
Class renames
Settings renames
Changelog
• pem file is used by mockserver and required by scrapy bench (commit 5eddc68)
• scrapy bench needs scrapy.tests* (commit d6cb999)
• Use a mutable mapping to proxy deprecated settings.overrides and settings.defaults attribute (commit e5e8133)
• there is not support for python3 yet (commit 3cd6146)
• Update python compatible version set to debian packages (commit fa5d76b)
• Fix deprecated CrawlerSettings and increase backward compatibility with .defaults attribute (commit 8e3f20a)
Enhancements
• Infer exporter’s output format from filename extensions (issue 546, issue 659, issue 760)
• Support case-insensitive domains in url_is_from_any_domain() (issue 693)
• Remove pep8 warnings in project and spider templates (issue 698)
• Tests and docs for request_fingerprint function (issue 597)
• Update SEP-19 for GSoC project per-spider settings (issue 705)
• Set exit code to non-zero when contracts fails (issue 727)
• Add a setting to control what class is instanciated as Downloader component (issue 738)
• Pass response in item_dropped signal (issue 724)
• Improve scrapy check contracts command (issue 733, issue 752)
• Document spider.closed() shortcut (issue 719)
• Document request_scheduled signal (issue 746)
• Add a note about reporting security issues (issue 697)
• Add LevelDB http cache storage backend (issue 626, issue 500)
• Sort spider list output of scrapy list command (issue 742)
• Multiple documentation enhancemens and fixes (issue 575, issue 587, issue 590, issue 596, issue 610, issue 617,
issue 618, issue 627, issue 613, issue 643, issue 654, issue 675, issue 663, issue 711, issue 714)
Bugfixes
• Encode unicode URL value when creating Links in RegexLinkExtractor (issue 561)
• Ignore None values in ItemLoader processors (issue 556)
• Fix link text when there is an inner tag in SGMLLinkExtractor and HtmlParserLinkExtractor (issue 485, issue
574)
• Fix wrong checks on subclassing of deprecated classes (issue 581, issue 584)
• Handle errors caused by inspect.stack() failures (issue 582)
• Fix a reference to unexistent engine attribute (issue 593, issue 594)
• Fix dynamic itemclass example usage of type() (issue 603)
• Use lucasdemarchi/codespell to fix typos (issue 628)
• Fix default value of attrs argument in SgmlLinkExtractor to be tuple (issue 661)
• Fix XXE flaw in sitemap reader (issue 676)
• Fix engine to support filtered start requests (issue 707)
• Fix offsite middleware case on urls with no hostnames (issue 745)
• Testsuite doesn’t require PIL anymore (issue 585)
Enhancements
• [Backward incompatible] Switched HTTPCacheMiddleware backend to filesystem (issue 541) To restore old
backend set HTTPCACHE_STORAGE to scrapy.contrib.httpcache.DbmCacheStorage
• Proxy https:// urls using CONNECT method (issue 392, issue 397)
• Add a middleware to crawl ajax crawleable pages as defined by google (issue 343)
• Rename scrapy.spider.BaseSpider to scrapy.spider.Spider (issue 510, issue 519)
• Selectors register EXSLT namespaces by default (issue 472)
• Unify item loaders similar to selectors renaming (issue 461)
• Make RFPDupeFilter class easily subclassable (issue 533)
• Improve test coverage and forthcoming Python 3 support (issue 525)
• Promote startup info on settings and middleware to INFO level (issue 520)
• Support partials in get_func_args util (issue 506, issue:504)
• Allow running indiviual tests via tox (issue 503)
• Update extensions ignored by link extractors (issue 498)
• Add middleware methods to get files/images/thumbs paths (issue 490)
• Improve offsite middleware tests (issue 478)
• Add a way to skip default Referer header set by RefererMiddleware (issue 475)
• Do not send x-gzip in default Accept-Encoding header (issue 469)
• Support defining http error handling using settings (issue 466)
• Use modern python idioms wherever you find legacies (issue 497)
• Improve and correct documentation (issue 527, issue 524, issue 521, issue 517, issue 512, issue 505, issue 502,
issue 489, issue 465, issue 460, issue 425, issue 536)
Fixes
Enhancements
• New Selector’s API including CSS selectors (issue 395 and issue 426),
• Request/Response url/body attributes are now immutable (modifying them had been deprecated for a long time)
• ITEM_PIPELINES is now defined as a dict (instead of a list)
• Sitemap spider can fetch alternate URLs (issue 360)
• Selector.remove_namespaces() now remove namespaces from element’s attributes. (issue 416)
• Paved the road for Python 3.3+ (issue 435, issue 436, issue 431, issue 452)
• New item exporter using native python types with nesting support (issue 366)
• Tune HTTP1.1 pool size so it matches concurrency defined by settings (commit b43b5f575)
• scrapy.mail.MailSender now can connect over TLS or upgrade using STARTTLS (issue 327)
• New FilesPipeline with functionality factored out from ImagesPipeline (issue 370, issue 409)
• Recommend Pillow instead of PIL for image handling (issue 317)
• Added debian packages for Ubuntu quantal and raring (commit 86230c0)
• Mock server (used for tests) can listen for HTTPS requests (issue 410)
• Remove multi spider support from multiple core components (issue 422, issue 421, issue 420, issue 419, issue
423, issue 418)
• Travis-CI now tests Scrapy changes against development versions of w3lib and queuelib python packages.
• Add pypy 2.1 to continuous integration tests (commit ecfa7431)
• Pylinted, pep8 and removed old-style exceptions from source (issue 430, issue 432)
• Use importlib for parametric imports (issue 445)
• Handle a regression introduced in Python 2.7.5 that affects XmlItemExporter (issue 372)
• Bugfix crawling shutdown on SIGINT (issue 450)
• Do not submit reset type inputs in FormRequest.from_response (commit b326b87)
• Do not silence download errors when request errback raises an exception (commit 684cfc0)
Bugfixes
Other
Thanks
• Backport scrapy check command fixes and backward compatible multi crawler process(issue 339)
• Lot of improvements to testsuite run using Tox, including a way to test on pypi
• Handle GET parameters for AJAX crawleable urls (commit 3fe2a32)
• Use lxml recover option to parse sitemaps (issue 347)
• Bugfix cookie merging by hostname and not by netloc (issue 352)
• Support disabling HttpCompressionMiddleware using a flag setting (issue 359)
• Support xml namespaces using iternodes parser in XMLFeedSpider (issue 12)
• Support dont_cache request meta flag (issue 19)
• Bugfix scrapy.utils.gz.gunzip broken by changes in python 2.7.4 (commit 4dc76e)
• Bugfix url encoding on SgmlLinkExtractor (issue 24)
• Bugfix TakeFirst processor shouldn’t discard zero (0) value (issue 59)
• Support nested items in xml exporter (issue 66)
• Improve cookies handling performance (issue 77)
• Log dupe filtered requests once (issue 105)
• Split redirection middleware into status and meta based middlewares (issue 78)
• Use HTTP1.1 as default downloader handler (issue 109 and issue 318)
• Support xpath form selection on FormRequest.from_response (issue 185)
• Bugfix unicode decoding error on SgmlLinkExtractor (issue 199)
• Bugfix signal dispatching on pypi interpreter (issue 205)
• Improve request delay and concurrency handling (issue 206)
• Add RFC2616 cache policy to HttpCacheMiddleware (issue 212)
• Allow customization of messages logged by engine (issue 214)
• Multiples improvements to DjangoItem (issue 217, issue 218, issue 221)
• Extend Scrapy commands using setuptools entry points (issue 260)
• Allow spider allowed_domains value to be set/tuple (issue 261)
• Support settings.getdict (issue 269)
• Simplify internal scrapy.core.scraper slot handling (issue 271)
• Added Item.copy (issue 290)
• Collect idle downloader slots (issue 297)
• Add ftp:// scheme downloader handler (issue 329)
• Added downloader benchmark webserver and spider tools Benchmarking
• Moved persistent (on disk) queues to a separate project (queuelib) which scrapy now depends on
• Add scrapy commands using external libraries (issue 260)
• Added --pdb option to scrapy command line tool
• Added XPathSelector.remove_namespaces() which allows to remove all namespaces from XML
documents for convenience (to work with namespace-less XPaths). Documented in Selectors.
• obey request method when scrapy deploy is redirected to a new endpoint (commit 8c4fcee)
• fix inaccurate downloader middleware documentation. refs #280 (commit 40667cb)
• doc: remove links to diveintopython.org, which is no longer available. closes #246 (commit bd58bfa)
• Find form nodes in invalid html5 documents (commit e3d6945)
• Fix typo labeling attrs type bool instead of list (commit a274276)
• Remove concurrency limitation when using download delays and still ensure inter-request delays are enforced
(commit 487b9b5)
• add error details when image pipeline fails (commit 8232569)
• improve mac os compatibility (commit 8dcf8aa)
• setup.py: use README.rst to populate long_description (commit 7b5310d)
• doc: removed obsolete references to ClientForm (commit 80f9bb6)
• correct docs for default storage backend (commit 2aa491b)
• doc: removed broken proxyhub link from FAQ (commit bdf61c4)
• Fixed docs typo in SpiderOpenCloseLogging example (commit 7184094)
• fixed LogStats extension, which got broken after a wrong merge before the 0.16 release (commit 8c780fd)
• better backward compatibility for scrapy.conf.settings (commit 3403089)
• extended documentation on how to access crawler stats from extensions (commit c4da0b5)
• removed .hgtags (no longer needed now that scrapy uses git) (commit d52c188)
• fix dashes under rst headers (commit fa4f7f9)
• set release date for 0.16.0 in news (commit e292246)
Scrapy changes:
• added Spiders Contracts, a mechanism for testing spiders in a formal/reproducible way
• added options -o and -t to the runspider command
• documented AutoThrottle extension and added to extensions installed by default. You still need to enable it with
AUTOTHROTTLE_ENABLED
• major Stats Collection refactoring: removed separation of global/per-spider stats, removed stats-related signals
(stats_spider_opened, etc). Stats are much simpler now, backward compatibility is kept on the Stats
Collector API and signals.
• added process_start_requests() method to spider middlewares
• dropped Signals singleton. Signals should now be accesed through the Crawler.signals attribute. See the signals
documentation for more info.
• dropped Signals singleton. Signals should now be accesed through the Crawler.signals attribute. See the signals
documentation for more info.
• dropped Stats Collector singleton. Stats can now be accessed through the Crawler.stats attribute. See the stats
collection documentation for more info.
• documented Core API
• lxml is now the default selectors backend instead of libxml2
• ported FormRequest.from_response() to use lxml instead of ClientForm
• removed modules: scrapy.xlib.BeautifulSoup and scrapy.xlib.ClientForm
• SitemapSpider: added support for sitemap urls ending in .xml and .xml.gz, even if they advertise a wrong content
type (commit 10ed28b)
• StackTraceDump extension: also dump trackref live references (commit fe2ce93)
• nested items now fully supported in JSON and JSONLines exporters
• added cookiejar Request meta key to support multiple cookie sessions per spider
• decoupled encoding detection code to w3lib.encoding, and ported Scrapy code to use that module
• dropped support for Python 2.5. See https://blog.scrapinghub.com/2012/02/27/
scrapy-0-15-dropping-support-for-python-2-5/
• dropped support for Twisted 2.5
• added REFERER_ENABLED setting, to control referer middleware
• move buffer pointing to start of file before computing checksum. refs #92 (commit 6a5bef2)
• Compute image checksum before persisting images. closes #92 (commit 9817df1)
• remove leaking references in cached failures (commit 673a120)
• fixed bug in MemoryUsage extension: get_engine_status() takes exactly 1 argument (0 given) (commit 11133e9)
• fixed struct.error on http compression middleware. closes #87 (commit 1423140)
• ajax crawling wasn’t expanding for unicode urls (commit 0de3fb4)
• Catch start_requests iterator errors. refs #83 (commit 454a21d)
• Speed-up libxml2 XPathSelector (commit 2fbd662)
• updated versioning doc according to recent changes (commit 0a070f5)
• scrapyd: fixed documentation link (commit 2b4e4c3)
• extras/makedeb.py: no longer obtaining version from git (commit caffe0e)
• added -o option to scrapy crawl, a shortcut for dumping scraped items into a file (or standard output using
-)
• Added support for passing custom settings to Scrapyd schedule.json api (r2779, r2783)
• New ChunkedTransferMiddleware (enabled by default) to support chunked transfer encoding (r2769)
• Add boto 2.0 support for S3 downloader handler (r2763)
• Added marshal to formats supported by feed exports (r2744)
• In request errbacks, offending requests are now received in failure.request attribute (r2738)
• Big downloader refactoring to support per domain/ip concurrency limits (r2732)
– CONCURRENT_REQUESTS_PER_SPIDER setting has been deprecated and replaced by:
* CONCURRENT_REQUESTS, CONCURRENT_REQUESTS_PER_DOMAIN ,
CONCURRENT_REQUESTS_PER_IP
– check the documentation for more details
• Added builtin caching DNS resolver (r2728)
• Moved Amazon AWS-related components/extensions (SQS spider queue, SimpleDB stats collector) to a sepa-
rate project: [scaws](https://github.com/scrapinghub/scaws) (r2706, r2714)
• Moved spider queues to scrapyd: scrapy.spiderqueue -> scrapyd.spiderqueue (r2708)
• Moved sqlite utils to scrapyd: scrapy.utils.sqlite -> scrapyd.sqlite (r2781)
• Real support for returning iterators on start_requests() method. The iterator is now consumed during
the crawl when the spider is getting idle (r2704)
• Added REDIRECT_ENABLED setting to quickly enable/disable the redirect middleware (r2697)
• Added RETRY_ENABLED setting to quickly enable/disable the retry middleware (r2694)
• Added CloseSpider exception to manually close spiders (r2691)
• Improved encoding detection by adding support for HTML5 meta charset declaration (r2690)
• Refactored close spider behavior to wait for all downloads to finish and be processed by spiders, before closing
the spider (r2688)
• Added SitemapSpider (see documentation in Spiders page) (r2658)
• Added LogStats extension for periodically logging basic stats (like crawled pages and scraped items) (r2657)
• Make handling of gzipped responses more robust (#319, r2643). Now Scrapy will try and decompress as much
as possible from a gzipped response, instead of failing with an IOError.
• Simplified !MemoryDebugger extension to use stats for dumping memory debugging info (r2639)
• Added new command to edit spiders: scrapy edit (r2636) and -e flag to genspider command that uses
it (r2653)
• Changed default representation of items to pretty-printed dicts. (r2631). This improves default logging by
making log more readable in the default case, for both Scraped and Dropped lines.
• Added spider_error signal (r2628)
• Added COOKIES_ENABLED setting (r2625)
• Stats are now dumped to Scrapy log (default value of STATS_DUMP setting has been changed to True). This
is to make Scrapy users more aware of Scrapy stats and the data that is collected there.
• Added support for dynamically adjusting download delay and maximum concurrent requests (r2599)
• Merged item passed and item scraped concepts, as they have often proved confusing in the past. This means: (r2630)
The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.
• Passed item is now sent in the item argument of the item_passed (#273)
• Added verbose option to scrapy version command, useful for bug reports (#298)
• HTTP cache now stored by default in the project data dir (#279)
• Added project data storage directory (#276, #277)
• Documented file structure of Scrapy projects (see command-line tool doc)
• New lxml backend for XPath selectors (#147)
• Per-spider settings (#245)
• Support exit codes to signal errors in Scrapy commands (#248)
• Added -c argument to scrapy shell command
• Made libxml2 optional (#260)
• New deploy command (#261)
• Added CLOSESPIDER_PAGECOUNT setting (#253)
• Added CLOSESPIDER_ERRORCOUNT setting (#254)
Scrapyd changes
Changes to settings
Deprecated/obsoleted functionality
• Deprecated runserver command in favor of server command which starts a Scrapyd server. See also:
Scrapyd changes
• Deprecated queue command in favor of using Scrapyd schedule.json API. See also: Scrapyd changes
• Removed the !LxmlItemLoader (experimental contrib which never graduated to main contrib)
The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.
• New Scrapy service called scrapyd for deploying Scrapy crawlers in production (#218) (documentation avail-
able)
• Simplified Images pipeline usage which doesn’t require subclassing your own images pipeline now (#217)
• Scrapy shell now shows the Scrapy log by default (#206)
• Refactored execution queue in a common base code and pluggable backends called “spider queues” (#220)
• New persistent spider queue (based on SQLite) (#198), available by default, which allows to start Scrapy in
server mode and then schedule spiders to run.
• Added documentation for Scrapy command-line tool and all its available sub-commands. (documentation avail-
able)
• Feed exporters with pluggable backends (#197) (documentation available)
• Deferred signals (#193)
• Added two new methods to item pipeline open_spider(), close_spider() with deferred support (#195)
• Support for overriding default request headers per spider (#181)
• Replaced default Spider Manager with one with similar functionality but not depending on Twisted Plugins
(#186)
• Splitted Debian package into two packages - the library and the service (#187)
• Scrapy log refactoring (#188)
• New extension for keeping persistent spider contexts among different runs (#203)
• Added dont_redirect request.meta key for avoiding redirects (#233)
• Added dont_retry request.meta key for avoiding retries (#234)
• New scrapy command which replaces the old scrapy-ctl.py (#199) - there is only one global scrapy
command now, instead of one scrapy-ctl.py per project - Added scrapy.bat script for running more
conveniently from Windows
• Added bash completion to command-line tool (#210)
• Renamed command start to runserver (#209)
API changes
• url and body attributes of Request objects are now read-only (#230)
• Request.copy() and Request.replace() now also copies their callback and errback attributes
(#231)
• Removed UrlFilterMiddleware from scrapy.contrib (already disabled by default)
• Offsite middelware doesn’t filter out any request coming from a spider that doesn’t have a allowed_domains
attribute (#225)
• Removed Spider Manager load() method. Now spiders are loaded in the constructor itself.
• Changes to Scrapy Manager (now called “Crawler”):
Changes to settings
The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.
API changes
The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.
New features
Backward-incompatible changes
* CLOSEDOMAIN_TIMEOUT to CLOSESPIDER_TIMEOUT
* CLOSEDOMAIN_ITEMCOUNT to CLOSESPIDER_ITEMCOUNT
• Removed deprecated SCRAPYSETTINGS_MODULE environment variable - use
SCRAPY_SETTINGS_MODULE instead (r1840)
• Renamed setting: REQUESTS_PER_DOMAIN to CONCURRENT_REQUESTS_PER_SPIDER (r1830, r1844)
Important: Double check that you are reading the most recent version of this document at https://docs.scrapy.org/
en/master/contributing.html
There are many ways to contribute to Scrapy. Here are some of them:
• Blog about Scrapy. Tell the world how you’re using Scrapy. This will help newcomers with more examples and
will help the Scrapy project to increase its visibility.
• Report bugs and request features in the issue tracker, trying to follow the guidelines detailed in Reporting bugs
below.
• Submit patches for new functionalities and/or bug fixes. Please read Writing patches and Submitting patches
below for details on how to write and submit a patch.
• Join the Scrapy subreddit and share your ideas on how to improve Scrapy. We’re always open to suggestions.
• Answer Scrapy questions at Stack Overflow.
Note: Please report security issues only to [email protected]. This is a private list only open to
trusted Scrapy developers, and its archives are not public.
Well-written bug reports are very helpful, so keep in mind the following guidelines when you’re going to report a new
bug.
• check the FAQ first to see if your issue is addressed in a well-known question
• if you have a general question about scrapy usage, please ask it at Stack Overflow (use “scrapy” tag).
• check the open issues to see if the issue has already been reported. If it has, don’t dismiss the report, but check
the ticket history and comments. If you have additional useful information, please leave a comment, or consider
sending a pull request with a fix.
• search the scrapy-users list and Scrapy subreddit to see if it has been discussed there, or if you’re not sure if
what you’re seeing is a bug. You can also ask in the #scrapy IRC channel.
• write complete, reproducible, specific bug reports. The smaller the test case, the better. Remember that
other developers won’t have your project to reproduce the bug, so please include all relevant files required to
reproduce it. See for example StackOverflow’s guide on creating a Minimal, Complete, and Verifiable example
exhibiting the issue.
• the most awesome way to provide a complete reproducible example is to send a pull request which adds a failing
test case to the Scrapy testing suite (see Submitting patches). This is helpful even if you don’t have an intention
to fix the issue yourselves.
• include the output of scrapy version -v so developers working on your bug know exactly which version
and platform it occurred on, which is often very helpful for reproducing it, or knowing if it was already fixed.
The better a patch is written, the higher the chances that it’ll get accepted and the sooner it will be merged.
Well-written patches should:
• contain the minimum amount of code required for the specific change. Small patches are easier to review and
merge. So, if you’re doing more than one change (or bug fix), please consider submitting one patch per change.
Do not collapse multiple changes into a single patch. For big changes consider using a patch queue.
• pass all unit-tests. See Running tests below.
• include one (or more) test cases that check the bug fixed or the new functionality added. See Writing tests below.
• if you’re adding or changing a public (documented) API, please include the documentation changes in the same
patch. See Documentation policies below.
• if you’re adding a private API, please add a regular expression to the coverage_ignore_pyobjects
variable of docs/conf.py to exclude the new private API from documentation coverage checks.
To see if your private API is skipped properly, generate a documentation coverage report as follows:
tox -e docs-coverage
The best way to submit a patch is to issue a pull request on GitHub, optionally creating a new issue first.
Remember to explain what was fixed or the new functionality (what it is, why it’s needed, etc). The more info you
include, the easier will be for core developers to understand and accept your patch.
You can also discuss the new functionality (or bug fix) before creating the patch, but it’s always good to have a patch
ready to illustrate your arguments and show that you have put some additional thought into the subject. A good starting
point is to send a pull request on GitHub. It can be simple enough to illustrate your idea, and leave documentation/tests
for later, after the idea has been validated and proven useful. Alternatively, you can start a conversation in the Scrapy
subreddit to discuss your idea first.
Sometimes there is an existing pull request for the problem you’d like to solve, which is stalled for some reason. Often
the pull request is in a right direction, but changes are requested by Scrapy maintainers, and the original pull request
author hasn’t had time to address them. In this case consider picking up this pull request: open a new pull request with
all commits from the original pull request, as well as additional changes to address the raised issues. Doing so helps a
lot; it is not considered rude as soon as the original author is acknowledged by keeping his/her commits.
You can pull an existing pull request to a local branch by running git fetch upstream pull/
$PR_NUMBER/head:$BRANCH_NAME_TO_CREATE (replace ‘upstream’ with a remote name for scrapy repos-
itory, $PR_NUMBER with an ID of the pull request, and $BRANCH_NAME_TO_CREATE with a name of the
branch you want to create locally). See also: https://help.github.com/articles/checking-out-pull-requests-locally/
#modifying-an-inactive-pull-request-locally.
When writing GitHub pull requests, try to keep titles short but descriptive. E.g. For bug #411: “Scrapy hangs if an
exception raises in start_requests” prefer “Fix hanging when exception occurs in start_requests (#411)” instead of “Fix
for #411”. Complete titles make it easy to skim through the issue tracker.
Finally, try to keep aesthetic changes (PEP 8 compliance, unused imports removal, etc) in separate commits from
functional changes. This will make pull requests easier to review and more likely to get merged.
Please follow these coding conventions when writing code for inclusion in Scrapy:
• Unless otherwise specified, follow PEP 8.
• It’s OK to use lines longer than 80 chars if it improves the code readability.
• Don’t put your name in the code you contribute; git provides enough metadata to identify author of the code.
See https://help.github.com/articles/setting-your-username-in-git/ for setup instructions.
For reference documentation of API members (classes, methods, etc.) use docstrings and make sure that the Sphinx
documentation uses the autodoc extension to pull the docstrings. API reference documentation should follow docstring
conventions (PEP 257) and be IDE-friendly: short, to the point, and it may provide short examples.
Other types of documentation, such as tutorials or topics, should be covered in files within the docs/ directory. This
includes documentation that is specific to an API member, but goes beyond API reference documentation.
In any case, if something is covered in a docstring, use the autodoc extension to pull the docstring into the documen-
tation instead of duplicating the docstring in files within the docs/ directory.
7.2.6 Tests
Tests are implemented using the Twisted unit-testing framework, running tests requires tox.
Running tests
To run the tests on a specific tox environment, use -e <name> with an environment name from tox.ini. For
example, to run the tests with Python 3.6 use:
tox -e py36
You can also specify a comma-separated list of environmets, and use tox’s parallel mode to run the tests on multiple
environments in parallel:
To pass command-line options to pytest, add them after -- in your call to tox. Using -- overrides the default positional
arguments defined in tox.ini, so you must include those default positional arguments (scrapy tests) after --
as well:
You can also use the pytest-xdist plugin. For example, to run all tests on the Python 3.6 tox environment using all your
CPU cores:
To see coverage report install coverage (pip install coverage) and run:
coverage report
see output of coverage --help for more options like html or xml report.
Writing tests
All functionality (including new features and bug fixes) must include a test case to check that it works as expected, so
please include tests for your patches if you want them to get accepted sooner.
Scrapy uses unit-tests, which are located in the tests/ directory. Their module name typically resembles the full path
of the module they’re testing. For example, the item loaders code is in:
scrapy.loader
tests/test_loader.py
7.3.1 Versioning
Backward-incompatibilities are explicitly mentioned in the release notes, and may require special attention before
upgrading.
Development releases do not follow 3-numbers version and are generally released as dev suffixed versions, e.g. 1.
3dev.
Note: With Scrapy 0.* series, Scrapy used odd-numbered versions for development releases. This is not the case
anymore from Scrapy 1.0 onwards.
Starting with Scrapy 1.0, all releases should be considered production-ready.
For example:
• 1.1.1 is the first bugfix release of the 1.1 series (safe to use in production)
API stability was one of the major goals for the 1.0 release.
Methods or functions that start with a single dash (_) are private and should never be relied as stable.
Also, keep in mind that stable doesn’t mean complete: stable APIs could grow new methods or functionality but the
existing methods should keep working the same way.
Release notes See what has changed in recent Scrapy versions.
Contributing to Scrapy Learn how to contribute to the Scrapy project.
Versioning and API Stability Understand Scrapy versioning and API stability.
s scrapy.extensions.telnet, 209
scrapy.contracts, 148 scrapy.http, 89
scrapy.contracts.default, 147 scrapy.item, 61
scrapy.crawler, 212 scrapy.linkextractors, 100
scrapy.downloadermiddlewares, 187 scrapy.linkextractors.lxmlhtml, 101
scrapy.downloadermiddlewares.ajaxcrawl, scrapy.loader, 64
199 scrapy.loader.processors, 73
scrapy.downloadermiddlewares.cookies, scrapy.mail, 133
188 scrapy.pipelines.files, 173
scrapy.pipelines.images, 175
scrapy.downloadermiddlewares.defaultheaders,
190 scrapy.selector, 57
scrapy.settings, 215
scrapy.downloadermiddlewares.downloadtimeout,
190 scrapy.signalmanager, 218
scrapy.downloadermiddlewares.httpauth, scrapy.signals, 220
190 scrapy.spiderloader, 218
scrapy.downloadermiddlewares.httpcache, scrapy.spidermiddlewares, 201
190 scrapy.spidermiddlewares.depth, 202
scrapy.spidermiddlewares.httperror, 203
scrapy.downloadermiddlewares.httpcompression,
196 scrapy.spidermiddlewares.offsite, 203
scrapy.downloadermiddlewares.httpproxy, scrapy.spidermiddlewares.referer, 204
196 scrapy.spidermiddlewares.urllength, 206
scrapy.downloadermiddlewares.redirect, scrapy.spiders, 33
196 scrapy.statscollectors, 219
scrapy.downloadermiddlewares.retry, 198 scrapy.utils.log, 130
scrapy.downloadermiddlewares.robotstxt, scrapy.utils.trackref, 166
199
scrapy.downloadermiddlewares.stats, 199
scrapy.downloadermiddlewares.useragent,
199
scrapy.exceptions, 125
scrapy.exporters, 224
scrapy.extensions.closespider, 210
scrapy.extensions.corestats, 209
scrapy.extensions.debug, 211
scrapy.extensions.httpcache, 193
scrapy.extensions.logstats, 209
scrapy.extensions.memdebug, 210
scrapy.extensions.memusage, 209
scrapy.extensions.statsmailer, 211
289
Scrapy Documentation, Release 1.7.3
291
Scrapy Documentation, Release 1.7.3
292 Index
Scrapy Documentation, Release 1.7.3
Index 293
Scrapy Documentation, Release 1.7.3
294 Index
Scrapy Documentation, Release 1.7.3
Index 295
Scrapy Documentation, Release 1.7.3
296 Index
Scrapy Documentation, Release 1.7.3
P R
parse RANDOMIZE_DOWNLOAD_DELAY
command, 30 setting, 115
parse() (scrapy.spiders.Spider method), 34 re() (scrapy.selector.Selector method), 59
parse_node() (scrapy.spiders.XMLFeedSpider re() (scrapy.selector.SelectorList method), 60
method), 39 re_first() (scrapy.selector.Selector method), 59
parse_row() (scrapy.spiders.CSVFeedSpider re_first() (scrapy.selector.SelectorList method), 60
method), 40 REACTOR_THREADPOOL_MAXSIZE
parse_start_url() (scrapy.spiders.CrawlSpider setting, 116
method), 37 REDIRECT_ENABLED
PickleItemExporter (class in scrapy.exporters), setting, 197
229 REDIRECT_MAX_TIMES
post_process() (scrapy.contracts.Contract setting, 116, 197
method), 148 REDIRECT_PRIORITY_ADJUST
PprintItemExporter (class in scrapy.exporters), setting, 116
229 redirect_reasons
pre_process() (scrapy.contracts.Contract method), reqmeta, 196
148 redirect_urls
print_live_refs() (in module reqmeta, 196
scrapy.utils.trackref ), 166 RedirectMiddleware (class in
process_exception() scrapy.downloadermiddlewares.redirect),
(scrapy.downloadermiddlewares.DownloaderMiddleware 196
method), 188 REFERER_ENABLED
process_item(), 80 setting, 204
process_request() RefererMiddleware (class in
(scrapy.downloadermiddlewares.DownloaderMiddleware scrapy.spidermiddlewares.referer), 204
Index 297
Scrapy Documentation, Release 1.7.3
298 Index
Scrapy Documentation, Release 1.7.3
Index 299
Scrapy Documentation, Release 1.7.3
300 Index
Scrapy Documentation, Release 1.7.3
Index 301
Scrapy Documentation, Release 1.7.3
SPIDER_MIDDLEWARES TEMPLATES_DIR
setting, 118 setting, 120
SPIDER_MIDDLEWARES_BASE text (scrapy.http.TextResponse attribute), 99
setting, 119 TextResponse (class in scrapy.http), 98
SPIDER_MODULES
setting, 119 U
spider_opened UnsafeUrlPolicy (class in
signal, 222 scrapy.spidermiddlewares.referer), 206
spider_opened() (in module scrapy.signals), 222 update() (scrapy.settings.BaseSettings method), 217
spider_stats (scrapy.statscollectors.MemoryStatsCollector
update_telnet_vars
attribute), 132 signal, 137
SpiderLoader (class in scrapy.spiderloader), 218 update_telnet_vars() (in module
SpiderMiddleware (class in scrapy.extensions.telnet), 137
scrapy.spidermiddlewares), 201 url (scrapy.http.Request attribute), 90
start() (scrapy.crawler.CrawlerProcess method), 214 url (scrapy.http.Response attribute), 97
start_exporting() UrlContract (class in scrapy.contracts.default), 147
(scrapy.exporters.BaseItemExporter method), urljoin() (scrapy.http.Response method), 98
226 URLLENGTH_LIMIT
start_requests() (scrapy.spiders.Spider method), setting, 120
34 UrlLengthMiddleware (class in
start_urls (scrapy.spiders.Spider attribute), 33 scrapy.spidermiddlewares.urllength), 206
startproject USER_AGENT
command, 26 setting, 120
stats (scrapy.crawler.Crawler attribute), 212 UserAgentMiddleware (class in
STATS_CLASS scrapy.downloadermiddlewares.useragent),
setting, 119 199
STATS_DUMP
setting, 119 V
StatsCollector (class in scrapy.statscollectors), 219 version
STATSMAILER_RCPTS command, 31
setting, 119 view
status (scrapy.http.Response attribute), 97 command, 29
stop() (scrapy.crawler.Crawler method), 213
stop() (scrapy.crawler.CrawlerProcess method), 215 X
stop() (scrapy.crawler.CrawlerRunner method), 214 XMLFeedSpider (class in scrapy.spiders), 38
store_response() (scrapy.extensions.httpcache.CacheStorage
XmlItemExporter (class in scrapy.exporters), 227
method), 193 XmlResponse (class in scrapy.http), 100
StrictOriginPolicy (class in xpath() (scrapy.http.TextResponse method), 99
scrapy.spidermiddlewares.referer), 206 xpath() (scrapy.selector.Selector method), 58
StrictOriginWhenCrossOriginPolicy (class xpath() (scrapy.selector.SelectorList method), 59
in scrapy.spidermiddlewares.referer), 206
T
TakeFirst (class in scrapy.loader.processors), 73
TELNETCONSOLE_ENABLED
setting, 119
TELNETCONSOLE_HOST
setting, 137
TELNETCONSOLE_PASSWORD
setting, 138
TELNETCONSOLE_PORT
setting, 120, 137
TELNETCONSOLE_USERNAME
setting, 137
302 Index