Data Scraping

There’s Always 
an API
Sometimes they make you work for it
Hi, I’m Matt Dennewitz
• VP Product, Pitchfork; Dir. Engineering, Wired
• I consult on data for baseball writers and 

MLB clubs
• @mattdennewitz on Twitter, Github

Agenda
• 101
• Your first scrape: Google Docs
• Interlude: HTML, JSON, XML, XPath
• Scaling up: Python
• What happens when the data isn’t on the page?
• Advanced topics (time allowing)

What is scraping?
• Extracting information from a document
• Rows from an HTML table
• Text from a PDF
• Images from Craigslist posts or museum websites
• OCR’ing an image and reading its text
• Spidering a website like Google

Tools
• Google Docs (surprise!)
• Chrome Developer Tools
• Python
• Scrapy
Strategy
1. “What do I want?”
2. Case the joint
3. Rob it just a little bit
4. Move in
“What do I want?”
• Envision the data you want, how you need it
• “How will I scrape this data?” Script? Crawler?
• “Do I need to scrape this more than once?”
• “How do I need to shape the data?”
• “What do I need to do to the data after I have it?”

Clean, verify, cross-link with another data set, …?
• “How/to where do I want to output the data?”

Case the joint
• Does the document seem scrape-ready? Does
access come with preconditions?
• Preconditions: password-protected? Online-only?

Needs a special decoder?
• Look at how the data is presented in the

document. Are there external dependencies, or
is it self-contained?
• External deps: more information on secondary

pages, data in other spreadsheets or workbooks
Rob it just a little bit
• Prototype using a subset of the information
• Estimate how long scraping will take, determine

imperative needs like throttling or a specific OS
• Validate your ideas about the data you wish 

to extract, correct bugs
• Writing unit tests

Oceans 1101
• You’ve created a stable scraper which emits data

in the format you want (CSV, JSON, XML, SQL, …)
to the location you want
• You understand its performance characteristics
• Go!
Interlude: formats
• Data is distributed in mercilessly innumerable
formats
• The Big Three of web scraping
• HTML
• JSON
• XML
Formats: XML
• eXtensible Markup Language
• Well-structured, self-validating, predictable
• Pedantic, though not with its charms

Formats: XML
Formats: HTML
• Hypertext something something something
• XML-like, without the upside
• Needs stronger class of parser to heal broken

code
• Less predictable, far more susceptible to

changes in the wind
Formats: HTML
<p>
1.
<strong><span class="playerdef"><a href="http://

www.baseballprospectus.com/card/card.php?id=102123">Alex Reyes</
a></span>, RHP, <span class="teamdef"><a href="http://
www.baseballprospectus.com/team_audit.php?team=SLN"
target="blank">St. Louis Cardinals</a></span></strong><br>
Scouting Report: <a href=“http://www.baseballprospectus.com/

article.php?articleid=30958">LINK</a>
</p>
Formats: JSON
• JavaScript Object Notation
• Data objects with simple primitives: int, double,

string, boolean, object (key/value pairs), array
(untyped), null.
• Requires waaaaaaay less parsing, much easier to

serialize
• No schemas, but validation tools exist
• Has taken over for XML in web data transmission

Formats: JSON
{
"prospect_year": "2017",
"player_id": 643217,
"player_first_name": "Andrew",
"player_last_name": "Benintendi",
"rank": 1,
"position": "OF",
"preseason100": 1,
"preseason20": 1,
"team_file_code": "BOS",
}
Bonus: XPath
• XPath is a way to query XML (and HTML)
• It’s got a super goofy syntax
• Very powerful, essential for scraping the web

Bonus: XPath
• XPath: //table/tbody/tr
• HTML (fragment):  
<table> 
<thead> 
<tr> 
<th>Name</th><th>HR</th><th>SB</th> 
</tr> 
</thead> 
<tbody> 
<tr><td>Mike Trout</td><td>40</td><td>40</td></tr> 
</tbody> 
</table>
• Result: <tr><td>Mike Trout</td><td>40</td><td>40</td></tr>

Bonus: XPath
• XPath: //span[@class=“playerdef”]/text()
• HTML:  
<p>1. <strong><span class=“playerdef”>Eloy
Jiminez</span></strong>, OF, …</p>
• Result: “Eloy Jiminez”

Ok, time to scrape
Google Docs
• Fire up Google Docs, start a new spreadsheet
• IMPORTXML and IMPORTHTML are your friends
• Let’s look at IMPORTHTML

IMPORTHTML
• Allows you put pull in specific list or tabular data

from a web page
• Syntax: 
=IMPORTHTML(url, <“list” or “table”>,
[index])
IMPORTHTML
• ESPN Home Run Tracker
• Syntax: 
=importhtml("http://
www.hittrackeronline.com/?perpage=1000",
"table", 17)
• “Give me the 16th table on the page” (0-based

indexing)
IMPORTHTML
IMPORTHTML
• Brooks Baseball Player Pitch Logs
• Syntax: 
=IMPORTHTML("http://
www.brooksbaseball.net/pfxVB/
tabdel_expanded.php?
pitchSel=458584&game=gid_2016_06_27_bosm
lb_tbamlb_1/
&s_type=&h_size=700&v_size=500",
"table")
IMPORTHTML
Google Docs
• Useful for pulling in single tables, or keeping

everything in a spreadsheet
• Data doesn’t always exist in a single place
• Spread across several pages
• Spread across several files or APIs

Google Docs
• Useful for pulling in single tables, or keeping
everything in a spreadsheet
• Data doesn’t always exist in a single place
• Spread across several pages
• Spread across several files or APIs
• Automate as much as you can

Python time
• Beautiful language. Transcendental even.
• Robust ecosystem for handling data parsing,

cleaning, making net requests, etc
• A+ community
• Runs anywhere
Python time
• I’m going to use two non-standard packages

today:
• lxml, for HTML parsing and cleaning
• requests, for HTTP fetching

Strategy (again)
1. “What do I want?”
2. Case the joint
4. Move in
Strategy (again)
1. “What do I want?”: prospect rankings from BP,

MLB, Baseball America
2. Case the joint
4. Move in
Strategy (again)
MLB
2. Case the joint: BP has dirty HTML. MLB loads a

JSON file.
4. Move in
Strategy (again)
MLB

JSON file.
3. Rob it just a little bit: Get a feel for BP and BA’s

HTML structure, examine MLB’s JSON file.
4. Move in
Strategy (again)
MLB.

JSON file.
3. Rob it just a little bit: Get a feel for BP and BA’s

HTML structure, examine MLB’s JSON file.
4. Move in: Write three scripts, one for each.

Strategy (again)
• Fields to export:
• Name
• Rank
• List type (“BP”, “MLB”, …)
• System ID (MLBAM ID, BP player ID, …)

BP
• http://www.baseballprospectus.com/article.php?
articleid=31160
BP
• First thing to do is inspect the source
• Is there a pattern in the HTML you can engineer for,

or an attribute you can target?
• Let’s head to the console! Right click on the one

of the capsules, and click “Inspect”
BP
BP
• Yes! Player data is in a paragraph tag, <p>, which
contains a <span> with class “playerdef”
• Get used to talking like this
• Using XPath, we can target that <span> and walk

up to its parent element, <p>, which gives us
access to the whole player capsule
BP
• Beware: the “playerdef” class could be used anywhere.
We need to find a reasonable scope for our XPath.
• Luckily for us, player capsules are in a <div> with class

“article”, and that structure appears only once per
article page across BP.
• XPath: //div[@class=“article”]//
span[@class=“playerdef"]/..
• What else?
BP
• Code: https://github.com/mattdennewitz/
sloan-scraping/blob/master/bp-
top-101-2017.py
• Output: https://github.com/mattdennewitz/
sloan-scraping/blob/master/bp-2017.csv
BP
BP
• What did we do?
• Inspected the page
• Found critical path to data, wrote supporting

XPaths
• Scripted collecting and outputting the data

MLB
• http://m.mlb.com/prospects/2017
MLB
• Again, start by inspecting the source
• Try to find “Benintendi” or “Moncada”
•
MLB
• Again, start by inspecting the source
• Try to find “Benintendi” or “Moncada” in 

the HTML
• “uhh”
MLB
• Websites love to load data asynchronously.
• LOVE to
• Let’s head to the Inspector’s Network panel to

poke around and find the source
• In Chrome: Ctrl+Shift+I (Windows) or Cmd+Opt+I

(Mac), then select “Network”
MLB
• Websites love to load data asynchronously.
• LOVE to
• Let’s head to the Inspector’s Network panel to

poke around and find the source
• In Chrome: Ctrl+Shift+I (Windows) or Cmd+Opt+I

(Mac), then select “Network”
• Let’s start by looking under “XHR”, the typical

place to look for dynamically loaded data
MLB
MLB
• “playerProspects.json” looks promising
• We know it’s a JSON file
• The filename is a pretty dead giveaway
• When we open it up, it has a ton of prospect data

MLB
• Here, we have a JSON file
• Let’s inspect the structure to find exactly what

attributes we would like to scrape
• Fast-forward: the “prospect_players” key has

prospects for all teams! And it has the Top 100
under the “prospects” key.
MLB
{
"prospect_year": "2017",
"player_id": 643217,
"player_first_name": "Andrew",
"player_last_name": "Benintendi",
"rank": 1,
"position": "OF",
"preseason100": 1,
"preseason20": 1,
"team_file_code": "BOS",
}
MLB
• Using Python’s out-of-box JSON parser, we can
easily parse this file and extract players
• Code: https://github.com/mattdennewitz/
sloan-scraping/blob/master/mlb-
top-100-2017.py
• Output: https://github.com/mattdennewitz/
sloan-scraping/blob/master/mlb-2017.csv
MLB
Recap
• We’ve used the four step approach to plan for
consistent output across disparate systems
• We’ve used tools like the Inspector to probe for

data
• We’ve written very simple yet powerful scripts in

Python to download prospect lists
• We’ve streamlined the data into a consistent shape
• Our scripts are easily reusable

Next steps
• Since we were clever and included system IDs,
we can tie it all together using a baseball player
ID registry
• Chadwick Register
• Smart Fantasy Baseball
• Crunchtime
Tools
• Hopefully there’s time to talk about this

Tools
• requests: A beautiful HTTP library
• lxml: A beautiful XML and HTML parsing library.

Tricky to install on Windows, binaries are
available.
• BeautifulSoup: another A+ HTML parser
• Scrapy: a very robust Python framework for

crawling websites
Code
• The code and output from this session is online

at: https://github.com/mattdennewitz/2017-
sloan-data-scraping
Thanks!
• Questions?
• If we have some time left, we could try a bit of

live coding
• If you have very specific scraping questions, find

me after and let’s talk

Data Scraping

Uploaded by

Data Scraping

Uploaded by

There’s Always

• VP Product, Pitchfork; Dir. Engineering, Wired

• I consult on data for baseball writers and

• @mattdennewitz on Twitter, Github

• Your first scrape: Google Docs

• Interlude: HTML, JSON, XML, XPath

• Scaling up: Python

• What happens when the data isn’t on the page?

• Advanced topics (time allowing)

• Rows from an HTML table

• Text from a PDF

• Images from Craigslist posts or museum websites

• OCR’ing an image and reading its text

• Spidering a website like Google

• Google Docs (surprise!)

• Chrome Developer Tools

2. Case the joint

3. Rob it just a little bit

• “How will I scrape this data?” Script? Crawler?

• “Do I need to scrape this more than once?”

• “How do I need to shape the data?”

• “What do I need to do to the data after I have it?”

• “How/to where do I want to output the data?”

• Preconditions: password-protected? Online-only?

• Look at how the data is presented in the

• External deps: more information on secondary

• Estimate how long scraping will take, determine

• Validate your ideas about the data you wish

• Writing unit tests

• You’ve created a stable scraper which emits data

• You understand its performance characteristics

• The Big Three of web scraping

• eXtensible Markup Language

• Well-structured, self-validating, predictable

• Pedantic, though not with its charms

• XML-like, without the upside

• Needs stronger class of parser to heal broken

• Less predictable, far more susceptible to

<strong><span class="playerdef"><a href="http://

Scouting Report: <a href=“http://www.baseballprospectus.com/

• Data objects with simple primitives: int, double,

• Requires waaaaaaay less parsing, much easier to

• No schemas, but validation tools exist

• Has taken over for XML in web data transmission

• XPath is a way to query XML (and HTML)

• It’s got a super goofy syntax

• Very powerful, essential for scraping the web

• Result: <tr><td>Mike Trout</td><td>40</td><td>40</td></tr>

• Result: “Eloy Jiminez”

• Fire up Google Docs, start a new spreadsheet

• IMPORTXML and IMPORTHTML are your friends

• Let’s look at IMPORTHTML

• Allows you put pull in specific list or tabular data

• “Give me the 16th table on the page” (0-based

• Useful for pulling in single tables, or keeping

• Data doesn’t always exist in a single place

• Spread across several pages

• Spread across several files or APIs

• Data doesn’t always exist in a single place

• Spread across several pages

• Spread across several files or APIs

• Automate as much as you can

• Beautiful language. Transcendental even.

• Robust ecosystem for handling data parsing,

• I’m going to use two non-standard packages

• lxml, for HTML parsing and cleaning

• requests, for HTTP fetching

2. Case the joint

3. Rob it just a little bit

1. “What do I want?”: prospect rankings from BP,

2. Case the joint

3. Rob it just a little bit

2. Case the joint: BP has dirty HTML. MLB loads a

3. Rob it just a little bit

2. Case the joint: BP has dirty HTML. MLB loads a

3. Rob it just a little bit: Get a feel for BP and BA’s

There’s Always 

• I consult on data for baseball writers and 

• Validate your ideas about the data you wish 

• Try to find “Benintendi” or “Moncada” in