0% found this document useful (0 votes)
250 views63 pages

Data Scraping

The document discusses scraping data from websites. It begins with an overview of scraping and common tools used, such as Google Docs, Chrome Developer Tools, Python and Scrapy. It then covers strategies for scraping, including defining the desired data, examining the data structure on the site, doing a test scrape, and then fully implementing the scraper. The document provides examples scraping Baseball Prospectus and MLB prospect data as case studies. It discusses common data formats like HTML, JSON, XML and the use of XPath for querying pages.

Uploaded by

Prashanth Mohan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
250 views63 pages

Data Scraping

The document discusses scraping data from websites. It begins with an overview of scraping and common tools used, such as Google Docs, Chrome Developer Tools, Python and Scrapy. It then covers strategies for scraping, including defining the desired data, examining the data structure on the site, doing a test scrape, and then fully implementing the scraper. The document provides examples scraping Baseball Prospectus and MLB prospect data as case studies. It discusses common data formats like HTML, JSON, XML and the use of XPath for querying pages.

Uploaded by

Prashanth Mohan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 63

There’s Always


an API
Sometimes they make you work for it
Hi, I’m Matt Dennewitz

• VP Product, Pitchfork; Dir. Engineering, Wired

• I consult on data for baseball writers and



MLB clubs

• @mattdennewitz on Twitter, Github


Agenda
• 101

• Your first scrape: Google Docs

• Interlude: HTML, JSON, XML, XPath

• Scaling up: Python

• What happens when the data isn’t on the page?

• Advanced topics (time allowing)


What is scraping?
• Extracting information from a document

• Rows from an HTML table

• Text from a PDF

• Images from Craigslist posts or museum websites

• OCR’ing an image and reading its text

• Spidering a website like Google


Tools

• Google Docs (surprise!)

• Chrome Developer Tools

• Python

• Scrapy
Strategy

1. “What do I want?”

2. Case the joint

3. Rob it just a little bit

4. Move in
“What do I want?”
• Envision the data you want, how you need it

• “How will I scrape this data?” Script? Crawler?

• “Do I need to scrape this more than once?”

• “How do I need to shape the data?”

• “What do I need to do to the data after I have it?”


Clean, verify, cross-link with another data set, …?

• “How/to where do I want to output the data?”


Case the joint
• Does the document seem scrape-ready? Does
access come with preconditions?

• Preconditions: password-protected? Online-only?


Needs a special decoder?

• Look at how the data is presented in the


document. Are there external dependencies, or
is it self-contained?

• External deps: more information on secondary


pages, data in other spreadsheets or workbooks
Rob it just a little bit
• Prototype using a subset of the information

• Estimate how long scraping will take, determine


imperative needs like throttling or a specific OS

• Validate your ideas about the data you wish



to extract, correct bugs

• Writing unit tests


Oceans 1101

• You’ve created a stable scraper which emits data


in the format you want (CSV, JSON, XML, SQL, …)
to the location you want

• You understand its performance characteristics

• Go!
Interlude: formats
• Data is distributed in mercilessly innumerable
formats

• The Big Three of web scraping

• HTML

• JSON

• XML
Formats: XML

• eXtensible Markup Language

• Well-structured, self-validating, predictable

• Pedantic, though not with its charms


Formats: XML
Formats: HTML
• Hypertext something something something

• XML-like, without the upside

• Needs stronger class of parser to heal broken


code

• Less predictable, far more susceptible to


changes in the wind
Formats: HTML
<p>

1.

<strong><span class="playerdef"><a href="http://


www.baseballprospectus.com/card/card.php?id=102123">Alex Reyes</
a></span>, RHP, <span class="teamdef"><a href="http://
www.baseballprospectus.com/team_audit.php?team=SLN"
target="blank">St. Louis Cardinals</a></span></strong><br>

Scouting Report: <a href=“http://www.baseballprospectus.com/


article.php?articleid=30958">LINK</a>

</p>
Formats: JSON
• JavaScript Object Notation

• Data objects with simple primitives: int, double,


string, boolean, object (key/value pairs), array
(untyped), null.

• Requires waaaaaaay less parsing, much easier to


serialize

• No schemas, but validation tools exist

• Has taken over for XML in web data transmission


Formats: JSON
{

"prospect_year": "2017",

"player_id": 643217,

"player_first_name": "Andrew",

"player_last_name": "Benintendi",

"rank": 1,

"position": "OF",

"preseason100": 1,

"preseason20": 1,

"team_file_code": "BOS",

}
Bonus: XPath

• XPath is a way to query XML (and HTML)

• It’s got a super goofy syntax

• Very powerful, essential for scraping the web


Bonus: XPath
• XPath: //table/tbody/tr

• HTML (fragment): 

<table>

<thead>

<tr>

<th>Name</th><th>HR</th><th>SB</th>

</tr>

</thead>

<tbody>

<tr><td>Mike Trout</td><td>40</td><td>40</td></tr>

</tbody>

</table>

• Result: <tr><td>Mike Trout</td><td>40</td><td>40</td></tr>


Bonus: XPath

• XPath: //span[@class=“playerdef”]/text()

• HTML: 

<p>1. <strong><span class=“playerdef”>Eloy
Jiminez</span></strong>, OF, …</p>

• Result: “Eloy Jiminez”


Ok, time to scrape
Google Docs

• Fire up Google Docs, start a new spreadsheet

• IMPORTXML and IMPORTHTML are your friends

• Let’s look at IMPORTHTML


IMPORTHTML

• Allows you put pull in specific list or tabular data


from a web page

• Syntax:

=IMPORTHTML(url, <“list” or “table”>,
[index])
IMPORTHTML
• ESPN Home Run Tracker

• Syntax:

=importhtml("http://
www.hittrackeronline.com/?perpage=1000",
"table", 17)

• “Give me the 16th table on the page” (0-based


indexing)
IMPORTHTML
IMPORTHTML
• Brooks Baseball Player Pitch Logs

• Syntax:

=IMPORTHTML("http://
www.brooksbaseball.net/pfxVB/
tabdel_expanded.php?
pitchSel=458584&game=gid_2016_06_27_bosm
lb_tbamlb_1/
&s_type=&h_size=700&v_size=500",
"table")
IMPORTHTML
Google Docs

• Useful for pulling in single tables, or keeping


everything in a spreadsheet

• Data doesn’t always exist in a single place

• Spread across several pages

• Spread across several files or APIs


Google Docs
• Useful for pulling in single tables, or keeping
everything in a spreadsheet

• Data doesn’t always exist in a single place

• Spread across several pages

• Spread across several files or APIs

• Automate as much as you can


Python time

• Beautiful language. Transcendental even.

• Robust ecosystem for handling data parsing,


cleaning, making net requests, etc

• A+ community

• Runs anywhere
Python time

• I’m going to use two non-standard packages


today:

• lxml, for HTML parsing and cleaning

• requests, for HTTP fetching


Strategy (again)

1. “What do I want?”

2. Case the joint

3. Rob it just a little bit

4. Move in
Strategy (again)

1. “What do I want?”: prospect rankings from BP,


MLB, Baseball America

2. Case the joint

3. Rob it just a little bit

4. Move in
Strategy (again)
1. “What do I want?”: prospect rankings from BP,
MLB

2. Case the joint: BP has dirty HTML. MLB loads a


JSON file.

3. Rob it just a little bit

4. Move in
Strategy (again)
1. “What do I want?”: prospect rankings from BP,
MLB

2. Case the joint: BP has dirty HTML. MLB loads a


JSON file.

3. Rob it just a little bit: Get a feel for BP and BA’s


HTML structure, examine MLB’s JSON file.

4. Move in
Strategy (again)
1. “What do I want?”: prospect rankings from BP,
MLB.

2. Case the joint: BP has dirty HTML. MLB loads a


JSON file.

3. Rob it just a little bit: Get a feel for BP and BA’s


HTML structure, examine MLB’s JSON file.

4. Move in: Write three scripts, one for each.


Strategy (again)

• Fields to export:

• Name

• Rank

• List type (“BP”, “MLB”, …)

• System ID (MLBAM ID, BP player ID, …)


BP

• http://www.baseballprospectus.com/article.php?
articleid=31160
BP

• First thing to do is inspect the source

• Is there a pattern in the HTML you can engineer for,


or an attribute you can target?

• Let’s head to the console! Right click on the one


of the capsules, and click “Inspect”
BP
BP
• Yes! Player data is in a paragraph tag, <p>, which
contains a <span> with class “playerdef”

• Get used to talking like this

• Using XPath, we can target that <span> and walk


up to its parent element, <p>, which gives us
access to the whole player capsule
BP
• Beware: the “playerdef” class could be used anywhere.
We need to find a reasonable scope for our XPath.

• Luckily for us, player capsules are in a <div> with class


“article”, and that structure appears only once per
article page across BP.

• XPath: //div[@class=“article”]//
span[@class=“playerdef"]/..

• What else?
BP

• Code: https://github.com/mattdennewitz/
sloan-scraping/blob/master/bp-
top-101-2017.py

• Output: https://github.com/mattdennewitz/
sloan-scraping/blob/master/bp-2017.csv
BP
BP

• What did we do?

• Inspected the page

• Found critical path to data, wrote supporting


XPaths

• Scripted collecting and outputting the data


MLB

• http://m.mlb.com/prospects/2017
MLB

• Again, start by inspecting the source

• Try to find “Benintendi” or “Moncada”


MLB

• Again, start by inspecting the source

• Try to find “Benintendi” or “Moncada” in



the HTML

• “uhh”
MLB
• Websites love to load data asynchronously.

• LOVE to

• Let’s head to the Inspector’s Network panel to


poke around and find the source

• In Chrome: Ctrl+Shift+I (Windows) or Cmd+Opt+I


(Mac), then select “Network”
MLB
• Websites love to load data asynchronously.

• LOVE to

• Let’s head to the Inspector’s Network panel to


poke around and find the source

• In Chrome: Ctrl+Shift+I (Windows) or Cmd+Opt+I


(Mac), then select “Network”

• Let’s start by looking under “XHR”, the typical


place to look for dynamically loaded data
MLB
MLB

• “playerProspects.json” looks promising

• We know it’s a JSON file

• The filename is a pretty dead giveaway

• When we open it up, it has a ton of prospect data


MLB

• Here, we have a JSON file

• Let’s inspect the structure to find exactly what


attributes we would like to scrape

• Fast-forward: the “prospect_players” key has


prospects for all teams! And it has the Top 100
under the “prospects” key.
MLB
{

"prospect_year": "2017",

"player_id": 643217,

"player_first_name": "Andrew",

"player_last_name": "Benintendi",

"rank": 1,

"position": "OF",

"preseason100": 1,

"preseason20": 1,

"team_file_code": "BOS",

}
MLB
• Using Python’s out-of-box JSON parser, we can
easily parse this file and extract players

• Code: https://github.com/mattdennewitz/
sloan-scraping/blob/master/mlb-
top-100-2017.py

• Output: https://github.com/mattdennewitz/
sloan-scraping/blob/master/mlb-2017.csv
MLB
Recap
• We’ve used the four step approach to plan for
consistent output across disparate systems

• We’ve used tools like the Inspector to probe for


data

• We’ve written very simple yet powerful scripts in


Python to download prospect lists

• We’ve streamlined the data into a consistent shape

• Our scripts are easily reusable


Next steps
• Since we were clever and included system IDs,
we can tie it all together using a baseball player
ID registry

• Chadwick Register

• Smart Fantasy Baseball

• Crunchtime
Tools

• Hopefully there’s time to talk about this


Tools
• requests: A beautiful HTTP library

• lxml: A beautiful XML and HTML parsing library.


Tricky to install on Windows, binaries are
available.

• BeautifulSoup: another A+ HTML parser

• Scrapy: a very robust Python framework for


crawling websites
Code

• The code and output from this session is online


at: https://github.com/mattdennewitz/2017-
sloan-data-scraping
Thanks!

• Questions?

• If we have some time left, we could try a bit of


live coding

• If you have very specific scraping questions, find


me after and let’s talk

You might also like