Data Scraping
Data Scraping
an API
Sometimes they make you work for it
Hi, I’m Matt Dennewitz
• Python
• Scrapy
Strategy
1. “What do I want?”
4. Move in
“What do I want?”
• Envision the data you want, how you need it
• Go!
Interlude: formats
• Data is distributed in mercilessly innumerable
formats
• HTML
• JSON
• XML
Formats: XML
1.
</p>
Formats: JSON
• JavaScript Object Notation
"prospect_year": "2017",
"player_id": 643217,
"player_first_name": "Andrew",
"player_last_name": "Benintendi",
"rank": 1,
"position": "OF",
"preseason100": 1,
"preseason20": 1,
"team_file_code": "BOS",
}
Bonus: XPath
• HTML (fragment):
<table>
<thead>
<tr>
<th>Name</th><th>HR</th><th>SB</th>
</tr>
</thead>
<tbody>
<tr><td>Mike Trout</td><td>40</td><td>40</td></tr>
</tbody>
</table>
• XPath: //span[@class=“playerdef”]/text()
• HTML:
<p>1. <strong><span class=“playerdef”>Eloy
Jiminez</span></strong>, OF, …</p>
• Syntax:
=IMPORTHTML(url, <“list” or “table”>,
[index])
IMPORTHTML
• ESPN Home Run Tracker
• Syntax:
=importhtml("http://
www.hittrackeronline.com/?perpage=1000",
"table", 17)
• Syntax:
=IMPORTHTML("http://
www.brooksbaseball.net/pfxVB/
tabdel_expanded.php?
pitchSel=458584&game=gid_2016_06_27_bosm
lb_tbamlb_1/
&s_type=&h_size=700&v_size=500",
"table")
IMPORTHTML
Google Docs
• A+ community
• Runs anywhere
Python time
1. “What do I want?”
4. Move in
Strategy (again)
4. Move in
Strategy (again)
1. “What do I want?”: prospect rankings from BP,
MLB
4. Move in
Strategy (again)
1. “What do I want?”: prospect rankings from BP,
MLB
4. Move in
Strategy (again)
1. “What do I want?”: prospect rankings from BP,
MLB.
• Fields to export:
• Name
• Rank
• http://www.baseballprospectus.com/article.php?
articleid=31160
BP
• XPath: //div[@class=“article”]//
span[@class=“playerdef"]/..
• What else?
BP
• Code: https://github.com/mattdennewitz/
sloan-scraping/blob/master/bp-
top-101-2017.py
• Output: https://github.com/mattdennewitz/
sloan-scraping/blob/master/bp-2017.csv
BP
BP
• http://m.mlb.com/prospects/2017
MLB
•
MLB
• “uhh”
MLB
• Websites love to load data asynchronously.
• LOVE to
• LOVE to
"prospect_year": "2017",
"player_id": 643217,
"player_first_name": "Andrew",
"player_last_name": "Benintendi",
"rank": 1,
"position": "OF",
"preseason100": 1,
"preseason20": 1,
"team_file_code": "BOS",
}
MLB
• Using Python’s out-of-box JSON parser, we can
easily parse this file and extract players
• Code: https://github.com/mattdennewitz/
sloan-scraping/blob/master/mlb-
top-100-2017.py
• Output: https://github.com/mattdennewitz/
sloan-scraping/blob/master/mlb-2017.csv
MLB
Recap
• We’ve used the four step approach to plan for
consistent output across disparate systems
• Chadwick Register
• Crunchtime
Tools
• Questions?