flyscrape is a standalone and scriptable web scraper, combining the speed of Go with the flexibility of JavaScript. — Focus on data extraction rather than request juggling.
Installation · Documentation · Releases
- Highly Configurable: 10 options to fine-tune your scraper.
- Standalone: flyscrape comes as a single binary executable.
- Scriptable: Use JavaScript to write your data extraction logic.
- Simple API: Extract data from HTML pages with a familiar API.
- Fast Iteration: Use the development mode to get quick feedback.
- Request Caching: Re-run scripts on websites you already scraped.
- Zero Dependencies: No need to fill up your disk with npm packages.
export const config = {
url: "https://news.ycombinator.com/",
// urls: [] // Specify additional URLs to start from. (default = none)
// depth: 0, // Specify how deep links should be followed. (default = 0, no follow)
// follow: [], // Speficy the css selectors to follow (default = ["a[href]"])
// allowedDomains: [], // Specify the allowed domains. ['*'] for all. (default = domain from url)
// blockedDomains: [], // Specify the blocked domains. (default = none)
// allowedURLs: [], // Specify the allowed URLs as regex. (default = all allowed)
// blockedURLs: [], // Specify the blocked URLs as regex. (default = none)
// rate: 100, // Specify the rate in requests per second. (default = no rate limit)
// proxies: [], // Specify the HTTP(S) proxy URLs. (default = no proxy)
// cache: "file", // Enable file-based request caching. (default = no cache)
}
export default function ({ doc, absoluteURL }) {
const title = doc.find("title");
const posts = doc.find(".athing");
return {
title: title.text(),
posts: posts.map((post) => {
const link = post.find(".titleline > a");
return {
title: link.text(),
url: link.attr("href"),
};
}),
}
}
$ flyscrape run hackernews.js
[
{
"url": "https://news.ycombinator.com/",
"data": {
"title": "Hacker News",
"posts": [
{
"title": "Show HN: flyscrape - An standalone and scriptable web scraper",
"url": "https://flyscrape.com/"
},
...
]
}
}
]
Check out the examples folder for more detailed examples.
flyscrape
is available for MacOS, Linux and Windows as a downloadable binary from the releases page.
To compile flyscrape from source, follow these steps:
-
Install Go: Make sure you have Go installed on your system. If not, you can download it from https://golang.org/.
-
Install flyscrape: Open a terminal and run the following command:
go install github.com/philippta/flyscrape/cmd/flyscrape@latest
Usage:
flyscrape run SCRIPT [config flags]
Examples:
# Run the script.
$ flyscrape run example.js
# Set the URL as argument.
$ flyscrape run example.js --url "http://other.com"
# Enable proxy support.
$ flyscrape run example.js --proxies "http://someproxy:8043"
# Follow paginated links.
$ flyscrape run example.js --depth 5 --follow ".next-button > a"
Below is an example scraping script that showcases the capabilities of flyscrape. For a full documentation of all configuration options, visit the documentation page.
export const config = {
url: "https://example.com/", // Specify the URL to start scraping from.
urls: [ // Specify the URL(s) to start scraping from. If both `url` and `urls`
"https://example.com/foo", // are provided, all of the specified URLs will be scraped.
"https://example.com/bar",
],
depth: 0, // Specify how deep links should be followed. (default = 0, no follow)
follow: [], // Speficy the css selectors to follow (default = ["a[href]"])
allowedDomains: [], // Specify the allowed domains. ['*'] for all. (default = domain from url)
blockedDomains: [], // Specify the blocked domains. (default = none)
allowedURLs: [], // Specify the allowed URLs as regex. (default = all allowed)
blockedURLs: [], // Specify the blocked URLs as regex. (default = none)
rate: 100, // Specify the rate in requests per second. (default = no rate limit)
proxies: [], // Specify the HTTP(S) proxy URLs. (default = no proxy)
cache: "file", // Enable file-based request caching. (default = no cache)
headers: { // Specify the HTTP request header. (default = none)
"Authorization": "Basic ZGVtbzpwQDU1dzByZA==",
"User-Agent": "Gecko/1.0",
},
};
export function setup() {
// Optional setup function, called once before scraping starts.
// Can be used for authentication.
}
export default function ({ doc, url, absoluteURL }) {
// doc - Contains the parsed HTML document
// url - Contains the scraped URL
// absoluteURL(...) - Transforms relative URLs into absolute URLs
}
// <div class="element" foo="bar">Hey</div>
const el = doc.find(".element")
el.text() // "Hey"
el.html() // `<div class="element">Hey</div>`
el.attr("foo") // "bar"
el.hasAttr("foo") // true
el.hasClass("element") // true
// <ul>
// <li class="a">Item 1</li>
// <li>Item 2</li>
// <li>Item 3</li>
// </ul>
const list = doc.find("ul")
list.children() // [<li class="a">Item 1</li>, <li>Item 2</li>, <li>Item 3</li>]
const items = list.find("li")
items.length() // 3
items.first() // <li>Item 1</li>
items.last() // <li>Item 3</li>
items.get(1) // <li>Item 2</li>
items.get(1).prev() // <li>Item 1</li>
items.get(1).next() // <li>Item 3</li>
items.get(1).parent() // <ul>...</ul>
items.get(1).siblings() // [<li class="a">Item 1</li>, <li>Item 2</li>, <li>Item 3</li>]
items.map(item => item.text()) // ["Item 1", "Item 2", "Item 3"]
items.filter(item => item.hasClass("a")) // [<li class="a">Item 1</li>]
import { parse } from "flyscrape";
const doc = parse(`<div class="foo">bar</div>`);
const text = doc.find(".foo").text();
import http from "flyscrape/http";
const response = http.get("https://example.com")
const response = http.postForm("https://example.com", {
"username": "foo",
"password": "bar",
})
const response = http.postJSON("https://example.com", {
"username": "foo",
"password": "bar",
})
// Contents of response
{
body: "<html>...</html>",
status: 200,
headers: {
"Content-Type": "text/html",
// ...
},
error": "",
}
import { download } from "flyscrape/http";
download("http://example.com/image.jpg") // downloads as "image.jpg"
download("http://example.com/image.jpg", "other.jpg") // downloads as "other.jpg"
download("http://example.com/image.jpg", "dir/") // downloads as "dir/image.jpg"
// If the server offers a filename via the Content-Disposition header and no
// destination filename is provided, Flyscrape will honor the suggested filename.
// E.g. `Content-Disposition: attachment; filename="archive.zip"`
download("http://example.com/generate_archive.php", "dir/") // downloads as "dir/archive.zip"
If you encounter any issues or have suggestions for improvement, please submit an issue.