flyscrape

flyscrape is an elegant scraping tool for efficiently extracting data from websites. Whether you're a developer, data analyst, or researcher, flyscrape empowers you to effortlessly gather information from web pages and transform it into structured data.

Features

Simple and Intuitive: flyscrape offers an easy-to-use command-line interface that allows you to interact with scraping scripts effortlessly.
Create New Scripts: The new command enables you to generate sample scraping scripts quickly, providing you with a solid starting point for your scraping endeavors.
Run Scripts: Execute your scraping script using the run command, and watch as flyscrape retrieves and processes data from the specified website.
Watch for Development: The watch command allows you to watch your scraping script for changes and quickly iterate during development, helping you find the right data extraction queries.

Installation

To install flyscrape, follow these simple steps:

Install Go: Make sure you have Go installed on your system. If not, you can download it from https://golang.org/.
Install flyscrape: Open a terminal and run the following command:
```
go install github.com/philippta/flyscrape/cmd/flyscrape@latest
```

Usage

flyscrape offers several commands to assist you in your scraping journey:

Creating a New Script

Use the new command to create a new scraping script:

flyscrape new example.js

Running a Script

Execute your scraping script using the run command:

flyscrape run example.js

Watching for Development

The watch command allows you to watch your scraping script for changes and quickly iterate during development:

flyscrape watch example.js

Example Script

Below is an example scraping script that showcases the capabilities of flyscrape:

import { parse } from 'flyscrape';

export const options = {
    url: 'https://news.ycombinator.com/',     // Specify the URL to start scraping from.
    depth: 1,                                 // Specify how deep links should be followed.  (default = 0, no follow)
    allowedDomains: [],                       // Specify the allowed domains. ['*'] for all. (default = domain from url)
    blockedDomains: [],                       // Specify the blocked domains.                (default = none)
    allowedURLs: [],                          // Specify the allowed URLs as regex.          (default = all allowed)
    blockedURLs: [],                          // Specify the blocked URLs as regex.          (default = non blocked)
    proxy: '',                                // Specify the HTTP(S) proxy to use.           (default = no proxy)
    rate: 100,                                // Specify the rate in requests per second.    (default = 100)
}

export default function({ html, url }) {
    const $ = parse(html);
    const title = $('title');
    const entries = $('.athing').toArray();

    if (!entries.length) {
        return null; // Omits scraped pages without entries.
    }

    return {
        title: title.text(),                                            // Extract the page title.
        entries: entries.map(entry => {                                 // Extract all news entries.
            const link = $(entry).find('.titleline > a');
            const rank = $(entry).find('.rank');
            const points = $(entry).next().find('.score');

            return {
                title: link.text(),                                     // Extract the title text.
                url: link.attr('href'),                                 // Extract the link href.
                rank: parseInt(rank.text().slice(0, -1)),               // Extract and cleanup the rank.
                points: parseInt(points.text().replace(' points', '')), // Extract and cleanup the points.
            }
        }),
    };
}

Contributing

We welcome contributions from the community! If you encounter any issues or have suggestions for improvement, please submit an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
cmd/flyscrape		cmd/flyscrape
js		js
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
fetch.go		fetch.go
fetch_test.go		fetch_test.go
go.mod		go.mod
go.sum		go.sum
js.go		js.go
js_test.go		js_test.go
scrape.go		scrape.go
scrape_test.go		scrape_test.go
utils.go		utils.go
watch.go		watch.go
watch_test.go		watch_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

flyscrape

Features

Installation

Usage

Creating a New Script

Running a Script

Watching for Development

Example Script

Contributing

About

Releases

Packages

Languages

License

wvdschel/flyscrape

Folders and files

Latest commit

History

Repository files navigation

flyscrape

Features

Installation

Usage

Creating a New Script

Running a Script

Watching for Development

Example Script

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages