There is a newer version of the record available.

Published July 19, 2023 | Version 1.1
Dataset Open

SCI-3000: A Novel Dataset for the Task of Figure, Table and Caption Extraction from Scientific PDFs

  • 1. TU Vienna

Contributors

  • 1. TU Vienna

Description

This dataset contains bounding boxes of figures, tables, captions in 34,791 pages extracted from 3000 open-access scientific publications from the fields of medicine, chemistry, physics, computer science, and technology. The underlying publications are also included in PDF form.

For more details, refer to the README file.

Notes

V1.1 Adds clarification on rasterization techniques (important for reproducibility)

Files

SCI-3000-full.zip

Files (13.6 GB)

Name Size Download all
md5:e2abc448cc6c529eed243324bc184cb5
6.8 GB Preview Download
md5:ed7b26bbc59872432d051562f48d89dc
6.8 GB Preview Download