apricot: Submodular selection for data summarization in Python

Schreiber, Jacob; Bilmes, Jeffrey; Noble, William Stafford

Computer Science > Machine Learning

arXiv:1906.03543 (cs)

[Submitted on 8 Jun 2019]

Title:apricot: Submodular selection for data summarization in Python

Authors:Jacob Schreiber, Jeffrey Bilmes, William Stafford Noble

View PDF

Abstract:We present apricot, an open source Python package for selecting representative subsets from large data sets using submodular optimization. The package implements an efficient greedy selection algorithm that offers strong theoretical guarantees on the quality of the selected set. Two submodular set functions are implemented in apricot: facility location, which is broadly applicable but requires memory quadratic in the number of examples in the data set, and a feature-based function that is less broadly applicable but can scale to millions of examples. Apricot is extremely efficient, using both algorithmic speedups such as the lazy greedy algorithm and code optimizers such as numba. We demonstrate the use of subset selection by training machine learning models to comparable accuracy using either the full data set or a representative subset thereof. This paper presents an explanation of submodular selection, an overview of the features in apricot, and an application to several data sets. The code and tutorial Jupyter notebooks are available at this https URL

Subjects:	Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:1906.03543 [cs.LG]
	(or arXiv:1906.03543v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.1906.03543

Submission history

From: Jacob Schreiber [view email]
[v1] Sat, 8 Jun 2019 23:53:57 UTC (121 KB)

Computer Science > Machine Learning

Title:apricot: Submodular selection for data summarization in Python

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:apricot: Submodular selection for data summarization in Python

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators