Anaconda's Guide To Open-Source: Tools and Libraries For Enterprise Data Science and Machine Learning
Anaconda's Guide To Open-Source: Tools and Libraries For Enterprise Data Science and Machine Learning
Anaconda's Guide To Open-Source: Tools and Libraries For Enterprise Data Science and Machine Learning
to Open-Source
Tools and Libraries for Enterprise Data Science
and Machine Learning
What’s Inside
3..........Introduction
7..........Machine Learning
10........Data Visualization
13........Image Processing
16........Scalable Computing
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 2
Open-source collaboration has led to some of the most innovative and
advanced technologies of our time. These are data science and machine
learning tools and libraries that equip data scientists in every industry, including
engineering, manufacturing, cybersecurity, medicine, genetics, and astronomy.
Open-source technologies empower organizations to do breakthrough data
science and create differentiating AI and machine learning technologies.
Python is the most commonly used and most recommended language for data
science and machine learning, which is why many of the open-source tools and
libraries are built for Python. It is also growing in popularity among developers
-- it is currently the second most popular language on GitHub. As Python
becomes a common language between developers and data scientists, getting
machine learning models and applications through production becomes more
efficient. All of the tools listed in this guide are compatible with Python.
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 3
Fundamental Data Science
Tools and Libraries
This collection of open-source Python tools and
libraries consists of very popular packages that are
frequently used together to do data science. The
fundamental tools are not only essential and powerful
for individual practitioners, but they are also essential
for doing enterprise data science with Python. Many
other tools and libraries in the Python data science
and ML ecosystem are dependent upon these
fundamental packages.
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 4
WHAT IT IS: WHAT IT IS:
Jupyter is an open-source project created to A library for tabular data structures, data analysis,
support interactive data science and scientific and data modeling tools, including built-in
computing across programming languages. plotting using Matplotlib.
Jupyter offers a web-based environment for
working with notebooks containing code, data, WHAT IT’S USED FOR:
and text. Jupyter notebooks are the standard Data manipulation and indexing, reshaping and
workspace for most Python data scientists. pivoting of data sets, label-based slicing and
alignment, high-performance merging and
WHAT IT’S USED FOR: joining of data sets, and time series data analysis.
Jupyter notebooks are used to create and share Pandas includes efficient methods for reading
live code, equations, visualizations and text. It has and writing a wide variety of data, including CSV
become the tool of choice for presenting data files, Excel sheets, and SQL queries.
science projects.
PROJECTS:
PROJECTS: Many companies have found that pandas is easy
Jupyter is used by Google, Microsoft, IBM, to use across teams and boosts productivity for
Bloomberg, NASA, and many other companies data analysis. For example, Appnexus uses
and universities. It is safe to say that if an pandas across their engineering, mathematician,
organization has data scientists working in and analyst teams. Datadog uses pandas to
Python, they use Jupyter notebooks. process time series data on their production
servers. It’s safe to say, if a company is doing data
MORE INFORMATION:
science, they are using Pandas.
jupyter.org
LEARN MORE:
https://pandas.pydata.org/
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 5
WHAT IT IS: WHAT IT IS:
The SciPy library consists of a specific set of A core package for scientific computing with
fundamental scientific and numerical tools for Python. Numpy enables array formation and
Python that data scientists use to build their own basic operations with arrays.
tools and programs, not to be confused with the
SciPy community and the SciPy conference, WHAT IT’S USED FOR:
which include anyone working on scientific Numpy is used for indexing and sorting but
computing with Python. can also be used for linear algebra and other
operations. SciPy is more fully featured when
WHAT IT’S USED FOR: it comes to algebra modules and numerical
Routines for numerical integration, interpolation, algorithms. Many other data-science libraries
linear algebra, and statistics. for Python are built on NumPy internally,
including Pandas and SciPy.
PROJECTS:
https://numpy.org/
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 6
Machine Learning
Machine learning (ML) is a discipline within AI that involves
developing and studying algorithms and models machines
use to learn and perform tasks without being explicitly
programmed to do so. Deep learning is a subfield of
ML that involves processing with neural networks and
high-performance computing. These are three of the
most popular open-source machine learning
technologies.
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 7
WHAT IT IS: WHAT IT IS:
TensorFlow is an open-source deep learning An open-source deep learning framework that
platform from Google that includes an consists of fundamental tools and libraries for
ecosystem of tools and libraries that enable the Python AI and machine learning development.
building and deployment of AI and deep learning
applications. Keras is a high-level API used to WHAT IT’S USED FOR:
build and train deep learning models, originally To build and train deep learning models, such
as a separate library but now included with as CNNs and GANs. A rich ecosystem of libraries
efficiently build, train, and deploy deep learning Salesforce, among many others, uses
models, such as convolutional neural networks PyTorch for natural language processing
PROJECTS:
LEARN MORE:
https://www.tensorflow.org/
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 8
WHAT IT IS:
PROJECTS:
LEARN MORE:
https://scikit-learn.org/stable/
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 9
Data Visualization
Data visualization is essential to data exploration,
analysis, and communication, allowing data scientists
to understand their data and share that understanding
with others. Python has many, many viz tools available
(see pyviz.org/tools.html for a complete list), but
we will highlight a few here.
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 10
Bokeh & Plotly
WHAT IT IS: WHAT THEY ARE:
Matplotlib is the most well-established Python Popular and powerful browser-based
data visualization tool, focusing primarily on visualization libraries that let you create
two-dimensional plots (line charts, bar charts, interactive, JavaScript-based plots
scatter plots, histograms, and many others). It from Python.
works with many GUI interfaces and file formats,
but has relatively limited interactive support in WHAT THEY ARE USED FOR:
web browsers. Bokeh and Plotly create not just static plots,
but interactive visualizations with panning,
WHAT IT’S USED FOR: zooming, linking between plots, and other
Matplotlib is used to analyze, explore, and show features that let you work in Python but use
relationships between data. the power of modern web technologies to
share your results widely.
PROJECTS:
LEARN MORE:
https://matplotlib.org
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 11
Data Visualization
code, easily transitioning from Jupyter to The HoloViz project provides extensive free
standalone servers), Voila (directly serving tutorials showing how to use these tools for
Jupyter notebooks), Streamlit (apps from Python working with billions of data points interactively,
scripts), Dash (direct control over HTML/CSS for constructing plots and dashboards from a
styling, stateless deployment). few lines of Python code, and for working with
streaming, geographic, network, or other more
PROJECTS: complex types of data.
The best way to see what projects are possible
with these tools is to see the examples at PROJECTS:
awesome-panel.org, voila-gallery.org, awesome- See demos and tutorials for the many types
streamlit.org, and dash-gallery.plotly.host of visualizations possible with HoloViz at
http://holoviews.org/gallery/index.html.
LEARN MORE:
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 12
Image Processing
Advances in computing and data-storage hardware
have made it practical to move beyond simple text and
numeric data types into images, sounds, movies, and
live sensors. Image processing tools enable data
scientists and engineers to build and train models for
AI, such as robots, process sounds and images for
predictive maintenance in factories, and many other
applications that require image processing from
cameras or image files.
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 13
PIL/Pillow
Data preparation for image training and basic WHAT IT’S USED FOR:
image manipulation. scikit-image is used for processing large volumes
of images, and it is commonly used for scientific
PROJECTS:
applications ranging from biomedical imaging
Data scientists, analysts, and others in banking, to astronomy.
finance and health care industries have used
Pillow for image manipulation. PROJECTS:
LEARN MORE:
https://scikit-image.org
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 14
Image Processing
WHAT IT IS:
PROJECTS:
LEARN MORE:
https://opencv.org/
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 15
Scalable Computing
Scalable computing, including distributed and parallel
computing, speeds up analysis, model training and
performance. It enables multiple tasks and calculations to
be performed simultaneously across computers or
processors. These packages can be used as boosters for
many Python data science and machine learning tasks.
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 16
WHAT IT IS: WHAT IT IS:
Numba is a high-performance Python compiler. Dask is a Python package used to scale
It makes Python faster and optimizes the NumPy workflows with parallel processing to
performance of Numpy arrays, reaching the enable multi-dimensional data analysis, enabling
speed of FORTRAN and C without a compiler. users to store and process data larger than
their computer’s RAM. Dask can scale out to
WHAT IT’S USED FOR:
clusters, or scale down to a single computer.
Accelerating Python functions and
Dask mimics the pandas and NumPy API,
parallelizing algorithms for GPUs and CPUs,
making it more intuitive for Python data
such as in Datashader.
scientists than Apache Spark.
PROJECTS:
WHAT IT’S USED FOR:
Datashader, a data visualization tool, uses
Dask is used to accelerate processing in a variety
Numba for acceleration. Fortune 100 finance
of fields, including research in Earth science,
firms have used it for financial modeling, and it is
satellite imagery, and genomics. It is also used in
also commonly used for building simulations.
business and engineering. For example, it is used
Numba was also used, among other tools, in the
to increase efficiency in cashflow model
Xenon1T experiment to detect dark matter.
management systems and civic modeling.
LEARN MORE:
PROJECTS:
http://numba.pydata.org/
With implementations of Dask, Capital One
reduced model training times by 91%. Other
organizations have used Dask for genome
sequencing, cashflow modeling systems, satellite
imagery processing.
LEARN MORE:
https://stories.dask.org/
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 17
Scalable Computing
RAPIDS is basically a tool for running Pandas, A fault-tolerant cluster computing framework
Scikit-Learn, and NetworkX (graph analytics and interface for programming clusters launched
library) on GPUs. It also integrates with some by UC Berkeley. Developed for Java/Hadoop
deep learning libraries ecosystem but with support for Python. PySpark
is the Python API for Spark.
WHAT IT’S USED FOR:
Accelerating data science and analytics pipelines WHAT IT’S USED FOR:
by utilizing GPUs. Spark is a multi-purpose tool that can be used
for data preparation and processing as well as
PROJECTS: training ML algorithms. Spark is great for
Capital One uses Rapids in conjunction with managing data streams in real time and
Dask to speed up their data science workflows interactive analytics through interactive queries.
and scale on GPUs. They also find that former
SAS users and other data scientists because they PROJECTS:
do not have to learn Spark or Java to be effective. Spark is used by a wide variety of companies.
eBay uses Apache Spark for log transaction
LEARN MORE: aggregation and analytics. MyFitnessPal uses
https://rapids.ai/about.html Spark to clean up users’ data and to build
recommendation engines for foods and recipes.
LEARN MORE:
https://spark.apache.org/
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 18
Data Preparation / ETL
Data preparation is a prerequisite to doing data analysis,
data science and machine learning, and it can also be
the most rigorous and time-consuming part of the
whole process. Most data-science workflows initially
use custom Pandas and other data-manipulation code,
but these data preparation / ETL (extract, transform,
and load) tools help automate the process to make
data preparation more efficient in production for
companies and large organizations.
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 19
WHAT IT IS: WHAT IT IS:
An open-source workflow automation tool by A data ingest/loading library for a wide variety of
Apache for creating data workflows, scheduling file formats and data services, with hierarchical
tasks and monitoring results. It integrates with cataloguing, searching, and interactivity with
multiple cloud providers, including AWS, Azure, remote storage platforms under a single interface.
and Google Cloud.
WHAT IT’S USED FOR:
WHAT IT’S USED FOR: Intake lets an organization catalog data of all
Airflow is used to manage and automate data types, including fitted model descriptions, images,
pipelines for use in data analysis and machine and unstructured log entries, so Python data
learning models. scientists can then focus on their analyses rather
than boilerplate I/O code. Catalogs are text files
PROJECTS: that can easily be shared with others and reused
Airflow was created by developers from Airbnb between projects.
for managing big data pipelines from multiple
sources. Currently used for data pipeline PROJECTS:
management by Airbnb, Slack, Walmart, Lyft Intake is currently used by Zillow, NASA, and USGS
and Hello Fresh among others. to catalog data of many types for use in Python.
https://airflow.apache.org/ https://intake.readthedocs.io
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 20
Natural Language Processing (NLP)
Natural Language Processing (NLP) involves programming
machines to parse and understand human language
and to interact with humans through both written and
spoken language. The field of NLP includes speech
recognition, language generation, document analysis,
and information retrieval.
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 21
NLTK
LEARN MORE:
https://pypi.org/project/gensim/
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 22
WHAT IT IS:
PROJECTS:
LEARN MORE:
https://spacy.io/
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 23
Looking Ahead: AI Frontiers
As machine learning technologies advance, AI solutions
will become more and more sophisticated. At the core of
this evolution are questions about fairness and
interpretability. As AI uses data to make more impactful
decisions that change people’s lives (such as hiring,
recidivism, and credit approval), humans must ensure that
these decisions are as fair as possible and that they are
explainable to those who are affected. For AI to advance,
portability and interoperability are also a problem.
Those who work on AI models need to be able to move
them between platforms with ease instead of having to
rebuild and re-code. Here are a few tools on the cutting
edge of solving these problems.
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 24
FairLearn
and headaches in the process of operationalizing companies have published case studies or
models. It is also commonly used for overviews of their use of the tool. One example
LEARN MORE:
https://onnx.ai/
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 25
Looking Ahead: AI Frontiers
Similar to FairLearn, it’s used for evaluating WHAT IT’S USED FOR:
fairness of AI/ML models and training data and InterpretML is used to explain any existing “black
mitigating bias in current models. box” model (models with means of making
decisions that are incomprehensible to humans),
PROJECTS: and it can also be used to train new models that
|AI Fairness 360 has been used to detect bias in are designed to be interpretable, “glass box”
credit scoring algorithms and to mitigate racial models (models explainable to humans).
bias in healthcare utilization scoring.
PROJECTS:
LEARN MORE: InterpretML was started by open-source
https://aif360.mybluemix.net/ developers at Microsoft, and it has been used
to make credit fraud, churn, and medical
prediction models more interpretable.
LEARN MORE:
https://github.com/interpretml/interpret
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 26
LIME
WHAT IT IS:
https://github.com/marcotcr/lime
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 27
How Can I Manage Open
Source in the Enterprise?
Guide to Open-Source Tools and Libraries for Enterprise Data Science and Machine Learning 28
About Anaconda
With more than 20 million users, Anaconda is the world’s most popular data
science platform and the foundation of modern machine learning. We
pioneered the use of Python for data science, champion its vibrant community,
and continue to steward open-source projects that make tomorrow’s
innovations possible. Our enterprise-grade solutions enable corporate,
research, and academic institutions around the world to harness the power of
open-source for competitive advantage, groundbreaking research, and a better
world.