Machine Learning Operations (MLOps) - Overview, Definition, and Architecture
Machine Learning Operations (MLOps) - Overview, Definition, and Architecture
†
This paper does not represent an official IBM
statement
MLOps: Overview, Definition, and Architecture Kreuzberger, Kühl and Hirschl
Methodology
methodologies surfaced in the field of software engineering. Literature Review Tool Review Interview Study
Prominent examples include waterfall [37] and the agile manifesto (27 articles) (11 tools) (8 interviewees)
[5]. Those methodologies have similar aims, namely to deliver
production-ready software products. A concept called “DevOps”
emerged in the years 2008/2009 and aims to reduce issues in
MLOps
software development [9,31]. DevOps is more than a pure
methodology and rather represents a paradigm addressing social
Components
Architecture
Principles
and technical issues in organizations engaged in software
Results
Roles
development. It has the goal of eliminating the gap between
development and operations and emphasizes collaboration,
communication, and knowledge sharing. It ensures automation Figure 1. Overview of the methodology
with continuous integration, continuous delivery, and continuous
scientific databases of Google Scholar, Web of Science, Science
deployment (CI/CD), thus allowing for fast, frequent, and reliable
releases. Moreover, it is designed to ensure continuous testing, Direct, Scopus, and the Association for Information Systems
eLibrary. It should be mentioned that the use of DevOps for ML,
quality assurance, continuous monitoring, logging, and feedback
MLOps, and continuous practices in combination with ML is a
loops. Due to the commercialization of DevOps, many DevOps
relatively new field in academic literature. Thus, only a few peer-
tools are emerging, which can be differentiated into six groups
reviewed studies are available at the time of this research.
[23,28]: collaboration and knowledge sharing (e.g., Slack, Trello,
GitLab wiki), source code management (e.g., GitHub, GitLab), Nevertheless, to gain experience in this area, the search included
non-peer-reviewed literature as well. The search was performed in
build process (e.g., Maven), continuous integration (e.g., Jenkins,
May 2021 and resulted in 1,864 retrieved articles. Of those, we
GitLab CI), deployment automation (e.g., Kubernetes, Docker),
screened 194 papers in detail. From that group, 27 articles were
monitoring and logging (e.g., Prometheus, Logstash). Cloud
environments are increasingly equipped with ready-to-use selected based on our inclusion and exclusion criteria (e.g., the
DevOps tooling that is designed for cloud use, facilitating the term MLOps or DevOps and CI/CD in combination with ML was
described in detail, the article was written in English, etc.). All 27
efficient generation of value [38]. With this novel shift towards
of these articles were peer-reviewed.
DevOps, developers need to care about what they develop, as they
need to operate it as well. As empirical results demonstrate,
3.2 Tool Review
DevOps ensures better software quality [34]. People in the
industry, as well as academics, have gained a wealth of experience After going through 27 articles and eight interviews, various
in software engineering using DevOps. This experience is now open-source tools, frameworks, and commercial cloud ML
being used to automate and operationalize ML. services were identified. These tools, frameworks, and ML
services were reviewed to gain an understanding of the technical
components of which they consist. An overview of the identified
3 Methodology tools is depicted in Table 1 of the Appendix.
To derive insights from the academic knowledge base while 3.3 Interview Study
also drawing upon the expertise of practitioners from the field, we
To answer the research questions with insights from practice,
apply a mixed-method approach, as depicted in Figure 1. As a
we conduct semi-structured expert interviews according to Myers
first step, we conduct a structured literature review [20,43] to
and Newman [33]. One major aspect in the research design of
obtain an overview of relevant research. Furthermore, we review
expert interviews is choosing an appropriate sample size [8]. We
relevant tooling support in the field of MLOps to gain a better
apply a theoretical sampling approach [12], which allows us to
understanding of the technical components involved. Finally, we
choose experienced interview partners to obtain high-quality data.
conduct semi-structured interviews [33,39] with experts from
Such data can provide meaningful insights with a limited number
different domains. On that basis, we conceptualize the term
of interviews. To get an adequate sample group and reliable
“MLOps” and elaborate on our findings by synthesizing literature
insights, we use LinkedIn—a social network for professionals—to
and interviews in the next chapter (“Results”).
identify experienced ML professionals with profound MLOps
3.1 Literature Review knowledge on a global level. To gain insights from various
perspectives, we choose interview partners from different
To ensure that our results are based on scientific knowledge,
organizations and industries, different countries and nationalities,
we conduct a systematic literature review according to the method
as well as different genders. Interviews are conducted until no
of Webster and Watson [43] and Kitchenham et al. [20]. After an
new categories and concepts emerge in the analysis of the data. In
initial exploratory search, we define our search query as follows:
total, we conduct eight interviews with experts (α - θ), whose
((("DevOps" OR "CICD" OR "Continuous Integration" OR
details are depicted in Table 2 of the Appendix. According to
"Continuous Delivery" OR "Continuous Deployment") AND
MLOps Kreuzberger, Kühl, and Hirschl
Glaser and Strauss [5, p.61], this stage is called “theoretical order by considering relationships and dependencies
saturation.” All interviews are conducted between June and [14,17,26,32,40,41] [α, β, γ, δ, ζ, η].
August 2021. P3 Reproducibility. Reproducibility is the ability to reproduce
With regard to the interview design, we prepare a semi- an ML experiment and obtain the exact same results [14,32,40,46]
structured guide with several questions, documented as an [α, β, δ, ε, η].
interview script [33]. During the interviews, “soft laddering” is P4 Versioning. Versioning ensures the versioning of data,
used with “how” and “why” questions to probe the interviewees’ model, and code to enable not only reproducibility, but also
means-end chain [39]. This methodical approach allowed us to traceability (for compliance and auditing reasons) [14,32,40,46]
gain additional insight into the experiences of the interviewees [α, β, δ, ε, η].
when required. All interviews are recorded and then transcribed. P5 Collaboration. Collaboration ensures the possibility to
To evaluate the interview transcripts, we use an open coding work collaboratively on data, model, and code. Besides the
scheme [8]. technical aspect, this principle emphasizes a collaborative and
communicative work culture aiming to reduce domain silos
between different roles [14,26,40] [α, δ, θ].
4 Results P6 Continuous ML training & evaluation. Continuous
We apply the described methodology and structure our training means periodic retraining of the ML model based on new
resulting insights into a presentation of important principles, their feature data. Continuous training is enabled through the support of
resulting instantiation as components, the description of necessary a monitoring component, a feedback loop, and an automated ML
roles, as well as a suggestion for the architecture and workflow workflow pipeline. Continuous training always includes an
resulting from the combination of these aspects. Finally, we evaluation run to assess the change in model quality [10,17,19,46]
derive the conceptualization of the term and provide a definition [β, δ, η, θ].
of MLOps. P7 ML metadata tracking/logging. Metadata is tracked and
logged for each orchestrated ML workflow task. Metadata
4.1 Principles tracking and logging is required for each training job iteration
A principle is viewed as a general or basic truth, a value, or a (e.g., training date and time, duration, etc.), including the model
guide for behavior. In the context of MLOps, a principle is a guide specific metadata—e.g., used parameters and the resulting
to how things should be realized in MLOps and is closely related performance metrics, model lineage: data and code used—to
to the term “best practices” from the professional sector. Based on ensure the full traceability of experiment runs [26,27,29,32,35] [α,
the outlined methodology, we identified nine principles required β, δ, ε, ζ, η, θ].
to realize MLOps. Figure 2 provides an illustration of these P8 Continuous monitoring. Continuous monitoring implies
principles and links them to the components with which they are the periodic assessment of data, model, code, infrastructure
associated. resources, and model serving performance (e.g., prediction
P2 P3
accuracy) to detect potential errors or changes that influence the
Workflow
P3
P6
Orchestration
Component
P4 Model
Registry product quality [4,7,10,27,29,42,46] [α, β, γ, δ, ε, ζ, η].
P4 P1 P6 P8 P1
P9 Feedback loops. Multiple feedback loops are required to
P5 Source Code
Repository
P6
P9
CI/CD
Component
Model Training
Infrastructure
P9 Monitoring
Component
Model Serving
Component
integrate insights from the quality assessment step into the
development or engineering process (e.g., a feedback loop from
PRINCIPLES P3 P4
P1 CI/CD automation
P4 Feature
Stores
P7 ML Metadata
Stores
the experimental model engineering stage to the previous feature
P2 Workflow orchestration
P3 Reproducibility engineering stage). Another feedback loop is required from the
P4 Versioning of data, code, model
P5 Collaboration monitoring component (e.g., observing the model serving
P6 Continuous ML training & evaluation
P7 ML metadata tracking
P8 Continuous monitoring
performance) to the scheduler to enable the retraining
P9 Feedback loops
[4,6,7,17,27,46] [α, β, δ, ζ, η, θ].
COMPONENT
regarding the success or failure of certain steps, thus increasing IBM Watson Studio [γ]. MLflow provides an advanced metadata
the overall productivity [10,15,17,26,35,46] [α, β, γ, ε, ζ, η]. store in combination with the model registry [32,35].
Examples are Jenkins [17,26] and GitHub actions (η). C8 Model Serving Component (P1). The model serving
C2 Source Code Repository (P4, P5). The source code component can be configured for different purposes. Examples
repository ensures code storing and versioning. It allows multiple are online inference for real-time predictions or batch inference
developers to commit and merge their code [17,25,42,44,46] [α, β, for predictions using large volumes of input data. The serving can
γ, ζ, θ]. Examples include Bitbucket [11] [ζ], GitLab [11,17] [ζ], be provided, e.g., via a REST API. As a foundational
GitHub [25] [ζ ,η], and Gitea [46]. infrastructure layer, a scalable and distributed model serving
C3 Workflow Orchestration Component (P2, P3, P6). The infrastructure is recommended [7,11,25,40,45,46] [α, β, δ, ζ, η, θ].
workflow orchestration component offers task orchestration of an One example of a model serving component configuration is the
ML workflow via directed acyclic graphs (DAGs). These graphs use of Kubernetes and Docker technology to containerize the ML
represent execution order and artifact usage of single steps of the model, and leveraging a Python web application framework like
workflow [26,32,35,40,41,46] [α, β, γ, δ, ε, ζ, η]. Examples Flask [17] with an API for serving [α]. Other Kubernetes
include Apache Airflow [α, ζ], Kubeflow Pipelines [ζ], Luigi [ζ], supported frameworks are KServing of Kubeflow [α], TensorFlow
AWS SageMaker Pipelines [β], and Azure Pipelines [ε]. Serving, and Seldion.io serving [40]. Inferencing could also be
C4 Feature Store System (P3, P4). A feature store system realized with Apache Spark for batch predictions [θ]. Examples of
ensures central storage of commonly used features. It has two cloud services include Microsoft Azure ML REST API [ε], AWS
databases configured: One database as an offline feature store to SageMaker Endpoints [α, β], IBM Watson Studio [γ], and Google
serve features with normal latency for experimentation, and one Vertex AI prediction service [δ].
database as an online store to serve features with low latency for C9 Monitoring Component (P8, P9). The monitoring
predictions in production [10,14] [α, β, ζ, ε, θ]. Examples include component takes care of the continuous monitoring of the model
Google Feast [ζ], Amazon AWS Feature Store [β, ζ], Tecton.ai serving performance (e.g., prediction accuracy). Additionally,
and Hopswork.ai [ζ]. This is where most of the data for training monitoring of the ML infrastructure, CI/CD, and orchestration are
ML models will come from. Moreover, data can also come required [7,10,17,26,29,36,46] [α, ζ, η, θ]. Examples include
directly from any kind of data store. Prometheus with Grafana [η, ζ], ELK stack (Elasticsearch,
C5 Model Training Infrastructure (P6). The model training Logstash, and Kibana) [α, η, ζ], and simply TensorBoard [θ].
infrastructure provides the foundational computation resources, Examples with built-in monitoring capabilities are Kubeflow [θ],
e.g., CPUs, RAM, and GPUs. The provided infrastructure can be MLflow [η], and AWS SageMaker model monitor or cloud watch
either distributed or non-distributed. In general, a scalable and [ζ].
distributed infrastructure is recommended [7,10,24–
26,29,40,45,46] [δ, ζ, η, θ]. Examples include local machines (not 4.3 Roles
scalable) or cloud computation [7] [η, θ], as well as non- Having described principles and their resulting instantiation of
distributed or distributed computation (several worker nodes) components, we identify mandatory roles required to realize
[25,27]. Frameworks supporting computation are Kubernetes [η, MLOps are identified next. MLOps is an interdisciplinary group
θ] and Red Hat OpenShift [γ]. process, and the interplay of different roles is crucial to design,
C6 Model Registry (P3, P4). The model registry centrally manage, automate, and operate an ML system in production. In
stores the trained ML models together with their metadata. It has the following, every role, its purpose, and related tasks are briefly
two main functionalities: storing the ML artifact and storing the described:
ML metadata (see C7) [4,6,14,17,26,27] [α, β, γ, ε, ζ, η, θ]. R1 Business Stakeholder (similar roles: Product Owner,
Advanced storage examples include MLflow [α, η, ζ], AWS Project Manager). The business stakeholder defines the business
SageMaker Model Registry [ζ], Microsoft Azure ML Model goal to be achieved with ML and takes care of the communication
Registry [ζ], and Neptune.ai [α]. Simple storage examples include side of the business, e.g., presenting the return on investment
Microsoft Azure Storage, Google Cloud Storage, and Amazon (ROI) generated with an ML product [17,24,26] [α, β, δ, θ].
AWS S3 [17]. R2D2Solution Architect (similar role: IT Architect). The
C7 ML Metadata Stores (P4, P7). ML metadata stores allow solution architect designs the architecture and defines the
for the tracking of various kinds of metadata, e.g., for each technologies to be used, following a thorough evaluation [17,27]
orchestrated ML workflow pipeline task. Another metadata store [α, ζ].
can be configured within the model registry for tracking and R3 Data Scientist (similar roles: ML Specialist, ML
logging the metadata of each training job (e.g., training date and Developer). The data scientist translates the business problem into
time, duration, etc.), including the model specific metadata—e.g., an ML problem and takes care of the model engineering,
used parameters and the resulting performance metrics, model including the selection of the best-performing algorithm and
lineage: data and code used [14,25–27,32] [α, β, δ, ζ, θ]. hyperparameters [7,14,26,29] [α, β, γ, δ, ε, ζ, η, θ].
Examples include orchestrators with built-in metadata stores R4 Data Engineer (similar role: DataOps Engineer). The data
tracking each step of experiment pipelines [α] such as Kubeflow engineer builds up and manages data and feature engineering
Pipelines [α,ζ], AWS SageMaker Pipelines [α,ζ], Azure ML, and pipelines. Moreover, this role ensures proper data ingestion to the
MLOps Kreuzberger, Kühl, and Hirschl
databases of the feature store system [14,29,41] [α, β, γ, δ, ε, ζ, η, (A) MLOps project initiation. (1) The business stakeholder
θ]. (R1) analyzes the business and identifies a potential business
R5 Software Engineer. The software engineer applies problem that can be solved using ML. (2) The solution architect
software design patterns, widely accepted coding guidelines, and (R2) defines the architecture design for the overall ML system
best practices to turn the raw ML problem into a well-engineered and, decides on the technologies to be used after a thorough
product [29] [α, γ]. evaluation. (3) The data scientist (R3) derives an ML problem—
R6 DevOps Engineer. The DevOps engineer bridges the gap such as whether regression or classification should be used—from
between development and operations and ensures proper CI/CD the business goal. (4) The data engineer (R4) and the data scientist
automation, ML workflow orchestration, model deployment to (R3) work together in an effort to understand which data is
production, and monitoring [14–16,26] [α, β, γ, ε, ζ, η, θ]. required to solve the problem. (5) Once the answers are clarified,
R7 ML Engineer/MLOps Engineer. The ML engineer or the data engineer (R4) and data scientist (R3) collaborate to locate
MLOps engineer combines aspects of several roles and thus has the raw data sources for the initial data analysis. They check the
cross-domain knowledge. This role incorporates skills from data distribution, and quality of the data, as well as performing
scientists, data engineers, software engineers, DevOps engineers, validation checks. Furthermore, they ensure that the incoming
and backend engineers (see Figure 3). This cross-domain role data from the data sources is labeled, meaning that a target
builds up and operates the ML infrastructure, manages the attribute is known, as this is a mandatory requirement for
automated ML workflow pipelines and model deployment to supervised ML. In this example, the data sources already had
production, and monitors both the model and the ML labeled data available as the labeling step was covered during an
infrastructure [14,17,26,29] [α, β, γ, δ, ε, ζ, η, θ]. upstream process.
(B1) Requirements for feature engineering pipeline. The
features are the relevant attributes required for model training.
After the initial understanding of the raw data and the initial data
DS BE analysis, the fundamental requirements for the feature engineering
Data Scientist Backend Engineer
(ML model development) (ML infrastructure management) pipeline are defined, as follows: (6) The data engineer (R4)
ML
BS SA DS DE AND DS DE AND DS
DS SE
training code DO OR ML
ML workflow
data ML
model model export pipeline code
data analysis preparation & model
training validation model model
validation
Repository serving code {…}
versioned SE
feature data
ML metadata store CI / CD component
model training computation infrastructure continuous integration
DE / continuous delivery
Feature store ML Experimentation Zone versioned artifacts: model + ML training & workflow code (build, test and push)
system
ML Production Zone continuous deployment
offline DB (build, test and deploy model)
(normal
Scheduler artifact
(trigger when Workflow orchestration component
latency) store Model Registry
new data (e.g., Image prod ready
available, Registry) ML metadata store ML model DO OR ML
ML metadata store
online DB event-based or model status (staging or prod)
(low-latency) periodical) (metadata logging of each ML workflow task)
parameter & perf. metrics
D Automated ML Workflow Pipeline (best algorithm selection, parameter & perf. metric logging)DO
OR ML
versioned
feature data data
data model training model export push to model
preparation & registry
extraction / refinement validation model
validation
Figure 4. End-to-end MLOps architecture and workflow with functional components and roles
(10) The data preprocessing begins with data transformation These feature engineering rules are continuously improved based
and cleaning tasks. The transformation rule artifact defined in the on the feedback. (12) Lastly, a data ingestion job loads batch or
requirement gathering stage serves as input for this task, and the streaming data into the feature store system (C4). The target can
main aim of this task is to bring the data into a usable format. either be the offline or online database (or any kind of data store).
These transformation rules are continuously improved based on (C) Experimentation. Most tasks in the experimentation stage
the feedback. (11) The feature engineering task calculates new are led by the data scientist (R3). The data scientist is supported
and more advanced features based on other features. The by the software engineer (R5). (13) The data scientist (R3)
predefined feature engineering rules serve as input for this task. connects to the feature store system (C4) for the data analysis.
MLOps: Overview, Definition, and Architecture Kreuzberger, Kühl and Hirschl
(Alternatively, the data scientist (R3) can also connect to the raw predefined based on the settings of the previous experimentation
data for an initial analysis.) In case of any required data stage. The model is retrained and refined. (21) Automated model
adjustments, the data scientist (R3) reports the required changes evaluation and iterative adjustments of hyperparameters are
back to the data engineering zone (feedback loop). executed, if required. Once the performance metrics indicate good
(14) Then the preparation and validation of the data coming results, the automated iterative training stops. The automated
from the feature store system is required. This task also includes model training task and the automated model validation task can
the train and test split dataset creation. (15) The data scientist (R3) be iteratively repeated until a good result has been achieved. (22)
estimates the best-performing algorithm and hyperparameters, and The trained model is then exported and (23) pushed to the model
the model training is then triggered with the training data (C5). registry (C6), where it is stored e.g., as code or containerized
The software engineer (R5) supports the data scientist (R3) in the together with its associated configuration and environment files.
creation of well-engineered model training code. (16) Different For all training job iterations, the ML metadata store (C7)
model parameters are tested and validated interactively during records metadata such as parameters to train the model and the
several rounds of model training. Once the performance metrics resulting performance metrics. This also includes the tracking and
indicate good results, the iterative training stops. The best- logging of the training job ID, training date and time, duration,
performing model parameters are identified via parameter tuning. and sources of artifacts. Additionally, the model specific metadata
The model training task and model validation task are then called “model lineage” combining the lineage of data and code is
iteratively repeated; together, these tasks can be called “model tracked for each newly registered model. This includes the source
engineering.” The model engineering aims to identify the best- and version of the feature data and model training code used to
performing algorithm and hyperparameters for the model. (17) train the model. Also, the model version and status (e.g., staging
The data scientist (R3) exports the model and commits the code to or production-ready) is recorded.
the repository. Once the status of a well-performing model is switched from
As a foundational requirement, either the DevOps engineer staging to production, it is automatically handed over to the
(R6) or the ML engineer (R7) defines the code for the (C2) DevOps engineer or ML engineer for model deployment. From
automated ML workflow pipeline and commits it to the there, the (24) CI/CD component (C1) triggers the continuous
repository. Once either the data scientist (R3) commits a new ML deployment pipeline. The production-ready ML model and the
model or the DevOps engineer (R6) and the ML engineer (R7) model serving code are pulled (initially prepared by the software
commits new ML workflow pipeline code to the repository, the engineer (R5)). The continuous deployment pipeline carries out
CI/CD component (C1) detects the updated code and triggers the build and test step of the ML model and serving code and
automatically the CI/CD pipeline carrying out the build, test, and deploys the model for production serving. The (25) model serving
delivery steps. The build step creates artifacts containing the ML component (C8) makes predictions on new, unseen data coming
model and tasks of the ML workflow pipeline. The test step from the feature store system (C4). This component can be
validates the ML model and ML workflow pipeline code. The designed by the software engineer (R5) as online inference for
delivery step pushes the versioned artifact(s)—such as images—to real-time predictions or as batch inference for predictions
the artifact store (e.g., image registry). concerning large volumes of input data. For real-time predictions,
(D) Automated ML workflow pipeline. The DevOps features must come from the online database (low latency),
engineer (R6) and the ML engineer (R7) take care of the whereas for batch predictions, features can be served from the
management of the automated ML workflow pipeline. They also offline database (normal latency). Model-serving applications are
manage the underlying model training infrastructure in the form often configured within a container and prediction requests are
of hardware resources and frameworks supporting computation handled via a REST API. As a foundational requirement, the ML
such as Kubernetes (C5). The workflow orchestration component engineer (R7) manages the model-serving computation
(C3) orchestrates the tasks of the automated ML workflow infrastructure. The (26) monitoring component (C9) observes
pipeline. For each task, the required artifacts (e.g., images) are continuously the model-serving performance and infrastructure in
pulled from the artifact store (e.g., image registry). Each task can real-time. Once a certain threshold is reached, such as detection of
be executed via an isolated environment (e.g., containers). Finally, low prediction accuracy, the information is forwarded via the
the workflow orchestration component (C3) gathers metadata for feedback loop. The (27) feedback loop is connected to the
each task in the form of logs, completion time, and so on. monitoring component (C9) and ensures fast and direct feedback
Once the automated ML workflow pipeline is triggered, each allowing for more robust and improved predictions. It enables
of the following tasks is managed automatically: (18) automated continuous training, retraining, and improvement. With the
pulling of the versioned features from the feature store systems support of the feedback loop, information is transferred from the
(data extraction). Depending on the use case, features are model monitoring component to several upstream receiver points,
extracted from either the offline or online database (or any kind of such as the experimental stage, data engineering zone, and the
data store). (19) Automated data preparation and validation; in scheduler (trigger). The feedback to the experimental stage is
addition, the train and test split is defined automatically. (20) taken forward by the data scientist for further model
Automated final model training on new unseen data (versioned improvements. The feedback to the data engineering zone allows
features). The algorithm and hyperparameters are already for the adjustment of the features prepared for the feature store
MLOps Kreuzberger, Kühl, and Hirschl
system. Additionally, the detection of concept drifts as a feedback architects, data engineers, ML engineers, and DevOps engineers
mechanism can enable (28) continuous training. For instance, [29,41,44] [α, ε]. This is related to the necessary education of
once the model-monitoring component (C9) detects a drift in the future professionals—as MLOps is typically not part of data
data [3], the information is forwarded to the scheduler, which then science education [7] [γ]. Posoldova (2020) [35] further stresses
triggers the automated ML workflow pipeline for retraining this aspect by remarking that students should not only learn about
(continuous training). A change in adequacy of the deployed model creation, but must also learn about technologies and
model can be detected using distribution comparisons to identify components necessary to build functional ML products.
drift. Retraining is not only triggered automatically when a Data scientists alone cannot achieve the goals of MLOps. A
statistical threshold is reached; it can also be triggered when new multi-disciplinary team is required [14], thus MLOps needs to be
feature data is available, or it can be scheduled periodically. a group process [α]. This is often hindered because teams work in
silos rather than in cooperative setups [α]. Additionally, different
knowledge levels and specialized terminologies make
6 Conceptualization communication difficult. To lay the foundations for more fruitful
With the findings at hand, we conceptualize the literature and setups, the respective decision-makers need to be convinced that
interviews. It becomes obvious that the term MLOps is positioned an increased MLOps maturity and a product-focused mindset will
at the intersection of machine learning, software engineering, yield clear business improvements [γ].
DevOps, and data engineering (see Figure 5 in the Appendix). We ML system challenges. A major challenge with regard to
define MLOps as follows: MLOps systems is designing for fluctuating demand, especially in
MLOps (Machine Learning Operations) is a paradigm, relation to the process of ML training [7]. This stems from
including aspects like best practices, sets of concepts, as well as a potentially voluminous and varying data [10], which makes it
development culture when it comes to the end-to-end difficult to precisely estimate the necessary infrastructure
conceptualization, implementation, monitoring, deployment, and resources (CPU, RAM, and GPU) and requires a high level of
scalability of machine learning products. Most of all, it is an flexibility in terms of scalability of the infrastructure [7,26] [δ].
engineering practice that leverages three contributing disciplines: Operational challenges. In productive settings, it is
machine learning, software engineering (especially DevOps), and challenging to operate ML manually due to different stacks of
data engineering. MLOps is aimed at productionizing machine software and hardware components and their interplay. Therefore,
learning systems by bridging the gap between development (Dev) robust automation is required [7,17]. Also, a constant incoming
and operations (Ops). Essentially, MLOps aims to facilitate the stream of new data forces retraining capabilities. This is a
creation of machine learning products by leveraging these repetitive task which, again, requires a high level of automation
principles: CI/CD automation, workflow orchestration, [18] [θ]. These repetitive tasks yield a large number of artifacts
reproducibility; versioning of data, model, and code; that require a strong governance [24,29,40] as well as versioning
collaboration; continuous ML training and evaluation; ML of data, model, and code to ensure robustness and reproducibility
metadata tracking and logging; continuous monitoring; and [11,27,29].
feedback loops.
8 Conclusion
7 Open Challenges With the increase of data availability and analytical
Several challenges for adopting MLOps have been identified capabilities, coupled with the constant pressure to innovate, more
after conducting the literature review, tool review, and interview machine learning products than ever are being developed.
study. These open challenges have been organized into the However, only a small number of these proofs of concept progress
categories of organizational, ML system, and operational into deployment and production. Furthermore, the academic space
challenges. has focused intensively on machine learning model building and
Organizational challenges. The mindset and culture of data benchmarking, but too little on operating complex machine
science practice is a typical challenge in organizational settings learning systems in real-world scenarios. In the real world, we
[2]. As our insights from literature and interviews show, to observe data scientists still managing ML workflows manually to
successfully develop and run ML products, there needs to be a a great extent. The paradigm of Machine Learning Operations
culture shift away from model-driven machine learning toward a (MLOps) addresses these challenges. In this work, we shed more
product-oriented discipline [γ]. The recent trend of data-centric AI light on MLOps. By conducting a mixed-method study analyzing
also addresses this aspect by putting more focus on the data- existing literature and tools, as well as interviewing eight experts
related aspects taking place prior to the ML model building. from the field, we uncover four main aspects of MLOps: its
Especially the roles associated with these activities should have a principles, components, roles, and architecture. From these
product-focused perspective when designing ML products [γ]. A aspects, we infer a holistic definition. The results support a
great number of skills and individual roles are required for common understanding of the term MLOps and its associated
MLOps (β). As our identified sources point out, there is a lack of concepts, and will hopefully assist researchers and professionals
highly skilled experts for these roles—especially with regard to in setting up successful ML projects in the future.
MLOps: Overview, Definition, and Architecture Kreuzberger, Kühl and Hirschl
Appendix
Table 1. List of evaluated technologies
Airflow Airflow is a task and workflow orchestration tool, which can [26,40,41] [α, β, ζ, η]
also be used for ML workflow orchestration. It is also used
for orchestrating data engineering jobs. Tasks are executed
according to directed acyclic graphs (DAGs).
MLflow MLflow is an ML platform that allows for the management [11,32,35] [α, γ, ε, ζ, η, θ]
of the ML lifecycle end-to-end. It provides an advanced
experiment tracking functionality, a model registry, and
model serving component.
Commercial Databricks The Databricks platform offers managed services based on [26,32,35,40] [α, ζ]
examples managed other cloud providers’ infrastructure, e.g., managed
MLflow MLflow.
Azure DevOps Azure DevOps Pipelines is a CI/CD automation tool to [18,42] [γ, ε]
Pipelines facilitate the build, test, and delivery steps. It also allows one
to schedule and manage the different stages of an ML
pipeline.
Azure ML Microsoft Azure offers, in combination with Azure DevOps [6,24,25,35,42] [α, γ, ε, ζ, η, θ]
Pipelines and Azure ML, an end-to-end ML platform.
MLOps Kreuzberger, Kühl, and Hirschl
GCP - Vertex GCP offers, along with Vertex AI, a fully managed end-to- [25,35,40,41] [α, γ, δ, ζ, θ]
AI end platform. In addition, they offer a managed Kubernetes
cluster with Kubeflow as a service.
IBM Cloud IBM Cloud Pak for Data combines a list of software in a [41] [γ]
Pak for Data package that offers data and ML capabilities.
(IBM Watson
Studio)
Software
Machine Engineering
Learning CD4ML
CI/CD
Pipeline
DevOps
Code
ML Model MLOps
Data Engineering
Data
1 22/04/22