License: CC BY 4.0
arXiv:2401.10603v1 [cs.SE] 19 Jan 2024

ZnTrack: Data as Code

Fabian Zills Institute for Computational PhysicsAllmandring 3StuttgartGermany70569 [email protected] 0000-0002-6936-4692 Moritz Schäfer Institute for Theoretical ChemistryPfaffenwaldring 55StuttgartGermany70569 0000-0001-8474-5808 Samuel Tovey Institute for Computational PhysicsAllmandring 3StuttgartGermany70569 0000-0001-9537-8361 Johannes Kästner Institute for Theoretical ChemistryPfaffenwaldring 55StuttgartGermany70569 0000-0001-6178-7669  and  Christian Holm Institute for Computational PhysicsAllmandring 3StuttgartGermany70569 0000-0003-2739-310X
Abstract.

The past decade has seen tremendous breakthroughs in computation and there is no indication that this will slow any time soon. Machine learning, large-scale computing resources, and increased industry focus have resulted in rising investments in computer-driven solutions for data management, simulations, and model generation. However, with this growth in computation has come an even larger expansion of data and with it, complexity in data storage, sharing, and tracking. In this work, we introduce ZnTrack, a Python-driven data versioning tool. ZnTrack builds upon established version control systems to provide a user-friendly and easy-to-use interface for tracking parameters in experiments, designing workflows, and storing and sharing data. From this ability to reduce large datasets to a simple Python script emerges the concept of Data as Code, a core component of the work presented here and an undoubtedly important concept as the age of computation continues to evolve. ZnTrack offers an open-source, FAIR data compatible Python package to enable users to harness these concepts of the future.

Data as Code, Data Version Control, Machine Learning, Data Science, FAIR data, Reproducibility, Collaboration

1. Introduction

Large-scale computing is steadily becoming the norm in many industrial and research settings. With the rise of large-scale compute clusters, computational sciences have risen equally as fast in the size and number of experiments being performed on the newly available resources (Shalf, 2020). This has coincided with extraordinary advances in machine learning which is now ubiquitous in science and industry applications. Nevertheless, the challenge doesn’t end with the capability to implement extensive simulations or machine learning models. With the growing number of simulations and models comes an increase in parameters and workflows that must be tracked and stored efficiently, not only for reproducibility but also for distribution. Furthermore, once a workflow is applied in production, it is often desirable to submit the results to a publication or, more often, share models with other users in one’s community. The challenges in managing this generated data remain an issue in many fields (Allison et al., 2016; Anderson et al., 2007; Peng, 2011; Stoddart, 2016; Klump et al., 2021). Sharing data in a standardized format can be challenging because there can either be many different formats to choose from or none that fulfill all the requirements of new research questions. New formats can be introduced or generalizations attempted, but naming conventions and standardized data formats have a finite degree of complexity that they can handle before becoming overcrowded and unusable (González et al., 2007; Oliveira et al., 2020). When discussing data, it is often overlooked that most of the time software was used to generate it in the first place. This is to say that a simulation script has been written, a configuration file parsed, or command line interface (CLI) arguments have been used to initiate this generation of data. In this case, it is reasonable to say that all information required to reproduce the data is contained in the code used in its generation. Therefore, it is a convenient idea to provide a simple interface for sharing this code, along with its results, as data, i.e., to construct an interface capable of sharing Data as Code (DaC).

In this work, we present ZnTrack, a Python package designed to address these requirements. The use of the Python programming language aligns with its popularity in the scientific community, especially in data science and machine learning (JetBrains, 2022). ZnTrack is built on the idea that once code has been written to generate data, no matter how complex the workflow, this is all that is required to reproduce or share the data. Central to this goal is ZnTrack’s construction on top of the Data Version Control (DVC) framework, which provides a convenient interface to treat versions of code, i.e experiments, as commits or tags alongside potentially large data files inside a single repository driven by the GIT version control system. ZnTrack allows users to store their data alongside the code used to generate it. The usage of a data remote, together with well-established version control infrastructure such as GitHub, GitLab, or Bitbucket, is automatically managed by utilizing DVC. With ZnTrack, we combine the universal applicability of GIT and DVC with a dynamic and flexible interface driven by the Python programming language.

The purpose of this paper is to introduce the ZnTrack software to the community both on a technical level as well as through practical examples. Initially, we introduce the concepts on which ZnTrack is built, these being computational workflow design on graphs, version control with GIT, and the DaC paradigm. Following this overview of the theoretical aspects of the package, the architecture of ZnTrack is presented and discussed along with special mention of certain key features. Finally, two use cases are explored in order to demonstrate the applicability and strengths of this new technology. We showcase how ZnTrack can be used for purely Python-based workflows as well as how it can be expanded to work with other software.

2. Related Work

2.1. Workflow and Data Management

The concept of workflow and data management has been of interest for years. Research groups and organizations have invested in the field on both theoretical and product-driven levels. The result of this investment has been the emergence of general frameworks such as Apache Airflow (Haines, 2022), kedro (Alam et al., 2023), snakemake (Mölder et al., 2021), luigi (Rieger et al., 2017) and others (Fitschen et al., 2019; Luo et al., 2021; Adorf et al., 2018; George and Saha, 2022; Amstutz et al., 2022; Dask Development Team, 2016) which are widely used today. Most of these approaches target large and complex workflows and are for organizations comprising many researchers. In most cases, a specialized setup process is required to interface with or construct a new form of these services. Additionally, the majority of these pre-existing solutions depend heavily on databases hosted locally or on servers. This introduces a significant degree of overhead for new users and typically requires experts in data structures and software engineering for maintenance. Such problems might help to explain why many academic research groups have not adopted these strategies thus far (da Silva et al., 2021; Alam and Roy, 2022). Several groups have realized these shortcomings and developed tools to fill the needs left behind. These include MLFlow (Chen et al., 2020) or wandb.ai (Biewald, 2020) which are now widely known in the academic community and are even integrated into well-known software packages, for example, MACE (Batatia et al., 2022) and NequIP (Batzner et al., 2022) from the quantum chemistry community. In many cases, modern research software groups go to the trouble of integrating their own data management systems into their code as in pymatgen (Ong et al., 2013), MDSuite (Tovey et al., 2023) or pyiron (Janssen et al., 2019). However, even in these cases, what is missing is a unified infrastructure for the storage and sharing of data. To this end, products such as DVC (Castro, 2023) have emerged to combine parameter and workflow tracking with established version control tools like GIT.

DVC enables the use of version control tools like GIT for managing large data files. Additionally, it provides a comprehensive workflow management system while remaining compatible with the entire GIT ecosystem. DVC primarily utilizes a CLI and YAML configuration files, making it universally applicable to all file formats and the tools used to generate them. However, this requires users to adapt their code to interface with DVC, which leads to an overhead and might limit code flexibility. To address some of these limitations, DVC offers a Python API for accessing data, although it does not currently provide a public Python API for constructing or distributing workflows.

2.2. Python Interfaces

For the DaC concept, we want to take a look at existing datasets that can be accessed through Python packages. This method of accessing datasets was already available in the early versions of TensorFlow (Abadi et al., 2015) and PyTorch (Paszke et al., 2019). An example of this is the MNIST (Lecun et al., 1998a) dataset, which is made available among other packages through torchvision and can be downloaded through a simple API call.

1from torchvision import datasets
2trainset = datasets.MNIST()

This concept is further improved by two of the most prominent tools for sharing training data or machine learning models, namely huggingface (Wolf et al., 2020) and kaggle (Kaggle, 2023), both of which provide a simple interface to download and use datasets.

1from datasets import load_dataset
2dataset = load_dataset("mnist")

They provide not only a simple-to-use Python interface but also a sophisticated website that enables users to search for datasets hosted along with detailed descriptions of the available data, as well as different versions of the datasets if they are available. Beneficial to these tools is the accessibility of metadata and the searchability provided by the website. However, their benefits are limited to a specific set of data made available through their platforms and they do not provide a general solution to all the challenges described above.

2.3. Code-Driven Developments

Datasets or models made available through code can be more easily integrated into workflows. To fully benefit from the code interface, it is necessary to transition from static configuration files to dynamic script-based interfaces. Due to its low barrier of entry and ubiquity in data sciences, the Python programming language is the ideal tool for this task. One successful example of the transition from static configuration files to dynamic script-based interfaces is Infractsture as Code (IaC) (Howard, 2022). IaC defines computing resources through code, rather than relying on specialized configuration files. This approach provides greater flexibility and allows for the utilization of programming language features, such as loops and conditionals. The scripts can be version-controlled in the same way as traditional configuration files, while being more transferable, allowing for better scaling, and typically being easier to maintain. By automating parts that would be redundant in static configuration files, IaC also reduces the risk of misconfigurations. Finally, the code can be documented in such a way as to make it easier to understand and maintain. This flexibility is already available in workflow managers such as Airflow but is lacking in the data management tools mentioned above. Bringing these concepts together can be seen as a prerequisite for DaC.

3. Theory

With this overview of related work, we want to introduce the DaC paradigm, showcase how it compares to existing solutions and highlight key differences. Thereby putting the focus on workflow and data management as well as collaboration.

3.1. Version Control with GIT

Version control systems are essential tools in software development, allowing developers to track changes to source code and collaborate with other team members effectively. GIT is a widely used distributed version control system that has gained popularity due to its efficiency, flexibility, and ease of use. GIT uses a decentralized approach, allowing each developer to maintain their local repository and then merge changes with another repository when ready. This approach provides several benefits, including faster processing times, improved collaboration, and the ability to work offline without requiring a connection to a central server. One of the key features of GIT is its support for branching and merging. Developers can create multiple branches of the same codebase, allowing for experimentation, feature development, and bug fixing without affecting the main branch. GIT is compatible with various operating systems, making it easy for developers to work on different platforms. Changes to the codebase are stored in GIT commits. Each commit is assigned a unique identifier based on the changes made to the code. This makes GIT ideal for human readable file formats. On the contrary, large files, especially if they are compressed and a small change to the content will update the entire file, are unsuitable to be tracked with GIT.

3.2. Data Version Control

DVC (Castro, 2023) can be employed to address the challenges posed by large data files in GIT repositories.

It achieves data versioning by computing a hash value for each file, and only this hash value is versioned using GIT. When requested, the hash value can be utilized to retrieve the data associated with a specific commit from the data storage. The storage options include local directories as well as remote storage such as object storage, WebDAV, and others (refer to Figure 1).

Using GIT in conjunction with DVC not only enables versioning of data but also promotes better collaboration and reproducibility. These benefits make it evident why this combination is advantageous for research.

Refer to caption
Figure 1. Combination of GIT and DVC in a local repository connected to a GIT remote and DVC data remote.

Version controlling both data and code in a GIT repository facilitates easy experimentation. By versioning input parameters, code, and outputs, experiments can be defined and stored as commits. Figure 2 illustrates how experiments can be performed in such a way. The GIT and DVC CLIs enable comparison of experiments. The best experiment can be promoted to a commit on a new branch, while changes to model architecture, training data, and parameters are tracked within a single repository. This streamlined approach enhances the efficiency of experimentation and development processes.

Refer to caption
Figure 2. Experiment versioning using GIT. Each Experiment represents a detached commit. The best experiment is committed and new experiments are performed based on this commit.

Experiments can be compared using the DVC CLI as well as graphical user interfaces (GUIs) such as the Visual Studio Code DVC extension. The corresponding data files can be accessed through a fsspec (python, 2023)-compliant Python API, that supports local and remote storage.

3.3. Computational Graphs

A core concept of modern computation is the representation of a workflow as a computational graph. Within a computational graph, workflows are defined through nodes N𝑁Nitalic_N that represent variables or operations involved in a computation. The flow of information and the dependencies between nodes are represented by the computational graph’s edges E𝐸Eitalic_E. Therefore, each connected node acts as a function on its inputs and passes its resulting outputs to its successors, as defined by the edges E𝐸Eitalic_E (Owhadi, 2022; Tinhofer et al., 2012).

(1) Let G=(N,E) be a graph with a node set N={1,2,3,4,5} and an edge set E={(1,2),(1,4),(2,3),(4,5),(5,3)}.Let 𝐺𝑁𝐸 be a graph with a node set 𝑁12345 and an edge set 𝐸1214234553\displaystyle\begin{split}\text{Let }G&=(N,E)\text{ be a graph with a node set% }\\ N&=\{1,2,3,4,5\}\text{ and an edge set }\\ E&=\{(1,2),(1,4),(2,3),(4,5),(5,3)\}.\end{split}start_ROW start_CELL Let italic_G end_CELL start_CELL = ( italic_N , italic_E ) be a graph with a node set end_CELL end_ROW start_ROW start_CELL italic_N end_CELL start_CELL = { 1 , 2 , 3 , 4 , 5 } and an edge set end_CELL end_ROW start_ROW start_CELL italic_E end_CELL start_CELL = { ( 1 , 2 ) , ( 1 , 4 ) , ( 2 , 3 ) , ( 4 , 5 ) , ( 5 , 3 ) } . end_CELL end_ROW

Each tuple in the edge set E𝐸Eitalic_E is a directed connection from one node to another. An edge (1,2)12(1,2)( 1 , 2 ) from node 1 to node 2 indicates that the output of node 1 is used as an input to node 2. In other words, node 2 depends on the output of node 1.

Refer to caption
Figure 3. Illustration of a DAG.

In the context of data-driven workflows, computational graphs are a powerful tool for managing dependencies, parameters and results. We will focus on DAGs for the remainder of this work as illustrated in Figure 3. Connections within a DAG are directional, i.e. the information is only passed from the preceding node to the following one as indicated by the tuples in the edge set E𝐸Eitalic_E. Furthermore, it is prohibited to form cycles or loops (Thulasiraman and Swamy, 1992).

Such graphs are broadly applied in machine learning and data analysis (Lecun et al., 1998b; Schulman et al., 2015). Graphs allow for the identification of regions that are independent of each other and can be embarrassingly parallelized or optimized together (Herlihy et al., 2020; Sabne, 2020). They can be used in the context of automatic differentiation to compute gradients and are the basis of many machine learning frameworks such as TensorFlow (Abadi et al., 2015), PyTorch (Paszke et al., 2019) or JAX (Bradbury et al., 2018).

Computational graphs can be elegantly described in makefile-like formats, where the inputs, outputs, and corresponding functions are explicitly defined for each node. Within the context of DVC, a comprehensive toolset is provided to generate and persist these graphs in the YAML file format. DVC leverages the concept of node-specific hash values, computed based on the inputs and outputs associated with each node. This approach enables DVC to efficiently determine when recomputation of results is necessary or if they can be loaded from a cache based on previous runs.

3.4. Data as Code

A key challenge in cooperating on data driven projects is the availability, accessibility and documentation of data sources. Given a data repository that provides easy access to the data, a collaborator is still required to understand the chosen data format. For making changes to the data, even more in depth knowledge about the construction of the data is required and essential parts of the computational workflow are often not shared alongside the data. In the ideal scenario the data itself contains all of this information. With DaC, we build on the idea that a single entity is responsible for the generation, storage and interface of data. It should be possible to version and share this entity with collaborators. Such a framework is illustrated in Figure 4.

Refer to caption
Figure 4. Illustration of the DaC paradigm.

The idea is to provide access to data by sharing the code that was used to generate it, thus rendering the problem of attached meta-data by an arbitrary format redundant as it is an integral component of the code itself. By combining data generation and interfacing within the same entity, we ensure that the data remains readily accessible. Additionally, documentation is consolidated in a single location for both data generation and interface. Nevertheless, compatibility can be ensured by abstraction of the interface and adherence to existing ontologies and code formats that are only altered if not sufficient for the given task.

In this code-centered data paradigm, multiple single-responsibility entities can be assembled to construct a computational graph, facilitating workflow management. These entities serve as the nodes on the computational graph. To enhance usability, when a user runs the shared code, cached files are loaded and made available instead of re-executing the data generation process. This approach promotes both workflow management and reproducibility, making them core features of the DaC concept. Drawing inspiration from the transformative impact of version control on software development (Brindescu et al., 2014), DaC provides a viable pathway towards achieving findable, accessible, interoperable, and re-usable (FAIR) data standards.

3.5. FAIR Data and Code Style

For successful collaboration in data-driven research, it is important to agree on a set of principles that guide the development. Introduced in 2016, the importance of FAIR data (Wilkinson et al., 2016) is more relevant than ever. With the increasing amount of data that is generated, it is important to make data accessible and reusable. In addition to the application of the DaC principles to achieve FAIR data, good code practices are also essential. Using an object-oriented approach, the SOLID principles are a good starting point for high-quality code design. These principles recommend designing software in a way that is easy to maintain and extend (Martin, 2017). The SOLID principles refer to a set of guidelines, namely single responsibility (Martin, 2003), open-closed (Meyer, 1988), Liskov substitution (Liskov, 1987), interface segregation (Martin, 2003) and dependency inversion (Martin, 2003). In addition to these theoretical code design aspects, the use of opinionated code formatters and linters, along with detailed in-code documentation, can enhance collaboration.

4. Architecture

In this section, we will introduce the architecture of the ZnTrack package. We will highlight the concept of DaC and its implementation in ZnTrack. Furthermore, we will describe the different features of the package in more detail and provide exemplary code snippets.

4.1. ZnTrack

To realize the DaC paradigm, data must be interfaced through code. For this, an interface through a programming language is needed. DVC and GIT, while useful for data and code versioning respectively, provide mostly CLI tools to generate a DaC infrastructure but do not fulfill all requirements to be considered DaC on their own.

ZnTrack builds on the synergy between GIT and DVC to create a cohesive framework for establishing a unified repository where data and code coexist. Embracing the DaC paradigm, ZnTrack empowers developers to effortlessly integrate data interfaces and code components in Python. Through the definition of abstract dependency attributes and outputs, the code facilitates type checking and serves as a comprehensive documentation resource for both data and code.

Additional benefits include the ability to easily share code and data with collaborators through platforms such as GitHub or GitLab. These platforms provide further infrastructure for managing issues, community discussions and code review. Furthermore, whilst most current workflow management systems require a database, using GIT as the underlying framework allows for serverless and distributed workflow management. This simplifies the migration of existing projects to a DaC workflow. Users benefit doubly from this, as setup and maintenance efforts are drastically decreased.

The core components of the ZnTrack package are split according to the components of a computational graph:

Nodes:

The Node base class is the single interface to the code and the data. All data generation and data access are handled through this class. This includes not only the input data but also parameters, metrics, plots and the produced output data. Having a single interface responsible for one task strengthens the SOLID principles described in Section 3.5. Furthermore, it allows for documentation of the code and data in a single place. Data can be tested besides the code using common testing frameworks.

Edges:

The Node defines the expected dependency types but the connections between the Nodes are handled by the ZnTrack Project. The Project acts as the interface to the computational graph. It handles the GIT and DVC commands and gives access to experiment handling. Through the Project interface, Nodes can also be organized in groups for a better overview.

To set up a DaC Project with ZnTrack, only the initialization of a repository with GIT and DVC is required. This server-less setup makes it easy to create new projects with minimal overhead. Later on, the projects can be pushed to a server to share the results with collaborators.

Structuring experiments in a very specific way can come with an overhead of time and might not be easily adapted by scientists from different areas. To ensure a user-friendly introduction, ZnTrack provides all necessary tools to create a DaC infrastructure through a Python interface. Minimal knowledge is sufficient to track your experiments using GIT and DVC in this way. Additionally, ZnTrack enables the development of new tools that inherently incorporate FAIR data management, parameter tracking, and distributed computing.

4.2. Defining Nodes

As described in Section 3, a Node is a single step in a computational workflow. It is defined by a set of parameters and a function that is executed when the Node is run.

At its core, a Node can be described as a single function on the highest level control flow, i.e. what we from now on call the computational graph. In the ZnTrack package, Nodes can be defined as Python functions or classes. In this work, our primary focus will be on the class-based approach. The main advantage of using classes, as opposed to functions, is that they are stateful in this context, making it easier to access results.

By using classes to define Nodes, we can store the parameters of the Node as class attributes. This capability allows us to conveniently access these parameters later on and establish a direct connection between the parameters, the data, and the function that is executed.

The Node class features an abstract run method, which is intended to be overridden by the user. This allows the user to define the specific functionality that should be executed when the Node is run.

To facilitate ease of use, parameters, dependencies, and outputs are defined as class attributes. These attributes are leveraged to automatically generate the class constructor. This automation simplifies the process of initializing and configuring Node instances. It also allows for class inheritance and abstract classes, which can be used to define a common interface for a group of Nodes. The inputs and outputs to a Node that are available in this way are summarized in Figure 5. We differentiate between manual data serialisation (MDS) and automatic data serialisation (ADS). Attributes suffixed with _path are MDS attributes, i.e. they are file paths. The way data is stored at these locations is entirely up to the user. In contrast, attributes without the _path suffix are ADS attributes. These attributes are serialized by the ZnTrack package and provide a convenient way to store and access data.

Refer to caption
Figure 5. The inputs and outputs of a node are split into MDS and ADS attributes. The MDS attributes describe file paths whilst the ADS attributes can contain arbitrary data and are managed by ZnTrack.

Managing data from different commits or streaming data from the data remote is built into the ADS attributes, whilst for MDS attributes the user must make use of the Node.state.fs filesystem interface. The Node can then be deserialized based on any commit in the repository. Alongside the Node all data from the specific commit will also be available. With the availability of the code, a Node can also be deserialized anywhere by supplying it with a link to the repository.

1import zntrack
2
3class MyNode(zntrack.Node):
4    data: str = zntrack.deps_path()
5    result: float = zntrack.outs()
6    shift: float = zntrack.params()
7
8    def run(self) -> None:
9        with open(self.data, "r") as f:
10            self.result = float(f.read()) + self.shift

A simple example can be illustrated with the class MyNode. The Node takes data as an input file and adds a parameter called shift to the data value read from the file. Finally, the Node serializes the output variable result of the computation to disk. It is now possible to use MyNode as a normal Python class and test its functionality.

4.3. Defining Graphs

To make use of ZnTrack and DVC features, we need to create a graph and add an instance of MyNode to it. Therefore, we instantiate a project, which manages the graph. Within the project’s context manager, connections between Nodes are defined and parameters are set. This code will only define the graph but not execute it. All this information will be serialized and stored in human-readable configuration files that can be tracked via GIT.

1import zntrack
2
3with zntrack.Project() as project:
4    node1 = MyNode(data="data.txt", shift=1.0)
5    node2 = MyNode2(value=node1.result)
6
7project.build()

4.4. Parallelization, Deployment and Interaction

To execute the graph built within zntrack.Project, multiple options are available. All of these options share the common feature that a single command is all that’s needed to assess the entire graph and execute only the necessary parts, i.e. only run the Nodes with changed dependencies that haven’t been executed yet. This design essentially turns each Node in the graph into a checkpoint, facilitating straightforward debugging and rerunning.

The DVC package only allows for sequential execution of the computational graph. As described in Section 3.3 the computational graph can be used to parallelize the execution of Nodes that are capable of running in parallel. Therefore, we developed an additional graph executor that uses the Dask (Dask Development Team, 2016) package to parallelize the execution of the DVC workflow. Our dask4dvc package enables the graph to be run in parallel and allows us to easily deploy the graph on a cluster, using the distributed package. This allows for efficient utilization of computational resources when executing the graph.

Parallelization of a single graph is often not the only requirement for a scientific workflow. In many scenarios, the same workflow must be executed with different parameters. The concept of running the graph with different parameters is called an experiment in DVC. In DVC, experiments are run using a queueing system based on Celery and Hydra.

1dvc exp run --queue -S MyNode.shift=1,2,3,4
2dvc exp run --queue -S MyNode.shift=range(5, 10)

The code above shows how to queue experiments with different parameters. If more flexibility is required, ZnTrack enables experimentation through Python, allowing for greater control and customization in the process.

1for x in range(5):
2    with project.create_experiment() as exp:
3        my_node.shift = x

The experiments can be executed from within Python or using one of the aforementioned graph executors. This approach can also be combined with parameter optimization libraries such as Optuna (Akiba et al., 2019) or Ray Tune (Liaw et al., 2018) to automate the process of finding optimal parameters whilst keeping all results and avoiding redundant computations.

Another powerful tool that can be brought to different branches of scientific research is the automatic deployment of predefined tasks such as model training or data analysis via continuous integration (CI) setups (Iterative, 2023a). Changes in a parameter file can automatically trigger e.g. training of a new machine learning model using a CI setup. In this way, complex pipelines can technically be initiated with a click of a button even from mobile devices. Another example would be the automatic analysis of experimental results through a CI pipeline. If a screening through many experiments with recurring analysis is required this can be automated and parallelized easily.

Finally, all of these steps are inherently fully reproducible, because all of the data is present in the repository or connected data storage. This makes validating experiments much easier and fulfills the FAIR principles whilst simplifying the process of storing experiment parameters and results in the same way. This can even be achieved on a multitude of computing hardware, utilizing containerization on Nodes.

4.5. Analyzing Results

A key task when running multiple experiments is the analysis and comparison of results. There are multiple ways to achieve this. The foremost is to use the DVC CLI to compare the results of different experiments. This allows us to easily compare predefined metrics and plots.

1dvc metrics show
2dvc metrics diff

A more convenient way to gather results is by loading the full Node instance into a Python kernel. This will make the parameters in combination with the results available through the Node attributes. To avoid loading not requested data into memory, all attributes are lazily evaluated. This means that the data is only loaded when the attribute is accessed.

1my_node.load()

If the instance is not present or multiple versions should be compared, one can also load the results from a specific revision and remote. Possible revisions are the commit hash or a tag, as well as DVC experiment names.

1my_node = zntrack.from_rev(
2    "MyNode",
3    rev="v2",
4    remote="https://github.com/user/repo"
5)

With this approach, arbitrary analysis can be performed on the data. Ultimately, an analysis node can be written with multiple nodes - even from different revisions or remotes - as dependencies and the analysis can be performed on the graph level.

4.6. Collaboration

Providing results to the scientific community and collaborating during the development of a project is also facilitated by the DaC paradigm. Although all of the described principles can be used offline, their full potential is only realized when used in a collaborative environment. DaC can be realized by sharing repositories on a platform like GitHub or GitLab. Furthermore, specialized platforms such as Iterative Studio (Iterative, 2023b) or DagsHub (dagshub, 2023) allow more streamlined access to data and provide graphical user interfaces to run and compare results. All of these tools are compatible with ZnTrack.

Additionally, Nodes written with ZnTrack can be shared as Python packages and can be made available, e.g. through PyPi, building a FAIR data and DaC infrastructure. This allows others to access models and data easily. As data and models are stored together, they are automatically compatible and can easily be used and adapted by others in their own projects.

This becomes even more important when considering that many new software applications will rely on data-driven methods and often require the inclusion of machine learning models. These models can be too large to be effectively managed through GIT version control but still necessitate seamless integration into the codebase. By adhering to DaC principles, one can avoid the need to version control the model and code separately, thus mitigating the risk of incompatibilities.

5. Demonstration

5.1. Atomistic Simulation

The ZnTrack package was developed with a strong focus on atomistic simulations. These often require experimentation and can be computationally expensive. Therefore, keeping track of all the experiments and their results is crucial. In this section, we will demonstrate how ZnTrack can be used to manage and deploy computational graphs for such simulations. Furthermore, we will introduce new ways of sharing simulation data and making results accessible to and reproducible by other researchers.

5.1.1. Molecular Dynamics Simulation

To illustrate the process, a molecular dynamics (MD) simulation for a system of molten sodium chloride is deployed. The example’s workflow consists of generating a 3D structure from SMILES (Weininger, 1988), generating a simulation box and running the actual simulation. We use the rdkit package (Landrum et al., 2023) together with packmol (Martínez et al., 2009) to generate the structure and Lammps (Thompson et al., 2022) to run the simulation.

We run multiple experiments for simulation temperatures from 1000 K to 1800 K. Each simulation contains 1000 ion pairs and is simulated for 100 ps. Interactions are modeled by a Born-Meyer-Huggins-Tosi-Fumi (BMHTF) potential (Pan et al., 2016; Fumi and Tosi, 1964; Huggins and Mayer, 1933) and are analysed using ASE (Larsen et al., 2017). The experiments are queued using hydra parameter composition and deployed in parallel using dask4dvc. Using the DVC extension for Visual Studio Code, we can follow the progress of the experiments, e.g. the temperature, in real time.

1dvc exp run --queue -S "lammps.yaml:temperature=range(1000,1800,50)"
2dask4dvc run --config config.yaml
Refer to caption
Figure 6. Parallel execution of multiple experiments.

After an initial structure has been generated, the respective nodes that generate the box are cached and the MD simulations are executed in parallel. Figure 6 visualizes the parts of the experiments that are only executed once and the parts that are executed for each experiment with different parameters.

For further comparison, the results of each experiment can be visualized using the DVC plot feature. A more detailed analysis can be performed using the ZnTrack package. Here, we can easily iterate over the experiments and load the results without making changes to the workspace. Figure 7 shows the density of the systems as well as the radial distribution function (RDF) g(r)𝑔𝑟g(r)italic_g ( italic_r ), in which temperature dependent structural changes can be seen.

Refer to caption
Figure 7. Density and RDF of molten sodium chloride at different temperatures. The RDFs are displayed for the Na-Cl pair.

Every experiment is referenced by a unique identifier and is associated with a commit. All experiments are made available through the GIT repository and can be listed using dvc exp list <remote>. The ZnTrack package allows us to load the experiments directly. The only requirement for this to work is that the respective package for the Lammps node is installed. This node can then be inspected or connected to another computational graph.

1node = zntrack.from_rev("Lammps", rev="exp1")
2print(node.thermo)

5.1.2. Structural Database Repository

We have demonstrated how to use ZnTrack to run an atomistic simulation using Lammps and a BMHTF potential. In recent years, machine learning has become a popular tool to model interatomic interactions instead. These models are often trained on simulations from first principle methods. Often, these datasets are shared in a way that either does not allow versioning them or will inflate the size of the repository by continuously editing large binary files that are not designed to be stored via GIT (Eastman et al., 2023; Xiao et al., 2017). Many of these datasets are designed as benchmarks for machine learning models and therefore should not be altered. They are also used to train production models (Ramakrishnan et al., 2014; Chmiela et al., 2017; Smith et al., 2017).

With the structural database repository (SDR) we want to introduce a GIT repository with a public DVC remote that can be used to access, comment on and expand or improve first principle simulation data. All information required to reproduce the data is available through this repository. Data is stored in the H5MD format (Buyl et al., 2014) and can be accessed either directly or through ZnTrack using ASE (Larsen et al., 2017).

1import zntrack
2
3node = zntrack.from_rev(
4    "BMIM_BF4_x10",
5    remote="https://github.com/user/repo",
6    rev="water"
7)
8node.atoms: list[ase.Atoms]

Furthermore, tools can be built directly on top of ZnTrack to make working with the data even easier. We provide the ZnDraw package that can visualize not only most molecular dynamics trajectory files but also access the SDR directly as shown in Figure 8.

1zndraw BMIM_BF4_x10-x10.atoms --remote https://github.com/user/repo --rev BMIM-X
Refer to caption
Figure 8. The room temperature ionic liquid BMIM-BF44{}_{\textrm{4}}start_FLOATSUBSCRIPT 4 end_FLOATSUBSCRIPT visualized from the SDR using ZnDraw.

5.2. Random Forest Classifier

ZnTrack can also be used for more general machine learning applications. We will showcase this by implementing a machine learning method for a binary classification task. To this end, we will demonstrate the training of a random forest classifier (Breiman, 2001) adapted from the DVC example repository (Iterative, 2023c) in ZnTrack. This example utilizes a dataset consisting of StackOverflow questions and their respective tags.

Refer to caption
Figure 9. Computational graph for training a random forest classifier.

The code is organized into distinct steps, as illustrated in Figure 9. The initial step involves extracting data from the raw data file and converting it into a suitable format. Following this, the data undergoes featurization using a bag-of-words approach, and it is subsequently divided into separate training and test sets. The features extracted from the training set are then employed to train a random forest classifier. Lastly, the model’s performance is evaluated using the unseen test set. The code is implemented in Python and uses scikit-learn (Pedregosa et al., 2011), numpy (Harris et al., 2020), pandas (pandas development team, 2020) and matplotlib (Hunter, 2007).

1params = yaml.safe_load(open("params.yaml"))["train"]
2
3inputs = sys.argv[1]
4output = sys.argv[2]
5seed = params["seed"]
6n_est = params["n_est"]
7min_split = params["min_split"]

We’ve converted the example code (see above) to adhere to the DaC paradigm by using ZnTrack (see code below). For this scenario, we were able to reduce the lines of code and the McCabe complexity (McCabe, 1976) between parts of the workflow with the same functionality. Adapting data into the DaC paradigm through ZnTrack often only requires defining parameters, dependencies and outputs as class attributes, whilst the remaining code can be kept almost unchanged.

1class Train(zntrack.Node):
2    min_split: float = zntrack.params(0.01)
3    n_est: int = zntrack.params(50)
4    seed: int = zntrack.params(20170428)
5
6    features: str = zntrack.deps_path("data/features")
7    model: str = zntrack.outs_path("model.pkl")

This is further supported by the similarity analysis shown in Figure 10. In this analysis, we compute TF-IDF features (Manning et al., 2008) for both the original code and the code written with ZnTrack. We then calculate the cosine similarity between them. The results indicate that, despite adaptations, the new code remains similar to the original. This strengthens the argument that ZnTrack can be used to convert existing code to adhere to the DaC paradigm, while keeping many of the original code’s characteristics, thereby only requiring minimal changes and being not very invasive.

Refer to caption
Figure 10. Similarity between the original code and the adapted code for DaC using ZnTrack. The similarity is calculated using the cosine similarity between the TF-IDF features (Manning et al., 2008) as implemented in scikit-learn (Pedregosa et al., 2011).

5.3. Data Availability

The described packages and conducted experiments are publicly available under the Apache-2.0 license.

6. Conclusion

In this work, we introduced the DaC concept, which simplifies access, sharing, and version control of data and workflows. DaC shifts the interface between data and code from end-users to developers and provides it alongside the code. We present the ZnTrack package, which facilitates DaC implementation in Python.

We demonstrated how DaC can be applied in atomistic simulations, allowing for easy parallel execution of multiple simulations and the FAIR sharing and version control of large datasets. Furthermore, we showed how packages can be built to interface with other DaC tools, as briefly demonstrated by the ZnTrack interface of the ZnDraw visualizer. Fully reproducible workflows support entry into data-driven fields, such as machine learning and atomistic simulations.

Leveraging DVC and GIT, ZnTrack offers easy integration of DaC into the research software and data lifecycle. It accommodates both newcomers and experienced researchers, providing minimal overhead for converting existing workflows.

As an open-source package, ZnTrack also serves as a flexible workflow framework. It enables DaC within Python, promoting the close integration of code and associated data. This strengthens FAIR data practices and fosters collaborations within and beyond research groups. Publishing research software, datasets, and experiments on platforms like GitHub or GitLab can contribute to the growth of open science, as seen in open-source research software.

Acknowledgements

C.H., J.K., F.Z., and M.S. acknowledge support by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) in the framework of the priority program SPP 2363, “Utilization and Development of Machine Learning for Molecular Applications - Molecular Machine Learning” Project No. 497249646. S. T was supported by an LGF stipend of the state of Baden-Württemberg. Further funding though the DFG under Germany’s Excellence Strategy - EXC 2075 - 390740016 and the Stuttgart Center for Simulation Science (SimTech) was provided. All authors acknowledge support by the state of Baden-Württemberg through bwHPC and the German Research Foundation (DFG) through grant INST 35/1597-1 FUGG.

References

  • (1)
  • Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/
  • Adorf et al. (2018) Carl S. Adorf, Paul M. Dodd, Vyas Ramasubramani, and Sharon C. Glotzer. 2018. Simple Data and Workflow Management with the Signac Framework. Computational Materials Science 146 (April 2018), 220–229. https://doi.org/10.1016/j.commatsci.2018.01.035
  • Akiba et al. (2019) Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Framework. arXiv:1907.10902 [cs.LG]
  • Alam and Roy (2022) Khairul Alam and Banani Roy. 2022. Challenges of Provenance in Scientific Workflow Management Systems. In 2022 IEEE/ACM Workshop on Workflows in Support of Large-Scale Science (WORKS) (2022-11). IEEE, Dallas, Texas, 10–18. https://doi.org/10.1109/WORKS56498.2022.00007
  • Alam et al. (2023) Sajid Alam, Nok Lam Chan, Yetunde Dada, Ivan Danov, Deepyaman Datta, Tynan DeBold, Jannic Holzer, Stephanie Kaiser, Rashida Kanchwala, Ankita Katiyar, Amanda Koh, Andrew Mackay, Ahdra Merali, Antony Milne, Huong Nguyen, Nero Okwa, Juan Luis Cano Rodríguez, Joel Schwarzmann, Jo Stichbury, and Merel Theisen. 2023. Kedro. kedro-org. https://github.com/kedro-org/kedro
  • Allison et al. (2016) David B. Allison, Andrew W. Brown, Brandon J. George, and Kathryn A. Kaiser. 2016. Reproducibility: A Tragedy of Errors. Nature 530, 7588 (2016), 27–29. Issue 7588. https://doi.org/10.1038/530027a
  • Amstutz et al. (2022) Peter Amstutz, Maxim Mikheev, Michael R. Crusoe, Nebojša Tijanić, Samuel Lampa, et al. 2022. (2022): Existing Workflow Systems. Common Workflow Language wiki, GitHub. https://s.apache.org/existing-workflow-systems updated. https://s.apache.org/existing-workflow-systems
  • Anderson et al. (2007) Nicholas R. Anderson, E. Sally Lee, J. Scott Brockenbrough, Mark E. Minie, Sherrilynne Fuller, James Brinkley, and Peter Tarczy-Hornoch. 2007. Issues in Biomedical Research Data Management and Analysis: Needs and Barriers. JAMIA 14, 4 (2007), 478–488. https://doi.org/10.1197/jamia.M2114 arXiv:17460139
  • Batatia et al. (2022) Ilyes Batatia, Dávid Péter Kovács, Gregor N. C. Simm, Christoph Ortner, and Gábor Csányi. 2022. MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields. Advances in Neural Information Processing Systems 35 (Dec. 2022), 11423–11436.
  • Batzner et al. (2022) Simon Batzner, Albert Musaelian, Lixin Sun, Mario Geiger, Jonathan P. Mailoa, Mordechai Kornbluth, Nicola Molinari, Tess E. Smidt, and Boris Kozinsky. 2022. E(3)-Equivariant Graph Neural Networks for Data-Efficient and Accurate Interatomic Potentials. Nature Communications 13, 1 (Dec. 2022), 2453. https://doi.org/10.1038/s41467-022-29939-5 arXiv:2101.03164 [cond-mat, physics:physics]
  • Biewald (2020) Lukas Biewald. 2020. Experiment Tracking with Weights and Biases. https://www.wandb.com/ Software available from wandb.com.
  • Bradbury et al. (2018) James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. 2018. JAX: composable transformations of Python+NumPy programs. Google LLC. http://github.com/google/jax
  • Breiman (2001) Leo Breiman. 2001. Random Forests. Machine Learning 45, 1 (2001), 5–32. https://doi.org/10.1023/A:1010933404324
  • Brindescu et al. (2014) Caius Brindescu, Mihai Codoban, Sergii Shmarkatiuk, and Danny Dig. 2014. How Do Centralized and Distributed Version Control Systems Impact Software Changes?. In Proceedings of the 36th International Conference on Software Engineering (2014-05-31) (ICSE 2014). Association for Computing Machinery, New York, NY, USA, 322–333. https://doi.org/10.1145/2568225.2568322
  • Buyl et al. (2014) Pierre Buyl, Peter H. Colberg, and Felix Höfling. 2014. H5MD: A Structured, Efficient, and Portable File Format for Molecular Data. Computer Physics Communications 185, 6 (2014), 1546–1553. https://doi.org/10.1016/j.cpc.2014.01.018
  • Castro (2023) David de la Iglesia Castro. 2023. DVC: Data Version Control - Git for Data & Models. https://doi.org/10.5281/zenodo.7886036
  • Chen et al. (2020) Andrew Chen, Andy Chow, Aaron Davidson, Arjun DCunha, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Clemens Mewald, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, Avesh Singh, Fen Xie, Matei Zaharia, Richard Zang, Juntai Zheng, and Corey Zumar. 2020. Developments in MLflow: A System to Accelerate the Machine Learning Lifecycle. In Proceedings of the Fourth International Workshop on Data Management for End-to-End Machine Learning (DEEM’20). Association for Computing Machinery, New York, NY, USA, 1–4. https://doi.org/10.1145/3399579.3399867
  • Chmiela et al. (2017) Stefan Chmiela, Alexandre Tkatchenko, Huziel E. Sauceda, Igor Poltavsky, Kristof T. Schütt, and Klaus-Robert Müller. 2017. Machine learning of accurate energy-conserving molecular force fields. Science Advances 3, 5 (2017), e1603015. https://doi.org/10.1126/sciadv.1603015 arXiv:https://www.science.org/doi/pdf/10.1126/sciadv.1603015
  • da Silva et al. (2021) Rafael Ferreira da Silva, Henri Casanova, Kyle Chard, Ilkay Altintas, Rosa M Badia, Bartosz Balis, Tainã Coleman, Frederik Coppens, Frank Di Natale, Bjoern Enders, Thomas Fahringer, Rosa Filgueira, Grigori Fursin, Daniel Garijo, Carole Goble, Dorran Howell, Shantenu Jha, Daniel S. Katz, Daniel Laney, Ulf Leser, Maciej Malawski, Kshitij Mehta, Loïc Pottier, Jonathan Ozik, J. Luc Peterson, Lavanya Ramakrishnan, Stian Soiland-Reyes, Douglas Thain, and Matthew Wolf. 2021. A Community Roadmap for Scientific Workflows Research and Development. In 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS) (2021-11). IEEE, St. Louis, Missouri, USA, 81–90. https://doi.org/10.1109/WORKS54523.2021.00016
  • dagshub (2023) dagshub 2023. Open Source Data Science Collaboration - DagsHub. dagshub. https://dagshub.com/
  • Dask Development Team (2016) Dask Development Team. 2016. Dask: Library for Dynamic Task Scheduling. dask.org. https://dask.org
  • Eastman et al. (2023) Peter Eastman, Pavan Kumar Behara, David L. Dotson, Raimondas Galvelis, John E. Herr, Josh T. Horton, Yuezhi Mao, John D. Chodera, Benjamin P. Pritchard, Yuanqing Wang, Gianni De Fabritiis, and Thomas E. Markland. 2023. SPICE, A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials. Scientific Data 10, 1 (2023), 11. https://doi.org/10.1038/s41597-022-01882-6
  • Fitschen et al. (2019) Timm Fitschen, Alexander Schlemmer, Daniel Hornung, Henrik tom Wörden, Ulrich Parlitz, and Stefan Luther. 2019. CaosDB—Research Data Management for Complex, Changing, and Automated Research Workflows. Data 4, 2 (June 2019), 83. https://doi.org/10.3390/data4020083
  • Fumi and Tosi (1964) F.G. Fumi and M.P. Tosi. 1964. Journal of Physics and Chemistry of Solids 25 (1964), 43.
  • George and Saha (2022) Johnu George and Amit Saha. 2022. End-to-End Machine Learning Using Kubeflow. In 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD) (CODS-COMAD 2022). Association for Computing Machinery, New York, NY, USA, 336–338. https://doi.org/10.1145/3493700.3493768
  • González et al. (2007) M. González, F. González, A. Luaces, and J. Cuadrado. 2007. Interoperability and Neutral Data Formats in Multibody System Simulation. Multibody System Dynamics 18, 1 (2007), 59–72. https://doi.org/10.1007/s11044-007-9060-8
  • Haines (2022) Scott Haines. 2022. Workflow Orchestration with Apache Airflow. In Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming Applications, Scott Haines (Ed.). Apress, Berkeley, CA, 255–295. https://doi.org/10.1007/978-1-4842-7452-1_8
  • Harris et al. (2020) Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. 2020. Array programming with NumPy. Nature 585, 7825 (Sept. 2020), 357–362. https://doi.org/10.1038/s41586-020-2649-2
  • Herlihy et al. (2020) Maurice Herlihy, Nir Shavit, Victor Luchangco, and Michael Spear. 2020. The Art of Multiprocessor Programming. Newnes. arXiv:7MqcBAAAQBAJ
  • Howard (2022) Michael Howard. 2022. Terraform – Automating Infrastructure as a Service. https://doi.org/10.48550/arXiv.2205.10676 arXiv:2205.10676 [cs]
  • Huggins and Mayer (1933) M.L. Huggins and J.E. Mayer. 1933. Journal of Chemical Physics 1 (1933), 643.
  • Hunter (2007) J. D. Hunter. 2007. Matplotlib: A 2D graphics environment. Computing in Science & Engineering 9, 3 (2007), 90–95. https://doi.org/10.1109/MCSE.2007.55
  • Iterative (2023a) Iterative. 2023a. Continuous Machine Learning (CML) is CI/CD for Machine Learning Projects. https://cml.dev/
  • Iterative (2023b) Iterative. 2023b. Iterative Studio. Iterative. https://studio.iterative.ai
  • Iterative (2023c) Iterative 2023c. Iterative/Example-Get-Started: Get Started DVC Project. https://github.com/iterative/example-get-started
  • Janssen et al. (2019) Jan Janssen, Sudarsan Surendralal, Yury Lysogorskiy, Mira Todorova, Tilmann Hickel, Ralf Drautz, and Jörg Neugebauer. 2019. pyiron: An integrated development environment for computational materials science. Computational Materials Science 163 (2019), 24–36. https://doi.org/10.1016/j.commatsci.2018.07.043
  • JetBrains (2022) JetBrains. 2022. Python Developers Survey 2022. https://lp.jetbrains.com/python-developers-survey-2022
  • Kaggle (2023) Kaggle 2023. Kaggle: Your Machine Learning and Data Science Community. https://www.kaggle.com/
  • Klump et al. (2021) Jens Klump, Lesley Wyborn, Mingfang Wu, Julia Martin, Robert R. Downs, and Ari Asmi. 2021. Versioning Data Is About More than Revisions: A Conceptual Framework and Proposed Principles. Data Science Journal 20, 1 (2021), 12. Issue 1. https://doi.org/10.5334/dsj-2021-012
  • Landrum et al. (2023) Greg Landrum, Paolo Tosco, Brian Kelley, Ric, David Cosgrove, sriniker, gedeck, Riccardo Vianello, NadineSchneider, Eisuke Kawashima, Dan N, Gareth Jones, Andrew Dalke, Brian Cole, Matt Swain, Samo Turk, AlexanderSavelyev, Alain Vaucher, Maciej Wójcikowski, Ichiru Take, Daniel Probst, Kazuya Ujihara, Vincent F. Scalfani, guillaume godin, Juuso Lehtivarjo, Axel Pahl, Rachel Walker, Francois Berenger, jasondbiggs, and strets123. 2023. Rdkit/Rdkit: 2023_03_2 (Q1 2023) Release. Zenodo. https://doi.org/10.5281/zenodo.8053810
  • Larsen et al. (2017) Ask Hjorth Larsen, Jens Jørgen Mortensen, Jakob Blomqvist, Ivano E. Castelli, Rune Christensen, Marcin Dułak, Jesper Friis, Michael N. Groves, Bjørk Hammer, Cory Hargus, Eric D. Hermes, Paul C. Jennings, Peter Bjerre Jensen, James Kermode, John R. Kitchin, Esben Leonhard Kolsbjerg, Joseph Kubal, Kristen Kaasbjerg, Steen Lysgaard, Jón Bergmann Maronsson, Tristan Maxson, Thomas Olsen, Lars Pastewka, Andrew Peterson, Carsten Rostgaard, Jakob Schiøtz, Ole Schütt, Mikkel Strange, Kristian S. Thygesen, Tejs Vegge, Lasse Vilhelmsen, Michael Walter, Zhenhua Zeng, and Karsten W. Jacobsen. 2017. The Atomic Simulation Environment—a Python Library for Working with Atoms. Journal of Physics: Condensed Matter 29, 27 (2017), 273002. https://doi.org/10.1088/1361-648X/aa680e
  • Lecun et al. (1998a) Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998a. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 86, 11 (1998), 2278–2324. https://doi.org/10.1109/5.726791
  • Lecun et al. (1998b) Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Nov./1998b. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 86, 11 (Nov./1998), 2278–2324. https://doi.org/10.1109/5.726791
  • Liaw et al. (2018) Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E. Gonzalez, and Ion Stoica. 2018. Tune: A Research Platform for Distributed Model Selection and Training. arXiv:1807.05118 [cs.LG]
  • Liskov (1987) Barbara Liskov. 1987. Keynote Address - Data Abstraction and Hierarchy. ACM SIGPLAN Notices 23, 5 (1987), 17–34. https://doi.org/10.1145/62139.62141
  • Luo et al. (2021) Zhaojing Luo, Sai Ho Yeung, Meihui Zhang, Kaiping Zheng, Lei Zhu, Gang Chen, Feiyi Fan, Qian Lin, Kee Yuan Ngiam, and Beng Chin Ooi. 2021. MLCask: Efficient Management of Component Evolution in Collaborative Data Analytics Pipelines. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). 1655–1666. https://doi.org/10.1109/ICDE51399.2021.00146
  • Manning et al. (2008) Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press.
  • Martin (2003) Robert Cecil Martin. 2003. Agile Software Development: Principles, Patterns, and Practices. Prentice Hall PTR.
  • Martin (2017) Robert C. Martin. 2017. Clean Architecture: A Craftsman’s Guide to Software Structure and Design. Prentice Hall.
  • Martínez et al. (2009) L. Martínez, R. Andrade, E. G. Birgin, and J. M. Martínez. 2009. PACKMOL: A Package for Building Initial Configurations for Molecular Dynamics Simulations. Journal of Computational Chemistry 30, 13 (2009), 2157–2164. https://doi.org/10.1002/jcc.21224 arXiv:19229944
  • McCabe (1976) T.J. McCabe. 1976. A Complexity Measure. IEEE Transactions on Software Engineering SE-2, 4 (1976), 308–320. https://doi.org/10.1109/TSE.1976.233837
  • Meyer (1988) Bertrand Meyer. 1988. Object-Oriented Software Construction. Prentice Hall.
  • Mölder et al. (2021) Felix Mölder, Kim Philipp Jablonski, Brice Letcher, Michael B. Hall, Christopher H. Tomkins-Tinch, Vanessa Sochat, Jan Forster, Soohyun Lee, Sven O. Twardziok, Alexander Kanitz, Andreas Wilm, Manuel Holtgrewe, Sven Rahmann, Sven Nahnsen, and Johannes Köster. 2021. Sustainable Data Analysis with Snakemake. F1000Research 10 (April 2021), 33. https://doi.org/10.12688/f1000research.29032.2
  • Oliveira et al. (2020) Micael J. T. Oliveira, Nick Papior, Yann Pouillon, Volker Blum, Emilio Artacho, Damien Caliste, Fabiano Corsetti, Stefano de Gironcoli, Alin M. Elena, Alberto García, Víctor M. García-Suárez, Luigi Genovese, William P. Huhn, Georg Huhs, Sebastian Kokott, Emine Küçükbenli, Ask H. Larsen, Alfio Lazzaro, Irina V. Lebedeva, Yingzhou Li, David López-Durán, Pablo López-Tarifa, Martin Lüders, Miguel A. L. Marques, Jan Minar, Stephan Mohr, Arash A. Mostofi, Alan O’Cais, Mike C. Payne, Thomas Ruh, Daniel G. A. Smith, José M. Soler, David A. Strubbe, Nicolas Tancogne-Dejean, Dominic Tildesley, Marc Torrent, and Victor Wen-zhe Yu. 2020. The CECAM Electronic Structure Library and the Modular Software Development Paradigm. The Journal of Chemical Physics 153, 2 (2020), 024117. https://doi.org/10.1063/5.0012901
  • Ong et al. (2013) Shyue Ping Ong, William Davidson Richards, Anubhav Jain, Geoffroy Hautier, Michael Kocher, Shreyas Cholia, Dan Gunter, Vincent L. Chevrier, Kristin A. Persson, and Gerbrand Ceder. 2013. Python Materials Genomics (pymatgen): A robust, open-source python library for materials analysis. Computational Materials Science 68 (2013), 314–319. https://doi.org/10.1016/j.commatsci.2012.10.028
  • Owhadi (2022) Houman Owhadi. 2022. Computational Graph Completion. https://doi.org/10.48550/arXiv.2110.10323 arXiv:2110.10323 [cs, math, stat]
  • Pan et al. (2016) Ge ChuanQi Pan, Jing Ding, Weilong Wang, Jianfeng Lu, Jiang Li, and Xiaolan Wei. 2016. Molecular Simulations of the Thermal and Transport Properties of Alkali Chloride Salts for High-Temperature Thermal Energy Storage. International Journal of Heat and Mass Transfer 103 (Dec. 2016), 417–427. https://doi.org/10.1016/j.ijheatmasstransfer.2016.07.042
  • pandas development team (2020) The pandas development team. 2020. pandas-dev/pandas: Pandas. https://doi.org/10.5281/zenodo.3509134
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. https://doi.org/10.48550/arXiv.1912.01703 arXiv:1912.01703 [cs, stat]
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
  • Peng (2011) Roger D. Peng. 2011. Reproducible Research in Computational Science. Science 334, 6060 (2011), 1226–1227. https://doi.org/10.1126/science.1213847
  • python (2023) python filesystem spec 2023. Filesystem_spec. python filesystem spec. https://github.com/fsspec/filesystem_spec
  • Ramakrishnan et al. (2014) Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole Lilienfeld. 2014. Quantum Chemistry Structures and Properties of 134 Kilo Molecules. Scientific Data 1 (2014), 140022. https://doi.org/10.6084/m9.figshare.c.978904.v5
  • Rieger et al. (2017) Marcel Rieger, Martin Erdmann, Benjamin Fischer, and Robert Fischer. 2017. Design and Execution of Make-like, Distributed Analyses Based on Spotify’s Pipelining Package Luigi. https://doi.org/10.48550/arXiv.1706.00955 arXiv:1706.00955 [physics]
  • Sabne (2020) Amit Sabne. 2020. XLA : Compiling Machine Learning for Peak Performance.
  • Schulman et al. (2015) John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. 2015. Gradient Estimation Using Stochastic Computation Graphs. In Advances in Neural Information Processing Systems, Vol. 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2015/hash/de03beffeed9da5f3639a621bcab5dd4-Abstract.html
  • Shalf (2020) John Shalf. 2020. The Future of Computing beyond Moore’s Law. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 378, 2166 (Jan. 2020), 20190061. https://doi.org/10.1098/rsta.2019.0061
  • Smith et al. (2017) Justin S. Smith, Olexandr Isayev, and Adrian E. Roitberg. 2017. ANI-1, A Data Set of 20 Million Calculated off-Equilibrium Conformations for Organic Molecules. Scientific Data 4, 1 (2017), 170193. Issue 1. https://doi.org/10.1038/sdata.2017.193
  • Stoddart (2016) Charlotte Stoddart. 2016. Is There a Reproducibility Crisis in Science? Nature (2016). https://doi.org/10.1038/d41586-019-00067-3
  • Thompson et al. (2022) A. P. Thompson, H. M. Aktulga, R. Berger, D. S. Bolintineanu, W. M. Brown, P. S. Crozier, A. Kohlmeyer, S. G. Moore, T. D. Nguyen, R. Shan, M. J. Stevens, J. Tranchida, C. Trott, and S. J. Plimpton. 2022. LAMMPS - a Flexible Simulation Tool for Particle-Based Materials Modeling at the Atomic, Meso, and Continuum Scales. Comp. Phys. Comm. 271 (2022), 108171. https://doi.org/10.1016/j.cpc.2021.108171
  • Thulasiraman and Swamy (1992) K. Thulasiraman and M. N. S. Swamy. 1992. Graphs: Theory and Algorithms. John Wiley and Sons.
  • Tinhofer et al. (2012) Gottfried Tinhofer, Rudolf Albrecht, Ernst Mayr, Hartmut Noltemeier, and Maciej M. Syslo. 2012. Computational Graph Theory. Springer Science & Business Media.
  • Tovey et al. (2023) Samuel Tovey, Fabian Zills, Francisco Torres-Herrador, Christoph Lohrmann, Marco Brückner, and Christian Holm. 2023. MDSuite: comprehensive post-processing tool for particle simulations. Journal of Cheminformatics 15, 1 @ (11 Feb 2023), 19.
  • Weininger (1988) David Weininger. 1988. SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci. 28, 1 (1988), 31–36. https://doi.org/10.1021/ci00057a005
  • Wilkinson et al. (2016) Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E. Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alasdair J. G. Gray, Paul Groth, Carole Goble, Jeffrey S. Grethe, Jaap Heringa, Peter A. C. ’t Hoen, Rob Hooft, Tobias Kuhn, Ruben Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Albert Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-Serra, Marco Roos, Rene van Schaik, Susanna-Assunta Sansone, Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn, Morris A. Swertz, Mark Thompson, Johan van der Lei, Erik van Mulligen, Jan Velterop, Andra Waagmeester, Peter Wittenburg, Katherine Wolstencroft, Jun Zhao, and Barend Mons. 2016. The FAIR Guiding Principles for Scientific Data Management and Stewardship. Scientific Data 3, 1 (2016), 160018. Issue 1. https://doi.org/10.1038/sdata.2016.18
  • Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (Online, 2020-10). Association for Computational Linguistics, 38–45. https://doi.org/10.18653/v1/2020.emnlp-demos.6
  • Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv:cs.LG/1708.07747 [cs.LG]