1. Introduction
The structural arrangement of atoms in matter and the nature of their chemical bonds can be deciphered by employing X-ray crystallography. This experimental technique has been pivotal in structural biology (see, e.g., [
1,
2,
3] and references therein) and contributes to date to approximately 85% of the structures released in the Protein Data Bank (calculated using data from [
4,
5]). X-ray crystallography requires samples to be present in the form of crystals: periodic repetitions of a unique unit cell which results in Bragg peaks upon exposure to X-rays. These encode part of the information to reconstruct the electron density of the sample. In a simplified description, assuming that the kinematic approximation holds, the overall intensity of Bragg peaks increases with (i) the squared number of unit cells in the crystal and its degree of perfection, and (ii) the number of photons interacting with the sample
1. The size and quality of crystals can only be controlled to a limited extent, also due to the need to maintain realistic near-physiological conditions when biological compounds are investigated. The key to elucidate increasingly complex macromolecules, from e.g., Hemoglobin [
7] to the Ribosome [
8], instrumental to advance structural biology, has thus been the exploitation of more and more advanced photon sources. The photon flux generated from modern sources, which are either based on storage rings or X-ray free electron lasers (XFELs), has enabled the collection of diffraction data allowing up to Ångstrom resolution from smaller and smaller crystals, down to sub-micrometer size at XFELs as a result of their exceptional brilliance. The interaction of crystals with intense X-ray beams results in several physical processes that can lead to permanent radiation damage (see, e.g., [
9]). Classical crystallography experiments consist of rotation scans exposing repeatedly the same region of the crystal. Consequently, it accumulates damage during data acquisition. To mitigate this, cryogenic conditions can be employed so as to hamper the processes of radical formation from photolectrons [
10]. Another strategy consists in collecting data from fresh regions of the sample, either of the same crystal or different ones, in a serial fashion (see, e.g., [
11]). The latter paradigm is adopted at XFELs as a single X-ray pulse typically deposits enough energy to completely destroy the sample [
12]. However, the temporal duration of pulses is short enough – of the order of several femtoseconds – that signals from almost undamaged crystals can be detected, as the timescale of the ion movement is much longer
2. At XFELs, crystals are continuously replaced typically using liquid jets or movable fixed-target stages and intercept the X-ray beam with a certain probability (the so-called hit rate). This technique is termed serial femtosecond crystallography (SFX) [
14,
15,
16,
17,
18]. Furthermore, owing to the femtosecond duration of their pulses, XFELs are exceptional tools to perform time-resolved investigations at resolutions not achievable by photon sources based on storage rings [
13,
19,
20,
21,
22,
23].
In parallel to the development of X-ray sources and instrumentation, the computer hardware and software to process data from raw diffraction images to the final structural model has evolved in several aspects, including the development of novel crystallographic methods, the usage of parallel processing computational methods, and the design of graphical interfaces to facilitate and automate the data processing. The processing of X-ray crystallography data commences with the reduction of raw detector frames to a unique set of structure factors [
24]. This includes finding Bragg peaks, indexing them, integrating pixel intensities in three-dimensions, and averaging the symmetry-equivalent reflection observations with proper scaling. These steps have been implemented in popular software packages such as XDS [
25], Mosflm [
26], or
DIALS [
27]. Owing to the nature of data collection, serial crystallography requires different algorithms and approaches, and includes a hit-finding step for identifying detector frames containing the signature of diffraction, which is a prerequisite for indexing the reciprocal lattice [
17]. Also in this case, dedicated software suites have been developed in the last decade. Notably,
DIALS has been extended to process serial crystallography data [
28], and the
CrystFEL suite [
29] has been developed. Subsequent data analysis, from structure factors to the final atomic model, requires software for crystallographic phasing, model building from derived electron density and its refinement and validation
3.
The entire crystallography data processing pipeline consists of the sequential execution of several tasks, often performed by different software programs. As the input and output data formats, as well as the user experience, might differ greatly across tools, several software pipelines have been developed with the aim of abstracting complexity and increasing analysis throughput and automation level [
30,
31,
32,
33]. In fact, X-ray crystallography beamlines at storage rings are exceptional examples of sophisticated ecosystems including state-of-the-art robotics, information systems and processing tools, allowing for a comprehensive automation [
30,
34,
35]. Such simplification empowers inexperienced users to focus on scientific questions. It should be pointed out that the need for expert knowledge persists in demanding cases, e.g., when the diffraction signal is particularly weak, or extensive parameter optimization is required.
Several challenges are intrinsic to serial crystallography at XFELs, such as pulse-to-pulse jitter of the X-ray beam in space, wavelength, and energy, which are typically reflected in the amount of diagnostics necessary to interpret the outcome of the experiment. Additionally, the rate and amount of data collected to solve the scientific problem under investigation as well as the often-complicated nature of custom-built detectors further complicate processing and interpretation [
15,
17,
36]. For example, the European XFEL (EuXFEL) [
37,
38] generates up to 27,000 X-ray pulses per second. A fraction of these is collected by multi-modular pixelized area detectors, such as the Adaptive-Gain Integrating Pixel Detector (AGIPD, up to 3,520 frames or 14 GiB per second) [
39], the Large Pixel Detector (LPD, up to 5,120 frames or 10 GiB per second) [
40], and the JUNGFRAU detector (up to 160 frames or 1.9 GiB per second) [
41], which are synchronized with X-ray pulses. Due to technical reasons, the data acquisition system stores each detector module separately. In particular, predefined sequences of data from each module are stored in different HDF5
4 files in the EuXFEL data format (EXDF), which might be a significant barrier for several users, and currently cannot be used directly by popular software like
CrystFEL. Additionally, the sheer volume of the data to be analysed makes workflows practically unfeasible unless distributed computing on high-performance computing (HPC) clusters is employed, whose usage is an additional burden to scientists. Finally, photon sources like EuXFEL enable investigation of ultrafast processes, which are often performed utilizing some form of excitation of the sample (the so-called pump) to then probe the induced molecular dynamics with the XFEL beam. The cost of this is tedious bookkeeping of the data frame subsets, given the pump-probe patterns and verification of correct time sampling, and the same applies in general to diagnostic means.
With the aim of abstracting as much complexity as possible so as to allow scientists to focus on their biological question, we developed
EXtra-Xwiz [
43]. This allows for a high degree of automation of data analysis workflows through its integration with other services provided at EuXFEL. In this paper, we introduce
EXtra-Xwiz and discuss its current status and future goals. In
Section 2 we describe the
EXtra-Xwiz design and architecture, followed by a step-by-step tutorial with an example of processing SFX data in
Section 3. Finally, we give an outlook on planned extensions to the pipeline in
Section 4.
3. Data processing with EXtra-Xwiz by example
This section describes an example of processing SFX data from hen egg-white lysozyme (HEWL) microcrystals collected at the SPB/SFX instrument using the AGIPD detector (run 30, proposal 700000). Only basic knowledge of the Unix commands and environment are expected from the reader. Lines starting with a "
$" indicate commands which should be executed in a Unix shell. For readers who are not users of the European XFEL, the Virtual Infrastructure for Scientific Analysis (VISA) [
56] service can be used and additional instructions are provided in
Section 3.2.
Access to the Maxwell cluster is exclusive to EuXFEL users and detailed instructions on how to connect to it can be found in the EuXFEL Data Analysis user documentation [
57]. In general, a user with an active account can connect to one of the interactive cluster nodes:
$ ssh <user name>@max-exfl-display.desy.de
To start using EXtra-Xwiz a dedicated module has to be loaded at the cluster with:
$ module load exfel EXtra-xwiz/crystals2023
As mentioned in
Section 2.1,
EXtra-Xwiz requires a configuration file in TOML format [
58] for its operation
5. It should be named "xwiz_conf.toml" and a template of such file can be generated by starting the pipeline for the first time in an empty folder with the following command:
$ xwiz-workflow
The configuration file contains parameters for each of the pipeline processing steps organized into sections such as "[data]", "[geom]", "[unit_cell]", "[indexamajig_run]", and "[merging]". A copy of the configuration file used in this example along with all other files necessary for the pipeline execution can be downloaded from [
47].
Data to be processed by the pipeline can be specified with just a proposal number and a list of runs in the "[data]" section of the configuration file:
[data]
proposal = 700000
runs = [30]
It is possible to select a subset of frames from each run with an optional frames_range parameter in the same section, for example:
frames_range = {start = 0, end = 200000, step = 1}
This parameter has values organized into a dictionary similar to the Python range object but inclusive for the end value with end = -1 representing the last frame of the run.
For the purpose of reproducibility of the SFX analysis,
EXtra-Xwiz supports a list of different versions of the
CrystFEL suite which can be selected in the "[crystfel]" section:
[crystfel]
version = '0.10.2'
Currently, recent major CrystFEL versions are available, as well as a "maxwell_dev" option which corresponds to the constantly updated installation of the latest CrystFEL version.
Geometry and unit cell parameters files should be provided to the pipeline in the "[geom]" and "[unit_cell]" sections, respectively:
[geom]
file_path = "agipd_p700000_r0030.geom"
[unit_cell]
file_path = "hewl.cell"
A representative detector frame is shown in
Figure 2(b). There detector modules are positioned in the laboratory frame layout with the use of
EXtra-geom library [
60].
In case the unit cell of the sample is not known prior to the analysis, it can be generated by setting the option file_path = "none" as described in
Section 2.1. During the
EXtra-Xwiz session, after the indexing step, an interactive
cell_explorer session will start. At this point, the user is expected to determine the cell parameters from the histograms of indexing results (as explained in [
48]), and save them into a unit cell file which will be requested by
EXtra-Xwiz. This procedure is illustrated in
Figure 3.
Parameters for Bragg peak identification and indexing with
indexamajig program have to be specified in the "[indexamajig_run]" configuration block:
[indexamajig_run]
resolution = 4.0
peak_method = "peakfinder8"
peak_threshold = 800
peak_snr = 5
index_method = "mosflm"
integration_radii = "2,3,5"
...
min_peaks = 10
extra_options = "--no-non-hits-in-stream"
Documentation regarding all
indexamajig options can be found at [
46]. In the current state of
EXtra-Xwiz not all of these options are covered by the default configuration file parameters, and if any of such options are required for data processing they can be specified in the string for extra_options parameter. Data processing with
indexamajig is the most time-consuming step of the whole pipeline but the computations are usually performed in parallel on multiple nodes of the Maxwell cluster. Cluster partition, number of nodes to use in parallel and maximum expected duration of the individual jobs should be specified under "[slurm]" section of the configuration file:
[slurm]
partition = "upex"
n_nodes_all = 20
duration_all = "10:00:00"
For testing the pipeline on a small subset of data (e.g., a hundred frames), without exploiting the Slurm jobs scheduler, it is advised to select the "local" partition. In this case EXtra-Xwiz will run indexamajig on the same node the pipeline is running.
Reflection intensities obtained from the Bragg peaks indexing are merged and post-refined with the
partialator tool and required parameters have to be specified under the "[merging]" block of
EXtra-Xwiz configuration:
[merging]
point_group = "422"
scaling_model = "unity"
scaling_iterations = 1
max_adu = 100000
Point groups corresponding to the symmetry groups of crystallized samples can be identified with the table in
CrystFEL documentation [
61].
In case of the time-resolved SFX experiments
pump on (sample illuminated with the "pump" laser) and
pump off (sample in the non-excited state) frames are processed in the same manner and separated only on the merging step of
partialator. As already mentioned in
Section 2.1, for such separation
partialator requires an additional input file labelling accordingly each frame of the input data.
EXtra-Xwiz can generate such file according to parameters specified in the "[partialator_split]" block. Let us assume that the machine delivers only at 1/8th of the 4.5 MHz maximum repetition rate and the detector is configured to record only these pulses. The sample is illuminated by infrared light every third delivered pulse (i.e., the 24th assuming 4.5 MHz operation), resulting in the following labels, "pump_on pump_off pump_off pump_on ...". The latter can be set in the configuration as:
[partialator_split]
execute = true
mode = "by_pulse_id"
[partialator_split.manual_datasets]
pump_on = {start=0, end=-1, step=24}
pump_off = [{start=8, step=24}, {start=16, step=24}]
Any user-defined set of labels can be specified with a corresponding list of inclusive range-like dictionary objects or pulse id values. Usually, in time-resolved experiments a diode is used to record data relative to the state of the pump laser.
EXtra-Xwiz can utilize signal from this diode and automatically generate labels accordingly if the mode parameter is set to either "on_off" or "on_off_numbered":
[partialator_split]
execute = true
mode = "on_off_numbered"
xray_signal = ["SPB_LAS_SYS/ADC/UTC1-1:channel_0.output", "data.rawData"]
laser_signal = ["SPB_LAS_SYS/ADC/UTC1-1:channel_1.output", "data.rawData"]
The difference between the "on_off_numbered" and the "on_off" mode is that in the former case consequent events of the same kind (e.g., pump off) are identified by an increasing number. For the example given above, the "on_off_numbered" labels are "on_1 off_1 off_2 on_1, ...". Paths to the diode data specified for xray_signal and laser_signal are provided by beamline scientists. As data used in this tutorial does not originate from the pump-probe experiment, the splitting into datasets does not make sense and should be avoided by either setting execute = false or simply removing the "[partialator_split]" block from the configuration file.
After all the configuration parameters have been set the
EXtra-Xwiz pipeline can be executed in an automatic mode with:
$ xwiz-workflow -a
Without the "-a" ("–automatic") optional argument the pipeline will verify each configuration parameter with a user in the interactive procedure. When the
EXtra-Xwiz operation finishes, it will generate a summary file containing information on the processed data statistics and FOMs, for example:
Step # d_lim source N(crystals) N(frames) Indexing rate [%]
1 1.6 indexamajig 46899 639616 7.3
...
Crystallographic FOMs:
overall outer shell
Completeness 100.0 100.0
Signal-over-noise 4.224 0.99
CC_1/2 0.8974 0.03244
CC* 0.9726 0.2507
R_split 27.28 80.6
These results are provided only to demonstrate the capabilities of the pipeline and could be improved, for example by tuning selected parameters and processing more runs of the collected data. The resulting CrystFEL stream file can be found in the same folder and the output hkl file with full FOMs tables in the "partialator" folder.
3.1. Automatic scan over EXtra-Xwiz configuration parameters
Sometimes it is required to run the
EXtra-Xwiz pipeline modifying one or multiple configuration parameters over a list of values, for example to assess the sensitivity of a given parameter. For such use case an
xwiz-scan-parameters tool have been developed. Similar to
xwiz-workflow, it requires a configuration file, and a template is generated on the first use of the tool in an empty folder:
$ xwiz-scan-parameters
The configuration file "xwiz_scan_conf.toml" for parameters scan tool consists of four sections: "[settings]", "[xwiz]", "[scan]", and "[output]". Main parameter of the "[settings]" block is xwiz_config which determines the path to the initial EXtra-Xwiz configuration file.
The "[scan]" section can contain any number of sub-sections. Each sub-section will be treated as a next level of a nested loop, therefore number of iterations in each scan are multiplied. Names of the parameters within each scan sub-section represent full names of the parameters in the
EXtra-Xwiz configuration files and their values should contain either a list or an inclusive range-like dictionary of values to scan over. If multiple parameters are listed within one scan they have to contain the same number of values which will be modified simultaneously on each step of the scan, for example:
[scan.SNR]
'indexamajig_run.peak_snr' = {start = 3, end = 7, step = 2}
'indexamajig_run.peak_threshold' = [1000, 800, 700]
Here a scan with 3 iterations is defined: first an SNR value of 3 will be used with a threshold of 1000, next an SNR of 5 will be set with a threshold of 800, and finally an SNR of 7 with a threshold of 700.
In the "[output]" section, the file names to store the tables with the results can be specified. When the configuration file is ready, the parameters scan can be started by running the
xwiz-scan-parameters in the same folder. It will run
EXtra-Xwiz sequentially over all scan iterations and collect data processing statistics and overall values of FOMs for each scan step into a table similar to this one:
index_rate(%) ... cc_half cc_star r_split
peak_snr peak_threshold
3 1000 0.080 ... 0.051 0.313 105.40
5 800 7.332 ... 0.894 0.972 27.44
7 700 7.219 ... 0.919 0.979 25.81
3.2. Running EXtra-Xwiz tutorial at VISA
VISA is a platform for cloud-based data analysis, developed by the Institute Laue-Languevin and deployed at several European photon and neutron facilities [
56]. Its documentation can be found at [
62].
In order to perform this tutorial using VISA, an instance containing an EXtra-Xwiz installation can be generated and used in a web browser:
click "Create a new instance";
click "Search for experiments" and select the proposal "p700000 - SFX on Hen egg-white lysozyme, AGIPD detector";
click on "EXtra-Xwiz_Crystals2023" environment;
choose the virtual hardware;
create the instance.
There is no need to load any additional modules and the tutorial described above applies except for the distribution of computations on the HPC cluster, which is not accessible from VISA. Because of that, the following should be used:
[slurm]
partition = "local"
[indexamajig_run]
n_cores = 1
...
Be careful, this would result in a very slow processing of the input data. Therefore, instead of running
EXtra-Xwiz on the entire run, a pre-selection of "good frames" can be used by specifying in the "[data]" section:
[data]
...
frames_list_file = "indexed_p700000_r0030.lst"
Please note that the "indexed_p700000_r0030.lst" file has been produced specifically for this VISA example and this option is never used in the actual processing of the experimental data.
4. Discussion and outlook
A challenge that requires expert knowledge and/or iterative processing steps lies in the optimization of parameters, such as minimum signal-to-noise ratio, peak finder thresholds, etcetera. Technically, as these parameters are passed to the
CrystFEL programs through a configuration file, iterative runs require a re-editing of the configuration, which can quickly become tedious. Currently,
EXtra-Xwiz offers two ways to simplify this: first, the software includes a grid search, as described in
Section 3.1, that can scan over some parameter space. This requires some estimate of reasonable parameter ranges, and enough time to compute the necessary grid nodes. Second, peak-finding parameter optimization can be done manually, and with visual feedback, using the
CrystFEL GUI. Results of a GUI session can be stored to a project file to be used by
EXtra-Xwiz for batch processing of one or more entire runs. Both of these approaches have their limitations, in particular related to the interpretation of their results by inexperienced users. Alternatively to brute-force searches, optimization methods, for example based on artificial intelligence, can be employed. In particular, we developed a method based on Bayesian optimization that optimizes
EXtra-Xwiz parameters by maximizing the indexing rate, and reduces the need for expertise in the interpretation of results [
63]. This solution is being deployed at EuXFEL and integrated into
EXtra-Xwiz. The same approach can also be used to tune the detector geometry representation in laboratory space. Indeed, the overall SFX workflow can greatly benefit from automating this task. At the moment, this is largely done manually. The operator is typically using graphical software for the visual centering and quadrant fitting of powder diffraction rings, followed by iterations of
indexamajig and
geoptimizer [
64] for maximizing the yield of indexable frames.
As a more long-term outlook, we strive to give users an end-to-end experience, that is, a pipeline covering all the steps from detector images to an atomic structure built into the reconstructed electron density. For this purpose, downstream modules handling the program execution related to the tasks of crystallographic phasing, model building and (preliminary but automated) structure refinement need to be included to the EXtra-Xwiz framework. Furthermore, owing to the modular nature of EXtra-Xwiz, in the future, software different from CrystFEL (e.g., DIALS), might be considered. It should be pointed out that challenging problems, in particular with respect to de novo structures, will require user experience for decisions along the way, for instance which phasing method to apply, or which search model to use for molecular replacement. This means that the design of the workflows have to be well thought, allowing user intervention where required and good balance of parameters with reasonable defaults versus expert input, boiling down to a useful semi-automatic approach. For this, an improved user interface is being designed.