2. Machine Learning in gRNA Design
Machine Learning is a branch of artificial intelligence that include algorithms and mathematical models that allow computers to learn from data without being explicitly programmed for each task. The machine learning algorithms follow some steps, starting with data processing, feature extraction, training, and classification or prediction [
47].
The input data are DNA sequences that require processing to transform the categorical input into a numerical sequence. The two main algorithms to convert the data to a numeric representation are the "One Hot-Encoding" and "k-mer word embedding" algorithms. One hot encoding is a technique where each value in the vector corresponds to a unique category. Therefore, each base in the gRNA and target DNA can be encoded as one of the four one-hot vectors [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0] and [0, 0, 0, 1] [
48]. One hot encoding does not capture any information about the relationship between words or the context in which they appear, what the k-mer word embedding algorithm does. However, k-mer word embeddings do not preserve the original sequence of the words and can be sensitive to rare or unseen words.
After the data is processed, it undergoes feature extraction, which involves selecting and transforming the essential characteristics or patterns of the raw data. This selection identifies and converts the most relevant information into a format the machine learning model can understand. The extracted features depend on whether the model is on or off target. Although the same machine learning models can be applied for on and off target prediction (see
Table 1 and
Table 2), what changes is the interpretation of the data.
During training, a machine learning model searches for patterns in the data. This process also requires setting the hyperparameters, which are variables that control the algorithm’s behavior during learning [
47]. For example, the number of layers, neurons, or the type of optimizer. The manipulation of these parameters is essential for enhancing the evaluation metrics of the model.
Machine learning models can output a sequence for prediction tasks or a categorical label for classification tasks. Classification models can be applied to predict if a specific RNA sequence is a potential CRISPR target. Prediction models of CRISPR can forecast the effectiveness of a particular CRISPR-Cas system on a given target sequence. The main machine learning algorithms for predicting or classifying DNA sequences are linear regression algorithms, logistic regression, Decision Trees, Random Forrest, and Support Vector Machines (SVM) (see
Figure 4). Neural Networks are another algorithm of artificial intelligence, but due to their complexity, these will be explained in the following section.
Linear regression (LR) is a supervised learning algorithm used for prediction tasks. Linear regression models fit a linear function between the dependent variable and the independent variables. Studies using this algorithm are CRISPRScan [
49], CRISPRater [
50]. A decision function can be added to a regression model to obtain a logistic regression (LG). In Logistic Regression, a linear function is transformed through a sigmoid function to produce a probability value between 0 and 1, which can then be classified into one of the two categories based on a threshold value. For example, models using this algorithm are Broad GPP [
42] and SCC [
51].
Decision Trees (DT) is a supervised learning method used for classification and prediction tasks. The algorithm builds a tree-like model of decisions and their possible consequences, where each node represents a feature of the RNA sequence, and each branch represents a possible outcome based on that feature. The algorithm recursively splits the data into subsets based on the most informative features until a stopping criterion is met. An algorithm derived from decision trees is the Random Forest (RF), an ensemble learning algorithm that uses multiple decision trees for classification and prediction tasks. Random forest builds multiple decision trees using randomly selected subsets of the data and features and combines the results of these trees to improve the accuracy of the prediction. Examples of these algorithms are CRISTA [
52], Elevation [
53] and CHANGE-seq [
54]
Support Vector Machines (SVM) are supervised learning algorithms for classification and prediction tasks. SVMs use a similar concept of finding the optimal hyperplane that separates the data points, just like Linear Regression finds the best line that fits the data. However, this algorithm requires a feature extraction and training of the data. The evaluation metrics of this model rely on the significance of the extracted features and the inherent characteristics of the sequence, such as its length or the presence of specific motifs or secondary structures. Examples of studies using SVM are WU-CRISPR [
55], SgRNAScorer [
56], Azimuth [
57], ge-CRISPR [
58], and many others.
4. Reaching Efficiency
Since the beginning of the gRNA-design algorithms, scientists have widely used these programs to find the gRNA of interest or gRNAs whose use must be avoided. The concerned efficiency of these programs is of significant importance for research, and it can be measured according to the evaluation metric previously presented. In 2013, Hsu et al. [
35] published their web-based off-target sites predictor, CRISPRtool, also known as the MIT CRISPR Design Tool. They designed their experimental data from which they obtained hand-crafted features and implemented a score based on correct matches and mismatches. In 2014, the CRISPRtool was used to design the best gRNAs, targeting two tumor suppressor genes and one oncogene and then mutating them [
70] directly for mouse lung cancer; their transient transfection reached a maximum of 44 % of indels. Almost the same results for the work performed by Xue et al. [
71] under the same conditions, but for liver cancer in mice. That year, Tsai et al. [
72] powered by the use of GUIDE-seq whole-genome sequencing, discovered that the CRISPRtool suffers from the unrecognition of many off-target sites due to very limited parameters implemented in the algorithm.
Approaching 2014, Doench et al. [
42] launched the Broad GPP designer, currently relaunched and updated as CRISPick. Machine learning and logistic regression were first used with this on-target prediction engine, releasing new features and updates. Research about genome editing in the parasite
Leishmania donovani was performed using CRISPR/Cas9 [
73]. Here, the GPP CRISPR designer compared the gRNAa (designed and named by Wei Zhang et al.) with a set of gRNAs suggested by the tool for the gene of interest, resulting in the gRNAa having a low score according to the web-based tool. Additionally, the engine helped to create a robust, high-efficiency protocol to mediate genome editing in
Caenorhabditis elegans regardless of possible low-efficiency gRNAs, permitting the use of a wider variety of gRNAs [
74].
A gRNA linear regression-based designer model was introduced by Moreno-Mateos et al. [
49] in 2015 with CRISPRscan. Thyme et al. [
75] found that hairpin formation can reduce gRNA efficiency, and many web-based tools for this purpose before 2016 ignored this critical factor. CRISPRscan was not the exception, but it presented a lower hairpin formation fraction compared with their contenders. In 2016, research about genome modification in hematopoietic stem/progenitor cells (HSPCs) was significantly improved by Gunry et al. [
76]. They targeted the CD45 gene in human HL-60 cells with three distinct gRNAs designed with CRISPRscan. Here, high mutagenesis percentages were obtained, touching almost 75 % of indels, which classifies CRISPRscan as a high-fidelity gRNA design tool.
Based on the lack of a model that in different genome contexts widely agglutinates and demonstrates the efficiency provided by distinct sequence features, in 2015, Xu et al. [
51] launched the linear regression-based Spacer Scoring for CRISPR (SSC) tool. They aimed to develop an affordable model to design gRNAs for genome-wide functional screens, training it with as many gRNAs datasets as possible for that time. Despite the relatively low ROC-AUC related to its prediction power, Radzisheuskaya et al. [
77] utilized this tool to confirm that employing the correct gRNAs, explicitly designed for functional genome screens will highly improve the efficiency, although other factors impact the efficiency strongly. In other words, for CRISPRi (CRISPR gene inhibition), if the gene transcription start site (TTS) is targeted and the highest-scored gRNA for that gene is used, the efficiency will increase, showing better phenotype-based screens.
Plant gRNA prospects and their characteristics partially differ from gRNAs designed for mammals or bacterial cells. For instance, Liang et al. [
78] explain that nucleotide preferences in the recognition sequence are not seen for plants, unlike for mammals. Together with the introduction of linear regression-based methods for machine learning training in the last two web-based tools, in 2015, WU-CRISPR and SgRNAScorer [
55,
56] used the support vector machine (SVM) framework for gRNA design. In contrast to WU-CRISPR, the SgRNAScorer algorithm does not consider the presence of contiguous repetitive sequences, or the impact of RNA’s secondary structures formed in the guide sequence occasioned by self-folding free energy, thus reducing SgRNAScorer efficiency. Even more, Wong et al. compared their WU-CRISPR tool against the SSC, SgRNAScorer, and GPP CRISPR designer tool, demonstrating, using precision-recall curves, a better design of functional gRNAs by WU-CRISPR. Mutagenesis experiments in rice and cotton employed the SgRNAScorer to target genes of interest. In cotton experiments [
79], the SgRNAScorer designed 82 distinct gRNAs to target a GFP gene in a transgenic cotton genome, selecting only three significantly different gRNAs in the scoring value. They found that the mutagenesis efficiency varied inconsistently, suggesting that SgRNAScorer gRNA prospects lack robust biological and computational basis. Interestingly, obtaining these results, they decided to use the WU-CRISPR tool, getting only 13 gRNAs for their gene. Analogously, in rice experiments, Baysal et al. [
80] selected two gRNAs for a gene of interest. Unfortunately and inconsistently, the high-scored gRNA showed no mutagenesis activity, whereas the lowest one positively did.
Recalling the Broad GPP designer by Doench et al. [
42], whose architecture was based on the support vector machine (SVM) with logistic regression, in 2016, it was improved by the launch of Azimuth [
57]. This tool seeks the integration of biochemical and thermodynamic sequence features regarding the secondary structure formation, a characteristic missing in the first version of this tool. In addition, they found a better performance and incorporated linear regression models, specifically gradient-boosted regression trees, which proved to be much more efficient than the first version. Finally, they provided two score-based parameters for accurately discriminating potential on-target and off-target sites: Rule Set 2, and the CFD score, respectively, incorporated in their Azimuth web page. Two years later, Listgarten and colleagues developed the Elevation tool [
53], an off-target-prediction-focused algorithm that aims to complement the Azimuth model, changing the architecture for a two-layer stacked regression model, where the first layer is intended to learn to predict unique mismatches in the gRNA-target duplexes. The second layer learns to predict various mismatches, yielding a score for potential off-target sites.
As explained throughout this section, the off-target predictor, CRISPRtool, by Hsu et al. [
35] suffered from many weaknesses, invoking the necessity of a potent tool to predict off-target sites. In 2016, Haeussler et al. [
60] launched the off-target predictor CRISPOR, powered by the BWA sequence search algorithm [
81] to perform the corresponding alignments to locate possible off-target sites. CRISPOR’s predicted gRNAs avoid using extremely GC-rich sequences, and the tool treats >4 mismatches much better than the MIT CRISPRtool. These patterns found by analyzing eight large datasets of off-target sites deliver an improved fidelity on CRISPOR prediction. Mutagenesis and gene knock-out research in the hexaploid
Camelina sativa [
82] employed the CRISPOR tool to design desired and exclude undesired gRNAs for targeting the microsomal oleate desaturase (
FAD2) gene, whose knock-out leads to an accumulation of oleic acid in this plant. They selected two gRNAs, from which the second one harbors sequence features described by CRISPOR to improve the mutagenesis efficiency. Looking back on the sgRNA Scorer, Chari et al. [
56] structured this tool to analyze gRNA sets of high and low activity for two orthologs of the Cas9 protein. For each ortholog, a separate SVM model was created. In 2017, the same team founded the sgRNA Scorer 2.0 [
84], which inversely creates just one SVM model for both Cas9 orthologs by merging all gRNAs in high and low activity sets. With this, they aimed to design a model that predicts efficient gRNAs for distinct CRISPR systems, knowing that many orthologs exist for different CRISPR systems. Even though this tool was trained with a dataset of gRNAs targeting eukaryotic cell genes, Shen et al. [
98] used this tool to design 81 gRNAs targeting virulent
Klebsiella phage genes. As expected, due to the cellular context, sgRNAScorer did not discriminate correctly between high-and-low-activity gRNAs.
In 2016, the research done for CRISPOR’s feature incorporation shall cause inconsistencies with the research by Abadi et al. [
52]. The latter team launched in 2017 a new predictor known as CRISTA (CRISPR Target Assessment), based on a regression model using the Random Forest algorithm. Their primary purpose was not to design a model for exclusively predicting gRNA on-target efficiency or potential off-target sites but to assess the cleavage efficacy of a particular genomic target by a specific gRNA. CRISTA included a treatment for DNA/RNA “bulges” in their algorithm, which can be understood as gaps in the gRNA/target hybridization. CRISPOR noticed these bulges, but their database analysis suggested no need for treating these gaps, disfavoring this tool for missing this important feature. CRISTA finally considered the necessity to deal with the formation of secondary structures inherent to RNA sequences by their learning model. Furthermore, the DNA enthalpy, geometry, and the target location (chromosome number and distance from telomere and centromere) are some additional features inserted in the algorithm. In contrast to many other predicting tools, the CRISTA training dataset does not discard uncleaved sites (i.e., targeted sequences with no gRNA activity), helping to avoid the design of identical zero-activity gRNAs.
Table 2.
On-target models
Table 2.
On-target models
Name |
Model |
Year |
Parameter |
Detail |
Reference |
Broad GPP |
LG |
2014 |
Spearman: 0.87 |
1,831gRNAs targeting three human genes and six mouse genes were used to generate screening data using one-hot encoding |
[42] |
WU-CRISPR |
SVM |
2015 |
AUROC 0.91, Spearman 0.70 |
|
[55] |
SSC |
LG |
2015 |
AUROC : 0.711 |
Datasets Wang, Koik Yusa, Shalcm, Zhou, Gilbert, Konermann. |
[51] |
Multiple CRISPR models |
SVM, LR, GBT, LG, RF |
2015 |
Spearman : 0.51. AUROC : 0.75 |
One hot encoding over the datasets Wang ribosomal, Wang non-ribosomal, Koike-Yusa, Doench Vl. |
[83] |
CRISPRScan |
LR |
2015 |
R: 0.45, SD: 0.071 |
Includes data from new cell lines. |
[49] |
SgRNAScorer |
SVM |
2015 |
Spearman 0.75 |
|
[56] |
Azimuth |
SVM, LG |
2016 |
0.462 |
One hot encoding. |
[57] |
ge-CRISPR |
SVM |
2016 |
Accuracy: 0.888. MCC: 0.78 |
Includes data from new cell lines. |
[58] |
CRISPRater |
LR |
2017 |
Spearman 0.67 |
Includes data from new cell lines. |
[50] |
SgRNAScorer 2.0 |
SVM |
2017 |
Accuracy: 0.737, Precision: 0.728, Recall of 0.758 |
|
[84] |
CRISPRpred |
SVM |
2017 |
AUROC: 0.85. AUPRC: 0.56. MCC: 0.4 |
K-mer encoding over Broad GPP. |
[85] |
DeepCRISPR |
CNN |
2018 |
Spearman 0.406 |
|
[62] |
DeepCpf1 |
CNN |
2018 |
Spearman:0.873 |
|
[86] |
DeepCas9 |
CNN |
2018 |
Spearman 0.351 |
|
[87] |
TUSCAN |
RF |
2018 |
Spearman: 0.55 |
|
[88] |
DeepHF |
RNN |
2019 |
Spearman: 0.867 |
Cell lines HCT116, HEK293T, HELA, HL60. |
[89] |
DeepSpCas9 |
1DCNN |
2019 |
Spearman: 0.91 |
|
[90] |
CRISPRpred(SEQ) |
SVM |
2020 |
Spearman: 0.829. AUROC: 0.893 |
Haeussler and DeepHF datasets. |
[91] |
GNL-Scorer |
AdaBoost |
2020 |
Spearman: 0.502 |
One hot encoding over 10 public datasets. |
[92] |
C-RNN CRISPR |
RNN |
2020 |
Spearman: 0.877. AUROC: 0.976 |
Includes data from new cell lines. |
[93] |
CNN-SVR CRISPR |
CNN-SVR |
2020 |
Spearman: 0.807. AUROC: 0.983 |
Includes data from new cell lines. |
[94] |
On-target CRISPRon |
CNN |
2021 |
Spearman 0.91 |
|
[95] |
BoostMEC |
GBM |
2022 |
0.704 |
Includes data from new cell lines. |
[96] |
CNN-XG |
CNN-Tree |
2022 |
Spearman 0.7352 AUROC: 0.992 |
|
[97] |
The CRISPR/Cas9 genome editing system left the scientific community with gigantic expectations. The promise of flawless gene knock-out, knock-in, or functional screens must be accomplished. In 2018, Chuai et al. [
62] finally included the use of deep neural network approaches for predicting and designing gRNAs into their novel tool, DeepCRISPR. Parallelly to CRISTA, DeepCRISPR seeks to predict both functional on-target gRNAs and avoid those with a propensity to rise off-target cuts.
Figure 7.
Timeline from 2018 to 2022. With the launching of DeepCRISPR, deep neuronal networks initiated its treasure, improving each year with the introduction of RNN, embedding methods, hybrid models or addition of more layers.
Figure 7.
Timeline from 2018 to 2022. With the launching of DeepCRISPR, deep neuronal networks initiated its treasure, improving each year with the introduction of RNN, embedding methods, hybrid models or addition of more layers.
In order to achieve this purpose, Chuai et al. designed an architecture with three fundamental networks: the main one can be understood as the pre-training network (known as “parent network” by the authors) that will recognize various features of gRNAs, using as input ∼ 0.68 billion gRNA sequences targeting coding and non-coding human genes. The following two CNN use the pre-training network corresponding output. These last networks are trained using well-known, experimentally validated gRNAs with on-target or off-target activity, extracting all the distinctive features characterizing these sequences for further integration in the predictive capacity of the tool. In 2020, accordingly to the pre-training DeepCRISPR dataset based on human exons and intron genes, the tool helped to predict the off-target activity of gRNAs designed by Mintz et al. [
99] that initially targets the
PARP1 gene, for its inhibition, in triple-negative breast cancer (TNBC) cells, highlighting the importance of using CRISPR/Cas9 systems in preclinical studies.
In the same year of the DeepCRISPR launch, Lin et al. [
48] focused on developing a tool that exclusively predicts off-target sites with a deep neural network framework. They named their tool as CNN_std, in which they adapted the biological ribonucleotide sequence of the gRNA for the computational environment in a matrix with 4 x 23 size, representing the four nucleotides and the 20-nt recognition sequence plus the 3-nt PAM sequence. This matrix has the correct format for input in the convolutional neural network. Also, Lin et al. utilized the CRISPOR dataset for training, validating, testing, and comparing CNN_std against previous off-target prediction tools such as the CFD score or the MIT CRISPR design tool, overperforming all these and other machine learning-based tools getting a ROC-AUC of 0.972.
Undoubtedly, the CRISPR/Cas9 systems had an enormous refinement with the introduction of deep neural networks, specifically CNN. Unluckily, DeepCRISPR and CNN_std implemented algorithms and architectures that neglected the biological features underlying the gRNA feasible design, thus missing characteristics probed to be crucial for this objective. Also, the Chari et al. sgRNA Scorer, the first and second version [
56,
84], used an algorithm trained with datasets obtained from diverse Cas9 orthologs, then being capable of predicting under a more comprehensive array of RGNs (RNA-guided nucleases). In 2019, Wang et al. [
89] compared the predicting performance of an RNN and a conventional CNN. They found that RNN beats CNN and other machine learning algorithms. The dataset used for training, validation, and testing is based on their own experiments in human cells, emphasizing the use of three Cas9 orthologs: WT-SpCas9 (wild-type
Streptococcus pyogenes Cas9), eSpCas9 (enhanced), and SpCas9-HF1 (High Fidelity). Furthermore, to remedy the inexistence of biological features treatment in deep neural network models, this RNN was trained with features such as sequence secondary structure formation and their stem-loops, GC content, or the contiguous repetitive sequences first described by Wong et al. [
55], and implemented in the WU-CRISPR tool in 2015. Lastly, Wang and his team launched DeepHF, a tool comprising all the concerns mentioned earlier. DeepHF was used in experiments premeditated to knock out an apoptosis-inducing gene in mice,
Htra2, whose translated protein is found in high concentrations in neomycin-treated cochleae, one of the causes to develop deafness. The team designed three gRNAs targeting the
Htra2 gene, obtaining 87.27 % of indels in the
Htra2 gene for the highest-scored gRNA [
100].
Notwithstanding the boom of deep learning-based pipelines in gRNA design tools, Muhammad et al. [
91] were uncomfortable using deep neural networks for gRNA design. Despite the visible characteristics and performance obtained with these models (CNN or RNN), it is tough to interpret their results. Even more, it has been proven that conventional, simpler algorithms can perform the same work done by deep neural networks [
101]. Regarding the latter point, Muhammad et al. launched the on-target CRISPRpred(SEQ) predictor tool, whose SVM-based architecture was trained with the same training dataset for DeepCRISPR while mixing biological gRNA sequence features. In most of the benchmarks, CRISPRpred(SEQ) outperformed DeepCRISPR. On the other hand, CRISPRpred(SEQ) challenged DeepHF using the dataset generated by the latter; unluckily, the machine learning-based tool did not surpass DeepHF due to needing more specific tuning against DeepHF.
Another scope to achieve the desired interpretability in deep neural networks is presented by Xiao et al. [
69]. Firstly, they provide a categorization of the existing deep neuronal networks, founded on the treatment of the model’s input: methods in the spatial domain, whose input is transformed in a two-dimensional image, which is ready to work with convolutional neural networks for sequence feature extraction [
48,
62]; methods in the temporal domain, for which the input is treated as a word, and works perfectly with recurrent convolutional networks [
89]. Xiao et al. then proposed an ensemble learning model that uses both the spatial and temporal domains to extract the necessary sequence features, in addition to an “attention mechanism” to give interpretability. The on-target model, which is named AttCRISPR, was further enhanced with hand-crafted biological features, finally overperforming even the DeepHF tool with its training dataset.
In recent years, almost all gRNA design tools have turned their vision to implement only deep neural networks, or hybrid models. These models are increasingly perfecting the predictive activity, getting more and more computationally flawless. In 2022, Zhang et al. [
38] launched the off-target CRISPR-IP predictor tool, which includes four layers, each of which performs distinct procedures focused on characterizing novel sequence features; these are CNN, Bi-directional Long-Short Term Memory (BiLSTM, an RNN derivative), attention layer, and finally a dense layer. The model uses as training dataset experimental information based on sequencing (SITE-seq and CIRCLE-seq). Finally, epigenetic information and bulge treatment were adapted to the model, resulting in high predicting performance. Regarding the most recent on-target prediction tool, Li et al. [
97] proposed a machine-and-deep learning hybrid model. They got inspiration from a fully-computational approach published by Ren et al. [
102] that seeks to provide an accurate and high-performance image classification based on XGBoost (extreme gradient-boosted tree, being the machine learning part) and CNN (the deep learning and feature extraction part). The computational approach thus was fused with the biological vision in the hybrid model named CNN-XG, using as input a gRNA sequence, treating it with the CNN layer for feature extraction, and finally sending the latter as an input for the XGBoost classification structure.