Deep Learning for Genomics: Data-driven approaches for genomics applications in life sciences and biotechnology
()
About this ebook
Deep learning has shown remarkable promise in the field of genomics; however, there is a lack of a skilled deep learning workforce in this discipline. This book will help researchers and data scientists to stand out from the rest of the crowd and solve real-world problems in genomics by developing the necessary skill set. Starting with an introduction to the essential concepts, this book highlights the power of deep learning in handling big data in genomics. First, you’ll learn about conventional genomics analysis, then transition to state-of-the-art machine learning-based genomics applications, and finally dive into deep learning approaches for genomics. The book covers all of the important deep learning algorithms commonly used by the research community and goes into the details of what they are, how they work, and their practical applications in genomics. The book dedicates an entire section to operationalizing deep learning models, which will provide the necessary hands-on tutorials for researchers and any deep learning practitioners to build, tune, interpret, deploy, evaluate, and monitor deep learning models from genomics big data sets.
By the end of this book, you’ll have learned about the challenges, best practices, and pitfalls of deep learning for genomics.
Related to Deep Learning for Genomics
Related ebooks
Interpretable Machine Learning with Python: Learn to build interpretable high-performance models with hands-on real-world examples Rating: 0 out of 5 stars0 ratingsActive Machine Learning with Python: Refine and elevate data quality over quantity with active learning Rating: 0 out of 5 stars0 ratingsThe Deep Learning Architect's Handbook: Build and deploy production-ready DL solutions leveraging the latest Python techniques Rating: 0 out of 5 stars0 ratingsInternet of Things (IoT) A Quick Start Guide: A to Z of IoT Essentials Rating: 0 out of 5 stars0 ratingsMachine Learning for Imbalanced Data: Tackle imbalanced datasets using machine learning and deep learning techniques Rating: 0 out of 5 stars0 ratingsSynthetic Data for Machine Learning: Revolutionize your approach to machine learning with this comprehensive conceptual guide Rating: 0 out of 5 stars0 ratingsAdvanced Deep Learning for Engineers and Scientists: A Practical Approach Rating: 0 out of 5 stars0 ratingsHarnessing the Power of AI: A Guide to Making Technology Work for You Rating: 0 out of 5 stars0 ratingsDistributed Machine Learning with Python: Accelerating model training and serving with distributed systems Rating: 0 out of 5 stars0 ratingsKnowledge-Based Bioinformatics: From Analysis to Interpretation Rating: 0 out of 5 stars0 ratingsDeep Learning with PyTorch: A practical approach to building neural network models using PyTorch Rating: 0 out of 5 stars0 ratingsNeural Network Programming with Java Rating: 0 out of 5 stars0 ratingsReproducible Data Science with Pachyderm: Learn how to build version-controlled, end-to-end data pipelines using Pachyderm 2.0 Rating: 0 out of 5 stars0 ratingsMicroservices Design Patterns in .NET: Making sense of microservices design and architecture using .NET Core Rating: 0 out of 5 stars0 ratingsMachine Learning in Biotechnology and Life Sciences: Build machine learning models using Python and deploy them on the cloud Rating: 0 out of 5 stars0 ratingsHands-on Scikit-Learn for Machine Learning Applications: Data Science Fundamentals with Python Rating: 0 out of 5 stars0 ratingsR Deep Learning Essentials.: A step-by-step guide to building deep learning models using TensorFlow, Keras, and MXNet Rating: 0 out of 5 stars0 ratingsMetaprogramming with Python: A programmer's guide to writing reusable code to build smarter applications Rating: 0 out of 5 stars0 ratingsModern Computer Vision with PyTorch: A practical roadmap from deep learning fundamentals to advanced applications and Generative AI Rating: 0 out of 5 stars0 ratingsApplied Deep Learning: Design and implement your own Neural Networks to solve real-world problems (English Edition) Rating: 0 out of 5 stars0 ratingsMachine Learning with Tensorflow: A Deeper Look at Machine Learning with TensorFlow Rating: 0 out of 5 stars0 ratings
Intelligence (AI) & Semantics For You
Artificial Intelligence: A Guide for Thinking Humans Rating: 4 out of 5 stars4/52084: Artificial Intelligence and the Future of Humanity Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 5 out of 5 stars5/5Summary of Super-Intelligence From Nick Bostrom Rating: 5 out of 5 stars5/5ChatGPT For Dummies Rating: 4 out of 5 stars4/5Midjourney Mastery - The Ultimate Handbook of Prompts Rating: 5 out of 5 stars5/5ChatGPT For Fiction Writing: AI for Authors Rating: 5 out of 5 stars5/5101 Midjourney Prompt Secrets Rating: 3 out of 5 stars3/5Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5AI for Educators: AI for Educators Rating: 5 out of 5 stars5/5Killer ChatGPT Prompts: Harness the Power of AI for Success and Profit Rating: 2 out of 5 stars2/5The Secrets of ChatGPT Prompt Engineering for Non-Developers Rating: 5 out of 5 stars5/5Enterprise AI For Dummies Rating: 3 out of 5 stars3/5The Business Case for AI: A Leader's Guide to AI Strategies, Best Practices & Real-World Applications Rating: 0 out of 5 stars0 ratingsChat-GPT Income Ideas: Pioneering Monetization Concepts Utilizing Conversational AI for Profitable Ventures Rating: 3 out of 5 stars3/5The Algorithm of the Universe (A New Perspective to Cognitive AI) Rating: 5 out of 5 stars5/5Dancing with Qubits: How quantum computing works and how it can change the world Rating: 5 out of 5 stars5/5The ChatGPT Handbook Rating: 0 out of 5 stars0 ratingsCoding with AI For Dummies Rating: 0 out of 5 stars0 ratings
Reviews for Deep Learning for Genomics
0 ratings0 reviews
Book preview
Deep Learning for Genomics - Upendra Kumar Devisetty
BIRMINGHAM—MUMBAI
Deep Learning for Genomics
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Publishing Product Manager: Dhruv Jagdish Kataria
Content Development Editor: Priyanka Soam
Technical Editor: Rahul Limbachiya
Copy Editor: Safis Editing
Project Coordinator: Farheen Fathima
Proofreader: Safis Editing
Indexer: Rekha Nair
Production Designer: Mohamed Huzair
Marketing Coordinators: Shifa Ansari, Abeer Riyaz Dawe
First published: October 2022
Production reference: 1311022
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80461-544-7
www.packt.com
Contributors
About the author
Upendra Kumar Devisetty has a Ph.D. in agriculture and over 12 years of experience working in Next-Generation Sequencing. He has a deep background in genomics and bioinformatics with a specialization in applying predictive analytics across a varied set of genomics problems in life sciences. Dr. Devisetty is currently working as a senior data science manager at Greenlight Biosciences, where he leads a team of bioinformatics scientists and data scientists to support the various bioinformatics and data science projects at Greenlight Biosciences with a mission to create mRNA-based solutions that can provide a cleaner environment and healthier people.
About the reviewer
Urminder Singh is a computer scientist and bioinformatician. His diverse research interests include understanding novel gene evolution, cancer genomics, machine learning in medicine, sociogenomics, and algorithms for big heterogeneous data. You can find him online at urmi-21.github.io.
Table of Contents
Preface
Part 1 – Machine Learning in Genomics
1
Introducing Machine Learning for Genomics
What is machine learning?
Why machine learning for genomics?
Machine learning for genomics in life sciences and biotechnology
Exploring machine learning software
Python programming language
Visualization
Biopython
Scikit-learn
Summary
2
Genomics Data Analysis
Technical requirements
Installing Biopython
Matplotlib
What is a genome?
Genome sequencing
Sanger sequencing of nucleic acids
Evolution of next-generation sequencing
Analysis of genomic data
Steps in genomics data analysis
Introduction to Biopython for genomic data analysis
What is Biopython?
Genomic data analysis use case – Sequence analysis of Covid-19
Calculating GC content
Calculating nucleotide content
Dinucleotide content
Modeling
Motif finder
Summary
3
Machine Learning Methods for Genomic Applications
Technical requirements
Python packages
ML libraries
Genomics big data
Supervised and unsupervised ML
Supervised ML
Unsupervised ML
ML for genomics
The basic workflow of ML in genomics
An ML use case for genomics – Disease prediction
Data collection
Data preprocessing
EDA
Data transformation
Data splitting
Model training
Model evaluation
ML challenges in genomics
Summary
Part 2 – Deep Learning for Genomic Applications
4
Deep Learning for Genomics
Understanding what deep learning is and how it works
Neural network definition
Anatomy of deep neural networks
Key concepts of DNNs
An example of how neural networks work
DNN architectures
DNNs for genomics
Deep learning workflow for genomics
Broad application of DNNs in genomics
Protein structure predictions
Regulatory genomics
Gene regulatory networks
Single-cell RNA sequencing
Introducing deep learning algorithms and Python libraries
General deep learning libraries
Deep learning libraries for genomics
Summary
5
Introducing Convolutional Neural Networks for Genomics
Introduction to CNNs
What are CNNs?
Transfer Learning
CNNs for genomics
Applications of CNNs in genomics
DeepBind
DeepInsight
DeepChrome
DeepVariant
Summary
6
Recurrent Neural Networks in Genomics
What are RNNs?
Introducing RNNs
How do RNNs work?
Different RNN architectures
Bidirectional RNNs (BiLSTM )
LSTMs and GRUs
Different types of RNNs
Applications and use cases of RNNs in genomics
DeepNano
ProLanGo
DanQ
Understanding RNNs through Transcription Factor Binding Site (TFBS) predictions
Summary
7
Unsupervised Deep Learning with Autoencoders
What is unsupervised DL?
Types of unsupervised DL
Clustering
Anomaly detection
Association
What are autoencoders?
Properties of autoencoders
How do autoencoders work?
Architecture of autoencoders
Types of autoencoders
Autoencoders for genomics
Gene expression
Use case – Predicting gene expression from TCGA pan-cancer RNA-Seq data using denoising autoencoders
Summary
8
GANs for Improving Models in Genomics
What are GANs?
Differences between Discriminative and Generative models
Intuition about GANs
How do GANs work?
Challenges working with genomics datasets
What is synthetic data?
How can GANs help improve models?
Practical applications of GANs in genomics
Analysis of ScRNA-Seq data
Generation of DNA
Using GANs for augmenting population-scale genomics data
Summary
Part 3 – Operationalizing models
9
Building and Tuning Deep Learning Models
Technical requirements
DL life cycle
Data processing
Data collection
Data wrangling
Feature engineering
Developing models
Selecting an appropriate algorithm
Model training
Tuning the models
Hyperparameter tuning
Hyperparameter tuning libraries
Classification metrics or performance statistics
Visualizing performance
Regression metrics
Use case – Predicting the binding site location of the JunD TF
Framing the TFBS prediction problem in terms of DL
Processing the data
Model training
Summary
10
Model Interpretability in Genomics
What is model interpretability?
Black-box model interpretability
Unlocking business value from model interpretability
Better business decisions
Building trust
Profitability
Model interpretability methods in genomics
Partial dependence plot
Individual conditional expectation
Permuted feature importance
Global surrogate
LIME
Shapley value
ExSum
Saliency map
Use case – Model interpretability for genomics
Data collection
Feature extraction
Target labels
Train-test split
Creating a CNN architecture
Summary
11
Model Deployment and Monitoring
Technical requirements
Streamlit
Hugging Face
Introducing model deployment
Steps in model deployment
Types of model deployment
Deploying models as services
A use case for deploying a DL model as a web service – building a Streamlit application of the CNN model
Monitoring models using advanced tools
Why monitor models?
Reasons for model degradation
How to monitor DL models
Advanced tools for model monitoring
Addressing drifts
Summary
12
Challenges, Pitfalls, and Best Practices for Deep Learning in Genomics
Deep learning challenges regarding genomics
Lack of flexible tools
Fewer biological samples
Computational resource requirements
Expertise in DL frameworks
Lack of high-quality labeled data
Lack of model interpretability
Common pitfalls for applying deep learning to genomics
Confounding
Data leakage
Imbalanced data
Improper model comparisons
Best practices for applying deep learning to genomics
Understand the problem and know your data better
A simple model for a simple problem
Establish a baseline for your model
Ensure reproducibility
Using pre-existing models for genomics
Do not reinvent the rule
Tune hyperparameters automatically
Focus on feature engineering
Normalize the data
Always perform model interpretation
Avoid overfitting
Summary
Index
Other Books You May Enjoy
Preface
Deep learning is the subset of machine learning based on artificial neural networks with representative learning using vast amounts of data. Machine learning is a subcomponent of artificial intelligence, which includes sophisticated algorithms that enable machines to mimic human intelligence to perform human tasks automatically. Both deep learning and machine learning help automatically detect meaningful patterns from data without explicit programming. Machine learning and deep learning have completely changed the way that we live these days. We rely on these so much that it’s hard to imagine a day without using any of these in some way or another, whether it is via the spam filtering of emails, product recommendations, or speech recognition. Both machine learning and specifically deep learning have been adopted by the scientific community in areas such as biology, genomics, bioinformatics, and computational biology. High-throughput technologies (HTS) such as next-generation sequencing (NGS) have made a significant contribution to genomics to study complex biological phenomena at a single-base-pair resolution on an unprecedented scale, facilitating an era of big data genomics. To get meaningful and novel biological insights from this big data, most of the algorithms are currently based on machine learning and, lately, deep learning methodologies to provide higher levels of accuracy in specific tasks related to genomics than state-of-the-art rule-based algorithms. Given the growing trend in the perception and application of machine learning and deep learning in genomics, research professionals, scientists, and managers require a good understanding of this exciting field to equip them with the necessary tools, technologies, and general guidelines to assist them in the selection of machine learning and deep learning methods for handling genomics data and accelerating data-driven decision-making in industries related to life sciences and biotechnology.
Throughout this book, we will learn how to apply deep learning approaches to solve real-world problems in genomics, interpret biological insights from deep learning models built from genomic datasets, and finally, operationalize deep learning models using open source tools to enable predictions for end users.
Who is this book for?
This book aims to practically introduce machine learning and deep learning for genomic applications that can transform genomics data into novel biological insights. It provides both the theoretical fundamentals and hands-on sections to give a taste of how machine learning and deep learning can be leveraged in real-world applications in the life sciences and biotech industries. This book covers a range of topics that are not currently available in other textbooks. The book also includes the challenges, pitfalls, and best practices when applying machine learning and deep learning to real-world scenarios. Each chapter of the book has code written in Python with industry-standard machine learning and deep learning libraries and frameworks such as Keras that the audience can reproduce in their working environment. This book is designed to cater to the needs of researchers, bioinformaticians, and data scientists in both academia and industry who want to leverage machine learning and deep learning technologies in genomic applications to extract insights from sets of big data. Managers and leaders who are already established in the life sciences and biotechnology sectors will not only find this book useful but can also adopt these methodologies to identify patterns, come up with predictions, and thereby contribute to data-driven decision-making in their respective companies.
The book is divided into three different parts. The first part introduces the fundamentals of genomic data analysis and machine learning. In this part, we will introduce the basic concept of genomic data analysis and discuss what machine learning is and why it is important for genomics and what value machine learning will bring to the life sciences and biotechnology industries. The second part will transition the readers from machine learning to deep learning and introduce them to the basic concepts of deep learning and diverse deep learning algorithms, using real-world examples to transform raw genomics data into biological insights. The final part will describe how to operationalize deep learning models using open source tools to enable predictions for end users. In this part, you will learn how to build and tune state-of-the-art machine learning models using Python and industry-standard libraries to derive biological insights from large amounts of multimodal genomic datasets and how to deploy these models on several cloud platforms such as AWS and Azure. The last chapter in the final part is fully dedicated to the current challenges for deep learning approaches to genomics and the potential pitfalls and how to avoid them using best practices.
What this book covers
Chapter 1, Introducing Machine Learning for Genomics, provides a brief history of the field of genomics and the practical application of machine learning methods to genomics, in addition to some of the technologies that this book will use.
Chapter 2, Genomics Data Analysis, gives readers a quick primer on data analysis in genomics. Using the Python programming language, readers will be able to make sense of the vast amounts of genomics data available and extract biological insights.
Chapter 3, Machine Learning Methods for Genomic Applications, introduces the reader to the two most important machine learning methods (supervised and unsupervised) and some of the important elements of standard machine learning pipelines. It also includes the practical real-world applications of supervised and unsupervised algorithms for genomics data analysis in the life sciences and biotechnology industries.
Chapter 4, Deep Learning for Genomics, will teach the reader about the fundamental concepts of deep learning, different types of deep learning models, and different deep learning Python libraries.
Chapter 5, Introducing Convolutional Neural Networks for Genomics, gives the reader a taste of Convolutional Neural Networks (CNNs), a type of deep neural network that is primarily used for sequence data, and shows how CNNs have superior performance compared to other deep learning methods.
Chapter 6, Recurrent Neural Networks in Genomics, introduces reinforcement learning techniques such as Recurrent Neural Networks (RNNs) and LSTMs and shows how they are currently being applied in several applications.
Chapter 7, Unsupervised Deep Learning with Autoencoders, introduces unsupervised deep learning, different methods of unsupervised deep learning, specifically Autoencoders, and its application in genomics.
Chapter 8, GANs for Improving Models in Genomics, introduces Generative Adversarial Networks (GANs) and how they can be used to improve deep neural networks trained on genomics datasets for predictive modeling.
Chapter 9, Building and Tuning Deep Learning Models, describes how to build and tune machine learning and deep learning models and deploy the final models across various computational systems and several platforms.
Chapter 10, Model Interpretability in Genomics, introduces the reader to how to interpret machine learning and deep learning models. The model interpretability introduced here helps readers to understand a model’s decision and why businesses are interested in model interpretability for creating trust, gaining profitability, and so on.
Chapter 11, Model Deployment and Monitoring, teaches the reader how to take the model they built on Google Colab and deploy it for predictions using open source tools such as Streamlit and Hugging Face. In addition, this chapter also describes how to monitor models using advanced tools and how monitoring is a key metric for businesses.
Chapter 12, Challenges, Pitfalls, and Best Practices for Deep Learning in Genomics, informs the reader of the challenges and pitfalls associated with applying machine learning and deep learning methodologies to genomics applications. It also covers the best practices for building end-to-end machine learning and deep learning models and applying them to genomic datasets.
To get the most out of this book
The book aims to keep it self-contained as possible. To extract the maximum value out of this book, a basic to intermediate knowledge of Python programming is recommended and a background in genomics, statistics, and bioinformatics and some knowledge of data science is a must. In addition, readers are expected to know the basics of machine learning and associated machine learning algorithms, such as regression and classification. The book provides a hands-on approach to implementation and associated deep learning methodologies that will have you up-and-running and productive in no time. At the end of the book, you will be able to put your knowledge to work with this practical guide.
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository. This will ensure you avoid any potential error related to copying and pasting of code.
Download the example code files
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Deep-Learning-for-Genomics-. Any updates to the code will be reflected in the GitHub repository. We also have other code bundles from our rich catalog of books and videos available at: https://github.com/PacktPublishing/. Check them out!
Conventions used
There are several text conventions used throughout this book.
Code in text: Indicates code words in the text, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system.
A block of code is set as follows:
# covid19_features.py
from Bio import SeqIO
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold: First, import all the relevant libraries:
>>> from Bio import SeqIO
Any command-line input or output is written as follows:
>>> from Bio import SeqIO
Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: In the Create the default IAM role pop-up window, select Any S3 bucket.
Tips or important notes
Appear like this.
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Reviews
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
Share Your Thoughts
Once you’ve read Deep Learning for Genomics, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Download a free PDF copy of this book
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link below
https://packt.link/free-ebook/9781804615447
Submit your proof of purchase
That’s it! We’ll send your free PDF and other benefits to your email directly
Part 1 – Machine Learning in Genomics
This part will describe genomics data analysis and machine learning approaches to genomics. You will use state-of-the-art machine learning methods to transform raw genomics data into insights utilizing real-life examples in the life sciences and biotechnology industries.
This section comprises the following chapters:
Chapter 1, Introducing Machine Learning for Genomics
Chapter 2, Genomics Data Analysis
Chapter 3, Machine Learning Methods for Genomic Applications
1
Introducing Machine Learning for Genomics
Machine learning (ML) is the field of science that deals with developing computer algorithms and models that can perform certain tasks without explicitly programming them. This is to say, it teaches the machines to learn
rather than specifying rules
from input data provided to them. The machine then can convert that learning into expertise or knowledge and use that for predictions. ML is an important tool for leveraging technologies around artificial intelligence (AI), a subfield of computer science that aims to perform tasks automatically that we, as humans, are naturally good at. ML is an important aspect of all modern businesses and research. The adoption of ML for genomics applications is changing recently because of the availability of large genomic datasets, improvement in algorithms, and, most importantly, superior computational power. More and more scientific research organizations and industries are expanding the use of ML across vast volumes of genomic data for predictive diagnostics, as well as to get biological insights at the scale of population health.
Genomics, the study of the genetic constitution of organisms, holds promise in understanding and diagnosing human diseases or improving our agriculture and livestock. The field of genomics has seen exponential growth in the last 15 years, mainly due to recent technological advances in High-throughput sequencing also known as next-generation sequencing (NGS) technologies generating exponential amounts of genomics data. It is estimated that between 100 million and as many as 2 billion human genomes could be sequenced by 2025 (https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195), representing an astounding growth of four to five orders of magnitude in 10 years and far exceeding the growth of many big data domains. This complexity and the sheer amount of data generated create roadblocks not only to the acquisition, storage, and distribution but also to genomic data analysis. The current tools used in the genomic analysis are built on top of deterministic approaches and rely on rules encoded to perform a particular task. To keep up with data growth, we need more and new innovative approaches, such as ML, in genomics to enrich our understanding of basic biology and subject them to applied research. In this chapter, we’ll learn what ML is, why ML is essential for genomics, and what value ML brings to life sciences and biotechnology industries that leverage genome data for the development of genomic-based products. By the end of this chapter, you will understand the limitations of the current conventional algorithms for genomic data analysis, how solving problems with ML is different from conventional approaches, and how ML approaches can fill in those gaps and make generating biological insights very easy.
As such, in this chapter, we’re going to cover the following main topics:
What is machine learning?
Why machine learning for genomics?
Machine learning for genomics in life sciences and biotechnology
What is machine learning?
Before we talk about ML, let’s understand what AI is. In the simplest terms, AI is the ability of a machine to mimic human intelligence and iteratively improve itself based on the information it collects. The goal of AI is to build systems to perform actions that are routinely done by humans such as problem-solving, pattern matching, image recognition, knowledge acquisition, and so on. ML, a subset of AI, is the process of training a model to learn and improve from experience. Deep learning (DL), in turn, is a subfield of ML, in which we leverage artificial neural networks (ANNs) to mimic the human brain and find the nonlinear relationships between the input and output to generate predictions (Figure 1.1):
Figure 1.1 – AI versus ML versus DL – how they are relatedFigure 1.1 – AI versus ML versus DL – how they are related
In ML, a model is built based on input data and an underlying algorithm to make useful predictions from real-world data. In a simplified ML, features
that represent an individual measurable property of the data are provided as input, and labels
are returned as the predictions. Suppose we want to predict whether a particular sequence of DNA has a binding site for a transcription factor (TF) of your interest or not. Using the traditional approach, we would use a positional weight matrix (PWF) to scan the sequence and identify the potential motifs that are overrepresented. Even though this works, this is extremely difficult, manual, scalable, and so on. Using an