Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Deep Learning for Genomics: Data-driven approaches for genomics applications in life sciences and biotechnology
Deep Learning for Genomics: Data-driven approaches for genomics applications in life sciences and biotechnology
Deep Learning for Genomics: Data-driven approaches for genomics applications in life sciences and biotechnology
Ebook609 pages4 hours

Deep Learning for Genomics: Data-driven approaches for genomics applications in life sciences and biotechnology

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Deep learning has shown remarkable promise in the field of genomics; however, there is a lack of a skilled deep learning workforce in this discipline. This book will help researchers and data scientists to stand out from the rest of the crowd and solve real-world problems in genomics by developing the necessary skill set. Starting with an introduction to the essential concepts, this book highlights the power of deep learning in handling big data in genomics. First, you’ll learn about conventional genomics analysis, then transition to state-of-the-art machine learning-based genomics applications, and finally dive into deep learning approaches for genomics. The book covers all of the important deep learning algorithms commonly used by the research community and goes into the details of what they are, how they work, and their practical applications in genomics. The book dedicates an entire section to operationalizing deep learning models, which will provide the necessary hands-on tutorials for researchers and any deep learning practitioners to build, tune, interpret, deploy, evaluate, and monitor deep learning models from genomics big data sets.
By the end of this book, you’ll have learned about the challenges, best practices, and pitfalls of deep learning for genomics.

LanguageEnglish
Release dateNov 11, 2022
ISBN9781804613016
Deep Learning for Genomics: Data-driven approaches for genomics applications in life sciences and biotechnology

Related to Deep Learning for Genomics

Related ebooks

Intelligence (AI) & Semantics For You

View More

Related articles

Reviews for Deep Learning for Genomics

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Deep Learning for Genomics - Upendra Kumar Devisetty

    cover.png

    BIRMINGHAM—MUMBAI

    Deep Learning for Genomics

    Copyright © 2022 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Publishing Product Manager: Dhruv Jagdish Kataria

    Content Development Editor: Priyanka Soam

    Technical Editor: Rahul Limbachiya

    Copy Editor: Safis Editing

    Project Coordinator: Farheen Fathima

    Proofreader: Safis Editing

    Indexer: Rekha Nair

    Production Designer: Mohamed Huzair

    Marketing Coordinators: Shifa Ansari, Abeer Riyaz Dawe

    First published: October 2022

    Production reference: 1311022

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham

    B3 2PB, UK.

    ISBN 978-1-80461-544-7

    www.packt.com

    Contributors

    About the author

    Upendra Kumar Devisetty has a Ph.D. in agriculture and over 12 years of experience working in Next-Generation Sequencing. He has a deep background in genomics and bioinformatics with a specialization in applying predictive analytics across a varied set of genomics problems in life sciences. Dr. Devisetty is currently working as a senior data science manager at Greenlight Biosciences, where he leads a team of bioinformatics scientists and data scientists to support the various bioinformatics and data science projects at Greenlight Biosciences with a mission to create mRNA-based solutions that can provide a cleaner environment and healthier people.

    About the reviewer

    Urminder Singh is a computer scientist and bioinformatician. His diverse research interests include understanding novel gene evolution, cancer genomics, machine learning in medicine, sociogenomics, and algorithms for big heterogeneous data. You can find him online at urmi-21.github.io.

    Table of Contents

    Preface

    Part 1 – Machine Learning in Genomics

    1

    Introducing Machine Learning for Genomics

    What is machine learning?

    Why machine learning for genomics?

    Machine learning for genomics in life sciences and biotechnology

    Exploring machine learning software

    Python programming language

    Visualization

    Biopython

    Scikit-learn

    Summary

    2

    Genomics Data Analysis

    Technical requirements

    Installing Biopython

    Matplotlib

    What is a genome?

    Genome sequencing

    Sanger sequencing of nucleic acids

    Evolution of next-generation sequencing

    Analysis of genomic data

    Steps in genomics data analysis

    Introduction to Biopython for genomic data analysis

    What is Biopython?

    Genomic data analysis use case – Sequence analysis of Covid-19

    Calculating GC content

    Calculating nucleotide content

    Dinucleotide content

    Modeling

    Motif finder

    Summary

    3

    Machine Learning Methods for Genomic Applications

    Technical requirements

    Python packages

    ML libraries

    Genomics big data

    Supervised and unsupervised ML

    Supervised ML

    Unsupervised ML

    ML for genomics

    The basic workflow of ML in genomics

    An ML use case for genomics – Disease prediction

    Data collection

    Data preprocessing

    EDA

    Data transformation

    Data splitting

    Model training

    Model evaluation

    ML challenges in genomics

    Summary

    Part 2 – Deep Learning for Genomic Applications

    4

    Deep Learning for Genomics

    Understanding what deep learning is and how it works

    Neural network definition

    Anatomy of deep neural networks

    Key concepts of DNNs

    An example of how neural networks work

    DNN architectures

    DNNs for genomics

    Deep learning workflow for genomics

    Broad application of DNNs in genomics

    Protein structure predictions

    Regulatory genomics

    Gene regulatory networks

    Single-cell RNA sequencing

    Introducing deep learning algorithms and Python libraries

    General deep learning libraries

    Deep learning libraries for genomics

    Summary

    5

    Introducing Convolutional Neural Networks for Genomics

    Introduction to CNNs

    What are CNNs?

    Transfer Learning

    CNNs for genomics

    Applications of CNNs in genomics

    DeepBind

    DeepInsight

    DeepChrome

    DeepVariant

    Summary

    6

    Recurrent Neural Networks in Genomics

    What are RNNs?

    Introducing RNNs

    How do RNNs work?

    Different RNN architectures

    Bidirectional RNNs (BiLSTM )

    LSTMs and GRUs

    Different types of RNNs

    Applications and use cases of RNNs in genomics

    DeepNano

    ProLanGo

    DanQ

    Understanding RNNs through Transcription Factor Binding Site (TFBS) predictions

    Summary

    7

    Unsupervised Deep Learning with Autoencoders

    What is unsupervised DL?

    Types of unsupervised DL

    Clustering

    Anomaly detection

    Association

    What are autoencoders?

    Properties of autoencoders

    How do autoencoders work?

    Architecture of autoencoders

    Types of autoencoders

    Autoencoders for genomics

    Gene expression

    Use case – Predicting gene expression from TCGA pan-cancer RNA-Seq data using denoising autoencoders

    Summary

    8

    GANs for Improving Models in Genomics

    What are GANs?

    Differences between Discriminative and Generative models

    Intuition about GANs

    How do GANs work?

    Challenges working with genomics datasets

    What is synthetic data?

    How can GANs help improve models?

    Practical applications of GANs in genomics

    Analysis of ScRNA-Seq data

    Generation of DNA

    Using GANs for augmenting population-scale genomics data

    Summary

    Part 3 – Operationalizing models

    9

    Building and Tuning Deep Learning Models

    Technical requirements

    DL life cycle

    Data processing

    Data collection

    Data wrangling

    Feature engineering

    Developing models

    Selecting an appropriate algorithm

    Model training

    Tuning the models

    Hyperparameter tuning

    Hyperparameter tuning libraries

    Classification metrics or performance statistics

    Visualizing performance

    Regression metrics

    Use case – Predicting the binding site location of the JunD TF

    Framing the TFBS prediction problem in terms of DL

    Processing the data

    Model training

    Summary

    10

    Model Interpretability in Genomics

    What is model interpretability?

    Black-box model interpretability

    Unlocking business value from model interpretability

    Better business decisions

    Building trust

    Profitability

    Model interpretability methods in genomics

    Partial dependence plot

    Individual conditional expectation

    Permuted feature importance

    Global surrogate

    LIME

    Shapley value

    ExSum

    Saliency map

    Use case – Model interpretability for genomics

    Data collection

    Feature extraction

    Target labels

    Train-test split

    Creating a CNN architecture

    Summary

    11

    Model Deployment and Monitoring

    Technical requirements

    Streamlit

    Hugging Face

    Introducing model deployment

    Steps in model deployment

    Types of model deployment

    Deploying models as services

    A use case for deploying a DL model as a web service – building a Streamlit application of the CNN model

    Monitoring models using advanced tools

    Why monitor models?

    Reasons for model degradation

    How to monitor DL models

    Advanced tools for model monitoring

    Addressing drifts

    Summary

    12

    Challenges, Pitfalls, and Best Practices for Deep Learning in Genomics

    Deep learning challenges regarding genomics

    Lack of flexible tools

    Fewer biological samples

    Computational resource requirements

    Expertise in DL frameworks

    Lack of high-quality labeled data

    Lack of model interpretability

    Common pitfalls for applying deep learning to genomics

    Confounding

    Data leakage

    Imbalanced data

    Improper model comparisons

    Best practices for applying deep learning to genomics

    Understand the problem and know your data better

    A simple model for a simple problem

    Establish a baseline for your model

    Ensure reproducibility

    Using pre-existing models for genomics

    Do not reinvent the rule

    Tune hyperparameters automatically

    Focus on feature engineering

    Normalize the data

    Always perform model interpretation

    Avoid overfitting

    Summary

    Index

    Other Books You May Enjoy

    Preface

    Deep learning is the subset of machine learning based on artificial neural networks with representative learning using vast amounts of data. Machine learning is a subcomponent of artificial intelligence, which includes sophisticated algorithms that enable machines to mimic human intelligence to perform human tasks automatically. Both deep learning and machine learning help automatically detect meaningful patterns from data without explicit programming. Machine learning and deep learning have completely changed the way that we live these days. We rely on these so much that it’s hard to imagine a day without using any of these in some way or another, whether it is via the spam filtering of emails, product recommendations, or speech recognition. Both machine learning and specifically deep learning have been adopted by the scientific community in areas such as biology, genomics, bioinformatics, and computational biology. High-throughput technologies (HTS) such as next-generation sequencing (NGS) have made a significant contribution to genomics to study complex biological phenomena at a single-base-pair resolution on an unprecedented scale, facilitating an era of big data genomics. To get meaningful and novel biological insights from this big data, most of the algorithms are currently based on machine learning and, lately, deep learning methodologies to provide higher levels of accuracy in specific tasks related to genomics than state-of-the-art rule-based algorithms. Given the growing trend in the perception and application of machine learning and deep learning in genomics, research professionals, scientists, and managers require a good understanding of this exciting field to equip them with the necessary tools, technologies, and general guidelines to assist them in the selection of machine learning and deep learning methods for handling genomics data and accelerating data-driven decision-making in industries related to life sciences and biotechnology.

    Throughout this book, we will learn how to apply deep learning approaches to solve real-world problems in genomics, interpret biological insights from deep learning models built from genomic datasets, and finally, operationalize deep learning models using open source tools to enable predictions for end users.

    Who is this book for?

    This book aims to practically introduce machine learning and deep learning for genomic applications that can transform genomics data into novel biological insights. It provides both the theoretical fundamentals and hands-on sections to give a taste of how machine learning and deep learning can be leveraged in real-world applications in the life sciences and biotech industries. This book covers a range of topics that are not currently available in other textbooks. The book also includes the challenges, pitfalls, and best practices when applying machine learning and deep learning to real-world scenarios. Each chapter of the book has code written in Python with industry-standard machine learning and deep learning libraries and frameworks such as Keras that the audience can reproduce in their working environment. This book is designed to cater to the needs of researchers, bioinformaticians, and data scientists in both academia and industry who want to leverage machine learning and deep learning technologies in genomic applications to extract insights from sets of big data. Managers and leaders who are already established in the life sciences and biotechnology sectors will not only find this book useful but can also adopt these methodologies to identify patterns, come up with predictions, and thereby contribute to data-driven decision-making in their respective companies.

    The book is divided into three different parts. The first part introduces the fundamentals of genomic data analysis and machine learning. In this part, we will introduce the basic concept of genomic data analysis and discuss what machine learning is and why it is important for genomics and what value machine learning will bring to the life sciences and biotechnology industries. The second part will transition the readers from machine learning to deep learning and introduce them to the basic concepts of deep learning and diverse deep learning algorithms, using real-world examples to transform raw genomics data into biological insights. The final part will describe how to operationalize deep learning models using open source tools to enable predictions for end users. In this part, you will learn how to build and tune state-of-the-art machine learning models using Python and industry-standard libraries to derive biological insights from large amounts of multimodal genomic datasets and how to deploy these models on several cloud platforms such as AWS and Azure. The last chapter in the final part is fully dedicated to the current challenges for deep learning approaches to genomics and the potential pitfalls and how to avoid them using best practices.

    What this book covers

    Chapter 1, Introducing Machine Learning for Genomics, provides a brief history of the field of genomics and the practical application of machine learning methods to genomics, in addition to some of the technologies that this book will use.

    Chapter 2, Genomics Data Analysis, gives readers a quick primer on data analysis in genomics. Using the Python programming language, readers will be able to make sense of the vast amounts of genomics data available and extract biological insights.

    Chapter 3, Machine Learning Methods for Genomic Applications, introduces the reader to the two most important machine learning methods (supervised and unsupervised) and some of the important elements of standard machine learning pipelines. It also includes the practical real-world applications of supervised and unsupervised algorithms for genomics data analysis in the life sciences and biotechnology industries.

    Chapter 4, Deep Learning for Genomics, will teach the reader about the fundamental concepts of deep learning, different types of deep learning models, and different deep learning Python libraries.

    Chapter 5, Introducing Convolutional Neural Networks for Genomics, gives the reader a taste of Convolutional Neural Networks (CNNs), a type of deep neural network that is primarily used for sequence data, and shows how CNNs have superior performance compared to other deep learning methods.

    Chapter 6, Recurrent Neural Networks in Genomics, introduces reinforcement learning techniques such as Recurrent Neural Networks (RNNs) and LSTMs and shows how they are currently being applied in several applications.

    Chapter 7, Unsupervised Deep Learning with Autoencoders, introduces unsupervised deep learning, different methods of unsupervised deep learning, specifically Autoencoders, and its application in genomics.

    Chapter 8, GANs for Improving Models in Genomics, introduces Generative Adversarial Networks (GANs) and how they can be used to improve deep neural networks trained on genomics datasets for predictive modeling.

    Chapter 9, Building and Tuning Deep Learning Models, describes how to build and tune machine learning and deep learning models and deploy the final models across various computational systems and several platforms.

    Chapter 10, Model Interpretability in Genomics, introduces the reader to how to interpret machine learning and deep learning models. The model interpretability introduced here helps readers to understand a model’s decision and why businesses are interested in model interpretability for creating trust, gaining profitability, and so on.

    Chapter 11, Model Deployment and Monitoring, teaches the reader how to take the model they built on Google Colab and deploy it for predictions using open source tools such as Streamlit and Hugging Face. In addition, this chapter also describes how to monitor models using advanced tools and how monitoring is a key metric for businesses.

    Chapter 12, Challenges, Pitfalls, and Best Practices for Deep Learning in Genomics, informs the reader of the challenges and pitfalls associated with applying machine learning and deep learning methodologies to genomics applications. It also covers the best practices for building end-to-end machine learning and deep learning models and applying them to genomic datasets.

    To get the most out of this book

    The book aims to keep it self-contained as possible. To extract the maximum value out of this book, a basic to intermediate knowledge of Python programming is recommended and a background in genomics, statistics, and bioinformatics and some knowledge of data science is a must. In addition, readers are expected to know the basics of machine learning and associated machine learning algorithms, such as regression and classification. The book provides a hands-on approach to implementation and associated deep learning methodologies that will have you up-and-running and productive in no time. At the end of the book, you will be able to put your knowledge to work with this practical guide.

    If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository. This will ensure you avoid any potential error related to copying and pasting of code.

    Download the example code files

    You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Deep-Learning-for-Genomics-. Any updates to the code will be reflected in the GitHub repository. We also have other code bundles from our rich catalog of books and videos available at: https://github.com/PacktPublishing/. Check them out!

    Conventions used

    There are several text conventions used throughout this book.

    Code in text: Indicates code words in the text, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system.

    A block of code is set as follows:

    # covid19_features.py

    from Bio import SeqIO

    When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold: First, import all the relevant libraries:

    >>> from Bio import SeqIO

    Any command-line input or output is written as follows:

    >>> from Bio import SeqIO

    Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: In the Create the default IAM role pop-up window, select Any S3 bucket.

    Tips or important notes

    Appear like this.

    Get in touch

    Feedback from our readers is always welcome.

    General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

    Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

    If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

    Reviews

    Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

    For more information about Packt, please visit packt.com.

    Share Your Thoughts

    Once you’ve read Deep Learning for Genomics, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

    Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

    Download a free PDF copy of this book

    Thanks for purchasing this book!

    Do you like to read on the go but are unable to carry your print books everywhere?

    Is your eBook purchase not compatible with the device of your choice?

    Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

    Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

    The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

    Follow these simple steps to get the benefits:

    Scan the QR code or visit the link below

    https://packt.link/free-ebook/9781804615447

    Submit your proof of purchase

    That’s it! We’ll send your free PDF and other benefits to your email directly

    Part 1 – Machine Learning in Genomics

    This part will describe genomics data analysis and machine learning approaches to genomics. You will use state-of-the-art machine learning methods to transform raw genomics data into insights utilizing real-life examples in the life sciences and biotechnology industries.

    This section comprises the following chapters:

    Chapter 1, Introducing Machine Learning for Genomics

    Chapter 2, Genomics Data Analysis

    Chapter 3, Machine Learning Methods for Genomic Applications

    1

    Introducing Machine Learning for Genomics

    Machine learning (ML) is the field of science that deals with developing computer algorithms and models that can perform certain tasks without explicitly programming them. This is to say, it teaches the machines to learn rather than specifying rules from input data provided to them. The machine then can convert that learning into expertise or knowledge and use that for predictions. ML is an important tool for leveraging technologies around artificial intelligence (AI), a subfield of computer science that aims to perform tasks automatically that we, as humans, are naturally good at. ML is an important aspect of all modern businesses and research. The adoption of ML for genomics applications is changing recently because of the availability of large genomic datasets, improvement in algorithms, and, most importantly, superior computational power. More and more scientific research organizations and industries are expanding the use of ML across vast volumes of genomic data for predictive diagnostics, as well as to get biological insights at the scale of population health.

    Genomics, the study of the genetic constitution of organisms, holds promise in understanding and diagnosing human diseases or improving our agriculture and livestock. The field of genomics has seen exponential growth in the last 15 years, mainly due to recent technological advances in High-throughput sequencing also known as next-generation sequencing (NGS) technologies generating exponential amounts of genomics data. It is estimated that between 100 million and as many as 2 billion human genomes could be sequenced by 2025 (https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195), representing an astounding growth of four to five orders of magnitude in 10 years and far exceeding the growth of many big data domains. This complexity and the sheer amount of data generated create roadblocks not only to the acquisition, storage, and distribution but also to genomic data analysis. The current tools used in the genomic analysis are built on top of deterministic approaches and rely on rules encoded to perform a particular task. To keep up with data growth, we need more and new innovative approaches, such as ML, in genomics to enrich our understanding of basic biology and subject them to applied research. In this chapter, we’ll learn what ML is, why ML is essential for genomics, and what value ML brings to life sciences and biotechnology industries that leverage genome data for the development of genomic-based products. By the end of this chapter, you will understand the limitations of the current conventional algorithms for genomic data analysis, how solving problems with ML is different from conventional approaches, and how ML approaches can fill in those gaps and make generating biological insights very easy.

    As such, in this chapter, we’re going to cover the following main topics:

    What is machine learning?

    Why machine learning for genomics?

    Machine learning for genomics in life sciences and biotechnology

    What is machine learning?

    Before we talk about ML, let’s understand what AI is. In the simplest terms, AI is the ability of a machine to mimic human intelligence and iteratively improve itself based on the information it collects. The goal of AI is to build systems to perform actions that are routinely done by humans such as problem-solving, pattern matching, image recognition, knowledge acquisition, and so on. ML, a subset of AI, is the process of training a model to learn and improve from experience. Deep learning (DL), in turn, is a subfield of ML, in which we leverage artificial neural networks (ANNs) to mimic the human brain and find the nonlinear relationships between the input and output to generate predictions (Figure 1.1):

    Figure 1.1 – AI versus ML versus DL – how they are related

    Figure 1.1 – AI versus ML versus DL – how they are related

    In ML, a model is built based on input data and an underlying algorithm to make useful predictions from real-world data. In a simplified ML, features that represent an individual measurable property of the data are provided as input, and labels are returned as the predictions. Suppose we want to predict whether a particular sequence of DNA has a binding site for a transcription factor (TF) of your interest or not. Using the traditional approach, we would use a positional weight matrix (PWF) to scan the sequence and identify the potential motifs that are overrepresented. Even though this works, this is extremely difficult, manual, scalable, and so on. Using an

    Enjoying the preview?
    Page 1 of 1