Anne Carpenter Michael Schatz Matt Wood: Broad Institute, @drannecarpenter

#biodata14
Anne Carpenter
Broad Institute, @DrAnneCarpenter
Michael Schatz
Cold Spring Harbor Laboratory, @mike_schatz
Matt Wood
Amazon Web Services, @mza
@JasonWilliamsNY Charla Lambert
Data are interesting, but do not answer
any of the thousands of possible questions:
How does my genome compare to yours?
How does expression or methylation or chromatin change?
What diseases are you at risk for, what pathogens have you
been exposed to, and what medicines should we give you?

Data are interesting, but do not answer
any of the thousands of possible questions:
How does my genome compare to yours?
How does expression or methylation or chromatin change?
What diseases are you at risk for, what pathogens have you
been exposed to, and what medicines should we give you?

Who will answer those questions?

How will they do it?
Who is a Data Scientist?
http://en.wikipedia.org/wiki/Data_science
Biological Data
1 Illumina X-Ten sequences a genome every 30 minutes

~100k whole human genomes sequenced
Worldwide capacity exceeds 25 Pbp/year
How much is a petabyte?
Unit Size
Byte 1
Kilobyte 1,000
Megabyte 1,000,000
Gigabyte 1,000,000,000
Terabyte 1,000,000,000,000
Petabyte 1,000,000,000,000,000
*Technically a kilobyte is 210 and a petabyte is 250
How much is a petabyte?
100 GB / Genome
4.7GB / DVD
~20 DVDs / Genome
10,000 Genomes
1PB Data 787 feet of DVDs 500 2 TB drives

200,000 DVDs ~1/6 of a mile tall $500k
DNA Data Tsunami
Current world-wide sequencing capacity is growing at ~3x per year!
1400 ~1 exabyte
1200 by 2018
1000
800
600
400
200
0
2014 2015 2016 2017 2018
Petabytes per year
DNA Data Tsunami
Current world-wide sequencing capacity is growing at ~3x per year!
900
800 ~1 zettabyte
700 by 2024
600
500
400
300 ~1 exabyte
200 by 2018
100
0
2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
Exabytes per year

How much is a zettabyte?
Unit Size
Byte 1
Kilobyte 1,000
Megabyte 1,000,000
Gigabyte 1,000,000,000
Terabyte 1,000,000,000,000
Petabyte 1,000,000,000,000,000
Exabyte 1,000,000,000,000,000,000
Zettabyte 1,000,000,000,000,000,000,000
How much is a zettabyte?
100 GB / Genome
4.7GB / DVD
~20 DVDs / Genome
10,000,000,000 Genomes
1ZB Data 150,000 miles of DVDs Both currently ~100Pb

200,000,000,000 DVDs ~ distance to moon And growing exponentially
Sequencing Centers 2014
Next Generation Genomics: World Map of High-throughput Sequencers

http://omicsmaps.com
Informatics Centers 2014
The DNA Data Deluge!

Schatz, MC and Langmead, B (2013) IEEE Spectrum. July, 2013!
Biological Data
Much of the capacity is used to
sequence genomes (or exomes)
of individuals
but biology is much more

than just genomes
but biology is much more
than just sequences
Soon et al., Molecular Systems Biology, 2013 10

Phil Bourne, Associate Director of Data Science for NIH
http://www.slideshare.net/pebourne/wiki-mania080914
Biological Data Science
Privacy & Security
How?
Integration of multiple data types
Massively scalable
Geographically distributed
Computationally flexible
Tolerate noise, errors, and artifacts
Support data exploration and ambiguity
Reliable, reproducible, and secure

Data Science Technologies
Results
Domain
Knowledge
Machine Learning
classification, modeling,
visualization & data Integration
Algorithmics
Streaming, Sampling, Indexing, Parallel
Compute Systems
CPU, GPU, Distributed, Clouds, Workflows
IO Systems
Hardrives, Networking, Databases, Compression, LIMS
Sensors & Metadata

Sequencers, Microscopy, Imaging, Mass spec, Metadata & Ontologies
Master Lecture
Homomorphic encryption as a tool

to preserve privacy in genomic
computation
Friday @ 4:30pm
Kristin Lauter, Ph.D.

Microsoft Research
Schedule Change
Saturday Morning: Human Biology
Mark Gerstein will present first in the

session
Plan to break for lunch at 11:40am

instead of noon
Eric Perakslis, Ph.D.

Harvard Medical School
Keynote Introduction
Ph.D. in CS from the Univ. of Colorado at
Boulder in 1982
Member of the NAS and the American

Academy of Arts and Sciences; Fellow of
AAAS and AAAI
Research combines mathematics, computer

science, and molecular biology
Pioneered the use of HMMs and other
machine learning techniques for analyzing
biological sequences
Major efforts in the human genome project,
and developing the UCSC Genome Browser
Recently focused on understanding and
fighting cancer; sharing of data through the
Global Alliance for Genomics and Health
David Haussler, Ph.D.

Distinguished Professor of Biomolecular Engineering at UCSC
Investigator, Howard Hughes Medical Institute
Scientific Director, UC Santa Cruz Genomics Institute
Thank you!
@mike_schatz / #biodata14

Anne Carpenter Michael Schatz Matt Wood: Broad Institute, @drannecarpenter

Uploaded by

Copyright:

Available Formats

Anne Carpenter Michael Schatz Matt Wood: Broad Institute, @drannecarpenter

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Anne Carpenter Michael Schatz Matt Wood: Broad Institute, @drannecarpenter

Uploaded by

Copyright:

Available Formats

#biodata14

Who will answer those questions?

1 Illumina X-Ten sequences a genome every 30 minutes

1PB Data 787 feet of DVDs 500 2 TB drives

Exabytes per year

1ZB Data 150,000 miles of DVDs Both currently ~100Pb

Next Generation Genomics: World Map of High-throughput Sequencers

The DNA Data Deluge!

but biology is much more

Soon et al., Molecular Systems Biology, 2013 10

Tolerate noise, errors, and artifacts

Support data exploration and ambiguity

Reliable, reproducible, and secure

Sensors & Metadata

Homomorphic encryption as a tool

Kristin Lauter, Ph.D.

Saturday Morning: Human Biology

Mark Gerstein will present first in the

Plan to break for lunch at 11:40am

Eric Perakslis, Ph.D.

Member of the NAS and the American

Research combines mathematics, computer

David Haussler, Ph.D.

You might also like