Anne Carpenter Michael Schatz Matt Wood: Broad Institute, @drannecarpenter
Anne Carpenter Michael Schatz Matt Wood: Broad Institute, @drannecarpenter
Anne Carpenter Michael Schatz Matt Wood: Broad Institute, @drannecarpenter
Anne Carpenter
Broad Institute, @DrAnneCarpenter
Michael Schatz
Cold Spring Harbor Laboratory, @mike_schatz
Matt Wood
Amazon Web Services, @mza
@JasonWilliamsNY Charla Lambert
Data are interesting, but do not answer
any of the thousands of possible questions:
How does my genome compare to yours?
How does expression or methylation or chromatin change?
What diseases are you at risk for, what pathogens have you
been exposed to, and what medicines should we give you?
Data are interesting, but do not answer
any of the thousands of possible questions:
How does my genome compare to yours?
How does expression or methylation or chromatin change?
What diseases are you at risk for, what pathogens have you
been exposed to, and what medicines should we give you?
http://en.wikipedia.org/wiki/Data_science
Biological Data
100 GB / Genome
4.7GB / DVD
~20 DVDs / Genome
10,000 Genomes
1400 ~1 exabyte
1200 by 2018
1000
800
600
400
200
0
2014 2015 2016 2017 2018
Petabytes per year
DNA Data Tsunami
Current world-wide sequencing capacity is growing at ~3x per year!
900
800 ~1 zettabyte
700 by 2024
600
500
400
300 ~1 exabyte
200 by 2018
100
0
2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
100 GB / Genome
4.7GB / DVD
~20 DVDs / Genome
10,000,000,000 Genomes
Massively scalable
Geographically distributed
Computationally flexible
Results
Domain
Knowledge
Machine Learning
classification, modeling,
visualization & data Integration
Algorithmics
Streaming, Sampling, Indexing, Parallel
Compute Systems
CPU, GPU, Distributed, Clouds, Workflows
IO Systems
Hardrives, Networking, Databases, Compression, LIMS
Friday @ 4:30pm