Big Data For Good.: This Is The Industry Watch Blog
Big Data For Good.: This Is The Industry Watch Blog
Big Data For Good.: This Is The Industry Watch Blog
Permalink: http://www.odbms.org/blog/2012/06/big-data-for-good/
June 5, 2012.
Every day, 2.5 quintillion bytes of data are created. This data comes
from digital pictures, videos, posts to social media sites, intelligent
sensors, purchase transaction records, cell phone GPS signals to name
a few. This is Big Data.
There is a great interest both in the commercial and in the research
communities around Big Data. It has been predicted that analyzing Big
Data will become a key basis of competition, underpinning new waves
of productivity growth, innovation, and consumer surplus, according to
research by MGI and McKinseys Business Technology Office.
But very few people seem to look at how Big Data can be used for
solving social problems. Most of the work in fact is not in this direction.
Why this? What can be done in the international research community to
make sure that some of the most brilliant ideas do have an impact also
for social issues?
I have invited a panel of distinguished well known researchers and
professionals to discuss this issue. The list of panelists include:
- Roger Barga, Microsoft Research, group lead eXtreme Computing
Group, USA
- Laura Haas, IBM Fellow and Director Institute for Massive Data,
Analytics and Modeling IBM Research, USA
- Alon Halevy, Google Research, Head of the Structured Data
Group, USA
- Paul Miller, Consultant, Cloud of Data, UK
This Q&A panel focuses exactly at this question: is it possible to
conduct research for a corporation and/or a research lab, and at the
same time make sure that the potential output of our research has also
a social impact?
We take Big Data as a key example. Big Data is clearly of interest to
marketers and enterprises a like who wish to offer their customers
better services and better quality products. Ultimately their goal is to sell
their products/services.
This is good, but how about digging into Big Data to help people in
need? Preventing / predicting natural catastrophes, helping offering
services targeting to people and structures in social need?
Hope you`ll find this interview interesting, as well as eye-opening. RVZ
Q1. In your opinion, would it be possible to exploit some of the
current and future research and developments efforts on Big Data
for achieving social capital?
Alon: Yes, Big data is not just the size of an individual data set, but
rather the collection of data that is available to us online (e.g.,
government data, NGOs, local governments, journalists, etc). By putting
these data together we help tell stories about the data and make them
of interest and of value to the wider public. As one simple example, a
recent Danish Journalism Award was given to a nice visualization of
data about which doctors are being sponsored by the medical industry.
The ability to communicate this data with the public is certainly part of
the Big Data agenda.
Laura: Absolutely. In fact, many of the efforts that we are engaged in
today are exactly in this direction. Much of our Smarter Planet related
research is around utilizing more intelligently the large amounts of data
coming from instrumenting, observing, and capturing the information
about phenomena on planet earth, both natural and man-made.
Paul: First, its important to recognise that technological advances, new
techniques, and new ways of working often deliver both tangible and
intangible social benefit as a by-product of something else. Robert
Owen and his peers in the late 18th and early 19th centuries might
have had genuine motives for the social welfare and educational
programmes they delivered for workers in their factories, but it was the
commercial success of the factories themselves that paid for the
philanthropy. And better educated children became better integrated
factory workers, so it wasnt completely altruistic.
That said, there is clearly scope for Big Data to deliver direct benefits in
areas that aid society. Google Flu Trends is perhaps the best-known
example analysis of many millions of searches for flu-related terms
(symptoms, medicines, etc) enabling Googles non-profit Foundation to
provide early visibility of illness in ways that could/should assist local
healthcare systems. Googles search engine isnt about flu, and its
indices arent for flu detection or prediction; this piece of societal value
simply emerges from the data exhaust of all those people searching a
single site. Flu Trends isnt alone; Harvard researchers found that
Twitter data could be analysed to track the spread of Cholera on
Haiti in a way that proved substantially faster than traditional
techniques. According to Mathew Ingrams write-up of the research,
What the Harvard and HealthMap study shows is that analyzing the
data from large sets like the tweets around Haiti isnt just good at
tracking patterns or seeing connections after an event has occurred, but
can actually be of use to researchers on the ground while those
events are underway (my emphasis).
Roger: Absolutely, we have already seen several such examples. One
such example in science is Jim Gray and Alex Szalays collaboration to
build a virtual observatory for astronomy, which leveraged relational
database technology. The SDSS Sky Server has since supported
hundreds of researchers and resulted in thousands of publications over
the years. Another, more recent example, is the language translation
system researchers in Microsoft Research built for the aid relief worker
in Haiti after the 2010 earthquake. They leveraged the same technology
we leverage in our search operations to build a statistical machine
translation engine to translate Haitian Creole to English from scratch in
4 days, 17 hours, and 30 minutes and delivered to aid workers in Haiti.
Q2. If yes, what are the areas where in your opinion Big Data could
have a real impact on social capital?
Alon: Bringing data that is otherwise hidden from view to the eyes of
the interested public. Data activists and journalists world-wide need to
be able to easily discover data sets, merge them in a sensible fashion
and tell stories about them that will grab peoples attention. As another
example, helping people in crisis response situations has huge
potential. As two examples, people have used Google Fusion Tables
to create maps with critical information for people after the Japan
earthquake in 2011 and before the hurricane in NYC later that year.
Laura: With any big data project, many of the same issues exist. Ill
mention three major categories of issues: those related to the data,
itself, those related to the process of deriving insight and benefit from
the data, and finally, those related to management issues such as data
privacy, security, and governance in general. In the data space, we talk
about the 4 Vs of data Volume (just dealing with the sheer size of it),
Variety (handling the multiplicity of types and sources and formats),
Velocity (reacting to the flood of information in the time required by the
application), and, last and perhaps least understood, Veracity (how can
we cope with uncertainty, imprecision, missing values, and yes,
occasionally, mis-statements or untruths?). The challenges with deriving
insight include capturing data, aligning data from different sources (e.g.,
resolving when two objects are the same), transforming the data into a
form suitable for analysis, modeling it, whether mathematically, or
through some form of simulation, etc, and then understanding the
output visualizing and sharing the results, for example. And
governance includes ensuring that data is used correctly (abiding by its
intended uses and relevant laws), tracking how the data is used,
transformed, derived, etc, and managing its lifecycle. There are
research topics in ALL of these areas!
Paul: Data availability is there data available, at all? Increasingly,
there is. But coverage and comprehensiveness often remain patchy,
and the rigour with which datasets are compiled may still raise
concerns. A good process will, typically, make bad decisions if based
upon bad data. Data quality how good is the data? How broad is the
coverage? How fine is the sampling resolution? How timely are the
readings? How well understood are the sampling biases? What are the
implications in, for example, a Tsunami that affects several Pacific Rim
countries? If data is of high quality in one country, and poorer in
another, does the Aid response skew unfairly toward the well-surveyed
country or toward the educated guesses being made for the poorly
surveyed one? Data comprehensiveness are there areas without
coverage? What are the implications? Personally Identifiable
Information much of this information is about people. Can we extract
enough information to help people without extracting so much as to
compromise their privacy? Partly, this calls for effective industrial
practices. Partly, it calls for effective oversight by Government. Partly
perhaps mostly it requires a realistic reconsideration of what privacy
really means and an informed grown up debate about the real tradeoff between aspects of privacy lost and benefits gained. Rather than
offering blanket privacy policies, perhaps customers, regulators and
software companies should be moving closer to some form of explicit
data agreement; if you give me access to X, Y, and Z about yourself, I
will use it for purposes A, B, and C and you will gain benefits/services
D, E, and F. The first two parts are increasingly in place, albeit
informally. The final part the benefits is far less well expressed. Data
some of the critical problems facing our world. We may even create
economic value (not a bad thing, either!) while doing so.
_____________________________
Dr. Roger Barga has been with the Microsoft Corporation since 1997,
first working as a researcher in the database research group of
Microsoft Research, then as architect of the Technical Computing
Initiative, followed by architect and engineering group lead in the
eXtreme Computing Group of Microsoft Research. He currently leads a
product group developing an advanced analytics service on Windows
Azure. Roger holds a PhD in Computer Science (database systems),
MS in Computer Science (machine learning), and a BS in Mathematics.
Dr. Alon Halevy heads the Structured Data Group at Google
Research. Prior to that, he was a Professor of Computer Science at the
University of Washington, where he founded the Database Research
Group. From 1993 to 1997 he was a Principal Member of Technical
Staff at AT&T Bell Laboratories (later AT&T Laboratories). He received
his Ph.D in Computer Science from Stanford University in 1993, and his
Bachelors degree in Computer Science and Mathematics from the
Hebrew University in Jerusalem in 1988. Dr. Halevy was elected Fellow
of the Association of Computing Machinery in 2006.
Dr. Laura Haas is an IBM Fellow, and Director of IBM Researchs new
Institute for Massive Data, Analytics and Modeling; she also serves as a
catalyst for ambitious research across IBMs worldwide research labs.
She was the Director of Computer Science at IBMs Almaden Research
Center from 2005 to 2011. From 2001-2005, she led the Information
Integration Solutions architecture and development teams in IBMs
Software Group. Previously, Dr. Haas was a research staff member and
manager at Almaden. She is best known for her work on the Starburst
query processor, from which DB2 LUW was developed, on Garlic, a
system which allowed integration of heterogeneous data sources, and
on Clio, the first semi-automatic tool for heterogeneous schema
mapping. She has received several IBM awards for Outstanding
Innovation and Technical Achievement, an IBM Corporate Award for her
work on information integration technology, and the Anita Borg Institute
Technical Leadership Award. Dr. Haas was Vice President of the VLDB
Endowment Board of Trustees from 2004-2009, and is a member of the
National Academy of Engineering and the IBM Academy of Technology,
an ACM Fellow, and Vice Chair of the board of the Computing
Research Association.
Dr. Paul Miller is Founder of the Cloud of Data, a UK-based
consultancy primarily concerned with Cloud Computing, Big Data, and
Semantic Technologies. He works with public and private sector clients
in Europe and North America, and has a Ph.D in Archaeology
10