Visual Media History and Perspectives
Visual Media History and Perspectives
Visual Media History and Perspectives
net/publication/262572826
CITATIONS READS
3 8,530
5 authors, including:
U. Tariq
University of Illinois, Urbana-Champaign
14 PUBLICATIONS 163 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by U. Tariq on 04 May 2016.
Visual Media:
History and Perspectives
Thomas Huang,
Vuong Le,
Thomas Paine,
I n the early days of multimedia research, the
first image dataset collected consisted of only
four still grayscale images captured by a drum
generated using cameras that recorded light on
papers or plates with light-sensitive chemicals
and stored with negative films, starting the
Pooya Khorrami, scanner. At the time, digital imaging was only long history of capturing methods. Figure 1
and Usman Tariq available in laboratories, and digital videos barely depicts the milestones during this era.
existed. When more visual data become available, Together with the initial acquisition devices,
University of the problem of automatic image understanding delivery techniques also started to emerge. Ana-
Illinois at emerged. In 1966, Marvin Minsky, the father of log images were then enhanced by optical proc-
Urbana- artificial intelligence, was assigned “computer esses and were printed on chemically sensitized
Champaign vision” as a summer project. paper. Analog optical instruments were also
Half a century later, the amount of visual used for early image analysis methods such as
data has exploded at an unprecedented rate. frequency domain representation of images.
Images and videos are now created, stored, and With sinusoidal function basis, the Fourier
used by the majority of the population. Conse- transform offered a new perspective on how to
quently, image analysis has been transformed observe and modify a visual signal. Based on
into a sophisticated and powerful research field, space-frequency analysis and corresponding
providing services to all aspects of people’s lives. linear filters, algorithms were developed for
From the early days to now, a major mission applications such as compression, restoration,
of multimedia research has been providing and edge detection.
humans with visual information about the Soon after early imaging was born, pioneers
world. This includes capturing the scene’s con- in the field realized that discretization of the
tent into a computing system, enhancing the analog visual signal could preserve most of the
image’s appearance, and delivering it to people perceptible information while making opera-
in the most compelling way. However, some- tions much more convenient and efficient. This
times the underlying metadata is arguably even opened a new era of visual media processing:
more important than the content itself.1 Visual the digital era.
data understanding research concentrates on
either extracting the semantic meaning of the When Visual Media Became
scene useful to user or assisting the user in inter- Mainstream: The Digital Era
action with computers. You could say that the rise of digital visual
In this historical overview, we will follow the media began in the 1950s when the first drum
great journey that visual media research has scanners digitized images. These scanners did
embarked upon by looking at the fundamental not directly capture a photograph but instead
scientific and engineering inventions. Through copied preexisting photos by picking up the dif-
this lens, we will see that all three aspects of ferent intensities in a picture and saving them
media capturing, delivery, and understanding as a string of binary bits. Since the drum scan-
are developed surrounding the interaction with ner, other image digitization devices/methods
humans, making visual data processing a partic- were created with marked improvements in
ular human-centric field of computing.2 quality and efficiency such as charge-coupled
device (CCD) scanners and early TV cameras.
Early Days of Visual Media: The Analog Era In the 1970s, digital color images attracted
The first visual media was captured almost attention. The famous Lena image was scanned
two centuries ago when analog images were and cropped from the centerfold of the
4
1070-986X/14/$31.00 c 2014 IEEE Published by the IEEE Computer Society
Multimedia Efficient Storage
Unfortunately, if someone tried to store the multimedia information in much less storage space. Several influential
content naively it would require an extremely large works have proposed highly efficient compression techni-
amount of storage space. For instance, a two-hour stand- ques that make it feasible to process large numbers of images
ard definition (SD) video with a 720 480 pixel resolu- and videos computationally. Some examples include the dis-
tion and 24-bit color depth at 30 frames per second crete cosine transform (DCT), discrete wavelet transform
would take approximately 224 Gbytes in its original (DWT), and motion compensation, which are used to com-
form.3 The good news is that there is a lot of redundancy press JPEG files, JPEG-2000 images, and MPEG videos,
in the data, so we can store the same or a similar amount of respectively.
First digital
Figure 1. The evolution
cameras with
CCD image Commercially of multimedia
Twin lens sensors, available Consumer acquisition over time.
35 mm film, reflex Drum color/flatbed digital depth
box cameras cameras scanner scanner cameras cameras
Before 1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s 2000s 2010s
November 1972 issue of Playboy magazine. It particular object is present in an image or the
has since become widely used as a test object identity of a suspect in a given mugshot. Most
for evaluating image processing algorithms. of these methods required constructing image
The next step in acquiring images came in models defined via machine learning. These
1990 with the introduction of the first con- models were traditionally obtained in a super-
sumer digital cameras. With the advances in vised manner where the labels of the training
image capture, and the ability to compress the samples were given to a classifier or in an unsu-
images so they could be stored, it was suddenly pervised manner using clustering algorithms.
possible for anyone to build photo collections Some high-level inference tasks included multi-
of several hundred images. This is when con- media retrieval, search, recommendation, and
sumer imaging devices found their way into others for problems in human-computer in-
people’s everyday life. teraction systems, such as gesture control,
In keeping up with the revolution of visual biometrics-based access control, and facial
content capture, multimedia storage has also expression recognition.
come a long way; from the magnetic tapes in Unfortunately, many of these data models
the early 20th century to Blu-ray discs in the worked poorly on raw image data resulting in
previous decade to holographic storage. Figure the need for more sophisticated data represen-
2 gives a timeline of the evolution of multime- tations. Image features were introduced as ways
dia content storage. These advances in multi- of extracting distinctive information from the
media capture and storage were first steps that image data and forming a compact vector or
would eventually facilitate large-scale multime- descriptor. Oftentimes, the most effective fea-
April–June 2014
dia data collection. (See the sidebar for more tures used information gathered at the low-
details.) level such as edges or corners. In particular,
With more data efficiently created and scale-invariant feature transform (SIFT) and his-
stored, visual data understanding evolved to togram of oriented gradients (HOG) feature
derive contextual information from visual descriptors helped construct summaries of the
data. For example, one may want to know if a distribution of edges in the images and have led
5
Visions and Views
DVD,
Figure 2. The evolution
SmartMedia,
of multimedia storage Bubble CD-RW, Cloud-based
over time. Punched memory, multimedia storage,
tapes, Magnetic 8-inch floppy, card, social
phonograph drum Harddisk 5.25-inch floppy memory stick networks
Before 1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s 2000s 2010s
to state-of-art performance in object detection easier to deliver. Thus, images are no longer arti-
and recognition. Slowly but surely, researchers facts you keep in your house or carry in your
began to learn which information is relevant in wallet. They are instantly sent and aggregated.
image data.
For the groundbreakers of digital visual
Media Understanding: Closing
understanding, model overfitting was a major
the Semantic Gap
problem. This issue came from the fact that the
Another benefit of social networking sites is
more sophisticated and expressive a statistical
that users often label their data. It is common
model became, the more training data it
for users to tag photos of their friends, describe
needed to reliably compute the model parame-
the subject of videos, and curate their photos
ters. Luckily, in the late 2000s, the revolution in
into albums of related events. This additional
multimedia technology increased the ease of
semantic information is another major differ-
access for visual data, leading to an estimated
ence between datasets used by research labs in
2.5 billion people around the globe owning dig-
previous decades and the resources available to
ital cameras.4 This number is predicted to soon
media understanding researchers today.
surpass the world population. These seemingly
Armed with this massive amount of media
unlimited sources of naturally captured and
data and semantic labels, researchers can try
annotated data from everyday Internet users
new approaches to media understanding. But
offers a potential remedy for data-lacking issues
how? The secret is more parameters. Before,
and leads the way to the Internet era of visual
researchers had to limit the number of parame-
media research.
ters in their statistical models due to overfitting.
They favored dimensionality-reduction and lin-
Ubiquitous Visual Media: ear models. But as datasets increased in size,
The Internet Era overfitting became less of an issue, and bias
The Internet has made the process of collecting became the limiting factor.
images convenient. Previously, the most expan- Now researchers are using increasingly more
sive of photo sets (such as the US Library of parameters in their media understanding algo-
Congress) were physically limited to thousands rithms, which can leverage larger datasets.
of images from hundreds of photographers. In A common way to increase the number of
contrast, a single social networking site such as parameters is to extract features, learn an over-
Facebook can collect images from a billion complete mid-level representation, and then
active users, resulting in hundreds of billions of apply a linear classifier. This includes methods
images in total. In 2012, Facebook had more that discover object parts templates or that
IEEE MultiMedia
than 300 million images uploaded daily, which learn sparse dictionaries. These methods
is equivalent to 821 people taking 1,000 images improved results in object recognition and
daily for an entire year. Users upload images to object detection. Deep neural networks con-
these sites to share them with their friends and tinue this trend with many layers of parameters
family. This builds upon the image capturing that can model image statistics at multiple
advances from decades before, making images scales. Having more parameters to learn makes
6
Proboscis
monkey
African
hunting
dog
Siamese
cat
Snowmobile
Figure 3. Examples of the variations among the images of each category in the ImageNet Large-Scale
Vision Recognition Challenge dataset.
0.15
Figure 4. Improvement of recognition results for Wearable Gadgets and Moving to the Cloud
the ImageNet dataset over time and methods. The The word on the street is that wearable tech is
graph shows the best results per year. the new chic. Wearable cameras will be the
future trend. Some examples include, GoPro,
these models flexible, allowing them to fit data- camera watches, and Google Glass. GoPro has
sets with millions of images and generalize bet- already positioned itself, especially in adven-
ter to images in the wild. ture sports, for example, where users attach a
One specific case of this is the 2012 Image- camera to their headgear and helmets. Camera
April–June 2014
Net Large Scale Vision Recognition Challenge,5 watches, with data connection either through
which resulted in the ImageNet dataset (see Fig- a cell phone or data network, will find applica-
ure 3 for examples from several categories). The tions in video telephony. Google Glass will
dataset contains more than one million images enhance the user experience by bringing about
in 1,000 object categories, with highly variated new possibilities in communication and navi-
images in each category. Figure 4 shows the gation. Such devices will once again change
7
Visions and Views
tracking, and voice commands. Some examples ual media search involves the social aspect.
of such systems are the Microsoft Kinect with This draws its intuition from human-in-the-
gesture control, eye tracking for the disabled,7 loop architectures. The underlying assumption
and Apple’s SIRI (www.apple.com/ios/siri/). in such an architecture is that explicitly incor-
These systems may work well for a range of porating human information during learning
commands, but as the intended actions become can lead to algorithms with high quality results.
8
One of the most common settings for human-
in-the-loop techniques is image retrieval, where
a user queries a certain topic and the computer
returns a list of images it considers relevant.
Then a feedback mechanism allows humans, or
“workers,” to iteratively refine the algorithm’s
Developer MM system
results by choosing which of the retrieved
images were correctly associated with a given
concept. Social users
With the evolution social media such as Face-
book, we can now model social connections
among various users. This development has led
to the introduction of a social network compo-
nent to both the construction and application
Worker
of multimedia systems, as Figure 6 shows.
Figure 6. Modern human-in-the-loop system architecture. Media
Upcoming Applications understanding now incorporates social media connections into
Visual media will have its tangible impact on multimedia systems.
various areas such as healthcare, HCI, and secur-
ity in the upcoming years. In healthcare, we
hope that the research community will con-
may see its applications in psychology and see
tinue to make strides in these directions while
screening tools being developed for autism,
addressing future challenges.
depression, or attention deficit disorders using
multimodal automated affective analysis. We
Conclusions
may also see the application of visual media for
Visual data has evolved tremendously since the
taking vital signs. One such recent success has
first set of pictures were digitized. This evolution
been CardioCam, which finds a user’s heart rate
has occurred in all aspects of storage, delivery,
with a webcam by monitoring minute changes
and understanding. Recent years in particular
in skin color that correlate with blood circula-
have witnessed unprecedented growth, and we
tion.9 We may also see applications of visual
expect exciting new breakthroughs in the near
media understanding in automated nursing. For
future. Moreover, these aspects are quickly con-
example, if we are able to build a system that
verging via the ubiquity of devices and algo-
can accurately track body movements, then we
rithms leading to stronger interactions with
could monitor if the exercises prescribed by a
humans. In the near future, the human factor
physiotherapist are being followed correctly.
will continue to be at the center of the field’s
In security, we may find intelligent systems
development and will be the source of inspira-
being developed for anomaly detection and to
tion for the major working fields for media
track entities across surveillance cameras. The
researchers and engineers in the next era. MM
number of surveillance cameras installed in the
public is ever increasing. For instance, the Brit-
ish Security Authority estimates that there are
up to 5.9 million security cameras in the UK
Acknowledgments
alone.10 It is impossible to monitor every cam- Although this article lists only five authors, it
era at every instant, but is it possible to find was in fact a team effort, written by Thomas
anomalies automatically? Humans are very Huang and a number of his graduate students
good at finding abnormal events in a particular including Le, Paine, Khorrami, and Tariq in the
situation. However, event detection depends Image Formation and Processing (IFP) Group.
strongly on context. For instance, running may The group has weekly meetings as well as more
be normal on a beach but abnormal inside a frequent subgroup meetings, where many
April–June 2014
bank. Automating anomaly detection is an things are discussed, so the team knows each
open research problem and is related to seman- other’s ideas and views well. In addition to the
tic understanding. Apart from this, another authors listed on the title page, the following
interesting problem is tracking entities across students contributed to this article: Xinqi Chu,
multiple cameras so that law enforcement Kai-Hsiang (Sean) Lin, Jiangping Wang, Zhao-
agencies can follow suspects or vehicles. We wen Wang, and Yingzhen Yang.
9
Visions and Views
References http://today.ttu.edu/2011/09/eye-tracking-devi-
ces-help-disabled-use-computers/.
1. M. Slaney, “Web-Scale Multimedia Analysis: Does
8. S. Dhar, V. Ordonez, and T.L. Berg, “High Level
Content Matter?” IEEE MultiMedia, vol. 18, no. 2,
Describable Attributes for Predicting Aesthetics
2011, pp. 12–15.
and Interestingness,” Proc. 2011 IEEE Conf. Com-
2. A. Jaimes et al., “Guest Editors’ Introduction:
puter Vision and Pattern Recognition (CVPR), 2011,
Human-Centered Computing–Toward a Human
pp. 1657–1664.
Revolution,” Computer, vol. 40, no. 5, 2007,
9. M.-Z. Poh, D.J. McDuff, and R.W. Picard,
pp. 30–34.
“Advancements in Noncontact, Multiparameter
3. R.C. Gonzalez and R.E. Woods, Digital Image
Physiological Measurements Using a Webcam,” IEEE
Processing, 3rd ed., Prentice Hall, 2008.
Trans. Biomedical Eng., vol. 58, no. 1, 2011, pp. 7–11.
4. “Samsung Announces Ultra-Connected SMART
10. D. Barrett, “One Surveillance Camera for Every 11
Camera and Camcorder Line-Up Throughout
People in Britain, Says CCTV Survey,” The Tele-
Range,” Samsung, 9 Jan. 2012; www.samsung.
graph, 12 July 2013; www.telegraph.co.uk/tech-
com/us/news/20074.
nology/10172298/Onesurveillance-camera-for-
5. J. Deng et al., “ImageNet: A Large-Scale Hierarchi-
every-11-peoplein-Britain-says-CCTV-survey.html.
cal Image Database,” Proc. 2009 IEEE Conf. Com-
puter Vision and Pattern Recognition (CVPR), 2009,
Thomas Huang is a Swanlund Endowed Chair Pro-
pp. 248–255.
fessor in the Department of Electrical and Computer
6. A. Krizhevsky, I. Sutskever, and G. Hinton,
Engineering at the University of Illinois at Urbana–
“ImageNet Classification with Deep Convolutional
Champaign. His professional interests lie in the
Neural Networks,” Advances in Neural Information
broad area of information technology, especially the
Processing Systems 25, P. Bartlett et al., eds., 2012,
transmission and processing of multidimensional
pp. 1106–1114.
signals. Huang has a ScD in electrical engineering
7. J. Davis, “Eye-Tracking Devices Help Disabled Use
from Massachusetts Institute of Technology. Contact
Computers,” Texas Tech Today, 21 Sept. 2011;
him at [email protected].
10
View publication stats