Babugrg

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Some statistical and computational challenges, and

opportunities in astronomy
G. Jogesh Babu∗and S. George Djorgovski†
December 21, 2002

Abstract
The data complexity and volume of astronomical findings have increased in recent decades due to
major technological improvements in instrumentation and data collection methods. The contemporary
astronomer is flooded with terabytes of raw data producing enormous multidimensional catalogs of objects
(stars, galaxies, quasars, etc.) numbering in billions, with hundreds of measured numbers for each
object. The astronomical community thus facing a key task: to enable efficient and objective scientific
exploitation of enormous multifaceted datasets and the complex links between data and astrophysical
theory. In recognition of this, the National Virtual Observatory (NVO) initiative has recently emerged
to federate numerous large digital sky archives, and develop tools to explore and understand these vast
volumes of data. The effective use of such integrated massive datasets present a variety of new challenging
statistical and algorithmic problems that require methodological advances. An interdisciplinary team of
statisticians, astronomers and computer scientists from The Pennsylvania State University, California
Institute of Technology, and Carnegie Mellon University, are developing statistical methodology for the
NVO. A brief glimpse into the Virtual Observatory and the work of Penn State led team is provided
here.

1 Introduction: Historical relationship


Astronomy was perhaps the most widely studied field of natural science from antiquity until the 18th century.
Observational and deductive astronomy led to the foundations of many important concepts in mathematical
statistics, such as least squares, theory of errors, curve fitting and minimax theory. A brief history of this is
given here.
The estimation of the error using the range of discrepant astronomical observations was perhaps the
most important early encounter with a statistical concept. Hipparchus (2nd century B.C.) realized that his
estimate of the length of a year was not without an error, and used the range in estimating the error. Other
early astronomers, until Tycho Brahe (1546-1601), took the liberty of using the ‘best’ of several discrepant
observations. Brahe’s use of the mean increased the accuracy of his results, assisting Johannes Kepler (1571-
1630) in his rejection of circular models and the discovery of his laws on elliptical planetary orbits (Hald
1990). In his 1632 Dialogue of two chief world systems, Galileo Galilei included a detailed discussion of
what he called ‘observational’ errors, and a statistical analysis of the ‘new star’ (supernova) of 1572. Galileo
recommended the value requiring ‘the minimum amendments and smallest corrections’ possible to the data,
effectively the median. Thus Galileo’s analysis already had rudiments of least absolute deviation estimation.
Adrien Legendre published a volume in 1805 on new methods for determining the orbits of comets, requiring
estimation of a few unknowns from a large system of linear equations. He proposed minimizing sum of
squares of errors. Laplace (1747-1827) and Gauss (1777-1855) contributed to the development of this ‘least
squares method’ and mathematical theory of errors over the next several decades.
∗ Department of Statistics, The Pennsylvania State University, University Park, PA 16802
† Palomar Observatory, California Institute of Technology, Pasadena, CA 91125

1
This relationship has weakened during the later half of the 19th century, as astronomers turned princi-
pally towards astrophysics gaining insight into the physical aspects of the universe, and statistics turned to
applications in social sciences and industry. However, during the last few decades, a resurgence of interest
in statistical methods has emerged among astronomers, though with different emphases than in the past.
One major factor is the flood of data produced by large astronomical surveys at many wavebands. The
comparison of astronomical data to astrophysical questions is becoming increasingly complex, out pacing
the capabilities of traditional statistical methods.
The summaries above represent only a small sample of the problems in modern observational astronomy
that require sophisticated statistical and data analytical techniques. Given the weak connections between
the statistical and astronomical communities in recent decades, there is a need for improved communication
of existing statistical methods, and the concerted development of new methods, for astronomy. It should be
mentioned here that there are isolated collaborating groups of astronomers and statisticians.
Recent cross-disciplinary efforts in astrostatistics have produced valuable resources. A number of confer-
ences have been held in Europe (e.g. Rolfe 1983; Jaschek & Murtagh 1990; Subba Rao 1997) and the U.S.
(Feigelson & Babu 1992; Babu & Feigelson 1997; Feigelson & Babu 2003), astrostatistical sessions at large
meetings are being organized, and an introductory monograph on astrostatistics (Babu & Feigelson 1996)
and a monograph on spatial statistics in cosmology (Martinez & Saar 2001) have emerged.
Today, astronomy is becoming one of the most exciting and rapidly developing fields of physical sciences,
creating new opportunities for collaborative efforts with statistics. A brief introduction to the opportunities
for collaborative work is provided in this paper.

2 Astronomical surveys
Astronomy is the field devoted to the study of physical objects beyond the Earth: our planetary system;
the Sun and stars; collectives of stars such as the Milky Way Galaxy; galaxies distributed throughout the
Universe, including their active galactic nuclei such as quasars; material between these structures, variously
called the interplanetary, interstellar and intergalactic media; and cosmology, study of the Universe as a
whole. With rare exceptions, astronomical data are derived from observations of electromagnetic radiation
produced by distant objects made with telescopes on or in orbit around Earth. Some telescopes are placed into
orbit to get above the Earth’s atmosphere, which absorbs or deteriorates most wavelengths of electromagnetic
radiation. Observational astronomy is ‘Big Science’, with major ground-based telescopes costing hundreds
of millions of dollars and space-based observatories costing billions of dollars.
Astronomical data from the telescope is first reduced to usable forms. These include images (bivariate
integer or real functions representing electromagnetic radiation intensity as a function of location in the
sky), spectra (univariate intensities as a function of wavelength of electromagnetic radiation), or time series
(univariate intensities as a function of observing time). These data structures often mix these forms: e.g.,
a radio interferometer will produce a data cube of electromagnetic radiation intensity versus location and
wavelength, and an X-ray telescope will produce a 4-dimensional dataset of individual photons as a function
of location, wavelength and time.
An important intermediate data product between the raw telescope data and scientific investigation is the
astronomical survey. Common forms for surveys are an atlas of sky images at a particular wavelength, and a
multivariate database giving properties (columns) for each object (row) observed. Many important surveys
are commonly refered in astronomical community by their acronyms. Examples of some of the current and
forthcoming surveys across the electromagnetic spectrum include: FIRST (Faint Images of the Radio Sky at
Twenty cm (wavelength)) and NVSS (New VLA (Very Large Array) Sky Survey)radio surveys (106 sources);
MAP (Microwave Anisotropy Probe) and Planck (formerly known as COBRAS/SAMBA – COsmic Back-
ground Radiation Anisotropy Satellite / SAtellite for Measurement of Background Anisotropies) microwave
band all-sky images; IRAS (Infrared Astronomical Satellite) and SIRTF (Space Infrared Telescope Facility)
mid/far-infrared surveys (105 objects); 2MASS (2-Micron All-Sky Survey) and DENIS (Deep Near Infrared
Survey (of the Southern sky)) near-infrared surveys (108 objects); USNO (United States Naval Observa-

2
tory), SDSS (Sloan Digital Sky Survey) and DPOSS (Digital Palomar Observatory Sky Survey) visible band
surveys (109 objects); ROSAT (Röntgen Satellite), Chandra and XMM (X-ray Multi-Mirror Satellite) X-ray
surveys (105 sources); and CGRO (Compton Gamma-Ray Observatory) and GLAST (Gamma-ray Large
Area Space Telescope) γ-ray surveys (103 sources). For details on NASA’s astrophysics data environment
see Bredekamp and Golombek (2002).

3 Major data avalanche


Astronomy has become an immensely data-rich field and growing. A paradigm shift is underway in the very
nature of observational astronomy. While in the past a single astronomer or small group might observe a
handful of objects, today the large digital sky surveys are becoming the norm. Data are already streaming in
from surveys such as the Two Micron All Sky Survey and the Sloan Digital Sky Survey, which are providing
maps of the sky at infrared and optical wavelengths, respectively. The synoptic sky surveys, e.g., Solar system
patrols such as NEAT (Near Earth Asteroid Team), or LONEOS (Lowell Observatory Near Earth Object
Search); GRB patrols such as LOTIS (Livermore Optical Transient Imaging System) or ROTSE (Robotic
Optical Transient Search Experiment); microlensing experiments such as MACHO (Massive Compact Halo
Object search) and OGLE (Optical Gravitational Lensing Experiment); etc., will add another dimension,
time, to the data. Thus, the large digital sky surveys are becoming the dominant source of data in astronomy.
There are more than 100 terabytes of data in major archives, and growing rapidly. A typical sky survey
archive has approximately 10 terabytes of image data, a billion detected sources (stars, galaxies, quasars,
etc.), with hundreds of measured attributes per source. These surveys span the full range of wavelengths,
radio through gamma-ray. Yet, these are just a taste of the much larger data sets to come. The yearly
advances in electronics bring new instruments, doubling the amount of data collected each year leading
to the exponential growth of information in astronomy. Thus data sets orders of magnitude larger, more
complex, and more homogeneous than in the past are in the horizon. In comparison, the size of the Human
Genome is about one gigabyte and that of the the Library of Congress is about 20 terabytes.
Consequently, the data volumes here are several orders of magnitude larger than what astronomers and
statisticians are used to dealing with. These massive data sets are also much more complex (e.g., tens or
hundreds of measured parameters per source) and high dimensional in nature than what we are used to.
This great opportunity comes with a commensurate technological challenge: how to optimally store, manage,
combine, analyze and explore these vast amounts of complex information, and to do it quickly and efficiently?
Some powerful techniques that already exist can be tested in these new astronomical applications; others
will have to be developed, in collaboration, by astronomers, statisticians and computer scientists. The range
of the astrostatistical challenges is truly vast. A few of these issues are discussed later in the paper.

4 Pan-chromatic view of the universe


The current and forthcoming data (> 100 terabytes) spans the full range of wavelengths, radio-through
X-ray and beyond, and potentially provide a pan-chromatic and less biased view of the universe. The sky
‘looks’ differently at different wavebands. X rays and other wavebands such as radio, infrared, ultra-violet
and gamma, cannot be seen with the human eye. So they do not have any ‘color’ in the usual sense. To
see the invisible wavelengths, detectors such as the instruments on Chandra that are especially designed to
see those other wavelengths, are needed. Images taken by detectors that see invisible colors are called false
color images. The colors used by astronomers to construct a composite picture are not real but are chosen
to bring out important details.
Observations at different wavelengths carry important information about the nature of the celestial
objects. Figure 1 shows X-ray and Optical images of two giant galaxy clusters, located 2.5 and 3.1 billion
light years from Earth respectively. The Chandra data (left) provides a detailed temperature map for the
hot gas and allows astronomers to precisely determine the total masses of the clusters. Most of the mass
is in the form of dark matter. The Hubble data (right) place independent constraints on the masses of

3
Figure 1: X-ray/Optical sets of two giant galaxy clusters, Abell 2390 and MS2137.3-2353, located 2.5 and
3.1 billion light years from Earth respectively. Chandra’s large scale X-ray images (left) show hot gas filling
the two giant galaxy clusters, while Hubble’s smaller scale optical images (right) show the distribution of
galaxies in the central regions of the same clusters. (Credit: X-ray: NASA/IOTA/S.Allen et al., Optical:
HST).

the clusters that confirm the Chandra results. A pan-chromatic approach to the universe reveals a more
complete physical picture. Figure 2 shows the views of Crab Nebula (a supernova remnant and pulsar that
was first sighted by Chinese astronomers in 1054 AD) at X-ray, Optical, Infrared and Radio. Chandra’s
X-ray image of the Crab Nebula directly traces the most energetic particles being produced by the pulsar.
This amazing image reveals an unprecedented level of detail about the highly energetic particle winds and
will allow astronomers to probe deep into the dynamics of this cosmic powerhouse. As time goes on, and the
electrons move outward, they lose energy to radiation. The diffused optical light comes from intermediate
energy particles produced by the pulsar. The infrared radiation comes from electrons with energies lower
than those producing the optical light. Radio waves come from the lowest energy electrons. They can
travel the greatest distance and define the full extent of the nebula. The composite picture in Figure 3 is
constructed by overlaying the data from the four wavelengths. Astrophysical phenomena generated by these
objects can only be understood by combining data at several wavebands. This requires federation of different
sky surveys, matching the source objects in different wavelengths.
Another phenomenon revealed by multiwavelength studies is exemplified by Figure 4 (Hornschemeier et
al., 2000). Using NASA’s Chandra X-ray Observatory, astronomers have made the first long-duration X-ray
survey of the Hubble Deep Field North (A small patch of the sky, selected for unprecedented deep imaging
by the Hubble Space Telescope in the visible light, and then followed by other deep observations on other

4
Figure 2: Multiwavelength views of Crab Nebula, a supernova remnant and pulsar that was first sighted by
Chinese astronomers in 1054 AD. It is 6000 light years from Earth. (Credit: NASA/CXC/SAO).

Figure 3: A pan-chromatic view of the Crab Nebula seen in Figure 2 above. (Credit: NASA/CXC/SAO).

5
Figure 4: Results of the first long-duration X-ray survey of the Hubble Deep Field North. Chandra detected
X rays from six of the galaxies in the field. A surprise result that must be studied further is the lack of X
rays from some of the extremely luminous galaxies at huge distances (over 10 billion light years) from earth.
The Chandra results raise questions about the current theories used to explain the high energy output of
these objects. (Credit: Optical: NASA/HST, X-ray: NASA/PSU).

wavelengths, including X-ray). They detected X rays from six of the galaxies in the field, and were surprised
by the lack of X rays from some of the most energetic galaxies in the field. The X-ray emitting objects
discovered by the research team are a distant galaxy thought to contain a central giant black hole, three
elliptically shaped galaxies, an extremely red distant galaxy, and a nearby spiral galaxy. However, it was
very surprising to find that none of the X-ray sources lined up with any of the submillimeter-wave sources.
The submillimeter sources are extremely luminous, dusty galaxies that produce large amounts of infrared
radiation. This is an example of astrophysical process where truncation/censoring is present in one or more
coordinates. There are no counterparts to nonparametric methods such as product limit estimation in higher
dimensions. There is very little work in statistical literature to handle multivariate data, where truncation
and/or censoring is present in one or more dimensions. More work needs to be done by statisticians in this
area for effective analysis of current and forthcoming astronomical data.
Typically, astronomical data start as digital images over some portion of the sky, at some wavelength.
Instrumental effects are removed from the data, producing a quantifiable image of flux (light energy/square
cm/sec) as a function of the two spatial coordinates (projected on the sky); in some rare cases, one produces
data cubes rather than 2-d images, where the third dimension is the wavelength, or time. Some source-
finding algorithm is then run, which identifies individual (discrete) astronomical sources, e.g., stars, galaxies,
quasars, etc., and parametrizes the way their flux is distributed spatially, in wavelength, etc. The number
of independent measured parameters for each source then defines the dimensionality of a parameter space,
and each source can be represented as a point in this parameter space of observed properties. Examples of
parameters include fluxes, flux ratios (also known as colors), sizes, measures of the image shapes, concentra-

6
tions, etc. Some of the modern digital sky surveys measure hundreds of attributes for each detected source.
This parameter space representation then in principle contains all information present in the original data,
in a condensed form: associated with the detected sources, ignoring the “empty” pixels. Furthermore, this
transforms the data (which start as a panoramic imagery) into a quantitative form suitable for a statistical
analysis.
The systematic, pan-chromatic approach would enable new science, in addition to what can be done
with individual surveys. It would enable meaningful, effective experiments within these vast data parameter
spaces.

5 The National Virtual Observatory


Many astronomical observations (especially the more traditional ones) consist of measurements of properties
of individual, preselected sources or samples thereof (for example, flux measurements at some wavelength
for a sample of a hundred nearby spiral galaxies, which for some reason the astronomer wants to know).
Such observations, obtained with many different telescopes, instruments, integration times, etc., are very
heterogeneous in their properties (the depth, wavelength coverage, spatial resolution, etc.), and as such
not easily converted into homogeneous data sets suitable for a proper statistical analysis. In the recent
past, observational astronomy involved such pointed heterogeneous observations (∼ Megabyte - Gigabyte)
with small samples of objects (∼ 10 − 1000). The current trend is towards large, homogeneous sky surveys
(multi-terabytes, with 106 − 109 sources) leading to archives of pointed homogeneous observations. As
mentioned earlier, forthcoming projects and sky surveys are expected to deliver data volumes measured in
petabytes with repeated, multiple-epoch measurements for billions of sources. For each object, a few to
∼ 100 parameters are measured, most (but not all) with quantifiable errors and missing data in one or
more dimensions. Individually each of these surveys will lead to many advances in our understanding of the
physical processes that drive the formation and evolution of the Universe. As seen in the examples above,
combined they will provide the first digital map of the local and distant Universe across many decades of
wavelength of the electromagnetic spectrum.
The astronomical community thus faces a key task: to enable efficient and objective scientific exploitation
of enormous multifaceted datasets. The National Virtual Observatory (NVO) initiative has recently emerged,
in recognition of this need and in response to a top priority recommendation of the National Academy of
Sciences’ Decadal Report on astronomy for 2000-2010 (Taylor & McKee 2000), to federate numerous large
digital sky archives, both ground based and space based, and develop tools to explore and understand
these vast volumes of data. The effective use of such integrated massive datasets present a variety of new
challenging statistical and algorithmic problems that require methodological advances. A major effort is
needed by cross-disciplinary teams, of astronomers, computer scientists and statisticians to bring advances
in these fields into the toolbox of observational astronomy. The concept of a Virtual Observatory (VO) is
now being pursued worldwide, with several major projects under way in Europe and elsewhere.
The concept of the Virtual Observatory, its goals, challenges, and possible approaches are described, e.g.,
in the report of the National Virtual Observatory Science Definition Team, available at http://nvosdt.org, in
the “white paper” available at http://www.arXiv.org/abs/astro-ph/0108115, and in numerous papers in the
volumes edited by Brunner et al. (2001) and Banday et al. (2001). Also see the papers on massive datasets by
Djorgovski et al. (2002), Nichol et al. (2002), Strauss (2002), Szalay and Matsubara (2002) and others in the
proceedings of the conference ‘Statistical Challenges in Modern Astronomy III’ (Feigelson and Babu 2002).
Many papers on the current status of NVO appear in the proceedings of the ESO/ESA/NASA/NSF Astron-
omy Conference (June 2002, Garching, Germany), ‘Toward an international virtual observatory’ (Górski et
al. 2003).
Implementation of the NVO involves significant technical challenges on many fronts. Significant efforts
deal with the applied computing science and information technology aspects of the NVO. But scientific
discovery requires more than effective storage and distribution of information. How to explore dataset
comprising hundreds of millions of objects each with dozens of attributes? How to identify correlations and

7
anomalies within the datasets? How to classify the detected sources to isolate subpopulations of astrophysical
interest? How to use the data to constrain astrophysical interpretation, which often involve highly non-linear
parametric functions derived from fields such as physical cosmology, stellar structure or atomic physics.
The challenges posed by the analysis of large and complex data sets expected in the NVO-based research
are driven both by the size and the complexity of the data sets (billions of data vectors in parameter spaces
of tens or hundreds of dimensions), by the heterogeneity of the data and measurement errors, by selection
effects (Figure 4) and censored data, and by the intrinsic clustering properties (functional form, topology)
of the data distribution in the parameter space of observed attributes. The technological challenges for the
NVO include development of efficient database architectures and query mechanisms and data standards.
Techniques are needed for systematic exploration of the observable parameter spaces of measured source
attributes from federated sky surveys to search for rare or even new types of objects. This will include
supervised and unsupervised classification and clustering analysis techniques. Scientific questions one may
wish to address include: objective determination of the numbers of object classes present in the data, and
the membership probabilities for each source; searches for unusual, rare, or even new types of objects and
phenomena; discovery of physically interesting (generally multivariate) correlations which may be present in
some of the clusters; etc.
A key challenge for the NVO will be developing ways to simultaneously analyze data from several of the
dozens of astronomical databases available today. Each of those databases is organized differently, which
makes it quite difficult to perform analyses of data from several collections simultaneously. The NVO would
not only link the major astronomical data assets into an integrated, but virtual, system to allow automated
multiwavelength search and discovery among all cataloged astronomical objects, but also would provide
advanced statistical and data analysis methods for the astronomical community.
Enormous opportunities exist for sustained statistical research. It would create data standards and tools
for mining data and provide a link between the exciting astronomical data and the academic communities
in many disciplines including statistics. Most importantly, the NVO would provide access to powerful new
resources to scientists and students everywhere, who could do first-rate observational astronomy regardless
of their access to large ground-based telescopes. The NVO would also facilitate the inclusion of new massive
data sets, and optimize the design of future surveys and space missions.
The challenges posed by the analysis of massive data sets in astronomy, e.g., in the context of a VO, are
common to many or all information-intensive sciences today, with potential uses in many other modern fields
of endeavor: technology, commerce, national security, etc. Thus, the tools and the methodologies developed
in this context are likely to find useful applications elsewhere, with potentially great interdisciplinary and
societal benefits.

6 Current NVO related statistical and data analytic efforts


Our team, consisting of statisticians, computer scientists and astronomers from Penn State, Carnegie Mellon
and Caltech, is addressing some of the critically important statistical challenges raised by the NVO. A
brief description of our teams efforts on a few of these issues are presented here. Our aim here is not the
presentation of detailed statistical analysis, but to point to work in progress by the collaborative teams.

6.1 Low-storage percentile estimation for streaming massive data


In dealing with massive or streaming datasets, conventional statistical methods of testing of hypotheses and
building models for prediction may not be viable. When the data is streaming in from telescopes, we do
not have access to the entire data set at once, we only have access to the data points sequentially. Even
the computation of test statistics and estimates of parameters such as the simple median, which requires
sorting of the entire data, may pose difficulty in such cases. Standard sorting algorithms often require that
the entire dataset be placed into memory. When confronted with databases containing millions or billions
of objects, this may be impossible due the limitations on memory storage and CPU.

8
0.05
Difference of the estimated median and the sample median

0.04
0.03
0.02
0.01
0.0

0 20000 40000 60000 80000 100000

The number of observations after the initial 50

Figure 5: Sequential plot of the difference between the median estimated using the low-storage method and
the sample median for a Cauchy dataset based on 100,000 points.

Liechty et al. (2002) have recently developed a sequential procedure to estimate a p-th quantile (0 <
p < 1). It is a low-storage sequential algorithm using estimated ranks and weights to calculate scores which
determine the most attractive candidate data points to keep as the estimate of the quantile. The method
requires storing only a fixed number, say m, in memory for sorting and ranking. Initially, each of these
points are given a weight and a score based on p. Now a new point from the dataset is put in the array and
all the points in the existing array above it will have their ranks increased by 1. The weights and scores are
updated for these m + 1 points. The point with the largest score will then be dropped from the array, and
the process is repeated. See Liechty et al.[15] for the details.
Figure 5 shows simulation results for estimation of the median with m = 50. In this example a data of
size n = 100, 000 points is generated from a standard Cauchy distribution. In addition to the estimates of the
median, the sample median based on the points seen up to any given stage is also computed sequentially for
comparison purposes. The figure indicates that the estimates obtained by the method converges very fast.
The method is extended to estimate a number of quantiles, including those in the tail region, simultaneously.
These estimates can now be used to estimate the probability density function. Density estimation based on
this procedure and a multivariate extension are under investigation. The multivariate extension will help in
the clustering analysis when the data is streaming.

9
6.2 Multivariate classification
Large multivariate astronomical databases frequently contain mixtures of populations which must be distin-
guished from each other. The main point here is that astronomers want to know how many distinct classes
of sources are in there, on the basis of some statistical criterion. Ultimately, one would like to get a model
which fits the data in some defined way. Typically, the astronomer does not have specific mathematical
model in mind. Typical scientific questions posed may be:
• How many statistically distinct classes of objects are in this dataset, and which objects are to be
assigned to which classes, along with association probabilities? Are previously unknown classes of
objects present.
• Are there rare outliers, individual objects with a low probability of belonging to any one of the dominant
classes? Discovery of previously unknown types of objects is possible.
• Are there interesting correlations among the properties of objects in any given class, and what are
the optimal analytical expressions of such correlations? Some of the correlations may be spurious
(e.g., driven by sample selection effects), or simply uninteresting (e.g., objects brighter in one optical
bandpass will tend to be brighter in another optical bandpass).
Several complications may arise. The object classes form multivariate “clouds” in the parameter space
may have a power-law or exponential tails in some or all of the dimensions, and some may have sharp cutoffs,
etc. The clouds may be well separated in some of the dimensions, but not in others. How can we objectively
decide which dimensions are irrelevant, and which ones are useful? The topology of clustering may not be
simple: there may be clusters within clusters, holes in the data distribution, multiply-connected clusters,
etc. (Djorgovski et al. 2002 and Nichol et al. 2002).
A classical approach to multivariate classification involves maximizing a likelihood, and the EM Algorithm
is a widely used method for this purpose. A mixture–model of N Gaussians, where N is determined from
the data, to adaptively smooth and parameterize complex, multi–dimensional astronomical datasets is being
addressed. Such non–parametric density estimators are computationally impractical for today’s enormous
databases. The strategy is to use fast multi–resolutional, K–Dimensional (mrKD) tree codes. Figure 6
shows an example of a mrKD–tree which is an optimal index scheme that utilizes the emerging technology
of Cached Statistics in computer science to store sufficient statistics for the EM calculation at each node in
the tree. For various counting queries, one does not need to traverse the whole tree but simply use these
stored statistics to rapidly return the necessary count.
Genovese and Wasserman (2000) have developed a statistical theory to study the behavior of complex
mixture models. When the number of components of the mixture is allowed to increase as sample size in-
creases, the model is called a mixture sieve. Standard penalized likelihoods, such as the Bayesian Information
Criterion (BIC) or Akaike Information Criterion (AIC), may not always be suitable for astronomical data. A
general jackknife type (leave-one-out) likelihood procedure that reduces bias substantially and works better
than AIC and BIC is under development. For high energy astronomy, most detections reside in the Poisson
regime where Gaussian mixture–models may be less appropriate.
Most astronomical applications include two types of noise: random projection of uninteresting astronomi-
cal objects or detector background on top of the signal under study; and measurement errors, often correlated
with signal intensity. The first type can be treated with a smooth component in the mixture–model, while
the second type requires the incorporation of errors into the data of the mixture–model.

6.3 Search for rare objects


The concept of the exploration of the universe through a systematic study of the observable parameter space
was pioneered by Fritz Zwicky in 1930’s (Zwicky 1957). Some astronomical objects such as the high-redshift
quasars are rare to locate. How do we systematically search in massive databases and classify rare objects?
Figure 7 illustrates an example of discoveries of high-redshift quasars and type-2 quasars (quasars where

10
Figure 6: A kd-tree. The data is represented as a tree of nodes, each node has two daughter nodes as one
splits the data in two (along the axis with the largest dimension). On the left are the nodes for the 3rd
level of the tree (the top level has one node, while second level has two nodes). The right hand side plot
shows level 5 in the tree. The individual data points in this 2–dimensional space are shown as dots while the
bounding boxes of the nodes are shown as lines. The Cached Statistics, mean and covariance, are plotted as
a large dot and ellipse. (Credit: Robert Nichol).

the luminous “central engine” and the region close to it, from which the characteristic broad emission lines
originate, are obscured by a dusty disk or a torus, leaving only some indirect or subtle observable signatures
that such a luminous object is present in an otherwise inconspicuous galaxy) in Digital Palomar Observatory
Sky Survey (DPOSS). Density estimation using EM-AIC scoring in color space helps in outlier detection. A
color is defined as ratio of fluxes (brightness) at two different wavelengths. It does not depend on the distance
to the source. These ideas extend to arbitrary dimensional color space (space whose axes are object colors),
where the ability to visualize data without projecting down to a lower dimensionality subspace is lost. These
quasars were selected in a very unsophisticated manner in this color space. Since the distribution is very
clearly non-Gaussian, the best fit Gaussian from the core of the stellar locus is evaluated using the quartiles.
While quasars are morphologically indistinguishable from ordinary stars, this color parameter space offers a
good discrimination among these types of objects (Djorgovski et al. 2001). For high energy astronomy, most
detections reside in the Poisson regime where Gaussian mixture–models may be less appropriate. Mixture–
models based on other profiles including Poisson, mixtures of Gaussian with very different variances but the
same means (used to model the point spread functions of telescopes), and galaxy profile functions need to
be examined.
Another approach to outlier detection involving Bayesian networks and mixture–modeling is under in-
vestigation. Rather than using a single joint probability function which would require a prohibitively large
number of parameters to fit, a Bayesian network factors the representation into smaller conditional proba-
bility representations for subsets of the variables. The factored model has fewer parameters than would be
necessary to model the full density function directly. Bayes nets are most useful when their structure can be
estimated from data. Estimation occurs at two levels. The outer loop of the algorithm searches for the best
top level structure. Then for each candidate top level structure, the model for each variable and its parents
must be estimated.

11
Figure 7: An example of a color parameter space selection of astrophysically interesting types of objects
(high-redshift and type-2 quasars) from DPOSS. The dots are normal stars with r ∼ 19 mag. Solid circles
are some of the high-redshift (z > 4; z is the customary notation for the redshift, which is a nonlinear
measure of distance in cosmology) quasars, and open circles are some of the type-2 quasars found in this
survey.

7 Conclusions
Technical and methodological challenges facing the virtual observatory are common to most data-intensive
sciences today (commerce, industry, security). Interdisciplinary exchanges between the disciplines such as
astronomy, physics, biology, earth sciences are highly desirable to void wasteful duplication of efforts and
costs. The old research methodologies, geared to deal with data sets many orders of magnitude smaller and
simpler, are no longer adequate. The size of the data set affords us the opportunity to answer many interesting
cosmological questions but also presents many interesting statistical and computational challenges. For
example, in searching for clusters of galaxies, we must account for background clutter, measurement error
and the presence of unusual shapes such as filaments and sheets. Effective techniques are needed for dealing
with these problems on a large scale. The key issues are methodological; we have to learn to ask new kinds
of questions, enabled by the massive data sets and technology.
Astronomers’ need for advanced statistical methods is reciprocated by statisticians’ need for interaction
with practicing scientists in many fields. Confrontation with astrostatistical challenges nurtures the develop-
ment of statistical methodology that will have potential applications to other areas. This is especially true
for the methods developed for analyzing large databases. They may find applications in market research,
low-storage sequential signal processing and multimedia traffic flows. Although relatively few statisticians
have seriously engaged in astrostatistical research or consulting to date, a number of leading statisticians
believe that the ground is unusually fertile for growth. Hundreds of studies of methodological interest are
published annually in astronomy, and some of the most critical astrophysical questions of the 21st century
have a major astrostatistical component. Effective visualization of highly-dimensional parameter spaces and
multivariate correlations are needed. Our favorite graphics package is not enough to handle such massive
high dimensional data. A hybrid/interactive clustering and visualization approach is needed.
Great opportunities for collaborations and partnerships between astronomers, applied computer scientists,
and statisticians exist. Problems and challenges posed by the new astronomy may enrich and stimulate new

12
Computer Science and Statistical developments. The NVO will serve as an engine of discovery for astronomy
in the 21st century.

Acknowledgments: This work is supported in part by National Science Foundation grant DMS-0101360.
The authors are very thankful to Eric D. Feigelson of Penn State, Ashish Mahabal and Robert Brunner of
Caltech, and Robert Nichol and Larry Wasserman of Carnegie Mellon University for providing illustrations
and examples. We acknowledge useful discussions with many colleagues on these issues. S. George Djorgovski
acknowledges a partial support from the NASA AISRP program.

References
[1] Babu, G. J. and Feigelson, E. D. (1996). Astrostatistics. Chapman & Hall, London.
[2] Babu, G. J. and Feigelson, E. D. (1997). Statistical Challenges in Modern Astronomy II . Springer-Verlag,
New York.
[3] Banday, A. J., Zaroubi, S., Bartelmann, M. L. (editors) (2001). Mining the Sky. ESO Astrophysics
Symposia, Heidelberg: Springer Verlag.
[4] Bredekamp, J. H. and Golombek, D. A. (2002). NASA’s astrophysics data environment. In Statistical
Challenges in Modern Astronomy III , eds. E. D. Feigelson and G. J. Babu, Springer-Verlag, New York,
103-112.
[5] Brunner, R. J., Djorgovski, S. G., and Szalay, A. S. (editors) 2001, Virtual Observatories of the Future.
ASPCS, Vol. 225.
[6] Djorgovski, S. G., Mahabal, A., Brunner, R., Gal, R. R., Castro, S., de Carvalho, R. R., & Odewahn,
S. C. (2001). Searches for rare and new types of objects. In Virtual Observatories of the Future, eds. R.
Brunner, S. G. Djorgovski & A. Szalay, A.S.P. Conf. Ser., 225 52-63.
[7] Djorgovski, S. G., Brunner, R., Mahabal, A., Williams, R., Granat, R., and Stolorz, P. (2002). Chal-
lenges for cluster analysis in virtual observatory. In Statistical Challenges in Modern Astronomy III ,
eds. E. D. Feigelson and G. J. Babu, Springer-Verlag, New York, 127-138.
[8] Feigelson, E. D. and Babu, G. J. (1992). Statistical Challenges in Modern Astronomy. Springer-Verlag,
New York.
[9] Feigelson, E. D. and Babu, G. J. (2003). Statistical Challenges in Astronomy. Springer-Verlag, New
York.
[10] Górski, K. M., et al. (editors) (2003). Toward an international virtual observatory. ESO Astrophysics
Symposia, Heidelberg: Springer Verlag.
[11] Genovese, C. R.; Wasserman, L. (2000). Rates of convergence for the Gaussian mixture sieve. Ann.
Statist., 28 1105-1127.
[12] Hald, A. (1990) A History of Probability and Statistics and Their Applications before 1750 . John Wiley
& Sons, New York.
[13] Hornschemeier, A. E., Brandt, W. N., Garmire, G. P., Schneider, D. P., Broos, P. S., Townsley, L. K.,
Burrows, D. N., Feigelson, E. D., Nousek, J. A., Bautz, M. W., Griffiths, R., Lumb, D., and Sargent,
W. L. W. (2000). X-Ray Sources in the Hubble Deep Field Detected by Chandra. The Astrophysical
Journal, 541 Issue 1, pp. 49-53.
[14] Jaschek, C., and Murtagh, F. (eds.) (1990). Errors, Bias and Uncertainties in Astronomy. Cambridge
University Press, Cambridge.

13
[15] Liechty, J. C., Lin, D. K. J., and McDermott, J. P. (2002). Single-pass low-storage arbitrary quantile
estimation for massive datasets. Statistics and Computing, to appear.
[16] Marinez, Vicent J. and Saar, E. (2001). Statistics of the galaxy distribution. Chapman & Hall/CRC,
New York.

[17] Nichol, R. C., Chong, S., Connolly, A. J., Davies, S., Genovese, C., Hopkins, A. M., Miller, C. J., Moore,
A. W., Pelleg, D., Richards, G. T., Schneider, J., Szapudi, I., and Wasserman, L. (2002). Analysing
large data sets in cosmology. In Statistical Challenges in Modern Astronomy III , eds. E. D. Feigelson
and G. J. Babu, Springer-Verlag, New York, 265-276.
[18] Rolfe, E. J. (ed.) (1983). Statistical Methods in Astronomy. ESA SP 201, European Space Agency
Scientific & Technical Publications. Noordwijk Neth.
[19] Strauss, M. A. (2002). Statistical and astronomical Cchallenges in the Sloan Digital Sky Survey. In
Statistical Challenges in Modern Astronomy III , eds. E. D. Feigelson and G. J. Babu, Springer-Verlag,
New York, 113-123.
[20] Szalay, A. S. and Matsubara, T. (2002). Analysing large data sets in cosmology. In Statistical Challenges
in Modern Astronomy III , eds. E. D. Feigelson and G. J. Babu, Springer-Verlag, New York, 161-174.
[21] Taylor, J., and McKee, C. (2000). Astronomy and Astrophysics in the New Millenium. NRC:NAS Press.
[22] Zwicky, F. (1957). Morphological astronomy. Springer Verlag, Berlin.

14

You might also like