Ism 6404 CH 7

Section 2: Definition of Big Data
 Why is Big Data important? What has changed to put it in the center of the analytics world?
o As more and more data becomes available in various forms and fashions, timely
processing of the data with traditional means becomes impractical. The exponential
growth, availability, and use of information, both structured and unstructured, brings
Big Data to the center of the analytics world. Pushing the boundaries of data analytics
uncovers new insights and opportunities for the use of Big Data.
 How do you define Big Data? Why is it difficult to define?
o means different things to people with different backgrounds and interests, which is
one reason it is hard to define.
 b.Traditionally, the term "Big Data" has been used to describe the massive
volumes of data analyzed by huge organizations such as Google or research
science projects at NASA.
 c.includes both structured and unstructured data, and it comes from
everywhere: data sources include Web logs, RFID, GPS systems, sensor
networks, social networks, Internet-based text documents, Internet search
indexes, detailed call records, to name just a few.
 d.not just about volume, but also variety, velocity, veracity, and value
proposition.
 Out of the Vs that are used to define Big Data, in your opinion, which one is the most
important? Why?
o value proposition is probably the most important for decision makers' "big" data in
that it contains (or has a greater potential to contain) more patterns and interesting
anomalies than "small" data.
o by analyzing large and feature rich data, organizations can gain greater business value
that they may not have otherwise. While users can detect the patterns in small data
sets using simple statistical and machine-learning methods or ad hoc query and
reporting tools,
o Big Data means "big" analytics. Big analytics means greater insight and better
decisions, something that every organization needs nowadays.
 What do you think the future of Big Data will be like? Will it lose its popularity to something
else? If so, what will it be?
o Big Data could evolve at a rapid pace.
o The buzzword "Big Data" might change to something else, but the trend toward
increased computing capabilities, analytics methodologies, and data management of
high volume heterogeneous information will continue.
Section 3: Fundamentals of Big Data Analytics
 What is Big Data analytics?
o Big Data analytics is analytics applied to Big Data architectures.
• This is a new paradigm; in order to keep up with the computational needs of
Big Data, a number of new and innovative analytics computational techniques
and platforms have been developed.
 How does big data analytics differ from regular analytics?

o in order to keep up with the computational needs of Big Data, a number of new and
innovative analytics computational techniques and platforms have been developed.
o These techniques are collectively called high-performance computing, and include in-
memory analytics, in-database analytics, grid computing, and appliances.
o They differ from regular analytics which tend to focus on relational database
technologies.
 What are the critical success factors for Big Data analytics?
o Critical factors include a
• clear business need,
• strong and committed sponsorship,
• alignment between the business and IT strategies,
• a fact-based decision culture,
• a strong data infrastructure,
• the right analytics tools,
• and personnel with advanced analytic skills.
 What are the big challenges that one should be mindful of when considering implementation
of Big Data analytics?
o Traditional ways of capturing, storing, and analyzing data are not sufficient for Big
Data.
o Major challenges are :
• the vast amount of data volume,
• the need for data integration to combine data of different structures in a cost-
effective manner,
• the need to process data quickly,
• data governance issues,
• skill availability,
• and solution costs.
 What are the common business problems addressed by Big Data analytics?
o Process efficiency and cost reduction
o Brand management
o Revenue maximization, cross-selling, and up-selling
o Enhanced customer experience
o Churn identification, customer recruiting
o Improved customer service
o Identifying new products and market opportunities
o Risk management
o Regulatory compliance
o Enhanced security capabilities
Section 4: Big Data Technologies
 What are the common characteristics of emerging Big Data technologies?
o They take advantage of commodity hardware to enable scale-out, parallel processing
techniques; employ nonrelational data storage capabilities in order to process
unstructured and semistructured data; and apply advanced analytics and data
visualization technology to Big Data to convey insights to end users.
 What is MapReduce? What does it do? How does it do it?

o MapReduce is a programming model that allows the processing of large-scale data
analysis problems to be distributed and parallelized.
o The MapReduce technique, popularized by Google, distributes the processing of very
large multi-structured data files across a large cluster of machines.
o High performance is achieved by breaking the processing into small units of work
that can be run in parallel across the hundreds, potentially thousands, of nodes in the
cluster.
o The map function in MapReduce breaks a problem into sub-problems, which can
each be processed by single nodes in parallel. The reduce function merges (sorts,
organizes, aggregates) the results from each of these nodes into the final result.
 What is Hadoop? How does it work?
o Hadoop is an open source framework for processing, storing, and analyzing massive
amounts of distributed, unstructured data. It is designed to handle petabytes and
exabytes of data distributed over multiple nodes in parallel, typically commodity
machines connected via the Internet.
o It utilizes the MapReduce framework to implement distributed parallelism. The file
organization is implemented in the Hadoop Distributed File System (HDFS), which is
adept at storing large volumes of unstructured and semistructured data. This is an
alternative to the traditional tables/rows/columns structure of a relational database.
Data is replicated across multiple nodes, allowing for fault tolerance in the system.
 What are the main Hadoop components? What functions do they perform?
o Major components of Hadoop are the HDFS, a Job Tracker operating on the master
node, Name Nodes, Secondary Nodes, and Slave Nodes.
 The HDFS is the default storage layer in any given Hadoop cluster.
 A Name Node is a node in a Hadoop cluster that provides the client
information on where in the cluster particular data is stored and if any nodes
fail.
 Secondary nodes are backup name nodes.
 The Job Tracker is the node of a Hadoop cluster that initiates and coordinates
MapReduce jobs or the processing of the data.
 Slave nodes store data and take direction to process it from the Job Tracker.
o Querying for data in the distributed system is accomplished via MapReduce.
o The client query is handled in a Map job, which is submitted to the Job Tracker.
o The Job Tracker refers to the Name Node to determine which data it needs to access
to complete the job and where in the cluster that data is located, then submits the
query to the relevant nodes which operate in parallel.
o A Name Node acts as facilitator, communicating back to the client information such
as which nodes are available, where in the cluster certain data resides, and which
nodes have failed. When each node completes its task, it stores its result.
o The client submits a Reduce job to the Job Tracker, which then collects and
aggregates the results from each of the nodes.
 What is NoSQL? How does it fit into the Big Data analytics picture?
o NoSQL, also known as "Not Only SQL," is a new style of database for processing
large volumes of multi-structured data.

o Whereas Hadoop is adept at supporting large-scale, batch-style historical analysis,
NoSQL databases are mostly aimed at serving up discrete data stored among large
volumes of multi-structured data to end-user and automated Big Data applications.
o NoSQL databases trade ACID (atomicity, consistency, isolation, durability)
compliance for performance and scalability.
Section 5: Big Data and Data Warehousing
 What are the challenges facing data warehousing and Big Data? Are we witnessing the end
of the data warehousing era? Why or why not?
o What has changed the landscape in recent years is the variety and complexity of data,
which made data warehouses incapable of keeping up.
o It is not the volume of the structured data but the variety and the velocity that forced
the world of IT to develop a new paradigm, which we now call "Big Data."
 But this does not mean the end of data warehousing.
 Data warehousing and RDBMS still bring many strengths that make them
relevant for BI and that Big Data techniques do not currently provide.
 What are the use cases for Big Data and Hadoop?
o In terms of its use cases, Hadoop is differentiated two ways: first, as the repository
and refinery of raw data, and second, as an active archive of historical data.
o Hadoop, with their distributed file system and flexibility of data formats (allowing
both structured and unstructured data), is advantageous when working with

information commonly found on the Web, including social media, multimedia, and
text.
o Also, because it can handle such huge volumes of data (and because storage costs are
minimized due to the distributed nature of the file system), historical (archive) data
can be managed easily with this approach.
 What are the use cases for data warehousing and RDBMS?
o Three main use cases for data warehousing are performance, integration, and the
availability of a wide variety of BI tools.
o The relational data warehouse approach is quite mature, and database vendors are
constantly adding new index types, partitioning, statistics, and optimizer features.
 This enables complex queries to be done quickly, a must for any BI
application.
o Data warehousing, and the ETL process, provide a robust mechanism for collecting,
cleaning, and integrating data. And, it is increasingly easy for end users to create
reports, graphs, and visualizations of the data.
 In what scenarios can Hadoop and RDBMS coexist?
o There are several possible scenarios under which using a combination of Hadoop and
relational DBMS-based data warehousing technologies makes sense.
o For example, you can use Hadoop for storing and archiving multi-structured data,
with a connector to a relational DBMS that extracts required data from Hadoop for
analysis by the relational DBMS.

o Hadoop can also be used to filter and transform multi-structural data for transporting
to a data warehouse, and can also be used to analyze multi-structural data for
publishing into the data warehouse environment.
o Combining SQL and MapReduce query functions enables data scientists to analyze
both structured and unstructured data.
o Also, front end query tools are available for both platforms.
Section 6: Big Data Vendors and Platforms
 What is special about the Big Data vendor landscape? Who are the big players?
o The Big Data vendor landscape is developing very rapidly. It is in a special period of
evolution where entrepreneurial startup firms bring innovative solutions to the
marketplace. Cloudera is a market leader in the Hadoop space. MapR and
Hortonworks are two other Hadoop startups. DataStax is an example of a NoSQL
vendor. Informatica, Pervasive Software, Syncsort, and MicroStrategy are also
players. Most of the growth in the industry is with Hadoop and NoSQL distributors
and analytics providers.
o There is still very little in terms of Big Data application vendors.
o Meanwhile, the next-generation data warehouse market has experienced significant
consolidation. Four leading vendors in this space—Netezza, Greenplum, Vertica, and
Aster Data—were acquired by IBM, EMC, HP, and Teradata, respectively. Mega-
vendors Oracle and IBM also play in the Big Data space, connecting and
consolidating their products with Hadoop and NoSQL engines.
 How do you think the Big Data vendor landscape will change in the near future? Why?
o As the field matures, more and more traditional data vendors will incorporate Big
Data into their architectures. We already saw something similar with the
incorporation of XML data types and XPath processing engines in relational database
engines.
o Also, the Big Data market will be increasingly cloud-based, and hosting services will
include Big Data data storage options, along with the traditional MySql and SqlServer
options.
o Vendors providing Big Data applications and services, for example in the finance
domain or for scientific purposes, will begin to proliferate. (Different students will
have different answers.)
 What is the role of visual analytics in the world of Big Data?
o Visual analytics help organizations uncover trends, relationships, and anomalies by
visually shifting through very large quantities of data.
o Many vendors are developing visual analytics offerings, which have traditionally
applied to structured data warehouse environments (relational and multidimensional),
for the Big Data space.
o To be successful, a visual analytics application must allow for the coexistence and
integration of relational and multistructured data.
Section 7: Big Data and Stream Analytics
 What is a stream (in the Big Data world)?
o A stream can be thought of as an unbounded flow or sequence of data elements,
arriving continuously at high velocity. Streams often cannot be efficiently or

effectively stored for subsequent processing; thus Big Data concerns about
Velocity (one of the six Vs) are especially prevalent when dealing with streams.
o Examples of data streams include sensor data, computer network traffic, phone
conversations, ATM transactions, web searches, and financial data.
 What are the motivations for stream analytics?
o In situations where data streams in rapidly and continuously, traditional analytics
approaches that work with previously accumulated data (i.e., data at arrest) often
either arrive at the wrong decisions because of using too much out-of-context
data, or they arrive at the correct decisions but too late to be of any use to the
organization.
o Therefore it is critical for a number of business situations to analyze the data soon
after it is created and/or as soon as it is streamed into the analytics system.
o It is no longer feasible to "store everything." Otherwise, analytics will either
arrive at the wrong decisions because of using too much out-of-context data, or at
the correct decisions but too late to be of any use to the organization.
o Therefore it is critical for a number of business situations to analyze the data as
soon as it is streamed into the analytics system.
 What is stream analytics? How does it differ from regular analytics?
o Stream analytics is the process of extracting actionable information from
continuously flowing/streaming data. It is also sometimes called "data in-motion
analytics" or "real-time data analytics."

o It differs from regular analytics in that it deals with high velocity (and transient)
data streams instead of more permanent data stores like databases, files, or web
pages.
 What is critical event processing? How does it relate to stream analytics?
o Critical event processing is a method of capturing, tracking, and analyzing
streams of data to detect events (out of normal happenings) of certain types that
are worthy of the effort. It involves combining data from multiple sources to infer
events or patterns of interest. An event may also be defined generically as a
"change of state," which may be detected as a measurement exceeding a
predefined threshold of time, temperature, or some other value. This applies to
stream analytics because the events are happening in real time.
 Define data stream mining. What additional challenges are posed by data stream mining?
o Data stream mining is the process of extracting novel patterns and knowledge
structures from continuous, rapid data records.
o Processing data streams, as opposed to more permanent data storages, is a
challenge. Traditional data mining techniques can process data recursively and
repetitively because the data is permanent. By contrast, a data stream is a
continuous flow of ordered sequence of instances that can only be read once and
must be processed immediately as they come in.
Section 8: Applications of Stream Analytics
 What are the most fruitful industries for stream analytics?
o Many industries can benefit from stream analytics.

o Some prominent examples include e-commerce, telecommunications, law
enforcement, cyber security, the power industry, health sciences, and the government.
 How can stream analytics be used in e-commerce?
o Companies such as Amazon and eBay use stream analytics to analyze customer
behavior in real time. Every page visit, every product looked at, every search
conducted, and every click made is recorded and analyzed to maximize the value
gained from a user's visit.
o Behind the scenes, advanced analytics are crunching the real-time data coming from
our clicks, and the clicks of thousands of others, to "understand" what it is that we are
interested in (in some cases, even we do not know that) and make the most of that
information by creative offerings.
 In addition to what is listed in this section, can you think of other industries and/or
application areas where stream analytics can be used?
o Stream analytics could be of great benefit to any industry that faces an influx of
relevant real-time data and needs to make quick decisions.
o One example is the news industry. By rapidly sifting through data streaming in, a
news organization can recognize "newsworthy" themes (i.e., critical events).
o Another benefit would be for weather tracking in order to better predict tornados or
other natural disasters. (Different students will have different answers.)
 Compared to regular analytics, do you think stream analytics will have more (or less) use
cases in the era of Big Data analytics? Why?
o Stream analytics can be thought of as a subset of analytics in general, just like
"regular" analytics. The question is, what does "regular" mean?

o Regular analytics may refer to traditional data warehousing approaches, which does
constrain the types of data sources and hence the use cases.
o Or, "regular" may mean analytics on any type of permanent stored architecture (as
opposed to transient streams). In this case, you have more use cases for "regular"
(including Big Data) than in the previous definition.
o In either case, there will probably be plenty of times when "regular" use cases will
continue to play a role, even in the era of Big Data analytics. (Different students will
have different answers.)
OPENING VIGNETTE: Analyzing Customer Churn in a Telecom Company Using Big Data
Methods
Extra Notes
 ________ bring together hardware and software in a physical unit that is not only fast but
also scalable on an as-needed basis.
o Appliances
 ________ of data provides business value; pulling of data from multiple subject areas and
numerous applications into one repository is the raison d'être for data warehouses.
o Integration
 ________ speeds time to insights and enables better data governance by performing data
integration and analytic functions inside the database.
o In-database analytics
 A job ________ is a node in a Hadoop cluster that initiates and coordinates MapReduce jobs,
or the processing of the data.
o tracker
 A newly popular unit of data in the Big Data era is the petabyte (PB), which is
o 1015 bytes.
 All of the following statements about MapReduce are true EXCEPT
o MapReduce runs without fault tolerance.
 Allowing Big Data to be processed in memory and distributed across a dedicated set of nodes
can solve complex problems in near-real time with highly accurate insights. What is this
process called?
o in-memory analytics
 As the size and the complexity of analytical systems increase, the need for more ________
analytical systems is also increasing to obtain the best performance.
o efficient
 As volumes of Big Data arrive from multiple sources such as sensors, machines, social
media, and clickstream interactions, the first step is to ________ all the data reliably and cost
effectively.
o capture
 Big Data comes from ________.
o everywhere
 Big Data employs ________ processing techniques and nonrelational data storage
capabilities in order to process unstructured and semistructured data.
o parallel
 Big Data is being driven by the exponential growth, availability, and use of information.
o True
 Big Data simplifies data governance issues, especially for global firms.
o False
 Big Data uses commodity hardware, which is expensive, specialized hardware that is custom
built for a client or application.
o False
 Companies with the largest revenues from Big Data tend to be
o the largest computer and IT services firms.

 Current total storage capacity lags behind the digital information being generated in the
world. True
 Data flows can be highly inconsistent, with periodic peaks, making data loads hard to
manage. What is this feature of Big Data called?
o variability
 Define MapReduce.
o As described by Dean and Ghemawat (2004), MapReduce is a programming model
and an associated implementation for processing and generating large data sets.
Programs written in this functional style are automatically parallelized and executed
on a large cluster of commodity machines. This allows programmers without any
experience with parallel and distributed systems to easily utilize the resources of a
large distributed system.
 Describe data stream mining and how it is used.
o Data stream mining, as an enabling technology for stream analytics, is the process of
extracting novel patterns and knowledge structures from continuous, rapid data
records. A data stream is a continuous flow of ordered sequence of instances that in
many applications of data stream mining can be read/processed only once or a small
number of times using limited computing and storage capabilities. Examples of data
streams include sensor data, computer network traffic, phone conversations, ATM
transactions, web searches, and financial data. Data stream mining can be considered
a subfield of data mining, machine learning, and knowledge discovery. In many data
stream mining applications, the goal is to predict the class or value of new instances
in the data stream given some knowledge about the class membership or values of
previous instances in the data stream.
 Despite their potential, many current NoSQL tools lack mature management and monitoring
tools.
o True
 For low latency, interactive reports, a data warehouse is preferable to Hadoop.
o True
 Hadoop and MapReduce require each other to work.
o False
 Hadoop is primarily a(n) ________ file system and lacks capabilities we'd associate with a
DBMS, such as indexing, random access to data, and support for SQL.
o distributed
 Hadoop was designed to handle petabytes and exabytes of data distributed over multiple
nodes in parallel.
o True
 HBase is a nonrelational ________ that allows for low-latency, quick lookups in Hadoop.
o database
 HBase, Cassandra, MongoDB, and Accumulo are examples of ________ databases.
o NoSQL
 How does Hadoop work?
o It breaks up Big Data into multiple parts so each part can be processed and analyzed
at the same time on multiple computers.

 If you have many flexible programming languages running in parallel, Hadoop is preferable
to a data warehouse.
o True
 In a Hadoop "stack," what is a slave node?
o a node where data is stored and processed
 In a Hadoop "stack," what node periodically replicates and stores data from the Name Node
should it fail?
o secondary node
 In a network analysis, what connects nodes?
o edges
 In Application Case 7.6, Analyzing Disease Patterns from an Electronic Medical Records
Data Warehouse, it was found that urban individuals have a higher number of diagnosed
disease conditions.
o True
 In most cases, Hadoop is used to replace data warehouses.
o False
 In open-source databases, the most important performance enhancement to date is the cost-
based ________.
o optimizer
 In the Alternative Data for Market Analysis or Forecasts case study, satellite data was NOT
used for
o monitoring individual customer patterns.

 In the Analyzing Disease Patterns from an Electronic Medical Records Data Warehouse case
study, what was the analytic goal?
o determine differences in rates of disease in urban and rural populations
 In the energy industry, ________ grids are one of the most impactful applications of stream
analytics.
o smart
 In the financial services industry, Big Data can be used to improve
o regulatory oversight and decision making
 In the opening vignette, the Access Telecom (AT), built a system to better visualize
customers who were unhappy before they canceled their service.
o True
 In the opening vignette, why was the Telecom company so concerned about the loss of
customers, if customer churn is common in that industry?
o The company was concerned about its loss of customers, because the loss was at such
a high rate. The company was losing customers faster than it was gaining them.
Additionally, the company had identified that the loss of these customers could be
traced back to customer service interactions. Because of this, the company felt that
the loss of customers is something that could be analyzed and hopefully controlled.
 In the Salesforce case study, streaming data is used to identify services that customers use
most.
o False
 In the Twitter case study, how did influential users support their tweets?
o objective data
 In the world of Big Data, ________ aids organizations in processing and analyzing large
volumes of multistructured data. Examples include indexing and search, graph analysis, etc.
o MapReduce
 In-motion ________ is often overlooked today in the world of BI and Big Data.
o analytics
 It is important for Big Data and self-service business intelligence to go hand in hand to get
maximum value from analytics.
o True
 List and briefly discuss the three characteristics that define and make the case for data
warehousing.
o 1)Data warehouse performance:
 More advanced forms of indexing such as materialized views, aggregate join
indexes, cube indexes, and sparse join indexes enable numerous performance
gains in data warehouses.
 The most important performance enhancement to date is the cost-based
optimizer, which examines incoming SQL and considers multiple plans for
executing each query as fast as possible.
o 2)Integrating data that provides business value:
 Integrated data is the unique foundation required to answer essential business
questions.
o 3)Interactive BI tools:
 These tools allow business users to have direct access to data warehouse
insights. Users are able to extract business value from the data and supply
valuable strategic information to the executive staff.
 List and describe four of the most critical success factors for Big Data analytics.
o •A clear business need (alignment with the vision and the strategy).
 Business investments ought to be made for the good of the business, not for
the sake of mere technology advancements. Therefore, the main driver for Big
Data analytics should be the needs of the business at any level—strategic,
tactical, and operations.
o •Strong, committed sponsorship (executive champion).
 It is a well-known fact that if you don't have strong, committed executive
sponsorship, it is difficult (if not impossible) to succeed. If the scope is a
single or a few analytical applications, the sponsorship can be at the
departmental level. However, if the target is enterprise-wide organizational
transformation, which is often the case for Big Data initiatives, sponsorship
needs to be at the highest levels and organization-wide.
o •Alignment between the business and IT strategy.
 It is essential to make sure that the analytics work is always supporting the
business strategy, and not other way around. Analytics should play the
enabling role in successful execution of the business strategy.
o •A fact-based decision making culture.
 In a fact-based decision-making culture, the numbers rather than intuition, gut
feeling, or supposition drive decision making. There is also a culture of

experimentation to see what works and doesn't. To create a fact-based
decision-making culture, senior management needs to do the following:
recognize that some people can't or won't adjust; be a vocal supporter; stress
that outdated methods must be discontinued; ask to see what analytics went
into decisions; link incentives and compensation to desired behaviors.
o •A strong data infrastructure.
 Data warehouses have provided the data infrastructure for analytics. This
infrastructure is changing and being enhanced in the Big Data era with new
technologies. Success requires marrying the old with the new for a holistic
infrastructure that works synergistically.
 List and describe the three main "V"s that characterize Big Data.
o •Volume:
 This is obviously the most common trait of Big Data. Many factors
contributed to the exponential increase in data volume, such as transaction-
based data stored through the years, text data constantly streaming in from
social media, increasing amounts of sensor data being collected, automatically
generated RFID and GPS data, and so forth.
o •Variety:
 Data today comes in all types of formats—ranging from traditional databases
to hierarchical data stores created by the end users and OLAP systems, to text
documents, e-mail, XML, meter-collected, sensor-captured data, to video,
audio, and stock ticker data. By some estimates, 80 to 85 percent of all
organizations' data is in some sort of unstructured or semistructured format.

o •Velocity:
 This refers to both how fast data is being produced and how fast the data must
be processed (i.e., captured, stored, and analyzed) to meet the need or demand.
 RFID tags, automated sensors, GPS devices, and smart meters are driving an
increasing need to deal with torrents of data in near-real time.
 MapReduce can be easily understood by skilled programmers due to its procedural nature.
o True
 Organizations are working with data that meets the three V's-variety, volume, and ________
characterizations.
o velocity
 refers to the conformity to facts: accuracy, quality, truthfulness, or trustworthiness of the
data.
o Veracity
 Satellite data can be used to evaluate the activity at retail locations as a source of alternative
data.
o True
 Social media mentions can be used to chart and predict flu outbreaks.
o True
 The ________ Node in a Hadoop cluster provides client information on where in the cluster
particular data is stored and if any nodes fail.
o Name
 The ________ of Big Data is its potential to contain more useful patterns and interesting
anomalies than "small" data.

o value proposition
 The problem of forecasting economic activity or microclimates based on a variety of data
beyond the usual retail data is a very recent phenomenon and has led to another buzzword —
________.
o alternative data
 The quality and objectivity of information disseminated by influential users of Twitter is
higher than that disseminated by noninfluential users.
o True
 The term "Big Data" is relative as it depends on the size of the using organization.
o True
 There is a clear difference between the type of information support provided by influential
users versus the others on Twitter.
o True
 Traditional data warehouses have not been able to keep up with
o the variety and complexity of data.
 Under which of the following requirements would it be more appropriate to use Hadoop over
a data warehouse?
o unrestricted, ungoverned sandbox explorations
 Using data to understand customers/clients and business operations to sustain and foster
growth and profitability is
o an increasingly challenging task for today's enterprises.
 What are the differences between stream analytics and perpetual analytics? When would you
use one or the other?

o •In many cases they are used synonymously. However, in the context of intelligent
systems, there is a difference.
 Streaming analytics involves applying transaction-level logic to real-time
observations. The rules applied to these observations take into account
previous observations as long as they occurred in the prescribed window;
these windows have some arbitrary size (e.g., last 5 seconds, last 10,000
observations, etc.).
 Perpetual analytics, on the other hand, evaluates every incoming observation
against all prior observations, where there is no window size. Recognizing
how the new observation relates to all prior observations enables the
discovery of real-time insight.
o •When transactional volumes are high and the time-to-decision is too short, favoring
nonpersistence and small window sizes, this translates into using streaming analytics.
o However, when the mission is critical and transaction volumes can be managed in
real time, then perpetual analytics is a better answer.
 What is Big Data's relationship to the cloud?
o Amazon and Google have working Hadoop cloud offerings.
 What is NoSQL as used for Big Data? Describe its major downsides.
o •NoSQL is a new style of database that has emerged to, like Hadoop, process large
volumes of multi-structured data. However, whereas Hadoop is adept at supporting
large-scale, batch-style historical analysis, NoSQL databases are aimed, for the most
part (though there are some important exceptions), at serving up discrete data stored
among large volumes of multi-structured data to end-user and automated Big Data
applications.
 This capability is sorely lacking from relational database technology, which
simply can't maintain needed application performance levels at Big Data
scale.
o •The downside of most NoSQL databases today is that they trade ACID (atomicity,
consistency, isolation, durability) compliance for performance and scalability.
o Many also lack mature management and monitoring tools.
 What is the Hadoop Distributed File System (HDFS) designed to handle?
o unstructured and semistructured non-relational data
 When considering Big Data projects and architecture, list and describe five challenges
designers should be mindful of in order to make the journey to analytics competency less
stressful.
o •Data volume:
 The ability to capture, store, and process the huge volume of data at an
acceptable speed so that the latest information is available to decision makers
when they need it.
o •Data integration:
 The ability to combine data that is not similar in structure or source and to do
 so quickly and at reasonable cost.
o •Processing capabilities:
 The ability to process the data quickly, as it is captured. The traditional way of
collecting and then processing the data may not work. In many situations data
needs to be analyzed as soon as it is captured to leverage the most value.
o •Data governance:
 The ability to keep up with the security, privacy, ownership, and quality issues
of Big Data. As the volume, variety (format and source), and velocity of data
change, so should the capabilities of governance practices.
o •Skills availability:
 Big Data is being harnessed with new tools and is being looked at in different
ways. There is a shortage of data scientists with the skills to do the job.
o •Solution cost:
 Since Big Data has opened up a world of possible business improvements,
there is a great deal of experimentation and discovery taking place to
determine the patterns that matter and the insights that turn to value. To ensure
a positive ROI on a Big Data project, therefore, it is crucial to reduce the cost
of the solutions used to find that value.
 Which Big Data approach promotes efficiency, lower cost, and better performance by
processing jobs in a shared, centrally managed pool of IT resources?
o grid computing
 Which of the following sources is likely to produce Big Data the fastest?
o RFID tags
 Why are some portions of tape backup workloads being redirected to Hadoop clusters today?
o First, while it may appear inexpensive to store data on tape, the true cost comes with
the difficulty of retrieval.
 Not only is the data stored offline, requiring hours if not days to restore, but
tape cartridges themselves are also prone to degradation over time, making
data loss a reality and forcing companies to factor in those costs. To make
matters worse, tape formats change every couple of years, requiring
organizations to either perform massive data migrations to the newest tape
format or risk the inability to restore data from obsolete tapes.
o •Second, it has been shown that there is value in keeping historical data online and
accessible.
 As in the clickstream example, keeping raw data on a spinning disk for a
longer duration makes it easy for companies to revisit data when the context
changes and new constraints need to be applied. Searching thousands of disks
with Hadoop is dramatically faster and easier than spinning through hundreds
of magnetic tapes. Additionally, as disk densities continue to double every 18
months, it becomes economically feasible for organizations to hold many
years' worth of raw or refined data in HDFS.

CERN is the European Organization for Nuclear Research. It plays a leading role in fundamental
studies of physics. It has been instrumental in many key global innovations and breakthrough
discoveries in theoretical physics and today operates the world's largest particle physics
laboratory, home to the Large Hadron Collider (LHC). The CERN laboratory sits astride the
Franco-Swiss border near Geneva, Switzerland. What is CERN, and why is it important to
the world of science?
As more and more data becomes available in various forms and fashions, timely processing of
the data with traditional means becomes impractical. The exponential growth, availability, and
use of information, both structured and unstructured, brings Big Data to the center of the
analytics world. Pushing the boundaries of data analytics uncovers new insights and
opportunities for the use of Big Data.Why is Big Data important? What has changed to put it in
the center of the analytics world?
Big Data means different things to people with different backgrounds and interests, which is one
reason it is hard to define. Traditionally, the term "Big Data" has been used to describe the
massive volumes of data analyzed by huge organizations such as Google or research science
projects at NASA. Big Data includes both structured and unstructured data, and it comes from
everywhere: data sources include Web logs, RFID, GPS systems, sensor networks, social
networks, Internet-based text documents, Internet search indexes, detailed call records, to name
just a few. Big data is not just about volume, but also variety, velocity, veracity, and value
proposition. How do you define Big Data? Why is it difficult to define?
Although all of the Vs are important characteristics, value proposition is probably the most
important for decision makers' "big" data in that it contains (or has a greater potential to contain)
more patterns and interesting anomalies than "small" data. Out of the Vs that are used to define
Big Data, in your opinion, which one is the most important? Why?
Big Data could evolve at a rapid pace. The buzzword "Big Data" might change to something
else, but the trend toward increased computing capabilities, analytics methodologies, and data
management of high volume heterogeneous information will continue. (Different students may
have different answers.) What do you think the future of Big Data will be like? Will it lose
its popularity to something else? If so, what will it be?
Big Data analytics is analytics applied to Big Data architectures. This is a new paradigm; in
order to keep up with the computational needs of Big Data, a number of new and innovative
analytics computational techniques and platforms have been developed. These techniques are
collectively called high-performance computing, and include in-memory analytics, in-database
analytics, grid computing, and appliances. They differ from regular analytics which tend to focus
on relational database technologies. What is Big Data analytics? How does it differ from regular
analytics?
Critical factors include a clear business need, strong and committed sponsorship, alignment
between the business and IT strategies, a fact-based decision culture, a strong data infrastructure,
the right analytics tools, and personnel with advanced analytic skills. What are the critical
success factors for Big Data analytics?
Traditional ways of capturing, storing, and analyzing data are not sufficient for Big Data. Major
challenges are the vast amount of data volume, the need for data integration to combine data of
different structures in a cost-effective manner, the need to process data quickly, data governance
issues, skill availability, and solution costs. What are the big challenges that one should be
mindful of when considering implementation of Big Data analytics?
Here is a list of problems that can be addressed using Big Data analytics:
• Process efficiency and cost reduction
• Brand management
• Revenue maximization, cross-selling, and up-selling
• Enhanced customer experience
• Churn identification, customer recruiting
• Improved customer service
• Identifying new products and market opportunities
• Risk management
• Regulatory compliance
• Enhanced security capabilities What are the common business problems addressed by Big
Data analytics?
They take advantage of commodity hardware to enable scale-out, parallel processing techniques;
employ nonrelational data storage capabilities in order to process unstructured and
semistructured data; and apply advanced analytics and data visualization technology to Big Data
to convey insights to end users. What are the common characteristics of emerging Big Data
technologies?
MapReduce is a programming model that allows the processing of large-scale data analysis
problems to be distributed and parallelized. The MapReduce technique, popularized by Google,
distributes the processing of very large multi-structured data files across a large cluster of
machines. High performance is achieved by breaking the processing into small units of work that
can be run in parallel across the hundreds, potentially thousands, of nodes in the cluster. The map
function in MapReduce breaks a problem into sub-problems, which can each be processed by
single nodes in parallel. The reduce function merges (sorts, organizes, aggregates) the results
from each of these nodes into the final result. What is MapReduce? What does it do? How
does it do it?
Hadoop is an open source framework for processing, storing, and analyzing massive amounts of
distributed, unstructured data. It is designed to handle petabytes and exabytes of data distributed
over multiple nodes in parallel, typically commodity machines connected via the Internet. It
utilizes the MapReduce framework to implement distributed parallelism. The file organization is
implemented in the Hadoop Distributed File System (HDFS), which is adept at storing large
volumes of unstructured and semistructured data. This is an alternative to the traditional
tables/rows/columns structure of a relational database. Data is replicated across multiple nodes,
allowing for fault tolerance in the system. What is Hadoop? How does it work?
Major components of Hadoop are the HDFS, a Job Tracker operating on the master node, Name
Nodes, Secondary Nodes, and Slave Nodes. The HDFS is the default storage layer in any given
Hadoop cluster. A Name Node is a node in a Hadoop cluster that provides the client information
on where in the cluster particular data is stored and if any nodes fail. Secondary nodes are
backup name nodes. The Job Tracker is the node of a Hadoop cluster that initiates and
coordinates MapReduce jobs or the processing of the data. Slave nodes store data and take
direction to process it from the Job Tracker.
Querying for data in the distributed system is accomplished via MapReduce. The client query is
handled in a Map job, which is submitted to the Job Tracker. The Job Tracker refers to the Name
Node to determine which data it needs to access to complete the job and where in the cluster that
data is located, then submits the query to the relevant nodes which operate in parallel. A Name
Node acts as facilitator, communicating back to the client information such as which nodes are
available, where in the cluster certain data resides, and which nodes have failed. When each node
completes its task, it stores its result. The client submits a Reduce job to the Job Tracker, which
then collects and aggregates the results from each of the nodes. What are the main Hadoop
components? What functions do they perform?
NoSQL, also known as "Not Only SQL," is a new style of database for processing large volumes
of multi-structured data. Whereas Hadoop is adept at supporting large-scale, batch-style
historical analysis, NoSQL databases are mostly aimed at serving up discrete data stored among
large volumes of multi-structured data to end-user and automated Big Data applications. NoSQL
databases trade ACID (atomicity, consistency, isolation, durability) compliance for performance
and scalability. What is NoSQL? How does it fit into the Big Data analytics picture?
Data scientists use a combination of their business and technical skills to investigate Big Data,
looking for ways to improve current business analytics practices (from descriptive to predictive
and prescriptive) and hence to improve decisions for new business opportunities. One of the
biggest differences between a data scientist and a business intelligence user—such as a business
analyst—is that a data scientist investigates and looks for new possibilities, while a BI user
analyzes existing business situations and operations. Data scientist is an emerging profession,
and there is no consensus on where data scientists come from or what educational background a
data scientist has to have. But there is a common understanding of what skills and qualities they
are expected to possess, which involve a combination of soft and hard skills. What is a data
scientist? What makes them so much in demand?
One of the most sought-out characteristics of a data scientist is expertise in both technical and
business application domains. Data scientists are expected to have soft skills such as creativity,
curiosity, communication/interpersonal skills, domain expertise, problem definition skills, and
managerial skills as well as sound technical skills such as data manipulation,
programming/hacking/scripting, and knowledge of Internet and social media/networking
technologies. Data scientists are supposed to be creative and curious, and should be excellent
communicators, with the ability to tell compelling stories about their data. What are the common
characteristics of data scientists? Which one is the most important?
Data scientist is an emerging profession, and there is no consensus on where data scientists come
from or what educational background a data scientist has to have. Master of Science (or Ph.D.) in
Computer Science, MIS, Industrial Engineering, of postgraduate analytics are common
examples. But many data scientists have advanced degrees in other disciplines, like the physical
or social sciences, or more specialized fields like ecology or system biology. Where do data
scientists come from? What educational backgrounds do they have?
Becoming a great data scientist requires you to delve deeply into developing quantitative and
technical skills, as well as interpersonal and communication skills. In addition, you will need to
gain significant domain knowledge (e.g., in business). This effort will most likely require an
advanced degree. It also requires a continuous thirst for knowledge and an intense curiosity; you
will always be learning in this profession. In addition to meticulous analytical skills, it also
requires creativity and imagination. (Students will vary in their answers to this question.) What
do you think is the path to becoming a great data scientist?
What has changed the landscape in recent years is the variety and complexity of data, which
made data warehouses incapable of keeping up. It is not the volume of the structured data but the
variety and the velocity that forced the world of IT to develop a new paradigm, which we now
call "Big Data." But this does not mean the end of data warehousing. Data warehousing and
RDBMS still bring many strengths that make them relevant for BI and that Big Data techniques
do not currently provide. What are the challenges facing data warehousing and Big Data?
Are we witnessing the end of the data warehousing era? Why or why not?
In terms of its use cases, Hadoop is differentiated two ways: first, as the repository and refinery
of raw data, and second, as an active archive of historical data. Hadoop, with their distributed file
system and flexibility of data formats (allowing both structured and unstructured data), is
advantageous when working with information commonly found on the Web, including social
media, multimedia, and text. Also, because it can handle such huge volumes of data (and because
storage costs are minimized due to the distributed nature of the file system), historical (archive)
data can be managed easily with this approach. What are the use cases for Big Data and
Hadoop?
Three main use cases for data warehousing are performance, integration, and the availability of a
wide variety of BI tools. The relational data warehouse approach is quite mature, and database
vendors are constantly adding new index types, partitioning, statistics, and optimizer features.
This enables complex queries to be done quickly, a must for any BI application. Data
warehousing, and the ETL process, provide a robust mechanism for collecting, cleaning, and
integrating data. And, it is increasingly easy for end users to create reports, graphs, and
visualizations of the data. What are the use cases for data warehousing and RDBMS?
There are several possible scenarios under which using a combination of Hadoop and relational
DBMS-based data warehousing technologies makes sense. For example, you can use Hadoop for
storing and archiving multi-structured data, with a connector to a relational DBMS that extracts
required data from Hadoop for analysis by the relational DBMS. Hadoop can also be used to
filter and transform multi-structural data for transporting to a data warehouse, and can also be
used to analyze multi-structural data for publishing into the data warehouse environment.
Combining SQL and MapReduce query functions enables data scientists to analyze both
structured and unstructured data. Also, front end query tools are available for both platforms.
In what scenarios can Hadoop and RDBMS coexist?
The Big Data vendor landscape is developing very rapidly. It is in a special period of evolution
where entrepreneurial startup firms bring innovative solutions to the marketplace. Cloudera is a
market leader in the Hadoop space. MapR and Hortonworks are two other Hadoop startups.
DataStax is an example of a NoSQL vendor. Informatica, Pervasive Software, Syncsort, and
MicroStrategy are also players. Most of the growth in the industry is with Hadoop and NoSQL
distributors and analytics providers. There is still very little in terms of Big Data application
vendors. Meanwhile, the next-generation data warehouse market has experienced significant
consolidation. Four leading vendors in this space—Netezza, Greenplum, Vertica, and Aster Data
—were acquired by IBM, EMC, HP, and Teradata, respectively. Mega-vendors Oracle and IBM
also play in the Big Data space, connecting and consolidating their products with Hadoop and
NoSQL engines. What is special about the Big Data vendor landscape? Who are the big
players?
As the field matures, more and more traditional data vendors will incorporate Big Data into their
architectures. We already saw something similar with the incorporation of XML data types and
XPath processing engines in relational database engines. Also, the Big Data market will be
increasingly cloud-based, and hosting services will include Big Data data storage options, along
with the traditional MySql and SqlServer options. Vendors providing Big Data applications and
services, for example in the finance domain or for scientific purposes, will begin to proliferate.
(Different students will have different answers.) How do you think the Big Data vendor
landscape will change in the near future? Why?
Visual analytics help organizations uncover trends, relationships, and anomalies by visually
shifting through very large quantities of data. Many vendors are developing visual analytics
offerings, which have traditionally applied to structured data warehouse environments (relational
and multidimensional), for the Big Data space. To be successful, a visual analytics application
must allow for the coexistence and integration of relational and multistructured data. What
is the role of visual analytics in the world of Big Data?
A stream can be thought of as an unbounded flow or sequence of data elements, arriving
continuously at high velocity. Streams often cannot be efficiently or effectively stored for
subsequent processing; thus Big Data concerns about Velocity (one of the six Vs) are especially
prevalent when dealing with streams. Examples of data streams include sensor data, computer
network traffic, phone conversations, ATM transactions, web searches, and financial data.
What is a stream (in the Big Data world)?
In situations where data streams in rapidly and continuously, traditional analytics approaches that
work with previously accumulated data (i.e., data at arrest) often either arrive at the wrong
decisions because of using too much out-of-context data, or they arrive at the correct decisions
but too late to be of any use to the organization. Therefore it is critical for a number of business
situations to analyze the data soon after it is created and/or as soon as it is streamed into the
analytics system. It is no longer feasible to "store everything." Otherwise, analytics will either
arrive at the wrong decisions because of using too much out-of-context data, or at the correct
decisions but too late to be of any use to the organization. Therefore it is critical for a number of
business situations to analyze the data as soon as it is streamed into the analytics system. What
are the motivations for stream analytics?
Stream analytics is the process of extracting actionable information from continuously
flowing/streaming data. It is also sometimes called "data in-motion analytics" or "real-time data
analytics." It differs from regular analytics in that it deals with high velocity (and transient) data
streams instead of more permanent data stores like databases, files, or web pages. What is stream
analytics? How does it differ from regular analytics?
Critical event processing is a method of capturing, tracking, and analyzing streams of data to
detect events (out of normal happenings) of certain types that are worthy of the effort. It involves
combining data from multiple sources to infer events or patterns of interest. An event may also
be defined generically as a "change of state," which may be detected as a measurement
exceeding a predefined threshold of time, temperature, or some other value. This applies to
stream analytics because the events are happening in real time. What is critical event
processing? How does it relate to stream analytics?
Data stream mining is the process of extracting novel patterns and knowledge structures from
continuous, rapid data records. Processing data streams, as opposed to more permanent data
storages, is a challenge. Traditional data mining techniques can process data recursively and
repetitively because the data is permanent. By contrast, a data stream is a continuous flow of
ordered sequence of instances that can only be read once and must be processed immediately as
they come in. Define data stream mining. What additional challenges are posed by data stream
mining?
Many industries can benefit from stream analytics. Some prominent examples include e-
commerce, telecommunications, law enforcement, cyber security, the power industry, health
sciences, and the government. What are the most fruitful industries for stream analytics?
Companies such as Amazon and eBay use stream analytics to analyze customer behavior in real
time. Every page visit, every product looked at, every search conducted, and every click made is
recorded and analyzed to maximize the value gained from a user's visit. Behind the scenes,
advanced analytics are crunching the real-time data coming from our clicks, and the clicks of
thousands of others, to "understand" what it is that we are interested in (in some cases, even we
do not know that) and make the most of that information by creative offerings. How can
stream analytics be used in e-commerce?
Stream analytics could be of great benefit to any industry that faces an influx of relevant real-
time data and needs to make quick decisions. One example is the news industry. By rapidly
sifting through data streaming in, a news organization can recognize "newsworthy" themes (i.e.,
critical events). Another benefit would be for weather tracking in order to better predict tornados
or other natural disasters. (Different students will have different answers.) In addition to what is
listed in this section, can you think of other industries and/or application areas where stream
analytics can be used?
Stream analytics can be thought of as a subset of analytics in general, just like "regular"
analytics. The question is, what does "regular" mean? Regular analytics may refer to traditional
data warehousing approaches, which does constrain the types of data sources and hence the use
cases. Or, "regular" may mean analytics on any type of permanent stored architecture (as
opposed to transient streams). In this case, you have more use cases for "regular" (including Big
Data) than in the previous definition. In either case, there will probably be plenty of times when
"regular" use cases will continue to play a role, even in the era of Big Data analytics. (Different
students will have different answers.)Compared to regular analytics, do you think stream
analytics will have more (or less) use cases in the era of Big Data analytics? Why?
For Luxottica, Big Data includes everything they can find about their customer interactions (in
the form of transactions, click streams, product reviews, and social media postings). They see
this as constituting a massive source of business intelligence for potential product, marketing,
and sales opportunities. What does "big data" mean to Luxottica?
Because Luxottica outsourced both data storage and promotional campaign development and
management, there was a disconnect between data analytics and marketing execution. Their
competitive posture and strategic growth initiatives were compromised for lack of an
individualized view of their customers and an inability to act decisively and consistently on the
different types of information generated by each retail channel. In the Luxottica case study, the
technique company uses to gain visibility into its customers is data integration. What were
their main challenges?
Luxottica deployed the Customer Intelligence Appliance (CIA) from IBM Business Partner
Aginity LLC. This product is built on IBM PureData System for Analytics. This solution helps
Luxottica highly segment customer behavior and provide a platform and smart database for
marketing execution systems, such as campaign management, e-mail services, and other forms of
direct marketing. Anticipated benefits include a 10% improvement in marketing effectiveness,
identifying the highest valued customers, and the ability to target customers based on preference
and history. Luxottica did not outsource their data storage and promotional campaign
development and management, nor did they merge with companies in Asia. What were the
proposed solution and the obtained results?
Big Data can potentially handle the high volume, high variability, continuously streaming data
that trading banks need to deal with. Traditional relational databases are often unable to keep up
with the data demand. How can Big Data benefit large-scale trading banks?
The Bank's legacy system was built on relational database technology. As data volumes and
variability increased, the legacy system was not fast enough to respond to growing business
needs and requirements. It was unable to deliver real-time alerts to manage market and
counterparty credit positions in the desired timeframe. Big data offered the scalability to address
this problem. The benefits included a new alert feature, less downtime for maintenance, much
faster capacity to process complex changes, and reduced operations costs.
In the investment bank case study, the major benefit brought about by the supplanting of multiple
databases by the new trade operational store was providing real-time access to trading data.
The investment bank greatly benefitted after moving from many old disparate systems to a
unified new system.
This case illustrates an excellent example in the banking industry, where disparate data sources
are integrated into a Big Data infrastructure to achieve a single source of the truth. What were the
challenges, the proposed solution, and the obtained results?
eBay is the world's largest online marketplace, and its success requires the ability to turn the
enormous volumes of data it generates into useful insights for customers. Big Data is essential
for this effort. Why did eBay need a Big Data solution?
eBay was experiencing explosive data growth and needed a solution that did not have the typical
bottlenecks, scalability issues, and transactional constraints associated with common relational
database approaches. eBay also needed a solution to perform rapid analysis on a broad
assortment of the structured and unstructured data it captured. The solution did NOT integrate
into a single Big Data Center infrastructure.
Now that the solution is in place, eBay can more cost effectively process massive amounts of
data at very high speeds. The new architecture serves a wide variety of new use cases, and its
reliability and fault tolerance has been greatly enhanced. The load balancing helped the company
meet its Big Data needs with the extremely fast data handling and application availability
requirement, enabling the buying and selling of practically anything in an online marketplace.
What were the challenge, the proposed solution, and the obtained results?
Big Data and analytics have a lot to offer to modern-day politics. This case study makes it clear
that even though elections are unpredictable, politics and elections are suitable arenas for Big
Data. The main characteristics of Big Data, namely volume, variety, and velocity (the three Vs),
readily apply to the kind of data that is used for political campaigns. Big Data Analytics can help
predict election outcomes as well as targeting potential voters and donors, and have become a
critical part of political campaigns.
Know the inputs to the analytic system:
a) market research
b) social media
c) census data
d) election databases
Know the analytic system outputs or goals:

a) voter mobilization
b) organize movements
c) increase number of volunteers
d) raise money contributions What is the role of analytics and Big Data in modern day politics?
It may well have changed the outcome of the 2008 and 2012 elections. Many agree that the
Democrats clearly had the competitive advantage in utilizing Big Data and Analytics over the
Republicans in the 2008 and 2012 presidential elections. For example, in 2012, Barack Obama
had a data advantage and the depth and breadth of his campaign's digital operation, from political
and demographic data mining to voter sentiment and behavioral analysis, reached beyond
anything politics had ever seen. So there was a significant difference in expertise and comfort
level with modern technology between the two parties. The usage and expertise gap between the
party lines may disappear over time, and this will even the playing field. But new data regarding
voter sympathies may itself change the thinking and ideological directions of the major parties.
This in itself could shift election outcomes. Do you think Big Data analytics could change the
outcome of an election?
For the Dublin case, Big Data Analytics were used primarily to ease traffic problems and better
understand the traffic network. Municipalities could use these technologies for many other
governmental tasks. For example, providing social services is a very complex and difficult
process that can be benefit from Big Data technologies. Other areas include tax collection,
sanitation services, environmental management, crime prevention, and management of police
and fire departments. Is there a strong case to make for large cities to use Big Data Analytics
and related information technologies? Identify and discuss examples of what can be done with
analytics beyond what is portrayed in this application case.
They can help get a better sense of the "traffic health" by identifying traffic congestion in its
early stages. By integrating geospatial data from buses and data on bus timetables into a central
geographic information system, you can create a digital map of the city. Then, using the
dashboard screen, operators can drill down to see the number of buses that are on time or delayed
on each route. With big data analytics, users can produce detailed reports on areas of the network
where buses are frequently delayed, and take prompt action to ease congestion. Data and
analytics can also assist with future planning of roads, infrastructure, and public transportation in
order to further ease traffic problems.How can big data analytics help ease the traffic problem in
large cities?
The major problem was the difficulty in getting a good picture of traffic in the city from a high-
level perspective. The proposed solution was to team up with IBM Research, and especially their
Smarter Cities Technologies Centre. Using IBM InfoSphere Streams and mapping software,
IBM researchers created a digital map of the city, overlaid with the real-time positions of
Dublin's 1,000 buses. This gave operators the ability to see the system as a whole instead of just
individual corridors. The new system gave insight to the operators and managers. They could
now start answering questions such as: "Are the bus lane start times correct?" and "Where do we
need to add additional bus lanes and bus-only traffic signals?" For the future, the Dublin City
Council and IBM plan to enhance the system by incorporating meteorological data, under-road
sensors, and bicycle-usage data into their predictive analytics. What were the challenges
Dublin City was facing; what were the proposed solution, initial results, and future plans?
Consumer analytics helps a company's customers make better purchasing and usage decisions.
OG&E is using it to help conserve on energy usage and ultimately to delay their need to build
new fossil fuel generation plants. This helps both OG&E (by reducing costs), its customers (by
reducing their energy costs), and the environment as a whole (by reducing pollution from fossil
fuels). OG&E is NOT using it to help increase energy usage to regain company profits and
ultimately to accelerate their need to build new fossil fuel generation plants. Why perform
consumer analytics?
Dynamic segmentation refers to real-time or near-real-time customer segmentation analytics that
will enhance their understanding about individuals' responses to the price signals and identify the
best customers to be targeted with specific marketing campaigns. What is meant by dynamic
segmentation?
Using geospatial mapping and visual analytics, OG&E views a near-real-time version of data
about its energy-efficient prospects spread over geographic areas and comes up with marketing
initiatives that are most suitable for these customers. Geospatial mapping gives OG&E an easy
way to narrow down to the specific customers in a geographic region based on their meter usage.
In addition, OG&E can find noncommunicating smart meters, track the outages, and weather
overlay of their services. It is NOT using geospatial mapping to track down the specific
customers in a geographic region based on their lapsed payments on their utility bills and assign
bill collectors to the highest delinquency rates by zip code How does geospatial mapping help
OG&E?
OG&E has started working on consumer-oriented efficiency programs to shift the residential
customer's usage out of peak demand cycles. OG&E is targeting customers with its smart hours
plan. This plan encourages customers to choose a variety of rate options sent via phone, text, or
e-mail. These rate options offer attractive summer rates for all other hours apart from the peak
hours of 2 p.m. to 7 p.m. OG&E is making an investment in customers by supplying a
communicating thermostat that will respond to the price signals sent by OG&E and help
customers in managing their utility consumption. OG&E also educates its customers on their
usage habits by providing 5-minute interval data every 15 minutes to the demand-responsive
customers. OG&E has NOT started working on corporate-oriented inefficiency programs to shift
the corporate energy usage into more peak demand cycles. What types of incentives might the
consumers respond to in changing their energy use?
Traditional analytics produce visual maps that are geographically mapped and based on the
traditional location data, usually grouped by the postal codes. The use of postal codes to
represent the data is a somewhat static approach for achieving a higher level view of things.
How does traditional analytics make use of location-based data?
They help the user in understanding "true location-based" impacts, and allow them to view at
higher granularities than that offered by the traditional postal code aggregations. Addition of
location components based on latitudinal and longitudinal attributes to the traditional analytical
techniques enables organizations to add a new dimension of "where" to their traditional business
analyses, which currently answer questions of "who," "what," "when," and "how much." By
integrating information about the location with other critical business data, organizations are now
creating location intelligence (LI). How can geocoded locations assist in better decision
making?
Geospatial analysis gives organizations a broader perspective and aids in decision making.
Location intelligence (LI) is enabling organizations to gain critical insights and make better
decisions by optimizing important processes and applications. By incorporating demographic
details into locations, retailers can determine how sales vary by population level and proximity to
other competitors; they can assess the demand and efficiency of supply chain operations.
Consumer product companies can identify the specific needs of the customers and customer
complaint locations, and easily trace them back to the products. Sales reps can better target their
prospects by analyzing their geography. What is the value provided by geospatial analytics?
CabSense provides an interactive map based on current user location obtained from the mobile
phone's GPS locator to find the best street corners for finding an open cab. It also provides a
radar view that automatically points the right direction toward the best street corner. The
application also allows users to plan in advance, set up date and time of travel, and view the best
corners for finding a taxi. Furthermore, CabSense distinguishes New York's Yellow Cab services
from the for-hire vehicles and readily prompts the users with relevant details of private service
providers that can be used in case no Yellow Cabs are available. Cabsense does NOT give the
exact location of cabs in real time for New Yorkers and visitors wanting to hire a cab. What
are the various options that CabSense provides to users?
Another app that is mentioned in the text is one deployed in Pittsburgh, Pennsylvania, and
developed in collaboration with Carnegie Mellon University. This app, called ParkPGH, includes
predictive capabilities to estimate parking availability. It calculates the number of spaces
available in downtown Pittsburgh parking lots and garages. (Student answers will vary.)
Explore more transportation applications that may employ location-based analytics.
Profiles are often built based on users' behavior. The behavior is tracked by the system based on
user clicks, mobile locations, purchase histories, email connections, social network actions, etc.
In this way, companies can construct a profile of the user's habits, interests, spending patterns,
and favorite locations.Briefly describe how the data are used to create profiles of users.
Two basic approaches that are employed in the development of recommendation systems are
collaborative filtering and content filtering. List the types of approaches used in
recommendation engines.
In collaborative filtering, the recommendation system is built based on the individual user's past
behavior by keeping track of the previous history of all purchased items. Collaborative filtering
involves aggregating the user-item profiles to generate user ratings matrices, and can take a user-
based approach in which the users take the main role. Collaborative filtering systems require
huge amounts of existing data on user-item preferences to make appropriate recommendations,
and so are subject to the "cold start" problem.
By contrast, content filtering approaches first consider specifications and characteristics of items.
Then content-based individual user profiles are built to store the information about the
characteristics of specific items that the user has rated in the past. Content-based filtering
involves using information tags or keywords in fetching detailed information about item
characteristics and restricts this process to a single user, unlike collaborative filtering, which
looks for similarities between various user profiles.
In short, collaborative filtering makes recommendations based on what other consumers like you
chose, whereas content filtering makes recommendations based on similar items you chose in the
past. How do the two approaches differ?
Amazon, Facebook, and LinkedIn all use collaborative filtering. Pandora uses content-based
filtering. Can you identify specific sites that may use one or the other type of
recommendation system?
Web 2.0 is the popular term for describing advanced Web technologies and applications,
including blogs, wikis, RSS, mashups, user-generated content, and social networks. A major
objective of Web 2.0 is to enhance creativity, information sharing, and collaboration. Define
Web 2.0.
The following are representative characteristics of the Web 2.0 environment:
• The ability to tap into the collective intelligence of users. The more users who contribute, the
more popular and valuable a Web 2.0 site becomes.
• Data is made available in new or never-intended ways. Web 2.0 data can be remixed or
"mashed up," often through Web service interfaces, much the way a dance-club DJ mixes music.
• Web 2.0 relies on user-generated and user-controlled content and data.
• Lightweight programming techniques and tools let nearly anyone act as a Web site developer.
• The virtual elimination of software-upgrade cycles makes everything a perpetual beta or work-
in-progress and allows rapid prototyping, using the Web as an application development platform.
• Users can access applications entirely through a browser.
• An architecture of participation and digital democracy encourages users to add value to the
application as they use it.
• A major emphasis on social networks and computing.
• Strong support of information sharing and collaboration. List the major characteristics of Web
2.0.
New business models stress collaboration with customers, partners, and suppliers, as well as
among internal users. Businesses are also much more data-driven utilizing metrics on behaviors
and sentiments of the participants. Agility (perpetual beta) and global reach are other Web 2.0
contributors to the evolving business models. What new business model has emerged from
Web 2.0?
A social network is a place where people create their own space, or homepage, on which they
write blogs (Web logs); post pictures, videos, or music; share ideas; and link to other Web
locations they find interesting. In addition, members of social networks can tag the content they
create and post it with keywords they choose themselves, which makes the content searchable.
Define social network.
Facebook, LinkedIn, Orkut, Google+. List some major social network sites.
Cloud computing offers the possibility of using software, hardware, platform, and infrastructure,
all on a service-subscription basis. Cloud computing enables a more scalable investment on the
part of a user. Like PaaS, etc., cloud computing offers organizations the latest technologies
without significant upfront investment.
In some ways, cloud computing is a new name for many previous related trends: utility
computing, application service provider grid computing, on-demand computing, software as a
service (SaaS), and even older centralized computing with dumb terminals. But the term cloud
computing originates from a reference to the Internet as a "cloud" and represents an evolution of
all previous shared/centralized computing trends. Define cloud computing. How does it relate
to PaaS, SaaS, and IaaS?
Companies offering such services include 1010data, LogiXML, and Lucid Era. These companies
offer feature extract, transform, and load capabilities as well as advanced data analysis tools.
Other companies, such as Elastra and Rightscale, offer dashboard and data management tools
that follow the SaaS and DaaS models. Give examples of companies offering cloud
services.
Cloud-computing-based BI services offer organizations the latest technologies without
significant upfront investment. How does cloud computing affect business intelligence?
The three service models are data-as-a-service (DaaS), information-as-a-service (IaaS), and
analytics-as-a-service (AaaS).What are the three service models that provide the foundation to
service-oriented DSS?
In the DaaS model, the actual platform on which the data resides doesn't matter. Data can reside
in a local computer or in a server at a server farm inside a cloud-computing environment. With
DaaS, any business process can access data wherever it resides. Customers can move quickly
thanks to the simplicity of the data access and the fact that they don't need extensive knowledge
of the underlying data. How does DaaS change the way data is handled?
MaaS stands for models-as-services. This subset of IaaS provides a collection of industry-
specific business processes, reports, dashboards, and other service models for key industries
(e.g., banking, insurance, and financial markets) to accelerate enterprise business initiatives for
business process optimization and multi-channel transformation. What is MaaS? What does it
offer to businesses?
AaaS in the cloud has economies of scale and scope by providing many virtual analytical
applications with better scalability and higher cost savings. The capabilities that a service
orientation (along with cloud computing, pooled resources, and parallel processing) brings to the
analytic world enable cost-effective data/text mining large-scale optimization, highly-complex
multi-criteria decision problems, and distributed simulation models. Why is AaaS cost
effective?
As we learned from Chapter 6, MapReduce is a programming model that allows the processing
of large-scale data analysis problems to be distributed and parallelized. High performance is
achieved by breaking the processing into small units of work that can be run in parallel across
the hundreds, potentially thousands, of nodes in the cluster. This fits very well with the AaaS
model, based on a service orientation of analytics functionality in a cloud platform involving
highly distributed computing. Why is MapReduce mentioned in the context of AaaS?
Analytics can change the manner in which many decisions are made and can consequently
change managers' jobs. They can help managers gain more knowledge, experience, and
expertise, and consequently enhances the quality and speed of their decision making. In
particular, information gathering for decision making is completed much more quickly when
analytics are in use. This affects both strategic planning and control decisions, changing the
decision-making process and even decision-making styles. List the impacts of analytics on
decision making.
Less expertise (experience) is required for making many decisions. Faster decision making is
possible because of the availability of information and the automation of some phases in the
decision-making process. Less reliance on experts and analysts is required to provide support to
top executives. Power is being redistributed among managers. (The more information and
analysis capability they possess, the more power they have.) Support for complex decisions
allows decisions to be made faster and of better quality. Information needed for high-level
decision making is expedited or even self-generated. Automation of routine decisions or phases
in the decision-making process (e.g., for frontline decision making and using ADS) may
eliminate some managers, especially middle level managers. Routine and mundane work can be
done using an analytic system, freeing up managers and knowledge workers to do more
challenging tasks. List the impacts of analytics on other managerial tasks.
One change in organizational structure is the possibility of creating an analytics department, a BI
department, or a knowledge management department in which analytics play a major role. This
special unit can be combined with or replace a quantitative analysis unit, or it can be a
completely new entity. Describe new organizational units that are created because of
analytics.
When a company introduces a data warehouse and BI, the information flows and related business
processes (e.g., order fulfillment) are likely to change because information flows change. For
example, before IBM introduced e-procurement, it restructured all related business processes,
including decision making, searching inventories, reordering, and shipping. How can
analytics affect restructuring of business processes?
Automated decision support (ADS) applications will probably have the following impacts:
• reduction of middle management
• empowerment of customers and business partners
• improved customer service
• increased productivity of help desks and call centers Describe the impacts of ADS
systems.
Analytics has positive or negative impacts. Although many jobs may be substantially enriched by
analytics, other jobs may become more routine and less satisfying. But a study by Davenport and
Harris found that employees using ADS systems, especially those who were empowered, were
more satisfied with their jobs.How can analytics affect job satisfaction?
• What is the value of an expert opinion in court when the expertise is encoded in a computer?
• Who is liable for wrong advice (or information) provided by an intelligent application?
• What happens if a manager enters an incorrect judgment value into an analytic application and
the result is damage or a disaster?
• Who owns the knowledge in a knowledge base?
• Can management force experts to contribute their expertise? List some legal issues of
analytics.
In general, privacy is the right to be left alone and the right to be free from unreasonable personal
intrusions. The Internet, in combination with large-scale databases, has created an entirely new
dimension of accessing and using data. The inherent power in systems that can access vast
amounts of data can be used for the good of society. For example, by matching records with the
aid of a computer, it is possible to eliminate or reduce fraud, crime, government mismanagement,
tax evasion, welfare cheating, family-support filching, employment of illegal aliens, and so on.
The same is true on the corporate level. Private information about employees may aid in better
decision making, but the employees' privacy may be affected. Similar issues are related to
information about customers. Describe privacy concerns in analytics.
The Internet offers a number of opportunities to collect private information about individuals.
Here are some of the ways it can be done:
• By reading an individual's newsgroup postings

• By looking up an individual's name and identity in an Internet directory
• By reading an individual's e-mail
• By wiretapping wireline and wireless communication lines and listening to employees
• By conducting surveillance on employees
• By asking an individual to complete Web site registration
• By recording an individual's actions as he or she navigates the Web with a browser, using
cookies or spyware
The implications for online privacy are significant. The ability of law enforcement agencies to
authorize installation of pen registers and trap-and-trace devices has increased. The U.S.
PATRIOT Act also broadens the government's ability to access student information and personal
financial information without any suspicion of wrongdoing by attesting that the information
likely to be found is pertinent to an ongoing criminal investigation. Explain privacy concerns on
the Web.
Representative ethical issues that could be of interest in MSS implementations include the
following:
• Electronic surveillance
• Ethics in DSS design
• Software piracy
• Invasion of individuals' privacy
• Use of proprietary databases
• Use of intellectual property such as knowledge and expertise
• Exposure of employees to unsafe environments related to computers
• Computer accessibility for workers with disabilities
• Accuracy of data, information, and knowledge
• Protection of the rights of users
• Accessibility to information
• Use of corporate computers for non-work-related purposes
• How much decision making to delegate to computers List ethical issues in analytics.
1. Data Infrastructure Providers
2. Data Warehouse Industry
3. Middleware Industry
4. Data-Aggregated Distributers
5. Analytics-Focused Software Developers
6. Analytics Industry Analysts and Influencers
7. Academic Providers and Certification Agencies
8. Analytics User Organizations
9. Application Developers: Industry Specific or General Identify the nine clusters in the
analytics ecosystem.
Most involve developers to some degree. The data infrastructure providers, data warehouse
industry, middleware industry, and data aggregators all develop technologies that enable
analytics applications. Then there are applications developers, especially those focused on
analytics applications. Although the user organizations, academic providers, and influencers will
include many non-developers, even these clusters will also have some developers in them.
Which clusters represent technology developers?
The most prominent one is analytics user organizations. But others will include analysts and
influencers as well as academic providers and certification agencies. Indeed, all stakeholders are
users of some sort. Which clusters represent technology users?
One prominent example is people who move into the analysts and influencers cluster. These will
have been either users or developers or perhaps from other clusters as well. A similar migration
can bring people into the Academic Provider category, where experts in the field from some of
these other clusters become prominent members of this cluster and thereby bring knowledge and
expertise to a wider audience. Give examples of an analytics professional moving from one
cluster to another.
Great Clips depends on a growth strategy that is driven by rapidly opening new stores in the right
locations and markets. Great Clips is NOT using dynamic segmentation. They use geospatial
analysis to help analyze the locations based on the requirements for a potential customer base,
demographic trends, and sales impact on existing franchises in the target location. They use their
Alteryx-based solution to evaluate each new location based on demographics and consumer
behavior data, aligning with existing Great Clips customer profiles and the potential revenue
impact of the new site on the existing sites. How is geospatial analytics employed at Great
Clips?
Major criteria include potential customer base, demographic trends, and sales impact on existing
franchises in the target location. It is NOT examining the types of haircuts most popular in
different geographic locations. What criteria should a company consider in evaluating sites
for future locations?
Geospatial data can be used to help customers find the right location (for example, the closest
Great Clips location). It is certainly relevant for other companies in a variety of industries.
Analyzing customer profiles and applying these to geographic information can assist with many
retail firms. Another possibility is utilizing geospatial analysis to find locations for
manufacturing facilities; in this case you would be looking for supplier and raw materials'
locations more than customer locations. In the consumer market, geospatial analysis can help
users in a variety of applications; for example finding the best locations for restaurants or stores
catering to the customer's desires. (Student answers will vary.) Can you think of other
applications where such geospatial data might be useful?
Location-based behavioral targeting can help to narrow the characteristics of users who are most
likely to utilize a retailer's services or products. This sort of analytics would typically target the
tech-savvy and busy consumers of the company in question. Quizno's was NOT targeting
consumers who cut coupons from the local newspaper and redeemed them at Subway. It used
location-based analytics. How can location-based analytics help retailers in targeting
customers?
If a user on a smart phone enters data, the location sensors of the phone can help find others in
that location who are facing similar circumstances, as well as local companies providing services
and products that the consumer desires. The user can thus see what others in their location are
choosing, and the opportunities for meeting his or her needs. Conversely, the user's behaviors
and choices can then contribute information to other consumers in the same location. How
can location-based analytics help individual consumers?
Via smartphones, users can enter data such as gender, age, weight, height, and the location where
he or she lives. From this, an app can create a behavior profile compared with health data from
the CDC. Predictive analytics can calculate life expectancy of the user, who can begin
discovering health opportunities. Also, most smartphones are equipped with accelerometers and
gyroscopes to measure jerk, orientation, and sense motion. Muscle motions may be used to
predict the progression of disorders such as Parkinson's disease, as well as tracking exercise
activities.
In the case study A Life Coach in Your Pocket, it was stated that Kaggle (kaggle.com) is a
platform that hosts competitions and research for predictive modeling and analytics and recently
hosted a competition aimed at identifying muscle motions that may be used to predict the
progression of Parkinson's disease. The objective of the competition is to best identify markers
that can lead to predicting the progression of the disease. This particular application of advanced
technology and analytics is an example of how these two can come together to generate
extremely useful and relevant information.
The app did NOT create the behavior profile and compare it with census data from the Bureau of
Labor. How can smartphone data be used to predict medical conditions?
ParkPGH does more than just report current parking-space availability. It is also capable of
predicting future parking availability. Depending on historical demand and current events, the
app is able to provide information on which lots will have free space by the time the driver gets
to the destination. The app's underlying algorithm uses data on current events around the area—
for example, a basketball game—to predict an increase in demand for parking spaces later that
day, thus saving commuters valuable time searching for parking spaces in the busy city.
ParkPGH did NOT find the best corners for hailing a taxi cab based on the person's location, the
day of the week, and the time of day. How is ParkPGH different from a "parking space-
reporting" app?
Geospatial analytics gives organizations a broader perspective and aids in decision making.
Geospatial data helps companies with managing operations, targeting customers, and deciding on
promotions. It also helps consumers directly, making use of integrated sensor technologies and
global positioning systems installed in their smartphones. Using geospatial data, companies can
identify the specific needs of the customers and customer complaint locations, and easily trace
them back to the products. Another example is in the telecommunications industry, where
geospatial analysis can enable communication companies to capture daily transactions from a
network to identify the geographic areas experiencing a large number of failed connection
attempts of voice, data, text, or Internet. What are the potential benefits of using geospatial
data in analytics? Give examples.
One prominent application is in the emerging area of reality mining, which uses location-enabled
devices for finding nearby services, locating friends and family, navigating, tracking of assets
and pets, dispatching, and engaging in sports, games, and hobbies. Adding shopping cart
knowledge will enhance the application's ability to provide targeted information to a customer;
for example, the app could find prices for similar products in nearby stores. What type of
new applications can emerge from knowing locations of users in real time? What if you also
knew what they have in their shopping cart, for example?
Consumer-oriented analytics based on location information fit into two major categories: (a)
GPS navigation and data analysis, and (b) historic and current location demand analysis.
Consumers benefit from analytics-based applications in many areas, including fun and health, as
well as enhanced personal productivity. How can consumers benefit from using analytics,
especially based on location information?
Students' answers will differ. Privacy threats relate to user-profiling, intrusive use of personal
information, and not being able to control what is being collected. "Location-tracking-based
profiling (reality mining) is powerful but also poses privacy threats." Comment.
In some ways, cloud computing is a new name for many previous related trends: utility
computing, application service provider grid computing, on-demand computing, software as a
service (SaaS), and even older centralized computing with dumb terminals. But the term cloud
computing originates from a reference to the Internet as a "cloud" and represents an evolution of
all previous shared/centralized computing trends.
Cloud computing offers the possibility of using software, hardware, platform, and infrastructure,
all on a service-subscription basis. Cloud computing enables a more scalable investment on the
part of a user. Like PaaS, etc., cloud-computing offers organizations the latest technologies
without significant upfront investment. Is cloud computing "just an old wine in a new
bottle"? How is it similar to other initiatives? How is it different?
Mobile social networking enables social networking where members converse and connect with
one another using cell phones or other mobile devices. Discuss the relationship between
mobile devices and social networking.
Students' answers will differ. A common criticism of traditional data-processing systems is their
negative effects on people's individuality. Such systems are criticized as being impersonal: They
may dehumanize and depersonalize activities that have been computerized because they reduce
or eliminate the human element that was present in noncomputerized systems. Some people feel
a loss of identity; they feel like just another number. On the bright side, one of the major
objectives of analytics is to create flexible systems and interfaces that allow individuals to share
their opinions and knowledge and work together with computers. Despite all these efforts, some
people are still afraid of computers, so they are stressed; others are mostly afraid of their
employers watching what they do on the computer. Some say that analytics in general, and ES
in particular, dehumanize managerial activities, and others say they do not. Discuss arguments
for both points of view.
a.) Students' answers will differ. Some reasons are: physicians do not understand and therefore
do not trust the ES; malpractice insurance does not cover recommendations made by MYCIN;
administrators will not invest in it; physicians fear they will be replaced or earn less.
b.) Students' answers will differ. Students should identify a combination of positive (more
money) and negative (calm their fears) motivators.
c.) Probably not--or at least not yet. 8. Diagnosing infections and prescribing pharmaceuticals
are the weak points of many practicing physicians (according to E. H. Shortliffe, one of the
developers of MYCIN). It seems, therefore, that society would be better served if MYCIN (and
other ES) were used extensively, but few physicians use ES. Answer the following questions:
a. Why do you think such systems are little used by physicians?
b. Assume that you are a hospital administrator whose physicians are salaried and report to you.
What would you do to persuade them to use ES?
c. If the potential benefits to society are so great, can society do something that will increase
doctors' use of such analytic systems?
Loss of privacy is a key concern in employing analytics on mobile data. If someone can track the
movement of a cell phone, the privacy of that customer is a big issue. Some of the app
developers claim that they only need to gather aggregate flow information, not individually
identifiable information. But many stories appear in the media that highlight violations of this
general principle. Sometimes, retailers provide information on their customers to the federal
government, in violation of their stated privacy policies.
Legally, the right of privacy is not absolute. The public's right to know is superior to the
individual's right to privacy. For example, the USA PATRIOT Act broadens the government's
ability to access student information and personal financial information without any suspicion of
wrongdoing. Location information from devices has been used to locate victims and criminals,
so provides a social good. But at what point is the information not the property of the individual?
What are some of the major privacy concerns in employing analytics on mobile data?
Individuals in a technology provider cluster move into a user cluster simply by utilizing the
technology and tools for solving their own business problems or meeting their
consumer/personal needs. In effect, everyone is potentially in the user cluster. How can one
move from a technology provider cluster to a user cluster?

Ism 6404 CH 7

Uploaded by

Copyright:

Available Formats

Ism 6404 CH 7

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ism 6404 CH 7

Uploaded by

Copyright:

Available Formats

Section 2: Definition of Big Data

 How do you define Big Data? Why is it difficult to define?

one reason it is hard to define.

volumes of data analyzed by huge organizations such as Google or research

science projects at NASA.

 c.includes both structured and unstructured data, and it comes from

networks, social networks, Internet-based text documents, Internet search

indexes, detailed call records, to name just a few.

anomalies than "small" data.

decisions, something that every organization needs nowadays.

else? If so, what will it be?

o Big Data could evolve at a rapid pace.

increased computing capabilities, analytics methodologies, and data management of

high volume heterogeneous information will continue.

Section 3: Fundamentals of Big Data Analytics

 What is Big Data analytics?

o Big Data analytics is analytics applied to Big Data architectures.

• This is a new paradigm; in order to keep up with the computational needs of

Big Data, a number of new and innovative analytics computational techniques

and platforms have been developed.

 How does big data analytics differ from regular analytics?

innovative analytics computational techniques and platforms have been developed.

memory analytics, in-database analytics, grid computing, and appliances.

o Critical factors include a

• clear business need,

• strong and committed sponsorship,

• alignment between the business and IT strategies,

• a fact-based decision culture,

• a strong data infrastructure,

• the right analytics tools,

• and personnel with advanced analytic skills.

of Big Data analytics?

o Major challenges are :

• the vast amount of data volume,

• data governance issues,

• and solution costs.

o Process efficiency and cost reduction

o Revenue maximization, cross-selling, and up-selling

o Enhanced customer experience

o Churn identification, customer recruiting

o Improved customer service

o Identifying new products and market opportunities

o Enhanced security capabilities

Section 4: Big Data Technologies

 What are the common characteristics of emerging Big Data technologies?

o They take advantage of commodity hardware to enable scale-out, parallel processing

techniques; employ nonrelational data storage capabilities in order to process

visualization technology to Big Data to convey insights to end users.

 What is MapReduce? What does it do? How does it do it?

analysis problems to be distributed and parallelized.

o The MapReduce technique, popularized by Google, distributes the processing of very

large multi-structured data files across a large cluster of machines.

 What is Hadoop? How does it work?

amounts of distributed, unstructured data. It is designed to handle petabytes and

exabytes of data distributed over multiple nodes in parallel, typically commodity

machines connected via the Internet.

o It utilizes the MapReduce framework to implement distributed parallelism. The file

organization is implemented in the Hadoop Distributed File System (HDFS), which is