Business Intelligence and Analytics: Systems For Decision Support, 10e (Sharda) Chapter 13 Big Data and Analytics

Business Intelligence and Analytics: Systems for Decision Support, 10e (Sharda)
Chapter 13 Big Data and Analytics
1) In the opening vignette, the CERN Data Aggregation System (DAS), built on MongoDB (a
Big Data management infrastructure), used relational database technology.
Answer: FALSE
Diff: 2 Page Ref: 544
2) The term "Big Data" is relative as it depends on the size of the using organization.
Answer: TRUE
3) In the Luxottica case study, outsourcing enhanced the ability of the company to gain insights
into their data.
Answer: FALSE
Diff: 2 Page Ref: 550-551
4) Many analytics tools are too complex for the average user, and this is one justification for Big
Data.
Answer: TRUE
5) In the investment bank case study, the major benefit brought about by the supplanting of
multiple databases by the new trade operational store was providing real-time access to trading
data.
Answer: TRUE
6) Big Data uses commodity hardware, which is expensive, specialized hardware that is custom
built for a client or application.
Answer: FALSE
7) MapReduce can be easily understood by skilled programmers due to its procedural nature.
Answer: TRUE
8) Hadoop was designed to handle petabytes and extabytes of data distributed over multiple
nodes in parallel.
Answer: TRUE
9) Hadoop and MapReduce require each other to work.

Answer: FALSE
10) In most cases, Hadoop is used to replace data warehouses.
Answer: FALSE
11) Despite their potential, many current NoSQL tools lack mature management and monitoring
tools.
Answer: TRUE
12) The data scientist is a profession for a field that is still largely being defined.
Answer: TRUE
13) There is a current undersupply of data scientists for the Big Data market.
Answer: TRUE
14) The Big Data and Analysis in Politics case study makes it clear that the unpredictability of
elections makes politics an unsuitable arena for Big Data.
Answer: FALSE
15) For low latency, interactive reports, a data warehouse is preferable to Hadoop.
Answer: TRUE
16) If you have many flexible programming languages running in parallel, Hadoop is preferable
to a data warehouse.
Answer: TRUE
17) In the Dublin City Council case study, GPS data from the city's buses and CCTV were the
only data sources for the Big Data GIS-based application.
Answer: FALSE
18) It is important for Big Data and self-service business intelligence go hand in hand to get
maximum value from analytics.
Answer: TRUE
19) Big Data simplifies data governance issues, especially for global firms.
Answer: FALSE
20) Current total storage capacity lags behind the digital information being generated in the
world.
Answer: TRUE
21) Using data to understand customers/clients and business operations to sustain and foster
growth and profitability is
A) easier with the advent of BI and Big Data.
B) essentially the same now as it has always been.
C) an increasingly challenging task for today's enterprises.
D) now completely automated with no human intervention required.
Answer: C
22) A newly popular unit of data in the Big Data era is the petabyte (PB), which is
A) 109 bytes.
B) 1012 bytes.
C) 1015 bytes.
D) 1018 bytes.
Answer: C
23) Which of the following sources is likely to produce Big Data the fastest?
A) order entry clerks
B) cashiers
C) RFID tags
D) online customers
Answer: C
24) Data flows can be highly inconsistent, with periodic peaks, making data loads hard to
manage. What is this feature of Big Data called?
A) volatility
B) periodicity
C) inconsistency
D) variability
Answer: D
25) In the Luxottica case study, what technique did the company use to gain visibility into its
customers?
A) visibility analytics
B) data integration
C) focus on growth
D) customer focus
Answer: B
26) Allowing Big Data to be processed in memory and distributed across a dedicated set of nodes
can solve complex problems in near—real time with highly accurate insights. What is this
process called?
A) in-memory analytics
B) in-database analytics
C) grid computing
D) appliances
Answer: A
27) Which Big Data approach promotes efficiency, lower cost, and better performance by
processing jobs in a shared, centrally managed pool of IT resources?
A) in-memory analytics
B) in-database analytics
C) grid computing
D) appliances
Answer: C
28) How does Hadoop work?

A) It integrates Big Data into a whole so large data elements can be processed as a whole on one
computer.
B) It integrates Big Data into a whole so large data elements can be processed as a whole on
multiple computers.
C) It breaks up Big Data into multiple parts so each part can be processed and analyzed at the
same time on one computer.
D) It breaks up Big Data into multiple parts so each part can be processed and analyzed at the
same time on multiple computers.
Answer: D
29) What is the Hadoop Distributed File System (HDFS) designed to handle?
A) unstructured and semistructured relational data
B) unstructured and semistructured non-relational data
C) structured and semistructured relational data
D) structured and semistructured non-relational data
Answer: B
30) In a Hadoop "stack," what is a slave node?

A) a node where bits of programs are stored
B) a node where metadata is stored and used to organize data processing
C) a node where data is stored and processed
D) a node responsible for holding all the source programs
Answer: C
31) In a Hadoop "stack," what node periodically replicates and stores data from the Name Node
should it fail?
A) backup node
B) secondary node
C) substitute node
D) slave node
Answer: B
32) All of the following statements about MapReduce are true EXCEPT
A) MapReduce is a general-purpose execution engine.
B) MapReduce handles the complexities of network communication.
C) MapReduce handles parallel programming.
D) MapReduce runs without fault tolerance.
Answer: D
33) In the Big Data and Analytics in Politics case study, which of the following was an input to
the analytic system?
A) census data
B) assessment of sentiment
C) voter mobilization
D) group clustering
Answer: A
34) In the Big Data and Analytics in Politics case study, what was the analytic system output or
goal?
A) census data
B) assessment of sentiment
C) voter mobilization
D) group clustering
Answer: C
35) Traditional data warehouses have not been able to keep up with
A) the evolution of the SQL language.
B) the variety and complexity of data.
C) expert systems that run on them.
D) OLAP.
Answer: B
36) Under which of the following requirements would it be more appropriate to use Hadoop over
a data warehouse?
A) ANSI 2003 SQL compliance is required
B) online archives alternative to tape
C) unrestricted, ungoverned sandbox explorations
D) analysis of provisional data
Answer: C
37) What is Big Data's relationship to the cloud?

A) Hadoop cannot be deployed effectively in the cloud just yet.
B) Amazon and Google have working Hadoop cloud offerings.
C) IBM's homegrown Hadoop platform is the only option.
D) Only MapReduce works in the cloud; Hadoop does not.
Answer: B
38) Companies with the largest revenues from Big Data tend to be
A) the largest computer and IT services firms.
B) small computer and IT services firms.
C) pure open source Big Data firms.
D) non-U.S. Big Data firms.
Answer: A
39) In the health sciences, the largest potential source of Big Data comes from
A) accounting systems.
B) human resources.
C) patient monitoring.
D) research administration.
Answer: C
40) In the Discovery Health insurance case study, the analytics application used available data to
help the company do all of the following EXCEPT
A) predict customer health.
B) detect fraud.
C) lower costs for members.
D) open its own pharmacy.
Answer: D
41) Most Big Data is generated automatically by ________.

Answer: machines
42) ________ refers to the conformity to facts: accuracy, quality, truthfulness, or trustworthiness
of the data.
Answer: Veracity
43) In-motion ________ is often overlooked today in the world of BI and Big Data.
Answer: analytics
44) The ________ of Big Data is its potential to contain more useful patterns and interesting
anomalies than "small" data.
Answer: value proposition
45) As the size and the complexity of analytical systems increase, the need for more ________
analytical systems is also increasing to obtain the best performance.
Answer: efficient
46) ________ speeds time to insights and enables better data governance by performing data
integration and analytic functions inside the database.
Answer: In-database analytics
47) ________ bring together hardware and software in a physical unit that is not only fast but
also scalable on an as-needed basis.
Answer: Appliances
48) Big Data employs ________ processing techniques and nonrelational data storage
capabilities in order to process unstructured and semistructured data.
Answer: parallel
49) In the world of Big Data, ________ aids organizations in processing and analyzing large
volumes of multi-structured data. Examples include indexing and search, graph analysis, etc.
Answer: MapReduce
50) The ________ Node in a Hadoop cluster provides client information on where in the cluster
particular data is stored and if any nodes fail.
Answer: Name
51) A job ________ is a node in a Hadoop cluster that initiates and coordinates MapReduce jobs,
or the processing of the data.
Answer: tracker
52) HBase is a nonrelational ________ that allows for low-latency, quick lookups in Hadoop.
Answer: database
53) Hadoop is primarily a(n) ________ file system and lacks capabilities we'd associate with a
DBMS, such as indexing, random access to data, and support for SQL.
Answer: distributed
54) HBase, Cassandra, MongoDB, and Accumulo are examples of ________ databases.
Answer: NoSQL
55) In the eBay use case study, load ________ helped the company meet its Big Data needs with
the extremely fast data handling and application availability requirements.
Answer: balancing
56) As volumes of Big Data arrive from multiple sources such as sensors, machines, social
media, and clickstream interactions, the first step is to ________ all the data reliably and cost
effectively.
Answer: capture
57) In open-source databases, the most important performance enhancement to date is the cost-
based ________.
Answer: optimizer
58) Data ________ or pulling of data from multiple subject areas and numerous applications into
one repository is the raison d'être for data warehouses.
Answer: integration
59) In the energy industry, ________ grids are one of the most impactful applications of stream
analytics.
Answer: smart
60) In the U.S. telecommunications company case study, the use of analytics via dashboards has
helped to improve the effectiveness of the company's ________ assessments and to make their
systems more secure.
Answer: threat
61) In the opening vignette, what is the source of the Big Data collected at the European
Organization for Nuclear Research or CERN?
Answer: Forty million times per second, particles collide within the LHC, each collision
generating particles that often decay in complex ways into even more particles. Precise electronic
circuits all around LHC record the passage of each particle via a detector as a series of electronic
signals, and send the data to the CERN Data Centre (DC) for recording and digital
reconstruction. The digitized summary of data is recorded as a "collision event." 15 petabytes or
so of digitized summary data produced annually and this is processed by physicists to determine
if the collisions have thrown up any interesting physics.
62) List and describe the three main "V"s that characterize Big Data.
Answer:
• Volume: This is obviously the most common trait of Big Data. Many factors contributed to
the exponential increase in data volume, such as transaction-based data stored through the years,
text data constantly streaming in from social media, increasing amounts of sensor data being
collected, automatically generated RFID and GPS data, and so forth.
• Variety: Data today comes in all types of formats–ranging from traditional databases to
hierarchical data stores created by the end users and OLAP systems, to text documents, e-mail,
XML, meter-collected, sensor-captured data, to video, audio, and stock ticker data. By some
estimates, 80 to 85 percent of all organizations' data is in some sort of unstructured or
semistructured format
• Velocity: This refers to both how fast data is being produced and how fast the data must be
processed (i.e., captured, stored, and analyzed) to meet the need or demand. RFID tags,
automated sensors, GPS devices, and smart meters are driving an increasing need to deal with
torrents of data in near—real time.
63) List and describe four of the most critical success factors for Big Data analytics.
Answer:
• A clear business need (alignment with the vision and the strategy). Business investments
ought to be made for the good of the business, not for the sake of mere technology
advancements. Therefore the main driver for Big Data analytics should be the needs of the
business at any level–strategic, tactical, and operations.
• Strong, committed sponsorship (executive champion). It is a well-known fact that if you
don't have strong, committed executive sponsorship, it is difficult (if not impossible) to succeed.
If the scope is a single or a few analytical applications, the sponsorship can be at the
departmental level. However, if the target is enterprise-wide organizational transformation,
which is often the case for Big Data initiatives, sponsorship needs to be at the highest levels and
organization-wide.
• Alignment between the business and IT strategy. It is essential to make sure that the
analytics work is always supporting the business strategy, and not other way around. Analytics
should play the enabling role in successful execution of the business strategy.
• A fact-based decision making culture. In a fact-based decision-making culture, the numbers
rather than intuition, gut feeling, or supposition drive decision making. There is also a culture of
experimentation to see what works and doesn't. To create a fact-based decision-making culture,
senior management needs to do the following: recognize that some people can't or won't adjust;
be a vocal supporter; stress that outdated methods must be discontinued; ask to see what
analytics went into decisions; link incentives and compensation to desired behaviors.
• A strong data infrastructure. Data warehouses have provided the data infrastructure for
analytics. This infrastructure is changing and being enhanced in the Big Data era with new
technologies. Success requires marrying the old with the new for a holistic infrastructure that
works synergistically.
64) When considering Big Data projects and architecture, list and describe five challenges
designers should be mindful of in order to make the journey to analytics competency less
stressful.
Answer:
• Data volume: The ability to capture, store, and process the huge volume of data at an
acceptable speed so that the latest information is available to decision makers when they need it.
• Data integration: The ability to combine data that is not similar in structure or source and to
do so quickly and at reasonable cost.
• Processing capabilities: The ability to process the data quickly, as it is captured. The
traditional way of collecting and then processing the data may not work. In many situations data
needs to be analyzed as soon as it is captured to leverage the most value.
• Data governance: The ability to keep up with the security, privacy, ownership, and quality
issues of Big Data. As the volume, variety (format and source), and velocity of data change, so
should the capabilities of governance practices.
• Skills availability: Big Data is being harnessed with new tools and is being looked at in
different ways. There is a shortage of data scientists with the skills to do the job.
• Solution cost: Since Big Data has opened up a world of possible business improvements,
there is a great deal of experimentation and discovery taking place to determine the patterns that
matter and the insights that turn to value. To ensure a positive ROI on a Big Data project,
therefore, it is crucial to reduce the cost of the solutions used to find that value.
65) Define MapReduce.

Answer: As described by Dean and Ghemawat (2004), "MapReduce is a programming model
and an associated implementation for processing and generating large data sets. Programs written
in this functional style are automatically parallelized and executed on a large cluster of
commodity machines. This allows programmers without any experience with parallel and
distributed systems to easily utilize the resources of a large distributed system."
66) What is NoSQL as used for Big Data? Describe its major downsides.
Answer:
• NoSQL is a new style of database that has emerged to, like Hadoop, process large volumes of
multi-structured data. However, whereas Hadoop is adept at supporting large-scale, batch-style
historical analysis, NoSQL databases are aimed, for the most part (though there are some
important exceptions), at serving up discrete data stored among large volumes of multi-
structured data to end-user and automated Big Data applications. This capability is sorely lacking
from relational database technology, which simply can't maintain needed application
performance levels at Big Data scale.
• The downside of most NoSQL databases today is that they trade ACID (atomicity,
consistency, isolation, durability) compliance for performance and scalability. Many also lack
mature management and monitoring tools.
67) What is a data scientist and what does the job involve?
Answer: A data scientist is a role or a job frequently associated with Big Data or data science. In
a very short time it has become one of the most sought-out roles in the marketplace. Currently,
data scientists' most basic, current skill is the ability to write code (in the latest Big Data
languages and platforms). A more enduring skill will be the need for data scientists to
communicate in a language that all their stakeholders understand–and to demonstrate the special
skills involved in storytelling with data, whether verbally, visually, or–ideally–both. Data
scientists use a combination of their business and technical skills to investigate Big Data looking
for ways to improve current business analytics practices (from descriptive to predictive and
prescriptive) and hence to improve decisions for new business opportunities.
68) Why are some portions of tape backup workloads being redirected to Hadoop clusters today?
Answer:
• First, while it may appear inexpensive to store data on tape, the true cost comes with the
difficulty of retrieval. Not only is the data stored offline, requiring hours if not days to restore,
but tape cartridges themselves are also prone to degradation over time, making data loss a reality
and forcing companies to factor in those costs. To make matters worse, tape formats change
every couple of years, requiring organizations to either perform massive data migrations to the
newest tape format or risk the inability to restore data from obsolete tapes.
• Second, it has been shown that there is value in keeping historical data online and accessible.
As in the clickstream example, keeping raw data on a spinning disk for a longer duration makes
it easy for companies to revisit data when the context changes and new constraints need to be
applied. Searching thousands of disks with Hadoop is dramatically faster and easier than
spinning through hundreds of magnetic tapes. Additionally, as disk densities continue to double
every 18 months, it becomes economically feasible for organizations to hold many years' worth
of raw or refined data in HDFS.
69) What are the differences between stream analytics and perpetual analytics? When would you
use one or the other?
Answer:
• In many cases they are used synonymously. However, in the context of intelligent systems,
there is a difference. Streaming analytics involves applying transaction- level logic to real-time
observations. The rules applied to these observations take into account previous observations as
long as they occurred in the prescribed window; these windows have some arbitrary size (e.g.,
last 5 seconds, last 10,000 observations, etc.). Perpetual analytics, on the other hand, evaluates
every incoming observation against all prior observations, where there is no window size.
Recognizing how the new observation relates to all prior observations enables the discovery of
real-time insight.
• When transactional volumes are high and the time-to-decision is too short, favoring
nonpersistence and small window sizes, this translates into using streaming analytics. However,
when the mission is critical and transaction volumes can be managed in real time, then perpetual
analytics is a better answer.
70) Describe data stream mining and how it is used.
Answer: Data stream mining, as an enabling technology for stream analytics, is the process of
extracting novel patterns and knowledge structures from continuous, rapid data records. A data
stream is a continuous flow of ordered sequence of instances that in many applications of data
stream mining can be read/processed only once or a small number of times using limited
computing and storage capabilities. Examples of data streams include sensor data, computer
network traffic, phone conversations, ATM transactions, web searches, and financial data. Data
stream mining can be considered a subfield of data mining, machine learning, and knowledge
discovery. In many data stream mining applications, the goal is to predict the class or value of
new instances in the data stream given some knowledge about the class membership or values of
previous instances in the data stream.

Business Intelligence and Analytics: Systems For Decision Support, 10e (Sharda) Chapter 13 Big Data and Analytics

Uploaded by

Copyright:

Available Formats

Business Intelligence and Analytics: Systems For Decision Support, 10e (Sharda) Chapter 13 Big Data and Analytics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Business Intelligence and Analytics: Systems For Decision Support, 10e (Sharda) Chapter 13 Big Data and Analytics

Uploaded by

Copyright:

Available Formats

Business Intelligence and Analytics: Systems for Decision Support, 10e (Sharda)

Chapter 13 Big Data and Analytics

9) Hadoop and MapReduce require each other to work.

28) How does Hadoop work?

30) In a Hadoop "stack," what is a slave node?

37) What is Big Data's relationship to the cloud?

41) Most Big Data is generated automatically by ________.

65) Define MapReduce.

You might also like