Business Intelligence and Analytics: Systems For Decision Support, 10e (Sharda) Chapter 13 Big Data and Analytics

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 13

Business Intelligence and Analytics: Systems for Decision Support, 10e (Sharda)

Chapter 13 Big Data and Analytics

1) In the opening vignette, the CERN Data Aggregation System (DAS), built on MongoDB (a
Big Data management infrastructure), used relational database technology.
Answer: FALSE
Diff: 2 Page Ref: 544

2) The term "Big Data" is relative as it depends on the size of the using organization.
Answer: TRUE
Diff: 2 Page Ref: 546

3) In the Luxottica case study, outsourcing enhanced the ability of the company to gain insights
into their data.
Answer: FALSE
Diff: 2 Page Ref: 550-551

4) Many analytics tools are too complex for the average user, and this is one justification for Big
Data.
Answer: TRUE
Diff: 2 Page Ref: 552

5) In the investment bank case study, the major benefit brought about by the supplanting of
multiple databases by the new trade operational store was providing real-time access to trading
data.
Answer: TRUE
Diff: 2 Page Ref: 555

6) Big Data uses commodity hardware, which is expensive, specialized hardware that is custom
built for a client or application.
Answer: FALSE
Diff: 2 Page Ref: 556

7) MapReduce can be easily understood by skilled programmers due to its procedural nature.
Answer: TRUE
Diff: 2 Page Ref: 558

8) Hadoop was designed to handle petabytes and extabytes of data distributed over multiple
nodes in parallel.
Answer: TRUE
Diff: 2 Page Ref: 558

9) Hadoop and MapReduce require each other to work.


Answer: FALSE
Diff: 2 Page Ref: 562
10) In most cases, Hadoop is used to replace data warehouses.
Answer: FALSE
Diff: 2 Page Ref: 562

11) Despite their potential, many current NoSQL tools lack mature management and monitoring
tools.
Answer: TRUE
Diff: 2 Page Ref: 562

12) The data scientist is a profession for a field that is still largely being defined.
Answer: TRUE
Diff: 2 Page Ref: 565

13) There is a current undersupply of data scientists for the Big Data market.
Answer: TRUE
Diff: 2 Page Ref: 567

14) The Big Data and Analysis in Politics case study makes it clear that the unpredictability of
elections makes politics an unsuitable arena for Big Data.
Answer: FALSE
Diff: 2 Page Ref: 568

15) For low latency, interactive reports, a data warehouse is preferable to Hadoop.
Answer: TRUE
Diff: 2 Page Ref: 573

16) If you have many flexible programming languages running in parallel, Hadoop is preferable
to a data warehouse.
Answer: TRUE
Diff: 2 Page Ref: 573

17) In the Dublin City Council case study, GPS data from the city's buses and CCTV were the
only data sources for the Big Data GIS-based application.
Answer: FALSE
Diff: 2 Page Ref: 575-576

18) It is important for Big Data and self-service business intelligence go hand in hand to get
maximum value from analytics.
Answer: TRUE
Diff: 1 Page Ref: 579

19) Big Data simplifies data governance issues, especially for global firms.
Answer: FALSE
Diff: 2 Page Ref: 580
20) Current total storage capacity lags behind the digital information being generated in the
world.
Answer: TRUE
Diff: 2 Page Ref: 581

21) Using data to understand customers/clients and business operations to sustain and foster
growth and profitability is
A) easier with the advent of BI and Big Data.
B) essentially the same now as it has always been.
C) an increasingly challenging task for today's enterprises.
D) now completely automated with no human intervention required.
Answer: C
Diff: 2 Page Ref: 546

22) A newly popular unit of data in the Big Data era is the petabyte (PB), which is
A) 109 bytes.
B) 1012 bytes.
C) 1015 bytes.
D) 1018 bytes.
Answer: C
Diff: 2 Page Ref: 548

23) Which of the following sources is likely to produce Big Data the fastest?
A) order entry clerks
B) cashiers
C) RFID tags
D) online customers
Answer: C
Diff: 2 Page Ref: 549

24) Data flows can be highly inconsistent, with periodic peaks, making data loads hard to
manage. What is this feature of Big Data called?
A) volatility
B) periodicity
C) inconsistency
D) variability
Answer: D
Diff: 2 Page Ref: 549

25) In the Luxottica case study, what technique did the company use to gain visibility into its
customers?
A) visibility analytics
B) data integration
C) focus on growth
D) customer focus
Answer: B
Diff: 2 Page Ref: 550-551
26) Allowing Big Data to be processed in memory and distributed across a dedicated set of nodes
can solve complex problems in near—real time with highly accurate insights. What is this
process called?
A) in-memory analytics
B) in-database analytics
C) grid computing
D) appliances
Answer: A
Diff: 2 Page Ref: 553

27) Which Big Data approach promotes efficiency, lower cost, and better performance by
processing jobs in a shared, centrally managed pool of IT resources?
A) in-memory analytics
B) in-database analytics
C) grid computing
D) appliances
Answer: C
Diff: 2 Page Ref: 553

28) How does Hadoop work?


A) It integrates Big Data into a whole so large data elements can be processed as a whole on one
computer.
B) It integrates Big Data into a whole so large data elements can be processed as a whole on
multiple computers.
C) It breaks up Big Data into multiple parts so each part can be processed and analyzed at the
same time on one computer.
D) It breaks up Big Data into multiple parts so each part can be processed and analyzed at the
same time on multiple computers.
Answer: D
Diff: 3 Page Ref: 558

29) What is the Hadoop Distributed File System (HDFS) designed to handle?
A) unstructured and semistructured relational data
B) unstructured and semistructured non-relational data
C) structured and semistructured relational data
D) structured and semistructured non-relational data
Answer: B
Diff: 2 Page Ref: 558

30) In a Hadoop "stack," what is a slave node?


A) a node where bits of programs are stored
B) a node where metadata is stored and used to organize data processing
C) a node where data is stored and processed
D) a node responsible for holding all the source programs
Answer: C
Diff: 2 Page Ref: 559

31) In a Hadoop "stack," what node periodically replicates and stores data from the Name Node
should it fail?
A) backup node
B) secondary node
C) substitute node
D) slave node
Answer: B
Diff: 2 Page Ref: 559

32) All of the following statements about MapReduce are true EXCEPT
A) MapReduce is a general-purpose execution engine.
B) MapReduce handles the complexities of network communication.
C) MapReduce handles parallel programming.
D) MapReduce runs without fault tolerance.
Answer: D
Diff: 2 Page Ref: 562

33) In the Big Data and Analytics in Politics case study, which of the following was an input to
the analytic system?
A) census data
B) assessment of sentiment
C) voter mobilization
D) group clustering
Answer: A
Diff: 2 Page Ref: 568

34) In the Big Data and Analytics in Politics case study, what was the analytic system output or
goal?
A) census data
B) assessment of sentiment
C) voter mobilization
D) group clustering
Answer: C
Diff: 2 Page Ref: 568

35) Traditional data warehouses have not been able to keep up with
A) the evolution of the SQL language.
B) the variety and complexity of data.
C) expert systems that run on them.
D) OLAP.
Answer: B
Diff: 2 Page Ref: 570
36) Under which of the following requirements would it be more appropriate to use Hadoop over
a data warehouse?
A) ANSI 2003 SQL compliance is required
B) online archives alternative to tape
C) unrestricted, ungoverned sandbox explorations
D) analysis of provisional data
Answer: C
Diff: 2 Page Ref: 573

37) What is Big Data's relationship to the cloud?


A) Hadoop cannot be deployed effectively in the cloud just yet.
B) Amazon and Google have working Hadoop cloud offerings.
C) IBM's homegrown Hadoop platform is the only option.
D) Only MapReduce works in the cloud; Hadoop does not.
Answer: B
Diff: 2 Page Ref: 575-577

38) Companies with the largest revenues from Big Data tend to be
A) the largest computer and IT services firms.
B) small computer and IT services firms.
C) pure open source Big Data firms.
D) non-U.S. Big Data firms.
Answer: A
Diff: 2 Page Ref: 578

39) In the health sciences, the largest potential source of Big Data comes from
A) accounting systems.
B) human resources.
C) patient monitoring.
D) research administration.
Answer: C
Diff: 2 Page Ref: 587

40) In the Discovery Health insurance case study, the analytics application used available data to
help the company do all of the following EXCEPT
A) predict customer health.
B) detect fraud.
C) lower costs for members.
D) open its own pharmacy.
Answer: D
Diff: 2 Page Ref: 589-591

41) Most Big Data is generated automatically by ________.


Answer: machines
Diff: 2 Page Ref: 546
42) ________ refers to the conformity to facts: accuracy, quality, truthfulness, or trustworthiness
of the data.
Answer: Veracity
Diff: 2 Page Ref: 549

43) In-motion ________ is often overlooked today in the world of BI and Big Data.
Answer: analytics
Diff: 2 Page Ref: 549

44) The ________ of Big Data is its potential to contain more useful patterns and interesting
anomalies than "small" data.
Answer: value proposition
Diff: 2 Page Ref: 549

45) As the size and the complexity of analytical systems increase, the need for more ________
analytical systems is also increasing to obtain the best performance.
Answer: efficient
Diff: 2 Page Ref: 553

46) ________ speeds time to insights and enables better data governance by performing data
integration and analytic functions inside the database.
Answer: In-database analytics
Diff: 2 Page Ref: 553

47) ________ bring together hardware and software in a physical unit that is not only fast but
also scalable on an as-needed basis.
Answer: Appliances
Diff: 2 Page Ref: 553

48) Big Data employs ________ processing techniques and nonrelational data storage
capabilities in order to process unstructured and semistructured data.
Answer: parallel
Diff: 2 Page Ref: 556

49) In the world of Big Data, ________ aids organizations in processing and analyzing large
volumes of multi-structured data. Examples include indexing and search, graph analysis, etc.
Answer: MapReduce
Diff: 2 Page Ref: 558

50) The ________ Node in a Hadoop cluster provides client information on where in the cluster
particular data is stored and if any nodes fail.
Answer: Name
Diff: 2 Page Ref: 559
51) A job ________ is a node in a Hadoop cluster that initiates and coordinates MapReduce jobs,
or the processing of the data.
Answer: tracker
Diff: 2 Page Ref: 559

52) HBase is a nonrelational ________ that allows for low-latency, quick lookups in Hadoop.
Answer: database
Diff: 2 Page Ref: 560

53) Hadoop is primarily a(n) ________ file system and lacks capabilities we'd associate with a
DBMS, such as indexing, random access to data, and support for SQL.
Answer: distributed
Diff: 2 Page Ref: 561

54) HBase, Cassandra, MongoDB, and Accumulo are examples of ________ databases.
Answer: NoSQL
Diff: 2 Page Ref: 562

55) In the eBay use case study, load ________ helped the company meet its Big Data needs with
the extremely fast data handling and application availability requirements.
Answer: balancing
Diff: 2 Page Ref: 563

56) As volumes of Big Data arrive from multiple sources such as sensors, machines, social
media, and clickstream interactions, the first step is to ________ all the data reliably and cost
effectively.
Answer: capture
Diff: 2 Page Ref: 570

57) In open-source databases, the most important performance enhancement to date is the cost-
based ________.
Answer: optimizer
Diff: 2 Page Ref: 571

58) Data ________ or pulling of data from multiple subject areas and numerous applications into
one repository is the raison d'être for data warehouses.
Answer: integration
Diff: 2 Page Ref: 572

59) In the energy industry, ________ grids are one of the most impactful applications of stream
analytics.
Answer: smart
Diff: 2 Page Ref: 582
60) In the U.S. telecommunications company case study, the use of analytics via dashboards has
helped to improve the effectiveness of the company's ________ assessments and to make their
systems more secure.
Answer: threat
Diff: 2 Page Ref: 586

61) In the opening vignette, what is the source of the Big Data collected at the European
Organization for Nuclear Research or CERN?
Answer: Forty million times per second, particles collide within the LHC, each collision
generating particles that often decay in complex ways into even more particles. Precise electronic
circuits all around LHC record the passage of each particle via a detector as a series of electronic
signals, and send the data to the CERN Data Centre (DC) for recording and digital
reconstruction. The digitized summary of data is recorded as a "collision event." 15 petabytes or
so of digitized summary data produced annually and this is processed by physicists to determine
if the collisions have thrown up any interesting physics.
Diff: 2 Page Ref: 543

62) List and describe the three main "V"s that characterize Big Data.
Answer:
• Volume: This is obviously the most common trait of Big Data. Many factors contributed to
the exponential increase in data volume, such as transaction-based data stored through the years,
text data constantly streaming in from social media, increasing amounts of sensor data being
collected, automatically generated RFID and GPS data, and so forth.
• Variety: Data today comes in all types of formats–ranging from traditional databases to
hierarchical data stores created by the end users and OLAP systems, to text documents, e-mail,
XML, meter-collected, sensor-captured data, to video, audio, and stock ticker data. By some
estimates, 80 to 85 percent of all organizations' data is in some sort of unstructured or
semistructured format
• Velocity: This refers to both how fast data is being produced and how fast the data must be
processed (i.e., captured, stored, and analyzed) to meet the need or demand. RFID tags,
automated sensors, GPS devices, and smart meters are driving an increasing need to deal with
torrents of data in near—real time.
Diff: 2 Page Ref: 547-549
63) List and describe four of the most critical success factors for Big Data analytics.
Answer:
• A clear business need (alignment with the vision and the strategy). Business investments
ought to be made for the good of the business, not for the sake of mere technology
advancements. Therefore the main driver for Big Data analytics should be the needs of the
business at any level–strategic, tactical, and operations.
• Strong, committed sponsorship (executive champion). It is a well-known fact that if you
don't have strong, committed executive sponsorship, it is difficult (if not impossible) to succeed.
If the scope is a single or a few analytical applications, the sponsorship can be at the
departmental level. However, if the target is enterprise-wide organizational transformation,
which is often the case for Big Data initiatives, sponsorship needs to be at the highest levels and
organization-wide.
• Alignment between the business and IT strategy. It is essential to make sure that the
analytics work is always supporting the business strategy, and not other way around. Analytics
should play the enabling role in successful execution of the business strategy.
• A fact-based decision making culture. In a fact-based decision-making culture, the numbers
rather than intuition, gut feeling, or supposition drive decision making. There is also a culture of
experimentation to see what works and doesn't. To create a fact-based decision-making culture,
senior management needs to do the following: recognize that some people can't or won't adjust;
be a vocal supporter; stress that outdated methods must be discontinued; ask to see what
analytics went into decisions; link incentives and compensation to desired behaviors.
• A strong data infrastructure. Data warehouses have provided the data infrastructure for
analytics. This infrastructure is changing and being enhanced in the Big Data era with new
technologies. Success requires marrying the old with the new for a holistic infrastructure that
works synergistically.
Diff: 2 Page Ref: 553
64) When considering Big Data projects and architecture, list and describe five challenges
designers should be mindful of in order to make the journey to analytics competency less
stressful.
Answer:
• Data volume: The ability to capture, store, and process the huge volume of data at an
acceptable speed so that the latest information is available to decision makers when they need it.
• Data integration: The ability to combine data that is not similar in structure or source and to
do so quickly and at reasonable cost.
• Processing capabilities: The ability to process the data quickly, as it is captured. The
traditional way of collecting and then processing the data may not work. In many situations data
needs to be analyzed as soon as it is captured to leverage the most value.
• Data governance: The ability to keep up with the security, privacy, ownership, and quality
issues of Big Data. As the volume, variety (format and source), and velocity of data change, so
should the capabilities of governance practices.
• Skills availability: Big Data is being harnessed with new tools and is being looked at in
different ways. There is a shortage of data scientists with the skills to do the job.
• Solution cost: Since Big Data has opened up a world of possible business improvements,
there is a great deal of experimentation and discovery taking place to determine the patterns that
matter and the insights that turn to value. To ensure a positive ROI on a Big Data project,
therefore, it is crucial to reduce the cost of the solutions used to find that value.
Diff: 3 Page Ref: 554

65) Define MapReduce.


Answer: As described by Dean and Ghemawat (2004), "MapReduce is a programming model
and an associated implementation for processing and generating large data sets. Programs written
in this functional style are automatically parallelized and executed on a large cluster of
commodity machines. This allows programmers without any experience with parallel and
distributed systems to easily utilize the resources of a large distributed system."
Diff: 2 Page Ref: 557-558

66) What is NoSQL as used for Big Data? Describe its major downsides.
Answer:
• NoSQL is a new style of database that has emerged to, like Hadoop, process large volumes of
multi-structured data. However, whereas Hadoop is adept at supporting large-scale, batch-style
historical analysis, NoSQL databases are aimed, for the most part (though there are some
important exceptions), at serving up discrete data stored among large volumes of multi-
structured data to end-user and automated Big Data applications. This capability is sorely lacking
from relational database technology, which simply can't maintain needed application
performance levels at Big Data scale.
• The downside of most NoSQL databases today is that they trade ACID (atomicity,
consistency, isolation, durability) compliance for performance and scalability. Many also lack
mature management and monitoring tools.
Diff: 2 Page Ref: 562
67) What is a data scientist and what does the job involve?
Answer: A data scientist is a role or a job frequently associated with Big Data or data science. In
a very short time it has become one of the most sought-out roles in the marketplace. Currently,
data scientists' most basic, current skill is the ability to write code (in the latest Big Data
languages and platforms). A more enduring skill will be the need for data scientists to
communicate in a language that all their stakeholders understand–and to demonstrate the special
skills involved in storytelling with data, whether verbally, visually, or–ideally–both. Data
scientists use a combination of their business and technical skills to investigate Big Data looking
for ways to improve current business analytics practices (from descriptive to predictive and
prescriptive) and hence to improve decisions for new business opportunities.
Diff: 2 Page Ref: 565

68) Why are some portions of tape backup workloads being redirected to Hadoop clusters today?
Answer:
• First, while it may appear inexpensive to store data on tape, the true cost comes with the
difficulty of retrieval. Not only is the data stored offline, requiring hours if not days to restore,
but tape cartridges themselves are also prone to degradation over time, making data loss a reality
and forcing companies to factor in those costs. To make matters worse, tape formats change
every couple of years, requiring organizations to either perform massive data migrations to the
newest tape format or risk the inability to restore data from obsolete tapes.
• Second, it has been shown that there is value in keeping historical data online and accessible.
As in the clickstream example, keeping raw data on a spinning disk for a longer duration makes
it easy for companies to revisit data when the context changes and new constraints need to be
applied. Searching thousands of disks with Hadoop is dramatically faster and easier than
spinning through hundreds of magnetic tapes. Additionally, as disk densities continue to double
every 18 months, it becomes economically feasible for organizations to hold many years' worth
of raw or refined data in HDFS.
Diff: 2 Page Ref: 571

69) What are the differences between stream analytics and perpetual analytics? When would you
use one or the other?
Answer:
• In many cases they are used synonymously. However, in the context of intelligent systems,
there is a difference. Streaming analytics involves applying transaction- level logic to real-time
observations. The rules applied to these observations take into account previous observations as
long as they occurred in the prescribed window; these windows have some arbitrary size (e.g.,
last 5 seconds, last 10,000 observations, etc.). Perpetual analytics, on the other hand, evaluates
every incoming observation against all prior observations, where there is no window size.
Recognizing how the new observation relates to all prior observations enables the discovery of
real-time insight.
• When transactional volumes are high and the time-to-decision is too short, favoring
nonpersistence and small window sizes, this translates into using streaming analytics. However,
when the mission is critical and transaction volumes can be managed in real time, then perpetual
analytics is a better answer.
Diff: 2 Page Ref: 582
70) Describe data stream mining and how it is used.
Answer: Data stream mining, as an enabling technology for stream analytics, is the process of
extracting novel patterns and knowledge structures from continuous, rapid data records. A data
stream is a continuous flow of ordered sequence of instances that in many applications of data
stream mining can be read/processed only once or a small number of times using limited
computing and storage capabilities. Examples of data streams include sensor data, computer
network traffic, phone conversations, ATM transactions, web searches, and financial data. Data
stream mining can be considered a subfield of data mining, machine learning, and knowledge
discovery. In many data stream mining applications, the goal is to predict the class or value of
new instances in the data stream given some knowledge about the class membership or values of
previous instances in the data stream.
Diff: 2 Page Ref: 583-584

You might also like