Big Data Management
Big Data Management
Big Data Management
by Mike Wessler
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Big Data Management For Dummies®, 2nd Informatica Special Edition
Published by
John Wiley & Sons, Inc.
111 River St.
Hoboken, NJ 07030-5774
www.wiley.com
Copyright © 2018 by John Wiley & Sons, Inc.
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by
any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted
under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the
Publisher. Requests to the Publisher for permission should be addressed to the Permissions Department,
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at
http://www.wiley.com/go/permissions.
Trademarks: Wiley, For Dummies, the Dummies Man logo, The Dummies Way, Dummies.com, Making
Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons,
Inc. and/or its affiliates in the United States and other countries, and may not be used without written
permission. Informatica and the Informatica logo are registered trademarks of Informatica. All other
trademarks are the property of their respective owners. John Wiley & Sons, Inc., is not associated with
any product or vendor mentioned in this book.
For general information on our other products and services, or how to create a custom For Dummies book
for your business or organization, please contact our Business Development Department in the U.S. at
877-409-4177, contact [email protected], or visit www.wiley.com/go/custompub. For information
about licensing the For Dummies brand for products or services, contact BrandedRights&Licenses@
Wiley.com.
ISBN: 978-1-119-45962-0 (pbk); ISBN: 978-1-119-45964-4 (ebk)
10 9 8 7 6 5 4 3 2 1
Publisher’s Acknowledgments
Some of the people who helped bring this book to market include the following:
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Leveraging the Right Tools................................................................. 32
Considering Commercial Tools Built atop
Open Source Projects......................................................................... 33
Combining Management with Integration, Governance,
and Security......................................................................................... 34
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Introduction
B
ig data is the subject of great energy and excitement, and
for good reason. The prospect of channeling all the data in
the universe (and that is a lot of data) into analytical engines
to understand relationships between entities, identify illusive
patterns, and predict future events is exciting! It is changing our
lives and altering the way businesses see us as consumers. When
used correctly, businesses find that big data unleashes a wealth of
information and insights which translate to higher profits,
reduced costs, and less risk; it is a win!
The downside is, despite all the hype, many big data projects
struggle to deliver on those lofty promises. The fact is that while
technology evolved and data grew at an exponential pace, the
processes to manage big data were left behind. The result was
frustration with many big data projects.
Introduction 1
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Icons Used in This Book
Throughout this book, you will occasionally see special icons to
bring your attention to a point that needs to be emphasized. I will
keep them brief, and sometimes a little funny, but if you see one,
take note because it’s something you should know.
Tips indicate information that you may find useful. Often, they
relate to an experience I had (or I wish I had at the time), or they
add additional context to a topic.
If you see this icon, it’s probably something that will help you
later. You won’t find the meaning of life here, but you may find
some advice that will make your life easier.
One good place to visit is the Informatica Big Data Ready web page
at informatica.com/bigdataready.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Understanding the evolution of data
Chapter 1
Identifying Big Data
D
ata has evolved over the years and will continue to evolve.
Originally a stable stream of well-structured data, the growth
of technology has unleashed a flood of varied data from a
myriad of sources. The flood of big data can overwhelm those who
are unprepared, but for those ready for big data, many new business
opportunities await. In this chapter, I explore how data has evolved
into big data and how big data is used in business.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
As new technologies emerged, they generated a new type of data:
unstructured data. Unstructured data comes in a variety of data
types, sizes, and formats. Examples of unstructured data include
audio and video files, pictures and images, and unstructured text
streams such as mobile texts or social media posts.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Data is continually evolving and will continue to do so as technol-
ogy grows. Technology scientists, vendors, and businesses have
the challenge of keeping up with this evolution, and the latest
major evolutionary step is big data.
Beyond raw size, big data exceeds the capacity of existing tradi-
tional systems to store and process it; new technologies and pro-
cesses are required to make effective use of big data. Big data is
very often unstructured and not stored within an organization’s
corporate databases; it’s external and doesn’t neatly fit into pre-
defined formats.
»» Volume: The vast size of the data in terms of actual size and
number of data items
»» Velocity: How fast the data is being created and moves
across networks
»» Variety: Variation of data types including factors such as
format, structure, and source
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Recently, two additional Vs have been increasingly added as big
data is better understood and used within business:
It’s a safe statement that big data is defined by its volume, veloc-
ity, variety, and veracity; these are all key factors to consider from
the technical perspective. Including the “value” of big data in its
definition recognizes that data varies in its business impact to an
organization and that value is an important factor in determining
the time horizon as to where to store and retain the data.
Where in the world is all this data coming from? New technolo-
gies, sensors on existing technologies, metadata (for example,
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
data about data), and data about nearly everything a person or
device does adds to the data universe every second. Examples
include the following:
»» The over 7 billion mobile devices in use today allow for one
device per person on Earth — wow! Furthermore, consider
the many apps on each device that generate location,
communication, purchase, picture, video, and social media
data.
»» Online activities for web users ranging from browsing,
communications, and commerce are other common
examples. Usage patterns and preferences contained in
sessions yield a gold mine of useful data to be harvested.
»» Sensors in the everyday devices you use in your lives.
Increasingly, telemetry and location data within cars, home
appliances, and entertainment devices are added as new
features and capabilities are introduced.
»» Sensors within medical, scientific, and manufacturing
devices. As each device becomes more capable and net-
worked, the data generated increases. Everything from
hospital beds tracking a patient’s detailed statistics to
sensitive controllers on the factory floor are examples.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Examples of how businesses can use big data are vast and indus-
try specific, but common use cases include
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Highlighting the challenges of big data
Chapter 2
Understanding the
Challenges of Big Data
B
ig data promises incredible opportunities, but unfortu-
nately, a lot of work and complexity are involved in unlock-
ing those opportunities. Some challenges are obvious, while
others are more subtle. Beyond technical obstacles, the opera-
tional and management challenges are often the most difficult to
address. Fortunately, the intelligent use of a data management
methodology can solve these challenges. In this chapter, I discuss
the challenges of big data and introduce big data management as
a solution.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
»» Different data types, especially with unstructured data
»» Constant generation of new data to the point of system
overload
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
resource intensive and not repeatable. Or there may be
useful data, but it’s in a silo you aren’t aware of; thus, you
miss a full 360-degree view of the data.
»» Increasing security risks: Data breaches are big news and
legitimately can spell disaster for a company or organization;
no one wants their CIO in the newspapers for a data breach.
Sensitive data exists by itself and aggregation of seemingly
non-sensitive data can become a security issue.
»» Lack of data governance: In absence of a unified data
governance policy, either too many controls or not enough
as relate to data quality and data sharing are enforced, and
there’s no uniform process of accessing and managing
(curating) data. At best, your efforts to access, prepare, and
curate data are inconsistent and inefficient; at worst, you
either can’t access data, can’t trust the data, or you create a
potential security issue.
»» Too many emerging and changing technologies: Big data
is still evolving with new vendors, technology, and open
source projects. Keeping track of this shifting landscape is
difficult, and standardizing on a big data platform and
methodology is both technically and often politically
complex.
»» Value is difficult to unlock: Data in itself has little value, but
finding the important relationships within a data universe to
identify actionable information is the real challenge. IT,
business, and management stakeholders must be equipped
with technology, policy, and the will to find and exploit
opportunities from data before real value is achieved.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Most savvy technology, business, and executive folks understand
the value of big data someday; regrettably, despite their efforts,
that day is not today. Many big data projects have similar issues:
You can see in Figure 2-1 the complex ecosystem of big data proj-
ects with many data inputs, business outputs, cross-system pro-
cesses, and multi-disciplinary stakeholders; no wonder these are
difficult to manage.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
The ultimate effect of these problems is a degradation of faith
in big data projects by business leaders. Executives conceptu-
ally understand big data can provide value, but for a multitude
of reasons, they become hesitant to aggressively pursue future
projects. This perception is unfortunate because once expecta-
tions are properly set and managed, coupled with the right data
management methodology, great results are possible. However,
to get past these initial hurdles, it’s necessary to understand why
big data projects experience problems.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
not something staff can easily do in addition to their other
duties.
»» Custom, hand-coded solutions: A by-product of DIY,
inexperienced teams are developing custom, hand-coded
solutions, which aren’t repeatable for future efforts.
Common with data cleansing and integration processes,
these home-grown solutions are hard to initially develop and
aren’t reusable when new datasets or projects are intro-
duced. Thus, over time, a myriad of custom solutions are
written at great expense with limited future benefit.
»» Not leveraging appropriate tools and best practices:
Reinventing the wheel time and time again while ignoring
the expertise more experienced big data experts (and
vendors) have developed is wasteful in terms of time and
resources. Taking too much of a narrow view without
leveraging the greater body of expertise and tools frequently
leads to frustration, delay, and reduced positive results.
Did you notice these common errors are more about methodol-
ogy than they are about raw technology? And their roots are in
next-generation challenges more than traditional challenges? If
you want to be successful with big data, you need to understand
and embrace big data management.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
the focus is on the integration, governance, and security of big
data. However, to really understand big data management, one
must first comprehend the hierarchy of big data architectural
layers.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
»» Security: Protecting the data from unauthorized access and
manipulation
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Exploring the differences between a big
data laboratory and factory
Chapter 3
Building Blocks
of Effective Data
Management
I
f you’ve been reading this book straight through, at this point
you know what big data is and why big data projects encounter
problems; you are now ready to take a deep dive into big data
management to understand why it is core to a big data strategy.
I cover in detail the three pillars of big data management and what
they do and why they’re critical to your project’s success. Next,
you explore the processes associated with big data management in
detail. Finally, I end the chapter with help on how to empower
your team and make the most of the resources you already have in
place. This chapter provides the foundational knowledge to truly
understand big data management.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Understanding a Big Data Laboratory
versus Factory
Before I delve into the details of big data management, I must
apply context to the environments in which big data is used.
Depending on the environment, the requirements for an effective
implementation differ in several key ways; I explore those ways
and why they matter.
Big data factories take the model provided by the laboratory and
put those solutions into use as production. In big data analytics
uses cases, these environments are managed by a team of IT spe-
cialists with business analysts reviewing and applying the result-
ing insights to the business. However, more common use cases
(for example, next best offers, fraud detection, new data-driven
products, predictive maintenance, and so on) strive to deliver
actionable information directly to the end-user in real time.
This eliminates the need for a business user acting as a middle-
man layer between the end-users and data. Big data factories are
focused on data products that provide actionable information
directly to end business users and consumers.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
»» Factories implement the solution provided by the laborato-
ries to generate real business value.
»» Without factories, laboratories would have no reason to
exist.
»» Without laboratories, factories would have no solution to
implement in production.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Why does this matter in terms of big data management? Assum-
ing you accept that laboratories and factories are critical to the
big data operations within a company, those two distinct envi-
ronments must be managed appropriately. The needs and dif-
ferences of the environments must be respected and carefully
managed; recognizing these different approaches to integration,
governance, and security is important when evaluating big data
management platforms. Fortunately, big data management can
be architected to be flexible enough to meet the needs of labora-
tories and factories once you understand how it works.
»» Integration
»» Governance
»» Security
In Figure 3-1, you see the pillars of data management.
As you can see in Figure 3-1, there are only three pillars, but each
pillar encompasses multiple processes.
Integration
Integration ingests and processes data to achieve a result; this
processing must be scalable, repeatable, and agile. The longest
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
delays in big data projects occur during integration; smarter inte-
gration will reduce these time frames, automate processes, and
allow for rapid ingestion of new data. Key components of integra-
tion include
Governance
Governance defines the processes to access and administer data,
ensures the quality of the data, how it is tagged and cataloged,
and that it is fit-for-purpose. Essentially, the business and IT
teams must have confidence their data is clean and valid. Key
components of governance include
Security
Security identifies and manages sensitive data with a 360-degree
ring of risk assessment and analysis. Security must occur at the
source, not just at the perimeter. Identifying which data is sensi-
tive (credit card information, email addresses, addresses, Social
Security numbers, and other personally identifiable information)
and which data aggregated together becomes sensitive is a grow-
ing challenge. Key components of security are
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
»» Data proliferation and risk analysis
»» Masking and encryption for sensitive data
»» Security policy creation and management
If you have weak governance, data will effectively become “locked up”
because there is no established process to “free” it. Every time you
want data, it’s a battle to gain access.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Security is huge, and many organizations rightfully protect their
data like a grizzly bear protecting her cubs. This can become an
obstacle for data access (ingestion as part of integration), espe-
cially if you can’t prove you have sufficient security and gover-
nance controls in place.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
»» Master data: Organize your data into logical domains that
make sense to your business such as customers, products,
and services. Furthermore, you can add enrichment data to
further paint a clearer picture of your customers, products,
and services and their relationships.
»» Secure data: A mix of governance and security allows you to
establish security rules and then implement those rules.
First, you must determine how you will manage your
sensitive data. Next, you must find and assess the risk of
your sensitive data and implement rules via policy and
technology. This process is very important but prone to be
under-addressed by those inexperienced in big data
management.
»» Explore and analyze data: Implement a data laboratory to
perform experiments with a clear business goal in mind.
Based on your hypotheses, find what data exists and how it
can be analyzed to create a model that delivers results. Then
determine if the results are beneficial to the business;
remember that providing actionable information and
processes is the goal. Develop best practices to enhance
agility and processes before pushing the solution into the
factory.
»» Explore and analyze for business needs: Test out data
products to see if they provide a real value for the business;
often you just need to try something to see if it works. It is
common to use A/B testing to determine if a new data
product adds value to the business. Make iterative improve-
ments over time as you learn what works, what doesn’t
work, and what can be improved.
»» Operationalize the insights: Automate and streamline your
processes to create a steady pipeline of actionable insights to
business users. It’s not enough to have occasional production
runs from the big data factory; the factory must be running
regularly to be truly productive, meet business service-level
agreements (SLAs), and achieve the expected ROI.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
The system will ingest data from data sources, clean, integrate,
and manage that data, and then pass it to analytic applications for
processing to develop insights and finally to business applications
in the form of actionable information, all while applying big data
management processes. Understanding the processes of big data
management enables you to better manage environments.
First, understand the role and needs of each team member or cat-
egory of member. There will be a mix of data scientists, modelers,
analysts, stewards, engineers, and business users, all with differ-
ent perspectives, skill levels, and needs. Some will require greater
self-service autonomy (in the laboratory environment), while
others require operational agility (in the factory environment);
your job is to identify their needs within the big data environment.
Next, get help for your team in terms of training, effective tech-
nology, outside experts, and vendor experience. Odds are your
team is already overworked; why make them do things the “hard
way” by denying those tools and expertise to increase their effec-
tiveness? Forcing your team to work in isolation devoid of the
great work already done with big data will send the team down a
path of one-off, custom solutions, manual processes, and tedious
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
work that is not reproducible. That DIY approach results in frus-
tration for the team and costly lost opportunities for the business.
Finally, consider what you can do with what you already have
by creating repeatable, automated processes and standardized
technologies. Rather than re-inventing the wheel and expend-
ing resources for each new project or dataset, seize opportunities
where you can
Taking steps to empower your big data staff isn’t just right for
them as employees, but it yields benefits for the company as well.
Your people are an investment, and those in the big data field
know their value. There’s an industry shortage of qualified data
scientists, data engineers, and those who have knowledge and
experience in the big data world, and that shortage is expected
to increase in the near future. You must be willing to develop and
retain your highly skilled big data workforce; otherwise they may
go elsewhere under favorable market conditions.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Identifying common use cases and
drivers in big data management projects
Chapter 4
Using Big Data
Management
in the Wild
U
nderstanding the foundational concepts of big data man-
agement is essential, but you must also understand how
the concepts exist in the practical world. Businesses initi-
ate big data management projects for various purposes and from
multiple perspectives; it’s important to understand the drivers of
those efforts. The ability to identify key attributes in big data
management tools and how to effectively use those tools within
the principles of big data management in production environ-
ments is critical information I provide. Finally, I identify some
useful toolsets and highlight business experiences with those
tools. In this chapter, you gain an understanding of how to merge
the big data management principles with real-world business
operations.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Implementing Big Data Management
in Business
Companies initiate big data management projects for a variety of
reasons. While the industries and circumstances may vary greatly,
most projects originate from two emphases:
»» Business centric
»» IT centric
Business-centric big data management projects are just that:
focused on generating a business benefit. Often, the intent is to
generate new or additional revenue where it is relatively easy
to calculate the ROI. In other cases, more subtle benefits occur
such as better understanding the preferences or relationships of
perspective customers or improving existing business processes.
Even more subtle benefits are avoidance or detection of specific
conditions such as fraud, claims, or preventing a component fail-
ure via the Internet of Things (IoT). These projects are frequently
initiated by business analysts and executives.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
For example, IT may create a new data lake or build a Hadoop
cluster that provides enhanced capability for the organization,
but in isolation these projects don’t generate revenue unless they
support a business process or initiative. These projects are often
generated by IT as a consolidation or modernization initiative or
in response to business requests to explore a new capability.
These use cases are just the tip of the big data iceberg; below are
examples of real companies’ positive experiences:
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Putting big data management in the context of business and
IT-centric projects is beneficial in understanding how big data
management can help your business.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
»» Master Data Management: Enforce and ensure the
accuracy and accountability of the critical data in an organi-
zation to provide a common point of reference and truth.
»» Data Masking: De-identify, obfuscate, or otherwise obscure
sensitive data, such as credit card numbers, so relational
integrity is maintained, yet key sensitive values aren’t
accessible.
»» Data Security Analytics: Analyze and assess the risk of a
data security breach by identifying the location and prolifera-
tion, and tracking the usage of sensitive data.
»» Streaming Analytics: Collect, process, and analyze multi-
latency data (including real-time) to provide event-based
insights and alerts within a time-interval of maximum
business impact (often in real-time or near real-time).
»» Big Data Analytics: Apply analytical formulas and algo-
rithms to datasets to answer questions based on big data;
these algorithms are employed by data experts to test
hypotheses and validate analytic models used to improve
business outcomes.
»» Data Lakes: Collect and store all types of data as originally
sourced for use as a live archive, data exploration, and an
operational data store for pre-processing and preparing data
for big data analytics.
»» Data Warehouses: Collect and store structured data into a
large repository for the purpose of applying analytics and
generating reports.
When evaluating tools and software packages, ask “what does this
actually do and where does it fit within my big data architecture?”
Often, if you can’t find a satisfactory answer of what a product
does or how it complements or replaces an existing technology
within your IT infrastructure, you should beware that it may have
limited or no value. The same concept applies with big data man-
agement tools; if the perspective tool doesn’t include functional-
ity listed in the above categories, there may be more marketing
hype than substance.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Leveraging the Right Tools
It isn’t enough to simply have tools; to be effective you must have
the right tools to complete the task at hand. The challenge is, how
do you know what the right tools are? Specifically, what attributes
make one tool more desirable over another tool? In an industry
ripe with marketing hype and buzzwords, one must know how to
identify real value.
One great way to start is applying the three pillars of big data
management. If the tool relates to one or more processes within
the pillars of integration, governance, or security, then odds are
you are on the right track. Next, as discussed in the preceding sec-
tion, determine what function or work the tool actually performs;
it should be clearly defined with a demonstrated purpose or out-
put. Finally, drill down into the specific features for each tool to
determine which ones support forward looking, enterprise-grade
features such as
»» Integration
• High volume multi-latency ingestion
• Optimized for powerful, scalable processing
• Rapid deployment across varied environments
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
»» Governance
• Collaborative self-service approach
• 360-view of data relationships
• Fit-for-purpose data
»» Security
• Complete discovery and view of sensitive data
• Analysis and assessment of security risks
• Risk-centric and policy-based security
The ability to distinguish between okay versus great tools and
needless fluff versus real features is important for any IT pro-
fessional, not just those working in big data. By applying the
methodologies above, you will more accurately identify quality
tools warranting further investigation and discard tools providing
lesser value.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
»» A single point of contact for issues, training, and expertise
»» Features and enhancements made possible only by vendors
with extensive expertise and large R&D engineering
departments
»» Comfort level and compliance assurance that a paid vendor
is behind the product
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
The approach shown in Figure 4-1 is common in the big data
world. Using a mix of open source tools (Hadoop, Spark) with a
vendor big management engine (Informatica Blaze) and universal
metadata catalog (Informatica Live Data Map), big data manage-
ment processes are applied to technology to deliver integration,
governance, and security.
INTRODUCING INFORMATICA
BIG DATA MANAGEMENT
PRODUCTS V10
Informatica, a leader in data management technology, has recently
released its v10 family of new and upgraded products. Providing the
three pillars of big data management, these tools merit investigation.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
36 Big Data Management For Dummies, 2nd Informatica Special Edition
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Designing use cases with business value
Chapter 5
Ten Essential Tips for
Succeeding with Big
Data Management
M
anaging big data is the key to successful big data proj-
ects. Beyond technology, the management techniques
deployed make the difference between success and
failure. In this chapter, I identify tips and techniques to make you
more effective at managing big data in the real world.
CHAPTER 5 Ten Essential Tips for Succeeding with Big Data Management 37
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
A better way is to establish uses cases that deliver smaller victo-
ries earlier in the process. Using smaller, agile teams who focus
on rapid, iterative development practices to show value early
has many benefits. First, agile development does solve many of
the challenges and mitigate risks found in larger projects; plus
agile is the current, favored development methodology in many
organizations. Next, a project that shows value early is easier to
“sell” to management initially and to sustain as the project con-
tinues. Finally, smaller, more realistic goals are easier to achieve
while building the confidence and capability of agile teams. When
designing your uses cases and assembling your teams, focus on
quicker wins that show a benefit rather than risking an overly
ambitious project.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
analytic tools to explore possibilities. In other cases, business ana-
lysts use reporting and BI tools to make key decisions. Many other
examples exist, and they all have different data requirements.
The power of the data lake is that the same repository of raw data
can be used for different use cases. Data scientists can use the
data lake for their research while business analysts access more
curated and governed datasets for their operational requirements.
Sharing the same data lake for different purposes adds flexibil-
ity to the organization without the overhead cost of redundant,
purpose-specific data marts.
CHAPTER 5 Ten Essential Tips for Succeeding with Big Data Management 39
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
and continuous monitoring of results. The cycle is repeated as
needed, and changes are identified and implemented. Throughout
this cycle, careful collaboration between business and IT stake-
holders occurs.
Tools and processes exist to identify problem data. After data has
been initially ingested, cleansed, and processed, the method of
applying data scorecards begins. You must define data profiles for
the data quality scorecards and rules to be applied to the data.
Once applied, you can address exceptions in the data both auto-
matically and through alerts that require human intervention,
and monitor scorecard results. Use of data scorecards will help
ensure data quality is maintained.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
give you a much more complete view and, therefore, control of
your data.
CHAPTER 5 Ten Essential Tips for Succeeding with Big Data Management 41
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Implementing process repeatability is a key to success. First, avoid
custom programming solutions unique to a situation and manual
processes. Just as code reuse is a key to effective programming,
standardizing and reusing processes and logic are highly benefi-
cial with data management. Processes and logic related to data
cleansing and integration are often good candidates to standard-
ize and reuse. Find and document patterns in processes and logic
that your teams can reuse time and time again to speed up delivery
and reduce their workload so they can focus on more meaningful
efforts. Leverage these reusable patterns and logic for tasks such
as data ingestion, web log processing, ELT offloading, address
validation, masking credit card numbers, and so on.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
other factors at play including Fear, Uncertainty, and Doubt (the
FUD factor) or simply not having enough time or resources, which
prevent automation.
CHAPTER 5 Ten Essential Tips for Succeeding with Big Data Management 43
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
44 Big Data Management For Dummies, 2nd Informatica Special Edition
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
These materials are © 2018 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
WILEY END USER LICENSE AGREEMENT
Go to www.wiley.com/go/eula to access Wiley’s ebook EULA.