Data Engineering Teams
Data Engineering Teams
Data Engineering Teams
Jesse Anderson
2
Contents
1 Introduction 5
About This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Warnings and Success Stories . . . . . . . . . . . . . . . . . . . . . . . . . 6
Who Should Read This . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Navigating the Book Chapters . . . . . . . . . . . . . . . . . . . . . . . . 7
Conventions Used in This Book . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Data Engineers 28
What Is a Data Engineer? . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
What I Look for in Data Engineers . . . . . . . . . . . . . . . . . . . . . . 29
Qualified Data Engineers . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Not Just Data Warehousing and DBAs . . . . . . . . . . . . . . . . . . . . 30
Ability Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
CONTENTS 3
How to Work with a Data Science Team . . . . . . . . . . . . . . . . . . . 36
How to Work with a Data Warehousing Team . . . . . . . . . . . . . . . . 40
How to Work with an Analytics and/or Business Intelligence Team . . . 40
How I Evaluate Teams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Equipment and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
9 Conclusion 70
Best Efforts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
About the Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
CONTENTS 4
CHAPTER 1
Introduction
About This Book
You’re starting to hear, or have been hearing for a while, your engineers say “can’t.”
The can’ts always revolve around processing vast quantities of data. Sometimes,
you’re asking for an analysis that spans years. You’re asking for processing that
spans billions of rows or looks at every customer you have. Other times, you’re
just asking for the day-to-day reports and data products that are the lifeblood of
your business. Every time, there is a technical limitation that’s keeping you from
accomplishing your goal. You find yourself having to compromise every single
time due to the technology.
You’re sitting on a big pile of data. A BIG pile. Buried somewhere in there
are a bunch of really, really valuable insights. The kind of insights that could
fundamentally and deeply change your business. Maybe you could more deeply
understand your customers and reveal every time they’ve purchased from you.
Maybe it has to do with fraud or customer retention.
You, or your boss, have heard about Big Data, this magic bullet that everyone
is talking about. It’s allowing companies to free themselves of their technical
limitations. It’s also allowing your competition to move past you.
Here’s the good news: I can help. In fact, I’ve been helping companies do this for
the past 6+ years.
Here’s the not-so-good news: the process isn’t magic and Big Data only seems
like a magic bullet to the outside observer. In fact, a lot of hard work goes into
making a successful Big Data project. It takes a team with a lot of discipline, both
technical and managerial.
Now back to the good news. I’ve seen, helped, and mentored a lot of companies
as they’ve gone through this process. I know the things that work, and more
importantly, I know the things that don’t. I’m going to tell you the exact skills
Chapter 1 Introduction 5
that every team needs. If you follow the strategies and plans I outline in this
book, you’ll be way ahead of the 85% of your competitors who started down the
Big Data road and failed.
When your competitors or other companies talk about Big Data technologies
failing or their project failing due to the technologies, they’re not recognizing a
key and fundamental part of the project. They didn’t follow the strategies and
plans I outline in this book. They lay the blame squarely on the technology.
The reality lies several months or a year earlier, when they started the project
off on the wrong foot. I’ll show you how to start it off right and increase your
probability of success.
A cautionary story
These will be cautionary tales from the past. Learn from and avoid those mis-
takes.
These will be success stories from the past. Learn from and imitate their success.
Here’s a story that will give you more color on the subject.
These will be stories or comments from my experience in the field. They will
give you more background or color on a topic.
Figure 2.4: More complex architecture diagram for a Hadoop project with real-
time components.
You’ll see me say that Data Engineers need to know 10 to 30 different tech-
nologies in order to choose the right tool for the job. Data engineering is hard
because we’re taking 10 complex systems, for example, and making them work
together at a large scale. There are about 10 shown in Figure 2.4. To make the
right decision in choosing, for example, a NoSQL cluster, you’ll need to have
learned the pros and cons of five to 10 different NoSQL technologies. From that
list, you can narrow it down to two to three for a more in-depth look.
During this period, you might compare, for example, HBase and Cassandra.
Is HBase the right one or is Cassandra? That comes down to you knowing
what you’re doing. Do you need ACID-ity? There are a plethora of questions
you’d need to ask to choose one. Don’t get me started on choosing a real-time
I’ve had Software Engineers say they can teach my Big Data
technology classes. They’re under the impression Big Data is
easy. I’ve had Software Engineers say they’ll be done with a four-
day course in half a day. I’ve had DBAs tell me it’s no different
than a data warehouse. They think I don’t know what I’m talking
about or that I’m overstating the difficulty.
When this happens, I ask them to sit tight. You may be one of
the very, very, very few outliers, because they do exist. But more
than likely you’re right there in the middle of the bell curve and
this will be complex for your team.
Distributed Systems
Programming
This is the skill for someone who actually writes the code. They are tasked
with taking the use case and writing the code that executes it on the Big Data
framework.
The actual code for Big Data frameworks isn’t difficult. Usually the biggest
difficulty is keeping all of the different APIs straight; programmers will need to
know 10 to 30 different technologies.
I also look to the programmers to give the team its engineering fundamentals.
They’re the ones expecting continuous integration, unit tests, and engineering
Analysis
A data engineering team gives data analysis as a product. This analysis can range
from simple counts and sums all the way to more complex products. The actual
bar or skill level can vary dramatically on data engineering teams; it will depend
entirely on the use case and organization. The quickest way to judge the skill
level needed for the analysis is to look at the complexity of the data products.
Are they equations that most programmers wouldn’t understand or are they
relatively straightforward?
Other times, a data product is a simple report that’s given to another business
unit. This could be done with SQL queries.
Very advanced analysis is often the purview of a Data Scientist. I’ll talk about
Data Scientists and data engineering teams in Chapter 5.
Common titles with this skill: Software Engineer, Data Analyst, Business Intelli-
gence Analyst, DBA
A data engineering team needs to communicate its data products visually. This is
often the best way to show what’s happening with data, especially vast amounts
of it, so others can readily use it. You’ll often have to show to data over time and
with animation. This function combines programming and visualization.
A team member with visual communication skills will help you tell a graphic
story with your data. They can show the data not just in a logical way, but with
the right aesthetics too.
Common titles with this skill: Software Engineer, Business Intelligence Analyst,
UX Engineer, UI Engineer, Graphic Artist
Verbal Communication
The data engineering team is the hub in the wheel where many spokes of the
organization come in. We’ll talk more about that concept later in Chapter 5,
“Productive Data Engineering Teams,” but the manifestation is important. You
need people on the team who can communicate verbally with the other parts of
your organization.
Your verbal communicator is responsible for helping other teams be successful
in using the Big Data platform or data products. They’ll also need to speak to
these teams about what data is available. Other data engineering teams will
operate like internal solutions consultants.
This skill can mean the difference between increasing internal usage of the
cluster and the work going to waste.
Common titles with this skill: Software Architect, Software Engineer, Technical
Manager
Project Veteran
A project veteran is someone who has worked with Big Data and has had their
solution in production for a while. This person is ideally someone who has
extensive experience in distributed systems or, at the very least, extensive mul-
tithreading experience. This person brings a great deal of experience to the
team.
Schema
The schema skill is another odd skill for a data engineering team, because
it’s often missing. Members with this skill help teams lay out data. They’re
responsible for creating the data definitions and designing its representation
when it is stored, retrieved, transmitted, or received.
The importance of this skill really manifests as data pipelines mature. I tell
my classes that this is the skill that makes or breaks you as your data pipelines
become more complex. When you have 1 PB of data saved on Hadoop, you can’t
rewrite it any time a new field is added. This skill helps you look at the data you
have and the data you need to define what your data looks like.
Often, teams will choose a small data format like JSON or XML. Having 1 PB of
XML, for example, means that 25% to 50% of the information is just the overhead
of tags. It also means that data has to be serialized and deserialized every time it
needs to be used.
The schema skill goes beyond simple data modeling. Practitioners will under-
stand the difference between saving data as a string and a binary integer. They
also advocate for binary formats like Avro or Protobuf. They know to do this
because data usage grows as other groups in a company hear about its existence
and capabilities. A format like Avro will keep the data engineering team from
having to type-check everyone’s code for correctness.
Common titles with this skill: DBA, Software Architect, Software Engineer
Domain Knowledge
Some jobs and companies aren’t technology focused; they’re actually domain
expertise focused. These jobs focus 80% of their effort on understanding and
implementing the domain. They focus the other 20% on getting the technology
Now you’re going to fill out the paper or spreadsheet software. Go through
everyone on the team and put an “X” whenever that team member has that
particular skill. This requires a very honest look—doing this analysis as honestly
as possible could be the difference between success and failure for the team.
Now that you’ve placed the “X”s on everyone’s skills, total them up. This will tell
Figure 3.2: A gap analysis for an average DBA or data warehouse team with a
simple analysis use case
A very large retailer started their Big Data journey. They as-
signed the project to their data warehouse team. The team
didn’t have anyone with programming or distributed systems
skills. They decided to overcome this shortcoming by purchas-
ing an off-the-shelf Big Data solution that said it didn’t require
programming skills.
The off-the-shelf solution didn’t solve their need for the dis-
tributed systems skill. The team wasn’t able to understand how
to create the system and how the data needed to flow through it
in order for business value to be created.
The project eventually exhausted its budget and, worse yet, had
little to nothing of value to show for the million dollars and
months of work.
Let’s take another example. Say we just took our software engineering team
and didn’t make it multidisciplinary. You may have several different low or
absent skills—in this case, distributed systems, analysis, visual communication,
verbal communication, project veteran, and schema (Figure 3.3). This is another
way that Big Data projects fail; they just fail later on when you’re creating an
enterprise-ready solution.
Figure 3.3: A gap analysis for an average enterprise software engineering team
Other times you won’t have any zeros. Congratulations! Or maybe not. I’ve
found that teams overestimate their abilities and skills. Take another honest
look and make sure you aren’t deluding yourself. If you really have a gap, you
Diversity in Teams
Having taught extensively, I can say a diverse team performs much better than a
uniform team. Monocultures leads to groupthink, which leads to a less innova-
tive team, because everyone has similar ideas and thoughts.
The data engineering team is multidisciplinary and should be diverse. This
diversity manifests in many different ways, from gender and ethnicity to so-
cioeconomic backgrounds. This is how teams become more innovative and
performant.
When I’m teaching, I’ve noticed that members of a uniform team often have
the same misunderstanding or inability to understand a subject. They’ll have
the same or similar idea about execution. A diverse team will have some people
who understand and others who don’t. The people who do understand will help
explain the subject to those who don’t. A diverse team also will have more ideas
about execution and they can choose the best one.
Operations
I’m often asked if a data engineering team also should be engaged in operations.
I’ll start out with an indirect answer. With the cloud, we should be asking
ourselves why we’re still in the operations business at all. I think managed
services pay off and we shouldn’t be in the operations game.
That said, the cloud isn’t a possibility for everyone. You have to do operations on
an on-premises cluster.
I’ve seen all different models, but I haven’t seen one work better overall than
another. There are:
• Data engineering in one team and operations on a different team
• DevOps models where one team does both the engineering and opera-
tions.
• Data engineering and final tier support in one team and lower tier opera-
tional support in another team
You definitely don’t want a throw-over-the-fence model. This where the Data
Engineering team is so isolated from production issues that they don’t know how
their code is acting—or acting up—in production.
Some Big Data technologies aren’t built with operations in mind. Sometimes
they are operationally fragile or require a massive amount of knowledge. Some
Operations 25
of these technologies require such a massive amount of knowledge of the tech-
nology and the code, that you need DevOps from the beginning. When I teach
those classes, I give operational requirements and programming requirements
equal weight. The team is encouraged to use the DevOps model.
Quality Assurance
Another common issue is to figure out how a quality assurance (QA) team works
with a data engineering team.
In general, there is a lack of automated tools for doing QA on Big Data pipelines.
This is a question I’m often asked when teaching QA folks—how do I test this
thing? The answer is complicated. It involves manual scripting and writing unit
tests.
That means that your QA team needs to be Quality Assurance Engineers who
write this test code. Most QA teams lack the programming skills to create these
scripts and tests. This is where the data engineering team will have to write
many of these tests to allow the QA team to automate their testing.
My recommendation is that Data Engineers should be writing unit tests no
matter what. Usually the sorts of tests and bindings the QA team needs are
Quality Assurance 26
different. This is where some time on the schedule needs to be allocated for the
Data Engineers to work with the QA team.
Quality Assurance 27
CHAPTER 4
Data Engineers
What Is a Data Engineer?
A Data Engineer is someone who has specialized their skills in creating software
solutions around data. Their skills are predominantly based around Hadoop,
Spark, and the open source Big Data ecosystem projects.
They usually program in Java, Scala, or Python. They have an in-depth knowl-
edge of creating data pipelines.
Other languages
I’m often asked about other languages. Let’s say a team doesn’t
program in Java, Scala, or Python. What should they do? The
answer is very team-specific and company-specific. Do you
already have a large codebase in that language? How difficult is
it to use the Java bindings for that language? How similar is that
language to Java?
The data pipelines they create are usually reports, analytics, and dashboards.
They create the data endpoints and APIs for others to access data. This could
range from REST endpoints for accessing smaller amounts of data to helping
other teams consume large amounts of data with a Big Data framework. More
advanced examples of data pipelines are fraud analytics or predictive analytics.
The field of Big Data is constantly changing. A Data Engineer needs to have
in-depth knowledge of at least 10 technologies and know at least 20 more. You
might have read that and thought I was exaggerating or you didn’t believe me.
Based on my interactions with companies, that is exactly what’s needed.
I’m often asked how it’s possible for a general-purpose Software Engineer to
know so many different technologies. It isn’t. It’s only possible for generalist
Ability Gap
As a trainer, I’ve lost count of the number of students and companies I’ve taught.
One thing is common throughout my teaching: there is an ability gap. Some
people simply won’t understand Big Data concepts on their best day.
Most industry analysts usually talk about skills gaps when referring to a new
technology. They believe it’s a matter of an individual simply learning and
eventually mastering that technology. I believe that too, except when it comes to
Big Data.
Big Data isn’t your average industry shift. Just like all of the shifts before it, it’s
revolutionary. Unlike the previous shifts, the level of complexity is vastly greater.
This manifests itself in the individuals and students trying to learn Big Data. I’ll
talk to people where I’ve figured out they have no chance of understanding the
Big Data concept I’m discussing. They’ve simply hit their ability gap. They’ll
keep asking the same question over and over in a vain attempt to understand.
Big Data is hard. Not everyone will be able to be a Data Engineer. They may be
Ability Gap 31
part of the data engineering team, but not be a Data Engineer.
Ability Gap 32
CHAPTER 5
Figure 5.1: The data engineering team as the hub of data pipeline information
for the organization
This hub concept is often foreign or not thought about. The organization doesn’t
realize that their project will grow in scope. As the other parts of the organization
begin to consume the data or use the data pipeline, they’re going to need help.
This help comes as a result of an ability gap, a team with no programming skills,
or another programming team that needs help.
Figure 5.2: How Data Scientists and Data Engineers work together. Based on
Paco Nathan’s original diagram. Used with permission.
In Figure 5.2, I show how tasks are distributed between data science and data
engineering teams. I could, and have, talked about this diagram for an hour. The
duties are shown as more of a data science or data engineering duty by how close
they are to the top or bottom of the center panel. Notice that very few duties are
solely a data science or data engineering duty.
There are a few points I want you to take away from this diagram. A data engi-
neering team isn’t just there to write the code. They’ll need to be able to analyze
data, too.
Likewise, Data Scientists aren’t just there to just make equations and throw them
over the fence to the data engineering team. Data Scientists need to have some
level of programming. If this becomes the perception or reality, there can be
a great deal of animosity between the teams. In Figure 5.3, I show how there
should be a high bandwidth and significant level of interaction between the two
teams.
Ideally, there’s a great deal of admiration and recognition between the two teams.
Together or separate?
I find the usual reason for data engineering and data science be-
ing together or separate comes down to how they were started.
Usually, the data science team comes out of the analytics or
business intelligence side of the organization, and the data en-
gineering team comes from its engineering side. For the two
teams to be merged, they’d have to cross organizational bound-
aries, and that doesn’t happen very often.
One suggestion that I don’t see used often is to pair-program. I’m not saying
everything should be done as pair programming, but I’m suggesting that teams
do so when it makes sense. A Data Engineer and a Data Scientist would program
together. This way, the resulting code is much better and the knowledge is
transferred immediately. The code would be checked as it’s written for quality
by the Data Engineer and for correctness by the Data Scientist.
Unit Testing
I also check whether the team is unit testing their code. Data pipelines are
really complex. How do you know if the code changes you made will cause
other problems? Without good unit tests, the team’s productivity will grind to a
halt. The team will have a complex system with no guarantees it works. This is
unfortunately one of the most common failures in data engineering teams. A
large portion of the blame goes to the Big Data technologies which don’t have
unit testing as part of the project.
Let me give you an example of why this is important. I evaluate data engineering
teams on how fast they can replicate a bug. If you have a problem in your
production data pipelines and your Data Engineer can’t replicate a bug in their
IDE in 10 minutes or less you have two problems: a production problem and a
unit testing problem. Data pipelines should be logging out problems or problem
data. A Data Engineer should be able to take that log output, put it into an
already written unit test, and replicate the problem. If your team can’t do that,
you’re going to be forever chasing and fixing issues.
Another core reason for unit testing is to have a full stack or integration test.
You’ll recall that data pipelines are made up of 10 to 30 different technologies.
Each one of those technologies has a different API for doing unit tests. If you don’t
have a unit test that flows all the way from the very beginning of the data pipeline
Automation
Automation is another multifaceted and key metric for a team. Going from small
data to Big Data represents a significant increase in complexity. The current
team is experiencing and creating that increase in complexity as it happens.
Experiencing these changes in deltas, instead of all at once, makes it easier for
the current team to understand the system.
The major difficulty arises when new people join the team. How long will it
take for them to be productive? Let’s say with your small data system it took
one month. With Big Data, you could create a system that takes six months to
understand—and that’s apart from all the domain knowledge that’s necessary.
This thought should factor into your hiring and overall system design. Are you
creating a system so difficult only its designers will understand it?
How long does it take for a new person on the team to set up their development
environment? Setting up a Big Data environment on a developer’s machine isn’t
an easy task. This can take several days of trial and error. There are many differ-
ent processes that need to run, and the steps to configure aren’t documented
well. Additionally, this isn’t a developer’s core competency. This is something
that should be scripted, or, in my opinion, done on a virtual machine (VM).
This allows developers to have a VM already set up and running the necessary
processes. More importantly, it makes every developer run the same versions of
the software.
A related issue is how fast can you deploy a fix or new code to production. How
many manual steps are there? With small data systems, teams can squeak by
doing things manually. With Big Data, this changes because you’re dealing with
30 different computers or instances. The same tools and techniques just break
down. You’ll need to be able to deploy code quickly—ideally with the push of a
button.
A team that lacks production automation will suffer from having 30 machines
all configured in slightly different ways. It will have program versions that are
potentially different codebases. It will be soaking up operations time on deploy-
ing, instead of fixing the problem. All these are recipes for difficult production
With small data, you can use virtually whichever technology you already use. You
come in with a technology stack and use it for everything. For Big Data, almost
every single engineering decision comes down to the use case. If you don’t know
your use case, you won’t be able to make these decisions. When I teach data
engineering teams, I tell them not to even start looking at technologies until
they’ve clearly identified the use case.
Choosing technologies first and then looking at use cases often leads to failure.
There is a wide spectrum of possible levels of failure. The best-case failure is that
the project takes longer or much longer to complete. The worst-case failure is
that the use case isn’t possible at all with the technology choices. I’ve seen all
levels in the spectrum, from teams who’ve lost a million dollars and six months
by using a technology incorrectly, to those who lost a month and $150,000 by
Crawling
The crawling phase is doing the absolute minimum to start using Big Data. This
might be as a simple as getting your current data into a Big Data technology. For
highly political organizations, this is the phase where you try to get groups to
start working together. Ideally, you’re getting other groups that are opposed to
Big Data to buy into the value.
A good stretch goal for this phase is to automate the movement of data to and
from your data pipeline. You’ll want to continually bring in data.
In this phase, you’ll start on your ETL coding and data normalization. This will
set you up for success as you start analyzing the data. This phase has minimal
Walking
The walking phase builds on the solid foundation you laid while crawling. Every-
thing is ready for you to start gaining value from your data. If you didn’t build a
solid foundation, everything will come crashing down.
In this phase, you’re starting to really analyze your data. You’re using the best
tool for the analyses because the cluster is already running the right tools.
You’re also creating data products. These aren’t just entries in a database table;
you are focusing on in-depth analysis that creates business value. At this point,
you should be creating direct and quantifiable business value.
Last, you’re starting to look at data science and machine learning as ways to
improve your analysis or data products.
Running
In the running phase, you’re moving into the advanced parts of Big Data archi-
tectures. You’re gaining the maximum amount of value from your data.
You’re also looking at how batch-oriented systems are holding you back, and
start looking at real-time systems.
You’re focusing on the finer points of your data pipeline. This includes looking at
how to optimize your storage and retrieval of data. This might include choosing
a better storage format or working with a NoSQL database. You’re also using
machine learning and data science to their fullest potential.
Iterating
The next question is how to maintain this velocity. Data pipelines aren’t really
ever done. They’re constantly changing as you add new datasets and technolo-
gies, and other teams start using the data.
The team needs to iterate and repeat the crawl, walk, run again. This time, the
crawl may be not as simple or as time consuming; it could be more around
validating that the task is possible. The walk and run phases would be mostly
the same.
Technologies 53
this deficiency. And as noted, data pipelines are solutions that include 10 to 30
complex technologies.
A very common recipe for failure is when the entire data engineering team is
made up of DBAs or Data Warehouse Engineers. The data engineering team
needs to be multidisciplinary, so you will need to check for skills gaps.
Related to the skills gaps is when a team doesn’t program or have Java experi-
ence. Big Data technologies require programming and Java is their predominant
language. Not having any programming skills on the team is a big red flag. Not
knowing Java (or Python or Scala) is an issue, but not a showstopper. Program-
mers in one language can learn enough Java to get by. However, you do need to
give the team the resources and time to be successful with a brand-new language.
You’ll also need to verify that any project timelines take into account the team
having to learn and program in a different language.
Other teams lack a clear use case. They’ve been told to go find one or Big
Data is seen as a silver bullet that will solve any problem. Without a clear use
case, technologies cannot be chosen, solutions written, or value created. These
projects are doomed because there is no finish line.
Other teams face an albatross in the form of extensive legacy systems. The
new data pipeline is expected to be backward compatible or support all of the
previous systems. This stymies execution, leading to failure when projects’ plans
and managers don’t account for the increase in time. Heavy legacy systems
integrated with data pipelines often take 50% longer to complete.
Perhaps my biggest pet peeve when working with data engineering teams is
when they’re set up for failure. There’s little to no preparation on the project. The
engineering team doesn’t have a use case and hasn’t been given the resources to
figure it out. I hate seeing doomed projects.
Other teams lack qualified Data Engineers or don’t have a project veteran. These
teams may be initially successful, but over the long term they will fail. They’ll
get stuck at the crawl phase. They won’t have the skills or abilities to progress
beyond the basics. A project veteran can help lead the team through the difficult
transitions of the walk and run phases.
Some teams lack members who understand the entire system and the impor-
tance of schema. These failures don’t happen initially; they manifest in the
walk and run phases, when other parts of the organization start using the data
Sometimes it’s the entire company that causes the failure. The
data team could be productive and provide a data pipeline that
doesn’t get used. A company needs to a have a data-driven
culture and data-centric attitude. Without this attitude, the
data engineering team’s insights and data won’t be transformed
into business value. Other times, you need to make changes to
the other parts of the organization to fully leverage your data.
No Programming Skills
The most common scenario I’ve seen are data engineering teams with no pro-
gramming skills. The team is replete with DBAs and SQL skills, but lack the
ability to program the complex systems. This team makeup often comes from
taking data warehousing teams and calling them data engineering teams.
At a large organization, I’ll look around for data-focused programming teams.
Failing that, I search for programming teams with multithreading skills. If I still
can’t find that team, I look for any Java or Python team, with a preference for
Java.
Each of these teams will need training; it’s just a matter of how much they’ll
need. There’s also gradually rising odds on these teams having an ability gap to
contend with. Not all people with programming skills can handle the complexity
of Big Data.
If you can’t or don’t have any programming skills in the company, you’ll need
to hire one or more qualified Data Engineers as a consultant. You can also hire
a company to write the code for you. Here is where I urge caution, as some
consulting companies will say they do Big Data solutions but are just as lost as
No Project Veteran
Hiring a project veteran can be tricky. One may not be available, or you may not
be able to afford one. You’ll have to contend with your less experienced team.
When no project veteran is present, I keep a sharp eye out for abuses of the
technology. I also keep an eye out for the team missing the subtle nuances of
the use case and technology. These teams tend to create solutions held together
with more duct tape and hope than you’d like to see.
Absent a veteran, you can get the team mentoring to help teach them over time.
You can get training; in my training classes, we go through at least one use case.
You can also get specific consulting. This could range from a second pair of eyes
to go through your design, all the way to having a long-term relationship with a
consultant.
Another scenario I’ve seen is where a data engineering team is brand new and
hired from outside of the company. A lack of understanding of the domain
can lead to a misunderstanding of the use case. That can spell disaster for the
project.
This scenario happens most often when a data engineering team is replacing a
legacy system. The team that created, supports, and makes their living off the
legacy system feels threatened. I’ve trained at many companies where this is the
case.
I’ve helped them by training both teams at the same time. I’ll have members of
the legacy and data engineering teams attend so that they’re getting the same
knowledge. The Trojan horse is that I’m breaking down the political barriers
between the teams. This may surprise you, but sometimes companies don’t
communicate effectively internally. The training session is a great time to handle
this breakdown.
I suggest you handle this lack of skill politically rather than through business or
technical means. By breaking down the political walls, you score allies from the
legacy team to help you. You’ll learn how and why they engineered the solution
the way they did. You’ll learn some of the pitfalls they hit while creating.
By solving this problem, you’ll gain the legacy team’s domain knowledge for your
team.
Pre-project Steps
Before you start on your Big Data project, you’ll need to take care of some tasks
on creating the team.
If you don’t already have a team in place, you will need to create one. Often
companies don’t understand the need to create an actual data engineering team.
I’ve already outlined why this should be done earlier in the book in Chapter 2,
“The Need for Data Engineering,” and Chapter 3, “Data Engineering Teams.”
If you already have a team or are in the process of creating it, you will need to
check for gaps.
The first gaps to check for are skills gaps. Use the skills gap analysis mentioned
earlier in the book. This requires enormous honesty about the team’s skills.
The next gap to check for is an ability gap. You may have people on the team
who will never be able to perform. The significant uptick in complexity puts Big
Data out of their reach and no amount of time and help is going to change that.
It would be unfair to expect people with an ability gap to perform on a team.
If you didn’t perform the skills gap and ability gap checks in
Chapter 3, "Data Engineering Teams," you need to. This exercise
isn’t optional.
Use Case
A major difference between Big Data and small data projects is how deeply
you need to know and understand your use case. Skipping this step leads to
failed projects. Every decision after this is viewed through the lens of the use
case. Furthermore, there may be many different use cases, and you will need to
understand each one.
Questions you should ask about your use case include the following:
• What are you trying to accomplish?
• How fast do you need the results? Do they need to be in real-time, batch,
or both?
• What is the business value of this project?
• What difference will it make in the company?
• In what time frame does the project need to be completed?
• How much technical debt do you have?
• How many different use cases are there?
• How much data will we need to store initially? How much will we need to
grow by on a daily basis?
Use Case 64
• Will we need to store a wide variety of data types?
• How big is our user base? If you don’t have users, how many sources of
data or devices will you have?
• How computationally complex are the algorithms you’re going to run?
• How can you break up your system into crawl, walk, run?
• Is the use case changing often?
• How is data being accessed for the use case?
• How secure and encrypted does the data need to be?
• How will you handle the data management and governance?
• Who will own the data and are they encouraging other teams to use it?
Is It Big Data?
Now that you understand your use case, you’ll need to figure out if it requires Big
Data technologies. This is where a qualified Data Engineer is crucial. Given your
knowledge of the use case, the Data Engineer will help you make that decision.
A common word to look for when compiling the use case information is “can’t.”
When you’re talking to the other team members about the use case, they’ll often
talk about how they’re trying to do this now or they’ve tried it before. For both of
these scenarios, they “can’t” because of some technical limitation related to using
small data technologies for Big Data problems. By changing the technologies to
Big Data ones, the answer will change to “can.”
Use Case 65
they’re startups that are expecting to grow exponentially. Either way, they are an
organization that doesn’t have Big Data now, but will someday.
They’re faced with the difficult decision of when to start using Big Data technolo-
gies. I’ve seen what happens in both cases.
For startups, I’ve seen them use small data technologies and hit it big. Then the
startup can’t move over to a Big Data technology fast enough and it starts losing
customers and traffic. I’ve also seen startups make the deliberate choice not to
implement Big Data technologies due to the complexity and increased project
times. Those companies ended up going out of business before they needed Big
Data technologies.
For larger organizations, I’ve seen them delay moving to Big Data technologies
due to their legacy systems. They create so much technical debt for themselves
that they spend years moving to a new system. They just have to hope that they
don’t hit a reinforced concrete wall during that time.
After a very honest look at the team, you will need to see how you can set the
team up for success. Failing to set the team up for success is a common way to
make projects fail.
Repeat
In this step, you figure out how to break the project down into small pieces. This
way you’re in an iterative approach to creating a data pipeline instead of an
all-or-nothing one.
Projects that have an all-or-nothing approach often fail due to timelines and
complexity. It is important to break up the overall design or system into phases—
crawl, walk, run. I encourage you to segment your development phases to create
the system gradually rather than all at once. In approaching this, ask these
questions:
• How can you break up your system into crawl, walk, run?
• Has the rest of the management been informed which parts or features of
the system will be in which phase?
• Is the next phase in the project vastly more complex or time consuming
than the last phase?
• Do the phases have a logical progression?
• Is one feature dependent on a feature down the line?
Evaluate 68
Probability of Success
When I work with teams, I create a probability of success. This probability takes
into account:
• The team’s abilities
• The gaps in the team’s skills
• The complexity of the use case
• The complexity of the technologies you need to use
• The complexity of the resulting code
• The company’s track record on doing complex software
• How well the team is set up for success
• How much external help the team is going to receive (training, mentoring,
consulting)
In my experience, these have ranged from a 1% chance of success to a 99%
chance.
Next, calculate your probability of success. This requires an incredibly honest
look at things. As you looked over the questions and topics in this book:
• Did you have deep reservations about the team succeeding?
• Did they remind you of your team and their skills as they currently stand?
Now, I want you to think of ways to increase that probability:
• Would more qualified Data Engineers increase it?
• Would getting more or better external help increase it?
• Do you need to vastly decrease the scope or complexity of the use case?
• Is there a gap, in skill or ability, that’s making success improbable?
If you have a high probability, congratulations. You’ll likely be successful. How-
ever, many teams fool themselves with a dishonest assessment of their probabil-
ity of success. They significantly overestimate their abilities.
If you have a low probability, you should really think hard about undertaking the
project. This is a time to get extensive external help. At a minimum, your team
will need training and mentorship. If it’s very low, you should honestly consider
having a competent consulting company handle the entire project. Without
doing that, you’ll be setting the team up for failure, which isn’t fair to you or the
team.
Probability of Success 69
CHAPTER 9
Conclusion
Best Efforts
Big Data systems are incredibly complex. Part of your job will be to educate
your coworkers of this fact. Failing to do so will make everyone compare your
progress to easier projects, like a mobile or web project.
Despite your best efforts and planning, Big Data solutions are still hard. When
you start falling behind in the project or hitting a roadblock, I highly suggest
getting help quickly. I’ve helped companies who had been stuck for several
months. Had they reached out for help sooner, they could have saved months of
time and money. This isn’t admitting failure or an issue with the team; Big Data
is really that hard.
Best of luck on your Big Data journey.
Chapter 9 Conclusion 70
About the Author
Jesse Anderson is a Data Engineer, Creative Engineer
and Managing Director of Big Data Institute.
He trains and mentors companies ranging from star-
tups to Fortune 100 companies on Big Data. This in-
cludes training on current edge technologies and Big
Data management techniques. He’s mentored hun-
dreds of companies on their Big Data journeys. He
has taught thousands of students the skills to become
Data Engineers.
He is widely regarded as an expert in the field and his novel teaching practices.
Jesse is published on O’Reilly and Pragmatic Programmers. He has been covered
in prestigious publications such as The Wall Street Journal, CNN, BBC, NPR,
Engadget, and Wired.