Big Data and IOT
Big Data and IOT
Big Data and IOT
Year 2 Semester 2
richfield.ac.za
Table of Contents 2
Chapter 1: Big Data Analysis and Extraction Techniques ................................................................ 4
1.1 Big Data................................................................................................................................... 4
1.2 Big Data Analysis Techniques................................................................................................. 5
Chapter 2: IoT applications and architectures ............................................................................... 9
2.1 IoTs defined ............................................................................................................................ 9
2.2 How IoTs work? .................................................................................................................... 10
2.3 IoT Applications .................................................................................................................... 11
2.4 IoT Architectures .................................................................................................................. 14
Chapter 3: Big Data Storage and Security .................................................................................... 19
3.1 Big Data Storage ......................................................................................................................... 19
3.2 Big Data Security ........................................................................................................................ 23
Chapter 4: Big Data Strategies and Legal Compliance ................................................................. 27
4.1 How big data can help guide your strategy ............................................................................... 27
4.2 Forming your strategy for big data and data science ............................................................... 36
4.3 Analytics, algorithms and machine learning ............................................................................. 41
4.4 Governance and legal compliance ............................................................................................. 42
4.5 Governance for reporting .......................................................................................................... 46
Case study – Netflix gets burned despite best intentions .......................................................... 47
Chapter 5: IoT technologies and Standards ................................................................................. 48
5.1 IoT Protocols Background .......................................................................................................... 48
5.2 The Best Tools for Internet of Things (IoT) Development ........................................................ 49
5.3 IoT Development Platforms ....................................................................................................... 51
5.4 IoT Operating Systems ............................................................................................................... 52
5.5 IoT Programming Languages ...................................................................................................... 53
5.6 Open-Source Tools for the Internet of Things ........................................................................... 54
5.7 Best IoT Development Kits ......................................................................................................... 55
5.8 IoT Security ................................................................................................................................. 55
5.9 IoT Statistics and Forecast ......................................................................................................... 56
5.10 Types of IoT Connections ......................................................................................................... 56
5.11 Top Seven IoT Platforms .......................................................................................................... 58
5.12 Most Popular Internet of Things Protocols, Standards and Communication Technologies .. 62
5.13 Standards Bodies ...................................................................................................................... 66
3
PRESCRIBED OR RECOMMENDED BOOKS
Pearson by
David Stephenson
LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:
Learning Objectives
The millennium brought with it exponential data volumes driven by a variety of sources of digitised
data. Every individual, company, business and organisation has a digital footprint, as depicted in Figure
1 below.
Association rule
Association rule learning is an analysis technique adopted to find patterns in data through
correlations between variables in large databases. It was first used by major supermarket
chains to discover relations between products, using data from supermarket point-of-sale
(POS) systems. This adoption has been expanded to other areas, for example to assist in the:
This is a type of machine learning algorithm that adopts a structural mapping of binary
decisions which lead to a decision about the class of an object. Although sometimes referred
to as a decision tree, it is more properly a type of decision tree that leads to categorical
decisions. This statistical classification technique is sometimes used to:
• automatically assign documents to categories
• categorize organisms into groupings
• develop profiles of students who take online courses (Stephenson, 2013).
Genetic algorithms
6
Genetic algorithms are inspired by inheritance, mutation and natural selection. Essentially
these mechanisms are used to “evolve” useful solutions to problems that require
optimization. Some common applications of Genetic algorithms include the:
Regression analysis
Sentiment Analysis is a type of Natural Language Processing (NLP) technique that automates
the process of understanding an opinion about a given subject from written or spoken
language. Thus it helps researchers determine the sentiments of speakers or writers.
‘Sentiment analysis is being used to help:
• ‘understanding how people from different ethnic groups form ties with outsiders
• finding the importance of a particular individual within a group
• determining the social structure of a customer base’ (Stephenson, 2013).
Data Mining
Data mining extracts patterns from large data sets by combining methods from statistics and
machine learning, within database management. It is also referred to as the process of finding
anomalies, patterns and correlations within large data sets to predict outcomes.
An example would be when customer data is mined to determine which market segments are
most likely to react to an offer.
Natural Language Processing (NLP)
8
NLP is as a sub specialty of computer science, artificial intelligence, and linguistics, which uses
algorithms to analyse human (natural) language.
For example, if you have shopped online it is most likely that you were interacting with a
chatbot rather than an actual human being. These AI customer service agents are typically
algorithms that use NLP to be able to understand your query and respond to your questions
adequately, automatically, and in real-time.
In
2016, Mastercard launched
its own chatbot that was
compatible with Facebook
Messenger, but compared
to Uber’s bot, the
Mastercard bot functions
more like a virtual assistant.
Chapter 2: IoT applications and architectures 9
LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:
Learning Objectives
name a few.
An IoT usually consists of web-enabled smart devices that use embedded processors, sensors
and communication hardware to collect, send and act on data they acquire from their
environments, as depicted in Figure 3 below.
IoT devices share the sensor data they collect by connecting to an IoT gateway, where data is
either sent to the cloud to be analysed or analysed locally. Sometimes, these devices
communicate with other related devices and act on the information they get from one
another. The devices do most of the work without human intervention, although people can
interact with the devices - for instance, to set them up, give them instructions or access the
data.
The connectivity, networking and communication protocols used with these web-enabled
11
devices largely depend on the specific IoT applications deployed.
The proper use of IoT technologies in businesses can reduce overall operating costs, help
increase business efficiency, and create additional revenue streams through new markets and
products to enhance their competitive advantage. Some common applications for IoTs are
described below.
The leaders in “smart” operations are the retail and supply chain industry. This is not limited
to the use of IoT devices and applications for shopping and supply chain management.
Restaurants, hospitality providers, and other businesses have also adopted IoTs to manage
their supplies and gather valuable insights like avoiding over-ordering, effectively restricting
staff members who abuse their privileges, as well as better management of logistical and
merchandising expenses.
Home automation
Indoor cameras, and alarms, help you better manage your home.
The thermostat learns about your preferences and automatically adjusts the
temperature. In addition to a comfortable environment at home, it will help you save
on heating and use your energy more efficiently.
Cameras together with smoke and CO2 alarms make your home a safer place.
You can monitor and manage all of these devices with your smartphone using a
dedicated application.
Multiple wearable devices that monitor heart rate, calorie intake, sleep, track activity, and
other metrics to help us stay healthy have flooded the IoT market recently. These health and
fitness devices have been made available by Apple, Samsung, Jawbone, and Misfit, to name a
few wearables that represent IoT use.
In some cases, these wearable devices can communicate with third-party applications and
share information about the user’s chronic conditions with a healthcare provider.
In addition to the personal use of health wearable devices, there are some advanced smart
appliances, including scales, thermometers, blood pressure monitors, and even hair brushes.
Smart medication dispensers, are also widely used for home treatment and elderly care. The
appliance allows you to load the prescribed pills and monitor the intake. The mobile
application paired with the device sends timely alerts to the family members or caregivers to
inform them when the medicine is taken, or skipped. It also provides useful data on the
medication intake and sends notifications when your medication is running low.
Automotive
13
Most cars today are becoming increasingly connected through the inclusion of smart sensors.
These smart solutions are sometimes provided by the car
manufacturer itself, while others are offered as a third-
party solution to make your car “smart”,
The concept for the Smart car
like remote control and monitoring of your started in the early 1970’s.
car. Through a mobile application allows Mercedes-Benz engineer
you to control such functions of your car as Johann Tomforde began to
explore city car concepts and
opening/closing the doors, engine metrics, designs. He created the first
the alarm system, detecting the car’s concept sketch, but it wasn’t
location and routes, etc. until the 1990’s that Mercedes
assembled a team to start the
design process.
Agriculture
Smart farming although commonly ignored has grown to include many innovative products
toward progressive farmers. Some of these “smart products” include a distributed network
of smart sensors to monitor various natural conditions, such as humidity, air temperature,
and soil quality. Other products are used to automate irrigation systems. An example of an
IoT agriculture device is a smart watering system that uses real-time weather data and
forecasts to create an optimal watering schedule for the agricultural area. A smart Bluetooth-
powered controller and a mobile application are used to control the system, making it easy
to install, setup, and manage, as is shown in Figure 5 below.
Freight, fleet management, and shipping represent another promising area of use for IoT.
With smart tags or sensors, attached to the parcels or items being transported, customers
can track their location, speed, and even transportation or storage conditions.
According to Bilal (2018), the term Internet of Things (IoT) a is expressed through a simple
formula such as: - IoT= Services+ Data+ Networks + Sensors. These basic building blocks of an
IoT system are illustrated in Figure 7 below, and each of these elements are discussed
thereafter:
The gateway provides connectivity between the object and the cloud part of the IoT solution,
to enables data pre-processing and filtering before moving it to the cloud (to reduce the
volume of data for detailed processing and storing) and transmits control commands going
from the cloud to things. The objects then execute commands using their actuators. The
advantage of adopting a cloud gateway is that it ensures compatibility with various protocols
and communicates with field gateways using different protocols depending on what protocol
is supported by the relevant gateway.
A data lake is used for storing the data generated by connected devices in its original format.
This data that is generated in "batches" or in “streams” in a large in volume, therefore
commonly referred to as “Big data”. When specific data is needed for analysis it is extracted
from the data lake and loaded to a big data warehouse, where it is filtered cleaned, structured
and matched.
Data analysts use data from the big data warehouse, visualized in schemes, diagrams, or
infographics, to find trends and decide what actions to implement or understating the
correlations and patterns to create more suitable algorithms for control applications.
To create more precise and more efficient models for control applications, machine learning
is often adopted. Models are regularly updated based on historical data accumulated in the
big data warehouse.
Control applications are responsible for sending automatic commands and alerts to
actuators, for example:
Windows of a smart home can receive an automatic command to open or close depending
on the forecasts taken from the weather service.
When sensors show that the soil is dry, watering systems get an automatic command to
water plants.
Sensors help monitor the state of industrial equipment, and in case of a pre-failure
situation, an IoT system generates and sends automatic notifications to field engineers.
The commands sent by control applications to actuators can be also additionally stored in a
16
big data warehouse to help investigate problematic cases.
User applications are the software component of an IoT system which connects IoT users to
the devices and gives them the option to monitor and control their smart object through a
mobile phone or web application’ (Grizhnevich, 2018).
A simple example of “smart” outside lighting system, as a part of a “smart home”, is described
in Figure 7, below.
The perception layer is the physical layer, which has sensors for sensing and gathering
parameters about the environment, or identifies other smart objects in the environment.
The network layer is responsible for connecting to other smart things, network devices, and
servers, and is also used for transmitting and processing sensor data.
The application layer is responsible for delivering application specific services to the user. An
extension of the three-layer architecture is the five-layer architecture, which additionally
includes the processing and business layers as depicted in Figure 9 below.
The transport layer transfers the sensor data from the perception layer to the processing
layer and vice versa through networks such as wireless, 3G, LAN, Bluetooth, RFID, and NFC.
The processing layer stores, analyses, and processes huge amounts of data that come from
the transport layer through various technologies such as databases, cloud computing, and big
data processing. The business layer manages the whole IoT system, including applications,
business and profit models, and users’ privacy.
Another architecture proposed by Ning and Wang in Figure 10 below is inspired by the layers
of processing in the human brain. It is inspired by the intelligence and ability of human beings
to think, feel, remember, make decisions, and react to the physical environment. It is
constituted of three parts. First is the human brain, which is analogous to the processing and
data management unit or the data centre. Second is the spinal cord, which is analogous to
the distributed network of data processing nodes and smart gateways. Third is the network
of nerves, which correspond to the networking components and sensors.
18
LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:
Learning Objectives
Companies apply big data analytics to get greater intelligence from metadata. In most cases,
big data storage uses low-cost hard disk drives, or hybrids mixing disk and flash storage.
Although a specific volume size or capacity is not formally defined, big data storage usually
refers to volumes that grow exponentially to terabyte or petabyte scale. Thus, big data
storage is also able to flexibly scale as required.
A big data storage system clusters a large number of commodity servers attached to high-
capacity disk to support analytic software written to crunch vast quantities of data. The
system relies on massively parallel processing databases to analyse data ingested from a
variety of sources.
The data itself in big data is unstructured because it typically comes from various sources,
which means mostly file-based and object storage is required. The Apache Hadoop
Distributed File System (HDFS) is the most prevalent analytics engine for big data, and is
typically combined with some features of a NoSQL database.
Hadoop is open source software written in the Java programming language. HDFS spreads the
data analytics across hundreds or even thousands of server nodes without impacting on
performance. Through its MapReduce component (as depicted in Figure 11 above), Hadoop
distributes processing thus acting as a safeguard against catastrophic failure. The multiple
nodes serve as a platform for data analysis at a network's edge. When a query arrives,
MapReduce executes processing directly on the storage node on which the data resides. Once
analysis is completed, MapReduce gathers the collective results from each server and
“reduces” them to present a single cohesive response.
21
Hadoop has two major layers and two other supporting modules, with five building blocks
inside the Hadoop Architecture. Each of these building blocks is briefly explained below:
Hadoop Common is a collection of Java libraries and utilities that are required by/common
for other Hadoop modules, which contain all the necessary Java files and scripts required to
start Hadoop.
Hadoop Yet Another Resource Navigator (YARN) framework is a Resource Manager basically
used for job scheduling and efficient cluster resource management It takes the responsibility
of providing the computational resource (e.g., CPU storage memory devices, etc.) required
for application executions.
Hadoop Distributed File System (HDFS™) is suitable for applications having large data sets
because it is designed to be deployed on low-cost hardware. HDFS is responsible for providing
permanent, reliable and distributed storage, with unrestricted, high-speed access to the data
application. This is typically used for storing Inputs & Outputs.
Hadoop MapReduce framework is the core and integral part of the Hadoop architecture. It
efficiently handles large volume datasets by breaking them into multiple datasets and
assigning them to a cluster of computers to work parallel at same time.
Storage for big data is designed to collect voluminous data produced at variable speeds by
multiple sources and in varied formats. Industry experts describe this process as the three Vs:
the variety, velocity and volume of data.
Variety describes the different sources and types of data to be mined. Sources include audio
files, documents, email, file storage, images, log data, social media posts, streaming video and
user clickstreams.
Velocity pertains to the speed at which storage is able to ingest big data volumes and run
analytic operations against it. Volume acknowledges that modern applications scripts are
large and growing larger, outstripping the storage capabilities of existing legacy storage.
The key requirements of big data storage are that it can handle large volumes of data and
22
keep scaling to keep up with growth, so that it can provide the input/output operations per
second (IOPS) necessary to deliver data to analytics tools.
The largest big data practitioners such as Google, Facebook, and Apple, run what are known
as hyper scale computing environments.
These comprise vast amounts of commodity servers with direct-attached storage (DAS).
Redundancy is at the level of the entire compute/storage unit, and if a unit suffers an outage
of any component it is replaced wholesale, having already failed over to its mirror.
Statistical big data analysis and modelling is gaining adoption in a cross-section of industries,
including aerospace, environmental science, energy exploration, financial markets, genomics,
healthcare and retailing. A big data platform is built for much greater scale, speed and
performance than traditional enterprise storage. Also, in most cases, big data storage targets
a much more limited set of workloads on which it operates.
It's not uncommon for larger organizations to have multiple SAN and NAS environments that
support discrete workloads. Each enterprise storage silo may contain pieces of data that
pertain to your big data project.
Big data can bring an organization a competitive advantage from large-scale statistical
23
analysis of the data or its metadata. In a big data environment, the analytics mostly operate
on a defined set of data, using a series of data mining-based predictive modelling forecasts to
gauge customer behaviours or the likelihood of future events.
Big Data security is the processing of guarding data and analytics processes, irrespective of
where it is housed, from any vulnerabilities that could compromise their
confidentiality (Taylor, 2017).
Since Big Data relies heavily on the cloud, many enterprises fear the sense of loss of control
over data that comes with utilizing cloud storage providers and third-party data management
and analytics solutions. The impact of this is significant, as many regulations hold enterprises
accountable for the security of data that may not be in their direct control.
Furthermore third-party applications of unknown lineage, can easily introduce risks into
enterprise networks when their security measures are not equivalent to the same standards
as established enterprise protocols and data governance policies of the enterprise.
Devices introduce yet another layer of big data security concerns, with workers embracing
mobility and taking advantage of the cloud to work anywhere, at any time. With BYOD, a
multitude of devices may be used to connect to the enterprise network and handle data at
any time, so effective big data security for business must address endpoint security with this
in mind.
Additionally, securing privileged user access must be a top priority for enterprises. Certain
users must be given access to highly sensitive data in certain business processes, but avoiding
potential misuse of data can be tricky. Securing privileged user access requires well-defined
security policies and controls that permit access to data and systems required by specific
employee roles while preventing privileged user access to sensitive data where access is not
necessary – a practice commonly referred to as the “principle of least privilege.” It is critically
important to provide a system in which encrypted authentication/validation verifies that
users are who they say they are, and determine who can see what.
Since Big Data implementations distribute their processing jobs across many systems for
24
faster analysis, due to their large volumes, this means a lot more systems where security
issues can crop up.
Non-relational data stores like NoSQL, which are used in Big Data systems, usually lack
security.
In Big Data architecture, the data which is usually stored in the cloud, is also typically stored
on multiple tiers, depending on business needs for performance. For example, high-priority
“hot” data will usually be stored on flash media for faster performance. Hence securing
storage will mean creating a tier-conscious security strategy.
Security solutions that draw logs from endpoints will need to validate the authenticity of
those endpoints.
Real-time security tools generate a large amount of information; the key is finding a way to
ignore the false alarms, so human talent can be focused on the true breaches.
Data mining solutions should ensure that they are secured against not just external threats,
but insiders, who abuse network privileges, to obtain sensitive information.
Granular auditing can help determine when missed attacks have occurred, what the
consequences were, and what should be done to improve matters in the future. This in itself
is a lot of data, and must be enabled and protected to be useful in addressing big data security
issues.
Data provenance primarily concerns metadata (data about data), which can be extremely
helpful in determining where data came from, who accessed it, or what was done with it.
Usually, this kind of data should be analysed with exceptional speed to minimize the time in
which a breach is active. Privileged users engaged in this type of activity must be thoroughly
vetted and closely monitored to ensure they do not become their own big data security issues.
Encryption tools need to secure data in-transit and data at-rest, and more importantly, these
needs to be achieved across massive data volumes. Furthermore, encryption needs to operate
on many different types of data, both user and machine-generated. Encryption tools also need
to work with different analytics toolsets and their output data, and on common big data storage
formats including relational database management systems (RDBMS), non-relational
databases like NoSQL, and specialized file systems such as Hadoop Distributed File System
(HDFS). Encrypted data is useless to external entities, such as hackers, if they do not have the
key to unlock it. Moreover, encrypting data means that both at input and output, information is
completely protected.
Centralized key management has been a security best practice for many years even in big
25
data environments, especially those with wide geographical distribution. Best practices
include policy-driven automation, logging, on-demand key delivery, and abstracting key
management from key usage.
User access control may be the most basic network security tool, but many companies
practice minimal control because the management overhead can be so high. This is dangerous
at both the network level, as well as on the big data platform. Strong user access control
requires a policy-based approach that automates access based on user and role-based
settings. Policy driven automation manages complex user control levels, such as multiple
administrator settings that protect the big data platform against inside attack.
Controlling who has root access to Business Intelligence tools and analytics platforms is
another key to protecting your data. By developing a tiered access system, the opportunities
for an attack can be reduced.
Intrusion detection and prevention systems are security pillars equally so to the big data
platform. Big data’s value and distributed architecture naturally lends itself to intrusion
attempts. Intrusion Prevention Systems (IPS) enable security administrators to protect the big
data platform from intrusion. However, should an intrusion succeed despite the IPS, Intrusion
Detection Systems (IDS) quarantine the intrusion before it does significant damage.
Physical Security must not be ignored. Physical security should be deploying when the big
data platform in the data centre is being built. If your data centre is cloud based carefully do
due diligence to the cloud provider’s data centre security. Physical security systems serve an
important role in that they can deny data centre access to both strangers, or to staff members
who should not have access to sensitive areas. Video surveillance and security logs will also
serve the same purpose.
Building a strong firewall is another useful big data security tool. Firewalls are effective at
filtering traffic that both enters and leaves servers. Organizations can prevent attacks before
they happen by creating strong filters that avoid any third parties or unknown data sources.
Essentially, big data security requires a multi-faceted approach. When it comes to enterprises
handling vast amounts of data, both proprietary and obtained via third-party sources, big data
security risks become a real concern.
A comprehensive, multi-faceted approach to big data security encompasses:
Visibility of all data access and interactions
Data classification
Data event correlation
Application control
Device control and encryption
Web application and cloud storage control
Trusted network awareness
Access and privileged user control
26
As illustrated in Figure 14, below, Taylor (2017), has summarised the essential areas of
security required for big data, that is, to:
Thus big data security environments must operate during three data stages. These are 1) data
ingress (what data is coming in), 2) stored data (what data is stored), and 3) data output (what
data is going out to applications and reports) (Taylor, 2017).
Many enterprises have slowly accumulated a series of point solutions, each addressing a
single component of the full big data security picture. While this approach can address
standalone security concerns, the best approach to big data security integrates these
capabilities into a unified system capable of sharing and correlating security alerts, threat
intelligence, and other activity in real time – an approach not unlike the vast and dynamic
concept of big data itself.
Chapter 4: Big Data Strategies and Legal Compliance 27
LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:
Learning Objectives
As you evaluate your company’s strategy, perhaps even considering a strategic pivot, you’ll
want to gather and utilize all available data to build a deep understanding of your customers,
your competitors, the external factors that impact you and even your own product. The big
data ecosystem will play an important role in this process, enabling insights and guiding
actions in ways not previously possible.
Your customers
Customer data is one of your most important assets. There is more data available today than
ever before, and it can tell you much more than you’d expect about your customers and
potential customers: who they are, what motivates them, what they prefer and what their
habits are. The more data you collect, the more complete your customer picture will become,
so make it your goal to collect as much data as possible, from as many sources as possible.
Getting the data
Start by identifying all customer interaction points:
For many organizations, the interaction points will consist of physical and web stores, Apple
(iOS) and Android apps, social media channels, and staff servicing customers in stores, call
centres, online chat and social media. I’ll illustrate with a few examples.
Digital
Start with your digital platforms. First the basics (not big data yet).
You’ll probably already have some web analytics tags on your website that record high-level
events. Make sure you also
record the key moments in the
customer journey, such as when
the visitor does an onsite search,
selects filters, visits specific
pages, downloads material,
watches your videos or places
items in the checkout basket.
Record mouse events, such as
scrolls and hovers. Make sure
these moments are tagged in a
way that preserves the details you’ll need later, such as adding the details of product
category, price range, and product ID to the web tags associated with each item description
page. This will allow you to quickly do top-of-mind analysis, such as identifying how often
products in certain categories were viewed, or how effective a marketing campaign was in
driving a desired event. In the end, you’ll probably have several dozen or even several
hundred specific dimensions that you add to your out-of-the-box web analytics data. This isn’t
yet big data.
With the additional detailed tags that you’ve implemented, you’ll be able to analyse and
understand many aspects of the customer journey, giving insights into how different types of
customers interact with the products you’ve presented them. We’ll show examples of this
below.
If you haven’t already done so, set up conversion funnels for sequential events that lead to
important conversion events, such as purchases. shows how a basic purchase funnel might
look.
Each intermediate goal in the conversion funnel is a micro-conversion, together leading to a
macro-conversion (‘checkout’ in this case). Choose your micro-conversions in a way that
reflects increasing engagement and increased likelihood of a final conversion. The funnels you
set up will enable you to analyse drop-off rates at each stage, allowing you to address
potential problem points and increase the percentage of visitors progressing along each stage
of the funnel, eventually reaching the conversion event at the end of the funnel. Depending
on your product, customer movement down the funnel may span several days, weeks or
months, so you’ll need to decide what to consider ‘drop-off’.
For privacy and governance in your website, research and comply with local laws governing
29
the use of web cookies. Make a list of where you are storing the browsing data that identifies
the individual user (such as IP address) and how
you later use the insights you’ve gathered in
customizing your interactions with each
user. For example, if you personalize your
content and marketing based on the users’
online actions, you’ll need to consider the ethical
and legal implications. Remember the example
from Target.
Now the big data part. You’ve set up the web
analytics to record details most meaningful for
you. Now hook your web page up to a big data
system that will record every online event for
every web visitor. You’ll need a big data storage
system, such as HDFS, and you’ll need to implement code (typically JavaScript) that sends the
events to that storage. If you want a minimum-pain solution, use Google Analytics’ premium
service (GA360), and activate the BigQuery integration. This will send your web data to
Google’s cloud storage, allowing you to analyse it in detail within a few hours. If you need
data in real time, you can change the GA JavaScript method sendHitTask and send the same
data to both Google and to your own storage system.
Customer support
Consider recording and analysing all interactions with sales agents and customer support:
phone calls, online chats, emails and even videos of customers in stores. Most of this data is
easy to review in pieces, but difficult to analyse at scale without advanced tools. As you store
these interactions, your customer support agents should enrich them with additional
information, such as customer ID and time of day’, and label them with meaningful categories,
such as ‘order enquiry’, ‘new purchase’, ‘cancellation’ or ‘complaint.’ You can then save the
entire data file in a big data storage system (such as MongoDB or HDFS). We’ll show valuable
ways to use this data later in this chapter.
Physical movements
You have a choice of several technologies for monitoring how your customers are moving
within your stores. In addition to traditional video cameras and break-beam lasers across
entrances, there are technologies that track the movement of smartphones based on cellular,
Bluetooth or Wi-Fi interactions. Specialized firms such as ShopperTrak and Walkbase work in
these areas. Such monitoring will help you understand the browsing patterns of your
customers, such as what categories are considered by the same customers and how much
time is spent before a purchase decision. It will help you direct your register and support staff
where needed. Again, this data is valuable even if the customer is kept anonymous.
When a customer arrives at the register and makes a purchase, possibly with a card that is
linked to that customer, you will be able to see not only what is being purchased, but also
what other areas of the store were browsed. You might use this information in future
marketing or you might use it to redesign your store layout if you realize that the current
30
layout is hampering cross-sell opportunities.
These are just a few examples. In general, start collecting and storing as much detail as
possible, making sure to consider business value, to respect customer privacy and to comply
with local laws in your collection, storage and use of this data. Be careful not to cross the line
between ‘helpful’ and ‘creepy’. Keep your customers’ best interests in mind and assume any
techniques you use will become public knowledge.
Problem: Customers will not identify themselves (e.g. not logging in).
Possible solutions: Use web cookies and IP addresses to link visits from the same
visitors, producing a holistic picture of
anonymous customer journeys extended
across sessions. Use payment details to link
purchases to customers. Smartphones may
provide information to installed apps that
allow additional linking. Talk to your app
developers about this.
Problem: Customers create multiple
logins.
Possible solutions: Clean your customer
database by finding accounts that share
key fields: name, email address, home
address, date of birth, or IP address. A
graph database such as Neo4J can help in this process, as illustrated in Figure. Work
with the business to create logic for which customers to merge and which to associate
using a special database field (e.g. ‘spouse of’). Change your account creation process
to detect and circumvent creation of duplicate accounts, such as by flagging email
31
addresses from existing accounts.
Price low-to-high? Highest rated item? Newest product? Customers sorting price low-to-high
are probably price conscious. Those sorting price high-to-low or by highest rating are probably
quality conscious. Those sorting newest first may be quality conscious or perhaps
technophiles or early adopters. Those sorting by rating may be late adopters or quality
conscious. All of this will impact how you interact with them. If a customer is quality conscious
but not price conscious, you should present them with high-quality products in search results
and marketing emails, and you should not be presenting them with clearance sales of entry-
level products. You’ll want to interact with the price-conscious customer segments in exactly
the opposite way.
This information will help you decide when to intervene in the shopping process, such as by
offering a discount when a customer is about to leave without making a purchase.
This will help you segment the customer and return the most relevant results for ambiguous
search phrases (such as ‘jaguar’ the car vs ‘jaguar’ the animal or Panama City, Florida vs
Panama City, Panama). You’ll also use this information to guide which items you market to
the customer.
Do they write reviews? If they always read reviews, don’t present them with poorly reviewed
items in your marketing emails, search results, or cross-sell frames. If they often write
reviews, or if they own social media accounts with an unusually high number of followers,
make sure they get ‘golden glove’ customer support.
Do they open newsletters? Do they respond to flash sales? Don’t fill their inbox with
marketing they never respond to. Give them the most relevant media and you’ll increase the
odds they respond rather than clicking ‘unsubscribe’.
If they are active on social media, what are the topics and hashtags they most frequently
mention?
Can you use this knowledge to market more effectively to them? Here again, you are building
customer personas with all relevant information. Try to get connected to their social media
accounts. As we mentioned
Figure Searches in
Brazil for ‘McDonalds’
(top line) vs ‘Burger
King’ (bottom line) Q2,
2017 (Google Trends).
What is the applause rate on your social media? (How many times are your tweets
liked or retweeted? How many comments on your Facebook posts?)
How many people download your material?
How many people sign up for your newsletter?
Test changes in your products by running A/B tests, which you’ll do in the following way:
1. Propose one small change that you think may improve your offering. Change one frame,
one phrase, or one banner. Check with your development team to make sure it’s an easy
change.
2. Decide what key performance indicators (KPI) you most want to increase: revenue,
purchases, up-sells, time onsite, etc. Monitor the impact on other KPIs.
3. Run the original and the changed version (A and B) simultaneously. For websites, use an
A/B tool such as Optimizely. View the results using the tool or place the test version ID
in web tags and analyse specifics of each version, such as by comparing lengths of path
to conversion.
4. Check if results are statistically significant using a two-sample hypothesis test. Have an
analyst do this or use an on-line calculator such as https://abtestguide.com/calc/.
b. Are there key product or customer segments you should manage differently?
c. Did specific external events influence results? 36
Align your assumptions about your product with these new insights. For example:
Are you trying to compete on price, while most of your revenue is coming from
customers who are quality conscious?
Are you not taking time to curate customer reviews, while most of your customers are
actively sorting on and reading those reviews?
If your assumptions about your product don’t align with what you learn about your
customers’ preferences and habits, it may be time for a strategic pivot.
Use modern data and data science (analytics) to get the insights you’ll need to determine and
refine your strategy. Selectively choose the areas in which you should focus your efforts in
(big) data and data science and then determine the necessary tools, teams and processes.
In the next chapter, I’ll talk about how to choose and prioritize your data efforts.
4.2 Forming your strategy for big data and data science
It’s exciting for me to sit with business leaders to explore ways in which data and analytics
can solve their challenges and open new possibilities. From my experience, there are different
paths that lead a company to the point where they are ready to take a significant step forward
in their use of data and analytics.
Companies that have always operated with a minimal use of data may have been suddenly
blindsided by a crisis or may be growing exasperated by:
They end up forced to run damage control in these areas, but are ultimately seeking to
improve operations at a fundamental level and lay the groundwork for future growth.
Companies that have been operating with a data-driven mind set may be exploring innovative
ways to grow their use of data and analytics. They are looking for new data sources and
technologies that will give competitive advantages or are exploring ways to quickly scale up
and optimize a proven product by applying advances in parallel computing, artificial
intelligence and machine learning.
Regardless of which description best fits your company, the first step you’ll want to take when
37
re-evaluating your use of data and analytics is to form a strong programme team.
The programme team
Your data initiative programme team should include individuals representing four key areas
of expertise:
1. strategic,
2. business,
3. analytic; and
4. technical.
Strategic expertise
Business expertise
Analytic expertise
Technical expertise
Once you’ve selected your programme team, plan a programme kick-off meeting to lay the strategic
foundation for the analytics initiative, sketching the framework for business applications,
brainstorming ideas, and assigning follow-on steps, which will themselves lead to initial scoping
efforts. The four skill sets represented in the programme team should all be present if possible,
although the business expert may cover the strategic input and it is also possible (but not ideal) to
postpone technology input until the scoping stage.
Also helpful at this stage is to have detailed financial statements at hand. These figures will help focus
the discussion on areas with the most influence on your financials. Bring your standard reports and
dashboards, particularly those that include your key performance indicators (KPIs).
Strategic input: Start the kick-off meeting by reviewing the purpose and principles that govern
your efforts. Continue by reviewing the strategic goals of the company, distinguishing
between the long- and short-term strategic goals. Since some analytics projects will take
significant time to develop and deploy, it’s important to distinguish the time-lines of the
strategic goals. If there is no executive or strategic stakeholder involved in the process, the
team members present should have access to documentation detailing corporate strategy. If
there is no such strategic documentation (as is, sadly, sometimes the case), continue the
brainstorming using an assumed strategy of plucking low-hanging fruit with low initial
investment, low likelihood of internal resistance and relatively high ROI.
Business input: After reviewing the principles and strategy, review the KPIs used within the
38
organization. In addition to the standard financial KPIs, a company may track any number of
metrics. Marketing will track click-through rate, customer lifetime value, conversion rates,
organic traffic, etc. Human resources may track attrition rates, acceptance rates,
absenteeism, tenure, regretted attrition, etc. Finance will typically track financial lead
indicators, often related to traffic (visits, visitors, searches) as well as third-party data.
At this stage, probe more deeply into why certain KPIs are important and highlight the KPIs
that tie in most closely with your strategic and financial goals. Identify which KPIs you should
most focus on improving.
The business experts should then describe known pain points within the organization. These
could come from within any department and could be strategic, such as limited insight into
competition or customer segments; tactical, such as difficulty setting optimal product prices,
integrating data from recent acquisitions or allocating marketing spend; or operational, such
as high fraud rates or slow delivery times.
Ask the business experts to describe where they would like to be in three years. They may be
able to describe this in terms of data and analytics, or they may simply describe this in terms
of envisioned product offerings and business results. A part of this vision should be features
and capabilities of competitors that they would like to see incorporated into their offerings.
Analytics input: By now your business objectives, principles, and strategic goals should be
completely laid out (and ideally written up in common view for discussion). At this point, your
analytics expert should work through the list and identify which of those business objectives
can be matched to standard analytic tools or models that may bring business value in relieving
a pain point, raising a KPI, or providing an innovative improvement. It’s beneficial to have
cross-industry insight into how companies in other industries have benefited from similar
analytic projects.
To illustrate this process, a statistical model may be proposed to solve forecasting inaccuracy,
a graph-based recommendation engine may be proposed to increase conversion rates or
shorten purchase-path length, a natural language processing tool may provide near-real-time
social media analysis to measure sentiment following a major advertising campaign, or a
streaming analytics framework combined with a statistical or machine learning tool may be
used for real-time customer analytics related to fraud prevention, mitigation of cart
abandonment, etc.
Technical input: If IT is represented in your kick-off meeting, they will be contributing
throughout the discussion, highlighting technical limitations and opportunities. They should
be particularly involved during the analytics phase, providing the initial data input and taking
responsibility for eventual deployment of analytics solutions. If your technical experts are not
present during the initial project kick-off, you’ll need a second meeting to verify feasibility
and get their buy-in.
Output of the kick-off
39
The first output of your programme kick-off should be a document that I refer to as Impact
Areas for Analytics, consisting of the
table illustrated in Figure . The first
column in this table should be
business goals written in
terminology understandable to
everyone. The next column is the
corresponding analytic project,
along the lines of the applications.
The next three columns contain the
data, technology and staffing
needed to execute the project. If possible, divide the table into the strategic focus areas most
relevant to your company.
By the end of your kick-off meeting, you should have filled out the first two columns of this
matrix.
The second document you’ll create in the kick-off will be an Analytics Effort document. For
each analytics project listed in the first document, this second document will describe:
1. The development effort required. This should be given in very broad terms (small,
medium, large, XL or XXL, with those terms defined however you’d like).
2. An estimate of the priority and/or ROI.
3. The individuals in the company who:
a. can authorize the project; and
b. can provide the detailed subject-matter expertise needed for implementation.
We are looking here for individuals to speak with, not to carry out the project.
These are the ‘A’ and the ‘C’ in the RASCI model used in some organizations.
Distribute the meeting notes to the programme team members, soliciting and incorporating
their feedback. When this is done, return to the programme sponsor to discuss the Impact
Areas for Analytics document. Work with the programme sponsor to prioritize the projects,
referencing the Analytics Effort document and taking into consideration the company’s
strategic priorities, financial landscape, room for capital expenditure and head-count growth,
risk appetite and the various dynamics that may operate on personal or departmental levels.
Scoping phase
Once the projects have been discussed and prioritized with the programme sponsor, you
should communicate with the corresponding authorizers (from the Analytics Effort
document) to set up short (30–60 min) scoping meetings between the analytics expert and
the subject matter expert(s). The exact methods and lines of communication and
authorization will differ by company and by culture.
During the scoping meetings, speak with the individuals who best understand the data and
40
the business challenge. Your goal at this stage is to develop a detailed understanding of the
background and current challenges of the business as well as the relevant data and systems
currently in use.
After each scoping meeting, the analytics expert should update the corresponding project
entry on the Analytics Effort document and add a proposed minimum viable product (MVP)
to the project description.
The MVP is the smallest functional deliverable that can demonstrate the feasibility and
usefulness of the analytics project. It should initially have very limited functionality and
generally will use only a small portion of the available data. Collecting and cleaning your full
data set can be a major undertaking, so focus in your MVP on a set of data that is readily
available and reasonably reliable, such as data over a limited period for one geography or
product.
The description should briefly describe the inputs, methodology and outputs of the MVP, the
criteria for evaluating the MVP, and the resources required to complete the MVP (typically
this is only the staff time required, but it might entail additional computing costs and/or third-
party resources). Utilizing cloud resources should eliminate the need for hardware purchases
for an MVP, and trial software licenses should substitute for licensing costs at this stage.
Feed this MVP into whichever project management framework you use in your company (e.g.
scrum or Kanban). Evaluate the results of the MVP to determine the next steps for that
analytics project. You may move the project through several phases before you finally deploy
it. These phases might include:
It’s very important to keep in mind that analytic applications are often a form of Research &
Development (R&D). Not all good ideas will work. Sometimes this is due to insufficient or
poor-quality data, sometimes there is simply too much noise in the data, or the process that
we are examining does not lend itself to standard models. This is why it’s so important to start
with MVPs, to fail fast, to keep in close contact with business experts and to find projects that
41
produce quick wins. We’ll talk more about this in the next chapter when we talk about agile
analytics.
You have three primary concerns for securing and governing your data:
This last one can be a huge headache for multinationals, particularly in Europe, where the
General Data Protection Regulation, effective May 2018, carries with it fines for violations of
up to 4 per cent of global turnover or 20 million euros (whichever is larger). The EU will hold
accountable even companies headquartered outside of Europe if they collect or process data
of sufficient numbers of EU residents.
Regardless of legal risk, you risk reputational damage if society perceives you as handling
personal data inappropriately.
Personal data
When we talk about personal data, we often use the term personally identifiable information
(PII), which, in broad terms, is data that is unique to an individual. A passport or driver’s
license number is PII, but a person’s age, ethnicity or medical condition is not. There is no
clear definition of PII. The IP address of the browser used to visit a website is considered PII
in some but not all legal jurisdictions.
There is increased awareness that identities can be determined from non-PII data using data
science techniques, and hence we speak of ‘quasi-identifiers’, which are not PII but can be
made to function like PII. You’ll need to safeguard these as well, as we’ll see in the Netflix
example below.
Identify all PII and quasi-identifiers that you process and store. Establish internal policies for
monitoring and controlling access to them. Your control over this data will facilitate
compliance with current and future government regulations, as well as some third-party
services which will refuse to process PII.
PII becomes sensitive when it is linked to private information. For example, a database with
the names and addresses of town residents is full of PII but is usually public data. A database
of medical conditions (not PII) must be protected when the database can be linked to PII.
Jurisdictions differ in their laws governing what personal data must be protected (health
records, ethnicity, religion, etc.). These laws are often rooted in historic events within each
region.
There are two focus areas for proper use of sensitive personal data: data privacy and data
43
protection.
Data privacy relates to what data you may collect, store and use, such as whether it is
appropriate to place hidden video cameras in public areas or to use web cookies to
track online browsing without user consent.
Data protection relates to the safeguarding and redistribution of data you have legally
collected and stored. It addresses questions such as whether you can store private
data of European residents in data centres outside of Europe.
Privacy laws
If you’re in a large organization, you will have an internal privacy officer who should be on a
first name basis with your data and analytics leader. If you don’t have a privacy officer, you
should find resources that can advise you in the privacy and data protection laws of the
jurisdictions in which you have customer bases or data centres.
Each country determines its own privacy and data protection laws, with Europe having some
of the most stringent. The EU’s Data Protection Directive of 1995 laid out recommendations
for privacy and data protection within the EU, but, before the activation of the EU-wide
General Data Protection Regulation (GDPR) in May 2018, each country was left to determine
and enforce its own laws. If you have EU customers, you’ll need to become familiar with the
requirements of the GDPR. Figure which shows the rapid rise in number of Google searches
for the term ‘GDPR’ since January 2017, demonstrates that you won’t be alone in this.
The extent to which privacy laws differ by country has proven challenging for multinational
organizations, particularly for data-driven organizations that rely on vast stores of personal
data to better understand and interact with customers. Within Europe over the past years,
certain data that could be collected in one country could not be collected in a neighbouring
country, and the personal data that could be collected within Europe could not be sent
outside of Europe unless the recipient country provided data protection meeting European
standards.
The European Union’s Safe Harbour Decision in 2000 allowed US companies complying with
44
certain data governance standards to transfer data from the EU to the US. The ability of
US companies to safeguard
personal data came into question
following the Edward Snowden
affair, so that, on 6 October 2015, Privacy and data protection
the European Court of Justice laws vary by legal jurisdiction,
and you may be subject to local
invalidated the EC’s Safe Harbour laws even if you don’t have a
Decision, noting that ‘legislation physical presence there.
permitting the public authorities
to have access on a generalized
basis to the content of electronic
communications must be
regarded as compromising the
essence of the fundamental right to respect for private life.’85 A replacement for Safe
Harbou, the EU–US Privacy Shield, was approved by the European Commission nine months
later (July 2016).
By analysing the Facebook Likes of the users, the model could distinguish between Caucasians
and African Americans with a 95 per cent accuracy.88
So we see that two of the most fundamental tools within data science: the creative linking of
data sources and the creation of insight-generating algorithms, both increase the risk of
revealing sensitive personal details within apparently innocuous data. Be aware of such
dangers as you work to comply with privacy laws in a world of analytic tools that are
increasingly able to draw insights from and identify hidden patterns within big data.
Data governance
Establish and enforce policies within your organization for how employees access and use the
data in your systems. Designated individuals in your IT department, in collaboration with your
privacy officers and the owners of each data source, will grant and revoke access to restricted
data tables using named or role-based authorization policies and will enforce these policies
with security protocols, often keeping usage logs to verify legitimate data usage. If you are in
a regulated industry, you will be subject to more stringent requirements, where data
scientists working with production systems may need to navigate a half dozen layers of
security to get to the source data. In this case, you’ll want to choose an enterprise big data
product with features developed for high standards of security and compliance.
Adding a big data repository to your IT stack may make it more difficult to control access to,
usage of and eventual removal of personal information. In traditional data stores, data is kept
in a structured format and each data point can be assessed for sensitivity and assigned
appropriate access rights. Within big data repositories, data is often kept in unstructured
format (‘schema on read’ rather than ‘schema on write’), so it is not immediately evident
what sensitive data is present.
You may need to comply with right to be forgotten or right to erasure laws, particularly within
Europe, in which case you must delete certain personal data on request. With big data stores,
particularly the prevalent ‘data lakes’ of yet-to-be-processed data, it’s harder to know where
personal data is stored in your systems.
GDPR will limit your use of data from European customers, requiring express consent for
many business applications. This will limit the efforts of your data scientists, and you’ll also
be accountable under ‘right to explanation’ laws for algorithms that impact customers, such
as calculations of insurance risk or credit score. You will likely need to introduce new access
controls and audit trails for data scientists to ensure compliance with GDPR.
A full discussion of GDPR is beyond the scope of this book, and we’ve barely touched on the
46
myriad other regulations in Europe and around the world. Also (quick disclaimer) I’m not a
lawyer. Connect with privacy experts knowledgeable in the laws of the jurisdictions in which
you operate.
Moving on from the topics of legal compliance and data protection, I’ll briefly touch on an
optional governance framework, which should reduce internal chaos in your organization and
ease the lives of you and your colleagues. You should develop and maintain a tiered
governance model for how internal reports and dashboards are assembled and distributed
within your organization. Most organizations suffer tremendously from not having such a
model. Executives sit at quarter-end staring in dismay at a collection of departmental reports,
each of which defines a key metric in a slightly different way. At other times, a quick analysis
from an intern works its way up an email chain and may be used as input for a key decision in
another department.
From my experience, you’ll spare yourself tremendous agony if you develop a framework for:
One way to do this is to introduce a multi-tiered certification standard for your reports and
dashboards. The first tier would be self-service analysis and reports that are run against a
development environment. Reports at this level should never leave the unit in which they are
created. A tier one report that demonstrates business value can be certified and promoted to
tier two. Such a certification process would require a degree of documentation and
consistency and possibly additional development, signed off by designated staff. Tier-two
reports that take on more mission-critical or expansive roles may be promoted to a third tier,
etc. By the time a report lands on the desk of an executive, the executive can be confident of
its terminology, consistency and accuracy.
Takeaways
It is important that you identify and govern your use of personally identifiable
information (PII) and quasi-identifiers.
Establish and enforce governance and auditing of internal data usage.
Laws related to privacy and data governance differ greatly by jurisdiction and may
impact your organization even if it does not have a physical presence within that
jurisdiction.
Europe’s GDPR will have a strong impact on any company with customers in the EU.
Linkage attacks and advanced analytic techniques can reveal private information
despite your efforts to protect it.
Creating a tiered system for your internal reports and dashboards can provide
consistency and reliability.
Ask yourself
47
What measures are you taking to protect personally identifiable information (PII)
within your systems, including protection against linkage attacks? Make sure you are
compliant with regional laws in this area and are not putting your reputation at risk
from privacy infringement, even if legal.
If you have customers in Europe, what additional steps will you need to take to
become compliant with GDPR? Remember that GDPR fines reach 4 per cent of global
revenue.
If your organization does not have a privacy officer, whom can you consult for
questions related to privacy and data protection laws? There are global firms that can
provide advice spanning multiple jurisdictions.
When was the last time you reviewed an important internal report and realized the
terminology used was unclear or the data was inaccurate? What steps did you take to
address the problem? Perhaps you want to initiate an internal reporting governance
programme, such as the one outlined in this chapter.
LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:
Learning Objectives
A huge ecosystem of connected devices, named the Internet of Things, has been expanding
over the globe for the last two decades. Now, the overwhelming number of objects around
us are enabled to collect, process and send data to other objects, applications or servers. They
span numerous industries and use cases, including manufacturing, medicine, automotive,
security systems, transportation and more.
The IoT system can function and transfer information in the online mode only when devices
are safely connected to a communication network.
What makes such a connection possible? The invisible language allowing physical objects to
“talk” to each other consists of IoT standards and protocols. General protocols used for
personal computers, smartphones or tablets may not suit specific requirements (bandwidth,
range, power consumption) of IoT-based solutions. That is why multiple IoT network
protocols have been developed and new ones are still evolving.
There is a multitude of great choices for connectivity options at the engineers’ disposal. This
article explains complicated abbreviations and helps you make sense of the Internet of
Things standards.
The first devices connected to the global net appeared in 1982. It was a Coca-Cola vending
machine that could control the temperature of the machine and keep track of the number of
bottles in it. The term “Internet of Things” is considered to be formulated in 1999 by Kevin
Ashton, an RFID technology researcher.
In the 1990s, all IoT-related activities came down to theoretical concepts, discussions and
49
individual ideas. The 2000s and 2010s was a period of rapid development when IoT projects
began to succeed and found certain practical applications. Multiple small and large projects
were created, from intelligent lamps and fitness trackers to self-driving cars and smart cities.
This was made possible because of the emergence of wireless connections that could transfer
information over a long distance and the increased bandwidth of Internet communications.
The IoT grew to a completely “different Internet,” so that not all existing protocols were able
to satisfy its needs and provide seamless connectivity. That’s why it became a vital necessity
to create specialized IoT communication protocols and standards. However, some existing
technologies (e.g. HTTP) are also used by the Internet of Things.
The Internet of Things is penetrating every aspect of our daily life. IoT phenomenon is already
around us: it is made up of the ordinary objects we use at home, at work or in the streets. The
difference is that all these objects and devices are computerized. They
have embedded network connectivity,
can communicate with phones and
other gadgets, get information and
remain under control.
As the IoT trend morphs into an industry, the need for reliable, comprehensive developer
toolkits is increasing. IoT developer toolkits provide teams with the tools they need to access
specific networks, test hardware responses to application changes and manage updates. The
driving force behind the Internet of Things projects is more accessible hardware and more
flexible programming languages.
Tools for the Internet of Things (IoT) Development
50
IoT development generally requires the management of both an actuator and an endpoint.
The actuator monitors the connected device, searching for a specific value that energizes the
endpoint into action. This may be a connected home environment system that allows the user
to monitor the temperature of the home and adjust the thermostat settings remotely, or it
could be a security system that tracks movements within a building and alerts specified users
of changes.
Developing applications for connected devices generally requires that the solution provide
the following:
Endpoint authentication
Session creation
Session destruction and logout
User accounts and management
Individual user billing details as needed
User recent API activity
Individual device data plan details
Individual device details
Device claiming and activation
Device ordering
Incoming and outgoing SMS management
IoT solutions distinguish mainstream technology companies from those on the leading edge.
Even companies that operate primarily outside of the tech sector will see benefits in terms of
marketplace recognition and brand identity when successful smart initiatives are launched.
Connecting with the right talent is necessary to creating valuable solutions that fill this market
need.
51
Source: IDC
IoT development tools help create smart objects. The first thing you need in order
to build and launch connected products is a platform. There are plenty available today. Each
platform can be an ideal fit for some applications, but not for others, due to the different
characteristics that come with it. Below we’ve listed some popular IoT development
platforms.
IBM Watson
https://www.ibm.com/watson
This platform enables connecting all types of devices and easily developing custom IoT
solutions. The advantages of IBM Watson are obvious: quick and secure connectivity; an
ability to control your data flows; online data analysing; the visualization of critical risks and
the automation of operational responses to them.
Azure
https://azure.microsoft.com/en-in/
The Azure IoT development platform by Microsoft has some important capabilities. It enables 52
you to collect, analyse and visualize data, integrate it with back-office systems, store and
process large data sets and manage devices. Azure is highly scalable, and it supports a great
number of devices and operating systems.
AWS is a managed cloud-based platform that supports billions of devices around the world.
It provides secure and easy interaction even if the devices are offline. Amazon’s data centers
are equipped with multiple security levels and ensure seamless access and the safety of your
data. The main advantage of this platform is that no hardware infrastructure is needed. AWS
offers low prices without long-term commitments.
When choosing a platform, it’s necessary to decide on the operating system. There are certain
limitations to be considered: low processing power, a smaller amount of RAM and storage.
The most commonly used operating systems for such built-in computers are Linux, Android
and Ubuntu Core. But there is a great number of other IoT OSs available.
Contiki is an open-source operating system for the Internet of Things created by a worldwide
team of developers. It provides powerful low-power Internet communication, supports fully
standard IPv6 and IPv4, along with the recent low-power wireless standards: 6lowpan, RPL
and CoAP. Contiki runs on a range of low-power wireless devices; its applications are written
in standard C; development with Contiki is easy and fast.
ARM mbed OS is an open-source embedded operating system. It includes all the necessary
features to develop a connected product. Mbed OS provides multilayer security and a wide
range of communication options with drivers for Bluetooth Low Energy, Thread, 6LoWPAN,
Ethernet and WiFi. What’s more, necessary libraries are included automatically on your 53
devices to facilitate code writing.
The ThingBox is a set of software already installed and configured on an SDCard. The
ThingBox allows anyone to graphically create new unlimited applications interacting with
connected objects from a simple web-browser. This OS is suitable and easy-to-use for both
technical people and users with no technical background.
Huawei LiteOS is a lightweight, low energy, efficient operating system. It starts up within
milliseconds and responds within microseconds. LiteOS coordinates multiple sensors and
supports long- and short-distance communication.
Raspbian is one of the most widely used platforms for the Internet of Things. It is a free
system optimized for the Raspberry Pi hardware. Raspbian includes basic programs and
utilities to make the hardware run, but it also compiles more than 35,000 packages and pre-
compiled software for easy installation.
Android Things is an operating system from Google. It lets you build professional, mass-
market products on a trusted platform, without previous knowledge of embedded system
design. Android Things provides you with leverage for the existing Android development
tools, APIs, resources and regular security updates. Android Things ensures the development
of IoT products at scale.
Nowadays, IoT software uses more general programming languages than it used to. The
choice of language for your smart service depends on its compatibility with the system, the
code size and memory, general requirements and whether your developer is familiar with this
or that language. Some languages are suitable for general-purpose projects (e.g. Java), others
are more specific (e.g. Parasail). Here is a list of the main languages in use:
1. C and C++ are quite universal and familiar to many programmers. Both languages are
created to be written for the hardware they are running on, which helps provide ideal code
for a special built-in system.
2. Java is rather mobile and is able to run on various hardware. This is a real advantage for
IoT.
3. JavaScript is the most widespread language on the Internet. As the greater part of the 54
Internet already speaks JavaScrip, it’s a great option for IoT, too. When all of the connected
devices understand the servers, it’s much easier to make them function. It’s also possible
to reuse the same JavaScript functions for different devices.
4. Python is an interpreted language, so it is flexible and easy to use in the IoT world. Python
is especially good for data-heavy applications.
5. Go, Rust, Forth, Parasail, B# — these languages were not modified, but specifically
designed for embedded programming, so they fit the Internet of Things like a glove.
IoT developers have numerous open-source tools for the Internet of Things at their disposal.
Utilizing the tools we’ve listed below, you’ll be able to develop successful solutions with ease.
Arduino Starter Kit. This is a cloud-based system that offers both software and hardware. It
can be used even by beginner programmers.
Home Assistant. This tool is aimed at the smart home market and is great for interaction with
smart sensors in your home. The downside is that it doesn’t have a cloud component.
Zetta. This is a cloud-based platform built on Node.js. Zetta is perfect for turning devices into
API.
Device Hive. This tool functions as an M2M communications framework. Device Hive is quite
popular for the development of smart homes.
ThingSpeak. This is one of the oldest and most effective tools for IoT applications in the
market. ThingSpeak can process huge sums of data, it is used in web design applications and
location tracking tasks. This tool is able to work with other open-source tools.
NOD-RED. This is a browser-based tool for wiring the Internet of Things together. It helps deal
with the flow of the data, integrates with APIs, services and any devices.
5.7 Best IoT Development Kits
55
IoT development is interesting not only for large organizations but for small businesses and
individual developers as well. Here’s a top list of the best tools for the Internet of Things for
hobbyists and start-ups.
https://www.techworld.com/picture-gallery/apps-wearables/-best-iot-starter-kits-for-developers-
3637481/
1. ARM mBed
2. Relayr
3. Microsoft Azure IoT Starter Kits
4. BrickPi
5. VERVE2
6. Kinoma Create
7. Ninja Sphere
8. AWS IoT Starter Kits
9. Helium Development Kit
The Internet of Things makes ordinary physical objects smarter and broadens the horizons.
Together with these amazing possibilities, security problems arise, as all the connected
devices are subject to cyber-attacks and data leaks. That’s why security points have to be
integrated at every stage of IoT services development and deployment.
A special organization — The IoT Security Foundation — was launched in 2015 in England.
This is evidence that the world of IoT has become an integral part of modern society and its
safety is on the agenda.
56
Source: IDC
The IoT ecosystem is currently experiencing a period of rapid growth. According to Ericsson,
in 2018, the number of smart sensors and devices will exceed the number of mobile phones
and will become the largest category of connected devices.
Analysts of the company predict that by 2022, there will be about 29 billion connected
devices, and around 16 billion of them will be associated with IoT.
According to the statistics portal Statista, the global smart home market will reach almost $60
billion in 2017.
An IoT system has a three-level architecture: devices, gateways and data systems. The data
moves between these levels via four types of transmission channels.
1. Device to device (D2D) — direct contact between two smart objects when they share
information instantaneously without intermediaries. For example, industrial robots and
sensors are connected to one another directly to coordinate their actions and perform the
assembly of components more efficiently. This type of connection is not very common yet,
57
because most devices are not able to handle such processes.
3. Gateway to data systems — data transmission from a gateway to the appropriate data
system. To determine what protocol to use, you should analyse data traffic (frequency of
bustiness and congestion, security requirements and how many parallel connections are
needed).
4. Between data systems — information transfer within data centres or clouds. Protocols
for this type of connection should be easy to deploy and integrate with existing apps, have
high availability, capacity and reliable disaster recovery.
Networks are divided into categories based on the distance range they provide.
A nanonetwork — a set of small devices (sized a few micrometres at most) that perform very
simple tasks such as sensing, computing, storing, and actuation. Such systems are applied in
the biometrical, military and other nanotechnologies.
BAN (Body Area Network) — a network to connect wearable computing devices that can be
worn either fixed on the body, or near the body in different positions, or embedded inside
the body (implants).
PAN (Personal Area Network) — a net to link up devices within a radius of roughly one or a
couple of rooms.
LAN (Local Area Network) — a network covering the area of one building.
CAN (Campus/Corporate Area Network) — a network that unites smaller local area networks
within a limited geographical area (enterprise, university).
MAN (Metropolitan Area Network) — a big network for a certain metropolitan area powered
by the microwave transmission technology.
WAN (Wide Area Network) — a network that exists over a large-scale geographical area and
unites different smaller networks, including LANs and MANs.
Mesh Networks
58
Wireless nets can also be categorized according to their topology, i.e. a connectivity
configuration. There may be various combinations of connections between nodes: line, ring,
star, mesh, fully connected, tree, bus.
Mesh networks have the most benefits if compared to other types of networks since they
don’t have a hierarchy, and the hub and each node is connected to as many other nodes as
possible. Information can be routed more directly and efficiently, which prevents
communication problems. This makes mesh networks an excellent solution for the connected
objects.
Businesses are more actively adopting the Internet of Things. The data driven by connected
devices can create efficiencies and bring your company to the next level.
Any IoT device should connect to other devices, sensors, apps and data networks to transfer
information. An IoT platform serves as a mediator to unite all of them in one system.
Combining many of the tools, such a platform stores, analyses and manages the plethora of
data generated by the connected assets.
The most popular IoT platforms are still the solutions by the leading vendors such as Amazon,
Microsoft and IBM. But there are lots of other good options on the market. Here, we provide
a review of the best Internet of Things platforms 2019.
A device manager — to register devices with the service, monitor and configure them
Protocol bridges (MQTT and HTTP) — to connect devices to Google Cloud Platform
Google Cloud automatically integrates with the Internet of Things hardware producers such
as Intel and Microchip.
SAP Leonardo
SAP is the leading German software company. In 2017, it launched Leonardo as a purely IoT
platform. But later, it was relaunched as a “digital innovation system” in order to integrate
more emerging technologies in one place, such as Artificial Intelligence, Machine Learning,
Big Data, advanced analytics and block chain. Since one technology is not enough to deliver
good outcomes for customers, this integral
approach is really worthwhile. When technologies
are viewed and implemented jointly, it’s easier to
support businesses in any digital aspect and
accelerate time to value.
SAP Leonardo is predicted to be the leading platform for the Internet of Things 2018. The
platform offers accelerator packages. An accelerator is a fixed-price package tailored to
specific industries and functions. It comprises methodologies, the necessary licenses,
development and design services. Accelerators help customers create apps from the initial
prototype to the final solution.
SAP Leonardo offers two services:
Ready-made applications (e.g. SAP Service Ticketing)
Micro services and APIs (e.g. the SAP Streaming Analytics micro service) that can be 60
integrated into the customer’s own applications
Cisco IoT Cloud Connect is originally an offering for mobile operators. This
mobility-cloud-based software suite is on the list of the best Internet of
Things cloud platforms. It allows customers to fully optimize and utilize
networks, provides real-time visibility and updates every level of the
network.
The German IT company Bosch has become a full-service provider of connectivity and the
Internet of Things with its own open source IoT
platform. Now, it can compete with the big players
such as Amazon and IBM.
The Bosch IoT Suite is a flexible open source Platform as a Service (PaaS). Bosch focuses on
efficiency and safety and provides cloud services for typical IoT projects. Prototype
applications can be quickly set up and deployed within minutes. Software developers can 61
operate them at high availability.
Salesforce IoT
The advantages of Salesforce offering are high speed, simple point-and-click UI and the easy-
to-use and more meaningful user experience. Even non-technical users can easily derive
benefits from digital projects.
Another important feature is MyIoT — a declarative interface for building apps on top of
connected assets data.
Hitachi Lumada
Japanese IT vendor Hitachi with its Lumada is also on the list of the
best IoT platforms 2018. Lumada is a comprehensive service as it
includes the Internet of Things, Artificial Intelligence and Machine
Learning technologies. Therefore, it delivers the most advanced
opportunities to turn data into intelligent action, solving customer
problems before they occur.
Lumada focuses on industrial IoT deployments, which is why it can be run both on-premises
and in the cloud.
Predix includes:
A catalog of app templates that are used off-the-shelf with your data from connected
devices
A low-code Studio that helps non-technical users build industrial IoT apps
Customers can manage connected devices using Predix as a dashboard and create virtual
models (digital twins) of assets to predict and optimize their performance.
Predix runs on the major public cloud infrastructure providers. For instance, GE created
Predix-ready devices in partnership with Verizon, Cisco and Intel. Recently, the company
partnered with Apple to bring Predix apps to iOS devices.
There’s no definite answer to this question since there’s no one best platform suitable for any
digital project. The choice will always depend on the specific requirements of your business.
Large enterprises are more likely to turn to giants such as Amazon or Microsoft. Their
offerings are the most established, but also the most expensive. Smaller companies may find
more cost-efficient options that will nevertheless perfectly meet their requirements.
Now, let’s get to the specifics of IoT wireless protocols, standards and
technologies. There are numerous options and alternatives, but we’ll discuss
the most popular ones.
MQTT
MQTT (Message Queue Telemetry Transport) is a lightweight protocol for 63
sending simple data flows from sensors to applications and
middleware.
DDS
DDS (Data Distribution Service) is an IoT standard for real-time, scalable and high-
performance machine-to-machine communication. It was developed by the Object
Management Group (OMG).
You can deploy DDS both in low-footprint devices and in the cloud.
AMQP
The processing chain of the protocol includes three components that follow certain rules.
Bluetooth
Bluetooth is well-known to mobile users. But not long ago, the new
significant protocol for IoT apps appeared — Bluetooth Low-Energy (BLE),
or Bluetooth Smart. This technology is a real foundation for the IoT, as it
is scalable and flexible to all market innovations. Moreover, it is designed
to reduce power consumption.
Standard: Bluetooth 4.2
64
Frequency: 2.4GHz
Range: 50-150m (Smart/BLE)
Data Rates: 1Mbps (Smart/BLE)
Zigbee
ZigBee 3.0 is a low-power, low data-rate wireless network used mostly in industrial settings.
The Zigbee Alliance even created the universal language for the Internet of Things
— Dotdot — which makes it possible for smart objects to work securely on any network and
seamlessly understand each other.
WiFi
Cellular
Cellular technology is the basis of mobile phone networks. But it is also suitable for the IoT
apps that need functioning over longer distances. They can take
advantage of cellular communication capabilities such as GSM,
3G, 4G (and 5G soon).
LoRaWAN
LoRaWAN can provide low-cost mobile and secure bidirectional communication in various
industries.
Standard: LoRaWAN
Frequency: Various
Range: 2-5km (urban area), 15km (suburban area)
Data Rates: 0.3-50 kbps
5.13 Standards Bodies
66
ANSI / ISA-100.11a-2011 for “Wireless Systems for Industrial Automation: Process Control
and Related Applications” was approved in September 2014 and published as IEC 62734.
It provides a definition of reliable and secure wireless operations including monitoring,
alerting, supervisory control, open-loop control, and closed-loop applications. After
initial approval by ANSI in 2011, compliant device production began in earnest and over
130,000 connected devices had appeared by the end of 2012. ISA / IEC 62443 (formerly
ISA-99) provides a standard for automation system security.
ISO standards relevant to IoT include ISO18185 for RFID and numerous other supply
chain and sensor standards (ranging from device interfaces designed to monitor
conditions to sensor networking and network security frameworks). At the time of
publication, ISO/AWI 18575 was planned to address products and product packages for
IoT in the supply chain. ISO is often seen as providing a valuable resource for reference
architectures, specifications, and testing procedures.
ISO/IEC JTC/SWG 5
67
This joint technical committee (JTC) of ISO and IEC produced a subcommittee / working
group (SWG) that identifies market requirements and standardization gaps. It documents
standardization activity for IoT from groups internal and external to ISO and IEC. Areas
of collaboration this SWG focuses on include accessibility, user interfaces, software
and systems engineering, IT education, IT sustainability, sensor networking, automatic
identification and data capture, geospatial information, shipping, packaging, and thermal
performance and energy usage.
In early 2015, W3C launched a Web of Things initiative to develop web standards based
on IoT and what it calls “a web of data.” Many previous W3C standards efforts are
fundamental to IoT development including XML, SOAP, WSDL, and REST.
Contiki (http://www.contiki-os.org)
Contiki provides an open source development environment (written in C) used to
connect low-cost and low-power micro-controllers to the Internet (IPv6 and IPv4). The
environment includes simulators and regression tests.
Eclipse (http://iot.eclipse.org)
Eclipse provides frameworks for developing IoT gateways including Kura (Java and
OSGi services) and Mihini (written in Lua scripts). Industry services are bundled in a
SmartHome project consisting of OSGi bundles and an Eclipse SCADA offering. Tools
and libraries are provided for Message Queuing Telemetry Transport (MQTT), the
Constrained Application Protocol (CoAP), and OMA-DM and OMA LWM2M device
management protocols.
openHAB (http://www.openhab.org)
An open source project called openHAB produced software capable of integrating home
automation systems and technologies through a common interface. It can be deployed to
any intelligent device in the home that can run a Java Virtual Machine (JVM). It includes
a rules engine enabled through user control and provides interfaces via popular mobile
68
devices (Android, iOS) or via the web.
ThingsSpeak (http://www.thingspeak.org)
ThingSpeak provides APIs for “channels” enabling applications to store and retrieve data
and for “charts” providing visualization. ThingHTTP enables a device to connect to a web
service using HTTP over a network or the Internet. Links into Twitter are also provided for