3 - Big Data

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Big data

This article defines what exactly ‘big data’ is, how it can be used to
inform and implement business strategy and examples of how it is
being used by different industries today.

Introduction
Big data is part of the Strategic Business Leader (SBL) syllabus:

D2. Discuss how big data can be used to inform and implement business
strategy.

There are many definitions of the term ‘big data’ but most suggest something like the
following:

'Extremely large collections of data (data sets) that may be analysed to reveal patterns,
trends, and associations, especially relating to human behaviour and interactions.'

In addition, many definitions also state that the data sets are so large that conventional
methods of storing and processing the data will not work.

In 2001 Doug Laney, an analyst with Gartner (a large US IT consultancy company)


stated that big data has the following characteristics, known as the 3Vs:

 Volume
 Variety
 Velocity

These characteristics, and sometimes additional ones, have been generally adopted as
the essential qualities of big data.
The 3Vs: characteristics of
big data

The commonest fourth 'V' that is sometimes added is:


Veracity – is the data true and can its accuracy be relied upon?

Volume
The volume of big data held by large companies such as Walmart (supermarkets),
Apple and EBay is measured in multiple petabytes. What is a petabyte? It’s 1015 bytes
(characters) of information. A typical disc on a personal computer (PC) holds 10 9 bytes
(a gigabyte), so the big data depositories of these companies hold at least the data that
could typically be held on 1 million PCs, perhaps even 10 to 20 million PCs.

These numbers probably mean little even when converted into equivalent PCs. It is
more instructive to list some of the types of data that large companies will typically
store.

Retailers
Via loyalty cards being swiped at checkouts: details of all purchases you make, when,
where, how you pay, use of coupons.

Via websites: every product you have every looked at, every page you have visited,
every product you have ever bought.

Social media (such as Facebook and Twitter)


Friends and contacts, postings made, your location when postings are made,
photographs (that can be scanned for identification), any other data you might choose to
reveal to the universe.

Mobile phone companies


Numbers you ring, texts you send (which can be automatically scanned for key words),
every location your phone has ever been whilst switched on (to an accuracy of a few
metres), your browsing habits. Voice mails.

Internet providers and browser providers


Every site and every page you visit. Information about all downloads and all emails
(again these are routinely scanned to provide insights into your interests). Search terms
which you enter.

Banking systems
Every receipt, payment, credit card information (amount, date, retailer, location),
location of ATM machines used.

Variety
Some of the variety of information can be seen from the examples listed above. In
particular, the following types of information are held:

 Browsing activities: sites, pages visited, membership of sites, downloads, searches


 Financial transactions
 Interests
 Buying habits
 Reaction to advertisements on the internet or to advertising emails
 Geographical information
 Information about social and business contacts
 Text
 Numerical information
 Graphical information (such as photographs)
 Oral information (such as voice mails)
 Technical information, such as jet engine vibration and temperature analysis

This data can be both structured and unstructured:


Structured data: this data is stored within defined fields (numerical, text, date etc) often
with defined lengths, within a defined record, in a file of similar records. Structured data
requires a model of the types and format of business data that will be recorded and how
the data will be stored, processed and accessed. This is called a data model. Designing
the model defines and limits the data which can be collected and stored, and the
processing that can be performed on it.

An example of structured data is found in banking systems, which record the receipts
and payments from your current account: date, amount, receipt/payment, short
explanations such as payee or source of the money.

Structured data is easily accessible by well-established database structured query


languages.

Unstructured data: refers to information that does not have a pre-defined data-model.
It comes in all shapes and sizes and it is this variety and irregularity which makes it
difficult to store in a way that will allow it to be analysed, searched or otherwise used.
An often quoted statistic is that 80% of business data is unstructured, residing it in word
processor documents, spreadsheets, powerpoint files, audio, video, social media
interactions and map data.

Here is an example of unstructured data and an example of its use in a retail


environment:

You enter a large store and have your mobile phone with you. That allows your
movement round the store to be tracked. The store might or might not know who you
are (depending on whether it knows your mobile phone number). The store can record
what departments you visit, and how long you spend in each. Security cameras in the
ceiling match up your image with the phone, so now they know what you look like and
would be able to recognise you on future visits. You pass near a particular product and
previous records show that you had looked at that product before, so a text message
can be sent perhaps reminding you about it, or advertising a 10% price reduction.
Perhaps the store has a marketing campaign that states that it will never be undersold,
so when you pass near products you might be making a price comparison and the store
has to check prices on other stores websites and message you with a new price. If you
buy the product then the store might have further marketing opportunities for related
products and consumables and this data has to be recorded also. You pay with an
affinity credit card (a card with associations with another organisations such as a charity
or an airline), so now the store has some insight into your interests. Perhaps you buy
several products and the store will want to discover if these items are generally bought
together.

So just walking round a store can generate a vast quantity of data which will be very
different in size and nature for every individual.
Velocity
Information must be provided quickly enough to be of use in decision making. For
example, in the above store scenario, there would be little use in obtaining the price-
comparison information and texting customers once they had left the store. If facial
recognition is going to be used by shops and hotels, it has to be more or less instant so
that guests can be welcomed by name.

You will understand that the volume and variety conspire against velocity and, so,
methods have to be found to process huge quantities of non-uniform, awkward data in
real-time.

Software for big data


Without getting too technical on this issue, a library of software known as Apache
Hadoop is specifically designed to allow for the distributed processing of large data sets
(ie big data) across clusters of computers using simple programming models. (Clusters
of computers are needed to hold the vast volume of information.) Hadoop IT is designed
to scale up from single servers to thousands of machines, each offering local
computation and storage.

The processing of big data is generally known as big data analytics and includes:

 Data mining: analysing data to identify patterns and establish relationships such as
associations (where several events are connected), sequences (where one event
leads to another) and correlations.
 Predictive analytics: a type of data mining which aims to predict future events. For
example, the chance of someone being persuaded to upgrade a flight.
 Text analytics: scanning text such as emails and word processing documents to
extract useful information. It could simply be looking for key-words that indicate an
interest in a product or place.
 Voice analytics: as above but with audio.
 Statistical analytics: used to identify trends, correlations and changes in behaviour.

Google provides website owners with Google Analytics that will track many features of
website traffic. For example, the website OpenTuition.com provides free ACCA study
resources. Google analytics reports statistics such as the following:
Geographical distribution of
users

Type of browser used


Age of user

The final table is instructive. OpenTuition.com does not ask for users’ ages, so this data
has been pieced together from other information available to Google. It has been able to
do this for only about 58% of users.

The analytical findings can lead to:

 better marketing
 better customer service and relationship management
 increased customer loyalty
 increased competitive strength
 increased operational efficiency
 the discovery of new sources of revenue.

Other examples of the use of big data


Netflix: this company began as a DVD mailing service and developed algorithms to
help it to predict viewers’ preferences and habits. Now it delivers films over the internet
and can easily collect information about when movies are watched, how often films
might be stopped and restarted, where they might be abandoned, and how users rate
films. This allows Netflix to predict which films will be popular with which customers. It is
also being used by Netflix to produce its own TV series, with much greater assurance
that these will be hits.

Amazon: the world’s leading e-retailer collects huge amounts of information about
customers’ preferences and habits which allow it to market very accurately to each
customer. For example, it routinely makes recommendations to customers based on
books or DVDs previously purchased.

Airlines: they know where you’ve flown, preferred seats, cabin class, when you fly, how
often you search for a flight before booking, how susceptible you are to price reductions,
probably which airline you might book with instead, whether you are returning with them
but didn’t fly out with them, whether car hire was purchased last time, what class of
hotel you might book through their site, which routes are growing in popularity,
seasonality of routes. They also know the profitability of each customer so that, for
example, if a flight is cancelled they can help the most valuable customers first.

This information allows airlines to design new routes and timings, match routes to
planes and also to make individualised offers to each potential passenger.

Disease epidemic identification: in 2009, Google was able to track the spread of
influenza across the USA faster than the government’s Center for Disease Control and
Prevention. How? They monitored users entering terms like ‘Flu symptoms’, ‘Flu
remedies’, High temperature’. This connection was uncovered by web analytics looking
at popular search terms then finding a correlation with other information confirming
influenza infections. Of course, you have to be careful drawing conclusions about
correlations: the association between the use of search terms and the outbreak of flu
might be driven by news articles on the spread of the epidemic rather than the epidemic
itself.

Target: Target is the second largest discount retailer in the USA. There is an often
quoted story about their ability to predict when a customer is pregnant – frequently
before the customer has informed her family. By looking at about 25 products it is
claimed that they can create a pregnancy predictor. For example, early pregnancy often
causes morning sickness so consumers would perhaps change to blander food and less
perfumed shower gel. Why would Target be interested in knowing whether a consumer
is pregnant? Well that person will require different products during the pregnancy then
in a few months the baby will have its own product needs: nappies, baby shampoo and
clothes. Early identification of pregnancy can allow Target to establish the shopping
habits of the mother and perhaps even the preferences of the child.

Dangers of big data


Despite the examples of the use of big data in commerce, particularly for marketing and
customer relationship management, there are some potential dangers and drawbacks.

Cost: It is expensive to establish the hardware and analytical software needed, though
these costs are continually falling.

Regulation: Some countries and cultures worry about the amount of information that is
being collected and have passed laws governing its collection, storage and use.
Breaking a law can have serious reputational and punitive consequences.

Loss and theft of data: Apart from the consequences arising from regulatory breaches
as mentioned above, companies might find themselves open to civil legal action if data
were stolen and individuals suffered as a consequence.

Incorrect data (veracity): If the data held is incorrect or out of date incorrect
conclusions are likely. Even if the data is correct, some correlations might be spurious
leading to false positive results.

Employee monitoring: data collection methods allow employees to be monitored in


detail every second of the day. Some companies place sensors in name badges so that
employee movements and interactions at work can be monitored. The badged monitor
to whom each employee talks and in what tone of voice. Stress levels can be measured
from voice analysis also. Obviously, this information could be used to reduce stress
levels and to facilitate better interactions but you will easily see how it could easily be
used to put employees under severe pressure.

Adapted from an article originally written by Ken Garrett (a freelance lecturer and
writer)

You might also like