LM11 Introduction To Big Data Techniques IFT Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

LM11 Introduction to Big Data Techniques 2024 Level I Notes

LM11 Introduction to Big Data Techniques

1. Introduction ........................................................................................................................................................ 2
2. How Is Fintech Used in Quantitative Investment Analysis? ............................................................ 2
3. Advanced Analytical Tools: Artificial Intelligence and Machine Learning ................................. 4
4. Tackling Big Data with Data Science ......................................................................................................... 5
Summary................................................................................................................................................................... 7

This document should be read in conjunction with the corresponding reading in the 2023 Level I CFA®
Program curriculum. Some of the graphs, charts, tables, examples, and figures are copyright
2022, CFA Institute. Reproduced and republished with permission from CFA Institute. All rights
reserved.

Required disclaimer: CFA Institute does not endorse, promote, or warrant the accuracy or quality of
the products or services offered by IFT. CFA Institute, CFA®, and Chartered Financial Analyst® are
trademarks owned by CFA Institute.

Ver 1.0

© IFT. All rights reserved 1


LM11 Introduction to Big Data Techniques 2024 Level I Notes

1. Introduction
This learning module covers:
• What is ‘Fintech’ and how it is used in investment analysis
• A brief explanation of Big Data, artificial intelligence, and machine learning
• Applications of Big Data and Data Science to investment management
2. How Is Fintech Used in Quantitative Investment Analysis?
The term ‘Fintech’ comes from combining ‘Finance’ and ‘Technology’. Fintech refers to
technological innovation in the design and delivery of financial products and services.
Though the term ‘Fintech’ is relatively new, its earlier forms involved data processing and
automation of routine tasks. Fintech later advanced into decision-making applications based
on complex machine learning logic.
The major drivers of fintech have been:
• Rapid growth in data
• Technological advances
While Fintech spans the entire finance space, this learning module focuses on fintech
applications that are more directly relevant to quantitative analysis in the investment
industry:
• Analysis of large datasets
• Analytical tools
Big Data
Big Data refers to vast amount of data generated by industry, governments, individuals, and
electronic devices.
Characteristics of big data typically include:
• Volume: Over the last few decades, the amount of data that we are dealing with has
grown exponentially.
• Velocity: It refers to the speed at which data are communicated. In the past we often
worked with batch processing; however, we are now increasingly working with real
time data.
• Variety: Historically we only dealt with structured data. However, we are now also
dealing with unstructured data such as text, audio, video, etc.

© IFT. All rights reserved 2


LM11 Introduction to Big Data Techniques 2024 Level I Notes

In addition to these three V’s, a fourth V is becoming increasingly important, especially when
using big data for drawing inferences or making predictions.
• Veracity – refers to the credibility and reliability of different data sources.
Big Data can be structured (can be organized in tables), semi-structured, or unstructured
(cannot be represented in a tabular form).
Sources of Big Data
Traditional data sources include annual reports, regulatory filings, trade price and volume,
etc. Alternate data include many other sources and types of data. A simple classification of
alternate data sources is shown in Exhibit 2 of the curriculum.
Individuals Business Processes Sensors
Social media Transaction data Satellites
News, reviews Corporate data Geolocation
Web searches, personal data Internet of Things
Other sensors

© IFT. All rights reserved 3


LM11 Introduction to Big Data Techniques 2024 Level I Notes

Big Data Challenges


While big data can be a huge asset, there are also several challenges. The quality of data may
be questionable. The data may have biases, outliers, etc. The volume of data collected may
not be sufficient. We might be dealing with too much data or too little data. Another concern
is the appropriateness of data. In most cases working with Big Data usually involves
cleansing and organizing the data before we start analyzing it.

3. Advanced Analytical Tools: Artificial Intelligence and Machine Learning


Artificial intelligence (AI) computer systems perform tasks that have traditionally
required human intelligence. They exhibit cognitive and decision-making ability comparable
or superior to that of human beings. An important term in this context is ‘neural networks’. It
refers to programming based on how the brain learns and processes information. There are
examples of AI all around us. For example, chess playing computer programs, digital
assistants like Apple’s Siri, etc.
Machine learning (ML) refers to computer-based techniques that “extract knowledge from
large amounts of data by “learning” from known examples and then generating structure or
predictions” without relying on any help from a human. ML algorithms aim to “find the
pattern, apply the pattern.”
In ML, the dataset is divided into three distinct subsets:
1. Training dataset: It allows the algorithm to identify relationships between inputs and
outputs based on historical patterns in the data.
2. Validation dataset: It is used to validate and model tune the relationships identified
by training dataset.
3. Test dataset: As the name implies, this dataset is used to test the model’s ability to
predict well on new data.
Once an algorithm has mastered the training and validation datasets, it can be used to
predict outcomes based on other datasets.
Broadly speaking there are three main approaches to machine learning:
1. Supervised learning: In supervised learning, both inputs and outputs are identified or
labeled. After learning from labeled data, the trained algorithm is used to predict
outcomes for new data sets.
2. Unsupervised learning: In unsupervised learning, the input and output variables are
not labeled. Here we want the ML algorithm to seek relationships on its own.
3. Deep learning: In deep learning, (or deep learning nets), neural networks are used by
the computers to perform multistage, non-linear data processing to identify patterns.
Deep learning can use supervised or unsupervised machine learning approaches. With
terms like AI and ML one might think that human judgment is not required, but that is

© IFT. All rights reserved 4


LM11 Introduction to Big Data Techniques 2024 Level I Notes

far from the truth. For ML to work well, good human judgment is required. Human
judgment is required for questions like: which data to use, how much data to use,
which analytical techniques are relevant in the given context. Human judgment may
also be needed to clean and filter the data before it is fed to the ML algorithm. Deep
learning algorithms are used for image, pattern, and speech recognition.
Some challenges associated with machine learning are:
• Over-fitting the data: Sometimes an algorithm may try to be too precise in the way it
interprets data and predicts outcomes. This leads to over-trained models and may
result in data mining bias. We try to mitigate this issue by having a good validation
dataset.
• Black box: ML techniques can be opaque or black box, which means we have
predictions that are not very easy to understand or to explain.
Despite these challenges and weaknesses, the importance of ML in finance and investment
management has been growing substantially.

4. Tackling Big Data with Data Science


Data science leverages advances in computer science, statistics, and other disciplines for the
purpose of extracting information from Big Data.
Data Processing Methods
Data processing methods include:
• Capture: Refers to how data is collected from various sources and transformed into a
format that can be used by the analytical process.
• Curation: Refers to the process of ensuring data quality and accuracy through data
cleaning.
• Storage: Refers to how data will be recorded, archived, and accessed. It also refers to
the underlying databases design. An important consideration here is whether the data
is structured, unstructured, or both. We also need to be concerned whether the
analytical tools need real time access to the data or not.
• Search: Refers to how we can find what we want from the vast amount of data.
• Transfer: Refers to how data will move from the underlying source to the analytical
tools that are being used.
Data Visualization
Another aspect of data science is data visualization. This refers to how the data will
ultimately be presented to the analyst/user. Historically, data visualization happened
through graphs, charts, etc. However, in more recent times tools such as heat maps, tree
diagrams, and tag clouds are also being used.

© IFT. All rights reserved 5


LM11 Introduction to Big Data Techniques 2024 Level I Notes

An example of a heat map is a map of a city where routes with high traffic congestion are
shown in red. A tag cloud is a technique applicable to textual data. Words that appear more
often are shown in a larger font, whereas words that appear less often are shown with a
smaller font. This helps us to quickly evaluate how consumers/users are talking about a
given product.
Exhibit 3 from the curriculum shows an example of a ‘tag cloud’.

Text Analytics and Natural Language Processing


Text analytics refers to the use of computer programs to derive meaning from large,
unstructured text- or voice-based data. For example, text analytics can be used to gauge the
consumer sentiment about a new product by analyzing what is being said about the product
on blogs, forums, YouTube, etc. Based on this analysis, we can determine if the sentiment is
very positive, positive, neutral, or negative.
Natural language processing (NLP) is an application of text analytics whereby computers
analyze and interpret human language. For example, NLP analysis can be used for
communications from policy makers such as the US Federal Reserve. Officials at these
institutions may send subtle messages through their choice of words and inferred tone. NLP
analysis can provide insights into these subtle messages. Such processing is possible because
of access to Big Data and processing power.

© IFT. All rights reserved 6


LM11 Introduction to Big Data Techniques 2024 Level I Notes

Summary
LO: Describe aspects of “fintech” that are directly relevant for the gathering and
analyzing of financial data.
Fintech refers to the technological innovation in the design and delivery of financial products
and services.
LO: Describe Big Data, artificial intelligence, and machine learning.
Big Data refers to vast amounts of data generated by industry, governments, individuals, and
electronic devices.
Artificial intelligence (AI) computer systems perform tasks that have traditionally required
human intelligence. They exhibit cognitive and decision-making ability comparable or
superior to that of human beings.
Machine learning (ML) refers to computer-based techniques that “extract knowledge from
large amounts of data by “learning” from known examples and then generating structure or
predictions” without relying on any help from a human. In ML, the dataset is divided into
three distinct subsets, training dataset, validation dataset, and test dataset. There are three
main approaches to machine learning, i.e., supervised learning, unsupervised learning, and
deep learning.
LO: Describe applications of Big Data and Data Science to investment management.
Text analytics refers to the use of computer programs to derive meaning from large,
unstructured text- or voice-based data.
Natural language processing (NLP) is an application of text analytics whereby computers
analyze and interpret human language.

© IFT. All rights reserved 7

You might also like