Machine Learning - Lec1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

Machine Learning

Machine Learning
Arthur Samuel, an early American leader in the field of computer gaming and
artificial intelligence, coined the term “Machine Learning ” in 1959 while at
IBM. He defined machine learning as

“the field of study that gives computers the ability to learn without being
explicitly programmed “.
● Machine learning is programming computers to optimize a performance
criterion using example data or past experience.
● A model is defined up to some parameters, and learning is the execution
of a computer program to optimize the parameters of the model using
the training data or past experience.
● The model may be predictive to make predictions in the future, or
descriptive to gain knowledge from data.
Definition by Tom Mitchell (1998): Machine Learning is the
study of algorithms that

• improve their performance P

• at some task T

• with experience E.

A well-defined learning taskis given by <P,T,E>.


● Machine learning is a subfield of artificial intelligence (AI) that focuses on
the development of algorithms and statistical models that enable
computers to improve their performance on a specific task through
learning from data.

When do we use Machine Learning
ML is used when:

• Human expertise does not exist (navigating on Mars)

• Humans can’t explain their expertise (speech recognition)

• Models must be customized (personalized medicine)

• Models are based on huge amounts ofdata (genomics)


Components of ML Solution
Components of ML algorithm
Brief Review of data
Examples
● Sentiment Analysis: Analyzing text data to determine the sentiment (positive,
negative, neutral) behind a piece of text, which is useful for social media monitoring
and customer feedback analysis.
● Language Translation: Tools like Google Translate use machine learning to translate
text from one language to another.
● Chatbots: NLP models power chatbots and virtual assistants to provide automated
customer support or answer questions.
● Image Classification: Machine learning models can classify images into different
categories. For instance, they can identify whether an image contains a cat or a dog,
helping with applications like content moderation and image search.
Recommendation Systems:
● Online retailers use machine learning to suggest products to customers based on
their browsing and purchase history.
● Streaming services like Netflix use recommendation algorithms to suggest movies
and TV shows.
Speech Recognition:
● Voice assistants like Siri and Alexa use machine learning to convert spoken
language into text and execute commands.
● Speech-to-text applications convert audio recordings into transcribed text.
Healthcare:
● Machine learning can be used for medical image analysis, such as detecting tumors
in medical scans.
● Predictive analytics can help identify potential health risks for patients based on
their medical history.
Financial Services:
● Credit scoring models use machine learning to assess the creditworthiness of
individuals and businesses.
● Fraud detection algorithms analyze transaction data to identify potentially
fraudulent activities.
Autonomous Vehicles:
● Self-driving cars use machine learning to process sensor data, make decisions, and
navigate the vehicle safely.
Manufacturing and Quality Control:
● Machine learning models can be used to monitor and predict
equipment maintenance needs, reducing downtime in
manufacturing.
● Quality control systems can identify defects in products on the
assembly line.
Agriculture:
● Crop yield prediction models can help farmers optimize planting and
harvesting schedules.
● Image analysis can be used to identify crop diseases or pests.
Energy Efficiency:
● Machine learning can optimize energy consumption in buildings, reducing
energy costs and environmental impact.
● Predictive maintenance for power plants and grid infrastructure can prevent
outages and reduce downtime.
Gaming:
● Game developers use machine learning for non-player character (NPC)
behavior, game testing, and procedural content generation.
Text Generation and Summarization:
● Machine learning models can generate human-like text or automatically
summarize long documents.
Types of Machine Learning Algorithms
Supervised (inductive) learning

Training data includes desired outputs

Unsupervised learning

Training data does not include desired outputs

Semi-supervised learning

Training data includes a few desired outputs

Reinforcement learning

Rewards from sequence of actions


Machine Learning Models
Task Driven Data Driven

Supervised Learning Unsupervised Learning


(Pre-categorized data) (Unlabeled Data)
Predications + Predictive Models Pattern/Structure Recognition

Clustering
Divide by similarity

Association
Regression Classification Identify Sequences

Divide the ties by length Divide the socks by color

Dimensionality R
Linear Regression Logistic Regression
Compress data based

Decision Tree
Types of learning
Supervised Learning: In supervised learning, the algorithm is trained on a
labeled dataset, where both the input data and the correct output are
provided. The goal is to learn a mapping from inputs to outputs, enabling
the algorithm to make predictions on new, unseen data.

Unsupervised Learning: Unsupervised learning involves training


algorithms on unlabeled data, with the aim of discovering hidden
patterns, structures, or groupings within the data. Common techniques
include clustering and dimensionality reduction.
Semi-Supervised Learning: This approach combines elements of both
supervised and unsupervised learning. It leverages a small amount of
labeled data and a larger amount of unlabeled data to make predictions
or learn patterns.
Reinforcement Learning: In reinforcement learning, an agent learns to
interact with an environment in order to achieve a goal by taking actions
and receiving feedback in the form of rewards or punishments. The
agent's objective is to learn a policy that maximizes the cumulative
reward.
Supervised Machine Learning
● Algorithm learns from a labeled dataset, making it possible to make
predictions or decisions without being explicitly programmed.
● In supervised learning, the algorithm is trained on input-output pairs,
where the input data is associated with the correct output or target.
● The primary goal of supervised learning is to learn a mapping from input
data to output data, allowing the model to generalize and make
predictions on new, unseen data.
Key Components of SML
● Labeled Data: The training dataset consists of pairs of input data and
corresponding output labels. For example, in a spam email classifier, the input data
might be email text, and the output labels would be "spam" or "not spam."

● Training Phase: During the training phase, the algorithm learns the relationship
between the input data and the output labels. It adjusts its internal parameters or
model to minimize the error or difference between its predictions and the actual
labels in the training data.
● Testing and Evaluation: Once the model is trained, it is evaluated on a
separate dataset, known as the testing or validation set. The model's
performance is assessed by comparing its predictions to the true labels in
the testing set.
● Prediction: After successful training and evaluation, the model can be
used to make predictions on new, unseen data. These predictions are
based on the learned patterns from the training data.
Types of SML
● Classification: In classification tasks, the goal is to predict a discrete label or category.

Common examples include spam detection, image classification (e.g., identifying


objects in images), and sentiment analysis (categorizing text as positive, negative, or
neutral).

● Regression: Regression tasks involve predicting a continuous numerical value.

Examples include predicting house prices based on features like square footage and
location, forecasting stock prices, and estimating a person's age based on various
attributes.
Supervised learning
Applications
● Recognizing patterns
● Facial identities or facial expressions – Handwritten or
spoken words – Medical images
● Generating patterns: – Generating images or motion sequences
● Recognizing anomalies: – Unusualcredit card transactions –
Unusual patternsof sensor readings in a nuclear power plant
● Prediction: – Future stock prices or currency exchange rates
Algorithms used in SML
● Linear Regression: For regression tasks where the relationship between input
features and the target is assumed to be linear.
● Logistic Regression: For binary or multi-class classification tasks.
● Decision Trees: Tree-based models for both classification and regression tasks.
● Random Forest: An ensemble method that combines multiple decision trees for
improved performance.
● Support Vector Machines (SVM): Used for classification tasks and finding a
hyperplane that best separates classes.
● Neural Networks: Deep learning models with multiple layers of interconnected nodes,
suitable for various tasks, including image recognition and natural language processing.
● K-Nearest Neighbors (K-NN): A simple classification and regression algorithm based on the
similarity of data points.

The choice of algorithm depends on the specific problem, the characteristics of the data, and
the desired performance metrics. In supervised machine learning, the quality and quantity of
labeled data are critical, as they directly impact the model's ability to make accurate
predictions.
Unsupervised Machine Learning
● Unsupervised machine learning is a type of machine learning where the
algorithm is trained on unlabeled data, and its goal is to discover patterns,
structures, or relationships within the data without specific guidance or
labeled output.
● Unlike supervised learning, where the algorithm is given labeled examples
to learn from, unsupervised learning is used when you want the algorithm
to explore and find inherent structures or insights within the data itself.
Clustering:
A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing
behavior.

Association:
An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people
that buy X also tend to buy Y.
Dimensionality Reduction:
● In machine learning classification problems, there are often too many factors
on the basis of which the final classification is done.
● These factors are basically variables called features. The higher the number of
features, the harder it gets to visualize the training set and then work on it.
● Sometimes, most of these features are correlated, and hence redundant. This
is where dimensionality reduction algorithms come into play.
● Dimensionality reduction is the process of reducing the number of random
variables under consideration, by obtaining a set of principal variables.
Key Concepts of USML
● Unlabeled Data: In unsupervised learning, the training data consists of raw input
data without associated output labels. This data could be in the form of text,
images, numerical features, or any other data type.

● Clustering: Clustering is a common unsupervised learning technique where the


algorithm groups similar data points together based on some similarity measure.
Common clustering algorithms include k-means, hierarchical clustering, and
DBSCAN.
● Dimensionality Reduction: Dimensionality reduction techniques aim to reduce the
number of features or variables in the data while preserving the most important
information. Principal Component Analysis (PCA) and t-distributed Stochastic
Neighbor Embedding (t-SNE) are examples of dimensionality reduction methods.

● Anomaly Detection: Unsupervised learning can be used to identify rare or


anomalous data points in a dataset, which can be valuable for fraud detection,
network security, and quality control.
● Association Rule Learning: This technique discovers interesting associations or
relationships between different variables in the data. Apriori and FP-growth are
common algorithms used for association rule learning.
● Autoencoders: Autoencoders are neural network architectures used for
dimensionality reduction and feature learning. They consist of an encoder and a
decoder that work together to represent data in a lower-dimensional space and
then reconstruct it.
● Density Estimation: Density estimation techniques model the underlying probability
distribution of the data. Gaussian Mixture Models (GMMs) are an example of a
density estimation method.
Applications
● Customer Segmentation: Clustering customers based on their behavior or
characteristics to improve marketing strategies.
● Image Compression: Using dimensionality reduction techniques to reduce
the size of images while preserving essential details.
● Topic Modeling: Discovering topics within a collection of documents
without predefined categories, which is useful in text analysis and
document categorization.
● Anomaly Detection: Identifying unusual patterns or events in data, such as fraud
detection in financial transactions or equipment failures in manufacturing.
● Recommendation Systems: Grouping users or items with similar preferences,
helping to make personalized recommendations for products, content, or services.
● Reducing Noise in Data: In data preprocessing, unsupervised learning can help
reduce noise and redundancy in the data, making it cleaner and more manageable
for subsequent analysis.

Unsupervised machine learning is particularly valuable for exploring and understanding


complex datasets, finding hidden patterns, and preparing data for further analysis or
supervised learning tasks. It is a versatile tool in data analysis and pattern recognition.
Semi Supervised Machine Learning
● Semi-supervised machine learning is a learning paradigm that combines
elements of both supervised and unsupervised learning.
● In semi-supervised learning, the algorithm is trained on a dataset that
contains a mixture of labeled and unlabeled data.
● This approach is particularly useful when acquiring a large amount of
labeled data is expensive or time-consuming but a smaller labeled dataset
is available.
Key Characteristics of Semi Supervised Learning
● Labeled Data: A small portion of the dataset is labeled, meaning that some data
points have known output labels or target values, while the majority of the data
remains unlabeled.
● Unlabeled Data: The majority of the dataset consists of unlabeled data points,
which do not have associated output labels.
● Semi-Supervised Learning Algorithms: Semi-supervised learning algorithms use the
combination of labeled and unlabeled data to learn patterns and make predictions.
These algorithms typically leverage both supervised and unsupervised techniques
to improve model performance.
Benefits & use cases
● Reduced Labeling Effort: Semi-supervised learning can significantly reduce the cost
and effort associated with labeling data since only a small portion of the data
needs to be labeled.
● Improved Model Generalization: Combining labeled and unlabeled data can often
lead to better generalization and model performance, as the model has access to a
larger dataset.
● Limited Availability of Labeled Data: In cases where acquiring labeled data is
difficult, such as in medical imaging or rare event detection, semi-supervised
learning can be very beneficial.
● Active Learning: Semi-supervised learning can be used in conjunction with active
learning, where the algorithm selects the most informative data points for labeling,
optimizing the use of available labeled data.
● Weakly Supervised Learning: In some cases, semi-supervised learning is a step
toward weakly supervised learning, where data is only partially labeled or labeled
at a higher level of granularity.
● Semi-supervised learning algorithms may include variations of traditional
supervised algorithms adapted to incorporate unlabeled data, or they can involve
more complex techniques that take advantage of both labeled and unlabeled
information. Popular methods include self-training, co-training, and multi-view
learning.
● Semi-supervised learning can be particularly useful in scenarios where acquiring
fully labeled datasets is impractical or costly, such as in medical diagnostics, natural
language processing, and computer vision. It offers a middle ground between the
data efficiency of supervised learning and the exploratory capabilities of
unsupervised learning.
ML in Practice
Data Processing
Data Processing is the task of converting data from a given form to a much
more usable and desired form i.e. making it more meaningful and informative.

Using Machine Learning algorithms, mathematical modeling, and statistical


knowledge, this entire process can be automated. The output of this complete
process can be in any desired form like graphs, videos, charts, tables, images,
and many more, depending on the task we are performing and the
requirements of the machine.
1.Data collection: This is the process of gathering data from various sources,
such as sensors, databases, or other systems. The data may be structured or
unstructured, and may come in various formats such as text, images, or audio.
2.Data preprocessing: This step involves cleaning, filtering, and transforming
the data to make it suitable for further analysis. This may include removing
missing values, scaling or normalizing the data, or converting it to a different
format.
3.Data analysis: In this step, the data is analyzed using various techniques
such as statistical analysis, machine learning algorithms, or data visualization.
The goal of this step is to derive insights or knowledge from the data.
4.Data interpretation: This step involves interpreting the results of the data
analysis and drawing conclusions based on the insights gained. It may also
involve presenting the findings in a clear and concise manner, such as through
reports, dashboards, or other visualizations.
5.Data storage and management: Once the data has been processed and
analyzed, it must be stored and managed in a way that is secure and easily
accessible. This may involve storing the data in a database, cloud storage, or
other systems, and implementing backup and recovery strategies to protect
against data loss.
6.Data visualization and reporting: Finally, the results of the data analysis are
presented to stakeholders in a format that is easily understandable and actionable.
This may involve creating visualizations, reports, or dashboards that highlight key
findings and trends in the data.
There are many tools and libraries available for data processing in ML, including
pandas for Python, and the Data Transformation and Cleansing tool in RapidMiner.
The choice of tools will depend on the specific requirements of the project, including
the size and complexity of the data and the desired outcome.
Advantages of Data Processing in ML
1. Improved model performance: Data processing helps improve the performance of the
ML model by cleaning and transforming the data into a format that is suitable for
modeling.
2. Better representation of the data: Data processing allows the data to be transformed
into a format that better represents the underlying relationships and patterns in the
data, making it easier for the ML model to learn from the data.
3. Increased accuracy: Data processing helps ensure that the data is accurate,
consistent, and free of errors, which can help improve the accuracy of the ML model.
Disadvantages of Data Processing in ML
1. Time-consuming: Data processing can be a time-consuming task, especially for large
and complex datasets.
2. Error-prone: Data processing can be error-prone, as it involves transforming and
cleaning the data, which can result in the loss of important information or the
introduction of new errors.
3. Limited understanding of the data: Data processing can lead to a limited
understanding of the data, as the transformed data may not be representative of the
underlying relationships and patterns in the data.

You might also like