ML Unit 1
ML Unit 1
ML Unit 1
```python
import pandas as pd
# Example dataset
data = {'A': [1, 2, None, 4],
'B': [5, None, 7, 8]}
df = pd.DataFrame(data)
```python
from sklearn.impute import SimpleImputer
# Example dataset
data = {'A': [1, 2, None, 4],
'B': [5, None, 7, 8]}
df = pd.DataFrame(data)
```python
from sklearn.ensemble import RandomForestRegressor
df = pd.DataFrame(data)
Each approach has its own advantages and limitations, and the choice depends
on the nature of the data, the amount of missing data, and the specific
requirements of the machine learning task.
1. **Data Preprocessing**:
- This stage involves preparing the dataset for model training by cleaning,
transforming, and scaling the data as necessary.
- Example: Suppose you have a dataset of house prices with features like
number of bedrooms, square footage, and location. In the preprocessing stage,
you might handle missing values, encode categorical variables, and scale
numerical features to a common range.
2. **Model Selection and Training**:
- This stage involves selecting an appropriate machine learning algorithm,
training the model on the prepared dataset, and tuning hyperparameters to
optimize performance.
- Example: After preprocessing the house price dataset, you might choose to
use a regression algorithm such as linear regression, decision tree regression,
or a more sophisticated model like random forest regression. You then train the
chosen model on the preprocessed data and adjust hyperparameters through
techniques like cross-validation to improve performance.
These stages are iterative and may involve revisiting previous steps based on
the evaluation results or changes in requirements. Additionally, it's crucial to
continually monitor and update the model as new data becomes available or as
the underlying problem domain evolves.
4.what are the differences between machine learning and
deep machine learning?
Ans-
S.
No. Machine Learning Deep Learning
12. Its model takes less time in training A huge amount of time is taken
due to its small size. because of very big data points.
15. The results of an ML model are easy The results of deep learning are
to explain. difficult to explain.
1. **Spam Filters**:
- Supervised learning algorithms, such as Naive Bayes or Support Vector
Machines, are employed by email clients to differentiate between spam and
non-spam emails.
- The algorithms are trained on a labeled dataset consisting of examples of
both spam and legitimate emails.
- During training, the algorithms learn patterns and features characteristic of
spam and use this knowledge to classify incoming emails as either spam or
non-spam.
2. **Fraud Detection**:
- Financial institutions utilize supervised learning algorithms to detect
fraudulent transactions in real-time.
- These algorithms are trained on a dataset containing labeled examples of
fraudulent and non-fraudulent transactions.
- By analyzing transaction features such as amount, location, and frequency,
the algorithms learn to identify anomalous patterns indicative of fraudulent
activity.
3. **Recommendation Systems**:
- Online platforms like Netflix and Amazon leverage supervised learning
algorithms to provide personalized recommendations to users.
- These algorithms learn from historical user interactions, such as movies
watched or products purchased, in a labeled dataset.
- Using techniques like collaborative filtering or matrix factorization, the
algorithms predict user preferences and suggest similar items that users might
be interested in.
4. **Speech Recognition**:
- Voice assistants like Siri and Alexa rely on supervised learning algorithms to
understand and respond to spoken commands.
- The algorithms are trained on a dataset containing transcribed speech
paired with the corresponding text labels.
- By analyzing the acoustic features of speech signals and their corresponding
textual representations, the algorithms learn to recognize and interpret spoken
commands accurately.
5. **Image Classification**:
- Image recognition systems, such as those employed by social media
platforms, use supervised learning algorithms to classify images based on their
content.
- These algorithms are trained on a labeled dataset comprising images
annotated with their corresponding categories (e.g., cats, dogs).
- Through techniques like convolutional neural networks (CNNs), the
algorithms learn hierarchical representations of image features and can
accurately classify new images into predefined categories.
3. **Interpretability**:
- If interpretability is crucial, simpler models like logistic regression or
decision trees may be preferred as they provide easily interpretable results.
- More complex models like random forests or neural networks may offer
higher accuracy but are often considered black-box models, making
interpretation more challenging.
5. **Scalability**:
- Consider the scalability of the algorithm with respect to the size of the
dataset. Some algorithms, like linear models, are highly scalable and suitable
for large datasets, while others, like k-nearest neighbors, may be less efficient.
6. **Performance Metrics**:
- Consider the evaluation metrics relevant to your problem (e.g., accuracy,
precision, recall, F1-score) and choose algorithms that optimize those metrics
effectively.
- For imbalanced datasets, algorithms that handle class imbalance well, such
as those with class weights or algorithms specifically designed for imbalanced
data, may be preferred.
1. Classification Metrics:
2. Regression Metrics:
1. Train-Test Split:
The simplest method is to split your data into a training set and a testing
set. Train each candidate model on the training set and evaluate their
performance on the testing set. Choose the model with the best
performance on the testing set.
2. K-Fold Cross-Validation:
1. Split your dataset into training and validation sets (or use cross-
validation).
2. Train each candidate model on the training set (or k-1 folds in
cross-validation).
3. Evaluate each model’s performance on the validation set (or the
kth fold in cross-validation) using appropriate evaluation
metrics.
4. Compare the models’ performance and select the best one for
your problem.
5. Train the chosen model on the entire dataset and use it to make
predictions on new data.
**Steps**:
2. **Split Data**: Split the dataset into training (70%), validation (15%), and
test (15%) sets.
4. **Train Models**: Train each model on the training set using default
hyperparameters.
7. **Select Best Model**: Choose the model with the highest F1-score on the
validation set. Let's say the GBM model performs the best.
8. **Final Evaluation**: Evaluate the selected GBM model on the test set to get
an unbiased estimate of its performance. If the F1-score on the test set is
satisfactory and the model generalizes well, proceed to deployment.
Throughout this process, it's essential to document each step, including the
evaluation metrics, hyperparameters, and model performance, to ensure
reproducibility and transparency.