Final Project Report

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Final Project Report

Topic:-Applying Machine Learning Algorithms for Analyzing and predicting


Agriculture(Crops) Performance with many types of fertilizer and
temperature,humidity,rainfall.

Submitted By:-Saswata Banerjee

Submitted To:-Partha Koley

Course Name:-Machine Learning With Python

Euphoria GenX

----------------------------X--------------------------------
Index:-

1. Abstract……………………………………………………………………………..1-2

2. Acknowledgement……………………………………………………………..2-3

3. SDK(kits)…………………………………………………………………………….3-4

4. Model …………………………………………………………………………………4-11

5. Machine Learning……………………………………………………………….12

6. Supervised And Unsupervised……………………………………………12-13

7. Python…………………………………………………………………………………13

8. Workflow Project………………………………………………………………..13-15

9. The Elbow Method………………………………………………………………15-16

10. Distribution of agricultural conditions……………………………….16-18

11. Prediction of Crops………………………………………………………………18-20

12. Confusion matrix using logistic Regression,Kmeans…………..20-25

13. Classification report for logistic logistic regression……………25-26

14. Source code and Output…………………………………………………….26-29

15. Conclusion……………………………………………………………………………29-30

16. Future scope………………………………………………………………………..31-32

17. Bibliography………………………………………………………………………..32

------------------------------------------------------------------------X-----------------------------------------------------------------------------------
1

1.Abstract:-

Smarter applications are making better use of the insights gleaned from data,

having an impact on every industry and research discipline. At the core the

revolution lies the tools and the methods that are driving it, from processing the

massive piles of data generated each day to learning from and taking useful

action. In this paper we first introduced you to the python programming

characteristics and features. Python is one of the most preferred languages for

scientific computing, data science, and machine learning, boosting both

performance and productivity by enabling the use of low-level libraries. This

paper offers insight into the field of machine learning with python, taking a tour

through important topics and libraries of python which enables the development

of machine learning model a easy process.Then we will look at different types of

machine learning and various algorithms of machine leaning. And at last, we will

look at the one of the most used models i.e., Linear Regression.

Linear Regression is a Machine Learning algorithm based on supervised learning.

It performs a regression task. It is used to predict the value of a variable based

on the value of another variable. The variable you want to predict is called

dependent variable. The variable you are using to predict the other variable’s

value is called the independent variable.

Hypothesis function for linear regression:-

Y = mx + c

And at last, in this paper, we will be going to understand one of the linear
2
regression models for an ice-cream selling company which will predict the sales

done by the business on different temperatures.

Keywords:- Python; Machine Learning; Artificial Intelligence; Regression; Linear

Regression.

2.Acknowledgement:-

SDT(Software Development Tools):-Machine learning involves using algorithms to allow

computer software programs to 'learn' different tasks making use of the available data. ML

programs become more accurate the more they train on the relevant test data.

Many types of Software Development Tools are there.Some of them which is used in

Programming are:-

1.numpy:-

Work:-In Python we have lists that serve the purpose of arrays, but they are slow to process.

NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.

The array object in NumPy is called ndarray, it provides a lot of supporting functions that make

working with ndarray very easy.

Arrays are very frequently used in data science, where speed and resources are very important.

Language:-Python,C and C++.

2.pandas:-

Work:-Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

Language:-Python.

3.Matplotlib:-

Work:-Matplotlib is easy to use and an amazing visualizing library in Python. It is built

on NumPy arrays and designed to work with the broader SciPy stack and consists of
3
several plots like line, bar, scatter, histogram, etc.

Language:-Python.

4.Pyplot:-

Work:-pyplot is a collection of command style functions that make matplotlib work

like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a

figure, creates a plotting area in a figure, plots some lines in a plotting area,

decorates the plot with labels, etc.

Language:-Python.

5.Seaborn:-

Work:-Seaborn is a library for making statistical graphics in Python. It builds on top

of matplotlib and integrates closely with pandas data structures. Seaborn helps you

explore and understand your data.

Language:-Graphics,Python.

3.SDK:-As an AI language model, I can provide you with information about Google

Colab and its software development kit (SDK).

Google Colab is a cloud-based interactive computing environment that provides free

access to a Jupyter Notebook environment along with computational resources,

including CPU, GPU, and TPU. It allows users to write, run, and share Python code in a

collaborative and interactive manner. Colab is hosted on Google Drive, and notebooks

can be created, shared, and saved directly in Google Drive.

While Google Colab does not have an official SDK, it provides a Python library called

"google.colab" that allows developers to interact with the Colab environment

programmatically. The "google.colab" library provides functionality for tasks such as


4
importing and exporting files, installing Python packages, managing Colab sessions,
and connecting to external services like Google Drive and Google Sheets.

Some of the common tasks that can be performed using the "google.colab" library

include:-

Importing and exporting files:-The library allows you to upload and download files to and

from the Colab environment. For example, you can use the "files.upload()" function to

upload files from your local machine to Colab, and the "files.download()" function to

download files from Colab to your local machine.

Installing Python packages:-The library provides a way to install Python packages

directly from within the Colab environment using the "!pip install" command.

Managing Colab sessions:-The library allows you to manage the lifecycle of a Colab

session. You can use functions like "drive.mount()" to mount your Google Drive,

"drive.flush_and_unmount()" to flush and unmount the Google Drive, and "os.kill()" to

terminate the current session.Connecting to external services: The library provides

functionality to connect to external services like Google Drive and Google Sheets,

allowing you to read and write data to these services from within a Colab notebook.

Interacting with Colab UI:- The library allows you to interact with the Colab user

interface programmatically, for example, by using the "IPython.display" module to

display images, videos, and other media in the output of a Colab cell.

Overall, while Google Colab does not have a standalone SDK, the "google.colab" library

provides a convenient way to interact with the Colab environment programmatically and

automate various tasks within Colab notebooks. You can import the "google.colab"

library in your Python code and use its functions to perform operations within the Colab

environment.

4.Model:-
5
SDLC Model:-

Waterfall Model:-

The waterfall is a universally accepted SDLC model. In this method, the whole process of

software development is divided into various phases.

The waterfall model is a continuous software development model in which development is

seen as flowing steadily downwards (like a waterfall) through the steps of requirements

analysis, design, implementation, testing (validation), integration, and maintenance.

Linear ordering of activities has some significant consequences. First, to identify the end of a

phase and the beginning of the next, some certification techniques have to be employed at

the end of each step. Some verification and validation usually do this mean that will ensure

that the output of the stage is consistent with its input (which is the output of the previous

step), and that the output of the stage is consistent with the overall requirements of the

system.

RAD Model:-

RAD or Rapid Application Development process is an adoption of the waterfall model; it

targets developing software in a short period. The RAD model is based on the concept that a

better system can be developed in lesser time by using focus groups to gather system

requirements.

o Business Modeling
o Data Modeling
6

o Process Modeling
o Application Generation
o Testing and Turnover

Spiral Model:-

The spiral model is a risk-driven process model. This SDLC model helps the group to adopt

elements of one or more process models like a waterfall, incremental, waterfall, etc. The

spiral technique is a combination of rapid prototyping and concurrency in design and

development activities.

Each cycle in the spiral begins with the identification of objectives for that cycle, the

different alternatives that are possible for achieving the goals, and the constraints that exist.

This is the first quadrant of the cycle (upper-left quadrant).

The next step in the cycle is to evaluate these different alternatives based on the objectives

and constraints. The focus of evaluation in this step is based on the risk perception for the

project.

The next step is to develop strategies that solve uncertainties and risks. This step may

involve activities such as benchmarking, simulation, and prototyping.

V-Model:-

In this type of SDLC model testing and the development, the step is planned in parallel. So,

there are verification phases on the side and the validation phase on the other side. V-Model

joins by Coding phase.

Incremental Model:-

The incremental model is not a separate model. It is necessarily a series of waterfall cycles.

The requirements are divided into groups at the start of the project. For each group, the

SDLC model is followed to develop software. The SDLC process is repeated, with each

release adding more functionality until all requirements are met. In this method, each cycle
7

act as the maintenance phase for the previous software release. Modification to the

incremental model allows development cycles to overlap. After that subsequent cycle may

begin before the previous cycle is complete.

Agile Model:-

Agile methodology is a practice which promotes continues interaction of development and

testing during the SDLC process of any project. In the Agile method, the entire project is

divided into small incremental builds. All of these builds are provided in iterations, and each

iteration lasts from one to three weeks.

Any agile software phase is characterized in a manner that addresses several key

assumptions about the bulk of software projects:

1. It is difficult to think in advance which software requirements will persist and which

will change. It is equally difficult to predict how user priorities will change as the project

proceeds.

2. For many types of software, design and development are interleaved. That is, both

activities should be performed in tandem so that design models are proven as they are

created. It is difficult to think about how much design is necessary before construction is

used to test the configuration.

3. Analysis, design, development, and testing are not as predictable (from a planning

point of view) as we might like.

Iterative Model:-

It is a particular implementation of a software development life cycle that focuses on an

initial, simplified implementation, which then progressively gains more complexity and a

broader feature set until the final system is complete. In short, iterative development is a

way of breaking down the software development of a large application into smaller pieces.
8

Big bang model:-

Big bang model is focusing on all types of resources in software development and coding,

with no or very little planning. The requirements are understood and implemented when they

come.

This model works best for small projects with smaller size development team which are

working together. It is also useful for academic software development projects. It is an ideal

model where requirements are either unknown or final release date is not given.

Prototype Model:-

The prototyping model starts with the requirements gathering. The developer and the user

meet and define the purpose of the software, identify the needs, etc.

A 'quick design' is then created. This design focuses on those aspects of the software that

will be visible to the user. It then leads to the development of a prototype. The customer then

checks the prototype, and any modifications or changes that are needed are made to the

prototype.

Looping takes place in this step, and better versions of the prototype are created. These are

continuously shown to the user so that any new changes can be updated in the prototype.

This process continue until the customer is satisfied with the system. Once a user is

satisfied, the prototype is converted to the actual system with all considerations for quality

and security.

Machine Learning Life Cycle:-


9

1. Gathering Data:-
Data Gathering is the first step of the machine learning life cycle. The goal of this step is to

identify and obtain all data-related problems.

In this step, we need to identify the different data sources, as data can be collected from

various sources such as files, database, internet, or mobile devices. It is one of the most

important steps of the life cycle. The quantity and quality of the collected data will

determine the efficiency of the output. The more will be the data, the more accurate will be

the prediction.

This step includes the below tasks:

o Identify various data sources


o Collect data
o Integrate the data obtained from different sources

By performing the above task, we get a coherent set of data, also called as a dataset. It will

be used in further steps.

2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a step

where we put our data into a suitable place and prepare it to use in our machine learning

training.

In this step, first, we put all data together, and then randomize the ordering of data.

This step can be further divided into two processes:


10

o Data exploration:-
It is used to understand the nature of data that we have to work with. We need to

understand the characteristics, format, and quality of data.

A better understanding of data leads to an effective outcome. In this, we find

Correlations, general trends, and outliers.

o Data pre-processing:-
Now the next step is preprocessing of data for its analysis.

3. Data Wrangling:-
Data wrangling is the process of cleaning and converting raw data into a useable format. It is

the process of cleaning the data, selecting the variable to use, and transforming the data in a

proper format to make it more suitable for analysis in the next step. It is one of the most

important steps of the complete process. Cleaning of data is required to address the quality

issues.

It is not necessary that data we have collected is always of our use as some of the data may

not be useful. In real-world applications, collected data may have various issues, including:

o Missing Values
o Duplicate data
o Invalid data
o Noise

So, we use various filtering techniques to clean the data.

It is mandatory to detect and remove the above issues because it can negatively affect the

quality of the outcome.

4. Data Analysis:-
Now the cleaned and prepared data is passed on to the analysis step. This step involves:

o Selection of analytical techniques


o Building models
11

o Review the result

The aim of this step is to build a machine learning model to analyze the data using various

analytical techniques and review the outcome. It starts with the determination of the type of

the problems, where we select the machine learning techniques such

as Classification, Regression, Cluster analysis, Association, etc. then build the model using

prepared data, and evaluate the model.

Hence, in this step, we take the data and use machine learning algorithms to build the model.

5. Train Model:-
Now the next step is to train the model, in this step we train our model to improve its

performance for better outcome of the problem.

We use datasets to train the model using various machine learning algorithms. Training a

model is required so that it can understand the various patterns, rules, and, features.

6. Test Model:-
Once our machine learning model has been trained on a given dataset, then we test the

model. In this step, we check for the accuracy of our model by providing a test dataset to it.

Testing the model determines the percentage accuracy of the model as per the requirement

of project or problem.

7. Deployment:-
The last step of machine learning life cycle is deployment, where we deploy the model in the

real-world system.

If the above-prepared model is producing an accurate result as per our requirement with

acceptable speed, then we deploy the model in the real system. But before deploying the

project, we will check whether it is improving its performance using available data or not.

The deployment phase is similar to making the final report for a project.
12

5.Machine Learning:-

ML-based deep learning can simplify the task of crop breeding. Algorithms simply collect

field data on plant behavior and use that data to develop a probabilistic model.

Crop yield prediction is another instance of machine learning in the agriculture sector.

The technology amplifies decisions on what crop species to grow and what activities to

perform during the growing season. Tech-wise, crop yield is used as a dependent

variable when making predictions. The major factors include temperature, soil type,

rainfall, and actual crop information. Based on these inputs, ML algorithms like neural

networks and multiple linear regression produce forecasts.

The goal of this research is to present a comparison between different

clustering and segmentation techniques, both supervised and unsupervised,

to detect plant and crop rows. Aerial images, taken by an Unmanned Aerial

Vehicle (UAV), of a corn field at various stages of growth were acquired in

RGB format through the Agronomy Department at the Kansas State

University. Several segmentation and clustering approaches were applied to

these images, namely K-Means clustering, Excessive Green (ExG) Index

algorithm, Support Vector Machines (SVM), Gaussian Mixture Models (GMM),

and a deep learning approach based on Fully Convolutional Networks (FCN),

to detect the plants present in the images. A Hough Transform (HT) approach

was used to detect the orientation of the crop rows and rotate the images so

that the rows became parallel to the x-axis. The result of applying different
13
segmentation methods to the images was then used in estimating the

location of crop rows in the images by using a template creation method

based on Green Pixel Accumulation (GPA) that calculates the intensity

profile of green pixels present in the images. Connected component analysis

was then applied to find the centroids of the detected plants. Each centroid

was associated with a crop row, and centroids lying outside the row

templates were discarded as being weeds. A comparison between the

various segmentation algorithms based on the Dice similarity index and

average run-times is presented at the end of the work.

Python is also being used for developing the IoT devices. AI is assisting IoT

in enabling real-time data analytics to help make informed decisions to farmers. Precision

agriculture or smart Agriculture relies on emerging technologies such as AI, ML and data

analytics to revolutionize farming practices.

8. Workflow Project:-

WorkFlow Management Crops Prediction(Agricultural System)


Workflow management in agricultural systems for crop prediction involves the efficient

coordination and automation of tasks and processes related to crop cultivation, monitoring,
14
and prediction of yields. Here's a general outline of a typical workflow management system

for crop prediction in an agricultural setting:-

Data Collection:- Data related to various factors that influence crop growth and yield, such as

weather conditions, soil characteristics, historical crop data, and satellite imagery, are

collected and integrated into the workflow management system. This data can be collected

through various sensors, drones, and other data sources.

Data Preprocessing:- The collected data is preprocessed to clean and transform it into a

format suitable for analysis. This may involve data cleaning, normalization, aggregation, and

feature extraction to reduce noise and ensure data quality.

Data Analysis:- The preprocessed data is analyzed using various statistical and machine

learning techniques to identify patterns, trends, and correlations between different variables.

For example, machine learning algorithms such as decision trees, random forests, and neural

networks can be used to predict crop yields based on historical data and environmental

factors.

Crop Prediction:- Based on the analysis results, the workflow management system can

generate crop prediction models that can forecast crop yields for different crops and regions.

These models can be continuously updated with new data to improve their accuracy over

time.

Decision Support:- The workflow management system can provide decision support to

farmers by presenting them with insights and recommendations based on the crop prediction

models. For example, it can suggest optimal planting times, irrigation schedules, and

fertilization plans based on the predicted crop yields and current weather conditions.

Task Automation:-The workflow management system can automate various tasks related to

crop cultivation, such as scheduling irrigation, applying fertilizers, and monitoring pest

control, based on the predicted crop yields and environmental conditions. This can help

farmers optimize their operations, reduce costs, and increase productivity.


15
Monitoring and Feedback:-The workflow management system can continuously monitor the

actual crop growth and yield data and compare it with the predicted results. This feedback

loop allows for ongoing validation and refinement of the prediction models, and helps farmers

make informed decisions about their crop management practices.

Reporting and Visualization:- The workflow management system can generate reports and

visualizations to provide farmers and other stakeholders with a clear understanding of the

crop prediction results, trends, and performance metrics. This can help farmers evaluate the

effectiveness of their crop management strategies and make data-driven decisions for future

seasons.

Integration with Crop Management Tools:- The workflow management system can be

integrated with other crop management tools, such as farm management software, precision

agriculture equipment, and agricultural drones, to enable seamless coordination and

execution of tasks based on crop prediction results.

Continuous Improvement:-The workflow management system can be continuously improved

by incorporating new data sources, updating prediction models, and refining decision support

algorithms based on feedback from farmers and other stakeholders. This iterative process

helps ensure that the system remains accurate, reliable, and relevant over time.

Overall, an effective workflow management system for crop prediction in agricultural

systems involves the integration of data collection, preprocessing, analysis, prediction,

decision support, task automation, monitoring, reporting, and continuous improvement

components to enable efficient and data-driven crop management practices.

9.Elbow Method:-
16

Elbow Method of Crops Prediction(Agriculture)

The Elbow Method is a commonly used technique in data science and machine learning to

determine the optimal number of clusters or groups in a dataset. It can also be applied in

agriculture for crop prediction, specifically in crop classification or clustering tasks.

For each value of k, run the clustering algorithm and compute the sum of squared distances

(SSE) of each data point to its centroid within each cluster. Plot the SSE values against the

corresponding values of k in a line chart.

The Elbow Method can help in optimizing the clustering process and improving the accuracy

of crop prediction models by identifying the appropriate number of clusters or groups in the

dataset. It can also aid in making informed decisions related to crop management, resource

allocation, and agricultural planning.

10.Distribution Of Agricultural Conditions:-

The distribution of agricultural conditions can vary greatly depending on various factors such

as climate, soil type, topography, water availability, and human intervention. Here are some

general patterns of agricultural conditions distribution:-

Climate:-Climate plays a crucial role in determining agricultural conditions. Crops have

specific requirements for temperature, precipitation, and sunlight. In general, agricultural


17
areas tend to be concentrated in regions with favorable climates for crop growth. For

example, areas with moderate temperatures, adequate rainfall, and ample sunlight are often

conducive to agriculture. Regions with harsh climates such as deserts, extreme cold, or

excessive rainfall may have limited agricultural potential.

Soil type:- Soil type is another critical factor that influences agricultural conditions. Different

crops require different types of soils for optimal growth. For example, crops like rice and

cranberries thrive in acidic soils, while crops like wheat and corn prefer well-drained loamy

soils. Agricultural areas are often found in regions with fertile soils that provide essential

nutrients and support healthy crop growth.

Topography:-Topography, or the physical characteristics of the land, can also affect

agricultural conditions. Flat or gently sloping lands are generally more suitable for

agriculture as they allow for easier irrigation and cultivation. Steep slopes or rugged terrains

may pose challenges in terms of soil erosion, water runoff, and accessibility, which can

impact agricultural productivity.

Water availability:-Access to water is critical for agriculture. Regions with ample water

resources such as rivers, lakes, or groundwater reserves are often conducive to agriculture.

Irrigation systems are often developed in areas with limited rainfall to support crop growth.

In contrast, areas with limited water resources may face challenges in agricultural

production.

Human intervention:-Human intervention, including agricultural practices and infrastructure

development, can greatly influence agricultural conditions. Agricultural technologies, such

as irrigation systems, fertilizers, and crop management practices, can enhance agricultural

productivity and expand the potential for agriculture in regions with suboptimal conditions.

Human settlements and infrastructure, such as roads and markets, also play a role in

determining the distribution of agricultural conditions.

Overall, the distribution of agricultural conditions is influenced by a complex interplay of


18
factors including climate, soil type, topography, water availability, and human intervention.

Understanding these factors is crucial for planning and managing agricultural activities and

ensuring sustainable food production.

11.Predictions of Crops:-
As an AI language model, I can provide predictions about potential trends and factors

that may impact crop production in the future. However, it's important to note that crop

predictions are subject to various factors, including weather conditions, technological

advancements, economic factors, and policy changes, which can all influence crop

production. Additionally, unforeseen events or disruptions, such as natural disasters or

disease outbreaks, can also significantly impact crop yields. With these considerations in

mind, here are some potential predictions for crops:

Climate-resilient crops:-With the increasing impacts of climate change, there may be a

growing demand for climate-resilient crops that are adapted to changing weather patterns,

such as drought-tolerant or heat-tolerant varieties. Advances in biotechnology and genetic

engineering may lead to the development of genetically modified crops that are better able

to withstand extreme weather conditions, helping to ensure stable crop production in the

face of climate challenges.

Vertical farming:- Vertical farming, which involves growing crops indoors in stacked layers
19
using artificial lighting, may become more widespread due to its potential for year-round

production in urban environments and reduced reliance on traditional agricultural land.

Advances in LED lighting technology, automation, and data analytics may drive increased

adoption of vertical farming, allowing for the cultivation of a wide variety of crops in

controlled environments with optimized resource use.

Organic and regenerative agriculture: There may be a growing demand for organic and

regenerative agricultural practices that prioritize soil health, biodiversity, and ecosystem

sustainability. Consumers' increasing focus on health and environmental sustainability may

drive demand for crops grown using organic or regenerative practices, which can promote

soil fertility, reduce chemical inputs, and enhance overall ecosystem resilience.

Precision agriculture:- Precision agriculture, which involves using technologies such as

drones, sensors, and data analytics to optimize crop management, may continue to gain

momentum. Advancements in remote sensing, data analytics, and artificial intelligence may

enable farmers to make data-driven decisions about planting, irrigation, nutrient

management, and pest control, resulting in improved crop yields, reduced input use, and

enhanced sustainability.

Alternative protein crops:-As global demand for protein-rich foods continues to rise, there

may be an increasing focus on alternative protein crops, such as legumes, insects, and algae.

These crops are rich in protein, require fewer resources to produce compared to traditional

animal agriculture, and may be more sustainable and environmentally friendly.

Resurgence of traditional and indigenous crops:-There may be a renewed interest in

traditional and indigenous crops that are well adapted to local climates and have genetic

diversity. These crops may be seen as more resilient to changing environmental conditions

and may offer unique nutritional and cultural benefits.

Increased adoption of genetically modified crops:-Advances in genetic engineering may lead

to increased adoption of genetically modified crops with enhanced traits, such as resistance
20
to pests, diseases, or environmental stress. However, the adoption of genetically modified

crops may continue to be a topic of debate, with concerns about safety, environmental

impacts, and consumer acceptance.

It's important to note that these predictions are speculative and may be subject to change as

new technologies, policies, and environmental factors emerge. The future of crop production

will likely be shaped by a complex interplay of various factors, and careful monitoring and

adaptive management will be necessary to ensure sustainable and resilient crop production

systems.

Example:-

12.Confusion Matrix:-

A confusion matrix, also known as an error matrix, is a commonly used evaluation metric in

machine learning and data mining to assess the performance of a classification model. K-

means, however, is an unsupervised clustering algorithm that does not inherently provide

labels or ground truth for classification. Therefore, using a confusion matrix directly with K-

means is not applicable.

However, if you are interested in evaluating the performance of a classification model that is

trained using K-means clustering as a feature extraction step, you can follow these steps to

generate a confusion matrix:-

Perform K-means clustering:-Use K-means algorithm to cluster your data into K groups. The
21
clusters obtained from K-means can be treated as pseudo-labels for your data.

Train a classifier:-Use the cluster assignments obtained from K-means as features and train

a classification model, such as logistic regression, decision tree, or support vector machine

(SVM), using a labeled dataset. The labeled dataset should have true class labels for each

data point that are used for training the classifier.

Make predictions:-Use the trained classifier to make predictions on a test dataset. The

predicted class labels can be obtained from the output of the classifier.

Create a confusion matrix:-Compare the predicted class labels with the true class labels

from the test dataset to create a confusion matrix. The confusion matrix will have rows

representing the true class labels and columns representing the predicted class labels. The

diagonal elements of the confusion matrix represent the number of correct predictions, while

the off-diagonal elements represent the misclassifications.

Calculate performance metrics:-Use the values in the confusion matrix to calculate various

performance metrics such as accuracy, precision, recall, and F1 score, which provide

insights into the classification performance of the model.Here's an example of how you can

create a confusion matrix using K-means clustering as a feature extraction step in Python.

A confusion matrix, also known as an error matrix, is a performance evaluation tool used in

machine learning and statistics to assess the accuracy of a classification model. It is a table

that displays the true positive (TP), true negative (TN), false positive (FP), and false negative

(FN) values for a set of predictions compared to the actual ground truth.

Here is an example of a confusion matrix:-


Actual/Predicte | Positive | Negative
-------------------- |---------- |----------
Positive | TP | FP

Negative | FN | TN
Each cell in the confusion matrix represents the count or percentage of instances that fall

into a specific category based on the model's predictions and the actual ground truth. The

key terms used in a confusion matrix are:


22
True Positive (TP):-The number of instances that are actually positive and are correctly

predicted as positive by the model.

True Negative (TN):-The number of instances that are actually negative and are correctly

predicted as negative by the model.

False Positive (FP):-The number of instances that are actually negative but are incorrectly

predicted as positive by the model.

False Negative (FN):-The number of instances that are actually positive but are incorrectly

predicted as negative by the model.

The confusion matrix provides valuable insights into the performance of a classification

model, allowing for the calculation of various performance metrics such as accuracy,

precision, recall, F1 score, and specificity, which help in understanding the model's strengths

and weaknesses. It is a useful tool for evaluating and fine-tuning machine learning models to

improve their classification accuracy.

Confusion Matrix using Logistic Regression:-

A confusion matrix is a commonly used tool to evaluate the performance of a classification

model, such as logistic regression. It is a matrix that shows the number of true positives (TP),

false positives (FP), true negatives (TN), and false negatives (FN) for a given set of

predictions compared to the actual ground truth.

Here's an example of how you can create a confusion matrix using logistic regression in
Python:-

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
# Load your dataset
# X is the feature matrix, y is the target variable
X, y = load_your_dataset()
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the logistic regression model
logreg = LogisticRegression()
23
# Train the model
logreg.fit(X_train, y_train)
# Make predictions on the test set
y_pred = logreg.predict(X_test)
# Create a confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Extract values from the confusion matrix
tn, fp, fn, tp = cm.ravel()
# Print the confusion matrix
print("Confusion Matrix:")
print("True Negatives (TN):", tn)
print("False Positives (FP):", fp)
print("False Negatives (FN):", fn)
print("True Positives (TP):", tp)
# You can also visualize the confusion matrix using a heatmap
import seaborn as sns
import matplotlib.pyplot as plt
# Create a heatmap of the confusion matrix
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
In the example above, we first load our dataset and split it into train and test sets using

train_test_split from Scikit-learn. Then we initialize a logistic regression model, fit it to the

training data, and make predictions on the test data. We create a confusion matrix using

confusion_matrix from Scikit-learn, and then extract the values for TP, FP, FN, and TP from

the confusion matrix. Finally, we print the values and visualize the confusion matrix using a

heatmap with Seaborn and Matplotlib.

Confusion Matrix using Kmeans:-


A confusion matrix, also known as an error matrix, is a commonly used evaluation metric in

machine learning and data mining to assess the performance of a classification model. K-

means, however, is an unsupervised clustering algorithm that does not inherently provide

labels or ground truth for classification. Therefore, using a confusion matrix directly with K-

means is not applicable.

However, if you are interested in evaluating the performance of a classification model that is

trained using K-means clustering as a feature extraction step, you can follow these steps to

generate a confusion matrix:-


24
Perform K-means clustering:-Use K-means algorithm to cluster your data into K groups. The

clusters obtained from K-means can be treated as pseudo-labels for your data.

Train a classifier:-Use the cluster assignments obtained from K-means as features and train

a classification model, such as logistic regression, decision tree, or support vector machine

(SVM), using a labeled dataset. The labeled dataset should have true class labels for each

data point that are used for training the classifier.

Make predictions:-Use the trained classifier to make predictions on a test dataset. The

predicted class labels can be obtained from the output of the classifier.

Create a confusion matrix:-Compare the predicted class labels with the true class labels

from the test dataset to create a confusion matrix. The confusion matrix will have rows

representing the true class labels and columns representing the predicted class labels. The

diagonal elements of the confusion matrix represent the number of correct predictions, while

the off-diagonal elements represent the misclassifications.

Calculate performance metrics:-Use the values in the confusion matrix to calculate various

performance metrics such as accuracy, precision, recall, and F1 score, which provide

insights into the classification performance of the model.

Here's an example of how you can create a confusion matrix using K-means clustering as a

feature extraction step in Python:-

from sklearn.cluster import KMeans


from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
# Step 1: Perform K-means clustering
kmeans = KMeans(n_clusters=3)
kmeans.fit(X_train) # X_train is your training data
# Step 2: Train a classifier
X_train_kmeans = kmeans.transform(X_train)
X_test_kmeans = kmeans.transform(X_test) # X_test is your test data
clf = LogisticRegression()
clf.fit(X_train_kmeans, y_train) # y_train is your true class labels for training data
# Step 3: Make predictions
y_pred = clf.predict(X_test_kmeans)
# Step 4: Create a confusion matrix
25
confusion_mat = confusion_matrix(y_test, y_pred) # y_test is your true class labels for test
data, y_pred is the predicted class labels
# Step 5: Calculate performance metrics
accuracy = (confusion_mat[0, 0] + confusion_mat[1, 1]) / np.sum(confusion_mat)
precision = confusion_mat[1, 1] / (confusion_mat[1, 1] + confusion_mat[0, 1])
recall = confusion_mat[1, 1] / (confusion_mat[1, 1] + confusion_mat[1, 0])
f1_score = 2 * (precision * recall) / (precision + recall)
print("Confusion Matrix:\n", confusion_mat)
print("Accuracy: {:.2f}".format(accuracy))
print("Precision: {:.2f}".format(precision))
print("Recall: {:.2f}".format(recall))
print("F1 Score: {:.2f}".format(f1_score))
13.Classification Report using Logistic Regression:-
Here's an example of how you can generate a classification report using logistic regression
in Python, utilizing the sklearn library.
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Load your dataset
# Replace X and y with your own features and target variable
X, y = load_your_dataset()
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and fit the logistic regression model
clf = LogisticRegression()
clf.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = clf.predict(X_test)
# Generate the classification report
report = classification_report(y_test, y_pred)
# Print the classification report
print(report)

The classification_report() function from sklearn.metrics generates a report that includes

metrics such as precision, recall, F1-score, and support for each class in a classification

problem. You can interpret the report to assess the performance of your logistic regression

model.

Here's an example of how you can generate a classification report for agriculture and

crop production using logistic regression. Please note that this is a hypothetical

example and the data and results are not based on actual data.
26
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# Load the dataset (example data)
data = pd.read_csv('agriculture_dataset.csv')
# Split the data into features and target variable
X = data.drop('Crop_Type', axis=1) # Features
y = data['Crop_Type'] # Target variable
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and fit the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Predict on the testing data
y_pred = model.predict(X_test)
# Generate the classification report
report = classification_report(y_test, y_pred)
# Print the classification report
print(report)

The classification_report function from scikit-learn is used to generate the classification

report, which provides metrics such as precision, recall, F1-score, and support for each

class in the target variable (Crop_Type in this case). The report gives an overview of the

performance of the logistic regression model in predicting the crop type based on the

features provided in the dataset.

14.Source Code and Output:-

Source Code:-

from google.colab import files


uploaded = files.upload()
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact
data=pd.read_csv("data (1) (1).csv")
print(data)
print(data.isnull().sum())
sns.heatmap(data.isnull())
plt.show()
Output:-
27

print("Avg nitrogen {0:.2f}".format(data["N"].mean()))


print("Avg phosphorus {0:.2f}".format(data["P"].mean()))
print("Avg Potassium {0:.2f}".format(data["K"].mean()))
print("Avg temperature {0:.2f}".format(data["temperature"].mean()))
print("Avg humidity {0:.2f}".format(data["humidity"].mean()))
print("Avg ph {0:.2f}".format(data["ph"].mean()))
print("Avg rainfall {0:.2f}".format(data["rainfall"].mean()))
Output:-
Avg nitrogen 50.55
Avg phosphorus 53.36
Avg Potassium 48.15
Avg temperature 25.62
Avg humidity 71.48
Avg ph 6.47
Avg rainfall 103.46
@interact
def summary(crops=list(data["label"].value_counts().index)):
x=data[data['label']==crops]
print(x['label'])
print("Min nitrogen required",x["N"].min())
print("Avg nitrogen required",x["N"].mean())
print("Max nitrogen required",x["N"].max())
print("Min phosphorus required",x["P"].min())
print("Avg phosphorus required",x["P"].mean())
print("Max phosphorus required",x["P"].max())
print("Min Potassium required",x["K"].min())
print("Avg Potassium required",x["K"].mean())
print("Max Potassium required",x["K"].max())
print("Min temperature required",x["temperature"].min())
print("Avg temperature required",x["temperature"].mean())
print("Max temperature required",x["temperature"].max())
print("Min ph required",x["ph"].min())
print("Avg ph required",x["ph"].mean())
print("Max ph required",x["ph"].max())
print("Min humidity required",x["humidity"].min())
print("Avg humidity required",x["humidity"].mean())
print("Max humidity required",x["humidity"].max())
print("Min rainfall required",x["rainfall"].min())
print("Avg rainfall required",x["rainfall"].mean())
print("Max rainfall required",x["rainfall"].max())
Output:-
28

plt.subplot(3,4,1)
sns.histplot(data['N'],color="green")
plt.xlabel("Nitrogen")
plt.grid()
plt.subplot(3,4,2)
sns.histplot(data['P'],color="red")
plt.xlabel("P")
plt.grid()
plt.subplot(3,4,3)
sns.histplot(data['K'],color="yellow")
plt.xlabel("K")
plt.grid()
plt.subplot(3,4,4)
sns.histplot(data['ph'],color="blue")
plt.xlabel("PH")
plt.grid()
plt.subplot(2,4,5)
sns.histplot(data['temperature'],color="yellow")
plt.xlabel("temperature")
plt.grid()
plt.subplot(2,4,6)
sns.histplot(data['humidity'],color="green")
plt.xlabel("humidity")
plt.grid()
plt.subplot(2,4,7)
sns.histplot(data['rainfall'],color="blue")
plt.xlabel("rainfall")
plt.grid()
"""**Elbow method**"""
from pandas.core.common import random_state
from sklearn.cluster import KMeans
x=data.drop(['label'],axis=1)
x=x.values
wcss=[]
for i in range(1,11):
km=KMeans(n_clusters=i,init="k-means++", max_iter=2000,n_init=10,random_state=0)
km.fit(x)
wcss.append(km.inertia_)
plt.plot(range(1,11),wcss)
plt.show()
km=KMeans(n_clusters=4,init="k-means++", max_iter=2000,n_init=10,random_state=0)
y_means=km.fit_predict(x)
a=data['label']
y_means=pd.DataFrame(y_means)
z=pd.concat([y_means,a],axis=1)
z=z.rename(columns={0:'cluster'})
29
print("Cluster 1",z[z['cluster']==0]['label'].unique())
print("Cluster 2",z[z['cluster']==1]['label'].unique())
print("Cluster 3",z[z['cluster']==2]['label'].unique())
print("Cluster 4",z[z['cluster']==3]['label'].unique())
y=data['label']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.2,random_state=0)
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(x_train,y_train)
y_pred=model.predict(np.array([[40,40,40,40,100,7,200]]))
print(y_pred)
y_pred=model.predict(x_test)
from sklearn.metrics import classification_report
cr=classification_report(y_test,y_pred)
print(cr)
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test,y_pred)
sns.heatmap(cm,annot=True)
print(cm)
Output:-

from sklearn.cluster import KMeans


x=data.drop(['label'],axis=1)
x=x.values
plt.rcParams['figure.figsize']=(10,4)
wcss=[]
for i in range(1,11):
km=KMeans(n_clusters=i,init='k-means++',max_iter=2000,n_init=10,random_state=0)
km.fit(x)
wcss.append(km.inertia_)
plt.plot(range(1,11),wcss)
plt.xlabel("No of cluster")
plt.ylabel("wcss")
plt.show()
Output:-

15.Conclusion:-

In conclusion, machine learning has emerged as a promising tool for predicting crop yields
30
and improving agricultural practices. By leveraging large datasets and sophisticated

algorithms, machine learning models can analyze various factors such as weather patterns,

soil conditions, historical crop data, and management practices to make accurate

predictions about crop yields.

One key benefit of crop prediction using machine learning is its potential to optimize

agricultural practices. Farmers can use these predictions to make informed decisions about

planting schedules, irrigation, fertilization, and pest management, leading to more efficient

resource allocation and higher yields. Additionally, machine learning can help farmers

identify early warning signs of crop stress or disease outbreaks, allowing for timely

interventions and reducing crop losses.

Machine learning in crop prediction also has the potential to contribute to sustainable

agriculture by optimizing resource use. For example, by predicting crop water requirements,

farmers can implement targeted irrigation strategies, minimizing water waste and conserving

this precious resource. Similarly, by predicting crop nutrient needs, farmers can apply

fertilizers more judiciously, reducing the risk of nutrient runoff and environmental pollution.

However, it's important to note that machine learning models for crop prediction are not

without limitations. Accurate predictions depend on the availability of reliable data, and in

many regions, data may be sparse or inconsistent. Additionally, machine learning models are

not immune to biases and may suffer from limitations in generalization, especially when

applied to different regions or crop varieties. Therefore, it's crucial to continue refining and

validating these models using field data and expert knowledge.

In conclusion, machine learning has the potential to revolutionize crop prediction and

agricultural practices, leading to improved crop yields, resource optimization, and

sustainable agriculture. However, ongoing research, data collection, and model validation

are necessary to ensure their reliability and effectiveness in real-world farming scenarios.
31
16.Future scope:-

The future scope of machine learning in crop prediction is promising and holds significant

potential for revolutionizing agriculture and improving crop production. Here are some key

areas where machine learning can play a significant role in the future:-

Machine learning algorithms can analyze a vast amount of data,

including soil quality, weather patterns, pest and disease prevalence, and plant growth rates

to provide farmers with precise recommendations on planting, fertilization, irrigation, and

pest control. This can optimize resource usage, reduce input costs, and increase crop yields.

Machine learning can be used to analyze historical data on

crop diseases and pests and create predictive models that can help farmers anticipate

disease outbreaks and pest infestations. This can enable early intervention and prevent crop

losses, reducing the reliance on chemical pesticides and minimizing environmental impact.

As climate change continues to impact agriculture, machine

learning can help farmers adapt by providing predictive models that take into account

changing weather patterns, temperature fluctuations, and rainfall variability. This can enable

farmers to make informed decisions about crop selection, planting times, and irrigation

strategies.

Machine learning algorithms can analyze data on crop growth,

historical yield data, weather patterns, and other factors to create accurate crop yield

forecasts. This can help farmers with crop planning, marketing, and financial decision-

making.

Machine learning can aid in crop breeding programs

by analyzing genetic data and identifying optimal combinations of traits for crop

improvement. This can accelerate the development of new crop varieties with improved yield,

resistance to diseases and pests, and other desirable traits.


Machine learning can analyze remote sensing data,

including satellite imagery, to monitor crop health, detect stressors such as nutrient

deficiencies, water stress, and disease outbreaks. This can help farmers make data-driven

decisions about crop management and optimize inputs.

Machine learning can power decision support systems that

provide farmers with real-time recommendations and insights for crop management. These

systems can integrate data from various sources and provide personalized recommendations

based on the specific needs of each farm.

In conclusion, machine learning has a bright future in crop prediction and agriculture, and it

has the potential to significantly improve crop production, optimize resource usage, and

contribute to sustainable farming practices. Continued advancements in machine learning

algorithms, data collection, and analytics are expected to drive further innovation in this

field in the future.

17.Bibliography:-

1. Application of Machine Learning in Agriculture,written by Mohammad

Ayoub Khan,Rijwan Khan,Mohammad Aslam Ansari.

2.Data Visualization Storytelling Using Data,written by Sharada

Sringeswara,Purvi Tiwari,U. Dinesh Kumar.

3.Python Data Analytics(with pamdas,numpy,matplotlib),written by Fabio

Nelli.

4.Data Analytics with Python,written by Dr. Bhaves Devra,Dr. Dilip

Kumar,Dr. Shajahan Basheer,Dr. Proloy Ghosh.

You might also like