Weather Prediction 2
Weather Prediction 2
Weather Prediction 2
MAJOR PROJECT
Submitted by ASHWIN.J
(18MCA038)
Under the Guidance of
Dr . B . UMA MAHESWARI M.Sc., MCA., M.Phil., PhD
Assistant Professor,
CERTIFICATE
This is to certify that this Project work entitled “WEATHER
PREDCTION” is a bonafide record of work done by ASHWIN.J
(18MCA038) in partial fulfillment of the requirements for the award of Degree
of Master of Computer Applications of Bharathiar University.
Dr.B.UMA MAHESWARI M.Sc.,MCA.,M.Phil.,PhD Dr.R.SUDHA MCA.,M.Phil.,Ph.D.,
This Project work has not been submitted by me for the award of any
other Degree/ Diploma/ Associate ship/ Fellowship or any other similar degree
to any other university
Place:Coimbatore ASHWIN J
With great gratitude, I would like to acknowledge the help of those who
contributed with their valuable suggestions and timely assistance to complete
this work.
This is to certify that Mr. J.ASHWIN (REGNO: 18MCA038) doing MCA final year
at PSG COLLEGE OF ARTS & SCIENCE, COIMBATORE had successfully
completed the PROJECT entitled “weather prediction” in department of “Machine
Learning” in our organization during the period of JANUARY 2021 to MARCH
2021.
Ashwini Kanniyappan
Manager – Human Resources
7 Testing
Testing and implementation 11
8 Future enhancement 12
9 Conclusion
13
Bibliography 14
Appendices 15
1.INTRODUCTION:
1.1.PROJECT OVERVIEW
Weather is an important aspect of a person’s life as it can help us to know when it’ll rain
and when it’ll be sunny. Weather forecasting is the attempt by meteorologists to predict
the weather conditions at some future time and the weather conditions that may be
expected. The climatic condition parameters are based on the temperature, pressure,
humidity, dewpoint, rainfall, precipitation, wind speed and size of dataset. Here, the
parameters temperature, pressure, humidity, dewpoint, precipitation, rainfall is only
considered for experimental analysis.
1
2.SYSTEM SPECIFICATION:
2
3. SYSTEM ANALYSIS:
3.2.Proposed system
To predict weather forecasting, a massive amount of data is being fed into the algorithm
that uses deep learning techniques to learn it and then make predictions based on the
past data. However, the trained ML model works on a physics free approach for the
forecasting process. The model has been designed to learn from the atmospheric
examples daily without applying any prior data fed on to the system. The underlying
convolution neural network (CNN) ‘U-Net’, which comprises a sequence of layers — a
set of mathematical operations — takes the input satellite imagery and then transforms
them into output images. The sequence of layers in the convolutional neural network are
usually arranged in an encoding phase, which, in turn, decreases the resolution of the
output images. However, with Google AI, the separate decoding phase has been added
to expand the low-resolution images.To start with, the engineering team trained the
model by feeding historical data from 2017 to 2019 collected from the US for
evaluation, and then compared the same to three baselines models — High-Resolution
Rapid Refresh (HRRR); an Optical Flow (OF) algorithm; and a persistence model.
According to researchers, once compared, Google artificial intelligence outperformed all
the traditional methods by using precision and recall graphs. The model would be
treating weather prediction as an image-to-image translation problem and believed in
leveraging state-of-the-art CNN. Moving forward, for the best results mechanism with
its ML model to have accurate forecasts.
3
4.SYSTEM DESIGN AND DEVELOPMENT:
Using python 3 version we have design the model with the help of repositries also we
have design it we are using numpy,matpoltlib,seaborn,pandas,scikit-learn etc
NumPy is the fundamental package for scientific computing in Python. It is a Python
library that provides a multidimensional array object, various derived objects (such as
masked arrays and matrices), and an assortment of routines for fast operations on arrays,
including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete
Fourier transforms, basic linear algebra, basic statistical operations, random simulation
and much more.
At the core of the NumPy package, is the ndarray object. This encapsulates n-
dimensional arrays of homogeneous data types, with many operations being performed
in compiled code for performance. There are several important differences between
NumPy arrays and the standard Python sequences:
NumPy arrays have a fixed size at creation, unlike Python lists (which can grow
dynamically). Changing the size of an ndarray will create a new array and delete the
original.
The elements in a NumPy array are all required to be of the same data type, and thus
will be the same size in memory. The exception: one can have arrays of (Python,
including NumPy) objects, thereby allowing for arrays of different sized elements.
NumPy arrays facilitate advanced mathematical and other types of operations on large
numbers of data. Typically, such operations are executed more efficiently and with less
code than is possible using Python’s built-in sequences.
A growing plethora of scientific and mathematical Python-based packages are using
NumPy arrays; though these typically support Python-sequence input, they convert such
input to NumPy arrays prior to processing, and they often output NumPy arrays. In other
words, in order to efficiently use much (perhaps even most) of today’s
scientific/mathematical Python-based software, just knowing how to use Python’s built-
in sequence types is insufficient - one also needs to know how to use NumPy arrays.
The points about sequence size and speed are particularly important in scientific
computing. As a simple example, consider the case of multiplying each element in a 1-D
sequence with the corresponding element in another sequence of the same length. If the
data are stored in two Python lists, a and b, we could iterate over each element:
seaborn is a Python data visualization library based on matplotlib. It provides a high-
level interface for drawing attractive and informative statistical graphics.
4
4.1. System flow diagram
Wind
Wind
Precip Apparent Spee Loud Pressure
Tempera Humidit Bearing Visibilit Daily
Type Temperatur d Cove (millibar
ture (C) y (degree y (km) Summary
e (C) (km/ r s)
s)
h)
5
meantempm
maxpressurem_1 -0.519699
maxpressurem_2 -0.425666
maxpressurem_3 -0.408902
meanpressurem_1 -0.365682
meanpressurem_2 -0.269896
meanpressurem_3 -0.263008
minpressurem_1 -0.201003
minhumidity_1 -0.148602
minhumidity_2 -0.143211
minhumidity_3 -0.118564
minpressurem_2 -0.104455
minpressurem_3 -0.102955
precipm_2 0.084394
precipm_1 0.086617
6
meantempm
precipm_3 0.098684
maxhumidity_1 0.132466
maxhumidity_2 0.151358
maxhumidity_3 0.167035
maxdewptm_3 0.829230
maxtempm_3 0.832974
mindewptm_3 0.833546
meandewptm_3 0.834251
mintempm_3 0.836340
maxdewptm_2 0.839893
meandewptm_2 0.848907
mindewptm_2 0.852760
mintempm_2 0.854320
meantempm_3 0.855662
maxtempm_2 0.863906
meantempm_2 0.881221
maxdewptm_1 0.887235
meandewptm_1 0.896681
mindewptm_1 0.899000
mintempm_1 0.905423
maxtempm_1 0.923787
meantempm_1 0.937563
mintempm 0.973122
maxtempm 0.976328
meantempm 1.000000
7
5. MODULES :
5.3. Normalization
The data we collected from internet should be first normalized. Normalization
refers to rescaling real valued numeric attributes into the rage or 0 and 1. After the data
are filtered it is then normalized.
To assess the correlation in this data I will call the corr() method of the Pandas
DataFrame object. Chained to this corr() method call I can then select the column of
interest ("meantempm") and again chain another method call sort_values() on the
resulting Pandas Series object. This will output the correlation values from most
negatively correlated to the most positively correlated.
9
6. IMPLIMENTATION OF ALGORITHM:
where:
ŷ is the predicted outcome variable (dependent variable)
xj are the predictor variables (independent variables) for j = 1,2,..., p-1 parameters
β0 is the intercept or the value of ŷ when each xj equals zero
βj is the change in ŷ based on a one unit change in one of the corresponding xj
Ε is a random error term associated with the difference between the predicted ŷi value
and the actual yi value
That last term in the equation for the Linear Regression is a very important one. The
most basic form of building a Linear Regression model relies on an algorithm known as
Ordinary Least Squares which finds the combination of βj's values which minimize
the Ε term.
10
7. TESTING:
Looking at the histogram of the values for maxhumidity the data exhibits quite a bit of
negative skew. I will want to keep this in mind when selecting prediction models and
evaluating the strength of impact of max humidities. Many of the underlying statistical
methods assume that the data is normally distributed. For now I think I will leave them
alone but it will be good to keep this in mind and have a certain amount of skepticism of
it.
11
This plot exhibits another interesting feature. From this plot, the data is multimodal,
which leads me to believe that there are two very different sets of environmental
circumstances apparent in this data. I am hesitant to remove these values since I know
that the temperature swings in this area of the country can be quite extreme especially
between seasons of the year. I am worried that removing these low values might have
some explanatory usefulness but, once again I will be skeptical about it at the same time.
12
8. CONCLUTION:
13
BIBLIOGRAPHY
Reference books
Katarya, Rahul, and Polipireddy Srinivas. "Predicting Heart Disease at EarlyStages using
Machine Learning: A Survey." 2020 International Conference on Electronics and
Sustainable Communication Systems (ICESC). IEEE, 2020.
Gavhane, Aditi, et al. "Prediction of heart disease using machine learning." 2018
Second International Conference on Electronics, Communication and Aerospace
Technology (ICECA). IEEE, 2018.
Kohli, Pahulpreet Singh, and Shriya Arora. "Application of machine learning in disease
prediction." 2018 4th International conference on computing communication and
automation (ICCCA). IEEE, 2018.
Atallah, Rahma, and Amjed Al-Mousa. "Heart Disease Detection Using Machine
Learning Majority Voting Ensemble Method." 2019 2nd International Conference on new
Trends in Computing Sciences (ICTCS). IEEE, 2019.
Reference website
• www.kaggle.com
• www.tutorialpoint.com
• ieeexplore.ieee.org
• semanticscholar.org
14
APPENDICES:
Screenshots
Collections of dataset
Output data
16
Sample coding
import time
import pandas as pd
import requests
API_KEY = '7052ad35e3c73564'
BASE_URL = "http://api.wunderground.com/api/{}/history_{}/q/NE/Lincoln.json"
records = []
for _ in range(days):
response = requests.get(request)
if response.status_code == 200:
data = response.json()['history']['dailysummary'][0]
records.append(DailySummary(
date=target_date,
meantempm=data['meantempm'],
meandewptm=data['meandewptm'],
17
meanpressurem=data['meanpressurem'],
maxhumidity=data['maxhumidity'],
minhumidity=data['minhumidity'],
maxtempm=data['maxtempm'],
mintempm=data['mintempm'],
maxdewptm=data['maxdewptm'],
mindewptm=data['mindewptm'],
maxpressurem=data['maxpressurem'],
minpressurem=data['minpressurem'],
precipm=data['precipm']))
time.sleep(6)
target_date += timedelta(days=1)
return records
# if you closed our terminal or Jupyter Notebook, reinitialize your imports and
df = pd.DataFrame(records, columns=features).set_index('date')
# 1 day prior
N=1
feature = 'meantempm'
18
# total number of rows
rows = tmp.shape[0]
col_name = "{}_{}".format(feature, N)
tmp[col_name] = nth_prior_measurements
tmp
rows = df.shape[0]
col_name = "{}_{}".format(feature, N)
df[col_name] = nth_prior_measurements
if feature != 'date':
derive_nth_day_feature(df, feature, N)
df.columns
19
'meanpressurem_1', 'meanpressurem_2', 'meanpressurem_3',
'precipm_2', 'precipm_3'],
dtype='object')
to_remove = [feature
df = df[to_keep]
df.columns
20
'maxtempm_3', 'mintempm_1', 'mintempm_2', 'mintempm_3', 'maxdewptm_1',
'precipm_2', 'precipm_3'],
dtype='object')
df.info()
<class 'pandas.core.frame.DataFrame'>
21
minhumidity_1 999 non-null object
dtypes: object(39)
22
memory usage: 312.5+ KB
df = df.apply(pd.to_numeric, errors='coerce')
df.info()
<class 'pandas.core.frame.DataFrame'>
23
maxtempm_2 998 non-null float64
spread = df.describe().T
24
# precalculate interquartile range for ease of use in next calculation
# create an outliers column which is either 3 IQRs below the first quartile or
spread.ix[spread.outliers,]
%matplotlib inline
plt.rcParams['figure.figsize'] = [14, 8]
df.maxhumidity_1.hist()
plt.title('Distribution of maxhumidity_1')
plt.xlabel('maxhumidity_1')
df.minpressurem_1.hist()
plt.title('Distribution of minpressurem_1')
plt.xlabel('minpressurem_1')
plt.show()
missing_vals = pd.isnull(df[precip_col])
df[precip_col][missing_vals] = 0
df = df.dropna()
25
26