Weather Prediction 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

“WEATHER PREDICTION”

MAJOR PROJECT

Submitted by ASHWIN.J
(18MCA038)
Under the Guidance of
Dr . B . UMA MAHESWARI M.Sc., MCA., M.Phil., PhD
Assistant Professor,

Department of Computer Applications.

In partial fulfillment of the requirements for the award of the degree of


MASTER OF COMPUTER APPLICATIONS
of Bharathiar University

DEPARTMENT OF COMPUTER APPLICATIONS


PSG COLLEGE OF ARTS & SCIENCE
An Autonomous college -Affiliated to Bharathiar
University Accredited with ‘A’ grade by NAAC (3rdcycle)
College with Potential for Excellence
(Status Awarded by the UGC)
Star College Status Awarded by
DBT MSTAn ISO 9001:2015
Certified Institution Coimbatore -
641 014
APRIL 2021
DEPARTMENT OF COMPUTER
APPLICATIONS PSG COLLEGE OF ARTS
& SCIENCE
An Autonomous College -Affiliated to Bharathiar University
Accredited with ‘A’ Grade by NAAC (3rdCycle)
College with Potential for Excellence (Status Awarded by the UGC)
Star College Status Awarded by DBT-MST
An ISO 9001:2015 Certified
Institution Civil
Aerodrome Post
Coimbatore -641 014
APRIL-2021

CERTIFICATE
This is to certify that this Project work entitled “WEATHER
PREDCTION” is a bonafide record of work done by ASHWIN.J
(18MCA038) in partial fulfillment of the requirements for the award of Degree
of Master of Computer Applications of Bharathiar University.
Dr.B.UMA MAHESWARI M.Sc.,MCA.,M.Phil.,PhD Dr.R.SUDHA MCA.,M.Phil.,Ph.D.,

Faculty Guide Head of the Department

Submitted for Viva-Voce Examination held on 09.04.2021

Dr.B.UMA MAHESWARI M.Sc.,MCA.,M.Phil.,PhD

Internal Examiner External Examiner


DECLARATION

ASHWIN J (18MCA038), hereby declare that this Project work


entitled “WEATHER PREDICTION ” is submitted to PSG College of Arts
and Science (Autonomous), Coimbatore in partial fulfillment for the award of
Master of Computer Applications, is a record of original work done by me
under the supervisionand guidance of Dr. B.UMA MAHESWARI M.Sc.,
MCA., M.Phil., PhD Assistant Professor in Department of Computer
Applications, PSG College of Arts andScience, Coimbatore.

This Project work has not been submitted by me for the award of any
other Degree/ Diploma/ Associate ship/ Fellowship or any other similar degree
to any other university

Place:Coimbatore ASHWIN J

Date : 08.04.2021 (18MCA038)


ACKNOWLEDGEMENT

With great gratitude, I would like to acknowledge the help of those who
contributed with their valuable suggestions and timely assistance to complete
this work.

First and foremost, I would like to extend my heartfelt gratitude and


place my sincere thanks to Thiru.L.GOPALAKRISHNAN Trustee, PSG &
SONS Charities, Coimbatore for providing all sorts of support and necessary
facilities throughout the course.

I express my deep sense of gratitude to Secretary, Dr.T.KANNAIAN


M.Sc., M.Tech., Ph.D., for permitting me to undertake this work.

I thank our Principal, Dr.D.BRINDHA M.Sc., MPhil., Ph.D.,


M.A(Yoga)., forher support and constant source of inspiration through the
course of project and also I would like to thank our Vice Principal,
Dr.A.ANGURAJ M.Sc., M.Phil., PhD., for hissupport.

I own my deepest gratitude to Dr.R.SUDHA MCA., M.Phil., Ph.D.,


Head of the Department, for her consultancy, encouraging me to pursue new
goals and ideas.

My sincere thanks to Dr . B . UMA MAHESWARI M.Sc., MCA.,


M.Phil., PhD for her valuable suggestions, support and guidance as my
internal guide, without which my work would not have reached the present
form.

I thank to Mr. R. Ramkumar MCA, for his valuable suggestions,


support and guidance as my external guide, without which my work would not
have reached the present form.
Last but not the least, I am greatly indebted to my parents and friends for their kind
co-operation in each and every step I took in this project.
DATE02.04.2021
CHENNAI

Mr. J. ASHWIN (3rd MCA)


REG.No. 18MCA038
PSG College of Arts and Science
Coimbatore.

TO WHOM SO EVER IT MAY CONCERN

This is to certify that Mr. J.ASHWIN (REGNO: 18MCA038) doing MCA final year
at PSG COLLEGE OF ARTS & SCIENCE, COIMBATORE had successfully
completed the PROJECT entitled “weather prediction” in department of “Machine
Learning” in our organization during the period of JANUARY 2021 to MARCH
2021.

WE WISH ALL THE BEST FOR HIS BETTER FUTURE

For Shiash Info Solutions Private Limited

Ashwini Kanniyappan
Manager – Human Resources

Shiash Info Solutions Private Limited


#51, Level 4, Tower A, Rattha TEK Meadows, Old Mahabalipuram
Road, Sholinganallur, Chennai – 600 119, Tamil Nadu
India+914466255681 [email protected]
SYNOPSIS:

Weather forecasting has gained attention many researchers from various


research communities due to its effect to the global human life. The emerging
machine learning techniques in the last decade coupled with the wide
availability of massive weather observation data and the advent of information
and computer technology have motivated many researches to explore hidden
hierarchical pattern in the large volume of weather dataset for weather
forecasting. This study investigates machine learning techniques for weather
forecasting.Using machine learning algorithms we can get accuarcy of datasets.
TABLE OF CONTENTS
S.No CONTENTS PAGE
NO
1 Introduction
1.1 Project Overview 1
2 System specification
2.1 Software requirement 3
2.2 Hardware requirement 3
3 System Analysis
3.1 Existing system 5
3.2 Proposed system 6
4 System design and development
4.1 System flow diagram 7
4.2 Data collection 7
5 Modules
5.1 Dataset selection 8
5.2 Features selection 8
5.3 Normalization
5.4 Machine Learning 9
5.5 Data preprocessing
5.6 Data visualization
6 Implementation of algorithm
6.1 Linear regression 10

7 Testing
Testing and implementation 11

8 Future enhancement 12
9 Conclusion
13
Bibliography 14
Appendices 15
1.INTRODUCTION:

1.1.PROJECT OVERVIEW
Weather is an important aspect of a person’s life as it can help us to know when it’ll rain
and when it’ll be sunny. Weather forecasting is the attempt by meteorologists to predict
the weather conditions at some future time and the weather conditions that may be
expected. The climatic condition parameters are based on the temperature, pressure,
humidity, dewpoint, rainfall, precipitation, wind speed and size of dataset. Here, the
parameters temperature, pressure, humidity, dewpoint, precipitation, rainfall is only
considered for experimental analysis.

1
2.SYSTEM SPECIFICATION:

2.1. software Requirement


The software used in our projects are:
Python 3.7: Python is an interpreted, high level, general programming language. Its
formatting is visually uncluttered, and it often uses English keywords where other
languages use punctuation. It provides a vast library for data mining and predictions.
Jupiter Notebook/ Spider/ Pycharm: It is an open source cross-platform integrated
development environment (IDE) for scientific programming in the Python language.
Spyder integrates with a number of prominent packages as well as another open source
software.
Numpy: Numpy was used for building the front-end part of the system.
Pandas: Pandas was used for the data preprocessing and statistical analysis of data.
Matplotlib: Matplotlib was used for the graphical representation of our prediction.

2.2. Hardware Requirement


Operating System : Windows OS
Processor : i3 or higher
Ram : 8 GB or higher
IDE : Anaconda

2
3. SYSTEM ANALYSIS:

3.1. Existing system


It was not until the invention of the eletric telegraphin 1835 that the modern age of
weather forecasting began. Before that, the fastest that distant weather reports could
travel was around 160 kilometres per day (100 mi/d), but was more typically 60–120
kilometres per day (40–75 mi/day) (whether by land or by sea). By the late 1840s, the
telegraph allowed reports of weather conditions from a wide area to be received almost
instantaneously,allowing forecasts to be made from knowledge of weather conditions
further
The two men credited with the birth of forecasting as a science were an officer of the
navy .Both were influential men in British naval and governmental circles, and though
ridiculed in the press at the time, their work gained scientific credence, was accepted by
the Royal Navy, and formed the basis for all of today's weather forecasting knowledge.
Beaufort developed the Wind Force Scale and Weather Notation coding, which he was
to use in his journals for the remainder of his life. He also promoted the development of
reliable tide tables around British shores, and with his friend expanded weather record-
keeping at 200 British costguard stations.
Robert FitzRoy was appointed in 1854 as chief of a new department within the to the
collection of weather data at sea as a service to . This was the forerunner of the modern.
All ship captains were tasked with collating data on the weather and computing it, with
the use of tested instruments that were loaned for this purpose.

3.2.Proposed system
To predict weather forecasting, a massive amount of data is being fed into the algorithm
that uses deep learning techniques to learn it and then make predictions based on the
past data. However, the trained ML model works on a physics free approach for the
forecasting process. The model has been designed to learn from the atmospheric
examples daily without applying any prior data fed on to the system. The underlying
convolution neural network (CNN) ‘U-Net’, which comprises a sequence of layers — a
set of mathematical operations — takes the input satellite imagery and then transforms
them into output images. The sequence of layers in the convolutional neural network are
usually arranged in an encoding phase, which, in turn, decreases the resolution of the
output images. However, with Google AI, the separate decoding phase has been added
to expand the low-resolution images.To start with, the engineering team trained the
model by feeding historical data from 2017 to 2019 collected from the US for
evaluation, and then compared the same to three baselines models — High-Resolution
Rapid Refresh (HRRR); an Optical Flow (OF) algorithm; and a persistence model.
According to researchers, once compared, Google artificial intelligence outperformed all
the traditional methods by using precision and recall graphs. The model would be
treating weather prediction as an image-to-image translation problem and believed in
leveraging state-of-the-art CNN. Moving forward, for the best results mechanism with
its ML model to have accurate forecasts.
3
4.SYSTEM DESIGN AND DEVELOPMENT:

Using python 3 version we have design the model with the help of repositries also we
have design it we are using numpy,matpoltlib,seaborn,pandas,scikit-learn etc
NumPy is the fundamental package for scientific computing in Python. It is a Python
library that provides a multidimensional array object, various derived objects (such as
masked arrays and matrices), and an assortment of routines for fast operations on arrays,
including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete
Fourier transforms, basic linear algebra, basic statistical operations, random simulation
and much more.
At the core of the NumPy package, is the ndarray object. This encapsulates n-
dimensional arrays of homogeneous data types, with many operations being performed
in compiled code for performance. There are several important differences between
NumPy arrays and the standard Python sequences:
NumPy arrays have a fixed size at creation, unlike Python lists (which can grow
dynamically). Changing the size of an ndarray will create a new array and delete the
original.
The elements in a NumPy array are all required to be of the same data type, and thus
will be the same size in memory. The exception: one can have arrays of (Python,
including NumPy) objects, thereby allowing for arrays of different sized elements.
NumPy arrays facilitate advanced mathematical and other types of operations on large
numbers of data. Typically, such operations are executed more efficiently and with less
code than is possible using Python’s built-in sequences.
A growing plethora of scientific and mathematical Python-based packages are using
NumPy arrays; though these typically support Python-sequence input, they convert such
input to NumPy arrays prior to processing, and they often output NumPy arrays. In other
words, in order to efficiently use much (perhaps even most) of today’s
scientific/mathematical Python-based software, just knowing how to use Python’s built-
in sequence types is insufficient - one also needs to know how to use NumPy arrays.
The points about sequence size and speed are particularly important in scientific
computing. As a simple example, consider the case of multiplying each element in a 1-D
sequence with the corresponding element in another sequence of the same length. If the
data are stored in two Python lists, a and b, we could iterate over each element:
seaborn is a Python data visualization library based on matplotlib. It provides a high-
level interface for drawing attractive and informative statistical graphics.

4
4.1. System flow diagram

4.2. Data collection

Wind
Wind
Precip Apparent Spee Loud Pressure
Tempera Humidit Bearing Visibilit Daily
Type Temperatur d Cove (millibar
ture (C) y (degree y (km) Summary
e (C) (km/ r s)
s)
h)

5
meantempm

maxpressurem_1 -0.519699

maxpressurem_2 -0.425666

maxpressurem_3 -0.408902

meanpressurem_1 -0.365682

meanpressurem_2 -0.269896

meanpressurem_3 -0.263008

minpressurem_1 -0.201003

minhumidity_1 -0.148602

minhumidity_2 -0.143211

minhumidity_3 -0.118564

minpressurem_2 -0.104455

minpressurem_3 -0.102955

precipm_2 0.084394

precipm_1 0.086617

6
meantempm

precipm_3 0.098684

maxhumidity_1 0.132466

maxhumidity_2 0.151358

maxhumidity_3 0.167035

maxdewptm_3 0.829230

maxtempm_3 0.832974

mindewptm_3 0.833546

meandewptm_3 0.834251

mintempm_3 0.836340

maxdewptm_2 0.839893

meandewptm_2 0.848907

mindewptm_2 0.852760

mintempm_2 0.854320

meantempm_3 0.855662

maxtempm_2 0.863906

meantempm_2 0.881221

maxdewptm_1 0.887235

meandewptm_1 0.896681

mindewptm_1 0.899000

mintempm_1 0.905423

maxtempm_1 0.923787

meantempm_1 0.937563

mintempm 0.973122

maxtempm 0.976328

meantempm 1.000000

7
5. MODULES :

The steps involved in preprocessing are :

5.1. Dataset selection


Dataset selection where we select as per the algorthim we select dataset already
given by companies or research purpose datasets mostly opensource we can download
from websites like datasearch,kaggle etc

5.2. Features selection


The data we have collected has many unwanted attributes which will not be
needed in our project. Hence, we use the attributes which we need only.

5.3. Normalization
The data we collected from internet should be first normalized. Normalization
refers to rescaling real valued numeric attributes into the rage or 0 and 1. After the data
are filtered it is then normalized.

5.4. Machine Learning


Training a model is the process of iteratively improving your prediction equation
by looping through the dataset multiple times, each time updating the weight and bias
values in the direction indicated by the slope of the cost function (gradient). Training is
complete when we reach an acceptable error threshold, or when subsequent training
iterations fail to reduce our cost.

5.5. Data preprocessing


When we talk about data, we usually think of some large datasets with huge
number of rows and columns. While that is a likely scenario, it is not always the case —
data could be in so many different forms: Structured Tables, Images, Audio files, Videos
etc..
Machines don’t understand free text, image or video data as it is, they understand 1s and
0s. So it probably won’t be good enough if we put on a slideshow of all our images and
expect our machine learning model to get trained just by that!
In any Machine Learning process, Data Preprocessing is that step in which the data gets
transformed, or Encoded, to bring it to such a state that now the machine can easily
parse it. In other words, the features of the data can now be easily interpreted by the
algorithm.

5.6. Data visualization


Data visualization is the graphical representation of information and data. By
using visual elements like charts, graphs, and maps, data visualization tools provide an
accessible way to see and understand trends, outliers, and patterns in data.
8
In the world of Big Data, data visualization tools and technologies are essential to
analyze massive amounts of information and make data-driven decisions. Our eyes are
drawn to colors and patterns. We can quickly identify red from blue, square from circle.
Our culture is visual, including everything from art and advertisements to TV and
movies.

Correlation Value Interpretation

0.8 - 1.0 Very Strong

0.6 - 0.8 Strong

0.4 - 0.6 Moderate

0.2 - 0.4 Weak

0.0 - 0.2 Very Weak

To assess the correlation in this data I will call the corr() method of the Pandas
DataFrame object. Chained to this corr() method call I can then select the column of
interest ("meantempm") and again chain another method call sort_values() on the
resulting Pandas Series object. This will output the correlation values from most
negatively correlated to the most positively correlated.

9
6. IMPLIMENTATION OF ALGORITHM:

6.1. Linear Regression


Regression is a method of modelling a target value based on independent
predictors. This method is mostly used for forecasting and finding out cause and effect
relationship between variables. Regression techniques mostly differ based on the
number of independent variables and the type of relationship between the independent
and dependent variables.
Linear regression aims to apply a set of assumptions primary regarding linear
relationships and numerical techniques to predict an outcome (Y, aka the dependent
variable) based off of one or more predictors (X's independent variables) with the end
goal of establishing a model (mathematical formula) to predict outcomes given only the
predictor values with some amount of uncertainty.
The generalized formula for a Linear Regression model is:

where:
ŷ is the predicted outcome variable (dependent variable)
xj are the predictor variables (independent variables) for j = 1,2,..., p-1 parameters
β0 is the intercept or the value of ŷ when each xj equals zero
βj is the change in ŷ based on a one unit change in one of the corresponding xj
Ε is a random error term associated with the difference between the predicted ŷi value
and the actual yi value
That last term in the equation for the Linear Regression is a very important one. The
most basic form of building a Linear Regression model relies on an algorithm known as
Ordinary Least Squares which finds the combination of βj's values which minimize
the Ε term.

10
7. TESTING:

7.1. TESTING AND IMPLIMENTATION

Looking at the histogram of the values for maxhumidity the data exhibits quite a bit of
negative skew. I will want to keep this in mind when selecting prediction models and
evaluating the strength of impact of max humidities. Many of the underlying statistical
methods assume that the data is normally distributed. For now I think I will leave them
alone but it will be good to keep this in mind and have a certain amount of skepticism of
it.

11
This plot exhibits another interesting feature. From this plot, the data is multimodal,
which leads me to believe that there are two very different sets of environmental
circumstances apparent in this data. I am hesitant to remove these values since I know
that the temperature swings in this area of the country can be quite extreme especially
between seasons of the year. I am worried that removing these low values might have
some explanatory usefulness but, once again I will be skeptical about it at the same time.

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=2/12,random_state=0)

from sklearn.metrics import r2_score


r_squared = r2_score(y_test, y_pred)*100

12
8. CONCLUTION:

In this project, I demonstrated how to use the Linear Regression Machine


Learning algorithm to predict future mean weather temperatures based off the data
collected in the prior project. I demonstrated how to use the statsmodels library to select
statistically significant predictors based off of sound statistical methods. I then utilized
this information to fit a prediction model based off a training subset using Scikit-
Learn's LinearRegression class. Using this fitted model I could then predict the expected
values based off of the inputs from a testing subset and evaluate the accuracy of the
prediction, which indicates a reasonable amount of accuracy.

13
BIBLIOGRAPHY
Reference books

Katarya, Rahul, and Polipireddy Srinivas. "Predicting Heart Disease at EarlyStages using
Machine Learning: A Survey." 2020 International Conference on Electronics and
Sustainable Communication Systems (ICESC). IEEE, 2020.

Gavhane, Aditi, et al. "Prediction of heart disease using machine learning." 2018
Second International Conference on Electronics, Communication and Aerospace
Technology (ICECA). IEEE, 2018.

Kohli, Pahulpreet Singh, and Shriya Arora. "Application of machine learning in disease
prediction." 2018 4th International conference on computing communication and
automation (ICCCA). IEEE, 2018.

Krishnan, Santhana, and S. Geetha. "Prediction of Heart Disease Using Machine


Learning Algorithms." 2019 1st international conference on innovations in information
and communication technology (ICIICT). IEEE, 2019.

Atallah, Rahma, and Amjed Al-Mousa. "Heart Disease Detection Using Machine
Learning Majority Voting Ensemble Method." 2019 2nd International Conference on new
Trends in Computing Sciences (ICTCS). IEEE, 2019.

Reference website

• www.kaggle.com
• www.tutorialpoint.com
• ieeexplore.ieee.org
• semanticscholar.org

14
APPENDICES:

Screenshots

Collections of dataset

Impementation of program in the portal


15
Importing of weather forecasting and prediction

Output data
16
Sample coding

from datetime import datetime, timedelta

import time

from collections import namedtuple

import pandas as pd

import requests

import matplotlib.pyplot as plt

API_KEY = '7052ad35e3c73564'

BASE_URL = "http://api.wunderground.com/api/{}/history_{}/q/NE/Lincoln.json"

target_date = datetime(2016, 5, 16)

features = ["date", "meantempm", "meandewptm", "meanpressurem", "maxhumidity", "minhumidity", "maxtem


pm",

"mintempm", "maxdewptm", "mindewptm", "maxpressurem", "minpressurem", "precipm"]

DailySummary = namedtuple("DailySummary", features)

def extract_weather_data(url, api_key, target_date, days):

records = []

for _ in range(days):

request = BASE_URL.format(API_KEY, target_date.strftime('%Y%m%d'))

response = requests.get(request)

if response.status_code == 200:

data = response.json()['history']['dailysummary'][0]

records.append(DailySummary(

date=target_date,

meantempm=data['meantempm'],

meandewptm=data['meandewptm'],

17
meanpressurem=data['meanpressurem'],

maxhumidity=data['maxhumidity'],

minhumidity=data['minhumidity'],

maxtempm=data['maxtempm'],

mintempm=data['mintempm'],

maxdewptm=data['maxdewptm'],

mindewptm=data['mindewptm'],

maxpressurem=data['maxpressurem'],

minpressurem=data['minpressurem'],

precipm=data['precipm']))

time.sleep(6)

target_date += timedelta(days=1)

return records

records = extract_weather_data(BASE_URL, API_KEY, target_date, 500)

# if you closed our terminal or Jupyter Notebook, reinitialize your imports and

# variables first and remember to set your target_date to datetime(2016, 5, 16)

records += extract_weather_data(BASE_URL, API_KEY, target_date, 500)

df = pd.DataFrame(records, columns=features).set_index('date')

tmp = df[['meantempm', 'meandewptm']].head(10)

# 1 day prior

N=1

# target measurement of mean temperature

feature = 'meantempm'

18
# total number of rows

rows = tmp.shape[0]

# a list representing Nth prior measurements of feature

# notice that the front of the list needs to be padded with N

# None values to maintain the constistent rows length for each N

nth_prior_measurements = [None]*N + [tmp[feature][i-N] for i in range(N, rows)]

# make a new column name of feature_N and add to DataFrame

col_name = "{}_{}".format(feature, N)

tmp[col_name] = nth_prior_measurements

tmp

def derive_nth_day_feature(df, feature, N):

rows = df.shape[0]

nth_prior_measurements = [None]*N + [df[feature][i-N] for i in range(N, rows)]

col_name = "{}_{}".format(feature, N)

df[col_name] = nth_prior_measurements

for feature in features:

if feature != 'date':

for N in range(1, 4):

derive_nth_day_feature(df, feature, N)

df.columns

Index(['meantempm', 'meandewptm', 'meanpressurem', 'maxhumidity',

'minhumidity', 'maxtempm', 'mintempm', 'maxdewptm', 'mindewptm',

'maxpressurem', 'minpressurem', 'precipm', 'meantempm_1', 'meantempm_2',

'meantempm_3', 'meandewptm_1', 'meandewptm_2', 'meandewptm_3',

19
'meanpressurem_1', 'meanpressurem_2', 'meanpressurem_3',

'maxhumidity_1', 'maxhumidity_2', 'maxhumidity_3', 'minhumidity_1',

'minhumidity_2', 'minhumidity_3', 'maxtempm_1', 'maxtempm_2',

'maxtempm_3', 'mintempm_1', 'mintempm_2', 'mintempm_3', 'maxdewptm_1',

'maxdewptm_2', 'maxdewptm_3', 'mindewptm_1', 'mindewptm_2',

'mindewptm_3', 'maxpressurem_1', 'maxpressurem_2', 'maxpressurem_3',

'minpressurem_1', 'minpressurem_2', 'minpressurem_3', 'precipm_1',

'precipm_2', 'precipm_3'],

dtype='object')

# make list of original features without meantempm, mintempm, and maxtempm

to_remove = [feature

for feature in features

if feature not in ['meantempm', 'mintempm', 'maxtempm']]

# make a list of columns to keep

to_keep = [col for col in df.columns if col not in to_remove]

# select only the columns in to_keep and assign to df

df = df[to_keep]

df.columns

Index(['meantempm', 'maxtempm', 'mintempm', 'meantempm_1', 'meantempm_2',

'meantempm_3', 'meandewptm_1', 'meandewptm_2', 'meandewptm_3',

'meanpressurem_1', 'meanpressurem_2', 'meanpressurem_3',

'maxhumidity_1', 'maxhumidity_2', 'maxhumidity_3', 'minhumidity_1',

'minhumidity_2', 'minhumidity_3', 'maxtempm_1', 'maxtempm_2',

20
'maxtempm_3', 'mintempm_1', 'mintempm_2', 'mintempm_3', 'maxdewptm_1',

'maxdewptm_2', 'maxdewptm_3', 'mindewptm_1', 'mindewptm_2',

'mindewptm_3', 'maxpressurem_1', 'maxpressurem_2', 'maxpressurem_3',

'minpressurem_1', 'minpressurem_2', 'minpressurem_3', 'precipm_1',

'precipm_2', 'precipm_3'],

dtype='object')

df.info()

<class 'pandas.core.frame.DataFrame'>

DatetimeIndex: 1000 entries, 2015-01-01 to 2017-09-27

Data columns (total 39 columns):

meantempm 1000 non-null object

maxtempm 1000 non-null object

mintempm 1000 non-null object

meantempm_1 999 non-null object

meantempm_2 998 non-null object

meantempm_3 997 non-null object

meandewptm_1 999 non-null object

meandewptm_2 998 non-null object

meandewptm_3 997 non-null object

meanpressurem_1 999 non-null object

meanpressurem_2 998 non-null object

meanpressurem_3 997 non-null object

maxhumidity_1 999 non-null object

maxhumidity_2 998 non-null object

maxhumidity_3 997 non-null object

21
minhumidity_1 999 non-null object

minhumidity_2 998 non-null object

minhumidity_3 997 non-null object

maxtempm_1 999 non-null object

maxtempm_2 998 non-null object

maxtempm_3 997 non-null object

mintempm_1 999 non-null object

mintempm_2 998 non-null object

mintempm_3 997 non-null object

maxdewptm_1 999 non-null object

maxdewptm_2 998 non-null object

maxdewptm_3 997 non-null object

mindewptm_1 999 non-null object

mindewptm_2 998 non-null object

mindewptm_3 997 non-null object

maxpressurem_1 999 non-null object

maxpressurem_2 998 non-null object

maxpressurem_3 997 non-null object

minpressurem_1 999 non-null object

minpressurem_2 998 non-null object

minpressurem_3 997 non-null object

precipm_1 999 non-null object

precipm_2 998 non-null object

precipm_3 997 non-null object

dtypes: object(39)

22
memory usage: 312.5+ KB

df = df.apply(pd.to_numeric, errors='coerce')

df.info()

<class 'pandas.core.frame.DataFrame'>

DatetimeIndex: 1000 entries, 2015-01-01 to 2017-09-27

Data columns (total 39 columns):

meantempm 1000 non-null int64

maxtempm 1000 non-null int64

mintempm 1000 non-null int64

meantempm_1 999 non-null float64

meantempm_2 998 non-null float64

meantempm_3 997 non-null float64

meandewptm_1 999 non-null float64

meandewptm_2 998 non-null float64

meandewptm_3 997 non-null float64

meanpressurem_1 999 non-null float64

meanpressurem_2 998 non-null float64

meanpressurem_3 997 non-null float64

maxhumidity_1 999 non-null float64

maxhumidity_2 998 non-null float64

maxhumidity_3 997 non-null float64

minhumidity_1 999 non-null float64

minhumidity_2 998 non-null float64

minhumidity_3 997 non-null float64

maxtempm_1 999 non-null float64

23
maxtempm_2 998 non-null float64

maxtempm_3 997 non-null float64

mintempm_1 999 non-null float64

mintempm_2 998 non-null float64

mintempm_3 997 non-null float64

maxdewptm_1 999 non-null float64

maxdewptm_2 998 non-null float64

maxdewptm_3 997 non-null float64

mindewptm_1 999 non-null float64

mindewptm_2 998 non-null float64

mindewptm_3 997 non-null float64

maxpressurem_1 999 non-null float64

maxpressurem_2 998 non-null float64

maxpressurem_3 997 non-null float64

minpressurem_1 999 non-null float64

minpressurem_2 998 non-null float64

minpressurem_3 997 non-null float64

precipm_1 889 non-null float64

precipm_2 889 non-null float64

precipm_3 888 non-null float64

dtypes: float64(36), int64(3)

memory usage: 312.5 KB

# Call describe on df and transpose it due to the large number of columns

spread = df.describe().T

24
# precalculate interquartile range for ease of use in next calculation

IQR = spread['75%'] - spread['25%']

# create an outliers column which is either 3 IQRs below the first quartile or

# 3 IQRs above the third quartile

spread['outliers'] = (spread['min']<(spread['25%']-(3*IQR)))|(spread['max'] > (spread['75%']+3*IQR))

# just display the features containing extreme outliers

spread.ix[spread.outliers,]

%matplotlib inline

plt.rcParams['figure.figsize'] = [14, 8]

df.maxhumidity_1.hist()

plt.title('Distribution of maxhumidity_1')

plt.xlabel('maxhumidity_1')

df.minpressurem_1.hist()

plt.title('Distribution of minpressurem_1')

plt.xlabel('minpressurem_1')

plt.show()

# iterate over the precip columns

for precip_col in ['precipm_1', 'precipm_2', 'precipm_3']:

# create a boolean array of values representing nans

missing_vals = pd.isnull(df[precip_col])

df[precip_col][missing_vals] = 0

df = df.dropna()

25
26

You might also like