Stock Market Analysis Using Supervised Machine Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

1

Stock Market Analysis using Supervised Machine Learning

Aashay Pawar
Dept. Electronics & Telecom.
Pune Institute of Computer Technology, Pune
[email protected]

Abstract:
Stock market or Share market is one of the most complicated and sophisticated way for any kind of business.
Small ownership, brokerage corporations, banking sector and many others depend on this very body to make
revenue and lower the risks. However, this paper proposes to use machine learning algorithm to predict the
future stock price for exchange by using open source libraries and pre-existing algorithms to help make this
unpredictable format of business a little more predictable. This paper brings out a simple implementation that
gives us acceptable results. The outcome is completely based on number of resources and assumes a lot of
axioms that may or may not follow in the real world at the time of prediction.

I. INTRODUCTION:

Stock Market one of the oldest methods where any person or a business organization would trade stocks, make
investments and earn some money out of companies that sell a part of themselves on this platform. This system
proves to be a potential investment schemes if done wisely and following any pre-determined way. However, the
prices and the liquidity of stock markets is highly unpredictable and this is where we bring technology to help us
out. Machine learning is one such tool that helps us achieve what we want.

Stock market is a very important trading platform which can affect anyone at an individual or national level. The
principle is quite simple. Companies will list their shares in the companies as small artefacts called Stocks. They
do so in order to raise money for the organization. A company lists its stock at a price called the Initial Public
Offering or simply IPO. This is the offer price at which the company can sell the stock to any individual and raise
money. These stocks are the property of the owner or an individual and they may sell them at any price to a buyer
at any Stock Exchange. Traders and buyers can sell these shares at their own price. This, if happens multiple times
with a profitable exchange increases the stock value. However, if the company issues more stocks at a lower IPO,
then the market price for exchange goes down and traders suffer a loss. This is the reason why people fear in
investing in stock markets and the reason for the fall and rise of stock prices in a nutshell.

Now, if we had a previous data about rise and fall for a particular stock, we could have thought of generating a
graph from it. But thanks to internet, accurate sets of data is still available. We can now generate our graphs. A
computer here can very easily simulate such an example with a more scientific and mathematical approach. In
statistics, we look at the values and attributes of an equation and try to identify the dependent and independent
variables and establish a relation between them. This technique is known as linear regression. In statistics it is
very commonly used due to its very simple and effective approach. In machine learning we have familiarized the
same algorithm where we use the features to train the classifier which further predicts the value of the label with
certain accuracy which can be checked while training and testing the classifier. For a classifier to be accurate we
must select the right features and have enough data to train our classifier. The accuracy of our classifier is directly
proportional to the amount of data provided to the classifier and the attributes selected, meaning more data more
accuracy.
2

II. PREDICTION MODEL:

A. Analysing the data:

The data that our program requires is taken from www.quandl.com which is a premier dataset providing platform.
Here, we shall look at the raw data available and study it in order to identify suitable attributes for the prediction
of our selected label.
The dataset taken is for Apple, Inc. by WIKI and can be extracted from quandl using the token “WIKI/AAPL”.
Here “AAPL” is tag or name of the stock of Apple, Inc at NASDAQ Exchange. We have extracted and used
approximately all the data till date.

The attributes of the dataset include:


- Open (Opening price of Stock)
- High (Highest price possible at an instance of time)
- Low (Lowest price possible at an instance of time)
- Close (Closing price of stock)
- Volume (Total times traded during a day)
- Split ratio
- Adj. Open Adj. High
- Adj. Low
- Adj. Close
- Adj. Volume

The variable which we shall be predicting is “Close” which will be our label and use “Adj. Open, Adj. High, Adj.
Close, Adj. Low and Adj. Volume” to extract the features that will help us predict the outcome better. It is to be
noted that we use adjusted values over raw as these values are already available, processed and free from any
errors. We use the above attributes to plot the graph. Such graphs are called OHLCV graphs and are very
informative about the rise or fall of the stocks. Now we use the same plotting parameters to decide the features
for the classifier.

Let’s define the set of parameters which we shall be using:


- Adj. Close: This is an important source of information as this decides market opening price for the next day and
volume expectancy for the day.

- HL_PCT: This is a derived feature which is defined by:

Using the percentage change reduces the number of features but retain the net information involved. High-Low is
a relevant feature as this helps us formulate the shape of the desired OHLCV graph.

- PCT_Change: This is also a derived feature,


defined by:

We apply the with Open and Close as High and Low, since they both are very relevant in our prediction model
and helps us reduce number of redundant and unwanted features and as well.

- Adj. Volume: This is a very important parameter as the volume traded has a direct impact on future stock price
than any other feature. Therefore, we shall use it without changing it in our case.
3

We have now analysed the data and extracted the useful information that we shall be requiring for the classifier.
This is a very important step and shall be treated with utmost care. A miss of information or small error in deriving
useful information will lead to a fail prediction model and resulting in a very inefficient classifier.

Also, the features extracted are more specific to the subject used and will definitely vary from subject to subject.
Generalization is possible only if, the data of the other subject is collected with the same coherence as the earlier
subject.

B. Training and Testing:

At this stage we shall be using the data that we extracted from our data and implement in our machine learning
model. We will be using SciPy, Scikit-learn and Matplolib libraries in python to program our model. We will then
train them with the features and label which we extracted and then test them with the same data.

First, we pre-process the data to make the data usable which includes:
-Shifted values of the label attribute by the percentage we want to predict.
-Data frame format is converted to Numpy array format.
-All the NaN data values are removed before feeding it to the classifier.
-The data is scaled such that for any value X,

-The data is split into test data (label) and train data (feature) respectively.

Now the data is ready and can be fed as input to the classifier. We will use the simplest classifier i.e. Linear
Regression, which is defined in Sklearn library of the Scikit-learn package. We choose this classifier because it
serves our purpose just right. Linear regression is a very commonly used technique for data analysis and
predicting. It mainly uses the key features to predict relations between variables based on their dependencies on
other features. This type of prediction is known as Supervised Machine learning.

Supervised learning is a method where the features are paired with their labels. Here we train the classifier such
that it learns the data patterns of which combination of features result in which label.
Here, the classifier perceives the features and simply maps at its label and remembers it. It remembers the
combination of features and its respective label which in our case is the stock price a few days later. Then it
proceeds and learns what pattern is being followed by the features to produce their respective label. This is how
supervised machine learning works.
For testing purpose, we input some combination of features into the trained classifier and cross check the output
of the classifier with the actual label. This helps us determine the accuracy of the classifier. This is very crucial
for our model. A classifier with an accuracy less than 95% is practically useless as a lot of money is involved and
even 5% of that can be a huge loss.
Accuracy is a very important factor in a machine learning model. We must understand what accuracy means and
how to increase the accuracy on the next subtopic.

C. Results:

Once the model is ready, we can use the model to obtain the desired results in any form we. For the purpose, we
shall be plotting a graph of our results as per our requirements which we have discussed earlier in this paper.
4

The key component in every result is the accuracy it delivers. It should be desired and as stated earlier, a model
with accuracy less than 95% is practically useless. There are some standard methods to calculate accuracy in
machine learning as follows:
- R2 value of the model.
- Adjusted R2 value
- RMSE Value
- Confusion matrix for classification problems.

Accuracy is the component which every machine learning model is always committed to contribute towards. After
the model is developed, there is huge amount of efforts towards optimizing the model to get more and more
accurate results. There are a few simple ways to boost the efficiency of your model, and have been discussed
above.
5

However, let us look at some of the standard ways to optimize a machine learning algorithm:
- Unconstrained Optimization
- Newton’s Method
- Gradient Decent
- Batch Learning
- Stochastic Gradient Decent
- Constrained Optimization
- SVM in primal and Dual forms
- Lagrange Duality
- Constrained Methods

III. GOOD TO KNOW FACTS:

A. Requirements:

You must be well versed with the problem requirements and the machine and throughput specification thoroughly
as the very first stage. It is not expected to rush this step as it is very crucial in deciding the overall plan for the
development of the program. Study the case carefully, do a little background check, collect ample of knowledge
of the subject in hand, and identify what you actually want and set it as your goal.

B. Function Analysis:

You must be very careful while retrieving the features from the data as they play a direct role in the prediction
model. They all must make a proper and direct sense in conjunction with the labels. Minimizing the functions
subject to the requirement constraints, as much as possible is highly recommended.

C. Implementation:

You must select the appropriate model in which you will implement your math to obtain results. The model
selected or designed must be in conjunction with the input data type. A wrong model designed or selected for an
inappropriate data or vice-versa, will result in a model which is completely useless. You must see for compatible
SVM or some other available methods to process the data. Implementing different models simultaneously to check
which works the most effectively is also a good practice. Furthermore, implementation is the simplest step and
should take the least amount of time so as to save us some time from the total time cost which could be utilized
in some other important steps.

D. Training and Testing:

Training of a model is very undemanding. You only need to make sure that the data is consistent, coherent and is
available in great abundance. A large set of training data contributes to a stronger and more accurate classifier
which ultimately increases the overall accuracy.

Testing on the other hand is also a very straightforward process. Make sure your test data is at least 0.2 or 20% of
the size of your training data. It is crucial to understand that testing is the test of the classifier’s accuracy and is
sometimes observed to be inversely proportional to a classifiers score. However, the accuracy of the classifier has
no dependency or correlation with testing it. Testing does not have any relationship with the classifier.

E. Optimization:

It is almost impossible to create an adaptable classifier in a single go. Therefore, we must always continue to
optimize it. There is always some room for improvements. When optimizing, keep in mind the standard
methodologies and basic requirements. Shifting to SVM, trying and testing different models, looking for new and
enhanced features, changing the entire data model to suit the model entirely are some very fundamental ways to
optimize your classifier.
6

IV. MISTAKES TO AVOID:

The common mistakes made by practitioners in this field that one can avoid are as follows:
- Bad annotation of training and testing datasets
- Poor understanding of algorithms’ assumptions
- Poor understanding of algorithms’ parameters
- Failure to understand objective
- Not understanding the data
- Avoid leakage (Features, information)
- Not enough data to train the classifier
- Using machine learning where it is not necessary

V. CONCLUSIONS:

Machine learning is a very powerful tool and it has some great application. Machine learning is very much
dependent upon data. Thus, it is important understand that data is quite invaluable and as simple is it may sound
Data analysis is really a tedious task and should be attempted with utmost care.

Machine learning have found tremendous application and has evolved further into deep learning and neural
networks, but the core idea is more or less the same for all of them.

This paper delivers a smooth insight of how to implement machine learning to predict futuristic data. There are
various ways and techniques available to handle and solve various problems, in different situations imaginable.
This paper is limited to only supervised machine learning, and tries to explain only the fundamentals of this
complex process.

REFERENCES:

[1] Fiess, N.M. and MacDonald, R., 2002. Towards the fundamentals of technical analysis: analyzing the
information content of High, Low and Close prices. Economic Modelling, 19(3), pp.353-374.
[2] Google Developers, Oct 2018, “Descending into ML: Linear Regression”, Google LLC,
https://developers.google.com/machine- learning/crash-course/descending-into-ml/linear-regression
[3] Jason Brownlee, March 2016, “Linear Regression for machine learning”, Machine learning mastery,
viewed on December 2018, https://machinelearningmastery.com/linear-regression-for-machine-
learning/
[4] “Linear Regression”, 1997-1998, Yale University http://www.stat.yale.edu/Courses/1997-
98/101/linreg.htm
[5] Draper, N.R.; Smith, H. (1998). Applied Regression Analysis (3rd ed.). John Wiley. ISBN 0-471-
17082-8.
[6] Montgomery, D.C., Peck, E.A. and Vining, G.G., 2012. Introduction to linear regression analysis
(Vol. 821). John Wiley & Sons.
[7] Andrew McCallum, Kamal Nigam, Jason Rennie, Kristie Seymore “A Machine learning approach to
Building domain-specific Search engine”, IJCAI, 1999 - Citeseer
[8] Hurwitz, E. and Marwala, T., 2012. Common mistakes when applying computational intelligence and
machine learning to stock market modelling. arXiv preprint arXiv:1208.4429.

You might also like