A Survey of Machine Learning Algorithms For Big Data Analytics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS)

A SURVEY OF MACHINE LEARNING


ALGORITHMS FOR BIG DATA ANALYTICS
Athmaja S. Hanumanthappa M. Vasantha Kavitha
Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science
Bangalore University Bangalore University Maharani Lakshmi Ammanni College for Women
Bengaluru, India Bengaluru, India Bengaluru, India

Abstract— Big data analytics is a booming research area in • Volume: This denotes the huge amount of data
computer science and many other industries all over the world. It produced every second, oscillating between terabytes
has gained great success in vast and varied application sectors. This to zettabytes. These big data sets can be maintained
includes social media, economy, finance, healthcare, agriculture,
using distributed systems.
etc. Several intelligent machine learning techniques were designed
and used to provide big data predictive analytics solutions. A • Velocity: This term represents the rate at which data is
literature survey of different machine learning techniques is produced and processed to congregate the demands.
provided in this paper. Also a study on commonly used machine • Variety: This indicates the diverse range of data that
learning algorithms for big data analytics is done and presented in we can use.
this paper. • Veracity: This speaks about the data quality. That is, it
indicates the biases, noise, abnormality etc. in the
Keywords—big data, analytics, machine learning algorithms,
technique, prediction, model data.
• Value: This points to the precious knowledge reveled
from the data.
I.INTRODUCTION Data scientists use many well-marked analytics
In this data rich era it is essential to use sophisticated techniques. Text analytics, predictive analytics, natural
analytics techniques on huge, diverse big data sets to produce language processing, machine learning, etc. are a few
useful knowledge and information. Big data analytics is a approaches to make better and faster decisions on big data sets
budding research area that deals with the collection, storage to uncover hidden insights.
and analysis of immense data sets to trace the unknown
patterns and other key information. Big data analytics helps us III.MACHINE LEARNING
to recognize the data that are integral component to the future Machine learning is an interdisciplinary research area
business decisions. Big data analytics can be abundantly found which combines ideas from several branches of science
in domains such as banking and insurance sector, healthcare, namely, artificial intelligence, statistics, information theory,
education, social media and entertainment industry, mathematics, etc. The prime focus of machine learning
bioinformatics applications, geospatial applications, research is on the development of fast and efficient learning
agriculture etc. It is a herculean task to handle big data using algorithms which can make predictions on data. When dealing
conventional data processing applications. Thus to discover with data analytics, machine learning is an approach used to
hidden data patterns, trends and associations, intelligent create models for prediction. Machine learning tasks are
machine learning methods can be adapted. The objective of mainly grouped into three categories- supervised,
the current research paper is to discuss various machine unsupervised and reinforcement learning. Supervised
learning algorithms used by data scientists for analyzing and machine learning requires training with labeled data. Each
modeling big data. labeled training data consists of input value and a desired
target output value. The supervised learning algorithm
II. BIG DATA ANALYTICS analyzes the training data and makes an inferred function,
The term big data which describes extremely large data which may be used for mapping new values. In unsupervised
sets is widely being used among different researchers all over machine learning technique, hidden insights are drawn from
the world. Traditional relational databases are not capable of unlabelled data sets, for example, cluster analysis. The third
handling big data. Enormous quantity of data sets arrives from category, reinforcement learning allows a machine to learn its
several sources like sensors, transactional applications, web behavior from the feedback received through the interactions
and social media, etc. The big data phenomenon can be with an external environment [3]. From a data processing
comprehended clearly by knowing the different Vÿs point of view, both supervised and unsupervised learning
associated with them- Volume, Velocity, Variety, Veracity and techniques are preferred for data analysis and reinforcement
Value. techniques are preferred for decision making problems [7].

‹,(((
2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS)

machine learning algorithms are decision rules, stacked


Most of the traditional machines learning algorithms were generalization, meta-learning and distributed boosting etc.
implemented for data sets which could be completely fit into Parallel machine learning is another popular learning scheme
the memory [15]. As the data keeps getting bigger day by day, where the learning process is executed among multiple
many intelligent learning methods are being implemented to processor environments or on multiple threaded machines [1].
provide solutions to several big data predictive analytics
problems. A study on several commonly used machines Transfer learning is another machine learning approach
learning techniques for big data analytics is provided in the mentioned in their paper. A common practice is that both the
following section. training data and test data are taken from the same field in the
conventional machine learning process. That is, the input
IV. LITERATURE SURVEY feature space and data distribution are identical [8]. But there
J. Qiu et al. presented different machine learning are certain scenarios in which getting training and test data
algorithms for big data processing [7]. The first one is from the same domain is a difficult and expensive task. In
representation learning or feature learning which deals with order to solve this issue, the transfer learning technique has
learning data representations that make the data analysis been used. In this scheme a high performance learner is
process easier. It is found that the performances of the created for a target domain by getting trainings from a related
machine learning algorithms are strongly influenced by the source domain. Transfer learning techniques are widely being
selection of data representation (or features) [16]. This used in many real-world data processing applications.
learning scheme plays a crucial role in dimensionality
reduction tasks. The important steps under representation The authors discussed about another learning scheme
learning are feature selection, feature extraction and distance called active learning. In some cases the data is represented
metric learning [14]. Feature selection (variable selection) without labels which become a challenge. Manually labeling
techniques are used to find those features of data which are this large data collection is an expensive and strenuous task.
most relevant for use in model construction. Feature extraction Also, learning from unlabelled data is very difficult. Active
techniques transform the high dimensional data into a low learning is used to solve the above mentioned issue by
dimensional space. In distance metric learning, a distance selecting a subset of the most important instances for labeling
function is constructed to calculate the distance between [17]. Another scheme, kernel- based learning, has been widely
various points of a data set. used in many engineering applications to design efficient,
powerful and high performance nonlinear algorithms [5].
The authors mentioned about another hot learning Some of the algorithms capable of operating with kernels are
technique called deep learning in their paper. Most of the support vector machines (SVM), principal component
ancient machine learning approaches follows shallow- analysis (PCA), kernel perceptron, etc.
structured learning architecture that containing a single layer
of nonlinear feature transformations. Some of the examples of J.L. Berral-Garcia presented a paper describing the
such learning techniques are Gaussian mixture models frequently used machine learning algorithms for big data
(GMMs), hidden Markov models (HMMs), support vector analytics [6]. Several algorithms are used for performing
machines (SVMs), logistic regression, kernel regression etc. modeling, prediction and clustering tasks. Decision tree
[9]. In contrast to the shallow- structured learning architecture, algorithms (like CART, Recursive Partition Trees or M5), K-
deep learning techniques make use of supervised and Nearest neighbors algorithms, Bayesian algorithms (using
unsupervised strategies in deep architecture. The learning Byes theorem) , Support vector machines(SVM), Artificial
systems with deep learning architecture are composed of Neural Network, K-means, DBSCAN algorithms, etc are
several levels of nonlinear processing stage, in which each presented in this paper. Several execution frameworks -
lower layer’s output is given as the input of the immediate Map-Reduce Frameworks (Apache Hadoop and Spark),
higher layer. Some of the examples are deep neural Google’s Tensor flow, Microsoft’s Azure-ML were also
networks, conventional neural networks, deep belief mentioned. The implementations of the previously discussed
networks and recurrent neural networks etc. Because of the algorithms are made available to the public through different
high performance of deep learning algorithms they are well tools, platforms and libraries such as R-cran, Python Sci-Kit,
suited for big data analytics applications. Weka, MOA, Elastic Search, Kibana etc.

Scalability is a challenging issue with the traditional M. U. Bokhari et al. presented a three layered architecture
machine learning algorithms. The traditional schemes cannot model for storing and analyzing big data [11]. The three layers
process the huge data chunks within a stipulated time as they are data gathering layer, data storing layer and data analysis &
require all the data in the same database. A new field of report generation layer. In order to gather and handle the huge
machine learning called distributed learning has been evolved volume of big data coming from high speed sources such as
to solve this problem. In this scheme, the learning is carried sensors or social media, a cluster of high speed nodes or
out on data sets distributed among several workstations to severs are kept in the data gathering layer. The data storage
scale up the learning process [4]. Examples of the distributed layer is responsible for storing the big data. The Hadoop
2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS)

Distributed File System (HDFS) can be used for data storage. DBSCAN
In the data analysis layer, machine learning techniques such as
2. J. Qui, Q. Wu, Gaussian Mixture A survey was done on
ANN, Naive Bayes, SVM and Principal Component Analysis G. Ding, Y. Xu models, Hidden the various traditional as
etc. are used to churn knowledge from the huge complex data and S. Feng Markov Models, well as advanced
chunks. ( 2016) SVM, logistic machine learning
regression, Kernel algorithms used for big
Rgression, Deep data processing.
P.Y. Wu et al. in their paper provided case studies to neural networks, Deep
show how big data analytics is useful in precision medicine to belief networks, PCA,
provide the most appropriate treatment to each patient [12]. Kernel Perceptron
Principal Component Analysis, Singular Value Decomposition
3. M.U. Bokhari, ANN, SVM,PCA, Presented a 3 layered
and tensor-based approaches are useful for feature extraction M. Zeyauddin Naive Bayes architecture model for
and for feature selection filter based and wrapper based and M. A. storing and analyzing
methods are helpful. All these are dimensionality reduction Siddiqui Bigdata. Data storage
techniques. The authors compared different techniques for (2016) can be done using the
Hadoop Distributed File
performing data mining tasks. Logistic regression, cox System(HDFS) and data
regression, local regression techniques are simple to interpret, analysis can be done
but are prone to outliers. Logistic regression with LASSO using techniques like
regularization reduces feature space. But over fitting is a ANN, SVM, Naive
Bayes and PCA.
problem. Other models such as Hidden Markov models,
Conditional Random fields, relational subgroup discovery,
episode rule mining etc are also useful for performing data 4. P. Y. Wu, C. Logistic regression, Discussed several
mining tasks. The authors discussed about the useful platforms W. Cheng, C. PCA, HMM, Local machine algorithms and
D. Kaddi, J. regression, cox platforms like Hadoop ,
for big data analytics. Apache Hadoop, IBM InfoSphere Venugopalan, regression IBM Infosphere, Tableu,
Platform, Apache Spark Streaming, Tableau, QlikView, R. Hoffman Qlik view, Spark etc for
TIBCO Spotfire, and other visual analytics tools are highly and M. D. providing big data
impactful platforms for providing big data analytics solutions. Wang solutions. Case studies
(2016) were done using –omic
Two real world case studies such as integrative -omic data for data from TCGA and
the improved understanding of cancer mechanisms, and the EHR data to show the
incorporation of genomic knowledge into the EHR system for usefulness of biomedical
improved patient diagnosis and care were done to discuss the big data analytics for
precision medicine.
usefulness of biomedical big data analytics for precision
medicine. Multi-omic TCGA [13] data and EHR data [2]were
used to conduct this study. 5. M. R. Bendre, MapReduce, Linear A model was built using
R. C. Thool regression MapReduce and Linear
and V. R. regression techniques.
M.R. Bendre et al. [10] conducted a research on the usage
Thool Case study was carried
of big data in precision agriculture. The authors mentioned out to predict the rainfall
that big data provides a broad range of functions to uncover and temperature value
new insights to address several farming problems. The for the year 2013 using
the historical weather
designed model uses the MapReduce technique for big data
data collected from
processing and the linear regression method for data KVR, Ahmednagar. The
prediction. The data collected from the KVR( Krishi main objective was to
Vidyapeeth Rahuri (KVR), Ahmednagar, India) station are improvise the accuracy
of rainfall forecasting.
used to test the model. The result forecasted using this model
is very useful for effective decision making in the agriculture
domain. V.CONCLUSION
With the advents in big data technology, it became
The following table summarizes the literature survey
difficult to handle the complex big data using the traditional
presented in this paper.
learning algorithms. Therefore several advanced, efficient and
intelligent learning algorithms are required to handle the huge
TABLE I. LITERATURE SURVEY SUMMARY
chunks of heterogeneous datasets. The results obtained
Sl. Author(s) Algorithms / Summary through these analytics techniques provide more effective
No. name (year) Techniques
solutions to many real world problems in various domains
such as healthcare, agriculture, social media, banking etc.
1. J. L. Berral- Decision tree A survey was done on Various research papers are surveyed to gather information
Garcia (2016) algorithms, K- Nearest the various machine about advanced learning techniques. This paper gives an
neighbor algorithms, algorithms for overall idea about the advanced machine learning algorithms
Bayesian algorithms, classification, prediction
SVM, ANN, K-means, and modeling
2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS)

and techniques used to provide solutions to the big data [16] Y. Bengio, A. Courville and P. Vincent, “Representation
analytics problems. Learning: A Review and New Perspectives”, in IEEE
Transactions on Pattern Analysis and Machine
Intelligence, vol. 35, issue 8, pp. 1798-1828, 2013. DOI:
REFERENCES 10.1109/TPAMI.2013.50.
[17] Y. Fu, B. Li, X. Zhu and C. Zhang, “Active Learning
[1] “Parallel machine learning toolbox”, retrieved from without Knowing Individual Instance Labels: A Pairwise
http://www.research.ibm.com/haifa/projects/verification/ml_tool Label Homogeneity Query Approach”, in IEEE
box/. Transactions on Knowledge and Data Engineering, vol.
[2] C. A. Caligtan and P. C. Dykes, “Electronic health records and 26, issue 4, pp. 808-822, 2014. DOI:
personal health records”, Semin Oncol Nurs, vol. 27, pp. 218- 10.1109/TKDE.2013.165.
228, 2011.
[3] C.M. Bishop, Pattern recognition and machine learning,
Springer, New York, 2006.
[4] D Peteiro-Barral and B Guijarro-Berdinas, “A survey of
methods for distributed machine learning”, Progress in Artificial
Intelligence, Springer, vol. 2, issue 1, pp. 1-11, 2013.
DOI:10.1007/s13748-012-0035-5.
[5] G. Ding, Q. Wu, Y. D. Yao, J. Wang and Y. Chen, “Kernel-
Based Learning for Statistical Signal Processing in Cognitive
Radio Networks: Theoretical Foundations, Example
Applications, and Future Directions”, IEEE Signal Processing
Magazine, vol. 30, issue. 4, pp. 126-136, 2013.
DOI: 10.1109/MSP.2013.2251071.
[6] J. L. Berral-Garcia, “A quick view on current techniques and
machine learning algorithms for big data analytics”, 18th
International Conf. on Transparent Optical Networks, pp.1-4,
2016. DOI: 10.1109/ICTON.2016.7550517.
[7] J. Qui, Q. Wu, G. Ding, Y. Xu and S. Feng, “A survey of
machine learning for big data processing”, EURASIP Journal on
Advances in Signal Processing, Springer, vol. 2016:67, pp. 1-16,
2016. DOI: 10.1186/s13634-016-0355-x.
[8] K Weiss, T Khoshgoftaar and D Wang, “A survey of transfer
learning”, Journal of Big Data, Springer, vol. 3, issue 9, pp. 1-
40, 2016. DOI: 10.1186/s40537-016-0043-6
[9] Li Deng, “A tutorial survey of architectures, algorithms and
applications for deep learning”, APSIPA transactions on Signal
and Information Processing, vol. 3,pp.1-29,2014.
DOI: https://doi.org/10.1017/atsip.2013.9.
[10] M. R. Bendre, R. C. Thool and V. R. Thool, “Big data in
precision agriculture: Weather forecasting for future farming”,
1st International Conf. on Next Generation Computing
Technologies, pp. 744-750, 2015.
DOI:10.1109/NGCT.2015.7375220.
[11] M. U. Bokhari, M. Zeyauddin and M. A. Siddiqui, “An effective
model for big data analytics”, 3rd International Conference on
Computing for Sustainable Global Development, pp. 3980-3982,
2016.
[12] P. Y. Wu, C. W. Cheng, C. D. Kaddi, J. Venugopalan, R.
Hoffman and M. D. Wang, “–Omic and Electronic Health
Record Big Data Analytics for Precision Medicine”, IEEE
Transactions on Biomedical Engineering, vol. 64, issue 2, pp.
263-273, 2017. DOI: 10.1109/TBME.2016.2573285.
[13] T. C. G. Atlas. Available: http://cancergenome.nih.gov/.
[14] W. Tu and S. Sun, “Cross-domain representation-learning
framework with combination of class-separate and domain-
merge objectives”, Proceedings of the 1st International
Workshop on Cross Domain Knowledge Discovery in Web and
Social Network Mining , ACM, pp. 18–25, 2012. DOI:
10.1145/2351333.2351336.
[15] X. W. Chen and X. Lin, “Big Data Deep Learning: Challenges
and Perspectives”, in IEEE Access, vol. 2, pp. 514-525, 2014.
DOI: 10.1109/ACCESS.2014.2325029.

You might also like