Himanshu Sharma - Detection of Financial Statement Fraud Using Decision Tree Classifiers - 2013

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Detection of Financial

Statement Fraud Using


Decision Tree Classifiers

Submitted by:- Under Guidance of:-


Himanshu Sharma Dr. V. Ravi
3rd Year Undergraduate Associate Professor
Dept. of Mathematics IDRBT, Hyderabad
IIT Delhi
Acknowledgment

I wish to express my profound gratitude to my Guide Dr. V. Ravi for giving me an

opportunity to do this project in the Institute for Development and Research in

Banking Technology. The supervision and support that he gave truly helped the

progression and smoothness of internship program.

This project would have been nothing without the enthusiasm and imagination

from our teaching assistants at IDRBT Mr Datta, Mr Pradeep and Mr Mayank

Pandey.

Last but not least I would like to thank my colleagues at IDRBT for nurturing my

confidence.
Certificate

This is to certify that Mr Himanshu Sharma, pursuing Integrated M.Tech degree

at Indian Institute of Technology, Delhi in Mathematics and Computing had

undertaken a project as an intern at IDRBT, Hyderabad from May 14, 2013 to

July 16, 2013. He completed the project on “Detection of financial statement

fraud using decision tree classifiers” under my guidance.

I wish him all the best for all his future endeavours.

(Dr. V. Ravi)
Project Guide
Assistant Professor
IDRBT, Hyderabad
Contents

Abstract 5

Introduction 6

Literature Review 9

Methodology 11

Results and Discussion 19

References 21
1. Abstract

The project explores the effectiveness of data mining techniques in detecting firms that issue
fraudulent financial statements and deals with the identification of the associated factors. To
this end, a number of experiments have been conducted using decision tree classifiers, which
were trained using a 10-fold cross-validation data set of Chinese firm. This study indicates
that a decision tree can be successfully used in the identification of fraudulent financial
statements and underline the importance of financial ratios. The goal is to create a model
that predicts the value of a target variable based on several input variables. Each branch
node of the tree represents a choice between a number of alternatives, and each leaf node
represents a decision (or class). Decision trees are commonly used for gaining information for
the purpose of decision making. Decision tree starts with a root node, from this node the
algorithm splits each node recursively until the stopping criteria is met. The final result is a
tree in which each branch represents a possible scenario of decision and its outcome.
Another classification algorithm i.e., firefly minor is implemented over the dataset.
2. Introduction

Financial fraud is a serious problem worldwide and more so in fast growing countries like
China. Traditionally, auditors are responsible for detecting financial statement fraud. With
the appearance of an increasing number of companies that resort to those unfair practices,
auditors have become overburdened with the task of detection of fraud. Hence, various
techniques of data mining are being used to reduce the workload of the auditors.
Financial Statement: Financial statements are records that outline the financial activities of a
business, an individual or any other entity. Financial statements are meant to present the
financial information of the entity in question as clearly and concisely as possible for both the
entity and for readers. It is a standard practice for businesses to present financial statements
that adhere to generally accepted accounting principles (GAAP), to maintain continuity of
information and presentation across international borders. As well, financial statements are
often audited by government agencies, accountants, firms, etc. to ensure accuracy and for
tax, financing or investing purposes. Financial statements are integral to ensuring accurate
and honest accounting for businesses and individuals alike.
Financial statements for businesses usually include: income statements, balance sheets,
statements of retailed earnings and cash flows, as well as other possible statements.
 Income Statements: A financial statement that measures a company's financial performance
over a specific accounting period. Financial performance is assessed by giving a summary of
how the business incurs its revenues and expenses through both operating and non-
operating activities. It also shows the net profit or loss incurred over a specific accounting
period, typically over a fiscal quarter or year. Also known as the "profit and loss statement" or
"statement of revenue and expense".
 Balance Sheet: A financial statement that summarizes a company’s assets, liabilities and
shareholders’ equity at a particular point in time. These three balance sheet segments give
investors an idea as to what the company owns and owes, as well as the amount invested by
the shareholders. The balance sheet must follow the following formula:
Assets = Liabilities + shareholders’ equity
It's called a balance sheet because the two sides balance out. A company has to pay for all
the things it has (assets) by either borrowing money (liabilities) or getting it from
shareholders (shareholders' equity).
 Statement of retailed earnings: Statement of retailed earnings explains changes in company’s
retailed earnings over the reporting period. It breaks down changes affecting the account,
such as profits and losses from operations, dividends paid, and any other items charged or
credited to retailed earnings.
 Cash Flow Statements: An item on the cash flow statement that reports the aggregate
change in a company's cash position resulting from any gains (or losses) from investments in
the financial markets and operating subsidiaries, and changes resulting from amounts spent
on investments in capital assets.
Now, we will discuss about some key financial ratios obtained from the financial statement
that helps in deciding financial statement frauds.
Financial Ratios: Financial ratios are a valuable and easy way to interpret the numbers found
in financial statements. They can help to answer critical questions such as whether the
business is carrying excess debt or inventory, whether the customers are paying according to
terms, whether the operating expenses are too high, and whether the company assets are
being used properly to generate income. Following are some of the financial ratios of utter
importance:
 Liquidity: Liquidity measures a company’s capacity to pay its liabilities in short term. There
are two ratios for evaluating liquidity. They are:
1. Current Ratio: Total current assets/Total current liabilities
2. Quick Ratio: (Cash + accounts receivable + Other quick assets)/Current liabilities
The higher the ratios the stronger is the company’s ability to pay its liabilities as they become
due, and lower is the risk of default.
 Safety: Safety indicates a company’s vulnerability to risk of debt. There are three ratios for
evaluating liquidity. They are:
1. Debt to equity:: Total liabilities/Net worth
2. EBIT/Interest: Earnings before interest rates and taxes/Interest charges
3. Cash-flow to current maturity of long-term debt: (Net profit + Non-cash
expenses)/Current position of long-term debt
 Profitability: Profitability ratios measure the company’s ability to generate a return on its
resources. These are four ratios to evaluate a company’s profitability. They include:
1. Gross profit margin: Gross profit/Total sales
2. Net profit margin: Net profit/ Total sales
3. Return on assets: Net profit before taxes/Total assets
4. Return on equity: Net profit before taxes/Net worth
 Efficiency: Efficiency evaluate how the company manages its assets. There are four ratios to
evaluate the efficiency of asset management:
1. Accounts receivable turnover = Total net sales/ Accounts receivable
2. Accounts payable turnover = Cost of goods sold/Accounts payable
3. Inventory turnover = Cost of goods sold/Inventor
4. Sales to total assets = Total sales/Total assets
Financial statement fraud may be perpetrated to increase stock prices or get loans from
banks. It may be done to distribute lesser dividends to shareholders. Another probable
reason may be to avoid payment of taxes. Nowadays an increasing number of companies are
making use of fraudulent financial statements in order to cover up their true financial status
and make selfish gains at the expense of stockholders. The fraud triangle describes the
probability of financial reporting fraud which depends on three factors: incentives/pressures,
opportunities, and attitudes/rationalization. The fraud triangle is also known as Cressey’s
triangle, or Cressey’s Fraud Triangle. It is depicted in Fig.1 and discussed below.

Figure 1: Components of Cressey’s Triangle

The Opportunity to commit fraud is possible when employees have access to assets and
information that allows them to both commit and conceal fraud. Employees are given access
to records and valuables in the ordinary course of their jobs. Unfortunately, that access
allows people to commit fraud. Over the years, managers have become responsible for a
wider range of employees and functions. This has led to more access for them, as well as
more control over functional areas of companies. Access must be limited to only those
systems, information, and assets that are truly necessary for an employee to complete his or
her job. Opportunity is created by weak internal controls, poor management oversight,
and/or through use of one’s position and authority. Failure to establish adequate procedures
to detect fraudulent activity also increases the opportunities fraud for to occur.
Pressure can be referred to as incentive, is another aspect of the fraud triangle. It might be a
real financial or other type of need, such as high medical bills or debts. Or it could be a
perceived financial need, such as a person who has a desire for material goods but not the
means to get them. Motivators can also be non-financial. There may be high pressure for
good results at work or a need to cover up someone’s poor performance. Addictions such as
gambling and drugs may also motivate someone to commit fraud. Lastly, employees may
rationalize this behaviour by determining that committing fraud is right for a variety of
reasons. Rationalization is a crucial component in most frauds. Rationalization involves a
person reconciling his/her behaviour with the commonly accepted notions of decency and
trust. For those who are generally dishonest, it is probably easier to rationalize a fraud. For
those with higher moral standards, it is probably not so easy. They have to convince
themselves that fraud is ok with excuses for their behaviour. Common rationalizations
include making up for being underpaid or replacing a bonus that was deserved but not
received. A thief may convince himself that he is just borrowing money from the company
and will pay it back one day. Some embezzlers tell themselves that the company doesn’t
need the money or won’t miss the assets. Others believe that the company deserves to have
money stolen because of bad acts against employees.
Data Mining: Data Mining is an iterative process within which progress is defined by
discovery, either through automatic or manual methods. Data mining is most useful in an
exploratory analysis scenario in which there are no predetermined notions about what will
constitute an interesting outcome. The application of Data Mining techniques for financial
classification is a fertile research area. However, as opposed to other well-examined fields
like bankruptcy prediction or financial distress, research on the application of Data Mining
techniques for the purpose of financial fraud detection has been rather minimal.
In this study, we carry out an in-depth examination of data from the financial statements of
various Chinese firms in order to detect fraudulent financial statement by using Data Mining
classification methods. The goal of this research is to identify the financial factors to be used
by auditors in assessing the likelihood of fraudulent financial statements. One main objective
is to introduce, apply, and evaluate the use of Data Mining methods in differentiating
between fraud and non-fraud observations.
In this study, different decision trees are tested for their applicability in management of fraud
detection. All methods have the theoretical advantage that they do not impose arbitrary
assumptions on the input variables. The different decision tree classifiers are compared in
terms of their predictive accuracy. The input data consists mainly of financial ratios derived
from financial statements, i.e., balance sheets and income statements. The sample contains
data from 202 Chinese firms, 101 of which are fraudulent and other 101 are non-fraudulent.
Relationships between input variables and classification outcomes are captured by the
models and revealed.
The paper proceeds as follows: Section 2 reviews relevant prior research. Section 3 provides
an insight into the research methodology used. Section 4 describes the developed models
and analyses the results. Finally, Section 5 presents the references.

3. Literature Review

As Watts and Zimmerman [1] argue the financial statement audit is a monitoring mechanism
that helps reduce information asymmetry and protect the interests of the principals,
specifically, stockholders and potential stockholders, by providing reasonable assurance that
management’s financial statements are free from material misstatements. However, in real
life, detecting management fraud is a difficult task when using normal audit procedures [2]
since there is a shortage of knowledge concerning the characteristics of management fraud.
Additionally, given its infrequency, most auditors lack the experience necessary to detect it.
Sometimes, it so may happen that managers deliberately try to deceive auditors [3].
Green and Choi [4] developed a Neural Network fraud classification model. The model used
five ratios and three accounts as input. The results showed that Neural Networks have
significant capabilities when used as a fraud detection tool. Nieschwietz et al. [5] provide a
comprehensive review of empirical studies related to external auditors’ detection of
fraudulent financial reporting while Albrecht et al. [6] review the fraud detection aspects of
current auditing standards and the empirical research conducted on fraud detection.
Bell and Carcello [7] developed and tested a logistic regression to estimate the likelihood of
fraudulent financial reporting using a sample of 77 fraud and 305 non-fraud engagements,
based on the incidence of red flags as explanatory variables. They found that the significant
red flags that effectively discriminated between fraud and non-fraud engagements were:
management lied to the auditor; a weak internal control environment; an unduly aggressive
management attitude; undue management emphasis on meeting earning projections; and
significant difficult-to-audit transactions.
Beasley et al. (2000) compare the company governance mechanisms of known fraud cases
with “no-fraud” industry benchmarks; they found that companies who exhibited fraud had
fewer audit committees, fewer independent audit committees, fewer audit committee
meetings, less frequent internal audit support and fewer independent board members.
Church et al. (2001) provide further evidence that internal auditors are sensitive to factors
that affect the possibility of fraudulent financial reporting. Specifically, they show that in a
situation where operating income is greater than expected, an earnings-based bonus plan is
used, and debt covenants are restrictive, internal auditors assigned a higher likelihood of
fraud. Ansah et al. [8] investigate the relative influence of the size of audit firms, auditor’s
position tenure and auditor’s year of experience in auditing on the likelihood of detecting
fraud in the stock and warehouse cycle. They conclude that such factors are statistically
significant predictors of the likelihood of detecting fraud, and increase the likelihood of fraud
detection.
Bell and Carcello (2000) developed and tested a logistic regression model that estimates the
likelihood of fraudulent financial reporting for an audit client, conditioned on the presence or
absence of several fraud-risk factors. The significant risk factors included in the final model
were: weak internal control environment, rapid company growth, inadequate or inconsistent
relative profitability, management that places undue emphasis on meeting earnings
projections, management that lies to the auditors or is overly evasive, ownership status
(public vs. private) on the entity, and interaction term between a weak control environment
and an aggressive management attitude towards financial reporting. Spathis (2002)
constructed a model to detect falsified financial statements. He employed the statistical
method of logistic regression. Two alternative input vectors containing financial ratios were
used. The reported accuracy rate exceeded 84%. The results suggest that there is potential in
detecting fraudulent financial statement through the analysis of published financial
statements. Spathis et al., (2002) used the UTADIS method to develop a falsified financial
statement detection model. The method operates on the basis of a non-parametric
regression-based framework. They also used the discriminant analysis and logit regression
methods as benchmarks. Their results indicate that the UTADIS method performs better than
the other statistical methods as regards the training and validation sample. The results also
showed that the ratios “total debt / total assets” and “inventory / sales” are explanatory
factors associated with fraudulent financial statement.
Aamodt and Plaza [9] and Kotsiantis ET. Al. [10] used case based reasoning to identify the
fraudulent companies. Further, Deshmukh and Talluru [11] demonstrated the construction of
a rule-based fuzzy reasoning system to access the risk of management fraud and proposed an
early warning system by finding out 15 rules related to the probability of management fraud.
Pacheco et al. [12] developed a hybrid intelligent system consisting of NN and a fuzzy expert
system to diagnose financial problems. Further, Magnusson et al. [13] used text mining and
demonstrated that the language of quarterly reports provided an indication of the changes in
the company’s financial status. A rule-based system that consisted of too many if-then
statements made it difficult for marketing researchers to understand key drivers of consumer
behaviour [14]. Variable selection was used in order to choose a subset of the original
predictive variables by eliminating variables that were either redundant or possessed little
predictive information.

4. Methodology

The dataset used in this research was obtained from 202 companies that were listed in
various Chinese stock exchanges, of which 101 were fraudulent and 101 were non-fraudulent
companies. The data contains 35 financial items for each of these companies. Table 1[15] lists
all these financial ratios. Of these, 28 financial ratios reflects liquidity, safety profitability and
efficiency of companies. During the data pre-processing stage normalization of each
independent variable of original 10-fold cross-validation dataset is done to improve reliability
of the results. Then we analysed the dataset using following methodology:

 Decision Trees
1. Random forest
2. Naïve-Bayesian Tree
3. C-4.5
4. RIPPER
5. CART
6. TreeNet
7. Quantile Regression Tree
 Firefly Minor Algorithm
Table 1: Items from financial statements of companies that are used for detection of financial
statement fraud
Number Financial Item
1 Debt
2 Total assets
3 Gross Profit
4 Net Profit
5 Primary business income
6 Cash and deposits
7 Accounts receivable
8 Inventory/Primary business income
9 Inventory/Total assets
10 Gross profit/Total assets
11 Net profit/Total assets
12 Current assets/Total assets
13 Net profit/Primary business income
14 Accounts receivable/Primary business income
15 Primary business income/Total assets
16 Current assets/Current liabilities
17 Primary business income/Fixed assets
18 Cash/Total assets
19 Inventory/Current liabilities
20 Total debt/Total equity
21 Long term debt/Total assets
22 Net profit/Gross profit
23 Total debt/Total assets
24 Total assets/Capital and reserves
25 Long term debt/Total capital and reserves
26 Fixed assets/Total assets
27 Deposits and cash/Current assets
28 Capitals and reserves/total debts
29 Accounts receivable/total assets
30 Gross profit/Primary business profit
31 Undistributed profit/Net profit
32 Primary business profit/Primary business profit of last year
33 Primary business income/Primary business income of last year
34 Account receivable/Accounts receivable of last year
35 Total assets/Total assets of last year
We chose Decision Tree Classifiers because of the following advantages that they have over
other classifiers like MLFF, SVM etc.
 Simple to understand and interpret. People are able to understand decision tree models after
a brief explanation
 Requires little data preparation. Other techniques often require data normalisation, dummy
variables need to be created and blank values to be removed
 Able to handle both numerical and categorical data. Other techniques are usually specialised
in analysing datasets that have only one type of variable. (For example, relation rules can be
used only with nominal variables while neural networks can be used only with numerical
variables)
 Uses a white box model. If a given situation is observable in a model the explanation for the
condition is easily explained by Boolean logic. (An example of a black box model is an artificial
neural network since the explanation for the results is difficult to understand)
 Possible to validate a model using statistical tests. That makes it possible to account for the
reliability of the model
 Performs well even if its assumptions are somewhat violated by the true model from which
the data were generated
 Performs well with large datasets. Large amounts of data can be analysed using standard
computing resources in reasonable time
The block diagram in figure 2 represents the data flow of different classifiers used for
classification of the dataset using 10-fold cross-validation. ‘If-then’ rules are generated by
decision tree models, which are easily understandable. Also the meta-heuristic algorithm
firefly minor which is inspired by the flashing behaviour of fireflies, also generates these ‘if-
then’ rule files. In this way, we ensured that the problem at hand is analysed by different
disparate models that exhibited varying degree of performance on different data mining
problem.

Figure 2: Architecture of different classifiers


Table 2: Top 18 items selected from financial statements of companies by t-statistics based
feature selection.
Number Financial Item
1 Net profit
2 Gross profit
3 Primary business income
4 Primary business income/Total assets
5 Gross profit/Total assets
6 Net profit/Total assets
7 Inventory/Total assets
8 Inventory/Current liabilities
9 Net profit/Primary business income
10 Primary business income/Fixed assets
11 Primary business profit/Primary business profit of last year
12 Primary business income/Last year’s primary business income
13 Fixed assets/Total assets
14 Current assets/Current liabilities
15 Capitals and reserves/Total debt
16 Long term debt/Total capital and reserves
17 Cash and deposits
18 Inventory/Primary business income

Figure 3: Architecture of different classifiers after feature selection

It is also observed that some of the independent variables turned out to be much more
important for the prediction purpose whereas some contributed negatively towards the
classification accuracies of different classifiers. So, results of a simple statistical technique
using t-statistics is used to accomplish feature selection on the dataset by identifying the
most significant financial items that could detect the presence of financial statement fraud.
The features having high significant values are those which are more significant than others.
Using feature selection, first the top 18 feature set was fed as input into different classifiers.
In order to conduct further detailed analysis top 10 feature set was considered and fed as an
input to different classifiers. Table 2 enlists the features with maximum t-statistics value, thus
the top 18 and top 10 features are selected in sequential order from the table. The block
diagram for all these combination is shown in figure 3. Also, some brief details of different
classifiers used in this research is provided below:
 Classification & Regression Tree (CART): CART is a non-parametric decision tree
learning technique that produces either classification or regression trees, depending on
whether the dependent variable is categorical or numeric, respectively. Decision trees are
formed by a collection of rules based on variables in the modelling data set:
a) Rules based on variables' values are selected to get the best split to differentiate
observations based on the dependent variable
b) Once a rule is selected and splits a node into two, the same process is applied to each
"child" node (i.e. it is a recursive procedure)
c) Splitting stops when CART detects no further gain can be made, or some pre-set
stopping rules are met. (Alternatively, the data are split as much as possible and then
the tree is later pruned)
Each branch of the tree ends in a terminal node. Each observation falls into one and exactly
one terminal node, and each terminal node is uniquely defined by a set of rules.

 C-4.5: C4.5 decision tree algorithm was developed by Ross Quinlan [16]. C4.5 is an extension
of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for
classification, and for this reason, C4.5 is often referred to as a statistical classifier. At each
node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of
samples into subsets enriched in one class or the other. The splitting criterion is the
normalized information gain (difference in entropy). The attribute with the highest
normalized information gain is chosen to make the decision. The C4.5 algorithm then
recurses on the smaller sub lists. This algorithm has a few base cases:
 When all the samples in the list belong to the same class, it simply creates a leaf node
for the decision tree saying to choose that class
 When none of the features provide any information gain, C4.5 creates a decision
node higher up the tree using the expected value of the class
 When instance of previously-unseen class encountered, C4.5 creates a decision node
higher up the tree using the expected value
In pseudo code, the general algorithm for building decision trees is [17]:
a) Check for base cases
b) For each attribute a, find the normalized information gain from splitting on a
c) Let a_best be the attribute with the highest normalized information gain
d) Create a decision node that splits on a_best
e) Recurse on the sub-lists obtained by splitting on a_best, and add those nodes as
children of node

 Random Forest: Random forests are an ensemble learning method for classification
(and regression) that operate by constructing a multitude of decision trees at training time
and outputting the class that is the mode of the classes output by individual trees. The
algorithm for inducing a random forest was developed by Leo Breiman [18] and Adele Cutler,
and "Random Forests" is their trademark. The term came from random decision forests that
was first proposed by Tin Kam Ho of Bell Labs in 1995. The method combines Breiman's
"bagging" idea and the random selection of features, introduced independently by
Ho[19][20] and Amit and Geman [21] in order to construct a collection of decision trees with
controlled variation. The selection of a random subset of features is an example of
the random subspace method, which, in Ho's formulation, is a way to implement stochastic
discrimination [22] proposed by Eugene Kleinberg. It is better to think of random forests as a
framework rather than as a particular model. The framework consists of several
interchangeable parts which can be mixed and matched to create a large number of
particular models, all built around the same central theme. Constructing a model in this
framework requires making several choices:
 The shape of the decision to use in each node
 The type of predictor to use in each leaf
 The splitting objective to optimize in each node
 The method for injecting randomness into the trees.
The simplest type of decision to make at each node is to apply a threshold to a single
dimension of the input. This is a very common choice and leads to trees that partition the
space into hyper-rectangular regions. However, other decision shapes, such as splitting a
node using linear or quadratic decisions are also possible.
Leaf predictors determine how a prediction is made for a point, given that it falls in a
particular cell of the space partition defined by the tree. Simple and common choices here
include using a histogram for categorical data, or constant predictors for real valued outputs.
In principle there is no restriction on the type of predictor that can be used, for example one
could fit a Support Vector Machine or a spline in each leaf; however, in practice this is
uncommon. If the tree is large then each leaf may have very few points making it difficult to
fit complex models; also, the tree growing procedure itself may be complicated if it is difficult
to compute the splitting objective based on a complex leaf model. The splitting objective is a
function which is used to rank candidate splits of a leaf as the tree is being grown. This is
commonly based on an impurity measure, such as the information or the Gini gain.
The method for injecting randomness into each tree is the component of the random forests
framework which affords the most freedom to model designers. Breiman's original algorithm
achieves this in two ways:
a) Each tree is trained on a bootstrapped sample of the original data set.
b) Each time a leaf is split, only a randomly chosen subset of the dimensions are
considered for splitting.
For prediction a new sample is pushed down the tree. It is assigned the label of the training
sample in the terminal node it ends up in. This procedure is iterated over all trees in the
ensemble, and the mode vote of all trees is reported as the random forest prediction.
 RIPPER: RIPPER stands for Repeated Incremental Pruning to Produce Error Reduction. This
classification algorithm was proposed by William W Cohen. It is based on association rules
with reduced error pruning (REP), a very common and effective technique found in decision
tree algorithms. In REP for rules algorithms, the training data is split into a growing set and a
pruning set. First, an initial rule set is formed that is the growing set, using some heuristic
method. Overlarge rule set is then repeatedly simplified by applying one of a set of pruning
operators typical pruning operators would be to delete any single condition or any single
rule. At each stage of simplification, the pruning operator chosen is the one that yields the
greatest reduction of error on the pruning set. Simplification ends when applying any pruning
operator would increase error on the pruning set. [23]
Here is algorithm:
 Initialize RS = {}, and for each class from the less prevalent one to the more frequent one.
DO:
 Building stage: Repeat Grow phase and Prune phase until the description length(DL) of
the rule set and examples is 64 bits greater than the smallest DL met so far, or there are
no positive examples, or the error rate >= 50%.
 Grow phase: Grow one rule by greedily adding antecedents (or conditions) to the rule
until the rule is perfect (i.e. 100% accurate). The procedure tries every possible value of
each attribute and selects the condition with highest information gain: p(log(p/t)-
log(P/T)).
 Prune phase: Incrementally prune each rule and allow the pruning of any final sequences
of the antecedents.
 Optimization stage: After generating the initial rule set {Ri}, generate and prune two
variants of each rule Ri from randomized data using procedure Grow phase and Prune
phase. But one variant is generated from an empty rule while the other is generated by
greedily adding antecedents to the original rule. Then the smallest possible DL for each
variant and the original rule is computed. The variant with the minimal DL is selected as
the final representative of Ri in the rule set. After all the rules in {Ri} have been examined
and if there are still residual positives, more rules are generated based on the residual
positives using Building Stage again. Delete the rules from the rule set that would
increase the DL of the whole rule set if it were in it and add resultant rule set to RS.ENDDO

 Naïve-Bayesian Tree: A naive Bayes classifier is a simple probabilistic classifier based on


applying Bayes' theorem with strong (naive) independence assumptions. A more descriptive
term for the underlying probability model would be "independent feature model". An
overview of statistical classifiers is given in the article on Pattern recognition. In simple terms,
a naive Bayes classifier assumes that the presence or absence of a particular feature is
unrelated to the presence or absence of any other feature, given the class variable. For
example, a fruit may be considered to be an apple if it is red, round, and about 3" in
diameter. A naive Bayes classifier considers each of these features to contribute
independently to the probability that this fruit is an apple, regardless of the presence or
absence of the other features. For some types of probability models, naive Bayes classifiers
can be trained very efficiently in a supervised learning setting. In many practical applications,
parameter estimation for naive Bayes models uses the method of maximum likelihood, in
other words, one can work with the naive Bayes model without accepting Bayesian
probability or using any Bayesian methods.
Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers
have worked quite well in many complex real-world situations. In 2004, an analysis of the
Bayesian classification problem showed that there are sound theoretical reasons for the
apparently implausible efficacy of naive Bayes classifiers [24]. Still, a comprehensive
comparison with other classification algorithms in 2006 showed that Bayes classification is
outperformed by other approaches, such as boosted trees or random forests [25].
An advantage of Naive Bayes is that it only requires a small amount of training data to
estimate the parameters (means and variances of the variables) necessary for classification.
Because independent variables are assumed, only the variances of the variables for each
class need to be determined and not the entire covariance matrix.

 Quantile Regression Tree: A nonparametric regression method that blends key features of
piecewise polynomial quantile regression and tree-structured regression based on adaptive
recursive partitioning of the covariate space is investigated [26]. Unlike least squares
regression trees, which concentrate on modelling the relationship between the response and
the covariates at the centre of the response distribution, quantile regression trees can
provide insight into the nature of that relationship at the centre as well as the tails of the
response distribution. These nonparametric regression quantiles have piecewise polynomial
forms, where each piece is obtained by fitting a polynomial quantile regression model to the
data in a terminal node of a binary decision tree. The decision tree is constructed by
recursively partitioning the data based on repeated analyses of the residuals obtained after
model fitting with quantile regression. One advantage of the tree structure is that it provides
a simple summary of the interactions among the covariates.

 Firefly Minor: The firefly algorithm was proposed by Yang [27, 28] based on the natural
luminous insects known as fireflies that reside in tropical and temperate regions. The two
fundamental functions of producing flashes in the night time are to attract mating partners
and attract the prey. Firefly minor [29] is a firefly based rule extraction algorithm for
classification purpose. It uses sequential approach to generate a single rule in each run. The
miner runs the firefly algorithm at least as many times as the number of classes in the
dataset. Initially, the rule list is empty. Generate population size number of rules consisting of
all attributes, where each attribute value lies between its upper and lower bounds in the
dataset. Compute the fitness function for each firefly. After one run of firefly, it outputs a rule
with the highest quality. Then, the correctly classified samples are removed from the training
set and firefly minor runs again for the same class until the number of uncovered samples for
that class is less than a pre-specified threshold. This process is repeated for other classes
also. Finally a set of rules are generated which can be used to classify the new dataset or
patterns. The pseudo-code of the firefly miner is presented in figure:

RS = /* initially , Rule Set is empty */

FOR EACH class C

TS = {All training examples}

WHILE(Number of uncovered training examples belonging to class C >

Maximum Uncovered Example Per Class)

Run the Firefly algorithm to discover the predicting class C and

Return the best discovered rule BestRule

RS = RS U BestRule

TS = TS – {training examples correctly covered by the discovered rule}

END WHILE

END FOR
Figure 4: Pseudo-code of the Firefly miner

The rule representation is as follows:


If attribute 1 < or > value and attribute 2 < or > value...and attribute n < or > value then class.

5. Results and discussion

The dataset analysed consists of 35 financial items for 202 companies, of which 101 were
fraudulent and 101 were non-fraudulent. We normalized the dataset before feeding it into
any classifier. We employed CART and TreeNet in Salford Predictive Modeler (SPM). For
quantile regression tree we used Generalized, Unbiased, Interaction Detection and
Estimation (GUIDE). We have used KNIME to implement C-4.5, Naïve-Bayesian tree, random
forest and RIPPER. The code for firefly miner used was being developed in Java. C4.5 we have
used gain ratio for the quality measure. No pruning has been applied and it has been ensured
that each node contains at least two records. We used the debug mode to extract additional
info for our console. The best accuracy result of 62% was obtained with top 10 feature set.
Random forest is also analysed in debug mode. All the settings like the maximum depth of
the tree, number of features to consider etc. were used at default value (where these
parameters can attain any unlimited value). The best accuracy result of 67% was attained
with top 10 feature set. For Ripper we have used the number of folds for REP (Repeated
incremental pruning) as 5. Also number of runs of optimization has been set to 7. All other
parameters like seed, error rate etc. has been used as default value. The best accuracy result
of 67.5% was attained with top 10 feature set. NB Tree has been analysed with default value
of parameter. The best accuracy result of 72% was obtained with top 18 feature set. Firefly
minor have been tested on top 10 and top 18 feature set both with population size 100. The
maximum accuracy result of 61.4% has been recorded with top 10 feature set. CART and
TreeNet are tested in Salford Predictive Modeler with default settings. CART outperformed
TreeNet with top 10 feature set to attain maximum accuracy of 70.4%. All the results have
been listed in the tables below.

Table 3: Average accuracy results of dataset with all, top10 and top 18 feature sets using 10-
fold cross-validation
Classifier Top 10 feature set Top 18 feature set Full feature set
Ripper 67.5 67 57
Naïve-Bayesian Tree 68 72 62
Random Forest 67 63 58
C-4.5 62 58 57
CART 70.4 63.8 65
TreeNet 69.3 66.3 63

With the whole feature set CART achieved maximum accuracy of 65%. With top 18 feature
set Naïve-Bayesian tree outperformed every other classifier to produce an accuracy of 72%
and after taking top 10 features In consideration results didn’t show much improvement in
case of Naïve-Bayesian tree. But for other classifiers accuracy did increase by a small amount.
In this case CART again outperformed other classifiers with an accuracy of 70.4%.
The 10 fold CV dataset was also trained using firefly miner algorithm with population size 100
and 30. If then rules for data classification are generated and the testing phase resulted in an
accuracy of 61.4% with top 10 feature set and 59.9% with top 18 feature set. The results are
tabulated in the following table.

Table 4: Average accuracy results for dataset on firefly minor algorithm with top 10 and top
18 feature set using 10-fold cross-validation
Top 10 feature set Top 18 feature set

61.4 59.9
Quantile regression tree is trained and tested on the 10-fold cross-validation dataset with
varying quantile. Results of t-statistics were used to apply the classifier on top 10, top 18 and
full feature set. Following results were recorded with a quantile of 0.9.

Table 5: Average accuracy results for dataset after applying quantile regression tree using 10-
fold cross-validation.
Top 10 feature set Top 18 feature set All feature set
70 68.7 63

Results on testing dataset of the 10-fold cross-validation were ensemble in order to look for
the best achievable classification accuracy. The maximum accuracy attained is 70% with top
10 feature set after we ensemble the results of CART, Ripper, Naïve-Bayesian Tree, random
forest and C-4.5 (all implemented in KNIME). The following results were achieved with top 10
and top 18 feature set.

Table 6: Average accuracy results of ensembling outputs of CART, Ripper, NB-Tree, Random
forest and C-4.5 using 10-fold cross-validation
Top 10 feature set Top 18 feature set

70 67.5

Overall we obtained the best accuracy result of 72% with Naïve-Bayesian tree when applied
on top 18 feature set using 10-fold cross-validation. It should be noted that while all the
decision tree techniques have equal cost, the technique that is preferred and recommended
is totally dictated by the dataset at hand. Since accuracy is a major concern for financial
analysts, we should select that technique which yields lass misclassifications and consume
less time. This is because the performance of all of these techniques depends on the dataset
on which they are used. If everything else (i.e., accuracies, sensitivity, specificity etc.) are
equal, we should select that technique which is less cumbersome, easy to understand and
easy to implement.

6. References

1. Watts, R. L., and J. L. Zimmerman, 1986, Positive Accounting Theory, Prentice-Hall


2. Coderre G. D. (1999) Fraud Detection. Using Data Analysis Techniques to Detect Fraud,
Global Audit Publications
3. Fanning K. and Cogger K., (1998), Neural Network Detection of Management Fraud Using
Published Financial Data, International Journal of Intelligent Systems in Accounting,
Finance & Management, Vol. 7, No. 1, pp. 21-24
4. Green B.P. and Choi J.H., (1997), Assessing the risk of management fraud through neural
network technology, Auditing: A Journal of Practice and Theory, Vol. 16(1), pp.14-28
5. Nieschwietz, R.J., Schultz, J.J. Jr and Zimbelman, M.F. (2000), Empirical research on
external auditors’ detection of financial statement fraud, Journal of Accounting
Literature, Vol. 19, pp. 190-246
6. Albrecht, C.C., Albrecht, W.S. and Dunn, J.G. (2001), Can auditors detect fraud: a review
of the research evidence, Journal of Forensic Accounting, Vol. 2 No. 1, pp. 1-12
7. Bell T. and Carcello J. (2000) A decision aid for assessing the likelihood of fraudulent
financial reporting, auditing: A Journal of Practice & Theory, Vol. 9 (1), pp. 169- 178
8. Ansah, S.O., Moyes, G.D., Oyelere, P.B. and Hay, D. (2002), An empirical analysis of the
likelihood of detecting fraud in New Zealand, Managerial Auditing Journal, Vol. 17 No. 4,
pp. 192-204
9. A. Aamodt, E. Plaza, Case-based reasoning: foundational issues, methodological
variations, and system approaches, Artificial Intelligence Communications 7(1) (1994) 39-
59
10. S. Kotsiantis, E. Koumanakos, D.Tzelepis, V. Tampakas, Forecasting fraudulent financial
statements using data mining, international Journal of Computational Intelligence 3(2)
(2006) 1014-110
11. A. Deshmukh, L. Talluru, A rule-based fuzzy reasoning system for assessing the risk of
management fraud, International Journal of Intelligent Systems in Accounting, Finance &
Management 7(4) (1998) 223-241
12. R. Pacheco, a. Martins, R.M. Barcia, S. Khator, A hybrid intelligent system applied to
financial statement analysis, Proceedings of the 5th IEEE conference on Fuzzy Systems,
vol. 2, 1996, pp. 1007-1012, New Orleans, LA, USA.
13. C. Magnusson, A. Arppe, T. Eklund, B. Back, H. Vanharanta, A.Visa, The language of
quarterly reports as an indicator of change in the company’s financial status, Information
& Management 42(4) (2005) 561-574
14. Y. Kim, Toward a successful CRM: variable selection, sampling, and ensemble, Decision
Support Systems 41(2) (2006) 542-553
15. P. Ravisankar, V. Ravi, G. Raghava Rao, I. Bose, Detection of financial statement fraud and
feature selection using data mining techniques, Decision support systems, vol 50 issue 2,
January 2011, pp 491-500
16. Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
17. S.B. Kotsiantis, Supervised Machine Learning: A Review of Classification Techniques,
Informatica 31(2007) 249-268, 2007
18. Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): pp 5–32
19. Ho, Tin Kam (1995). "Random Decision Forest". Proceedings of the 3rd International
Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995.
pp. 278–282
20. Ho, Tin Kam (1998). "The Random Subspace Method for Constructing Decision
Forests". IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (8): 832–844
21. Amit, Yali; Geman, Donald (1997). "Shape quantization and recognition with randomized
trees". Neural Computation 9 (7): 1545–1588
22. Kleinberg, Eugene (1996). "An Overtraining-Resistant Stochastic Modeling Method for
Pattern Recognition". Annals of Statistics 24 (6): 2319–2349
23. Cohen, W., “Fast effective rule induction”. Proceedings of International Conference on
machine Learning 1995, pp. 1-10
24. Zhang, Harry. "The Optimality of Naive Bayes". FLAIRS2004 conference.
25. Caruana, R.; Niculescu-Mizil, A. (2006). "An empirical comparison of supervised learning
algorithms". Proceedings of the 23rd international conference on Machine learning
26. Chaudhuri, P. and Loh, W.-Y. (2002), Nonparametric estimation of conditional quantiles
using quantile regression trees, Bernoulli, vol. 8, 561-576
27. X-S. Yang, Firefly algorithms for multimodal optimization, In proceedings of the 5th
international conference on Stochastic algorithms: foundations and applications
(SAGA’09), Osamu Watanabe and Thomas Zeugmann (Eds.). Springer-Verlag, Berlin,
Heidelberg, 169-178, 2009
28. X-S. Yang, Firefly algorithm, Stochastic test functions and design optimization,
International Journal of Bio-Inspired Computation, Vol.2, No. 2, pp. 78-84, 2010
29. Nekuri Naveen, V. Ravi, C. Raghavendra Rao, K.N.V.D. Sarath, Rule extraction using firefly
optimization: Application to banking, in proceedings of IEEE International Conference on
Industrial Engineering and Engineering Management, Hong Kong, 12/2012

You might also like