Prediction of Failures in The Project Management K

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Research Article

Prediction of failures in the project management knowledge areas


using a machine learning approach for software companies
Gizatie Desalegn Taye1 · Yibelital Alemu Feleke2

Received: 15 December 2021 / Accepted: 25 April 2022

© The Author(s) 2022  OPEN

Abstract
In this paper we propose a novel machine-learning model to predict project management knowledge areas failure
for software companies using ten knowledge areas in project management based solely on the criteria of unambigu-
ity, measurability, consistency, and practicability. The majority of software projects fail in software companies due to
a lack of software project managers who are unfamiliar with the Project Management Knowledge Areas (PMKAs) that
are used without considering the company’s conditions or project contexts. By distributing questionnaires, we use an
experimental methodology and the snowball sampling method to collect data from software businesses. We employ
machine learning techniques including Support Vector Machines (92.13%), Decision Trees (90%), K-Nearest Neighbors
(87.64%), Logistic Regression (76.4%), and Naive Bayes (66%) to adapt data from failed software projects. When we look
at the results, Support Vector Machine outperforms the other four machine learning methods. High dimensional data
is more efficient and contains nonlinear changes since Support Vector Machines deal with categorical data. The study’s
purpose is to improve project quality and decrease software project failure. Finally, we recommend collecting more failed
project datasets from software businesses and comparing them to our findings to predict knowledge domain failure.

Article highlights • Compare and contrast the machine learning model’s


• Design a machine learning model to predict knowl- performance.
• Evaluate the suggested machine learning model.
edge area failure in project management.

Keywords Project management · Project Management Knowledge Areas · Project failure · Machine learning

1 Introduction within the defined time, scope, budget, and quality stand-
ards to achieve all agreed goals and software project man-
An established software company’s goal is to sell software agement refers to the scheduling, planning, resource allo-
products and profit from them. A project is a short-term cation, and execution [2]. There are ten software Project
undertaking that results in a unique deliverable [1]. The Management Knowledge Areas (PMKAs). These are Pro-
objectives of project management including initiating, ject Integration Management (PIM), Project Scope Man-
planning, executing, regulating, and closing projects, as agement (PSM), Project Time Management (PTM), Project
well as controlling the operations of the project team Cost Management (PCM), Project Quality Management

* Gizatie Desalegn Taye, [email protected] | 1Department of Computer Science, Faculty of Technology, Debre Tabor University,
Debre Tabor, Ethiopia. 2Department of Database Administration, Amhara National Regional State Labour and Training Bureau, Bahir Dar,
Ethiopia.

SN Applied Sciences (2022) 4:165 | https://doi.org/10.1007/s42452-022-05051-7

Vol.:(0123456789)

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Research Article SN Applied Sciences (2022) 4:165 | https://doi.org/10.1007/s42452-022-05051-7

(PQM), Project Human Resource Management (PHRM), work makes predictions not only statistics. The study pre-
Project Risk Management (PRM), Project Procurements sented by the Project Management Institute (PMI) identi-
Management (PPM), Project Communications Manage- fies new domains of knowledge that contain a process to
ment (PCCM), and Project Stakeholders Management be followed for effective project management and pro-
(PSTM) [1]. The problems that cause software project ject managers must have knowledge and skills in each
failures are poor planning, lack of leadership, problems of these areas or have specialists who can assist in these
with people, vague or changing requirements, life cycle areas like some large projects have dedicated schedule
problems, inefficient communication process, inadequate coordinators, risk managers, communication specialists, or
funding, little attention to approval of stakeholders, lack of procurement contract officers. The authors [1] described a
schedule, missed deadlines, due to the hiring of unquali- competent and knowledgeable project manager is vital to
fied project manager. As a result, the research’s goal is to project success. The researchers evaluate the ten project
forecast knowledge areas of project management failures management knowledge areas in service industries and
for software firms. We develop a model based on machine manufacturing using the Analytic Hierarchy Process (AHP)
learning that helps software project managers predict the and the Absolute Degree Grey Incidence Analysis (ADGIA)
failed knowledge areas that best fit the current situation model. Both models have the result that project quality
(problem domain (failed motives), company character- management is the most important knowledge area and
istics, project size, indispensable nature of the project, also most strongly related to project communication man-
the nature of the opportunities, and the methodology agement and least strongly related to project integration
that follows). Improving the efficiency and maintaining management but the literature has a gap.
the sustainability of a software project are obstacles that The authors [8] focus on behavioral advertisement analy-
project managers face. The probability of project failure sis, such as an individual’s preferences, buying habits, or
is generally due to a lack of knowledge, skills, resources, hobbies, and will employ machine-learning approaches to
and technology during project implementation [3, 4]. The identify and successfully execute targeted advertising using
study answers the following research questions. data that reflects the user’s retail activity. By building a unique
framework that uses a classification model through stream-
1. How do we design a machine learning model that pre- ing technologies, and produces a multi-class classier to pro-
dicts project management knowledge area failure? vide sector-based classification. To improve the accuracy
2. Which machine learning techniques are the most of the model prediction task, the method uses a structured
effective for predicting project management knowl- approach and multiple ensemble techniques. To forecast fail-
edge areas failure? ure, we employed a multiclass classifier in our research. The
3. How well does our model predict project management authors [9] provided a framework for value realization. Uni-
failure in terms of knowledge areas? versities must assess learning analytics (LA’s) strategic role and
spend carefully on the following criteria like high-quality data,
The study would reduce the amount of time, and effort analytical tools, knowledgeable people that are up to date on
was given would spend money (for the project manag- technology, and data-driven prospects for learning improve-
ers, and software companies) to predict the failure of the ment. In our research, we used the four criteria to select attrib-
knowledge areas. However, every software project is dif- utes for prediction. The authors [10] investigated an efficient
ferent and unique [5]. According to [6] described that a algorithm for predicting software reliability using a hybrid
software company faces different challenges between approach known as Neuro-Fuzzy Inference System, which
funding, team building, and ideation to attract talent was also applied to test data for software reliability prediction
at a very early stage. Starting from this idea, the study using complexity, changeability, and portability parameters in
focuses on identifying the reasons behind wariness and software development as input for the Fuzzy Inference System.
uncertainty in organizations. The authors [7] carried After testing and training real-time data, forecast reliability in
out identifies and categorizes the software engineering terms of mean relative error and mean absolute relative error.
Project Management Knowledge Areas (PMKAs) used in The study’s findings are verified by comparing them to other
software companies to map the state of the art using a state-of-the-art soft computing techniques.
systematic study method of literature mapping with the From the above-mentioned related work, they have the
application of snowball sampling to evaluate the Software following gaps in general. To begin with, the majority of
Engineering Body of Knowledge (SWEBOK) characterizes research does not focus on making predictions. Second, the
the content of the software engineering discipline and above-mentioned related works are carried out in the auto-
promotes a consistent view of software engineering. Our motive supply sector, manufacturing, and non-governmental

Vol:.(1234567890)

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


SN Applied Sciences (2022) 4:165 | https://doi.org/10.1007/s42452-022-05051-7 Research Article

organizations (NGOs). Third, they employed a different method 2.2 Data collection and dataset preparation
than we did in our research. As a result, we focused our investi-
gation on software companies. In Ethiopia, most software firms We used a questionnaire to gather data from target soft-
have inexperienced, unsuccessful, and less skilled project man- ware companies for this study, and we produced data
agers as compared to other experienced corporate projects. found by project managers working for software compa-
Third, when to add or reduce the criteria influence on the pro- nies in Ethiopia. The dataset included eighteen attributes
ject management knowledge areas is self-evident. As a result, classified into three groups (project manager, project con-
we added more factors to the mix. Finally, the datasets that are text, and business situations) that influence the prediction
associated with them are quite modest. As a result, the output failure of the knowledge areas in project management,
is hurried. So, we prepared the dataset as much as feasible. and are collected, and prepared based on the criteria of
The introduction section comes to a close with this unambiguity, consistency, practicability, and measurability
paragraph. In Sect. 2, we look at the methodologies, [11].
which include everything from using datasets to predict- There are ten knowledge areas or output classes, as
ing failed project management, as well as the design of indicated in Table 1, namely, PCCM, PCM, PHRM, PIM, PPM,
the suggested model, data preparation, and the confu- PQM, PRM, PSTM, PSM, PTM and its failure values for each
sion matrices for calculating performance measures. The class is 48, 76, 45, 82, 40, 21,27,36,42,26 out of 443 total
results, validation of the model, and discussion highlights datasets. For prediction, we employed multiclass methods.
of the performance metrics of the findings are presented Row failed project data: are produced based on the
in Sect. 3, and the paper is concluded with the possibility questionnaires from software companies. Processing
of future extension of this work. failed project row data: The gathered row failed project
data should be processed for three reasons: missing values
should be fixed, data should be standardized, and variable
2 Methodology sets should be optimized.

The research is based on experiments. Experimental 2.2.1 Analyzing attributes


research is a collection of research designs that employ
manipulation and controlled testing to gain a better 2.2.1.1 Unambiguity Each attribute should have its
understanding of entire processes that predict outcomes meaning. Each attribute is subject to one and only one
depending on certain criteria. As a result, the following interpretation. The possible values are yes (Y) and no (N).
methods and techniques are employed to complete this Ambiguous attributes not selected.
study.
2.2.1.2 Consistency Each attribute should be independ-
2.1 The designed proposed prediction model ent of the others. There are three possible values: high (H),
medium (M), and low (L). The attributes with the highest
The general description of the prediction failure model for consistency value were chosen.
project management knowledge areas in software com-
panies is given in Fig. 1. The model has five major phases; 2.2.1.3 Measurability Each attribute should be assigned a
The first phase is the failure of project data collected value based on the metric. There are three possible values:
from software development companies, the second phase high (H), medium (M), and low (L). Attributes with higher
is data pre-processing, which serves to refine our data ease of measurability were chosen.
cleansing, feature selection, data transformation, and data
reduction tasks, the third phase consists of implement- 2.2.1.4 Practicability Each attribute should be feasible in
ing the selected algorithms like Support Vector Machine the sense of a particular (sudden) project. There are three
(SVM), Decision Trees (DT), Naïve Bayes (NB), Logistic possible values: high (H), medium (M), and low (L). Attrib-
Regression (LR), and K-Nearest Neighbors (KNN), the fourth utes with higher feasibility or practicability were chosen.
step is to perform data analysis and evaluation to calculate There are three possible values in Table 2: High (H),
using the chosen data and the efficiency of the proposed Medium (M), and Low (L). The final list of criteria included
models made by the accuracy, precision, F1-score, and an attribute with a higher level of practicability. A char-
recall of each algorithm, the fifth and final step is the end acteristic may be added or removed from the final list of
of our work, which consists of analyzing and drawing con- influential attributes based on the aforementioned criteria
clusions based on the graphical and aggregated experi- [11]. As a result, nine attributes were chosen as the input
mental result. In addition, we can see in Fig. 1 that each for machine learning from 18 preliminary lists of attributes.
component in the model is interconnected and sequential. The project manager has four attributes, three of which

Vol.:(0123456789)

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Research Article SN Applied Sciences (2022) 4:165 | https://doi.org/10.1007/s42452-022-05051-7

Fig. 1  The proposed model

are related to the project’s context and the remaining two


Table 1  Project management knowledge areas failures after anno- to the nature of the company’s situation. Table 2 shows
tating the data the list of attributes and their results ("P" denotes selected
Knowledge areas No of anno- attributes that made it into the final list of attributes, while
tated data "F" denotes unselected attributes that did not make it into
the final list of attributes).
Project Communications Management (PCCM) 48
Project Cost Management (PCM) 76
2.3 Data preprocessing
Project Human Resource Management (PHRM) 45
Project Integration Management (PIM) 82
The information on failed projects was gathered from
Project Procurements Management (PPM) 40
software companies. As a result, data preprocessing has
Project Quality Management (PQM) 21
been completed, which includes data cleansing, duplicate
Project Risk Management (PRM) 27
value removal, null value detection, rectification, and bal-
Project Stakeholders Management (PSTM) 36
ancing. This is where the preprocessing mapping is fin-
Project Scope Management (PSM) 42
ished. Because we collect data from a variety of sources,
Project Time Management (PTM) 26
data integration has become a crucial part of the process.
Total 443
We need to make a condensed version of the dataset that

Vol:.(1234567890)

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


SN Applied Sciences (2022) 4:165 | https://doi.org/10.1007/s42452-022-05051-7 Research Article

Table 2  Identified attributes with their description and selected attribute using four criteria
Factors Attributes Criteria in order of its unambi- Result References Descriptions
guity, consistency, measur-
ability, and practicability

Project Manager Education Level YHHH P [11] The act of doing and seeing
things, as well as the act of
making things happen to you.
The amount of time you have
spent on a particular task.: the
ability or experience you have
acquired from doing some-
thing (such as a particular job)
Experience YLMM F [12] Your major and the degree you
received
Knowhow of the PMKAs YMMM P [13] Is it a term for practical
knowledge about how to do
something?
Decision maker YMLL F [13] Someone who makes deci-
sions, especially at a high
level in an organization. A
strategic decision-maker may
be in charge of acquisitions,
company growth, or capital
investment
Relevant Job YMMM P [14] The job you want in terms of
the required skills or knowl-
edge
Education Background YHHH P [11] The highest grade obtained,
or whether the individual
has a secondary school (high
school) diploma or equiva-
lency certificate
Context of the project Complexity NMML F [11] A factor that plays a role in a
complex process or circum-
stance
Size of the project YMML F [11] A broad term is used to define
the project’s overall scope
Budget YLMM F [15] A forecast of sales and expendi-
tures for a future period
Reasons for failed YMMM P [1] Answers why the project failed?
Development model followed YMMM P [1] The steps that take place after a
project are completed
Requirement elicitation Tech- YMMM P [1] Obtaining information about a
nique Followed system’s specifications from
users, consumers, and other
stakeholders
Number of functionalities NLHM F [13] The condition of making a
profit or gaining money

Vol.:(0123456789)

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Research Article SN Applied Sciences (2022) 4:165 | https://doi.org/10.1007/s42452-022-05051-7

Table 2  (continued)
Factors Attributes Criteria in order of its unambi- Result References Descriptions
guity, consistency, measur-
ability, and practicability

Profitability YMMM P [13] In the minds of its target cus-


tomers, a brand occupies
Proper positioning YLMM F [13] Being one of a kind is a desir-
able attribute
Uniqueness NLMM F [1] The consistency of being
persuasive
Credibility NMML F [13] The opportunity to take advan-
tage of a deal or circumstance
that may lead to a favorable
outcome
Market situations YMMH P [13] The marketing plans in place
and expectations for the
future

is smaller in size but retains the original’s integrity. Data 2.3.2 Model evaluation
preparation is the process of transforming data into a for-
mat suitable for data modeling, such as converting char- This activity is in charge of describing the evaluation
acter values to binary values. parameters of the designed model and its results. The
The train test split technique is used to measure the comparison was made between the data categorized by
performance of machine learning algorithms that make the proposed model system and the manually labeled
predictions on data that was not used to train the model. (categorized) data. Having a common performance
appraisal metric for classification and classification accu-
• A training data set is a set of data that is used to fit a racy (CA) is used as the final proof of performance.
machine learning model.
• Test data set—used to assess the machine learning 2.3.2.1 Confusion matrix The confusion matrix assesses
model’s fit. the performance of a classification or classifier model
on a test dataset. Our target class was multiclass, which
The purpose of splitting the dataset is to assess the means classification tasks that have more than two class
machine learning model’s performance on new data that labels. So, our target class has ten labels that are 10X10
hasn’t been used to train the model. This is how we hope arrays.
to use the model in practice. That is, to fit it to existing The performance of a classification model is defined by
data with known inputs and outputs, and then make pre- a confusion matrix.
dictions about future events where we do not have the True positives (TP): cases where the classifier predicted
expected output or target values. that the true and correct class was true.
True negatives (TN): cases in which the model predicted
the false and correct class was false.
2.3.1 Experimental methods False positives (FP) (type I error) - Classes predicted true
but the correct class was false.
The experimental methods are mainly aimed at achiev- False negatives (FN) (type II error): The classifier pre-
ing, identifying, and visualizing what factors contribute dicted false but the correct class was false.
to project managers and building a prediction model
that executes a project whether or not the failed project
management knowledge areas were based on the perfor- 2.3.3 Accuracy
mance of the model.
Accuracy means the number of all misclassified samples
divided by the total number of samples in the dataset.
Accuracy has the best value of one and the worst value
of zero.

Vol:.(1234567890)

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


SN Applied Sciences (2022) 4:165 | https://doi.org/10.1007/s42452-022-05051-7 Research Article

(TP + TN) 3.1 Experimental results and analysis


Accuracy = (1)
(TP + TN + FP + FN)
After importing the necessary python modules and librar-
ies, the second immediate task is to read the processed
2.3.4 Precision data frame (df ) in pythons and check the imported
rows. The ID, project manager name, education label,
Precision (P)—precision is the fraction or percentage of educational experience, relevant work, company name,
identified or retrieved instances that the classification knowledge about project management knowledge areas
algorithm considers important. High precision means (PMKAs), model of development followed, the technique
that most items labeled, for example, as "positive" actu- of obtaining requirements followed by market situations,
ally belong to the class "positive" and is defined as preci- the profitability of the company, reasons for failure and
sion characterized as the number of isolated true positives class. From those IDs, the project manager name and
times the total sum of true positives and false positives. project name are not required for the study as the value
of each attribute removed the remaining unique values
TP
Precision = (2) displayed.
(TP + FP)
Feature engineering—the main goal of feature engi-
neering is to add features that are likely to have an impact
2.3.5 Recall on the failed project dataset. The fundamental step in fea-
ture engineering is to split the training and test datasets.
A recall is considered a measure of completeness, which is Out of the 443 rows in the dataset, we used 354 rows for
the level of positive examples that are marked as positive. training and 89 rows for tests. Because our datasets are
Cluster revision is characterized by the number of isolated small, we have demonstrated that the data split for train-
true positives times the total number of components that ing data is high, as high training data and low-test data
have a place with the positive classes. are recommended for small datasets to get good accuracy.

TP
Recall = (3)
(TP + FN) 3.1.1 Results of each prediction algorithm

We employed five methods to predict the failure of the


2.3.6 F1 score
project management knowledge areas in our experiment.
K-Nearest Neighbors (KNN), Decision Trees (DT), Logistic
F-Measure (F1 score) is defined as the harmonic means
Regression (LR), Naive Bayes (NB), and Support Vector
of precision and recall which is a measure that joins recall
Machines (SVM) are all examples of machine learning
and precision into a single measure of performance. The
algorithms.
F1-score was calculated by averaging precision and recall.
The relative contribution of precision and recall to the
3.1.1.1 K‑nearest neighbors (KNN) prediction algorithm
F1-score are equal.
results and analysis We started building a K-Nearest
(Precision*Recall) Neighbors model to predict knowledge area failures in
F1 - score = 2∗ (4) software companies after finalizing the data transforma-
(Precision + Recall)
tion and splitting the train test. The model result is pre-
sented in Table 3, we have got the weighted average
F1-Score with an accuracy of 87.64%. The values listed in
3 Results and discussion the Support column are classified in the test data into 10
classes.
Experimentation is recognized to necessitate the prepara-
tion of a dataset for training and testing purposes, as there 3.1.1.2 Decision trees prediction algorithm results and anal‑
is no free, ready-to-use dataset available on the Internet. ysis As we can see from the confusion matrix report in
We used 19 software companies in this study, which took Table 4, we have got a 90% weighted average accuracy of
the dataset and split it into three categories based on nine F1-Score for the decision tree algorithm.
attributes (project manager, project context, and company
situations). The collection has 443 records with 9 attrib- 3.1.1.3 Logistic regression prediction algorithm results
utes. The remaining 20% was utilized to test the proposed and analysis The performance measures we have
model, with 80% being used to train the model. obtained during Logistic Regression findings using the

Vol.:(0123456789)

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Research Article SN Applied Sciences (2022) 4:165 | https://doi.org/10.1007/s42452-022-05051-7

Table 3  K-Nearest Neighbors (KNN) result using the confusion Table 5  Logistic regression (LR) result using the confusion matrix
matrix
No Precision Recall F1-score Support
No Precision Recall F1-score Support
0 0.67 0.57 0.62 7
0 0.75 0.86 0.80 7 1 0.79 0.83 0.81 18
1 0.90 1.00 0.95 18 2 0.85 0.79 0.81 14
2 0.92 0.79 0.85 14 3 0.69 0.75 0.72 12
3 0.92 0.92 0.92 12 4 0.78 0.78 0.78 9
4 0.78 0.78 0.78 9 5 0.50 0.33 0.40 3
5 1.00 1.00 1.00 3 6 1.00 0.50 0.67 4
6 1.00 0.50 0.67 4 7 0.73 1.00 0.84 8
7 0.88 0.88 0.88 8 8 0.71 0.71 0.71 7
8 0.78 1.00 0.88 7 9 0.86 0.86 0.86 7
9 1.00 0.86 0.92 7 Accuracy 0.76 89
Accuracy 0.88 89 Macro average 0.76 0.71 0.72 89
Macro average 0.89 0.86 0.86 89 Weighted average 0.77 0.76 0.76 89
Weighted average 0.88 0.88 0.87 89

Table 6  Naïve Bayes (NB) result using the confusion matrix


Table 4  Decision Tree (DT) result using the Confusion Matrix
No Precision Recall F1-score Support
No Precision Recall F1-score Support
0 0.35 1.00 0.52 7
0 1.00 0.71 0.83 7 1 0.95 1.00 0.97 18
1 0.95 1.00 0.97 18 2 1.00 0.29 0.44 14
2 0.85 0.79 0.81 14 3 0.83 0.83 0.83 12
3 0.86 1.00 0.92 12 4 1.00 0.33 0.50 9
4 0.73 0.89 0.80 9 5 0.18 1.00 0.30 3
5 0.75 1.00 0.86 3 6 1.00 0.50 0.67 4
6 1.00 0.50 0.67 4 7 0.75 0.38 0.50 8
7 1.00 1.00 1.00 8 8 1.00 0.29 0.44 7
8 1.00 1.00 1.00 7 9 1.00 0.86 0.92 7
9 1.00 0.86 0.92 7 Accuracy 0.65 89
Accuracy 0.90 89 Macro average 0.81 0.65 0.61 89
Macro average 0.91 0.87 0.88 89 Weighted average 0.87 0.65 0.66 89
Weighted average 0.91 0.90 0.90 89

Table 7  Support vector machine (SVM) result using the confusion


testing set are given in Table 5. Here, we achieve the per- matrix
formance of 76.40% weighted average F1-Score. No Precision Recall F1-score Support

0 0.86 0.86 0.86 7


3.1.1.4 Results and analysis of the naïve bayes prediction
1 0.90 1.00 0.95 18
algorithm The performance measures we have obtained
2 0.86 0.86 0.86 14
during Naïve Bayes findings using the testing set are
3 1.00 0.92 0.96 12
given in Table 6. Here, we achieve the performance of 66%
4 0.89 0.89 0.89 9
weighted average F1-Score.
5 0.75 1.00 0.86 3
6 1.00 1.00 1.00 4
3.1.1.5 Support vector machine prediction algorithm
7 1.00 1.00 1.00 8
results and analysis The performance of the Support Vec-
8 1.00 1.00 1.00 7
tor Machine (SVM) model was also evaluated using the
9 1.00 0.71 0.83 7
testing set and the obtained performance measures are
Accuracy 0.92 89
given in Table 7. From the performance report, we can see
Macro average 0.93 0.92 0.92 89
that the SVM model achieves a 92.13% weighted average
Weighted average 0.93 0.92 0.92 89
F1-Score.

Vol:.(1234567890)

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


SN Applied Sciences (2022) 4:165 | https://doi.org/10.1007/s42452-022-05051-7 Research Article

3.2 Validation of the model epoch. The validation loss is constantly reduced through-
out the training procedures, as given in Fig. 3, indicating
Validation ensures the model does not overfit or underfit that there is no overfitting.
during the training process. To prevent the model from
learning too much or too little from the training set, a
dropout layer or early stopping can be added. When a 3.3 Discussion of the results
model learns too much on the training set, it performs
well in the training phase but fails miserably in the testing Table 8 shows, that the Support Vector Machine has stood
phase. In data it has never seen before, it performs poorly. out due to its prediction accuracy.
The accuracy of training is high, but the accuracy of test- First experiment: In the findings of the confusion matrix
ing is extremely low. Here is the validation for our model. of the test data for the K-Nearest Neighbors (KNN) predic-
Visualizing the training vs. validation accuracy over a tion model, which is presented in Table 8, 78 of them were
number of epochs is an excellent approach to see if the correctly identified and the remaining 11 were mistakenly
model has been properly trained. This is necessary to classified. Finally, K-Nearest Neighbors (KNN) was shown
ensure that the model is not undertrained or overtrained to be 87.64% accurate.
to the point that it begins to memorize the training data, Second experiment: In the findings of the confusion
reducing its capacity to predict effectively. We employed matrix of the test data for the Decision Tree (DT) predic-
early Stopping and epocs = 100 in our model in Fig. 2, with tion model, which is presented in Table 8, 80 of them were
nine attributes as the input layer, two hidden layers, and correctly identified and the remaining 9 were mistakenly
ten classes as the output layer. Early Stopping entails keep- classified. Finally, Decision Trees (DT) were able to reach
ing track of the loss on both the training and validation an accuracy of 90%.
datasets (a subset of the training set not used to fit the Third experiment: In the findings of the confusion
model). The training process can be interrupted as soon matrix of the test data for the Logistic Regression (LR)
as the validation set’s loss begins to exhibit evidence of prediction model, which is illustrated in Table 8, 68 of
overfitting. We’ve increased the number of epochs and are them were correctly identified and the remaining 21 were
certain that training will finish as soon as the model begins mistakenly classified. Finally, the accuracy of the Logistic
too overfit. From the plot of accuracy, as given in Fig. 2, Regression (LR) was 76.4%.
we can see that the model could probably be trained a Fourth experiment: In the confusion matrix findings for
little more as the trend for accuracy on both datasets is the Naïve Bayes (NB) prediction model, which is illustrated
still rising for the last few epochs. We can also see that in Table 8, 58 of the test data were correctly identified,
the model has not yet over-learned the training dataset, while the remaining 31 were mistakenly classified. Finally,
showing comparable skills on both datasets. the accuracy of Naive Bayes (NB) was 66%.
From the plot of loss, we can see that the model has Fifth experiment: In the confusion matrix of the test
comparable performance on both train and validation data, 82 of them were correctly identified, while the
datasets (labeled test). If these parallel plots start to depart remaining 7 were mistakenly classified, according to the
consistently, it might be a sign to stop training at an earlier Support Vector Machine (SVM) prediction model which is

Fig. 2  Validation of model


accuracy

Vol.:(0123456789)

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Research Article SN Applied Sciences (2022) 4:165 | https://doi.org/10.1007/s42452-022-05051-7

Fig. 3  Validation of model loss

included in Table 8. Finally, the Support Vector Machine variables is a key constraint of Logistic Regression (LR)
(SVM) attained a 92.13% accuracy. [17]. Second, Logistic Regression requires average or non-
The following are some of the reasons why the Naive multicollinearity between independent variables [16].
Bayes (NB) prediction performed poorly in our experiment: Third, non-linear problems cannot be solved with logistic
first, if the test dataset contains a categorical variable of a regression since it has a linear decision surface [18]. Line-
category that was not present in the training dataset, the arly separable data is unusual in real-world situations. As a
Naive Bayes (NB) model assigns zero probability, which is result, non-linear characteristics must be converted, which
known as ’frequency zero’ [16]. In addition, to tackle this can be accomplished by increasing the number of features
problem, we applied a smoothing technique. Second, the to segregate data linearly in higher dimensions. Fourth,
Naive Bayes (NB) algorithm is well-known for being an when creating a model, only the most critical and relevant
ineffective estimator [16]. Therefore, you should not take features should be employed. Otherwise, the probabilis-
the probability outputs or predict probability too seriously. tic predictions made by the model lead to incorrect, and
Third, the Naïve Bayes (NB) algorithm assumes that all the the model’s predictive value may degrade [18]. Fifth, each
features are independent classes [17]. training instance must be self-contained from the rest of
In our experiment, Logistic Regression (LR) predicted the dataset instances [17]. If they are related in some way,
achieving lower performance next to Naïve Bayes (NB) the model tries to give those specific training instances. As
because of the following reasons. First, the assumption a result, matching data or repeated measurements such
of linearity between the dependent and independent as training data should not be used. Some scientific study
procedures, for example, rely on several observations of
Table 8  Comparison of models on test data the same individual. In such conditions, this method is
ineffective.
Algorithms Correctly Incor- Accuracy (%)
In our experiment, the prediction of the K-Nearest
predicted rectly
predicted Neighbors (KNN) achieved less performance together
with the Logistics Regression (LR) and Naive Bayes (NB)
K-nearest neighbors (KNN) 78 11 87.64 due to the following reasons. First, K-Nearest Neighbors
Logistic regression (LR) 68 21 76.4 (KNN) can suffer from biased class distributions, if a cer-
Support vector machine 82 7 92.13 tain class is very frequent in the training set, it tends to
(SVM)
master the majority vote of the new instance (large num-
Decision trees (DT) 80 9 90
ber = more common) [17]. In our data, if the management
Naïve Bayes (NB) 58 31 66
of the integration class projects is more frequent, the

Vol:.(1234567890)

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


SN Applied Sciences (2022) 4:165 | https://doi.org/10.1007/s42452-022-05051-7 Research Article

K-Nearest Neighbors (KNN), the prediction assumes that grows as the number of training samples grows when
the new data is the management of project integration. large datasets are required to function well when a com-
Second, the accuracy of the K-Nearest Neighbors (KNN) plicated structure necessitates learning multi-layered fea-
can be severely degraded with high-dimensional data [19]. tures, and when high experience is required. It is used in a
Because there is little difference between the nearest and variety of industries, from automatic leadership to medi-
farthest neighbor. That is why K-Nearest Neighbors (KNN) cal devices. Finally, while our dataset is limited, we apply
is not good for high-dimensional data. Third, the algo- typical machine learning algorithms to achieve the best
rithm gets significantly slower as the number of features results.
increases [17]. Fourth, needs a large number of samples
for acquiring better accuracy [20]. Therefore, our data do
not have a large number of samples. Fifth, the algorithm is 4 Conclusions
hard to work with categorical features [16]. Therefore, our
data has categorical features. Due to its profitability, the development of software-
In our experiment, the predictions of the Decision based systems and the founding of software companies
Tree (DT) achieved less performance together with the have increased in recent years. However, in any business,
K-Nearest Neighbors (KNN), the Logistic Regression (LR), especially a software company, some projects can fail.
and the Naïve Bayes (NB), respectively, due to the follow- One way to avoid software project failure is to fill the skill
ing reasons: First, Decision Trees (DT) suffer in overfitting gaps of software project managers to increase their knowl-
[17]. This is the main problem of the Decision Trees (DT). edge areas of project management. Because knowledge
It usually results in data overfitting, which leads to incor- areas are the key issues associated with software project
rect predictions. It keeps creating new nodes to fit the management. In our country, Ethiopia, software projects
inputs (even noisy data), and the tree eventually gets too are not led by professionals. The functionality, schedule,
complex to interpret. It loses its ability to generalize in this budget management, risk of software projects is not man-
way. It performs very well on the trained data but starts aged properly due to a lack of knowledge about Project
making many mistakes on the unseen data. Second, High Management Knowledge Areas (PMKAs).
variance [16] as mentioned in the first concept, the deci- The machine learning model used in this work is
sion tree generally leads to the overfitting of data. Overfit- intended to assist project managers in predicting the
ting causes a lot of variances in the output, which leads to failure of project management knowledge areas (PMKA)
many inaccuracies in the final estimates and shows a lot for a specific project. As a result, a literature review was
of inaccuracy in the findings. Obtained zero bias (overfit- conducted to identify the features, which were then
ting), resulting in significant variance. Third, Unstable [21], evaluated using unambiguity, consistency, measurability,
adding new data, the point can lead to regeneration of and practicability criteria to discover the most important
the overall tree and all nodes need to be recalculated and attributes in predicting failed knowledge areas. Finally, a
recreated. Fourth, affected by noise [17], a little bit of noise machine learning model has been developed to predict
can make it unstable which leads to wrong predictions. failed Project Management Knowledge Areas (PMKAs).
The prediction of the Support Vector Machine (SVM) The model included three factors: project manager con-
achieved better performance among others due to the text, project context, and company context. This research
following reasons. First, it works more effectively in cat- work had a total of 443 records and 9 attributes to predict
egorical data[21]. For this reason, our dataset is categori- the failure of the Project Management Knowledge Areas
cal. Second, it works relatively well even in smaller datasets (PMKAs). Noise removal and management of missing val-
because the algorithm does not rely upon the complete ues w​​ ere performed to prepare the dataset for the experi-
data [20]. ments. To build the model, we have used machine learning
Third, it works more effectively for high-dimensional algorithms such as Decision Trees (DT), Logistic Regres-
datasets because the complexity of the training data set sion (LR), Naïve Bayes (NB), K-Nearest Neighbors (KNN),
does not depend on the dimensionality of the dataset [18]. and Support Vector Machine (SVM). Accuracy, precision,
Fourth, a Support Vector Machine (SVM) is extremely use- and recall were used to evaluate the performance of the
ful when we have no prior knowledge of the data [17]. developed model. The model is evaluated by comparing
Using traditional machine learning methods rather its performance or results with the actual data (the data
than deep learning techniques has several advantages. we have at hand) that have the values of the nine attrib-
The Support Vector Machine outperforms the other tech- utes and ten domains of knowledge of project manage-
niques, and it’s better for small datasets with outliers and ment. The results demonstrated that the Support Vector
non-parametric models, as we showed in our results. Deep Machine (SVM) technique is more efficient than other can-
learning, on the other hand, is used when the complexity didate algorithms at predicting failed Project Management

Vol.:(0123456789)

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Research Article SN Applied Sciences (2022) 4:165 | https://doi.org/10.1007/s42452-022-05051-7

Knowledge Areas (PMKAs). In terms of accuracy, the signifi- 2nd Edition, p 16. https://​doi.​org/​10.​1201/​b22038-​4/​proje​ct-​
cance of the produced model will change the progress of knowl​edge-​areas-​susan-​houst​on. Accessed 6 Apr 2022
3. Oun TA, Blackburn TD, Olson BA, Blessner P (2016) An enterprise-
anticipating failed areas of project management expertise. wide knowledge management approach to project management.
EMJ 28(3):179–192. https://​doi.​org/​10.​1080/​10429​247.​2016.​12037​
15
4. Saleem N (2019) Empirical analysis of critical success factors for
5 Future works project management in global software development. In: 2019
ACM/IEEE 14th international conference on global software engi-
neering (ICGSE), pp 68–71. https://​doi.​org/​10.​1109/​ICGSE.​2019.​
In terms of future research, we recommend the following: 00025
5. Lehtinen TOA, Mäntylä MV, Vanhanen J, Itkonen J, Lassenius C
1. Conduct various types of empirical research on pre- (2014) Perceived causes of software project failures—an analysis
of their relationships. Inf Softw Technol 56(6):623–643. https://​doi.​
dicting and reporting the effectiveness of project man-
org/​10.​1016/j.​infsof.​2014.​01.​015
agement knowledge areas to assist project managers, 6. Klotins E, Unterkalmsteiner M, Gorschek T (2019) Software engi-
and predict project management knowledge areas neering in start-up companies: an analysis of 88 experience
failure by compiling multiple failed project datasets reports. Empir Softw Eng 24(1):68–102. https://​doi.​org/​10.​1007/​
s10664-​018-​9620-y
using deep learning approaches and comparing them 7. Knodel J, Manikas K (2015) Towards a typification of software
with our results. ecosystems. In: Fernandes J, Machado R, Wnuk K (eds) Software
2. Test the effect of attribute reduction on the perfor- Business. ICSOB 2015. Lecture Notes in Business Information
mance of selected algorithms or other machine learn- Processing, vol 210. Springer, Cham. https://​doi.​org/​10.​1007/​
978-3-​319-​19593-3_5
ing algorithms by adding more features and criteria. 8. Alojail M, Bhatia S (2020) A novel technique for behavioral ana-
lytics using ensemble learning algorithms in E-commerce. IEEE
Access 8:150072–150080. https://​doi.​org/​10.​1109/​ACCESS.​2020.​
30164​19
Funding The authors have not disclosed any funding. 9. Sheikh RA, Bhatia S, Metre SG, Faqihi AYA (2022) Strategic
value realization framework from learning analytics: a practical
approach. J Appl Res High Educ 14(2):693–713. https://​doi.​org/​
Data availability The datasets and source code analyzed during the
10.​1108/​JARHE-​10-​2020-​0379
current study are publicly available at this link: https://​colab.​resea​
10. Gandhi P, Khan MZ, Sharma RK, Alhazmi OH, Bhatia S, Chakraborty
rch.g
​ oogle.c​ om/d​ rive/1
​ k84ZY
​ MIXW4
​ gjpKn
​ 1BQDj​ iJEfg
​ PUzT4
​ C3#s​ crol​
C (2022) Software reliability assessment using hybrid neuro-fuzzy
lTo=​34hjf​f ZL2O​j9.
model. Comput Syst Sci Eng 41(3):891–902. https://​doi.​org/​10.​
32604/​csse.​2022.​019943
Declarations 11. Ramadan N, Abdelaziz A, Salah A (2016) A hybrid machine learning
model for selecting suitable requirements elicitation techniques.
Conflict of interest The authors declare that they have no known Int J Comput Sci Inf Secur 14(6):1–12
competing financial interests or personal ties that may have influ- 12. Komi-Sirviö S (2004) Development, and evaluation of software
enced our work. process improvement methods. VTT
13. Jain R, Suman U (2018) A project management framework for
global software development. ACM SIGSOFT Softw Eng Notes
Open Access This article is licensed under a Creative Commons Attri-
43(1):1–10. https://​doi.​org/​10.​1145/​31783​15.​31783​29
bution 4.0 International License, which permits use, sharing, adap-
14. Wanberg CR, Ali AA, Csillag B (2020) Job seeking: the process
tation, distribution and reproduction in any medium or format, as
and experience of looking for a job. Annu Rev Org Psychol
long as you give appropriate credit to the original author(s) and the
Org Behav 7:315–337. https://​doi.​org/​10.​1146/​annur​ev-​orgps​
source, provide a link to the Creative Commons licence, and indicate
ych-​012119-​044939
if changes were made. The images or other third party material in this
15. Eastham J, Tucker DJ, Varma S, Sutton SM (2014) PLM software
article are included in the article’s Creative Commons licence, unless
selection model for project management using hierarchical deci-
indicated otherwise in a credit line to the material. If material is not
sion modeling with criteria from PMBOK® knowledge areas. EMJ
included in the article’s Creative Commons licence and your intended
26(3):13–24. https://​doi.​org/​10.​1080/​10429​247.​2014.​11432​016
use is not permitted by statutory regulation or exceeds the permitted
16. Dey A (2022) Machine learning algorithms: a review. https://​
use, you will need to obtain permission directly from the copyright
ijcsit.​com/​docs/​Volume%​207/​vol7i​ssue3/​ijcsi​t2016​070332.​pdf.
holder. To view a copy of this licence, visit http://​creat​iveco​mmons.​
Accessed 6 Apr 2022
org/​licen​ses/​by/4.​0/.
17. Hassanat AB, Abbadi MA, Altarawneh GA, Alhasanat AA (2014)
Solving the problem of the K parameter in the KNN classifier using
References an ensemble learning approach. http://​sites.​google.​com/​site/​
ijcsis/
1. Javed SA, Liu S (2017) Evaluation of project management knowl- 18. Osisanwo FY, Akinsola JE, Awodele O, Hinmikaiye JO, Olakanmi O,
edge areas using grey incidence model and AHP. In: 2017 interna- Akinjobi J (2017) Supervised machine learning algorithms: clas-
tional conference on grey systems and intelligent services (GSIS), sification and comparison. Int J Comput Trends Technol 48(3):128–
pp 120–120. https://​doi.​org/​10.​1109/​GSIS.​2017.​80776​84 138. https://​doi.​org/​10.​14445/​22312​803/​IJCTT-​V48P1​26
2. Houston SM (2017) Project knowledge areas. In: The project man- 19. Bhatia N, Vandana A (2010) Survey of nearest neighbor techniques.
ager’s guide to health information technology implementation, Int J Comput Sci Inf Secur. https://​doi.​org/​10.​48550/​arXiv.​1007.​
0085

Vol:.(1234567890)

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


SN Applied Sciences (2022) 4:165 | https://doi.org/10.1007/s42452-022-05051-7 Research Article

20. Taneja S, Gupta C, Goyal K, Gureja D (2014) An enhanced K-near- Publisher’s Note Springer Nature remains neutral with regard to
est neighbor algorithm using information gain and clustering. In: jurisdictional claims in published maps and institutional affiliations.
International conference on advanced computing and communi-
cation technologies, ACCT, pp 325–329. https://​doi.​org/​10.​1109/​
ACCT.​2014.​22
21. Mece EK, Binjaku K, Paci H (2020) The application of machine learn-
ing in test case prioritization—a review. Eur J Electr Eng Comput
Sci. https://​doi.​org/​10.​24018/​ejece.​2020.4.​1.​128

Vol.:(0123456789)

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:

1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at

[email protected]

You might also like