Machine Learning Models For Estimating Preliminary Factory Construction Cost: Case Study in Southern Vietnam

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

International Journal of Construction Management

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/tjcm20

Machine learning models for estimating


preliminary factory construction cost: case study
in Southern Vietnam

Nguyen Dang-Trinh, Pham Duc-Thang, Tran Nguyen-Ngoc Cuong & Tran Duc-
Hoc

To cite this article: Nguyen Dang-Trinh, Pham Duc-Thang, Tran Nguyen-Ngoc Cuong & Tran
Duc-Hoc (2022): Machine learning models for estimating preliminary factory construction
cost: case study in Southern Vietnam, International Journal of Construction Management, DOI:
10.1080/15623599.2022.2106043

To link to this article: https://doi.org/10.1080/15623599.2022.2106043

Published online: 05 Aug 2022.

Submit your article to this journal

Article views: 108

View related articles

View Crossmark data

Citing articles: 1 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=tjcm20
INTERNATIONAL JOURNAL OF CONSTRUCTION MANAGEMENT
https://doi.org/10.1080/15623599.2022.2106043

Machine learning models for estimating preliminary factory construction cost:


case study in Southern Vietnam
Nguyen Dang-Trinha,b, Pham Duc-Thanga,b, Tran Nguyen-Ngoc Cuongc and Tran Duc-Hoca,b
a
Faculty of Civil Engineering, Ho Chi Minh City University of Technology (HCMUT), Ho Chi Minh City, Vietnam; bVietnam National University
Ho Chi Minh City, Ho Chi Minh City, Vietnam; cFaculty of International Business and Economics, University of Economics & Business (VNU-UEB),
Vietnam National University, Ho Chi Minh City, Vietnam

ABSTRACT KEYWORDS
Construction of industrial enterprises has become more necessary in recent years. It is critical for project Deep learning; ensemble
managers to estimate the entire cost of a building project at this early stage. Existing approaches that model; industrial
use operator experience as a mathematical formula. Initial estimates are inaccurate due to the lack of construction; machine
learning; preliminary cost
available data points, which leads to overruns in project costs. This research utilizes different machine
learning techniques to predict preliminary factory construction cost. Five popular numeric predictive tech-
niques: support vector machine (SVM), artificial neural network (ANN), generalized linear regression
(GENLIN), classification and regression-based techniques (CART), exhaustive chi-squared automatic inter-
action detection (CHAID) are used for baseline and ensemble models. A deep learning neural network
(DLNN) is also utilized in this study. The machine learning model is trained and tested on actual data
gathered in the southern part of Vietnam. Deep learning outperforms all other machine learning algo-
rithms in this comparison, while the ensemble model of artificial neural networks and generalised linear
regression also fared well. Cost estimators can quickly pick the best model for projecting the cost of con-
structing a preliminary factory by having access to a variety of estimate methodologies.

Introduction 2010). Owning to the lack of primary information, additional


expenditure and excessive spending happen during the project
In recent years, Vietnam has transitioned from a nation with a life cycle (Elmousalami 2021). The project managers seldom han-
dominating agrarian economy to one of Asia’s leading industrial dle efficiently the expandable impact of cost overruns and escal-
hubs. This country has 335 established industrial parks, 260 ation (Sayed et al. 2020). Consequently, numerous estimated cost
established industrial parks with a total area of approximately methods have been applied in planning phases to obtain precise
68.7 thousand hectares, and 75 industrial parks under construc- and approximate calculations and completely overtake the
tion with a total area of approximately 29.2 thousand hectares, increased cost problem.
according to a report by CBRE—Commercial Real Estate An estimator has used various approximate cost techniques at
Services (Vietnam 2020). The occupancy rate of industrial parks various project development stages using available data. To name
in operation is about 75.7%. As of 2019, the southern region had a few, detailed work breakdown structure cost estimation
about 380,500 m2 of ready-built factories, an increase of 18.9% (Pettang et al. 1997); cost function calculation; activity based cost
over the same period last year, the Northern region was approximation; weighted proportion cost approach; and com-
321,420 m2, an increase of 25.2% over the same period of the puter program (Mohamed and Celik 2002; Ganorkar et al. 2017).
year 2018. Currently, foreign investors are promoting cooper- The conventional cost estimation process uses the quantity take-
ation with domestic industrial developers, to maximize the offs that utilize information of blueprints and specifications.
potential of the industrial market in Vietnam. Therefore, the Whereas comparative cost estimation method depends on the
need in building industrial factories will increase in the com- parametric of building such as type, size, and capacity. The com-
ing time. parative method presumes a linear dependency between essential
Cost estimation is one of the key standards for project man- designed variables and project final cost. Nevertheless, the pre-
agers to determine the total project budget at the initial stages sumption of linear relationship may not suitable for
(Murat G€ unaydı n and Zeynep Dogan 2004). Cost control also real practical.
plays an important role in competition meanwhile retaining Advancements in computing and software technologies have
superior quality (Enshassi et al. 2013). The decision-making generated new methodologies of estimating the project cost
influenced the building cost at the stage of planning. This impac- (Elmousalami 2020, 2021; Pham et al. 2021). The application of
tion descends throughout all building project stages, while the machine learning (ML) techniques has been investigated in esti-
committed costs ascend (Al-Tawal et al. 2021). The erroneous mating project costs due to the ability of dealing with multiple
estimation of costs leads to project budget overruns. Therefore, and non-linear relationships (Maya et al. 2021; Shartooh Sharqi
the completion of projects is hardly done as initially developed and Bhattarai 2021). Several researchers have proved that ML
and the expected profits may turn into losses (Cheng et al. techniques are able to deal with cost estimation efficiently with

CONTACT Tran Duc-Hoc [email protected]


ß 2022 Informa UK Limited, trading as Taylor & Francis Group
2 N. DANG-TRINH ET AL.

low errors of predicting final project costs. An ensemble model Zhu et al. (2010) built a model based on fuzzy and genetic neu-
is combined machine learning techniques that built by taking the ron networks for the estimation of project cost.
strengths of single machine learners to achieve better prediction Several researchers have applied the linear and multiple
performance rather than using a single model (Chou et al. 2022). regression models to predict project cost at initial phases due to
Deep learning technique can learn complicated relations and the simplicity and capacity of using software system. Lowe et al.
high-level data features (Ning et al. 2020). An estimation model (2006) confirmed that the log of cost backward model is the best
generated from a deep neural network can handle efficiently the regression model for estimating construction-building cost out of
aforementioned problems and enhance prediction accuracy. six proposed linear regression models. Stoy et al. (2008) applied
Therefore, emerging technologies are able to generate practical the regression analysis in estimating residential building con-
and precise outcomes concerning actual circumstances. struction costs. Nasrazadani et al. (2017) created a modeling
The growth of economy has led to the investment in building framework that utilizes Bayesian regression for retrofit cost pre-
factory in the southern, Vietnam. Domestic and foreign invest- diction. The regression model has become popular in cost esti-
ment capital in factories has increased substantially in recent mation due to its simplicity. Nevertheless, SVM and CBR have
years. Several capital sources such as budget capital, non-budget superior performance in dealing with nonlinear data compared
capital and social capital have been invested. Estimation cost is a to the regression model.
crucial criterion for decision maker to invest the money for con- An et al. (2007) assessed the feature of estimated conceptual
struction, especially in the idea formation stage. Estimator often cost by using support vector machine technique. SVM outper-
uses conventional cost estimation process that may cause large formed the discriminant analysis technique in estimating results.
error due to lack of necessary information. Therefore, this HongWei (2009) integrated SVM with rough set theory to
research aims at finding the factors that influence on preliminary enhance the prediction accuracy of the construction building
factory construction cost. cost. Son et al. (2012) hybridized principal component analysis
This study also focuses on carrying out various machine- and SVM to predict accurately project performance in the prep-
learning techniques to seek the most suitable predictive models aration phase. Petruseva et al. (2016) demonstrated the superior-
for estimating preliminary factory construction cost. Five popular ity of SVM in estimation precision of bidding price compared to
numeric predictive algorithms inclusive of support vector regression models. CBR method that worked as a progressive
machine (SVM), artificial neural network (ANN), generalized lin- finding mechanism for the identical situation was a promising
ear regression (GENLIN), classification and regression-based technique for cost estimation (Kwon et al. 2017; Hyung et al.
techniques (CART), exhaustive chi-squared automatic interaction 2019). An et al. (2007) introduced a case-based reasoning model
detection (CHAID), as well as ensemble models are compared in predicting construction cost that experience is included in the
for preliminary cost estimation. Moreover, a deep learning neural analytic hierarchy process.
network is introduced to possibly improve cost estimation. A Among estimation cost techniques, the hybrid models are the
cross-fold validation approach is further utilized to avoid ran- current trend because these methods were able to yield high
domness in selecting the testing fold. accuracy in predicting outcomes. Moreover, the hybrid models
can eliminate the drawbacks of a single model. Cheng et al.
(2013) proposed a hybrid model by using an evolutionary algo-
Related works on cost estimation rithm to optimize parameters of least squares SVM to predict
the construction cost index. Arabzadeh et al. (2018) proved that
Many extensive studies of construction cost estimate using artifi- the hybrid models achieved more accuracy in cost estimation
cial intelligence and machine learning models have been dis- than the single model. Shoar et al. (2022) used a hybrid model
cussed (Elmousalami 2021). Previous studies can be classified based on random forest regression to predict the increased pro-
into six groups including artificial neural network (ANN), fuzzy ject cost of high-rise residential buildings. Das et al. (2022)
logic (FL), regression, support vector machine (SVM), case-based hybridized the seasonal regression and artificial neural network
reasoning (CBR), and hybrid models (Elmousalami 2020). for forecasting the wind energy production cost. The hybrid
Ambrule and Bhirud (2017) have applied ANN for estimating models have been proved the appropriate techniques in estimat-
the preliminary cost of building projects to overtake the errors at ing construction cost with stable and high accuracy results.
initial phases of construction building. Juszczyk et al. (2018) Ensemble models and deep neuron networks are recently
investigated the application of ANNs in calculating building introduced for cost estimation with high accuracy. Williams and
activities overall cost of playing ground. Maya et al. (2021) Gong (2014) proposed a stacking ensemble and text mining
designed a model based neural network in estimating future pro- method to predict the cost overrun based on project contract
ject performance. documentation. Cao et al. (2018) developed a powerful ensemble
Fuzzy system techniques have been implemented in estimat- method for estimating the unit price bidding. The proposed
ing construction project costs for years. Yang and Xu (2010) pro- model outperformed any of the constituent learning algorithms
posed a fuzzy technique including four inputs and one output to and the baseline models. Meharie et al. (2021) demonstrated the
estimate building projects with a maximum error is 3.2%. Zhai use of the stacking ensemble-learning method for estimating the
et al. (2013) utilized fuzzy c-means to establish a fuzzy system construction project costs with high accuracy. Ning et al. (2020)
for predicting cost. Karatas and Ince (2016) modeled an expert applied a convolutional neuron network technique to estimate
tool based on fuzzy logic for satellite cost prediction. The above- the manufacturing cost. Bodendorf et al. (2021) investigated the
mentioned fuzzy methods used experts’ opinions to determine use of deep learning neural network that is based on image proc-
fuzzy rules generation. Hence, the hybridization of fuzzy and essing, auto encoding, and regression method to calculate the
other techniques for cost estimation justification is a new evolv- manufacturing cost of motherboard.
ing trend Cheng and Roy (2010) hybridized fuzzy approach, evo- Several researchers have successfully identified many key cost
lutionary algorithm, and neural network model to estimate driver identifications in construction projects (Elmousalami
conceptual cost. Fuzzy logic was applied for input and out data. 2020). ElSawy et al. (2011) determined the ten most important
INTERNATIONAL JOURNAL OF CONSTRUCTION MANAGEMENT 3

factor cost drivers out of 52 factors via a questionnaire survey and output dependent variables (Y) as the data distribution
based on expert’s judgment. Kim (2013) used a questionnaire assumption.g ¼ g ðEðY ÞÞ ¼ Xi bi þ O, Y  F(3)
survey and factor analysis to identify and rank the factors affect- where g is the linear prediction function, O is an offset vari-
ing set of guidelines for infrastructure projects. Marzouk and able, bi denotes the slope coefficients, Xi is independent inputs,
Elkadi (2016) based on experts’ responses in performing ques- and F is the distribution of Y. The generalized linear model con-
tionnaire to choose the causal factors of water purifying systems. sists of three constituents (1) an output variable Y complies with
El-Sawah and Moselhi (2014) collected a data set from 35 low- a particular random distribution where expected value l and
rise structural steel buildings for preliminary cost estimating. variance r2(E(Y) ¼ l; (2) a connecting function g(.) that links
Lotfy and Mohamed (2002) used 480 real projects as input data the expected value (l) of Y to transform predicted values of g[g
for proposed model. ¼ g(l)], and (3) a structural model.
An extensive review shows that there were a few studies on
estimating the early construction cost of a factory building.
Especially, there are no research on predicting preliminary fac-
tory construction cost in Vietnam. Cong and Minh (2020) used Classification and regression trees (CART)
ANN to estimate the construction schools cost in Ho Chi Minh The CART is a basic machine learning algorithm that can deal
city, Vietnam. The estimation cost model used a real data of 27 with regression and classification problems (Breiman et al. 1984).
school projects and yielded a high accuracy over 90%. Truong The variables in CART can classify as numerical or categorical.
and Soo-Yong (2009) utilized neural network model to predict- A set of learning data intends to optimize a learning tree. The
ing apartment construction cost in Vietnam. A data of 14 sam- optimization process assurances robustness while can keep the
ples were used for training and the last five cases were used for model simplicity. Various impurity measurements are used as
testing. The current study aims at determining the most effective the criterion to split nodes in CART. For example, Gini is usu-
and precise estimating methods for preliminary factory construc- ally picked for symbolic targeted fields. For continuous targets,
tion cost. Moreover, this research collects actual data on 35 the least-squared deviation is applied for automatically choosing
industrial park projects for model implementation. without selection explanation. The Gini index g(t) can be formu-
2
lated as Eq. (4).g ðtÞ ¼ 1  pðtÞ  ð1pðtÞÞ2 (4)
where p(t) is the relative frequency of the first class in the
Machine learning models node. The value of Gini index equals to zero when one class
Baseline predicting approaches appears at a node.

Artificial neural networks (ANNs)


ANNs are powerful tools that imitate the human neural system
for efficacy prediction. This technique can learn from past data Chi-squared automatic interaction detector (CHAID)
to predict results. A basic ANN structure composes of three The CHAID is a decision tree technique developed by Kass
layers: input, hidden, and output. The input layer consists of (1980) for both regression and classification tasks. The chi-square
input data as variables, which are transferred through one or test is utilized to assess the pureness progression of the node
more hidden layers for calculation, and the output layer yields splitting. Especially, the predictor with the highest correlation
the predicted preliminary factory construction cost (Chou et al. with the variables at each node is chosen for splitting node.
2022). More particularly, the ANNs commence the computing Since the tested predictor has no statistical significance, none
procedures by using an array of numbers X to the input layer in splitting process is executed, and the algorithm stops.
neural processing. Following, the input xi uses a transfer function A CHAID tree starts with the whole dataset by classifying
to feed forward to n neurons in the hidden layer. The w is used subsets of the space into several offspring nodes. To determine
to combine the neurons to the output. Equation (1) is applied to the best separation at any node, any pair of allowed categories of
calculate the outputs of each layerYkn ¼ f ðWn, m, k :Xm þ bi, k Þ(1) predictor variables is merged until there is no statistically signifi-
where f() is the activation function, k denotes the number of cant difference in the pair for the target variable. The CHIAD
layer, n denotes the number of neurons, m represents the num- method natively handles the interactions between the independ-
ber of weight for each transferring neurons, i denotes number of ent variables available directly from the tree examination. The
bias nodes. final nodes distinguish subsets as defined by different groups of
independent variables (Sut and Simsek 2011).
Support vector machine (SVM)
SVM is a supervised machine learning that introduced by
Vapnik (1995). It can be useful in multivariate regression and Deep neuron networks (DNN)
classification problems. When the data are continuous, the sup- DNNs are advanced versions of ANNs with additional depth,
port vector regression (SVR), one of the SVM variants, is uti-
that is, an increased number of hidden layers between the input
lized. The general description
P model of SVR for predicting
and the output layers. DNNs own considerable benefit in extract-
problems as Eq. (2).f ðxÞ ¼ nj¼1 xi gi ðxÞ þ b(2)
ing features at various abstraction levels and thereby learning
where xi denotes a weight; gi ðxÞ represents a group of nonlin-
more complicated data sets. DNNs have gained a lot of applica-
ear transformations and b denotes a bias term.
tions in various industrial and commercial problems (Asghari
et al. 2021). Layers of nodes and neurons interacted via activa-
The generalized linear model (GENLIN) tion functions to build a nonlinear connection between the input
GENLIN is a statistical analysis that uses historical cases for and output variables.li ¼ rðW i li1 þ bi Þ(5)
regression analysis (Nelder and Wedderburn 1972). Equation (3) where li and bi are vectored results and bias of layer i; W
sets the relationship between input independent variables (X) denotes nodes’ weight; r is activation function of each layer.
4 N. DANG-TRINH ET AL.

Ensemble model the level of factors affecting via a five-point Likert scale, where 1
denotes the almost none influence and 5 represents extreme
Ensemble methods are powerful machine learning methods to influence. The final part gathers respondent’s information
integrate the best-performing models to improve the overall including organization, designations, and years of experience.
achievement. The mathematical expression of the ensemble The unanswered and identical responses for all questions were
approach as g: Rd!R d-dimensional vector of input data and a eliminated from the data. The outliers was removed via the box-
one-dimensional output Y. An estimated function g(.) is obtained plot method (Schwertman et al. 2004).
by a particular algorithm in each process. The linear combin- The justification of the above process was to determine the
ation functions in Eq. (6) is utilized P
to obtain an ensemble-based most crucial factors that impact preliminary factory construction
function gen() as followsgenen ð:Þ ¼ Nj¼1 cj  gð:Þ(6) cost. The study conducted a total of 200 questionnaires to
where cj is the linear combination coefficients, which are respondents in the southern region, Vietnam. A total of 178 valid
defined based on average values of weights. responses were received, representing a response rate of 89%. The
inadequate data were removed before conducting the statistical
process in SPSS V23 (Landau 2017). The critical factors are iden-
Model construction and evaluation methods tified by mean value and Cronbach’s a coefficient analysis (Hair
Data collection et al. 2013). The variance inflation factor (VIF) is applied to
examine the multicollinearity (Hair et al. 2019; Nguyen et al.
This section consists of two phases: (1) a questionnaire was con- 2022). The VIF value is equal to or greater than 5 represents the
ducted to acquire preliminary data. The main factors affecting developing regression model with a high probability of exhibiting
preliminary factory construction cost can be clearly identified. multicollinearity and vice versa. The values of the inner VIF are
The scope of work focuses on studying factory construction cost in adequate range (VIF 1-1.923 < 5), therefore, the multicollinear-
in the southern region, Vietnam. Therefore, the questionnaire ity assumption is eliminated. Table 1 lists ten input variables that
method allows users to gain information from a large audience were used for estimating preliminary factory construction cost.
in a short period in a standardized way. (2) A real data set is In the second phase, a data set was collected from the indus-
collected from 35 industrial park projects in Vietnam that use trial park projects. Table 2 provides completed input and out-
for machine learning model evaluation. put datasets.
In the first phase, a preliminary questionnaire was created K-fold cross-validation technique is resampling the dataset
using expert judgements and relevant suggestions from literature. method to assess a machine learning model performance. This
The variables including twenty factors collected from the litera- method intends to have a lower bias compared to random sam-
ture review. A pilot study was conducted to determine the final pling methods. This study applied fivefold validation testing to
questionnaire form. The preliminary survey was conducted with provide a reasonable computing time and minor variation
fifteen (15) professionals with at least three years of experience according to Kohavi (1995) suggestion and the size of sample
in bidding and building construction factory in Vietnam by face data. The general process of stratified fivefold cross validation is
to face and online interviews. Because of the difficulty in con- as following critical tasks: (1) dividing sample data into five sub-
tacting experts mainly involved in the factory project as well as sets, (2) selecting a separate subset for testing, and remaining
limited time, only eight respondents were returned including subsets for training, (3) repeating five times model training and
bidding department (four respondents) and construction depart- testing. The model performance is evaluated via subset data test-
ment (four respondents). Eight specialists contributed a pilot test ing, as shown in Figure 1. The average results obtained by five
to suggest minor changes in the designing stage of the question- testing rounds express the accuracy of considering model.
naire. The approved questionnaire was ready for use in the field
after those changes were made.
Model construction and criteria
The authorized questionnaire consists of sixteen main ques-
tions to identify the influential factors. The field survey applied The Rapidminer (Minerswa et al. 2001) is used to implement the
two methods including interview and survey to eliminate the dis- predictive models for cost estimation. Rapidminer owns an easy to
inclination response. The main questionnaire comprises three use human-computer interaction to execute an analytical process.
main parts. The first part presents the survey purposes and fun- The user can use simple clicking buttons to input data,
damental knowledge about preliminary factory construction cost set algorithm parameters, and also build models simply. Figure 2
to targeted respondents. The respondents are asked to evaluate depicts five steps of model construction, which are explained below.

Table 1. The critical influencing factors.


ID Critical variable Unit Types Categories
X1 Building location NA Set Long An, Binh Duong, Dong Nai, Ho Chi Minh City,
Quang Nam, Quang Ngai, Vung Tau, Binh Phuoc, Da Nang
X2 Purpose of use NA Set Factory, factory and office, factory and ground
X3 Area m2 Range [min-max] ¼ [1272  15525]
X4 Building height m Range [min-max] ¼ [8  20,4]
X5 Column spacing m Range [min-max] ¼ [6  10]
X6 Number of stories Story Discrete [min-max] ¼ [1  3]
X7 Crane load Ton Discrete [min-max] ¼ [0  9]
X8 Foundation type NA Set isolated footing, pile foundation
X9 Wall type NA Set Brick, colored sheet metal, panel, fireproof panel
X10 Roof type NA Set Colored sheet metal, panel, fireproof panel
Y Preliminary factory construction cost Million VND Range [min-max] ¼ [4937  51474]
INTERNATIONAL JOURNAL OF CONSTRUCTION MANAGEMENT 5

Table 2. Data set.


X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y
Long An (5) factory and office (2) 1926 13 7.5 1 0 Pile foundation (1) Brick (1) Panel (1) 20,279
Binh Duong (1) factory (1) 7300 13.5 8.0 1 0 isolated footing (2) Colored Sheet metal (4) Colored Sheet metal (3) 20,430
Binh Duong (1) factory (1) 4446 9.9 6.0 1 0 isolated footing (2) Panel (2) Panel (1) 15,741
Vung Tau (9) factory (1) 2880 15.7 7.5 1 0 Pile foundation (1) Panel (2) Panel (1) 9842
Vung Tau (9) factory (1) 1272 13.7 7.5 1 0 Pile foundation (1) Panel (2) Panel (1) 4938
Dong Nai (4) factory and office (2) 7800 11.3 10.0 1 0 Pile foundation (1) Fireproof Panel (3) Fireproof Panel (2) 34,885
Binh Phuoc (2) factory and ground (3) 3825 13.3 7.0 1 3 isolated footing (2) Panel (2) Panel (1) 11,340
Vung Tau (9) factory and ground (3) 4000 13.3 8.3 1 6 Pile foundation (1) Panel (2) Panel (1) 15,845
Vung Tau (9) factory and ground (3) 8760 13.3 8.3 1 0 Pile foundation (1) Panel (2) Panel (1) 31,340
Quang Ngai (7) factory and office (2) 15552 11.4 8.0 1 0 Pile foundation (1) Colored Sheet metal (4) Colored Sheet metal (3) 34,788
Binh Duong (1) factory and office (2) 8550 14.7 8.4 1 0 isolated footing (2) Colored Sheet metal (4) Colored Sheet metal (3) 27,174
Binh Duong (1) factory (1) 2850 8.0 7.0 1 0 isolated footing (2) Colored Sheet metal (4) Colored Sheet metal (3) 5545
Binh Duong (1) factory and office (2) 7443 11.0 9.5 1 2 isolated footing (2) Colored Sheet metal (4) Colored Sheet metal (3) 24,437
Dong Nai (4) factory (1) 1800 10.3 8.0 1 0 isolated footing (2) Colored Sheet metal (4) Colored Sheet metal (3) 3505
Ho Chi Minh City (8) factory (1) 2765 14.1 7.0 1 0 isolated footing (2) Panel (2) Panel (1) 7128
Binh Duong (1) factory (1) 8880 15.0 5.0 1 0 isolated footing (2) Colored Sheet metal (4) Colored Sheet metal (3) 33,125
Vung Tau (9) factory and office (2) 9350 14.0 7.5 1 0 Pile foundation (1) Panel (2) Panel (1) 27,249
Binh Duong (1) factory (1) 6715 10.4 8.0 1 0 isolated footing (2) Panel (2) Panel (1) 18,927
Quang Nam (6) factory (1) 7200 13.3 8.0 2 0 isolated footing (2) Panel (2) Panel (1) 46,466
Da Nang (3) factory (1) 2249 10.4 1.0 1 0 isolated footing (2) Panel (2) Panel (1) 6705
Dong Nai (4) factory (1) 3150 18.7 7.5 3 0 isolated footing (2) Colored Sheet metal (4) Colored Sheet metal (3) 28,300
Binh Duong (1) factory and ground (3) 9750 19.2 7.0 1 9 Pile foundation (1) Colored Sheet metal (4) Colored Sheet metal (3) 51,475
Binh Duong (1) factory (1) 3000 11.4 7.5 1 0 isolated footing (2) Colored Sheet metal (4) Colored Sheet metal (3) 7016
Binh Duong (1) factory and office (2) 4380 9.1 7.0 1 0 isolated footing (2) Colored Sheet metal (4) Colored Sheet metal (3) 11,243
Binh Duong (1) factory and office (2) 3060 14.3 8.0 1 0 isolated footing (2) Colored Sheet metal (4) Colored Sheet metal (3) 7714
Dong Nai (4) factory (1) 3324 15.4 7.5 2 0 Pile foundation (1) Colored Sheet metal (4) Colored Sheet metal (3) 13,938
Dong Nai (4) factory and office (2) 3854 20.4 9.2 3 0 Pile foundation (1) Colored Sheet metal (4) Colored Sheet metal (3) 33,786
Binh Duong (1) factory (1) 5134 13.3 7.2 2 0 isolated footing (2) Panel (2) Panel (1) 28,358
Vung Tau (9) factory (1) 5120 11.8 8.0 1 9 isolated footing (2) Panel (2) Colored Sheet metal (3) 13,363
Vung Tau (9) factory (1) 4400 11.8 8.0 1 0 isolated footing (2) Colored Sheet metal (4) Colored Sheet metal (3) 8557
Vung Tau (9) factory and ground (3) 2000 11.8 8.0 1 2 isolated footing (2) Colored Sheet metal (4) Colored Sheet metal (3) 7215
Dong Nai (4) factory and ground (3) 4635 19.2 8.0 2 0 isolated footing (2) Colored Sheet metal (4) Colored Sheet metal (3) 28,550
Vung Tau (9) factory (1) 5880 9.7 7.0 1 0 isolated footing (2) Panel (2) Panel (1) 12,901
Binh Duong (1) factory and ground (3) 4995 13.3 7.0 1 3 isolated footing (2) Colored Sheet metal (4) Colored Sheet metal (3) 18,433
Long An (5) factory (1) 2800 11.4 8.0 1 0 Pile foundation (1) Panel (2) Panel (1) 9269
Noted: number inside the parentheses denotes numerical input.

Figure 1. Fivefold cross-validation method for resampling data.

 First step (loading data): This retrieve bottom is utilized to The above five steps also are used to implement the deep
access data in the repository and load them into the process neuron network and ensemble model.
 Second step (select attributes): This mechanism uses various The machine learning model performance was measured via
filter types to select the attribute. The following process will statistical indicators that consist of correlation coefficient (R),
operate only on the selected attributes mean absolute percentage error (MAPE), root mean square error
 Third step (set role): This node defines function of selected (RMSE), and mean absolute error (MAE). The R value is
attributes. The operator also specifies the input and tar- employed to evaluate the correlation between two variables. The
get variables. higher the absolute value of the R indicates the stronger the rela-
 Fourth step (cross validation): This task uses a k-fold cross tionship. The MAPE expresses accuracy in a percentage manner
validation method to evaluate the statistical model and uses the concept of absolute values. The RMSE stands for
performance. the sample standard deviation of estimated and actual values.
 Fifth step (building model): The model is built and tested in The MAE presents absolute errors between the estimated and
this phase actual values. The lower values of MAPE, RMSE, and MAE
6 N. DANG-TRINH ET AL.

Figure 2. Predictive models construction using RapidMiner.

strong confidence in predicting accuracy. The mathematical for- Table 3. Comparisons of machine learning approaches.
mula of these indicators was depicted as Eqs. (7)–(10). RMSE (million) MAE (million) MAPE (%) R
P P  P 
n ya :yp  ya yp Model Avg. Std. Avg. Std. Avg. Std. Avg. Std.
R ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffirffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 P  P 2 (7)
P  P 2 ANN 5661.965 479.25 4153.246 425.31 22.49 5.74 0.910 0.125
n y2a  ya n y2p  yp SVM 12,162.688 981.53 10,538.749 7831.26 88.78 8.76 0.763 0.167
CART 7272.326 717.18 5174.612 625.15 32.16 7.89 0.861 0.159
 
1 Xn yp  ya  GENLIN 6559.556 596.21 4890.241 515.38 35.54 7.17 0.849 0.148
MAPE ¼ (8) CHAID 7364.911 754.29 5352.942 629.31 31.97 6.98 0.894 0.157
n i¼1 ya DNN 4911.216 415.79 4019.731 376.84 21.70 4.98 0.921 0.109
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 Xn
Ensemble 4798.831 402.87 3915.893 381.29 20.16 4.25 0.932 0.093
RMSE ¼ y  ya Þ2
i¼1 ð p
(9) model
n Noted: Avg. ¼ average; Std. ¼ standard deviation.
1 Xn  
MAE ¼ i¼1 p
y  ya  (10)
n investment but it is a quite error if the project has small amount
where ya represents an actual value; yp denotes predicted value; of total investment. The correlation coefficient (R) is 0.932,
and n is number of samples. which is relatively high, there is a good linear correlation
between the actual value and the estimated value.
Figure 3 presents the actual values and predicted values
Experimental results and discussion obtained by the best machine-learning model (Ensemble). The
Results and analysis highest and lowest absolute deviation between actual and pre-
dicted values are 5988.5 MilVND and 92.4 MilVND, respectively.
Table 3 reported the statistical performance measurement of The horizontal axis denotes the index of instances in the testing
ensemble designed model, DNN, and other single predictive data of all folds; the vertical axis represents the preliminary fac-
approaches including ANN, SVM, CART, GENLIN, CHAID. tory construction cost.
The average and standard deviation values of all indicators were Computational time is another important indicator that
presented by a summary of the cross-fold modeling performance should be considered for model evaluation. All considered mod-
via testing folds of considered models. The predictive accuracy els use CPU time in obtaining the results for a fair comparison
obtained by the ANN model was the best in all commonly used with the same hardware platform. The computational times of
predictive models. The deep neuron network is the most effective every single model in estimating the preliminary factory con-
model to predict preliminary factory construction cost in all struction cost. were shown as following: The ANN needed the
baseline models. Notably, the ensemble model of the two best least computation time with average values 5 seconds for estimat-
single models (ANN þ GENLIN) was superior to that of baseline ing the preliminary factory construction cost, followed by the
models in all cases. The predictive accuracy obtained by the SVM (6 sec), CART (7 sec), GENLIN (8 sec), CHAID (8 sec), and
ensemble model was 4798.831 (MilVND) of RMSE, 3915.893 DNN(7 sec). These models were built in Rapidminer studio soft-
(MilVND) of MAE, 20.16% of MAPE, and 0.932 of R. The MAE ware with given optimal parameters. Thus, these models provide
value is acceptable for a project with a large amount of total a good basis for developing a cost estimation system.
INTERNATIONAL JOURNAL OF CONSTRUCTION MANAGEMENT 7

Figure 3. The actual values and predicted values obtained by the best model.

Discussion DNN in their original form to estimate preliminary factory con-


struction cost. A real data set was collected from 35 industrial
The baseline models required users to pick up the models and park projects in Vietnam to train and test the predicting models.
parameters in the software package, which has the benefits of The five-fold cross-validation method minimizes the bias in
being easy to use and implement. Nevertheless, their perform- comparing results. The findings and statistical analysis showed
ance precision is less than those of DNN or ensemble models. that the DNN is the best single model for the preliminary factory
Therefore, the baseline models are especially appropriate for construction cost estimation. Especially, the ensemble model
beginning users of machine learning. The person who has more combining two single models GENLIN and ANN even outper-
knowledge of machine learning methodology will choose the forms all constituent models. Furthermore, the ensemble model
deep neural network and ensemble techniques due to outstand- runs with easy computation algorithms. Therefore, the ensemble
ing predicting ability. model proved to be an efficient and promising alternative to
The machine learning models are efficient regarding calcula- handle the preliminary factory construction cost estimation.
tion speed. The models are able to generate precise prediction Particularly, the deep neural network and designed ensemble
outcomes with provided input parameters in a few seconds. The model obtained the precise predictive results of the preliminary
major strength can save time, facilities, and labor workforce in factory construction cost only in a few seconds. Accordingly,
predicting the preliminary factory construction cost and make these models establish a great commencement platform to esti-
practical significance. mate the preliminary factory construction cost in further
For machine learning operations, the DNN was superior to attempts. The designed models are easy to use and implement
all individual models in predicting the preliminary factory con- due to few tuning parameters in comparison to the simulation
struction cost because of its ability in generating the best values platform. According to the above-mentioned outcomes, the com-
in all indicators. However, the designed ensemble model sur- plexity and challenge of estimating preliminary factory construc-
passes all single models. Therefore, the designed ensemble learn- tion cost could be alleviated via data-mining techniques, i.e.,
ing system demonstrated that it has the capacity of handling ensemble model.
preliminary factory construction costs because of ease of oper- This paper contributes to knowledge by (1) developing
ation, comprehensibility, and speeding estimation. ensemble, DNN, and various single machine-learning models for
The problem of predicting preliminary factory construction preliminary factory construction cost estimation. (2) The benefits
cost is complex. All models are able to deal with the complicated of the designed ensemble model were emphasized for the con-
and nonlinear relationships in providing data. DNN proved that struction cost management at early stages. (3) A real dataset of
it has the ability in coping with any form of exponential distri- the factory construction cost was collected not only for model
bution type in data. The ensemble model of two single machine- building and testing but also for further consideration. (4) The
learning techniques (GENLIN þ ANN) generated better predict- predicting outcomes can be generated in only a few seconds. (5)
ing results than any constituent model. Research findings help in mitigating the cost estimator work and
The outcomes of this study demonstrated the possibility of therefore bring on to improving the cost estimation process.
machine learning and ensemble models for preliminary factory con- While providing several benefits, the proposed machine-learn-
struction cost estimation with reliable and stable results. Therefore, ing models have some main limitations. First, only a 35-sample
these models can assist cost estimators in calculating building project dataset was used for building and evaluating models. Therefore,
cost in the early stages with a short period and great accuracy. more data should be collected for enhancing the generalizability.
Second, the single and ensemble models used default parameter
Conclusion and further study settings suggested by the software. Further works need to use
optimization tools to set the tuning parameters of predicting
This research executed five artificial intelligence algorithms models to increase the performance. Third, the current study
(namely ANN, SVM, CART, GENLIN, and CHAID) and newly only concentrates on evaluating the performance of the machine-
8 N. DANG-TRINH ET AL.

learning model and neglecting on analysis of cost performance Enshassi A, Mohamed S, Abdel-Hadi M. 2013. Factors affecting the accuracy
response. Despite that, the collected real dataset is very useful for of pre-tender cost estimates in the Gaza Strip. J Construct Dev Countries.
18(1):73–94.
further purposes.
Ganorkar AB, Lakhe RR, Agrawal KN. 2017. Cost estimation techniques in
manufacturing industry: concept, evolution and prospects. Int J Econ
Account. 8(3–4):303–336.
Disclosure statement Hair JF, Black WC, Babin BJ, Anderson RE. 2013. Multivariate data analysis.
USA: Pearson Prentice Hall publishing.
No potential competing interest was reported by the authors. Hair JF, Sarstedt M, Ringle CM. 2019. Rethinking some of the rethinking of
partial least squares. Eur J Market. 53(4):566–584.
HongWei M. 2009. An improved support vector machine based on rough set
Funding for construction cost prediction. In: 2009 International forum on com-
puter science-technology and applications; Chongqing, China. IEEE; p.
This research is funded by Vietnam National University HoChiMinh 3–6
City (VNU-HCM) under grant number DS2022-20-01. Hyung W-G, Kim S, Jo J-K. 2019. Improved similarity measure in case-based
reasoning: a case study of construction cost estimation. Eng Constr Archit
Manage. 27(2):561–578.
Juszczyk M, Lesniak A, Zima K. 2018. ANN based approach for estimation
of construction costs of sports fields. Complexity 2018:1–11.
References Karatas Y, Ince F. 2016. Fuzzy expert tool for small satellite cost estimation.
Al-Tawal DR, Arafah M, Sweis GJ. 2021. A model utilizing the artificial IEEE Aerosp Electron Syst Mag. 31(5):28–35.
neural network in cost estimation of construction projects in Jordan. Eng Kass GV. 1980. An exploratory technique for investigating large quantities of
Constr Archit Manage. 28(9):2466–2488. categorical data. J R Stat Soc Ser C (Appl Stat). 29(2):119–127.
Ambrule VR, Bhirud AN. 2017. Use of artificial neural network for pre Kim S. 2013. Hybrid forecasting system based on case-based reasoning and
design cost estimation of building projects. Int J Recent Innov Trends analytic hierarchy process for cost estimation. J Civil Eng Manage. 19(1):
Comput Commun. 5(2):173–176. 86–96.
An S-H, Kim G-H, Kang K-I. 2007. A case-based reasoning cost estimating Kohavi R. 1995. A study of cross-validation and bootstrap for accuracy esti-
model using experience by analytic hierarchy process. Build Environ. mation and model selection. In: Proceedings of the 14th international
42(7):2573–2579. joint conference on artificial intelligence - Vol. 2. Montreal, Quebec,
An S-H, Yeol Park U, Kang K-I, Cho M-Y, Cho H-H. 2007. Application of Canada: Morgan Kaufmann Publishers Inc. p. 1137–1143.
support vector machines in assessing conceptual cost estimates. J Comput Kwon N, Park M, Lee H-S, Ahn J, Kim S. 2017. Construction noise predic-
Civ Eng. 21(4):259–264. tion model based on case-based reasoning in the preconstruction phase. J
Arabzadeh V, Niaki STA, Arabzadeh V. 2018. Construction cost estimation Constr Eng Manage. 143(6):04017008.
of spherical storage tanks: artificial neural networks and hybrid regres- Landau S. 2017. A handbook of statistical analysis using SPSS. Washington
sion—GA algorithms. J Ind Eng Int. 14(4):747–756. D.C.: CRC Press LLC.
Asghari V, Hsu S-C, Wei H-H. 2021. Expediting life cycle cost analysis of Lotfy EA, Mohamed AS. 2002. Applying neural networks in case-based rea-
infrastructure assets under multiple uncertainties by deep neural networks. soning adaptation for cost assessment of steel buildings. Int J Comput
J Manage Eng. 37(6):04021059. Appl. 24(1):28–38.
Bodendorf F, Merbele S, Franke J. 2021. Deep learning based cost estimation Lowe DJ, Emsley MW, Harding A. 2006. Predicting construction cost using
of circuit boards: a case study in the automotive industry. Int J Prod Res. multiple regression techniques. J Constr Eng Manage. 132(7):750–758.
1–22. doi:10.1080/00207543.2021.1998698. Marzouk M, Elkadi M. 2016. Estimating water treatment plants costs using
Breiman L, Friedman JH, Olshen Richard A, Stone CJ. 1984. Classification factor analysis and artificial neural networks. J Clean Prod. 112:
and regression trees. Newyork: Chapman and Hall/CRC. 4540–4549.
Cao Y, Ashuri B, Baek M. 2018. Prediction of unit price bids of resurfacing Maya R, Hassan B, Hassan A. 2021. Develop an artificial neural network
highway projects through ensemble machine learning. J Comput Civil (ANN) model to predict construction projects performance in Syria. J
Eng. 32(5):04018043. King Saud Univ Eng Sci.
Cheng M-Y, Hoang N-D, Wu Y-W. 2013. Hybrid intelligence approach Meharie MG, Mengesha WJ, Gariy ZA, Mutuku RNN. 2021. Application of
based on LS-SVM and Differential Evolution for construction cost index stacking ensemble machine learning algorithm in predicting the cost of
estimation: A Taiwan case study. Automat Construct. 35:306–313. highway construction projects. Eng Constr Archit Manage. doi:10.1108/
Cheng M-Y, Roy AFV. 2010. Evolutionary fuzzy decision model for construc- ECAM-02-2020-0128.
tion management using support vector machine. Expert Syst Appl. 37(8): Minerswa I, Klinkenberg R, Fischer S. 2001. RapidMiner. Germany:
6061–6069. University of Dortmund.
Cheng M-Y, Tsai H-C, Sudjono E. 2010. Conceptual cost estimates using Mohamed A, Celik T. 2002. Knowledge based-system for alternative design,
evolutionary fuzzy hybrid neural network for projects in construction
cost estimating and scheduling. Knowledge Based Syst. 15(3):177–188.
industry. Expert Syst Appl. 37(6):4224–4231.
Murat G€ unaydın H, Zeynep Dogan S. 2004. A neural network approach for
Chou J-S, Fleshman D-B, Truong D-N. 2022. Comparison of machine learn-
early cost estimation of structural systems of buildings. Int J Project
ing models to provide preliminary forecasts of real estate prices. J Hous
Manage. 22(7):595–602.
Built Environ. doi:10.1007/s10901-022-09937-1.
Nasrazadani H, Mahsuli M, Talebiyan H, Kashani H. 2017. Probabilistic
Cong TD, Minh QN. 2020. Estimating the construction schools cost in Ho
Chi Minh City using artificial neural network. Hanoi, Vietnam: IOP modeling framework for prediction of seismic retrofit cost of buildings. J
Conference Series: Materials Science and Engineering. p. 869. Constr Eng Manage. 143(8):04017055.
Das P, Patty S, Malakar T, Rani N, Saha S, Barman D. 2022. A hybrid regres- Nelder JA, Wedderburn RWM. 1972. Generalized linear models. J R Stat Soc
sion based forecasting model for estimating the cost of wind energy pro- Ser A (Gen). 135(3):370–384.
duction. IFAC-PapersOnLine. 55(1):795–800. Nguyen T-T-N, Anh Nguyen T, Tien Do S, Nguyen VT. 2022. Assessing stake-
El-Sawah H, Moselhi O. 2014. Comparative study in the use of neural net- holder behavioural intentions of BIM uses in Vietnam’s construction proj-
works for order of magnitude cost estimating in construction. J Inform ects. Int J Construct Manage. 1–9. doi:10.1080/15623599.2022.2051241.
Technol Construct. 19:462–473. Ning F, Shi Y, Cai M, Xu W, Zhang X. 2020. Manufacturing cost estimation
Elmousalami HH. 2021. Comparison of artificial intelligence techniques for based on a deep-learning method. J Manufact Syst. 54:186–195.
project conceptual cost prediction: a case study and comparative analysis. Petruseva S, Sherrod P, Pancovska VZ, Petrovski A. 2016. Predicting bidding
IEEE Trans Eng Manage. 68(1):183–196. price in construction using support vector machine. TEM J. 5(5):143–151.
Elmousalami HH. 2020. Artificial intelligence and parametric construction Pettang C, Mbumbia L, Foudjet A. 1997. Estimating building materials cost
cost estimate modeling: state-of-the-art review. J Constr Eng Manage. in urban housing construction projects, based on matrix calculation: the
146(1):03119008. case of Cameroon. Construct Build Mater. 11(1):47–55.
ElSawy I, Hosny H, Razek MA. 2011. A neural network model for construc- Pham TQD, Le-Hong T, Tran XV. 2021. Efficient estimation and optimiza-
tion projects site overhead cost estimating in Egypt. Int J Comput Sci. tion of building costs using machine learning. Int J Construct Manage.
3(8):273–283. 1–13. doi:10.1080/15623599.2021.1943630.
INTERNATIONAL JOURNAL OF CONSTRUCTION MANAGEMENT 9

Sayed M, Abdel-Hamid M, El-Dash K. 2020. Improving cost estimation in con- Truong LV, Soo-Yong K. 2009. Neural network model for construction cost
struction projects. Int J Construct Manage. 1–20. doi:10.1080/15623599. prediction of apartment projects in Vietnam. Korean J Construct Eng
2020.1853657. Manage. 10(3):139–147.
Schwertman NC, Owens MA, Adnan R. 2004. A simple more general boxplot Vapnik VN. 1995. The nature of statistical learning theory. New York, NY:
method for identifying outliers. Comput Stat Data Anal. 47(1):165–174. Springer-Verlag.
Shartooh Sharqi S, Bhattarai A. 2021. Evaluation of several machine learning CBRE. 2020. Vietnam industrial market time for a critical makeover.
models for field canal improvement project cost prediction. Complexity Vietnam: CBRE.
2021:1–12. Williams TP, Gong J. 2014. Predicting construction cost overruns using text
Shoar S, Chileshe N, Edwards JD. 2022. Machine learning-aided engineering mining, numerical data and ensemble classifiers. Automat Construct. 43:
services’ cost overruns prediction in high-rise residential building projects: 23–29.
Application of random forest regression. J Build Eng. 50:104102. Yang S, Xu J. 2010. The application of fuzzy system method to the cost esti-
Son H, Kim C, Kim C. 2012. Hybrid principal component analysis and sup- mation of construction works. In: 2010 International conference on
port vector machine model for predicting the cost performance of com- machine learning and cybernetics; Qingdao, China. IEEE; p. 654–658.
mercial building projects using pre-project planning variables. Automat Zhai K, Jiang N, Pedrycz W. 2013. Cost prediction method based on an
Construct. 27:60–66. improved fuzzy model. Int J Adv Manuf Technol. 65(5–8):1045–1053.
Stoy C, Pollalis S, Schalcher H-R. 2008. Drivers for cost estimating in early Zhu WJ, Feng WF, Zhou YG. 2010. The application of genetic fuzzy neural
design: case study of residential construction. J Constr Eng Manage. network in project cost estimate. In: 2010 International conference on
134(1):32–39. e-product e-service and e-entertainment; Henan, China. IEEE.
Sut N, Simsek O. 2011. Comparison of regression tree data mining methods
for prediction of mortality in head injury. Expert Syst Appl. 38(12):
15534–15539.

You might also like