The document describes a proposed methodology for using principal component analysis (PCA) and artificial neural networks (ANN) to estimate software project effort and duration (COCOMO model). It involves the following key steps:
1. Collecting size and cost factor metrics for past projects and normalizing the data.
2. Using PCA to reduce the dimensionality of the metrics and derive uncorrelated "domain metrics".
3. Training an ANN model on past project data to predict effort and duration based on the domain metrics.
4. Evaluating the accuracy of the ANN model by applying it to test data and calculating performance metrics like mean absolute relative error.
The goal is to develop an automated effort
The document describes a proposed methodology for using principal component analysis (PCA) and artificial neural networks (ANN) to estimate software project effort and duration (COCOMO model). It involves the following key steps:
1. Collecting size and cost factor metrics for past projects and normalizing the data.
2. Using PCA to reduce the dimensionality of the metrics and derive uncorrelated "domain metrics".
3. Training an ANN model on past project data to predict effort and duration based on the domain metrics.
4. Evaluating the accuracy of the ANN model by applying it to test data and calculating performance metrics like mean absolute relative error.
The goal is to develop an automated effort
The document describes a proposed methodology for using principal component analysis (PCA) and artificial neural networks (ANN) to estimate software project effort and duration (COCOMO model). It involves the following key steps:
1. Collecting size and cost factor metrics for past projects and normalizing the data.
2. Using PCA to reduce the dimensionality of the metrics and derive uncorrelated "domain metrics".
3. Training an ANN model on past project data to predict effort and duration based on the domain metrics.
4. Evaluating the accuracy of the ANN model by applying it to test data and calculating performance metrics like mean absolute relative error.
The goal is to develop an automated effort
The document describes a proposed methodology for using principal component analysis (PCA) and artificial neural networks (ANN) to estimate software project effort and duration (COCOMO model). It involves the following key steps:
1. Collecting size and cost factor metrics for past projects and normalizing the data.
2. Using PCA to reduce the dimensionality of the metrics and derive uncorrelated "domain metrics".
3. Training an ANN model on past project data to predict effort and duration based on the domain metrics.
4. Evaluating the accuracy of the ANN model by applying it to test data and calculating performance metrics like mean absolute relative error.
The goal is to develop an automated effort
Download as DOCX, PDF, TXT or read online from Scribd
Download as docx, pdf, or txt
You are on page 1of 8
Problem statement : COCOMO Estimating using PCA and ANN
Working of Basic Generic model:
This Generic model based on both algorithmic and non-algorithmic methods. Step1: Sizing specifications, source code, and test cases: The first step in any software estimate is to predict the sizes of the deliverables that must be constructed. Sizing must include all major deliverables such as specifications, source code, manuals, documents and test cases. A variety of sizing methods are now included, such as: a. Sizing based on function point metrics. b. Sizing based on Source Lines of Code (SLOC) metrics. Step 2: Specify various implementation attributes and cost drivers involving other different functional & operational characteristics. 2.1 Specify Product Factors: Determining the rating value of required reliability & reusability, product complexity etc. 2.2 Specify Computer Factors: Determining the rating value of Execution time constraint, Main storage constraint etc. 2.3 Specify Personnel Factors: Determining the rating value of Analyst capability, Language & Tool experience etc. 2.4 Specify Project Factors: Determining the rating value of required development schedule, use of software tool etc. Rating value includes very low, low, nominal, high & very high. According to that rating value we can calculate the complexity of the project & how many person/month required for developing the project. Step 3: Implementation for Effort & Time: Use COCOMO model for calculating the effort & Time. Step 4: Estimation of cost: Cost = Effort * Average salary per unit time. Step 5: Recording of Estimated Data: The next most important objective of software estimation and measurement practices is the recording of estimated data. In other words, in order to save time and bring efficiency and maturity into software cost estimation process, the organization should record the estimation data for comparison and analysis purposes for future projects.
Working with Proposed Generic model Schematic illustration for PCA+ANN
The entire data set of N samples is quality filtered 1. and then the dimensionality is further reduced by PCA to 10 PCA projections 2. from the original M expression values. Next the N2 test experiments are set aside and N1 training experiments are randomly partitioned into three groups 3. One of these groups is reserved for validation and the two remaining groups are used for calibration 4. ANN models are then calibrated using the 10 PCA values for each sample as input and the phenotype category as output 5. For each model the calibration is optimized with a number of iterative cycles (epochs). This is repeated using each of the three groups for validation 6. Samples are again randomly partitioned and the entire training process repeated 7. For each selection of validation group one model is calibrated resulting in a total of 3 x K trained models. Once the models are calibrated, they are used to rank the genes according to their importance for the classification 8. The entire process (27) is repeated using only the top ranked genes 9. The N2 test experiments are subsequently classified using all the calibrated models.
Schematic Illustration for Proposed System
Size(2 parameter )
Cost Factor (15 parameter ) PCA
(Domain Matrix for each parameter) COCOMO
Effort (OUTPUT) Duration (OUTPUT) SAMPLE DATA SET
Description We used the following methodology in this study: 1. The input metrics were normalized using min-max normalization. Min-max normalization performs a linear transformation on the original data . Suppose that mina and max A are the minimum and maximum values of an attribute A. It maps value v of A to v in the range 0 to 1 using the formula:
2. Perform principal components analysis on the normalized metrics to produce domain metrics. 3. We divided data into training, test and validate sets using 3:1:1 ratio. 4. Develop ANN model based on training and test data sets. 5. Apply the ANN model to validate data set in order to evaluate the accuracy of the model.
A. Principal-Component (or P.C.) Analysis Cost effort metrics have high correlation with each other. P.C analysis transforms raw metrics to variables that are not correlated to each other when the original data are Cost effort metrics, we call the new P.C. variables domain metrics . P.C. analysis is used to maximize the sum of squared loadings of each factor extracted in turn. The P.C. analysis aims at constructing new variable (Pi), called Principal Component (P.C.) out of a given set of variables Xj' s( j = 1,2,...., k) .
All bijs called loadings are worked out in such a way that the extracted P.C.s satisfies the following two conditions: (i) P.C.s are uncorrelated (orthogonal) and (ii) The first P.C. (P1) has the highest variance; the second P.C. has the next highest variance so on. The variables with high loadings help identify the dimension P.C. is capturing but this usually requires some degree of interpretation. In order to identify these variables, and interpret the P.C.s, we consider the rotated components. As the dimensions are independent, orthogonal rotation is used. There are various strategies to perform such rotation. We used the varimax rotation, which is the most frequently used strategy in literature. Eigen value or latent root is associated with P.C., when we take the sum of squared values of loadings relating to dimension, then the sum is referred to as eigenvalue. Eigen value indicates the relative importance of each dimension for the particular set of variables being analyzed. The P.C.s with eigenvalue greater than 1 is taken for interpretation. Given an n by m matrix of multivariate data, P.C. analysis can reduce the number of columns. In our study n represents the number of classes for which Cost effort metrics have been collected. Using P.C. analysis, the n by m matrix is reduced to n by p matrix (where p<m).
B. ANN Modeling The network used in this work belongs to Multilayer Feed Forward networks and is referred to as M-H-Q network with M source nodes, H nodes in hidden layer and Q nodes in the output layer. The input nodes are connected to every node of the hidden layer but are not directly connected to the output node. Thus the network does not have any lateral or shortcut connection. ANN repetitively adjusts different weights so that the difference between desired output from the network and actual output from ANN is minimized. The network learns by finding a vector of connection weights that minimizes the sum of squared errors on the training data set. The summary of ANN used in this study is shown in Table.
Architecture Layer 3 Input Unit 17 Hidden Unit 170 Output Unit 2 Training Feature selection PCA Algorithm Back propagation
The ANN was trained by the standard error back propagation algorithm at a learning rate of 0.005, having the minimum square error as the training stopping criterion.
C. Performance Evaluation In this system the main measure used for evaluating model performance is the Mean Absolute Relative Error (MARE). MARE is the preferred error measure for software measurement researchers and is calculated as follows
where: estimate is the network output for each observation n is the number of observations to estimate whether models are biased and tend to over or under estimate, the Mean Relative Error (MRE) is calculated as follows
A large positive MRE would suggest that the model over estimates the number of lines changed per class, whereas a large negative value will indicate the reverse.