6.4 The Principal Components of Measurement

In most linear modeling applications concerned with the mapping of software metrics onto software faults, such as regression analysis and discriminant analysis, each of the independent variables, or metrics, are assumed to represent some distinct aspect of variability not clearly present in other measures. That is, each metric will measure a distinct and unique software attribute. This clearly is not the case. The best case to examine in this regard are the two metric attributes of LOC and Exec. These are clearly related, are highly correlated, and measure almost the same thing. Knowledge of LOC clearly reveals a great deal about the number of executable statements in a program.

The metrics we will collect for most software development applications will be highly correlated. They have a high degree of linear relationship, covariance, to one another. We will use the term multicollinearity to describe this linear relationship of shared variance. In cases where there is a high degree of multicollinearity, it will be almost impossible to establish the unique contribution of each metric to the model. One distinct result of multicollinearity in the independent measures is that the regression models developed using independent variables with a high degree of multicollinearity have highly unstable regression coefficients. To circumvent this problem we will employ a statistical procedure called principal components analysis (PCA) to map the metrics into orthogonal attribute domains. ^[4] Each principal component extracted by this procedure may be seen to represent an underlying common attribute domain. A somewhat more extensive discussion of the foundations of PCA can be found in Appendix 1.

The principle behind PCA is quite straightforward. Assume that we are given a set of n metrics M = (m_l,...,m_n) having a multivariate distribution with a mean μ = μ₁,..,μ_n and a covariance matrix Σ. The covariance matrix, S, for a sample drawn from this multivariate distribution is a real symmetric matrix and, thus, by the spectral decomposition theorem it can be decomposed as follows:

S = PLP'

where P is a matrix whose columns are eigenvectors and L is a diagonal matrix of eigenvalues or latent roots of S. Alternatively, we can rewrite this equation as:

P'SP = L

While there are many solutions to this problem, the principal components factoring technique extracts the eigenvalues from the largest to the smallest, each of which represents the proportion of variance explained by its associated component. Each principal component represents an underlying metric domain onto which each of the raw metrics will be mapped. The relationship between the original set of metrics, M, and the new orthogonal set can be clearly seen by computing the product moment correlation for each raw metric with each new principal component as follows:

where l_j is the eigenvalue of the j^th principal component and s_i is the standard deviation of the i^th metric.

We have two objectives in mind when we employ PCA. First, we wish to transform our n highly correlated metrics, M = (m_l,...,m_n), into a new uncorrelated set D = (d₁,...,d_n) that we will call domain metrics to distinguish them from the raw metrics. Second, we would like to reduce the dimensionality of the problem. That is, we understand intuitively that there really are not n distinct sources of variation in the original set of metrics M. There is probably a smaller number p, p < n, of distinct sources of variation within the set of n original raw metrics. We feel certain, for example, that LOC, Exec, N₁, and N₂ all represent a single aspect of the size of a program.

The problem that we would now like to solve is to determine just exactly how many usable sources of variation can be identified in our original set of metrics M. We know that the PCA technique will extract the eigenvalues and the corresponding eigenvectors from the largest value to the smallest value. From a statistical perspective, each eigenvalue, l_i, represents the proportion of variance accounted for by a particular principal component. For example, the proportion of variance accounted for by the first principal component is l₁. Also observe that

Thus, the first principal component will account for l_i/n × 100 percent of the variation of all n metrics. As the process of extracting each new principal component progresses, more and more of the variation in the original set of metrics is accounted for.

The complete mapping between a set of 12 metrics, M, for the modules of the Space Shuttle Primary Avionics Software System and the 12 orthogonal metrics, D, is shown in Exhibit 5. To the right of each of the original 12 metrics in this table are the correlation coefficients for that metric with the 12 new orthogonal domain metrics. We can see that beyond the domain metrics 1 and 2, the correlation coefficients are very small.

Exhibit 5: Principal Components of PASS Metrics

	Domain
Metric	1	2	3	4	5	6	7	8	9	10	11	12
η_l	0.73	0.18	-0.61	-0.09	-0.19	0.01	-0.03	-0.11	-0.01	0.00	0.00	0.00
η₂	0.88	0.33	-0.21	0.01	0.08	0.01	0.02	-0.28	0.02	0.01	0.00	0.00
N_l	0.87	0.43	0.14	0.09	0.09	0.10	0.01	0.11	0.02	-0.06	0.00	0.00
N₂	0.87	0.43	0.08	0.10	0.13	0.10	0.02	0.05	0.10	-0.05	0.00	0.00
Exec	0.90	0.38	0.12	0.06	0.07	0.06	0.02	0.01	-0.16	-0.02	0.00	0.00
LOC	0.86	0.28	0.16	0.00	-0.11	-0.38	-0.08	0.02	0.01	0.00	0.00	0.00
Nodes	0.90	-0.25	0.16	-0.08	-0.29	0.09	-0.06	-0.03	0.01	-0.01	-0.01	0.03
Edges	0.89	-0.27	0.17	-0.06	-0.28	0.13	-0.07	-0.04	0.00	0.01	0.01	-0.03
Paths	0.72	-0.24	0.02	-0.58	0.31	0.00	-0.05	0.02	0.00	0.00	0.00	0.00
Cycles	0.67	-0.55	-0.11	0.37	0.26	0.00	-0.19	0.01	-0.01	0.00	0.00	0.00
Maxpath	0.89	-0.41	-0.04	0.05	0.02	-0.06	0.16	0.03	0.00	0.00	-0.04	-0.01
Avepath	0.87	-0.43	-0.01	0.09	0.02	-0.07	0.19	0.00	0.00	0.00	0.03	0.01
Eigenvalue	8.5	1.6	0.5	0.5	0.4	0.2	0.1	0.1	0.0	0.0	0.0	0.0
% Variance	70.7	13.1	4.6	4.3	3.4	1.7	1.0	0.9	0.3	0.1	0.0	0.0
Cumulative variance	70.7	83.8	88.4	92.7	96.1	97.8	98.8	99.6	99.9	100	100	100

The penultimate row of Exhibit 5 shows the eigenvalue associated with each new orthogonal domain. The final row of this table shows the proportion of variation contributed by each domain. Again, we can see that this proportion rapidly drops off beyond the first two new metric domains. This can be seen graphically in Exhibit 6, a plot of the eigenvalues.

Exhibit 6: Plot of Eigenvalues

click to expand

The PCA technique is an iterative technique that extracts eigenvalues (and eigenvectors) from the largest eigenvalue to the smallest, as can be seen in Exhibit 7. Clearly, there is a point of diminishing returns in this extraction process. What is needed is a stopping rule that would terminate the process of extraction when a sufficient number of orthogonal domains have been extracted. There are many such stopping rules, and a number of them are discussed in Appendix 1. For our purposes, we will observe that if we were to perform PCA on completely random data, all of the eigenvalues would be 1.0. That is, l_i = 1.0, ∀i = 1, 2,...,n. Thus, a very common stopping rule is to extract all principal components whose eigenvalue is 1.0 or greater. For convenience alone, we will employ this stopping rule in our subsequent analysis. This will yield the new factor pattern shown in Exhibit 7.

Exhibit 7: Two Orthogonal Metric Domains

Metric	Domain 1	Domain 2
η1	0.73	0.18
η2	0.88	0.33
N₁	0.87	0.43
N₂	0.87	0.43
Exec	0.90	0.38
LOC	0.86	0.28
Nodes	0.90	-0.25
Edges	0.89	-0.27
Paths	0.72	-0.24
Cycles	0.67	-0.55
Maxpath	0.89	-0.41
Avepath	0.87	-0.43
Eigenvalue	8.5	1.6
% Variance	70.7	13.1

When we examine the contents of Exhibit 7, we can see that all 12 metrics are highly correlated with the new Domain 1 and not well correlated with Domain 2. We have identified two new sources of variation in the 12 metrics but it is difficult to understand what we have found. This is an artifact of the PCA technique. It extracted the new domains based on the principal axes of variation through an orthogonal rotation of the original data axes. To clarify the nature of the two sources of variation, we will now conduct another orthogonal rotation of the two new domain axes, this time in an attempt to redistribute the variance accounted for by each domain equally among the two domains. There are many rotation techniques that can be employed. Some of these are discussed in Appendix 1. The technique of choice in this context will be the varimax orthogonal rotation. The results of the PCA with varimax rotation for the PASS data are shown in Exhibit 8 for the 1115 modules in the PASS GNC system. This table shows the factor pattern loadings for the first two principal components whose eigenvalues are greater than 1.0.

Exhibit 8: PCA of the 12 PASS Metrics

Metric	Size	Control
η₁	0.66	0.37
η₂	0.87	0.36
N₁	0.93	0.28
N₂	0.93	0.27
Exec	0.92	0.33
LOC	0.82	0.38
Nodes	0.49	0.80
Edges	0.47	0.80
Paths	0.37	0.66
Cycles	0.12	0.86
Maxpath	0.37	0.91
Avepath	0.35	0.91
% Variance	44.1	39.7

From Exhibit 8 we can see that most of the information represented by the original 12 metrics can be represented in two distinct metric domains. The raw complexity metrics are clustered into two distinct groups, or orthogonal metric domains. In some cases it is useful to associate names with the domains. The metrics most closely associated with the first domain in this table are those metrics that have the common characteristic of measuring variation in program size. All of these metrics share a common source of variation in the size of a program. Similarly, the metrics associated with the second domain seem to share a common element of control variation. They measure aspects of the program module flowgraph, which represents the control structure of a program.

If we carefully examine the set of 12 measures used above we can observe that there are some conceptual areas of software attributes that are not represented in this set of metrics. For example, there is the matter of data structure complexity. Clearly, there are no measures present in the set for this software attribute. The problem is that we do not know just how many distinct, measurable attributes a software system might have, but we do know that these 12 metrics are only measuring two distinct, uncorrelated attributes.

From the last row in the table in Exhibit 8 we can see that the size metrics account for approximately 44 percent of the variation in the set of 12 metrics. The control metrics account for about 38 percent of the variation in the same set. Altogether, the two new orthogonal domains account for roughly 82 percent of the variation in the original set.

The objective for the next research effort is to begin to build and extend a model for software attributes. This model will contain a set of orthogonal attribute domains. Once we have such a model in place, we would then like to identify and select from the attribute domain model those attributes that are correlated with a software quality measure, such as number of faults. Each of the orthogonal attributes will have an associated metric value that is uncorrelated with any other attribute metrics. Each of these attributes may potentially serve to describe some aspect of variability in the behavior of the software faults in a program module.

^[4]Coulter, N.S., Software Science and Cognitive Psychology, IEEE Transactions on Software Engineering, 9(2), 166-171, 1983.