More Information

Principal Component Analysis (PCA)

The primary objective of this multivariate data analysis method is data reduction.


Introduction

If studying a data set with 100 variables and 100 observations (data objects), you must look into 100 means, 100 variances, and (100 * 100-100)/2 covariances, for a total of 5150 statistics to be studied as the representation of the underlying multivariate normal population sampled. PCA is a method of simplifying this task. The key to the problem is that much of the variability in the data set is not independent, i.e., there is a lot of covariation between the variables. If from all variables under consideration we could extract two variables that captured most of the independent variability in the entire data set, a simple binary scatter diagram would reveal most of the information in the data. Accordingly, the primary objective is to extract a few uncorrelated variables that may capture most of the variability in the data set, while preserving the orthogonality of these new optimal reference axes/variables (i.e., principal components). The 1st principal component captures the maximum variation in the data set. The 2nd principal component has the next most variation, and so on.

The Principal Component Analysis Output

The PCA output includes the PC scores (that will be placed in the source worksheets) and a matrix plot, as well as PC loadings and the correlation or covariance matrix data (that will be placed in auto-generated worksheets and displayed graphically using Aabel charts and/or table editor) (see the example below; source of data: Fisher (1936), reproduced by Andrews and Herzberg (1985).

PC scores

PC loadings

Correlation or Covariance Matrix

Pre-Processing the Data

PCA and factor analysis methods allow optional pre-processing of the data prior to the main analysis. Examples of data transformations that can be used (as part of PCA or factor analysis) are:

  • Standardizing
  • Normalizing
  • Logarithmisizing
  • Log centering
  • Mean centering
  • Taking square root
  • Ranking variables individually
  • Ranking variables jointly