--- title: "Illustration of PCA" author: "Pierre Nyquist" date: "September 20, 2016" output: html_document --- ```{r echo=FALSE} # Initiate MASS library and remove list and graphics # Clear workspace rm(list=ls()) graphics.off() library(MASS) ``` This is an illustration of the method known as Principal Component Analysis (PCA). You are encouraged to play around with the parameters, particularly sample size $N$ and correlation $\rho$, to see what effect any changes might have on the outcome. As our data we draw a sample from the two-dimensional normal distribution with mean vector $\mu$ ```{r} mu <- c(2,4) mu ```` and covariance matrix $\sigma$ ```{r} rho = 0.8 sigma <- matrix(c(1,rho, rho,1), nrow=2, ncol=2) rho sigma ``` where $\rho$ determins the correlation between the two components. We take $N$ copies of such two-dimensional random variables $X=(X_1, X_2)$; the following figure shows a scatterplot of such a sample. ```{r} N = 10000 X <- mvrnorm(N,mu,sigma) # Scatter plot of X=(X_1,X_2) plot(X[,1], X[,2], xlab="", ylab="", main="Correlation > 0") ``` Here $\rho > 0$ and we observe an ellipsodial shape with main axes not parallel to the coordinate axes. To compare we also plot a sample where $\rho = 0$ ```{r} rho = 0 sigma2 <- matrix(c(1,rho, rho,1), nrow=2, ncol=2) X2 <- mvrnorm(N,mu,sigma2) plot(X2[,1], X2[,2], xlab="", ylab="", main="No correlation") ``` ## PCA We now extract the principal components associated with our two data sets $\mathbf{X}_1$ (correlation >0) and $\mathbf{X}_2$ (no correlation). First, we estimate mean vector (same for both) and covariance matrices: ```{r} muEst <- colMeans(X) covEst <- cov(X) covEst2 <- cov(X2) # Display estimates muEst covEst covEst2 ``` Second, we find the eigenvectors associated with the two covariance matrices. ```{r} # Find eigenvectors & eigenvalues # Columns are eigenvectors eigen <- eigen(covEst) # Define o1 and o2 as orthogonal vectors o1 <- eigen$vectors[,1] o2 <- eigen$vectors[,2] eigen2 <- eigen(covEst2) # Define o1 and o2 as orthogonal vectors o1.nocorr <- eigen2$vectors[,1] o2.nocorr <- eigen2$vectors[,2] ``` For the case where $\rho >0$ we have ```{r} o1 o2 ``` and for $\rho = 0$ the eigenvectors are ```{r} o1.nocorr o2.nocorr ``` The next two figures show the eigenvectors for the two cases: ```{r echo=FALSE} o = rbind(o1, o2) plot(o, xlim = c(-1,1), ylim=c(-1, 1), main="Principal components, correlation > 0") segments(0,0, o1[1], o1[2]) segments(0,0, o2[1], o2[2]) ``` ```{r echo=FALSE} o.nocorr = rbind(o1.nocorr, o2.nocorr) plot(o.nocorr, xlim = c(-1,1), ylim=c(-1, 1), main="Principal components, correlation = 0") segments(0,0, o1.nocorr[1], o1.nocorr[2]) segments(0,0, o2.nocorr[1], o2.nocorr[2]) ``` We now plot the original data and the transformed data - obtained by centering the data around 0 and then multiplying with the orthogonal matrices - in the same scatterplot. ```{r echo=FALSE} # Center point cloud around 0 X[,1]=X[,1]-muEst[1] X[,2]=X[,2]-muEst[2] # Scatterplot plot(X[,1], X[,2], xlab="X1", ylab="X2", main="Correlation > 0") #lines(eigen$vectors[,2], type="o", col="black", xlab="", ylab="") #lines(eigen$vectors[,1], type ="o", col="red") # Compute transformed data Xstar = t(eigen$vectors)%*%t(X) points(t(Xstar), col="red") ``` ```{r echo=FALSE} # Center point cloud around 0 X2[,1]=X2[,1]-muEst[1] X2[,2]=X2[,2]-muEst[2] # Scatterplot plot(X2[,1], X2[,2], xlab="X1", ylab="X2", main="No correlation") #lines(eigen$vectors[,2], type="o", col="black", xlab="", ylab="") #lines(eigen$vectors[,1], type ="o", col="red") # Compute transformed data Xstar2 = t(eigen2$vectors)%*%t(X2) points(t(Xstar2), col="red") ``` Comparing the two we clearly see the effect the transformation has (or not) on the shape of the original data. Lastly, we can compute the ratio between the individal eigenvalues and the sum of the eigenvalues for both cases. For $\rho >0$, ```{r} eigen$values/sum(eigen$values) ``` and for $\rho = 0$. ```{r} eigen2$values/sum(eigen2$values) ``` This further illustrates how in the first case the PCA is able to determine directions of importance, whereas in the second case the analysis does not provide any new, useful information.