Fork me on GitHub

Multivariate Statistics Assignment 4

The fourth assignment of Multivariate Statistics. The assignment is written in Rmarkdown, a smart syntax supported by RStudio helping with formula, plot visualization and plugin codes running.

most recommend: click here for html version of assignment, you can see codes as well as plots.

You may also find the PDF Version of this assignment from github. Or if you can cross the fire wall, just see below:

1

1
2
mp <- matrix(c(5,2,2,2),2,2,byrow=T) 
eigen(mp)

2

a

1
cov2cor(mp)

b

The principal components of Z obtained from the eigenvectors of the correlation
matrix ρ of X is different from those calculated from covariance matrix $\Sigma$. Because
the eigen pairs derived from $\Sigma$, in general not the same as the ones derived from $\rho$

c

THe correlations between $Y_j$ and $Z_i$ are:

3

a

1
2
3
4
5
6
7
8
9
10
# read the data
setwd('~/Desktop/三春/3多元统计分析/作业/作业4/')
dat<-read.csv("table8.4.csv")
X1<-dat$x1
X2<-dat$x2
X3<-dat$x3
X4<-dat$x4
X5<-dat$x5
covar <- cov(dat)
covar
1
2
3
eigen(cov(dat))
prcomp(dat)
summary(prcomp(dat))

b

From the summary above, the proportion of the total sample variance explained by the rst
three principal components is: 89.881%. It means that the first three explain almost all
variance.

c

From 8-33, we have the CI of m $\lambda_i$:

1
2
3
4
5
6
7
z <-qnorm(1-1/6)
cical <-function(lambda){
c(lambda/(1+z*(1/103)**0.5),lambda/(1-z*(1/103)**0.5))
}
cical(0.0013676780)
cical(0.0007011596)
cical(0.0002538024)

CIs are: [0.001248653 0.001511786], [0.0006401396 0.0007750385], [0.0002317147 0.0002805447]

d

1
plot(c(0.52926, 0.80059, 0.89881, 0.95399, 1.00000),ylab="Cumulative proportion",xlab="Component number",type='b')

From the cumulative proportion plot, it seems that three dimensions’ PC are enough.

4

a

1
2
3
library(bootstrap)
data(scor)
plot(scor)

b

1
cor(scor)

c

1
2
prcomp(scor)
summary(prcomp(scor))

d

1
plot(c( 0.6191,0.8013 ,0.8948 ,0.97102, 1.00000),ylab="Cumulative proportion",xlab="Component number",type='b')

I will choose the first too for these three PCs take almost 80% of total variance.

e

PC1 may stand for the indicator of scores on all subjects. PC2 has more straightforward mearning: it is related to closed or open rules.

f

1
2
library('ggfortify')
autoplot(prcomp(scor,scale=TRUE),colour='green',label=TRUE)

g

$\chi^2_2(0.05) = 5.99$
I use python to check the outlier:

1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np
from sklearn.decomposition import PCA
def convert(strr):
return np.array(strr.split(' ')).astype('float').reshape(-1,1)
pca = PCA(n_components=2, svd_solver='full')
dat = pca.fit_transform(data)
def ellipse(i):
x,y = dat[i,0],dat[i,1]
a = (x/26.2105)**2 + (y/14.2166)**2
if a >=5.99:
print (i,a)
for i in range(data.shape[0]):
ellipse(i)

And we can find eight outliers: 1,2,23,28,66,76,81,87

-----The ---- end ------- Thanks --- for --- Reading----