Main function example: model selection
Here we generate 50 samples of 200 dimensional data from a factor model with 3 factors. The model is of size 3, where the first 3 covariate coefficients are 5 and the rest zero. The factors and loadings are generated from a normal distribution. The errors are geenrated from a \(t\) distribution with \(2.5\) degrees of freedom.
library(FarmSelect)
set.seed(100)
P = 200 #dimension
N = 50 #samples
K = 3 #nfactors
Q = 3 #model size
Lambda = matrix(rnorm(P*K, 0,1), P,K)
F = matrix(rnorm(N*K, 0,1), N,K)
U = matrix(rnorm(P*N, 0,1), P,N)
X = Lambda%*%t(F)+U
X = t(X)
beta_1 = rep(5, Q)
beta = c(beta_1, rep(0,P-Q))
eps = rt(N, 2.5)
Y = X%*%beta+eps
output = farm.select(X,Y) #robust, no cross-validation
## calculating tuning parameters...
## calculating covariance matrix...
## fitting model...
output
##
## Factor Adjusted Robust Model Selection
## loss function used: scad
##
## p = 200, n = 50
## factors found: 3
## size of model selected:
## 3
names(output)
## [1] "model.size" "beta.chosen" "coef.chosen" "nfactors" "X.residual"
## [6] "p" "n" "robust" "loss"
output$beta.chosen
## [1] 1 2 3
output$coef.chosen
## [1] 5.155005 5.085859 5.088709
The values X.residual is the residual covariate matrix after adjusting for factors.
Now we use a different loss function for the model selection step, along with changing the number of factors.
farm.select(X,Y, loss = "mcp", K.factors = 10, verbose=FALSE)
##
## Factor Adjusted Robust Model Selection
## loss function used: mcp
##
## p = 200, n = 50
## factors found: 10
## size of model selected:
## 10
The robustness is controlled by the parameter of the Huber loss function. This can be chosen by cross-validation which takes a long time, but gives good results. Alternatively, we use the parameter \(tau*sd*rate\) where tau is a constant, rate is the optimal rate for the tuning parameter (see Fan et al.(2017) https://arxiv.org/abs/1612.08490). sd is the standard deviation of the data at hand. The value of \(tau\) can be supplied by the user and takes a default value of 2.
##examples of other robustification options
output = farm.select(X,Y,robust = FALSE, verbose=FALSE) #non-robust
output = farm.select(X,Y, tau = 3, verbose=FALSE) #robust, no cross-validation, specified tau
#output = farm.select(X,Y, cv = TRUE) #robust, cross-validation, LONG RUNNING!!
The function farm.res
adjusts the dataset for latent factors. The number of factors is estimated internally by using the method in (Ahn and Horenstein 2013).
output = farm.res(X, verbose=FALSE)
names(output)
## [1] "X.res" "nfactors" "factors" "loadings"
If known, we can provide this function (or the main function) the number of latent factors. Providing too large a number results in a warning message. The maximum number of factors possible is \(\max(n,p)\) but a much smaller number is recommended.
output = farm.res(X, K.factors = 30, verbose=FALSE)
## Warning in farm.res(X, K.factors = 30, verbose = FALSE):
## Warning: Number of factors supplied is > min(n,p)/2. May cause numerical inconsistencies
We see a warning telling us that it is not a good idea to calculate 30 eigenvalues from a dataset that has only 50 samples.
For a logistic regression, we prefer a larger sample size: Here we generate 200 samples of 300 dimensional data from a factor model with 3 factors. The model is of size 3, where the first 3 covariate coefficients are 5 and the rest zero. The factors, loadings, errors are all generated from a normal distribution.
set.seed(100)
P = 400 #dimension
N = 300 #samples
K = 3 #nfactors
Q = 3 #model size
Lambda = matrix(rnorm(P*K, 0,1), P,K)
F = matrix(rnorm(N*K, 0,1), N,K)
U = matrix(rnorm(P*N, 0,1), P,N)
X = Lambda%*%t(F)+U
X = t(X)
beta_1 = rep(5, Q)
beta = c(beta_1, rep(0,P-Q))
eps = rnorm(N)
Prob = 1/(1+exp(-X%*%beta))
Y = rbinom(N, 1, Prob)
output = farm.select(X,Y, lin.reg=FALSE, eps=1e-3)
## calculating tuning parameters...
## calculating covariance matrix...
## fitting model...