The package implements a collection of Generalized Elastic Net (GELnet) solvers, as outlined in the following publication: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004790 This vignette covers the basic usage of the gelnet package. The interface to the solvers was designed to be very flexible, by allowing the user to specify a large number of parameters. At the same time, nearly all of the parameters are given reasonable default values, making the interface easy to use if the additional flexibility is not required.
Let \(X\) be a samples-by-features data matrix and \(y\) be a column vector of labels. Typically, \(X\) and \(y\) are determined by the prediction task at hand, but for the purposes of this tutorial, we are going to generate them randomly:
X <- matrix( rnorm( 1000 ), 20, 50 )
y <- rnorm( 20 )
Let's further assume that we are given a feature-feature relationship matrix \(A\). If working with genomic features, \(A\) might be the adjacency of a gene-gene interaction network. Again, we are going to generate a random matrix for the purposes of this tutorial:
A <- matrix( sample( 0:1, 50*50, repl=TRUE ), 50, 50 )
A <- A & t(A) ## Make the matrix symmetric
We are now going to utilize the GELnet toolkit to learn a linear regression model, such that the model weights are more similar for the features that share an interaction on \(A\). As discussed in the manuscript, this can be achieved by formulating a feature-feature penalty matrix using either the graph Laplacian or \((I-D)\), where \(D\) is the graph diffusion matrix and \(I\) is the identity matrix. The gelnet package provides a function to compute the graph Laplacian from the adjacency. Here, we utilize the normalized Laplacian to keep the penalty term on the same scale as the traditional ridge regression:
library( gelnet )
L <- adj2nlapl(A)
The model can now be learned via
model <- gelnet( X, y, 0.1, 1, P = L )
## Training a linear regression model
## Running linear regression optimization with L1 = 0.100000, L2 = 1.000000
## f = 0.573674 after iteration 6
where we set the L1-norm and L2-norm penalties to 0.1 and 1, respectively. The response for new samples is computed via the dot product with the weights:
Xnew <- matrix( rnorm( 500 ), 10, 50 )
Xnew %*% model$w + model$b
## [,1]
## [1,] -0.5834195
## [2,] 0.3800813
## [3,] 0.3359393
## [4,] -0.7191435
## [5,] 0.4884036
## [6,] -0.6455145
## [7,] -0.2912131
## [8,] 0.3078447
## [9,] 0.1756701
## [10,] -0.5090319
Linear regression is one of the three types of prediction problems supported by the package. The other two are binary logistic regression and one-class logistic regression. The latter is outlined in the following paper: http://psb.stanford.edu/psb-online/proceedings/psb16/sokolov.pdf
The package recognizes the problem type based on the class of the \(y\) argument. To train a binary predictor, we have to provide \(y\) as a two-level factor, where the first level is treated as the positive class.
y <- factor( y > 0, levels=c(TRUE,FALSE) )
model2 <- gelnet( X, y, 0.1, 1, P=L )
## Training a logistic regression model
## Treating TRUE as the positive class
## Running logistic regression optimization with L1 = 0.100000, L2 = 1.000000
## Iteration 1: f = 0.693147
## Iteration 2: f = 0.637397
## Iteration 3: f = 0.637259
If we were to score the training data using this model, we can observe that the positive samples are receiving higher scores than the negative ones
data.frame( scores= X %*% model2$w + model2$b, labels= y )
## scores labels
## 1 -0.042026026 TRUE
## 2 -0.701904500 FALSE
## 3 -0.489787529 FALSE
## 4 -0.412480247 FALSE
## 5 -0.038749079 TRUE
## 6 -0.966533903 FALSE
## 7 -1.037170252 FALSE
## 8 -0.137472300 TRUE
## 9 -0.626894423 FALSE
## 10 -0.178248528 FALSE
## 11 -0.598004172 FALSE
## 12 -0.738103050 FALSE
## 13 -0.691199114 FALSE
## 14 -0.087613986 TRUE
## 15 -0.667663491 FALSE
## 16 -0.788829036 FALSE
## 17 0.023713731 TRUE
## 18 -0.135189723 TRUE
## 19 -0.006356284 TRUE
## 20 -0.033262442 TRUE
However, if there is class imbalance, the scores will tend to be skewed towards the class with more samples. This can be addressed by using an additional flag when training the model
model2bal <- gelnet( X, y, 0.1, 1, P=L, balanced=TRUE )
## Training a logistic regression model
## Treating TRUE as the positive class
## Running logistic regression optimization with L1 = 0.100000, L2 = 1.000000
## Iteration 1: f = 0.693147
## Iteration 2: f = 0.651713
## Iteration 3: f = 0.651705
data.frame( scores= X %*% model2bal$w + model2bal$b, labels= y )
## scores labels
## 1 0.3326137 TRUE
## 2 -0.3705584 FALSE
## 3 -0.1420000 FALSE
## 4 -0.0503658 FALSE
## 5 0.3411158 TRUE
## 6 -0.6465725 FALSE
## 7 -0.7369836 FALSE
## 8 0.2422113 TRUE
## 9 -0.2938887 FALSE
## 10 0.1936097 FALSE
## 11 -0.2651647 FALSE
## 12 -0.4211663 FALSE
## 13 -0.3562102 FALSE
## 14 0.2875076 TRUE
## 15 -0.3439592 FALSE
## 16 -0.4651769 FALSE
## 17 0.4052899 TRUE
## 18 0.2380984 TRUE
## 19 0.3671628 TRUE
## 20 0.3539453 TRUE
Traditionally, the loss function for logistic regression is averaged over \(n\), the number of samples. This causes every sample to make the same contribution to the loss, which is what causes the skew towards the larger class. By using the balanced flag, the problem is reformulated slightly such that the loss is averaged over the positive and negative samples separately, and then the mean of both averages is used as the overall loss.
Finally, we can build a one-class logistic regression model using just the positive samples. To train a one-class model we simply provide NULL for the \(y\) argument:
j <- which( y == TRUE )
model1 <- gelnet( X[j,], NULL, 0.1, 1, P=L )
## Training a one-class model
## Iteration 1 : f = 0.6931472
## Iteration 2 : f = 0.5880876
## Iteration 3 : f = 0.5878902
The model can now be used as a detector that recognizes the positive samples
data.frame( scores= X %*% model2bal$w + model2bal$b, labels= y )
## scores labels
## 1 0.3326137 TRUE
## 2 -0.3705584 FALSE
## 3 -0.1420000 FALSE
## 4 -0.0503658 FALSE
## 5 0.3411158 TRUE
## 6 -0.6465725 FALSE
## 7 -0.7369836 FALSE
## 8 0.2422113 TRUE
## 9 -0.2938887 FALSE
## 10 0.1936097 FALSE
## 11 -0.2651647 FALSE
## 12 -0.4211663 FALSE
## 13 -0.3562102 FALSE
## 14 0.2875076 TRUE
## 15 -0.3439592 FALSE
## 16 -0.4651769 FALSE
## 17 0.4052899 TRUE
## 18 0.2380984 TRUE
## 19 0.3671628 TRUE
## 20 0.3539453 TRUE