Introduction

The package implements a collection of Generalized Elastic Net (GELnet) solvers, as outlined in the following publication: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004790 This vignette covers the basic usage of the gelnet package. The interface to the solvers was designed to be very flexible, by allowing the user to specify a large number of parameters. At the same time, nearly all of the parameters are given reasonable default values, making the interface easy to use if the additional flexibility is not required.

Building your first GELnet model

Let \(X\) be a samples-by-features data matrix and \(y\) be a column vector of labels. Typically, \(X\) and \(y\) are determined by the prediction task at hand, but for the purposes of this tutorial, we are going to generate them randomly:

X <- matrix( rnorm( 1000 ), 20, 50 )
y <- rnorm( 20 )

Let's further assume that we are given a feature-feature relationship matrix \(A\). If working with genomic features, \(A\) might be the adjacency of a gene-gene interaction network. Again, we are going to generate a random matrix for the purposes of this tutorial:

A <- matrix( sample( 0:1, 50*50, repl=TRUE ), 50, 50 )
A <- A & t(A)  ## Make the matrix symmetric

We are now going to utilize the GELnet toolkit to learn a linear regression model, such that the model weights are more similar for the features that share an interaction on \(A\). As discussed in the manuscript, this can be achieved by formulating a feature-feature penalty matrix using either the graph Laplacian or \((I-D)\), where \(D\) is the graph diffusion matrix and \(I\) is the identity matrix. The gelnet package provides a function to compute the graph Laplacian from the adjacency. Here, we utilize the normalized Laplacian to keep the penalty term on the same scale as the traditional ridge regression:

library( gelnet )
L <- adj2nlapl(A)

The model can now be learned via

model <- gelnet( X, y, 0.1, 1, P = L )

## Training a linear regression model
## Running linear regression optimization with L1 = 0.100000, L2 = 1.000000
## f = 0.573674 after iteration 6

where we set the L1-norm and L2-norm penalties to 0.1 and 1, respectively. The response for new samples is computed via the dot product with the weights:

Xnew <- matrix( rnorm( 500 ), 10, 50 )
Xnew %*% model$w + model$b

##             [,1]
##  [1,] -0.5834195
##  [2,]  0.3800813
##  [3,]  0.3359393
##  [4,] -0.7191435
##  [5,]  0.4884036
##  [6,] -0.6455145
##  [7,] -0.2912131
##  [8,]  0.3078447
##  [9,]  0.1756701
## [10,] -0.5090319

Other regression problems

Linear regression is one of the three types of prediction problems supported by the package. The other two are binary logistic regression and one-class logistic regression. The latter is outlined in the following paper: http://psb.stanford.edu/psb-online/proceedings/psb16/sokolov.pdf

The package recognizes the problem type based on the class of the \(y\) argument. To train a binary predictor, we have to provide \(y\) as a two-level factor, where the first level is treated as the positive class.

y <- factor( y > 0, levels=c(TRUE,FALSE) )
model2 <- gelnet( X, y, 0.1, 1, P=L )

## Training a logistic regression model
## Treating TRUE as the positive class
## Running logistic regression optimization with L1 = 0.100000, L2 = 1.000000
## Iteration 1: f = 0.693147
## Iteration 2: f = 0.637397
## Iteration 3: f = 0.637259

If we were to score the training data using this model, we can observe that the positive samples are receiving higher scores than the negative ones

data.frame( scores= X %*% model2$w + model2$b, labels= y )

##          scores labels
## 1  -0.042026026   TRUE
## 2  -0.701904500  FALSE
## 3  -0.489787529  FALSE
## 4  -0.412480247  FALSE
## 5  -0.038749079   TRUE
## 6  -0.966533903  FALSE
## 7  -1.037170252  FALSE
## 8  -0.137472300   TRUE
## 9  -0.626894423  FALSE
## 10 -0.178248528  FALSE
## 11 -0.598004172  FALSE
## 12 -0.738103050  FALSE
## 13 -0.691199114  FALSE
## 14 -0.087613986   TRUE
## 15 -0.667663491  FALSE
## 16 -0.788829036  FALSE
## 17  0.023713731   TRUE
## 18 -0.135189723   TRUE
## 19 -0.006356284   TRUE
## 20 -0.033262442   TRUE

However, if there is class imbalance, the scores will tend to be skewed towards the class with more samples. This can be addressed by using an additional flag when training the model

model2bal <- gelnet( X, y, 0.1, 1, P=L, balanced=TRUE )

## Training a logistic regression model
## Treating TRUE as the positive class
## Running logistic regression optimization with L1 = 0.100000, L2 = 1.000000
## Iteration 1: f = 0.693147
## Iteration 2: f = 0.651713
## Iteration 3: f = 0.651705

data.frame( scores= X %*% model2bal$w + model2bal$b, labels= y )

##        scores labels
## 1   0.3326137   TRUE
## 2  -0.3705584  FALSE
## 3  -0.1420000  FALSE
## 4  -0.0503658  FALSE
## 5   0.3411158   TRUE
## 6  -0.6465725  FALSE
## 7  -0.7369836  FALSE
## 8   0.2422113   TRUE
## 9  -0.2938887  FALSE
## 10  0.1936097  FALSE
## 11 -0.2651647  FALSE
## 12 -0.4211663  FALSE
## 13 -0.3562102  FALSE
## 14  0.2875076   TRUE
## 15 -0.3439592  FALSE
## 16 -0.4651769  FALSE
## 17  0.4052899   TRUE
## 18  0.2380984   TRUE
## 19  0.3671628   TRUE
## 20  0.3539453   TRUE

Traditionally, the loss function for logistic regression is averaged over \(n\), the number of samples. This causes every sample to make the same contribution to the loss, which is what causes the skew towards the larger class. By using the balanced flag, the problem is reformulated slightly such that the loss is averaged over the positive and negative samples separately, and then the mean of both averages is used as the overall loss.

Finally, we can build a one-class logistic regression model using just the positive samples. To train a one-class model we simply provide NULL for the \(y\) argument:

j <- which( y == TRUE )
model1 <- gelnet( X[j,], NULL, 0.1, 1, P=L )

## Training a one-class model
## Iteration 1 : f = 0.6931472 
## Iteration 2 : f = 0.5880876 
## Iteration 3 : f = 0.5878902

The model can now be used as a detector that recognizes the positive samples

data.frame( scores= X %*% model2bal$w + model2bal$b, labels= y )

##        scores labels
## 1   0.3326137   TRUE
## 2  -0.3705584  FALSE
## 3  -0.1420000  FALSE
## 4  -0.0503658  FALSE
## 5   0.3411158   TRUE
## 6  -0.6465725  FALSE
## 7  -0.7369836  FALSE
## 8   0.2422113   TRUE
## 9  -0.2938887  FALSE
## 10  0.1936097  FALSE
## 11 -0.2651647  FALSE
## 12 -0.4211663  FALSE
## 13 -0.3562102  FALSE
## 14  0.2875076   TRUE
## 15 -0.3439592  FALSE
## 16 -0.4651769  FALSE
## 17  0.4052899   TRUE
## 18  0.2380984   TRUE
## 19  0.3671628   TRUE
## 20  0.3539453   TRUE