The mixl package presents a new approach to estimating complex choice models which is both many times faster than existing approaches, and simplier to use. The interface remains in R, while the loglikelihood function is compiled into to high performance C++ code in the background. Users need no knowledge of C++, and can specify thier utility functions and random variables as they would on paper. Where the appropriate parallel computing resources are available (on linux machines), the loglikelihood estimation can be easily sped up by simply specifying the number of cores available.
##Installation Under the hood, the mixl package compiles a loglikelihood function based on your model specification. This requires a c++ compiler. Somewhat complicatedly, the correct compiler (for C++ with openMP support) is only available by defauly on linux machines. On windows and OSX machines, mixl will installed successfully, but attempts to specify a model will fail unless the following steps are taken.
On windows machines, the easiest way is to install Rtools, an
official extension for R. As of writing, the Version 4.0 the most
recent. It is available here: https://cran.r-project.org/bin/windows/Rtools/.
Download and run the installer. The default location
(C:/Rtools
) is fine.
On Mac, things are a little more complicated. OSX has a compiler
pre-installed. However, the compiler (clang) doesn’t come with openMP
enabled. Hence, the specify_model
function will fail. The
easist way to install a suitable compiler is as follows. Note, these
commands require using the Terminal (command line). This can be accessed
from the search functionality in OSX, or directly in R Studio itself.
The following code assumes that you don’t have a Makevars for your R
environment set up already.
brew install gcc@9
mkdir -p ~/.R
curl -L -o ~/.R/Makevars https://git.io/JtbVr
The development of this package was motivated by the need at the Institute for Transport Planning and Systems (IVT) at ETH Zürich to estimate complicated mixed multinomial logit models of time use and travel behaviour. Other approaches, such as biogeme and other R implementations were found to be very memory intensive when a large number of random draws were needed.
Additionally, we found that the learning curve for some implementations was very steep, and changes to model specifications involved understanding of the underlying code base, on top of random utility and choice modeling fundamentals.
In R, until now there has been two approaches for specifying utility functions. The first, popularised by the mlogit package uses the R formula notation to allow succint yet expressive definition of a utility function. This approach falls short however, when the user wants to specify multiple utility functions, which is standard practice for more complicated models. This approach also doesn’t handle complicated mixed models well. The alternative involves specifying the utility functions as operations on datatable columns, which is much more flexible, but inefficient to run, and error prone to write.
From a technical perspective, the utility functions can be seen as the ‘inner loop’ of the Maximum likelihood Estimation (mle) operation. If there are p parameters, then for every iteration of the mle the utility functions are calculated at least 2p+1 times. It follows that any attempt at high performance mnl estimation should focus here.
This is where approaches in R fall short. The speedups possible by coding inner loops using C++ with the Rcpp package are already well widely documented, however, every change to the utility function would require changes to code in C++, not something your trained choice modeller should be spending thier time on.
Hence, we invisaged a package where one could write utility functions as they would on paper, which could be parsed into C++ and complied ‘behind the scenes’ into a high performant loglikelihood function accessible from R.
The second goal was to optimise the memory consumption. The straightforward approach is to merge the dataset with the random draw dataset, duplicating every data row by the number of requested random draws. However, this leads to an polynomial increase in the memory required. And is particular inefficient for large panel datasets. However, if the data and random draw datasets are kept separate, then the memory usage is only linear in the number of draws.
As such, this package, called mixl, allows for advanced choice modelling through an intuitive interface. The number of random draws is no longer limited by the computer memory available. Indeed, we have been able to estimate models with dozens of parameters and 10,000 random draws. Importantly, this performance does not require advanced programming skills from the modeller. models can be speficied as they would be on paper. The code is also parallelised, see section ________ for more details.
To both simplify the transformation of the utility functions into C++ code and aid the readability of the scripts, a small amount of syntax has been introduced:
Variables from the dataset must be prefixed with a $,
coefficients with a @
Every line must end with a ;
Intermediate variables that are calculated don’t get prefixed by anything
The utility functions are prefixed by U_. “U_bus” or “U_4” are valid, but U_geer.erg is not.
Draws are prefixed by “draw”. If you pass nDraws into the
estimate function, a suitable set of halton draws will be generated
automatically. If you pass in a draws matrix, it needs to be large
enough. there is a helper function to do this:
mixl::create_halton_draws(Nindividuals, Ndraws, draw_dimensions)
For powers, you need to use pow (x,y) instead of x^y
The availabilities must to be provided as a matrix of \(n*k\) where \(n\) is the number of rows in the data, and \(k\) is the number of utility functions. The helper functions and can help here
Below are two sample model definitions to get you started. The package automatically detects mixing, and generates the loglikelihood function appropriately. A simple mnl model, using the Train dataset could look like the following:
U_A = @B_price * $price_A / 1000 + @B_timeA * $time_A / 60 + @B_change * $change_A;
U_B = @ASC_B + @B_price * $price_B / 1000 + @B_timeB * $time_B / 60;
Here, the coefficients ASC_B, B_price, B_time, B_timeB and B_change would be estimated.
If we extend this to a simple mixed model, the additional random coefficients SIGMA_B and SIGMA_B2 are estimated. the declaration of ASC_B_RND helps the readability of the utility function:
ASC_B_RND = @ASC_B + draw_1 * @SIGMA_B + draw_2 * @SIGMA_B2;
U_A = @B_price * $price_A / 1000 + @B_timeA * $time_A / 60 + @B_change * $change_A;
U_B = ASC_B_RND + @B_price * $price_B / 1000 + @B_timeB * $time_B / 60;
As clean as this interface is, there are some limits to what the package can handle. The modeller needs to make sure that their dataset has two columns, ID and CHOICE.
U_x = 1
will give a utility of zero.#M maximum simulated likelihood estimation A set choice model
generated through this package is estimated using the
maximumLikelihood
function. It wraps the maxLik procedure
from the maxLik package, which then accepts the optimised likelihood
function generated from a utility script. The maxLik package is stable
and well tested. To improve the covariance estimates, the Hessian
function from the numDeriv package, rather than the default from the
maxLik package is used. The BFGS (Broyden–Fletcher–Goldfarb–Shanno
algorithm) method is used by deafault. This has found to converge faster
and more reliably than other methods such as Newton-Rhapson.
Even for simple MNL models, the numerical gradient of the loglikelihood function is used, as it is envisaged that this package will primarily be used for estimating mixed models, where simulation is necessary due to the use of random draws.
Parallelisation is provided through the openMP. Where this functionality is available (primarily on linux, but it can be enabled on OSX with some effort) the loglikelihood function can be run in parallel with near linear speedups. The parallel performance is largely due to the ‘embarassingly parallel’ nature of the problem. For each observation, only \(k\) utilities are calculated, and each observation is independent. The input data, namely the variables, coefficients and random draws are modified during the calculation. This means that the only variable jointly written by each processing core is the output matrix of probabilities. Since the update operation is only an (associative) addition, this operation can be performed atomically, requiring no locking sections of code.
Probit models are currently not supported, though that is planned for the near future.
The package also supports hybrid choice models. Declared variables
with the prefix P_indic_
will be considered as probability
indicators for each observation. The compiler detects these
automatically, and automatically includes the code to include the
product of the probability indicators in the loglikelihood. The model
estimation will return both the choice loglikelihood and the model
loglikelihood. In psuedo code, the inclusion of the indicators looks
like this:
p_choice = log(chosen_utility / sum(utilities) );
p_indic_total = P_indic_1 * P_indic_2 * .... P_indic_k;
p_choice = p_choice + (1/num_obs)*log(p_indic_total);
One extra column is required in the dataset to enable hybrid choice,
namely a count
column with the total number of observations
for the individual making the choice.
To give maximum flexibility to the user, there are no restrictions on
which probability distributions can be used in the probability
indicators. In R, it is common to use functions such as
rnorm
or dnorm
from the stats package. These
can still be used, but the syntax is a little different, as they must be
called from the C++ code. hence, where in R one would call the
following
dnorm(x, mean, sd)
in the utility script, one must write
R::dnorm(x, mean, sd, 0);
For a list of distributions which can be used, please refer to the Rcpp documentation