Design behind sparsevctrs

The sparsevctrs package produces 3 things; ALTREP classes, matrix/data.frame converting functions, helper functions. This document outlines the rationale behind each of these and the decisions behind them.

The primary objective of this package is to provide tools to work with sparse data in data.frames/tibbles. The next highest priority is execution speed. This means that algorithms and methods in this package are written to minimize memory allocations whenever possible, once that is done, running the code as fast as we can. These choices are made because this package was written to deal with tasks that were otherwise not possible due to memory constraints.

Altrep Functions

The functions sparse_double() and its relatives are used to construct sparse vectors of the noted type. To work they all need 4 pieces of information:

values
positions
length
default (defaults to 0)

The values need to match the type of the function name or be easily coerced into the type (double -> integer). The positions should be integers or doubles that can losslessly be turned into integers. The length should be a single non-negative integer-like value.

Values and positions are paired, and will thus be expected to be the same length, furthermore, positions are expected to be sorted in increasing order with no duplicates. The ordering is done to let the various extraction methods work as efficiently as possible.

These functions have quite strict input checking by choice, to allow the inner workings to be as efficient as possible.

The input of these functions mirrors the values stored in the ALTREP class that they produce.

Converting Functions

3 functions fall into this category:

coerce_to_sparse_data_frame()
coerce_to_sparse_tibble()
coerce_to_sparse_matrix()

the first two take a sparse matrix from the Matrix package and produce a data.frame/tibble with sparse columns. The last one takes a data.frame/tibble with sparse columns and produces a sparse matrix using the Matrix package.

These functions are expected to be inverse of each other, such that coerce_to_sparse_matrix(coerce_to_sparse_data_frame(x)) returns x back. They are made to be highly performant both in terms of speed and memory consumption, Meaning that sparsity is applied when appropriate.

These functions have quite strict input checking by choice, to allow the inner workings to be as efficient as possible. It is in part why data.frames with sparse vectors with different can’t be used with coerce_to_sparse_matrix() yet.

Helper Functions

There are 3 types of helper functions. First, we have the is_* family of functions. The specific is_sparse_double() and more general is_sparse_vector() can be used as a way to determine whether a vector is an ALTREP sparse vector. This is otherwise hard to tell as as.numeric() can’t tell the difference.

Secondly, we have the extraction functions. They are sparse_values() and sparse_positions(). These extract the values and positions respectively, without materializing the whole dense vector. These functions are made to work with non-sparse vectors as well to make them more ergonomic for the user. Internally they call is_sparse_vector(), so the choice to return something useful as the alternative wasn’t hard. There is no sparse_length() function as length() works with these types of

The last type of helper function is less clearly defined and is expanded as needed. The functions provide alternatives to functions that don’t have ALTREP support. Such as mean(). Calling mean() on a sparse vector will force materialization, and then calculate the mean. This is memory inefficient as it could have been calculated like so.

sum(sparse_values(x)) / length(x)

These functions, all starting with the name prefix sparse_*, are made to work with non-sparse vectors for the same reasons listed above regarding ergonomic use.

FAQ

Why aren’t the results returned as {vctrs} classes?

As it stands right now, it is viewed to be beneficial to have the users not be alerted to these vectors as they are expected to be used internally in packages and rarely by the end user. Furthermore having these sparse vectors produce the same result as dense vectors with class() is a big plus.

Will this package try to replace the {Matrix} package?

Not at all. The sparse vector types provided in this package mimic those created with Matrix::sparseVector(). They work with different types and allow for different defaults. None of the matrix operations will be reimplemented here.