The Crunch web app and the crunch
package are both built
on a REST API. Users can
interact with very large datasets because most of the heavy computation
is done on the Crunch servers and not on the users computer. In most
cases, you don’t need to know how the API works—the R package handles
the HTTP requests and responses and presents meaningful objects and
methods to you. To go beyond the basics, though, it can be useful to
understand how the API works so that you can make more complex or more
efficient (read: faster) operations.
When you open a dataset in the Crunch web app, what happens is that the app sends an HTTP request to the Crunch server for information about that dataset. It gets the dataset description and a list of variables contained within that dataset, but the actual data is stored on the server and not sent to the app. The app only loads information about variables that are actually going to be displayed on the screen, which is why very large datasets load so quickly. You might have millions of rows in your dataset, but the web app only asks the server for the summary statistics it needs to display the variable card.
This kind of minimally necessary information are stored in objects
called catalogs. There is a VariableCatalog
, which has
information about a dataset’s variables; a DatasetCatalog
,
which has information about your datasets, and many others. You can
think of catalogs just like the Sears Catalog: they are lists of things
and descriptions about those things, but they are not the things
themselves. You might flip through a catalog looking for the couch, but
you need to make a special order if you actually want the couch
delivered.
The R package relates to the Crunch API in more or less the same manner as the web app. When you load a Crunch dataset from the server you are not typically loading the actual data, but instead you are loading a catalog representation of that dataset which is stored in a list. This object includes things like the dataset identifier, the variable names and types, and the dataset dimensions, but the actual data stays on the server. When calculations are performed on that data, an HTTP request is sent to the server, the server calculates the answer, and the results are sent back to the R session. This is what allows you to calculate statistics on objects which don’t fit in memory: the calculations are done remotely and only the result of that calculation is sent back to you.
Many Crunch functions both retrieve information from the server and allow the user to set information. For instance if you want to retrieve a list of datasets associated with a project you could call the [datasets()] function like this:
proj <- projects()[["Project name"]]
datasets(proj)
What happens under the hood when you run this code is that R sends an
HTTP request to the server asking for the datasets associated with a
particular project. This is the “getter” side of the
datasets()
function. However, the function also allows the
user to change the datasets associated with that function using the
assignment operator.
ds <- loadDataset(mydatasets[["Dataset name"]])
datasets(proj) <- ds
Internally there is actually a second method
datasets<-
that takes the value on the right hand side
of the <-
operator and posts that value to the datasets
attribute of the project catalog. The projects catalog will then update
to reflect that a dataset belongs to a particular catalog, and that will
be reflected in the web app. Similar patterns happen when you get and
set attributes on objects, like “names”.
Crunch cubes are generally used to cross tabulate different variables
in a crunch dataset. For instance if you cross tabulate two variables
with crtabs(~cyl + gear, data = ds)
, you get a 2
dimensional cube which looks just like a matrix. Cross tabbing three
variables leads to a 3d cube, etc. This simple case gets quite a bit
more complicated when you add array variables to the cube:
Multiple response
Multiple response variables are themselves 2d arrays with the items or responses on one dimension and the selection status on the second dimension. The result is that a categorical-by-MR cube ends up with three dimensions: the categorical variable categories, the MR items/responses, and the MR selection status.
Categorical array Categorical arrays are also represented cubes as 2d arrays. With the items/responses (the subvariable labels) on one dimension, and their categories on another dimension.
The main feature of representing these counts as high dimensional
arrays is that it makes a number of computational tasks a lot easier,
but the problem is that you can quickly end up with cubes which are
difficult for humans to conceptualize. For example if you created a cube
with one MR variable which had 5 categories, one MR variable with 6
categories, and a categorical variable with 10 categories it would have
5 dimensions with 2700 (5*3 * 6*3 * 10)
entries. This is
hard to understand and to print, and as a result we try to figure out
the parts of the cube which the user is interested in and print that
sub-cube. For the purposes of this document we need three terms for
cubes:
Real cube: The cube with all of the underlying data User cube: The cube the user thinks they’re interacting with. This is the same as the real cube without MR selection dimensions Printed cube The cube which prints to the console. The same as the the user cube but with missing categories suppressed.
The user cube differs from the real cube in that it doesn’t include
MR selection dimensions. We assume that when the user is including an MR
in a cube, they really only care about the parts of the cube where the
MR is selected. So the 5d real cube listed above would become a 3d user
cube with 300 entries (5 * 6 * 10
) because we are selecting
the slices of the cube where each MR selection dimension is equal to
“Selected” and are able to drop the selections dimensions
The printed cube differs from the user cube in that it doesn’t
include missing categories. We assume that the user mostly wants to look
at categories which are not-missing, and that they don’t care that much
about the parts of the cube which represent missing categories. If the
categorical variable above had two missing categories: “No Data” and
“Not Answered” it would reduce the number of entries to 240
(5 * 6 * 8
) . This behavior can be changed by setting the
cube@useNA
slot to “always”
. The function
setCubeNa()
handles setting this slot on the cube.
This structure gets a little tricky when we subset the cube. For
instance the cube we’ve been working with has 5 dimensions, but to the
user appears to only have 3, so should they subset it as a 5 dimensional
array like cube[1, 1, 2:3, 1, 5]
or a 3 dimensional array
cube[1, 2:3, 5]
. Similarly, if a dimension is missing,
should that be included in the subset. For example if a cube has three
entries A, B, C, along one dimension, and B is hidden, should
cube[ , 1:2]
return A, A & B, or A &C? Similarly
should dim(cube)
show the dimensions of the real cube or
the printed cube?
We resolve this problem by allowing the user to select whether they
want the cube to represent the user cube or the printed cube, and make
all of the behavior of those cubes consistent. The
cube@useNA
controls whether the user wanted to interact
with the printed cube or the user cube and they can
showMissing()
and hideMissing()
functions to
control which cube they are interacting with
> dim(cube)
[1] 5 6 8
> dim(showMissing(cube))
[1] 5 6 10
This makes subsetting a bit easier because we can enforce the dimensions of the printed cube. For instance
cube[5, 6, 10]
Errors because the user supplied the wrong number of dimensions while
showMissing(cube)[cube[5, 6, 10]
Returns the correct data they were looking for.
Subsetting the printed cube
There are two cases for dropping dimensions from the printed and user cubes which need special handling.
Dropping MR variable
MR dimensions are always dropped together. Consider a cube composed
of a multiple response variable and a categorical variable. The real
cube has three dimensions, and the printed cube has two, the first of
which represents the multiple response variable. If the user asks for
cube[1, ]
both MR dimensions are dropped and a one
dimensional cube is returned.
Subsetting dimensions with missing values
When the user is interacting with the the printed cube, they are
relating to a cube without missing categories. As a result when they
subset this cube, missing categories should not be carried forward. For
instance if a cube has two categorical dimensions with one missing
category each, dim(cube)
will return c(2, 2)
dim(showMissing(cube))
, will return c(3, 3)
.
If the user subsets either of these dimensions, for instance with cube[
, 2] the missing categories of the second dimension will be dropped
(because the user is explicitly not selecting that category), and that
dimension will be dropped. If the user wants to subset a cube to
preserve missing categories, they need to subset the user cube with
showMissing(cube)[, c(2, 4)]