Crunch supports more types of variables than many of the data formats
from which you can create datasets. Plain-text .csv files, for example,
can’t express that some columns are actually indicator matrices of
multiple selections (Multiple Response Variables). Most SPSS .sav files
do not indicate Categorical Arrays (otherwise known as “grids”) as being
part of a group—they are simply several categorical variables. The same
is true for an R data.frame
.
You can use crunch
to “bind” categorical variables into
Multiple Response and Categorical Array variables.
One of the reasons to use R with Crunch is to leverage the power of
scripting for tasks that would be repetitive in a GUI. Many crunch
functions operating on Crunch datasets have an optional
pattern
argument that lets you use regular expressions for
these “bulk” operations.
In our sample Economist dataset, we have set of variables prefixed with “imiss”:
## [1] "imiss_b" "imiss_c" "imiss_d" "imiss_f" "imiss_g" "imiss_h" "imiss_j" "imiss_m" "imiss_p" "imiss_q" "imiss_r"
## [12] "imiss_s" "imiss_t"
These correspond to a survey grid question about how important respondents view a set of issues. Examining one of them, we see
## imiss_b (categorical)
##
## Count
## Very Important 196
## Somewhat Important 44
## Not very Important 8
## Unimportant 1
All of these “imiss” categorical variables have the same structure.
We can combine them into a categorical array variable with
makeArray
:
## Issue importance (categorical_array)
## Subvariables:
## $`imiss_b`
## $`imiss_c`
## $`imiss_d`
## $`imiss_f`
## $`imiss_g`
## $`imiss_h`
## $`imiss_j`
## $`imiss_m`
## $`imiss_p`
## $`imiss_q`
## $`imiss_r`
## $`imiss_s`
## $`imiss_t`
The set of “important issue” variables have gone from thirteen separate categorical variable cards to just one, where the subvariables are shown as rows, and the common categories across all of them are shown as columns.
In our example dataset, the categorical variables
imiss_*
are now not visible in the dataset directly, but we
can access them as “subvariables” of the array we just created.
## Subvariables:
## $`imiss_b`
## $`imiss_c`
## $`imiss_d`
## $`imiss_f`
## $`imiss_g`
## $`imiss_h`
## $`imiss_j`
## $`imiss_m`
## $`imiss_p`
## $`imiss_q`
## $`imiss_r`
## $`imiss_s`
## $`imiss_t`
We can also step into the subvariables and access the underlying categorical variables:
## imiss_b (categorical)
##
## Count
## Very Important 196
## Somewhat Important 44
## Not very Important 8
## Unimportant 1
The names imiss_b
through imiss_t
are
unsatisfying from a human-readability perspective: you can’t tell which
political issues correspond to the variables. Unfortunately, we didn’t
have additional metadata on these survey questions in the
data.frame
we imported initially. However, we can rectify
this.
Subvariables have methods similar to those for categories. They have a names attribute that we can get:
## [1] "imiss_b" "imiss_c" "imiss_d" "imiss_f" "imiss_g" "imiss_h" "imiss_j" "imiss_m" "imiss_p" "imiss_q" "imiss_r"
## [12] "imiss_s" "imiss_t"
We can set it, too:
names(subvariables(ds$imiss)) <- c("The economy", "Immigration",
"The environment", "Terrorism", "Gay rights", "Education",
"Health care", "Social security", "The budget deficit",
"The war in Afghanistan", "Taxes", "Medicare", "Abortion")
subvariables(ds$imiss)
## Subvariables:
## $`The economy`
## $`Immigration`
## $`The environment`
## $`Terrorism`
## $`Gay rights`
## $`Education`
## $`Health care`
## $`Social security`
## $`The budget deficit`
## $`The war in Afghanistan`
## $`Taxes`
## $`Medicare`
## $`Abortion`
Another useful thing we can do with array subvariables is reorder them. Let’s alphabetize the subvariables:
sorting <- order(names(subvariables(ds$imiss)))
subvariables(ds$imiss) <- subvariables(ds$imiss)[sorting]
subvariables(ds$imiss)
## Subvariables:
## $`Abortion`
## $`Education`
## $`Gay rights`
## $`Health care`
## $`Immigration`
## $`Medicare`
## $`Social security`
## $`Taxes`
## $`Terrorism`
## $`The budget deficit`
## $`The economy`
## $`The environment`
## $`The war in Afghanistan`
Just as we created a categorical array, we can create a multiple response variable. Like categorical arrays, multiple responses contain a set of subvariables, categorical variables with a common list of categories. However, the subvariables in a multiple response are treated as dichotomous indicators. We specify one or more categories that indicate “selected” versus “not selected”. Hence, when a multiple response appears in the web app, it looks like a single categorical variable, each subvariable shown like a category. Unlike a categorical variable, though, the multiple responses are not mutually exclusive, so tabulations with them may not sum to 100 percent.
In the Economist dataset, we have another set of parallel categorical variables, “boap”, which indicate approval of President Obama on a range of issues.
## boap_4 (categorical)
##
## Count
## Somewhat approve 68
## Strongly disapprove 67
## No opinion 46
## Somewhat disapprove 43
## Strongly approve 26
In the questionnaire that collected this data, “boap” appeared as a grid question, just as “imiss” did. But, for illustration purposes—and to show how you can convert between categorical array and multiple response—let’s treat this as multiple response.
The function makeMR
works like makeArray
but with an additional argument, “selections”, in which you specify the
category name(s) that identify which category or categories should be
the dichotomous indicator.
ds$boap <- makeMR(ds[grep("^boap_[0-9]+", names(ds))],
name="Approval of Obama on issues",
selections=c("Strongly approve", "Somewhat approve"))
ds$boap
## Approval of Obama on issues (multiple_response)
## Subvariables:
## $`boap_2`
## $`boap_3`
## $`boap_4`
## $`boap_5`
## $`boap_6`
## $`boap_7`
## $`boap_8`
## $`boap_9`
## $`boap_10`
## $`boap_11`
## $`boap_12`
## $`boap_13`
## $`boap_14`
Multiple response variables can be thought of as categorical arrays that have extra metadata indicating which categories are “selected”. This metadata can be manipulated, and we can thus transform categorical arrays into multiple response and vice versa.
The function undichotomize
removes the dichotomization
metadata:
## Approval of Obama on issues (categorical_array)
## Subvariables:
## $`boap_2`
## $`boap_3`
## $`boap_4`
## $`boap_5`
## $`boap_6`
## $`boap_7`
## $`boap_8`
## $`boap_9`
## $`boap_10`
## $`boap_11`
## $`boap_12`
## $`boap_13`
## $`boap_14`
We can add that information with dichotomize
. Taking our
categorical array “boap,” let’s make it into a multiple response, but
this time, let’s only include the “Strongly approve” category:
## Approval of Obama on issues (multiple_response)
## Subvariables:
## $`boap_2`
## $`boap_3`
## $`boap_4`
## $`boap_5`
## $`boap_6`
## $`boap_7`
## $`boap_8`
## $`boap_9`
## $`boap_10`
## $`boap_11`
## $`boap_12`
## $`boap_13`
## $`boap_14`
As noted above, when we make an array, its subvariables no longer appear in the dataset outside of the array.
## [1] "boap"
We can access the subvariables and do things with them directly via
the subvariables
method, but the case may arise in which we
want to undo our binding of these subvariables into the array. The
function unbind
deletes the array variable and restores the
subvariables as top-level variables.
Note the use of refresh
. Most functions that modify
objects on the server refresh their local copies in our R session
automatically; however, because unbind
doesn’t assign back
into ds
, the local dataset object doesn’t get updated with
the change, so we need to refresh it manually.
Now, if we check the names of ds
, we see our full set of
boap_*
former subvariables:
## [1] "boap_2" "boap_3" "boap_4" "boap_5" "boap_6" "boap_7" "boap_8" "boap_9" "boap_10" "boap_11" "boap_12"
## [12] "boap_13" "boap_14"