Data preparation is a common task in research, which usually takes the most amount of time in the analytical process. sjmisc is a package with special focus on transformation of variables that fits into the workflow and design-philosophy of the so-called “tidyverse”.
Basically, this package complements the dplyr package in that sjmisc takes over data transformation tasks on variables, like recoding, dichotomizing or grouping variables, setting and replacing missing values, etc. A distinctive feature of sjmisc is the support for labelled data, which is especially useful for users who often work with data sets from other statistical software packages like SPSS or Stata.
This vignette demonstrate some of the important recoding-functions in
sjmisc. The examples are based on data from the
EUROFAMCARE project, a survey on the situation of family carers of older
people in Europe. The sample data set efc
is part of this
package.
To show the results after recoding variables, the frq()
function is used to print frequency tables.
dicho()
dichotomizes variables into “dummy” variables
(with 0/1 coding). Dichotomization is either done by median, mean or a
specific value (see argument dich.by
).
Like all recoding-functions in sjmisc,
dicho()
returns the complete data frame including
the recoded variables, if the first argument is a
data.frame
. If the first argument is a vector, only the
recoded variable is returned. See this
vignette for details about the function-design.
If dicho()
returns a data frame, the recoded variables
have the same name as the original variable, including a suffix
_d
.
# age, ranged from 65 to 104, in this output
# grouped to get a shorter table
frq(efc, e17age, auto.grp = 5)
#> elder' age (e17age) <numeric>
#> # total N=908 valid N=891 mean=79.12 sd=8.09
#>
#> Value | Label | N | Raw % | Valid % | Cum. %
#> -----------------------------------------------
#> 1 | 65-72 | 212 | 23.35 | 23.79 | 23.79
#> 2 | 73-80 | 277 | 30.51 | 31.09 | 54.88
#> 3 | 81-88 | 270 | 29.74 | 30.30 | 85.19
#> 4 | 89-96 | 124 | 13.66 | 13.92 | 99.10
#> 5 | 97-104 | 8 | 0.88 | 0.90 | 100.00
#> <NA> | <NA> | 17 | 1.87 | <NA> | <NA>
# splitting is done at the median by default:
median(efc$e17age, na.rm = TRUE)
#> [1] 79
# the recoded variable is now named "e17age_d"
efc <- dicho(efc, e17age)
frq(efc, e17age_d)
#> elder' age (e17age_d) <categorical>
#> # total N=908 valid N=891 mean=0.49 sd=0.50
#>
#> Value | N | Raw % | Valid % | Cum. %
#> --------------------------------------
#> 0 | 455 | 50.11 | 51.07 | 51.07
#> 1 | 436 | 48.02 | 48.93 | 100.00
#> <NA> | 17 | 1.87 | <NA> | <NA>
As dicho()
, like all recoding-functions, supports labelled data,
the variable preserves it variable label (but not the value labels). You
can directly define value labels inside the function:
x <- dicho(efc$e17age, val.labels = c("young age", "old age"))
frq(x)
#> elder' age (x) <categorical>
#> # total N=908 valid N=891 mean=0.49 sd=0.50
#>
#> Value | Label | N | Raw % | Valid % | Cum. %
#> --------------------------------------------------
#> 0 | young age | 455 | 50.11 | 51.07 | 51.07
#> 1 | old age | 436 | 48.02 | 48.93 | 100.00
#> <NA> | <NA> | 17 | 1.87 | <NA> | <NA>
To split a variable at a different value, use the
dich.by
-argument. The value specified in
dich.by
is inclusive, i.e. all values from lowest
to and including dich.by
are recoded into the lower
category, while all values above dich.by
are
recoded into the higher category.
# split at upper quartile
x <- dicho(
efc$e17age,
dich.by = quantile(efc$e17age, probs = .75, na.rm = TRUE),
val.labels = c("younger three quarters", "oldest quarter")
)
frq(x)
#> elder' age (x) <categorical>
#> # total N=908 valid N=891 mean=0.24 sd=0.43
#>
#> Value | Label | N | Raw % | Valid % | Cum. %
#> ---------------------------------------------------------------
#> 0 | younger three quarters | 678 | 74.67 | 76.09 | 76.09
#> 1 | oldest quarter | 213 | 23.46 | 23.91 | 100.00
#> <NA> | <NA> | 17 | 1.87 | <NA> | <NA>
Since the distribution of values in a dataset may differ for
different subgroups, all recoding-functions also work on grouped data
frames. In the following example, first, the age-variable
e17age
is dichotomized at the median. Then, the data is
grouped by gender (c161sex
) and the dichotomization is done
for each subgroup, i.e. it once relates to the median age in the
subgroup of female, and once to the median age in the subgroup of male
family carers.
data(efc)
x1 <- dicho(efc$e17age)
x2 <- efc %>%
dplyr::group_by(c161sex) %>%
dicho(e17age) %>%
dplyr::pull(e17age_d)
# median age of total sample
frq(x1)
#> elder' age (x) <categorical>
#> # total N=908 valid N=891 mean=0.49 sd=0.50
#>
#> Value | N | Raw % | Valid % | Cum. %
#> --------------------------------------
#> 0 | 455 | 50.11 | 51.07 | 51.07
#> 1 | 436 | 48.02 | 48.93 | 100.00
#> <NA> | 17 | 1.87 | <NA> | <NA>
# median age of total sample, with median-split applied
# to distribution of age by subgroups of gender
frq(x2)
#> elder' age (x) <numeric>
#> # total N=908 valid N=891 mean=1.50 sd=0.50
#>
#> Value | N | Raw % | Valid % | Cum. %
#> --------------------------------------
#> 1 | 449 | 49.45 | 50.39 | 50.39
#> 2 | 442 | 48.68 | 49.61 | 100.00
#> <NA> | 17 | 1.87 | <NA> | <NA>
split_var()
recodes numeric variables into equal sized
groups, i.e. a variable is cut into a smaller number of groups at
specific cut points. The amount of groups depends on the
n
-argument and cuts a variable into n
quantiles.
Similar to dicho()
, if the first argument in
split_var()
is a data frame, the complete data frame
including the new recoded variable(s), with suffix _g
, is
returned.
x <- split_var(efc$e17age, n = 3)
frq(x)
#> elder' age (x) <categorical>
#> # total N=908 valid N=891 mean=2.05 sd=0.82
#>
#> Value | N | Raw % | Valid % | Cum. %
#> --------------------------------------
#> 1 | 274 | 30.18 | 30.75 | 30.75
#> 2 | 294 | 32.38 | 33.00 | 63.75
#> 3 | 323 | 35.57 | 36.25 | 100.00
#> <NA> | 17 | 1.87 | <NA> | <NA>
Unlike dplyr’s ntile()
, split_var()
never
splits a value into two different categories, i.e. you always get a
“clean” separation of original categories. In other words: cases that
have identical values in a variable will always be recoded into the same
group. The following example demonstrates the differences:
x <- dplyr::ntile(efc$neg_c_7, n = 3)
# for some cases, value "10" is recoded into category "1",
# for other cases into category "2". Same is true for value "13"
table(efc$neg_c_7, x)
#> x
#> 1 2 3
#> 7 75 0 0
#> 8 99 0 0
#> 9 106 0 0
#> 10 18 102 0
#> 11 0 96 0
#> 12 0 85 0
#> 13 0 14 50
#> 14 0 0 54
#> 15 0 0 45
#> 16 0 0 30
#> 17 0 0 35
#> 18 0 0 26
#> 19 0 0 16
#> 20 0 0 16
#> 21 0 0 2
#> 22 0 0 7
#> 23 0 0 4
#> 24 0 0 3
#> 25 0 0 6
#> 27 0 0 1
#> 28 0 0 2
x <- split_var(efc$neg_c_7, n = 3)
# no separation of cases with identical values.
table(efc$neg_c_7, x)
#> x
#> 1 2 3
#> 7 75 0 0
#> 8 99 0 0
#> 9 106 0 0
#> 10 0 120 0
#> 11 0 96 0
#> 12 0 85 0
#> 13 0 0 64
#> 14 0 0 54
#> 15 0 0 45
#> 16 0 0 30
#> 17 0 0 35
#> 18 0 0 26
#> 19 0 0 16
#> 20 0 0 16
#> 21 0 0 2
#> 22 0 0 7
#> 23 0 0 4
#> 24 0 0 3
#> 25 0 0 6
#> 27 0 0 1
#> 28 0 0 2
split_var()
, unlike ntile()
, does therefor
not always return exactly equal-sized groups:
x <- dplyr::ntile(efc$neg_c_7, n = 3)
frq(x)
#> x <integer>
#> # total N=908 valid N=892 mean=2.00 sd=0.82
#>
#> Value | N | Raw % | Valid % | Cum. %
#> --------------------------------------
#> 1 | 298 | 32.82 | 33.41 | 33.41
#> 2 | 297 | 32.71 | 33.30 | 66.70
#> 3 | 297 | 32.71 | 33.30 | 100.00
#> <NA> | 16 | 1.76 | <NA> | <NA>
x <- split_var(efc$neg_c_7, n = 3)
frq(x)
#> Negative impact with 7 items (x) <categorical>
#> # total N=908 valid N=892 mean=2.03 sd=0.81
#>
#> Value | N | Raw % | Valid % | Cum. %
#> --------------------------------------
#> 1 | 280 | 30.84 | 31.39 | 31.39
#> 2 | 301 | 33.15 | 33.74 | 65.13
#> 3 | 311 | 34.25 | 34.87 | 100.00
#> <NA> | 16 | 1.76 | <NA> | <NA>
With group_var()
, variables can be grouped into equal
ranged categories, i.e. a variable is cut into a smaller number of
groups, where each group has the same value range.
group_labels()
creates the related value labels.
The range of the groups is defined in the size
-argument.
At the same time, the size
-argument also defines the
lower bound of one of the groups.
For instance, if the lowest value of a variable is 1 and the maximum
is 10, and size = 5
, then
This means, that an equal-ranged grouping will define groups from 0 to 4, 5 to 9 and 10-14. Each of these groups has a range of 5, and one of the groups starts with the value 5.
The group assignment becomes clearer, when
group_labels()
is used in parallel:
set.seed(123)
x <- round(runif(n = 150, 1, 10))
frq(x)
#> x <numeric>
#> # total N=150 valid N=150 mean=5.52 sd=2.63
#>
#> Value | N | Raw % | Valid % | Cum. %
#> -------------------------------------
#> 1 | 6 | 4.00 | 4.00 | 4.00
#> 2 | 19 | 12.67 | 12.67 | 16.67
#> 3 | 16 | 10.67 | 10.67 | 27.33
#> 4 | 17 | 11.33 | 11.33 | 38.67
#> 5 | 20 | 13.33 | 13.33 | 52.00
#> 6 | 12 | 8.00 | 8.00 | 60.00
#> 7 | 19 | 12.67 | 12.67 | 72.67
#> 8 | 16 | 10.67 | 10.67 | 83.33
#> 9 | 15 | 10.00 | 10.00 | 93.33
#> 10 | 10 | 6.67 | 6.67 | 100.00
#> <NA> | 0 | 0.00 | <NA> | <NA>
frq(group_var(x, size = 5))
#> x <numeric>
#> # total N=150 valid N=150 mean=1.68 sd=0.59
#>
#> Value | N | Raw % | Valid % | Cum. %
#> -------------------------------------
#> 1 | 58 | 38.67 | 38.67 | 38.67
#> 2 | 82 | 54.67 | 54.67 | 93.33
#> 3 | 10 | 6.67 | 6.67 | 100.00
#> <NA> | 0 | 0.00 | <NA> | <NA>
group_labels(x, size = 5)
#> [1] "0-4" "5-9" "10-14"
dummy <- group_var(x, size = 5, as.num = FALSE)
levels(dummy) <- group_labels(x, size = 5)
frq(dummy)
#> x <categorical>
#> # total N=150 valid N=150 mean=1.68 sd=0.59
#>
#> Value | N | Raw % | Valid % | Cum. %
#> -------------------------------------
#> 0-4 | 58 | 38.67 | 38.67 | 38.67
#> 5-9 | 82 | 54.67 | 54.67 | 93.33
#> 10-14 | 10 | 6.67 | 6.67 | 100.00
#> <NA> | 0 | 0.00 | <NA> | <NA>
dummy <- group_var(x, size = 3, as.num = FALSE)
levels(dummy) <- group_labels(x, size = 3)
frq(dummy)
#> x <categorical>
#> # total N=150 valid N=150 mean=2.48 sd=0.96
#>
#> Value | N | Raw % | Valid % | Cum. %
#> -------------------------------------
#> 0-2 | 25 | 16.67 | 16.67 | 16.67
#> 3-5 | 53 | 35.33 | 35.33 | 52.00
#> 6-8 | 47 | 31.33 | 31.33 | 83.33
#> 9-11 | 25 | 16.67 | 16.67 | 100.00
#> <NA> | 0 | 0.00 | <NA> | <NA>
The argument right.interval
can be used when
size
should indicate the upper bound of a
group-range.
dummy <- group_var(x, size = 4, as.num = FALSE)
levels(dummy) <- group_labels(x, size = 4)
frq(dummy)
#> x <categorical>
#> # total N=150 valid N=150 mean=2.00 sd=0.74
#>
#> Value | N | Raw % | Valid % | Cum. %
#> -------------------------------------
#> 0-3 | 41 | 27.33 | 27.33 | 27.33
#> 4-7 | 68 | 45.33 | 45.33 | 72.67
#> 8-11 | 41 | 27.33 | 27.33 | 100.00
#> <NA> | 0 | 0.00 | <NA> | <NA>
dummy <- group_var(x, size = 4, as.num = FALSE, right.interval = TRUE)
levels(dummy) <- group_labels(x, size = 4, right.interval = TRUE)
frq(dummy)
#> x <categorical>
#> # total N=150 valid N=150 mean=1.78 sd=0.71
#>
#> Value | N | Raw % | Valid % | Cum. %
#> -------------------------------------
#> 1-4 | 58 | 38.67 | 38.67 | 38.67
#> 5-8 | 67 | 44.67 | 44.67 | 83.33
#> 9-12 | 25 | 16.67 | 16.67 | 100.00
#> <NA> | 0 | 0.00 | <NA> | <NA>
rec()
recodes old values of variables into new values,
and can be considered as a “classical” recode-function. The
recode-pattern, i.e. which new values should replace the old values, is
defined in the rec
-argument. This argument has a specific
“syntax”:
recode pairs: Each recode pair has to be
separated by a ;, e.g. rec = "1=1; 2=4; 3=2; 4=3"
multiple values: Multiple old values that should
be recoded into a new single value may be separated with comma,
e.g. rec = "1,2=1; 3,4=2"
value range: A value range is indicated by a
colon, e.g. rec = "1:4=1; 5:8=2"
(recodes all values from 1
to 4 into 1, and from 5 to 8 into 2)
value range for doubles: For double vectors
(with fractional part), all values within the specified range are
recoded; e.g. rec = "1:2.5=1;2.6:3=2"
recodes 1 to 2.5 into
1 and 2.6 to 3 into 2, but 2.55 would not be recoded (since it’s not
included in any of the specified ranges)
“min” and “max”: Minimum and maximum values are
indicates by min
(or lo
) and max
(or hi
), e.g. rec = "min:4=1; 5:max=2"
(recodes all values from minimum values of x to 4 into 1, and from 5 to
maximum values of x into 2) You can also use min
or
max
to recode a value into the minimum or maximum value of
a variable, e.g. rec = "min:4=1; 5:7=max"
(recodes all
values from minimum values of x to 4 into 1, and from 5 to 7 into the
maximum value of x).
“else”: All other values, which have not been
specified yet, are indicated by else,
e.g. rec = "3=1; 1=2; else=3"
(recodes 3 into 1, 1 into 2
and all other values into 3)
“copy”: The "else"
-token can be
combined with "copy"
, indicating that all remaining, not
yet recoded values should stay the same (are copied from the original
value), e.g. rec = "3=1; 1=2; else=copy"
(recodes 3 into 1,
1 into 2 and all other values like 2, 4 or 5 etc. will not be recoded,
but copied.
NA’s: NA
values are allowed both as
old and new value, e.g. rec = "NA=1; 3:5=NA"
(recodes all
NA
into 1, and all values from 3 to 5 into NA in the new
variable)
“rev”: "rev"
is a special token
that reverses the value order.
direct value labelling: Value labels for new
values can be assigned inside the recode pattern by writing the value
label in square brackets after defining the new value in a recode pair,
e.g. rec = "15:30=1 [young aged]; 31:55=2 [middle aged]; 56:max=3 [old aged]"
non-captured values: Non-matching values will be
set to NA
, unless captured by the "else"
- or
"copy"
-token.
Here are some examples:
frq(efc$e42dep)
#> elder's dependency (x) <numeric>
#> # total N=908 valid N=901 mean=2.94 sd=0.94
#>
#> Value | Label | N | Raw % | Valid % | Cum. %
#> -------------------------------------------------------------
#> 1 | independent | 66 | 7.27 | 7.33 | 7.33
#> 2 | slightly dependent | 225 | 24.78 | 24.97 | 32.30
#> 3 | moderately dependent | 306 | 33.70 | 33.96 | 66.26
#> 4 | severely dependent | 304 | 33.48 | 33.74 | 100.00
#> <NA> | <NA> | 7 | 0.77 | <NA> | <NA>
# replace NA with 5
frq(rec(efc$e42dep, rec = "NA=5;else=copy"))
#> elder's dependency (x) <numeric>
#> # total N=908 valid N=908 mean=2.96 sd=0.95
#>
#> Value | Label | N | Raw % | Valid % | Cum. %
#> -------------------------------------------------------------
#> 1 | independent | 66 | 7.27 | 7.27 | 7.27
#> 2 | slightly dependent | 225 | 24.78 | 24.78 | 32.05
#> 3 | moderately dependent | 306 | 33.70 | 33.70 | 65.75
#> 4 | severely dependent | 304 | 33.48 | 33.48 | 99.23
#> 5 | 5 | 7 | 0.77 | 0.77 | 100.00
#> <NA> | <NA> | 0 | 0.00 | <NA> | <NA>
# recode 1 to 2 into 1 and 3 to 4 into 2
frq(rec(efc$e42dep, rec = "1,2=1; 3,4=2"))
#> elder's dependency (x) <numeric>
#> # total N=908 valid N=901 mean=1.68 sd=0.47
#>
#> Value | N | Raw % | Valid % | Cum. %
#> --------------------------------------
#> 1 | 291 | 32.05 | 32.30 | 32.30
#> 2 | 610 | 67.18 | 67.70 | 100.00
#> <NA> | 7 | 0.77 | <NA> | <NA>
# recode 1 to 3 into 4 into 2
frq(rec(efc$e42dep, rec = "min:3=1; 4=2"))
#> elder's dependency (x) <numeric>
#> # total N=908 valid N=901 mean=1.34 sd=0.47
#>
#> Value | N | Raw % | Valid % | Cum. %
#> --------------------------------------
#> 1 | 597 | 65.75 | 66.26 | 66.26
#> 2 | 304 | 33.48 | 33.74 | 100.00
#> <NA> | 7 | 0.77 | <NA> | <NA>
# recode numeric to character, and remaining values
# into the highest value (="hi") of e42dep
frq(rec(efc$e42dep, rec = "1=first;2=2nd;else=hi"))
#> elder's dependency (x) <character>
#> # total N=908 valid N=901 mean=2.43 sd=0.86
#>
#> Value | N | Raw % | Valid % | Cum. %
#> --------------------------------------
#> 2nd | 225 | 24.78 | 24.97 | 24.97
#> first | 66 | 7.27 | 7.33 | 32.30
#> hi | 610 | 67.18 | 67.70 | 100.00
#> <NA> | 7 | 0.77 | <NA> | <NA>
data(iris)
frq(rec(iris, Species, rec = "setosa=huhu; else=copy", append = FALSE))
#> Species_r <categorical>
#> # total N=150 valid N=150 mean=2.00 sd=0.82
#>
#> Value | N | Raw % | Valid % | Cum. %
#> ------------------------------------------
#> huhu | 50 | 33.33 | 33.33 | 33.33
#> versicolor | 50 | 33.33 | 33.33 | 66.67
#> virginica | 50 | 33.33 | 33.33 | 100.00
#> <NA> | 0 | 0.00 | <NA> | <NA>
# works with mutate
efc %>%
dplyr::select(e42dep, e17age) %>%
dplyr::mutate(dependency_rev = rec(e42dep, rec = "rev")) %>%
head()
#> e42dep e17age dependency_rev
#> 1 3 83 2
#> 2 3 88 2
#> 3 3 82 2
#> 4 4 67 1
#> 5 4 84 1
#> 6 4 85 1
# recode multiple variables and set value labels via recode-syntax
dummy <- rec(
efc, c160age, e17age,
rec = "15:30=1 [young]; 31:55=2 [middle]; 56:max=3 [old]",
append = FALSE
)
frq(dummy)
#> carer' age (c160age_r) <numeric>
#> # total N=908 valid N=901 mean=2.40 sd=0.59
#>
#> Value | Label | N | Raw % | Valid % | Cum. %
#> -----------------------------------------------
#> 1 | young | 48 | 5.29 | 5.33 | 5.33
#> 2 | middle | 442 | 48.68 | 49.06 | 54.38
#> 3 | old | 411 | 45.26 | 45.62 | 100.00
#> <NA> | <NA> | 7 | 0.77 | <NA> | <NA>
#>
#> elder' age (e17age_r) <numeric>
#> # total N=908 valid N=891 mean=3.00 sd=0.00
#>
#> Value | Label | N | Raw % | Valid % | Cum. %
#> -----------------------------------------------
#> 1 | young | 0 | 0.00 | 0 | 0
#> 2 | middle | 0 | 0.00 | 0 | 0
#> 3 | old | 891 | 98.13 | 100 | 100
#> <NA> | <NA> | 17 | 1.87 | <NA> | <NA>
Where applicable, the recoding-functions in sjmisc
have “scoped” versions as well, e.g. dicho_if()
or
split_var_if()
, where transformation will be applied only
to those variables that match the logical condition of
predicate
.