Introduction to working with code lists

Jan van der Laan

The codelist package has an example code list and a data set that used codes from that code list. We will start by demonstrating how the package works using this example code list.

Let’s load the example code list:

> library(codelist)
> data(objectcodes)
> objectcodes
   code                  label parent locale missing
1     A                   Toys   <NA>     EN       0
2     B                  Tools   <NA>     EN       0
3   A01             Teddy Bear      A     EN       0
4   A02                Toy Car      A     EN       0
5   A03                Marbles      A     EN       0
6   B01                 Hammer      B     EN       0
7   B02         Electric Drill      B     EN       0
8     A              Speelgoed   <NA>     NL       0
9     B            Gereedschap   <NA>     NL       0
10  A01              Teddybeer      A     NL       0
11  A02          Speelgoedauto      A     NL       0
12  A03               Knikkers      A     NL       0
13  B01                  Hamer      B     NL       0
14  B02            Boormachine      B     NL       0
15    X         Unknown object   <NA>     EN       1
16    X Onbekend type voorwerp   <NA>     NL       1

We see that the code list contains codes for encoding various types of objects. A code list contains at the minimum a ‘code’ and ‘label’ column. The ‘code’ column can be any type; the ‘label’ column should be a character column. With the ‘parent’ column it is possible to define simple hierarchies. This columns should contain codes from the ‘code’ column. A missing value indicates a top-level code. With the ‘locale’ column it is possible to have different versions of the ‘label’ and ‘description’ (here missing) columns. It can be used for different translations as here, but could also be used for different versions of the labels and descriptions. The ‘missing’ column indicates whether or not the code should be treated as a missing value. This column should be interpretable as a logical column.

We will also load and example data set using the codes we loaded above:

> data(objectsales)
> objectsales |> head()
  product unitprice quantity totalprice
1     B01     70.65       67    4733.55
2     B01     76.93       76    5846.68
3     B01     43.49      100    4349.00
4     A03      3.08       26      80.08
5     A01     18.51       89    1647.39
6     A03      3.35       71     237.85

This is a data set containing the prices and sales of various products. The ‘product’ column uses codes from the objectcodes code list:

> objectsales$product |> head(10)
 [1] "B01" "B01" "B01" "A03" "A01" "A03" "A03" "B01" "A03" "A01"

One of the things we can do is convert the codes to their corresponding labels:

> to_labels(objectsales$product, objectcodes) |> head(10)
 [1] Hammer     Hammer     Hammer     Marbles    Teddy Bear Marbles   
 [7] Marbles    Hammer     Marbles    Teddy Bear
Levels: Toys Tools Teddy Bear Toy Car Marbles Hammer Electric Drill

The to_labels function accepts a vector with codes and a codelist for this vector. It can get a bit tiresome to keep having to pass in the codelist attribute. If it is missing, the looks for a ‘codelist’ attribute:

> attr(objectsales$product, "codelist") <- objectcodes
> to_labels(objectsales$product) |> head(10)
 [1] Hammer     Hammer     Hammer     Marbles    Teddy Bear Marbles   
 [7] Marbles    Hammer     Marbles    Teddy Bear
Levels: Toys Tools Teddy Bear Toy Car Marbles Hammer Electric Drill

The codelist package also has a code type. Converting to a code object adds the code class. This will result in some formatting and later on we will see that this also ensures that we cannot assign invalid codes to the vector:

> objectsales$product <- code(objectsales$product, objectcodes)
> objectsales$product |> head(10)
 [1] B01 B01 B01 A03 A01 A03 A03 B01 A03 A01
8 Codelist: A(=Toys) B(=Tools) A01(=Teddy Bear) A02(=Toy Car) ...X(=Unknown object)
> to_labels(objectsales$product) |> head(10)
 [1] Hammer     Hammer     Hammer     Marbles    Teddy Bear Marbles   
 [7] Marbles    Hammer     Marbles    Teddy Bear
Levels: Toys Tools Teddy Bear Toy Car Marbles Hammer Electric Drill

For code objects there is also the labels method:

labels(objectsales$product) |> head(10)

The labels method and the to_labels function can be used to get readable output from various R-functions:

> table(labels(objectsales$product), useNA = "ifany")

          Toys          Tools     Teddy Bear        Toy Car        Marbles 
             0              0             29             14             16 
        Hammer Electric Drill           <NA> 
            30              2              9 
> tapply(objectsales$unitprice, labels(objectsales$product), mean)
          Toys          Tools     Teddy Bear        Toy Car        Marbles 
            NA             NA      19.761034      12.432857       2.480625 
        Hammer Electric Drill 
     45.303000     205.350000 
> lm(unitprice ~ 0+labels(product), data = objectsales) 

Call:
lm(formula = unitprice ~ 0 + labels(product), data = objectsales)

Coefficients:
    labels(product)Teddy Bear         labels(product)Toy Car  
                       19.761                         12.433  
       labels(product)Marbles          labels(product)Hammer  
                        2.481                         45.303  
labels(product)Electric Drill  
                      205.350  

By default codes that are considered missing are converted to NA when converting to labels. This can be prevented by setting the missing argument to FALSE:

> table(labels(objectsales$product, FALSE), useNA = "ifany")

          Toys          Tools     Teddy Bear        Toy Car        Marbles 
             0              0             29             14             16 
        Hammer Electric Drill Unknown object           <NA> 
            30              2              5              4 

The droplevels removes unused codes from the levels of the generated factor vector:

> table(labels(objectsales$product, droplevels = TRUE), useNA = "ifany")

    Teddy Bear        Toy Car        Marbles         Hammer Electric Drill 
            29             14             16             30              2 
          <NA> 
             9 

Locale

Using the ‘locale’ column of the code list it is possible to specify different versions of for the labels and descriptions. This can be used the specify different translations as in this example, but can also be used to specify different versions, for example, long and short labels. By default all methods will use the first locale in the code list as the defalult locale; the locale returned by the cl_locale function:

> cl_locale(objectcodes)
[1] "EN"

Most methods also have a locale argument with which it is possible to specify the preferred locale (the default is used when the preferred locale is not present). For example:

> labels(objectsales$product, locale = "NL") |> head()
[1] Hamer     Hamer     Hamer     Knikkers  Teddybeer Knikkers 
7 Levels: Speelgoed Gereedschap Teddybeer Speelgoedauto Knikkers ... Boormachine

It can become tedious having to specify the locale for each function call. The cl_locale will look at the CLLOCALE option, when present, to get the preferred locale. Therefore, to set a default preferred locale:

> op <- options(CLLOCALE = "NL")
> cl_locale(objectcodes)
[1] "NL"
> tapply(objectsales$unitprice, labels(objectsales$product), mean)
    Speelgoed   Gereedschap     Teddybeer Speelgoedauto      Knikkers 
           NA            NA     19.761034     12.432857      2.480625 
        Hamer   Boormachine 
    45.303000    205.350000 
> # Set the locale back to the original value (unset)
> options(op)

Looking up codes based on label

Using the codes function it is possible to look up the codes based on a set of labels. For example, below we look up the code for ‘Hammer’:

> codes("Hammer", objectcodes)
[1] "B01"

or getting the code list form the relevant variable itself using the cl method that returns the code list of the variable:

> codes("Hammer", cl(objectsales$product))
[1] "B01"

This could be used to make selections. For example, instead of

> subset(objectsales, product == "B02")
         product unitprice quantity totalprice
33 B02[Electri…]    284.85       52   14812.20
73 B02[Electri…]    125.85       73    9187.05

one can do

> subset(objectsales, product == codes("Electric Drill", cl(product)))
         product unitprice quantity totalprice
33 B02[Electri…]    284.85       52   14812.20
73 B02[Electri…]    125.85       73    9187.05

In general the latter is more readable and makes the intent of the code much more clear (unless one can assume that the people reading the code will now most of the product codes).

When comparing a code object to labels, it is also possible to use the as.label function. This will add the class “label” to the character vector. The comparison operator will then first call the codes function on the label:

> subset(objectsales, product == as.label("Electric Drill"))
         product unitprice quantity totalprice
33 B02[Electri…]    284.85       52   14812.20
73 B02[Electri…]    125.85       73    9187.05

This only works for the equal-to and not-equal-to operators.

Selecting this way has an advantage over selecting records based on character vectors or factor vectors. For example we could also have done the following:

> subset(objectsales, labels(product) == "Electric Drill")
         product unitprice quantity totalprice
33 B02[Electri…]    284.85       52   14812.20
73 B02[Electri…]    125.85       73    9187.05

However, a small, difficult to spot, spelling mistake would have resulted in:

> subset(objectsales, labels(product) == "Electric drll")
[1] product    unitprice  quantity   totalprice
<0 rows> (or 0-length row.names)

And we could have believed that no electric drills were sold. The codes function will also check if the provided labels are valid and if not will generate an error (the try is to make sure don’t actually throw an error).

> try({
+   subset(objectsales, product == codes("Electric drill", cl(product)))
+ })
Error in codes.default("Electric drill", cl(product)) : 
  Labels not present in codelist in current locale.

Since selecting on labels is a common operation, there is also the in_labels function that will return a logical vector indicating whether or not a code has a label in the given set:

> subset(objectsales, in_labels(product, "Electric Drill"))
         product unitprice quantity totalprice
33 B02[Electri…]    284.85       52   14812.20
73 B02[Electri…]    125.85       73    9187.05

This function will of course also generate an error in case of invalid codes.

> try({
+   subset(objectsales, in_labels(product, "Electric drill"))
+ })
Error in codes.default(labels, codelist) : 
  Labels not present in codelist in current locale.

In the examples above we used the base function subset, but this will of course also work within data.tables and the filter methods from dplyr.

Assignment of codes

When the vector with codes is transformed to a code object, it can of course also be assigned to:

> objectsales$product[10] <- "A01"
> objectsales$product[1:10] 
 [1] B01 B01 B01 A03 A01 A03 A03 B01 A03 A01
8 Codelist: A(=Toys) B(=Tools) A01(=Teddy Bear) A02(=Toy Car) ...X(=Unknown object)

Here the codes function can also be of use (again, an invalid label will result in an error so this is a safe operation):

> objectsales$product[10] <- codes("Teddy Bear", objectcodes)
> objectsales$product[1:10] 
 [1] B01 B01 B01 A03 A01 A03 A03 B01 A03 A01
8 Codelist: A(=Toys) B(=Tools) A01(=Teddy Bear) A02(=Toy Car) ...X(=Unknown object)

Another option is to use the as.label function which labels a character vector as a label:

> objectsales$product[10] <- as.label("Electric Drill")
> objectsales$product[1:10] 
 [1] B01 B01 B01 A03 A01 A03 A03 B01 A03 B02
8 Codelist: A(=Toys) B(=Tools) A01(=Teddy Bear) A02(=Toy Car) ...X(=Unknown object)

Hierarchies

Each code can have parent code. With this a simple hierarchy can be defined. At the top of the hierarchy are the codes without parent (NA). This is level 0. Codes with a parent in level 0 are in level 1 etc. Note that level 0 is a higher level than level 1. The example code list objectcodes has two levels:

> cl_nlevels(objectcodes)
[1] 2
> cl_levels(objectcodes)
 [1] 0 0 1 1 1 1 1 0 0 1 1 1 1 1 0 0

These levels can be used to ‘cast’ the codes to a higher level:

> objectsales$group <- levelcast(objectsales$product, 0)
> head(objectsales)
        product unitprice quantity totalprice    group
1 B01[Hammer]       70.65       67    4733.55 B[Tools]
2 B01[Hammer]       76.93       76    5846.68 B[Tools]
3 B01[Hammer]       43.49      100    4349.00 B[Tools]
4 A03[Marbles]       3.08       26      80.08 A[Toys] 
5 A01[Teddy B…]     18.51       89    1647.39 A[Toys] 
6 A03[Marbles]       3.35       71     237.85 A[Toys] 

This is, for example, useful to create aggregates at higher levels. For example, we can calculate the total number of toys and tools sold:

> aggregate(objectsales[c("quantity", "totalprice")], 
+   objectsales[c("group")], sum)
        group quantity totalprice
1 A[Toys]         3274   43918.09
2 B[Tools]        1829  103011.65
3 X[Unknown…]      308   18184.42

Note that by default the code list of the vector returned by levelcast will be modified to only contain the codes in the higher hierarchy (this can be suppressed using the filter_codelist = FALSE argument):

> cl(objectsales$group)
   code                  label parent locale missing
1     A                   Toys   <NA>     EN   FALSE
2     B                  Tools   <NA>     EN   FALSE
8     A              Speelgoed   <NA>     NL   FALSE
9     B            Gereedschap   <NA>     NL   FALSE
15    X         Unknown object   <NA>     EN    TRUE
16    X Onbekend type voorwerp   <NA>     NL    TRUE

Also, when the data contains codes from different levels, trying to cast to a level lower than that some of the codes in the vector will result by default in an error. This can be controlled with the over_level argument.

Safety

Using a code vector also has the advantage that the codes assigned to will be validated against the code list, generating an error when one tries assign an invalid code:

> try({
+   objectsales$product[10] <- "Q"
+ })
Error in `[<-.code`(`*tmp*`, 10, value = "Q") : 
  Invalid codes used in value.

This makes a code object safer to work with than, for example, a character of numeric vector with codes (a factor vector will also generate a warning for invalid factor levels).

The codes function and the as.label function (which call the codes function) will also generate an error:

> try({
+   objectsales$product[10] <- as.label("Teddy bear")
+ })
Error in codes.default(value, codelist) : 
  Labels not present in codelist in current locale.

Assigning NA will of course still work:

> objectsales$product[10] <- NA

A code object is safer to work with than a factor vector. For example:

> x <- factor(letters[1:3])
> y <- code(1:3, data.frame(code = 1:3, label = letters[1:3]))

Comparing on invalid codes works with a factor while it will generate an error for code objects:

> try({ x == 4 })
[1] FALSE FALSE FALSE
> try({ y == 4 })
Error in Ops.code(y, 4) : Invalid codes used in RHS

The same holds when comparing on labels:

> try({ x == "foobar" })
[1] FALSE FALSE FALSE

A code cannot directly be compared on a label and will generate an error even when the label is valid:

> try({ y == "a" })
Error in Ops.code(y, "a") : 
  RHS not of the same class as the used codes of the LHS.

One should use either the codes or as.label function for that:

> try({ y == as.label("a") })
[1]  TRUE FALSE FALSE
> try({ y == as.label("foobar") })
Error in codes.default(e2, cl(e1)) : 
  Labels not present in codelist in current locale.