vtree is a flexible tool for calculating and displaying variable trees — diagrams that show information about nested subsets of a data frame. vtree can be used to:
explore a data set interactively
produce customized figures for reports and publications.
Note, however, that vtree is not designed to build or display decision trees.
Given a data frame and simple specifications, vtree will produce a variable tree and automatically label it with counts, percentages, and other summaries.
The sections below introduce variable trees and provide an overview
of the features of vtree. Or you can skip ahead
and start using the vtree
function.
Subsets play an important role in almost any data analysis.
Imagine a data set of countries that includes variables named
population
, continent
, and
landlocked
. Suppose we wish to examine subsets of the data
set based on the continent
variable. Within each of these
subsets, we could examine nested subsets based on the
population
variable, for example, countries with
populations under 30 million and over 30 million. We might continue to a
third nesting based on the landlocked
variable.
Nested subsets are at the heart of questions like the following: Among African countries with a population over 30 million, what percentage are landlocked? The variable tree below provides the answer:
By default, vtree uses the colorful display above (to help distinguish variables and values), but if you prefer a more sedate version, you can specify a single fill color (or simply white):
Even in simple situations like this, it can be a chore to keep track of nested subsets and calculate the corresponding percentages. The denominator used to calculate percentages may also depend on whether the variables have any missing values, as discussed later. Finally, as the number of variables increases, the magnitude of the task balloons, because the number of nested subsets grows exponentially. vtree provides a general solution to the problem of calculating nested subsets and displaying information about them.
Nested subsets arise in all kinds of situations. Consider, for example, flow diagrams for clinical studies, such as the following CONSORT-style diagram, produced by vtree.
Both the structure of this variable tree and the numbers shown were automatically determined. When manual calculation and transcription are instead used to populate diagrams like this, mistakes are likely. And although the errors that make it into published articles are often minor, they can sometimes be disastrous. One motivation for developing vtree was to make flow diagrams reproducible. The ability to reproducibly generate variable trees also means that when a data set is updated, a revised tree can be automatically produced.
At the end of this vignette, there is a collection of examples of variable trees using R datasets that you can try.
The examples that follow use a data set called FakeData
which represents 46 fictitious patients. We’ll start by using just two
variables, although variable trees are especially useful with three or
more variables. The variable tree below depicts subsets defined by
Sex
(M or F) nested within subsets defined by disease
Severity
(Mild, Moderate, Severe, or NA).
A variable tree consists of nodes connected by arrows. At the top of the diagram above, the root node of the tree contains all 46 patients. The rest of the nodes are arranged in successive layers, where each layer corresponds to a specific variable. Note that this highlights one difference between variable trees and some other kinds of trees: each layer of a variable tree corresponds to just one variable. (In a decision tree, by contrast, different branches can have different sequences of variable splits.)
Continuing with the variable tree above, the nodes immediately below
the root represent values of Severity
and are referred to
as the children of the root node. In this case,
Severity
was missing (NA) for 6 patients, and there is a
node for these patients. Inside each of the nodes, the number of
patients is displayed and—except for in the missing value node—the
corresponding percentage is also shown. Note that, by default,
vtree
displays “valid” percentages, i.e. the denominator
used to calculate the percentage is the total number of
non-missing values, 40.
The final layer of the tree corresponds to values of
Sex
. These nodes represent males and females within
subsets defined by each value of Severity
. In each of
these nodes the percentage is calculated in terms of the number of
patients in its parent node.
Like any node, a missing-value node can have children. For example,
of the 6 patients for whom Severity
is missing, 3 are
female and 3 are male. By default, vtree
displays the full
missing-value structure of the specified variables.
Also by default, vtree
automatically assigns a color
palette to the nodes of each variable. Severity
has been
assigned red hues (lightest for Mild, darkest for Severe), while
Sex
has been assigned blue hues (light blue for females,
dark blue for males). The node representing missing values of
Severity
is colored white to draw attention to it.
A tree with two variables is similar to a two-way contingency table.
In the example above, Sex
is shown within levels of
Severity
. This corresponds to the following contingency
table, where the percentages within each column add to 100%. These are
called column percentages.
Mild | Moderate | Severe | NA | |
---|---|---|---|---|
F | 11 (58%) | 11 (69%) | 2 (40%) | 3 (50%) |
M | 8 (42%) | 5 (31%) | 3 (60%) | 3 (50%) |
Likewise, a tree with Severity
shown within levels of
Sex
corresponds to a contingency table with row
percentages.
While the contingency table above is more compact than the corresponding variable tree, some people find the variable tree more intuitive. When three or more variables are of interest, multi-way contingency tables are often used. These are typically displayed using several two-way tables, but as the number of variables increases, these become increasingly difficult to interpret. Variable trees, on the other hand, have the same simple structure regardless of the number of variables.
Note that contingency tables are not always more compact than variable trees. When most cells of a large contingency table are empty (in which case the table is said to be sparse), the corresponding variable tree may be more compact since empty nodes are not shown.
vtree is designed to be quick and easy to use, so that it is convenient for data exploration, but also flexible enough that it can be used to prepare publication-ready figures. To generate a basic variable tree, it is only necessary to specify a data frame and some variable names. However extra features extend this basic functionality to provide:
control over labeling, colors, legends, line wrapping, and text formatting;
flexible pruning to remove parts of the tree that are of lesser interest, which is particularly useful when a tree gets large;
display of information about other variables in each node, including a variety of summary statistics;
special displays for indicator variables, patterns of values, and missingness;
support for checkbox variables from REDCap databases;
features for dichotomizing variables and checking for outliers;
automatic generation of PNG image files and embedding in R Markdown documents; and
interactive panning and zooming using the svtree
function to launch a Shiny app.
In many cases, you may wish to generate several different variable trees to investigate a collection of variables in a data frame. For example, it is often useful to change the order of variables, prune parts of the tree, etc.
vtree is built on open-source software: in particular Richard Iannone’s DiagrammeR package, which provides an interface to the Graphviz software using the htmlwidgets framework. Additionally, vtree makes use of the Shiny package, and the svg-pan-zoom JavaScript library.
A formal description of variable trees follows.
The root node of the variable tree represents the entire data frame. The root node has a child for each observed value of the first variable that was specified. Each of these child nodes represents a subset of the data frame with a specific value of the variable, and is labeled with the number of observations in the subset and the corresponding percentage of the number of observations in the entire data frame. The nth layer below the root of the variable tree corresponds to the nth variable specified. Apart from the root node, each node in the variable tree represents the subset of its parent defined by a specific observed value of the variable at that layer of the tree, and is labeled with the number of observations in that subset and the corresponding percentage of the number of observations in its parent node.
Note that a node always represents at least one observation. And unlike a contingency table, which can have empty cells, a variable tree has no empty nodes.
vtree
functionConsider a data frame named df
, which includes discrete
variables v1
and v2
. Suppose we wish to
produce a variable tree showing subsets based on values of
v1
as well as subsets of those subsets based on values of
v2
. The variable tree can be displayed using the following
command:
vtree(df,"v1 v2")
Alternatively, you may wish to assign the output of
vtree
to an object:
<- vtree(df,"v1 v2") simple_tree
Then it can be displayed later using:
simple_tree
Suppose vtree
is called without a list of variables:
vtree(df)
In this case, only the root node is shown, representing the entire data frame. Although a tree with just one node might not seem very useful, we’ll see later that summary information about the whole data frame can be displayed there.
The vtree
function has numerous optional parameters. For
example, by default vtree
produces a horizontal tree (that
is, a tree that grows from left to right). To generate a vertical tree,
specify horiz=FALSE
.
This section introduces some basic features of the
vtree
function.
To display a variable tree for a single variable, say
Severity
, use the following command:
vtree(FakeData,"Severity")
By default, next to each layer of the tree, a variable name is shown.
In the example above, “Severity” is shown below the corresponding nodes.
(For a vertical tree, “Severity” would be shown to the left of the
nodes.) If you specify showvarnames=FALSE
, no variable
names will be shown.
vtree
can also be used with dplyr. For example, to
rename the Severity
variable as HowBad
, we can
pipe the data frame into the rename
function in dplyr, and
then pipe the result into vtree
:
library(dplyr)
%>% rename("HowBad"=Severity) %>% vtree("HowBad") FakeData
Note that vtree
also has a built-in
way of renaming variables, which is an alternative to using
dplyr.
Large variable trees can be difficult to display in a readable way. One approach that helps is to display the count and percentage on the same line in each node. For example, in the tree above, the label for the Moderate node is on two lines, like this:
Moderate
16
(40%)
Specifying sameline=TRUE
results in single-line labels,
like this:
Moderate, 16 (40%)
By default, vtree shows “valid percentages”, i.e. percentages
calculated using the total number of non-missing values as
denominator. In the case of Severity
, there are 6 missing
values, so the denominator is 46 - 6, or 40. There are 19 Mild cases,
and 19/40 = 0.475 so the percentage shown is 48%. No percentage is shown
in the NA node since missing values are not included in the
denominator.
If you prefer the denominator to represent the complete set of
observations (including any missing values), specify
vp=FALSE
. A percentage will be shown in each of the nodes,
including any NA nodes.
If you don’t wish to see percentages, specify
showpct=FALSE
, and if you don’t wish to see counts, specify
showcount=FALSE
.
To display a legend, specify showlegend=TRUE
. Next to
each variable name are “legend nodes” representing the values of that
variable and colored accordingly. For each variable, the legend nodes
are grouped within a light gray box. Each legend node also contains a
count (with a percentage) for the value represented by that node in the
whole data frame. This is known as the marginal count (and
percentage).
When the legend is shown, labels in the nodes of the variable tree
are redundant, since the colors of the nodes identify the values of the
variables (although the labels may aid readability). If you prefer, you
can hide the node labels, by specifying
shownodelabels=FALSE
:
vtree(FakeData,"Severity Sex",showlegend=TRUE,shownodelabels=FALSE)
Since Severity
is the first variable in the tree, it is
not nested within another variable. Therefore the marginal counts and
percentages for Severity
shown in the legend nodes are
identical to those displayed in the nodes of the variable tree. In
contrast, for Sex
, the marginal counts and percentages are
different from what is shown in the nodes of the variable tree for
Sex
since they are nested within levels of
Severity
.
By default, vtree
wraps text onto the next line whenever
a space occurs after at least 20 characters. This can be adjusted, for
example, to 15 characters, by specifying splitwidth=15
. To
disable line splitting, specify splitwidth=Inf
(Inf
means infinity, i.e. “do not split”.)
The vsplitwidth
parameter is similarly used to control
text wrapping in variable names. This is helpful with long variable
names, which may be truncated unless wrapping is used. In this case text
wrapping occurs not only at spaces, but also at any of the following
characters:
. - + _ = / (
For example if vsplitwidth=5
, a variable name like
First_Emergency_Visit
would be split into
First_
Emergency_
Visit
This concludes the mini-tutorial. vtree has many more features, described in the following sections.
This section shows how to remove branches from a variable tree.
When a variable tree gets too big, or you are only interested in certain parts of the tree, it may be useful to remove some nodes along with their descendants. This is known as pruning. For convenience, there are several different ways to prune a tree, described below.
prune
parameterHere’s a variable tree we’ve already seen in various forms:
vtree(FakeData,"Severity Sex")
Suppose you don’t want the tree to show branches for individuals
whose disease is Mild or Moderate. Specifying
prune=list(Severity=c("Mild","Moderate"))
removes those
nodes, and all of their descendants:
vtree(FakeData,"Severity Sex",prune=list(Severity=c("Mild","Moderate")))
In general, the argument of the prune
parameter is a
list with an element named for each variable you wish to prune.
In the example above, the list has a single element, named
Severity
. In turn, that element is a vector
c("Mild","Moderate")
indicating the values of
Severity
to prune.
Caution: Once a variable tree has been pruned, it is
no longer complete. This can sometimes be confusing since not all
observations are represented at certain layers of the tree. For example
in the tree above, only 11 observations are shown in the
Severity
nodes and their children.
keep
parameterSometimes it is more convenient to specify which nodes should be
retained rather than which ones should be discarded. The
keep
parameter is used for this purpose, and can thus be
considered the complement of the prune
parameter. For
example, to retain the Moderate Severity
node:
vtree(FakeData,"Severity Sex",keep=list(Severity="Moderate"))
Note: In addition to the Moderate node, the missing
value node has also been retained. In general, whenever valid
percentages are used (which is the default), missing value nodes are
retained when keep
is used. This is because valid
percentages are difficult to interpret without knowing the denominator,
which requires knowing the number of missing values.
On the other hand, here’s what happens when
vp=FALSE
:
vtree(FakeData,"Severity Sex",keep=list(Severity="Moderate"),vp=FALSE)
prunebelow
parameterAs seen above, a disadvantage of pruning is that in the resulting tree, the counts shown in child nodes may not add up to the counts shown in their parent node.
An alternative is to prune below the specified nodes
(i.e. to prune their descendants), so that the counts always add up. In
the present example, this means that the Mild and Moderate nodes will be
shown, but not their descendants. The prunebelow
parameter
is used to do this:
vtree(FakeData,"Severity Sex",prunebelow=list(Severity=c("Mild","Moderate")))
follow
parameterThe complement of prunebelow
is follow
.
Instead of specifying which nodes should be pruned below, this allows
you to specify which nodes should be “followed” (that is, not
pruned below).
This section describes a more flexible way to prune variable trees.
To explain this, first note that the prune
,
keep
, prunebelow
, and follow
parameters specify pruning across all branches of the tree. For example,
if you were pruning Severity
nested within levels of
Sex
, the pruning would take place in both the M and F
branches.
Sometimes, however, it is preferable to perform pruning only in
specified branches of the tree. This is called targeted
pruning, and the parameters tprune
, tkeep
,
tprunebelow
, and tfollow
provide this
functionality. However, their arguments have a more complex form than
those of the corresponding prune
, keep
,
prunebelow
, and follow
parameters because they
specify the full path from the root of the tree all the way to
the nodes to be pruned. For example to remove every
Severity
node except Moderate, but only for males, the
following command can be used:
vtree(FakeData,"Sex Severity",tkeep=list(list(Sex="M",Severity="Moderate")))
Note that the argument of tkeep
is a list of lists, one
for each path through the tree. To keep both Moderate and Severe,
specify
tkeep=list(list(Sex="M",Severity=c("Moderate","Severe")))
.
Now suppose that, in addition to this, within females,you want to keep
just Mild. Use the following specification to do this:
=list(list(Sex="M",Severity=c("Moderate","Severe")),list(Sex=F",Severity="Mild")) tkeep
prunesmaller
parameterAs a variable tree grows, it can become difficult to see the forest
for the tree. For example, the following tree is hard to read, even when
sameline=TRUE
has been specified:
vtree(FakeData,"Severity Sex Age Category",sameline=TRUE)
One solution is to prune nodes that contain small numbers of
observations. For example if you want to only see nodes with at least 3
observations, you can specify prunesmaller=3
, as in this
example:
vtree(FakeData,"Severity Sex Age Category",sameline=TRUE,prunesmaller=3)
As with the keep
parameter, when valid percentages are
used (vp=TRUE
, which is the default), nodes represent
missing values will not be pruned. (As noted previously, this is because
percentages are confusing when missing values are not shown.) On the
other hand, when vp=FALSE
, missing nodes will be pruned (if
they are small enough).
This section shows how to relabel variables and nodes.
By default, vtree
labels variables and nodes exactly as
they appear in the data frame. But it is often useful to change these
labels.
labelvar
parameterSuppose Severity
in fact represents initial severity. To
label it that way in the variable tree, specify
labelvar=c(Severity="Initial severity")
:
vtree(FakeData,"Severity Sex",horiz=FALSE,labelvar=c(Severity="Initial severity"))
labelnode
parameterBy default, vtree
labels nodes (except for the root
node) using the values of the variable in question. Sometimes it is
convenient to instead specify custom labels for nodes. The
labelnode
argument can be used to relabel the values. For
example, you might want to use “Male” and “Female” instead of “M” and
“F”.
vtree(FakeData,"Group Sex",horiz=FALSE,labelnode=list(Sex=c(Male="M",Female="F")))
The argument of the labelnode
parameter is specified as
a list whose element names are variable names. To substitute a new label
for an old label, the syntax is: "New label"="Old label"
.
Thus the full specification, as used above, is:
labelnode=list(Sex=c(Male="M",Female="F"))
.
tlabelnode
parameterSuppose in the example above that Group
A represents
children and Group
B represents adults. In
Group
A, we would like to use the labels “girl” and “boy”,
while in Group
B we would like to use “woman” and “man”.
The labelnode
parameter cannot handle this situation
because the values of Sex
need to be labeled differently in
different branches of the tree. The tlabelnode
parameter
allows “targeted” node labels.
vtree(FakeData,"Group Sex",horiz=FALSE,
labelnode=list(Group=c(Child="A",Adult="B")),
tlabelnode=list(
c(Group="A",Sex="F",label="girl"),
c(Group="A",Sex="M",label="boy"),
c(Group="B",Sex="F",label="woman"),
c(Group="B",Sex="M",label="man")))
This section shows how to add bold, italics, and other text formatting.
Graphviz, the open source graph visualization software that vtree is built on, supports a variety of text formatting (including bold, colors, etc.). This is used in vtree to control formatting of text such as node labels.
By default, the vtree
package uses markdown-style codes
for text formatting. In the tables below, ...
represents
arbitrary text.
\n |
insert a line break |
\n*l |
make the preceding line left-justified and insert a line break |
*...* |
display text in italics |
**...** |
display text in bold |
^...^ |
display text in superscript (using 10 point font) |
~...~ |
display text in subscript (using 10 point font) |
%%red ...%% |
display text in red (or whichever color is specified) |
As an alternative, if you specify HTMLtext=TRUE
you can
use “HTML-like labels” (implemented in Graphviz), including:
<BR/> |
insert a line break |
<BR ALIGN='LEFT'/> |
make the preceding line left-justified and insert a line break |
<I> ... </I> |
display text in italics |
<B> ... </B> |
display text in bold |
<SUP> ... </SUP> |
display text in superscript (using 10 point font) |
<SUB> ... </SUB> |
display text in subscript (using 10 point font) |
<FONT POINT-SIZE='10'> ... </FONT> |
set font to 10 point |
<FONT FACE='Times-Roman'> ... </FONT> |
set font to Times-Roman |
<FONT COLOR='red'> ... </FONT> |
set font to red |
See https://www.graphviz.org/doc/info/shapes.html#html for more details.
text
parameterSuppose you wish to add the italicized text “Excluding new
diagnoses” to any Mild nodes in the tree. The parameter
text
is used to add text to nodes. It is specified as a
list with an element named for each variable. In the example below the
list has one element, named Severity
. That element in turn
is a vector c(Mild="\n*Excluding\nnew diagnoses*")
indicating that the Mild node should include additional text using
Markdown-style formatting (i.e. \n
indicates a linebreak
and the asterisks around the text indicate that it should be displayed
in italics):
vtree(FakeData,"Group Severity",horiz=FALSE,showvarnames=FALSE,
text=list(Severity=c(Mild="\n*Excluding\nnew diagnoses*")))
ttext
parameterIn the example above, suppose that new diagnoses are only excluded
from Mild cases in Group
B. But the text
parameter adds text to all Mild nodes. Thus, in situations like
this, the text
parameter is not sufficient. Instead, you
can use the ttext
parameter to target exactly which nodes
should have the specified text.
The ttext
parameter requires that you specify the full
path from the root of the tree to the node in question, along with the
text in question. The ttext
parameter is specified as a
list so that multiple targeted text strings can be specified at once.
For example:
vtree(FakeData,"Group Severity",horiz=FALSE,showvarnames=FALSE,
ttext=list(
c(Group="B",Severity="Mild",text="\n*Excluding\nnew diagnoses*"),
c(Group="A",text="\nSweden"),
c(Group="B",text="\nNorway")))
This section shows how to control how variables appear in a variable tree.
Sometimes it is desirable to modify a variable for use in a variable
tree. For example, suppose you wish to determine how many values of
Score
are missing. This is easy to do with dplyr:
library(dplyr)
%>% mutate(missingScore=is.na(Score)) %>% vtree("missingScore") FakeData
But vtree also offers built-in tools for variable specification. Although limited, they can be very convenient.
is.na:
If an individual variable name is preceded by is.na:
,
that variable will be replaced by a missing value indicator in the
variable tree. (This differs from the check.is.na
parameter, which is
used to replace all of the specified variables with missing
value indicators.) For example:
vtree(FakeData,"is.na:Score")
#
Specifying Ind#
matches all variable names that start
with Ind
and end with one or more numeric digits, namely
Ind1
, Ind2
, Ind3
, and
Ind4
. This wildcard can also be used within a variable
name. For example, visit#duration
would match
visit1duration
, visit2duration
, etc.
*
Specifying Ind*
matches all variable names that start
with Ind
and end with any other characters (or no other
characters). In FakeData
this matches Ind1
,
Ind2
, Ind3
, and Ind4
(just like
Ind#
does). But if FakeData
contained
variables named Ind
and Index
, they would also
be matched by Ind*
. As with the #
wildcard,
the *
wildcard can be used within a variable name.
i:
“Intersections” between multiple variables can be generated using the
prefix i:
. For example, i:Ind*
generates a
variable representing the observed combinations of values of
Ind1
, Ind2
, Ind3
, and
Ind4
. (If at least one of the variables is missing, the
combination will be missing.)
r:
(for REDCap)Vtree includes special support for REDCap data sets. The prefix
r:
is used to indicate REDCap checkbox variables, and can
be combined with other prefixes. This is described in the section on REDCap checkboxes later in this
vignette.
any:
Sometimes a group of variables contain responses to a list of
checkbox options (often with instructions to “check all that apply”).
For example, suppose you have a data frame of shops, including whether
they are open on Saturday (openSaturday
) or Sunday
(openSunday
). Suppose no other variables start with
open
. Then open*
will match both
openSaturday
and openSunday
.
In general for a group of checkbox variables, it is often useful to know if any of the options were selected (i.e. checked). In the case above, we might want to know which shops are open at all on the weekend (either Saturday or Sunday).
A specification like any:open*
is used to generate a
variable that is
TRUE
if any of the matching variables has a
“checked” value
FALSE
if none of the matching variables have
“checked” values.
The parameters checked
and unchecked
specify which values are considered checked or unchecked respectively,
and have the following defaults:
parameter | default value |
---|---|
checked |
c("1","TRUE","Yes","yes") |
unchecked |
c("0","FALSE","No","no") |
Values not listed in checked
or unchecked
are treated as missing values.
An alternative prefix, anyx:
, is used to specify that
missing values will be removed when performing the calculation. This
matches the behavior of the R function any
when
na.rm=TRUE
is specified.
none:
The logical complement (negation) of the any:
prefix. An
alternative prefix, nonex:
, is used to specify that missing
values will be removed when performing the calculation.
all:
A specification like all:open*
generates a variable
which is TRUE if all of the matching variables have a “checked”
value.
An alternative prefix, allx:
, is used to specify that
missing values will be removed when performing the calculation. This
matches the behavior of the R function all
when
na.rm=TRUE
is specified.
notall:
The logical complement (negation) of the all:
prefix. An
alternative prefix, notallx:
, is used to specify that
missing values will be removed when performing the calculation.
tri:
The tri:
prefix is useful for identifying values of a
numeric variable that are extreme compared to the other values
in a node. Note: Unlike other variable specifications,
which take effect at the level of the entire data frame, the
tri:
prefix takes effect within each node.
The effect of this variable specification is to trichotomize the values of a numeric variable, i.e. to divide them into three groups:
“mid”: values within plus or minus 1.5×IQR of the median,
“high”: values more than 1.5×IQR above the median,
“low”: values more than 1.5×IQR below the median.
variable=value
When a variable takes on a large number of different values, the
resulting variable tree will very large. One solution is to prune the
tree, for example by keeping just the node corresponding to one value of
a particular variable. An alternative is to specify the value of the
variable that is of primary interest and vtree
will
dichotomize the variable at that value. For example if
Severity=Mild
is specified, the Severity
variable will be dichotomized between Mild
and
Not Mild
.
variable<value
,
variable>value
These two specifications are used to dichotomize a numeric variable, splitting above and below a specified value. This can be useful for identifying subsets with extreme values.
This section shows how to display information about other variables in the nodes.
It is often useful to display information about other
variables (apart from those that define the tree) in the nodes of a
variable tree. This is particularly useful for numeric variables, which
usually would not be used to build the tree since they have too many
distinct values. The summary
parameter allows you to show
information (for example, a mean) about a specified variable within a
subset of the data frame.
Suppose you are interested in summary information for the
Score
variable for all of the observations in the data
frame (i.e. in the root node). In that case you don’t need to specify
any variables for the tree itself:
vtree(FakeData,summary="Score")
When the name of a numeric variable (in this case
"Score"
) is specified as the argument of the
summary
parameter, a default set of summary statistics (as
shown above) appears: the variable name, the number of missing values,
the mean and standard deviation, the median and interquartile range
(IQR), and the range.
(Note, however, that if there are three or fewer observations, instead of showing the above summary statistics, the observations are simply listed.)
Suppose we’re building a variable tree based on
Severity
. We can display these summaries for
Score
in each node:
vtree(FakeData,"Severity",summary="Score",horiz=FALSE)
Sometimes it is helpful to extract summary information as text. For example, we might wish to access the summary information contained in the Mild node. This is explained later on, but here’s a brief example:
<- vtree(FakeData,"Severity",summary="Score",horiz=FALSE)
vSeverity <- attributes(vSeverity)$info
info cat(info$Severity$Mild$.text)
##
## Score
## missing 1
## mean 12.1 SD 14.6
## med 5.5 IQR 3.2, 9.8
## range 1.0, 45.0
There are also default summaries for factor variables and for
indicator variables. For example, Category
is a factor
variable:
vtree(FakeData,summary="Category")
Indicator variables have two levels such as 0 / 1, or
TRUE
/ FALSE
. For example, Event
is an indicator variable
vtree(FakeData,summary="Event")
Variables in the summary
argument can also be specified
in a way that is similar to the specification of variables for
structuring a variable tree. For example, if we wish to know the
proportion of patients in each node whose Category
is
single, we specify Category=single
in the
summary
argument:
vtree(FakeData,"Severity",summary="Category=single",horiz=FALSE)
Summaries can be obtained for a collection of variables using pattern-matching, for example:
vtree(FakeData,"Severity",summary="Ind*",sameline=TRUE,horiz=FALSE,just="l")
Incidentally, note that just="l"
specifies that all text
should be left-justified, which conveniently lines up all of the rows of
the summary.
The summary
argument can also use the prefixes
i:
, any:
, none:
,
all:
, notall:
(as well as anyx:
,
nonex:
, allx:
, and notallx:
) and
wildcards #
and *
(similar to variable specifications).
Additionally, specifications for REDCap
checkboxes can be used.
%noroot%
, %leafonly%
,
%var=
v%
, and
%node=
n%
By default, summary information is shown in all nodes. However, it
may also be convenient to only show it in specific nodes. To control
this, special codes that begin and end with %
can be
specified. The following control codes are available:
code | summary information restricted to: |
---|---|
%noroot% |
all nodes except the root |
%leafonly% |
leaf nodes |
%var= v% |
nodes of variable v |
%node= n% |
nodes named n |
The control codes can be specified by adding them to the end of the
summary string, separated with a space. For example, to only show
summary information for nodes of the Category
variable with
the value single
:
vtree(FakeData,"Severity Category",summary="Score<10 %var=Category%%node=single%",
sameline=TRUE, showlegend=TRUE, showlegendsum=TRUE)
Here showlegend=TRUE
was specified, and additionally
showlegendsum=TRUE
, which indicates that summaries should
also be shown in legend nodes.
The summary
parameter also allows for customized
summaries. For example, we might wish to display only the mean
Score
in each node of the tree. The %mean%
code is used to represent the mean of the specified variable (preceded
here by a line break, \n
).
vtree(FakeData,"Severity",summary="Score \nmean score\n%mean%",sameline=TRUE,horiz=FALSE)
In addition to the %mean%
code, numerous other summary
codes are supported, as listed in the table below. When such a code is
present, the default summary is not shown. Instead, any text that is
provided—in this case \nmean score\n
—is shown, together
with the requested summary information. If there are any missing values
in a node, the number of missing values is shown using the abbreviation
mv
. To see summaries without any decimals, specify
cdigits=0
.
summary code | result |
---|---|
%mean% |
mean (variant: %meanx% does not report missing values*) |
%SD% |
standard deviation (variant: %SDx% does not report missing values*) |
%sum% |
sum (variant: %sumx% does not report missing values*) |
%min% |
minimum (variant: %minx% does not report missing values*) |
%max% |
maximum (variant: %maxx% does not report missing values*) |
%range% |
range (variant: %rangex% does not report missing values*) |
%median% |
median, i.e. p50 (variant: %medianx% does not report missing values*) |
%IQR% |
IQR, i.e. p25, p75 (variant: %IQRx% does not report missing values*) |
%freqpct% |
frequency and percentage of values of a variable (variant: %freqpct_% shows each value on a separate
line) |
%freq% |
frequency of values of a variable (variant: %freq_% shows each value on a separate line) |
%pY% |
Yth percentile (e.g. p50 means
the 50th percentile) |
%npct% |
frequency and percentage of a logical variable. By default “valid percentages” are used. Any missing values are also reported. |
%pct% |
same as %npct% but percentage only (with
no parentheses). |
%list% |
list of individual values, separated by commas (variant: %list_% shows each value on a separate line) |
%mv% |
the number of missing values |
%nonmv% |
the number of non-missing values |
%v% |
the name of the variable |
*Caution is recommended when suppressing missing values.
The summary
argument can include any number of these
codes, mixed with text and formatting codes.
%trunc%
codeIt is sometimes convenient to see individual values of a variable in
each node. A good example is ID numbers. To do this, use the
%list%
code. When a value occurs more than once in the
subset, it will be followed by a count of the number of repetitions in
parentheses.
When there are many individual values, it is often convenient to
truncate the output. If you specify
%trunc=
N%
, summary information will
be truncated after N characters, and followed by “…”.
Rather than starting the summary
argument with a
variable name, an R expression involving variables in the data frame can
be given, as long as it does not contain any spaces.
vtree(FakeData,"Severity Category",
summary="(Post-Pre)/Pre \nmean = %mean%",sameline=TRUE,horiz=FALSE,cdigits=1)
Expressions involving functions can also be used; for example
sqrt(abs(Post/Pre))
.
Sometimes it is useful to display summary information for more than
one variable. To do this, specify summary
as a
vector of character strings. For example:
vtree(FakeData,"Severity",horiz=FALSE,showvarnames=FALSE,splitwidth=Inf,sameline=TRUE,
summary=c("Score \nScore: mean (SD) %meanx% (%SD%)","Pre \nPre: range %range%"))
Sometimes you only want to show a summary in a particular node.
Targeted summaries are specified with the tsummary
parameter as a list of character-string vectors. The initial elements of
each character string vector point to a specific node. The final element
of each character string vector is a summary string, with the same
structure as .
vtree(FakeData,"Age Sex",tsummary=list(list(Age="5",Sex="M","id \n%list%")),horiz=FALSE)
This section shows how to display all the combinations of values in a set of variables.
Each node in a variable tree provides the frequency of a particular
combination of values of the variables. The leaf nodes represent the
observed combinations of values of all of the variables. For
example, in a variable tree for Severity
and
Sex
, the leaf nodes correspond to Mild F, Mild M, Moderate
F, Moderate M, etc. These combinations, or “patterns”, can be treated as
an additional variable. And if this new pattern variable is used as the
first variable in a tree, then the branches of the tree will be
simplified: each branch will represent a unique pattern, with no
sub-branches. A “pattern tree” can be easily produced by specifying
pattern=TRUE
. For example:
vtree(FakeData,"Severity Sex")
vtree(FakeData,"Severity Sex",pattern=TRUE)
Pattern trees are simpler to read than ordinary variable trees, but they involve a considerable loss of information, since they only represent the nth-degree subsets (where n is the number of variables).
Note that by default, when pattern=TRUE
is specified,
the root node is not shown (in order to simplify the display). A
disadvantage of this is that the total sample size is not shown. You can
override this behavior by specifying showroot=TRUE
.
A pattern tree has two other special characteristics. First, note
that after the first layer (representing pattern
), counts
and percentages are not shown, since they are not informative: by
definition, all nodes within a branch have the same count. Second, note
that in place of arrows, undirected line segments are shown. This is
because, unlike in a regular variable tree, the order of variables is
irrelevant in a pattern tree. Sometimes, however, the variables do have
a natural ordering, as in the case of longitudinal variables. To show
arrows, specify seq=TRUE
instead of
pattern=TRUE
, and a “sequence” (i.e. an ordered pattern)
will be shown.
Summaries can be shown in pattern trees (using the
summary
parameter), but they only appear in the pattern
node (or the sequence node if seq=TRUE
).
A pattern tree has the same structure as a table. Indeed, it may be
more convenient to produce a table rather than a pattern tree. A data
frame containing the information from the pattern tree can be exported
by specifying ptable=TRUE
:
vtree(FakeData,"Severity Sex",ptable=TRUE)
## n pct Severity Sex
## 1 2 4 Severe F
## 2 3 7 <NA> F
## 3 3 7 <NA> M
## 4 3 7 Severe M
## 5 5 11 Moderate M
## 6 8 17 Mild M
## 7 11 24 Mild F
## 8 11 24 Moderate F
The pattern table includes a column for the counts from the pattern nodes, and a column for percentages. Compared to a variable tree, this table is much more compact, and may be more suitable for use in a manuscript.
Pattern trees can be very useful for indicator variables, i.e. variables that take values like 0/1, no/yes, FALSE/TRUE, etc. For convenience in this section, we’ll refer to 0 (or no, FALSE, etc.) as a negative and 1 (or yes, TRUE, etc.) as an affirmative.
The variables Ind1
through Ind4
in
FakeData
are 0/1 indicator variables. If these variables
are interpreted as representing set membership (0 = non-member, 1 =
member), then a pattern tree is an alternative representation of a Venn
diagram. If you specify Venn=TRUE
, the nodes (except for
the pattern nodes) will be blank, with only their shade indicating their
value (dark = 1, light = 0, white = missing).
vtree(FakeData,"Ind1 Ind2 Ind3 Ind4",Venn=TRUE,pattern=TRUE)
Big pattern trees can be overwhelming, so it may be useful to prune
patterns that occur fewer than, say, 3 times, by specifying
prunesmaller=3
.
A pattern tree for indicator variables provides all the information that a Venn diagram represents, but unlike a Venn diagram, missing values are also represented. This can also be shown as a pattern table. For example:
vtree(FakeData,"Ind1 Ind2",ptable=TRUE)
## n pct Ind1 Ind2
## 1 1 2 <NA> 0
## 2 10 22 1 0
## 3 11 24 0 1
## 4 12 26 0 0
## 5 12 26 1 1
VennTable
functionFor indicator variables, there is an extra function,
VennTable
, which converts the pattern table to a matrix of
character strings and adds some additional totals.
VennTable(vtree(FakeData,"Ind1 Ind2",ptable=TRUE))
## n pct Ind2 Ind1
## "1" "2" "0" NA
## "10" "22" "0" "10"
## "11" "24" "11" "0"
## "12" "26" "0" "0"
## "12" "26" "12" "12"
## Total "46" "100" "" ""
## N "" "" "23" "22"
## pct "" "" "50" "48"
By default in R, when a matrix of character strings is printed,
quotation marks are displayed around each element. Unfortunately the
result is unattractive. Instead it’s helpful to call the
print
function and specify quote=FALSE
:
print(VennTable(vtree(FakeData,"Ind1 Ind2",ptable=TRUE)),quote=FALSE)
## n pct Ind2 Ind1
## 1 2 0 <NA>
## 10 22 0 10
## 11 24 11 0
## 12 26 0 0
## 12 26 12 12
## Total 46 100
## N 23 22
## pct 50 48
Without all those quotation marks, it’s easier to see what
VennTable
adds:
the total sample size (46) and percentage (100), and
the total number (N) of affirmatives for each variable, together with a percentage.
The VennTable
function can also be used in an R Markdown
document. Specifying markdown=TRUE
generates a pandoc
markdown pipetable, with several formatting tweaks:
the rows and columns of the table are transposed
affirmatives are represented by checkmarks
negatives are represented by spaces
missing values are represented by dashes (which can be changed
with the NAcode
parameter).
To display the table in R Markdown, use this inline call:
`r VennTable(vtree(FakeData,"Ind1 Ind2",ptable=TRUE),markdown=TRUE)`
Total | N | % | ||||||
---|---|---|---|---|---|---|---|---|
n | 1 | 10 | 11 | 12 | 12 | 46 | ||
% | 2 | 22 | 24 | 26 | 26 | 100 | ||
Ind2 | ✔ | ✔ | 23 | 50 | ||||
Ind1 | - | ✔ | ✔ | 22 | 48 |
VennTable
has some additional parameters. The
checked
parameter is used to specify values that should be
interpreted as affirmative. By default, it is set to
c("1","TRUE","Yes","yes","N/A")
. Similarly, the
unchecked
parameter is used to specify values that should
be interpreted as negative, with default
c("0","FALSE","No","no","not N/A")
.
summary
parameter in pattern tablesThe summary
parameter can also be used in pattern
tables. If a single summary is requested, it appears in the
summary_1
variable in the data frame. Additional summaries
appear as summary_2
, summary_3
, etc.
vtree(FakeData,"Severity Sex",summary=c("Score %mean%","Pre %mean%"),ptable=TRUE)
## n pct Severity Sex summary_1 summary_2
## 1 2 4 Severe F 28.0 -0.4
## 2 3 7 <NA> F 6.3 -0.1
## 3 3 7 <NA> M 23.7 -0.9
## 4 3 7 Severe M 44.0 -0.3
## 5 5 11 Moderate M 8.2 -0.7 mv=1
## 6 8 17 Mild M 6.3 mv=1 0.2
## 7 11 24 Mild F 15.7 -0.4 mv=2
## 8 11 24 Moderate F 21.5 mv=1 0.0
check.is.na
parameterIf check.is.na=TRUE
is specified, each variable is
replaced by an indicator of whether or not it is missing, and
pattern=TRUE
is automatically set. As when
Venn=TRUE
is specified, all nodes except for the pattern
node are blank, and only their shade indicates missing (dark) or not
(light). Whereas the variables used to build a variable tree are
normally categorical, in this situation non-categorical variables can be
used, because their missingness is represented instead of their actual
values.
vtree(FakeData,"Severity Age Pre Post",check.is.na=TRUE)
Specifying ptable=TRUE
produces this information in a
data frame, and calling VennTable
shows additional
information. To display the table in R Markdown, use this inline
call:
`r VennTable(vtree(FakeData,"Severity Age Pre Post",check.is.na=TRUE,ptable=TRUE),
markdown=TRUE)`
Total | N | % | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
n | 1 | 1 | 1 | 1 | 2 | 4 | 4 | 32 | 46 | ||
% | 2 | 2 | 2 | 2 | 4 | 9 | 9 | 70 | 100 | ||
MISSING_Age | ✔ | ✔ | ✔ | 7 | 15 | ||||||
MISSING_Severity | ✔ | ✔ | 6 | 13 | |||||||
MISSING_Pre | ✔ | ✔ | ✔ | 3 | 7 | ||||||
MISSING_Post | ✔ | ✔ | 2 | 4 |
The rows n
and pct
represent the frequency
and percentage of the total number of cases for each pattern of
missingness, and the columns N
and pct
on the
right-hand side represent the frequency and percentage of missingness
for each variable.
It may be useful to identify the ID numbers for these patterns. Here the results are truncated to 15 characters:
vtree(FakeData,"Severity Age Pre Post",check.is.na=TRUE,summary="id %list%%trunc=15%",
ptable=TRUE)
## n pct MISSING_Severity MISSING_Age MISSING_Pre MISSING_Post summary_1
## 1 1 2 not N/A N/A N/A not N/A 124
## 2 1 2 not N/A not N/A N/A N/A 118
## 3 1 2 not N/A not N/A N/A not N/A 108
## 4 1 2 not N/A not N/A not N/A N/A 104
## 5 2 4 N/A N/A not N/A not N/A 112, 135
## 6 4 9 N/A not N/A not N/A not N/A 103, 116, 126, ...
## 7 4 9 not N/A N/A not N/A not N/A 105, 119, 128, ...
## 8 32 70 not N/A not N/A not N/A not N/A 101, 102, 106, ...
This section explains how colors and color palettes can be used.
By default, vtree
assigns colors to nodes of each
successive variable using color palettes from RColorBrewer.
The sequence of palettes (identified by short names) is as follows:
1 | Reds | 6 | YlGn | 11 | BuPu | 16 | RdPu | |||
2 | Blues | 7 | PuBu | 12 | YlOrRd | 17 | BuGn | |||
3 | Greens | 8 | PuRd | 13 | RdYlGn | 18 | OrRd | |||
4 | Oranges | 9 | YlOrBr | 14 | GnBu | |||||
5 | Purples | 10 | PuBuGn | 15 | YlGnBu |
If you prefer to change the color assignments, you can use the
palette
parameter. For example, by default a variable tree
for Sex
and Severity
will assign shades of red
to nodes of Sex
and shades of blue to notes of
Severity
. To switch to shades of, say, green and orange
instead, use:
vtree(FakeData,"Sex Severity",palette=c(3,4))
Sometimes it may be useful to reverse the order of a gradient. To
reverse the order of all gradients, specify
revgradient=TRUE
. The gradient for selected variables can
be reversed as in the example below:
vtree(FakeData,"Sex Group Severity",revgradient=c(Sex=TRUE,Severity=TRUE))
Other color-related parameters include:
sortfill |
Specifying sortfill=TRUE fills nodes with gradient
colors in sorted order according to the node count. |
NAfillcolor |
By default, missing value nodes are colored white. For a different
color (say gray), specify NAfillcolor="gray" . To instead
use a color from the current palette, specify
NAfillcolor=NULL . |
rootfillcolor |
The color of the root node can be changed (say to yellow) by
specifying rootfillcolor="yellow" . |
fillcolor |
To set all nodes of the tree (except for missing value nodes and the
root node) to be the same color (say palegreen), specify
fillcolor="palegreen" . |
plain |
A simple color scheme is produced by specifying
plain=TRUE . (Additionally, this increases the spaces
between nodes.) |
This section details support for checkbox variables from REDCap.
In datasets exported from REDCap, checkboxes
(i.e. select-all-that-apply boxes) are represented in a special way. For
each item in a checklist, a separate variable is created. Suppose survey
respondents were asked to select which flavors of ice cream (Chocolate,
Vanilla, Strawberry) they like. Within REDCap, the variable name for
this list of checkboxes is IceCream
, but when the dataset
is exported, individual variables IceCream___1
(representing Chocolate), IceCream___2
(Vanilla), and
IceCream___3
(Strawberry) are created. When the dataset is
read into R, the names of the flavors are embedded in the
attributes
of these variables.
For illustrative purposes, let’s build a dataframe like this using
the build.data.frame
function (for an explanation of this
function see the section of this vignette on generating a data frame by specifying
subset sizes
<- build.data.frame(
dessert c( "group","IceCream___1","IceCream___2","IceCream___3"),
list("A", 1, 0, 0, 7),
list("A", 1, 0, 1, 2),
list("A", 0, 0, 0, 1),
list("A", 1, 1, 1, 1),
list("B", 1, 0, 1, 1),
list("B", 1, 0, 0, 2),
list("B", 0, 1, 1, 1),
list("B", 0, 0, 0, 1))
attr(dessert$IceCream___1,"label") <- "Ice cream (choice=Chocolate)"
attr(dessert$IceCream___2,"label") <- "Ice cream (choice=Vanilla)"
attr(dessert$IceCream___3,"label") <- "Ice cream (choice=Strawberry)"
r:
The prefix r:
identifies a REDCap checklist variable,
and extracts a label from the variable attribute. For example, the
following call automatically displays “Chocolate”:
vtree(dessert,"r:IceCream___1")
@
The suffix @
matches REDCap checklist variables based on
the naming scheme used by REDCap for checklist variables. For example,
the following call automatically displays Chocolate, Vanilla, and
Strawberry:
vtree(dessert,"r:IceCream@")
rany:
, rnone:
,
rall:
, and rnotall:
The variable prefixes any:
, none:
,
all:
, and notall:
can be combined with the
r:
prefix to form rany:
, rnone:
,
rall:
, and rnotall:
. For example, to determine
whether anyone did not like any of the flavors (Chocolate,
Vanilla, or Strawberry):
vtree(dessert,"rnone:IceCream@")
ri:
“Intersections” of REDCap variables may be obtained by combining the
r:
prefix with the i:
prefix:
vtree(dessert,"ri:IceCream@")
stem:
and
rc:
To examine the pattern of ice-cream flavor choices, the following can be used:
vtree(dessert,"IceCream___1 IceCream___2 IceCream___3",pattern=TRUE)
One problem is that this doesn’t assign the appropriate labels to
IceCream___1
(Chocolate), IceCream___2
(Vanilla), and IceCream___3
(Strawberry).
Instead, try the following more compact call, which also assigns labels automatically.
vtree(dessert,"stem:IceCream",pattern=TRUE)
The summary
parameter also supports a stem:
prefix:
vtree(dessert,summary="stem:IceCream",splitwidth=Inf,just="l")
If you wish to only examine specific REDCap checkbox items, the
rc:
prefix can be used. For example to examine results for
just Chocolate and Strawberry:
vtree(dessert,"rc:IceCream___1 rc:IceCream___3",pattern=TRUE)
vtree
This section shows how to obtain the DOT script that displays a variable tree.
Specifying getscript=TRUE
lets you capture the DOT
script representing a variable tree. (DOT is a graph description
language used by Graphviz, which is used by DiagrammeR, which is used by
vtree!). Here is an example:
<- vtree(FakeData,"Severity",getscript=TRUE)
dotscript cat(dotscript)
digraph vtree {
graph [nodesep=0.1, ranksep=0.5, tooltip=" "]
node [fontname = "Arial", fontcolor = black,shape = rectangle, color = black, tooltip=" ",margin=0.1]
rankdir=LR;
Node_L0_0 [style=invisible]
Node_L1_0[label=<<FONT POINT-SIZE="24"><FONT COLOR="#DE2D26">Severity</FONT></FONT>> shape=none margin=0]
Node_L0_0 -> Node_L1_0 [style=invisible arrowhead=none]
edge[style=solid]
Node_1->Node_2 Node_1->Node_3 Node_1->Node_4 Node_1->Node_5
Node_1[label=<46> fontcolor=<#000000> color=black style="rounded,filled" fillcolor=<#EFF3FF>]
Node_2[label=<Mild<BR/>19 (48%)> fontcolor=<#000000> color=black style="rounded,filled" fillcolor=<#FEE0D2> ]
Node_1[label=<46> fontcolor=<#000000> color=black style="rounded,filled" fillcolor=<#EFF3FF>]
Node_3[label=<Moderate<BR/>16 (40%)> fontcolor=<#ffffff> color=black style="rounded,filled" fillcolor=<#FC9272> ]
Node_1[label=<46> fontcolor=<#000000> color=black style="rounded,filled" fillcolor=<#EFF3FF>]
Node_4[label=<Severe<BR/>5 (12%)> fontcolor=<#ffffff> color=black style="rounded,filled" fillcolor=<#DE2D26> ]
Node_1[label=<46> fontcolor=<#000000> color=black style="rounded,filled" fillcolor=<#EFF3FF>]
Node_5[label=<NA<BR/>6> fontcolor=<#000000> color=black style="rounded,filled" fillcolor=<white> ]
}
If you wish to directly edit this code, it can can be pasted into an online Graphviz editor, for example:
This section explains how to obtain all of the counts and percentages of a variable tree.
Sometimes it is useful to extract counts, percentages, and summary
information from a variable tree. The object returned by
vtree
has an attribute info
containing
structured information about the counts and percentages in each node.
Here is an example:
<- vtree(FakeData,"Group Viral",horiz=FALSE)
v v
attributes(v)$info
## $.n
## [1] 46
##
## $.pct
## [1] 100
##
## $Group
## $Group$A
## $Group$A$.n
## [1] 24
##
## $Group$A$.pct
## [1] 52
##
## $Group$A$Viral
## $Group$A$Viral$`FALSE`
## $Group$A$Viral$`FALSE`$.n
## [1] 13
##
## $Group$A$Viral$`FALSE`$.pct
## [1] 65
##
##
## $Group$A$Viral$`TRUE`
## $Group$A$Viral$`TRUE`$.n
## [1] 7
##
## $Group$A$Viral$`TRUE`$.pct
## [1] 35
##
##
## $Group$A$Viral$`NA`
## $Group$A$Viral$`NA`$.n
## [1] 4
##
## $Group$A$Viral$`NA`$.pct
## [1] NA
##
##
##
##
## $Group$B
## $Group$B$.n
## [1] 22
##
## $Group$B$.pct
## [1] 48
##
## $Group$B$Viral
## $Group$B$Viral$`FALSE`
## $Group$B$Viral$`FALSE`$.n
## [1] 12
##
## $Group$B$Viral$`FALSE`$.pct
## [1] 67
##
##
## $Group$B$Viral$`TRUE`
## $Group$B$Viral$`TRUE`$.n
## [1] 6
##
## $Group$B$Viral$`TRUE`$.pct
## [1] 33
##
##
## $Group$B$Viral$`NA`
## $Group$B$Viral$`NA`$.n
## [1] 4
##
## $Group$B$Viral$`NA`$.pct
## [1] NA
The list contains the counts (.n
), percentages
(.pct
), and summary text (.text
) that appear
in the tree.
vtree
behaves differently depending on the context in
which it is called.
If vtree
is called interactively in RStudio, it
displays the variable tree in the Viewer window.
If vtree
is called interactively from the RGui
console (i.e. from R outside of RStudio), it displays the variable tree
in a browser window.
When vtree
is called from knitr, it generates
A PNG file if the output format is Markdown
A PDF file if the output format is LaTeX.
Here’s how it does that. vtree
uses the
DiagrammeR
package, which automatically generates an
htmlwidget
object for display in HTML, using the htmlwidgets framework. Then
vtree
converts the htmlwidget
object into an
SVG image, and finally into a PNG or PDF file.
PNG files are useful because they allow you to display variable trees in Microsoft Word documents, and also because HTML files that use htmlwidgets can get large, and if they contain several widgets they can be slow to load.
If vtree
is called while an R Markdown file is being
knitted, it generates a PNG file and automatically embeds it into the
knitted document. The resolution of the PNG file in pixels is determined
by parameters pxwidth
and pxheight
. If neither
is specified, pxwidth
is automatically set to 2000, which
provides good resolution for a printed page. The height of the image in
the R Markdown output document can be specified using the
imageheight
parameter, for example
imageheight="4in"
for a 4-inch image. There is also an
imagewidth
parameter. If neither is specified,
imageheight
is automatically set to 3 inches.
Note: You may notice a warning in the R Markdown rendering (in RStudio, the R Markdown pane) like this:
<unknown>:1919791: Invalid asm.js: Function definition doesn't match use
Although distracting, this message is irrelevant.
The PNG or PDF file is stored in the folder specified by the
folder
parameter, or if not specified, a temporary folder
will be used. Successive PNG files are named vtree001.png
,
vtree002.png
, and so forth and are stored in the folder.
(Similarly PDF files are named vtree001.pdf
, etc.) During
knitting, vtree
uses the options
function in
base R to store a variable called vtcount
to count the PNG
files, and a variable called vtfolder
to identify the
folder where they will be stored.
To call vtree
in R Markdown, you can use inline
code:
`r vtree(FakeData,"Sex Severity")`
Or you can use a code chunk:
```{r}
vtree(FakeData,"Sex Severity")
```
One advantage of code chunks is that they can also be run interactively (for example within RStudio, by clicking on the green arrow at the top right of a code chunk).
Specifying imageFileOnly=TRUE
instructs vtree to
generate an image file but not display it.
When knitting to an HTML document, htmlwidgets can be used rather
than embedding a PNG file. To use htmlwidgets instead of a PNG file
simply specify pngknit=FALSE
.
svtree
: Using vtree in ShinyThanks to Shiny and the svg-pan-zoom JavaScript library, interactive
panning and zooming of a variable tree is possible with the
svtree
function. The syntax of svtree
is the
same as that of vtree
, but instead of generating a static
variable tree, it launches a Shiny app. The mousewheel allows you to
zoom in or out. The variable tree can also be dragged to a different
position.
Thanks to the panning and zooming functionality in
svtree
, it is possible to examine larger variable trees
than with vtree
. In large variable trees it is often useful
to show the variable name in each node, since the variable labels (which
are shown at the bottom or left-hand margin) may not be visible after
zooming. To show the variable name in each node, specify
showvarinnode=TRUE
.
vtree
is designed to generate a variable tree based on a
data frame. However, sometimes the sizes of subsets are known but no
data frame is available.
The build.data.frame
function allows you to build a data
frame by specifying the size of subsets. Here’s an example involving
pets:
build.data.frame(
c("pet","breed","size"),
list("dog","golden retriever","large",5),
list("cat","tabby","small",2))
## pet breed size
## 1 dog golden retriever large
## 2 dog golden retriever large
## 3 dog golden retriever large
## 4 dog golden retriever large
## 5 dog golden retriever large
## 6 cat tabby small
## 7 cat tabby small
In this case there are five large golden retrievers and 2 small tabby
cats. Although a data frame like this could easily be created without
using build.data.frame
, it’s a different situation when the
counts are large. For example:
build.data.frame(
c("pet","breed","size"),
list("dog","golden retriever","large",5),
list("cat","tabby","small",2),
list("dog","Dalmation","various",101),
list("cat","Abyssinian","small",5),
list("cat","Abyssinian","large",22),
list("cat","tabby","large",86))
Consider the following fictitious data about a randomized controlled trial (RCT):
FakeRCT
## id eligible randomized group followup analyzed
## 1 001 Eligible Randomized B Followed up Analyzed
## 2 002 Eligible Not randomized <NA> <NA> <NA>
## 3 003 Eligible Randomized A Not followed up <NA>
## 4 004 Eligible Randomized B Followed up Analyzed
## 5 005 Eligible Randomized A Followed up Analyzed
## 6 006 Ineligible <NA> <NA> <NA> <NA>
## 7 007 Eligible Randomized A Followed up Analyzed
## 8 008 Ineligible <NA> <NA> <NA> <NA>
## 9 009 Eligible Randomized A Followed up Analyzed
## 10 0010 Ineligible <NA> <NA> <NA> <NA>
## 11 0011 Eligible Randomized B Followed up Analyzed
## 12 0012 Ineligible <NA> <NA> <NA> <NA>
The CONSORT diagram (http://www.consort-statement.org/) shows the flow of
patients through the study, starting with those who meet eligibility
criteria, then those who are randomized, etc. It is easy to produce a
rudimentary version of a CONSORT diagram in vtree
. The key
step is to prune branches for those who are not eligible,
not randomized, etc. This can be done using the
keep
parameter:
vtree(FakeRCT,"eligible randomized group followup analyzed",plain=TRUE,
keep=list(eligible="Eligible",randomized="Randomized",followup="Followed up"),
horiz=FALSE,showvarnames=FALSE,title="Assessed for eligibility")
Note that this does not include all of the additional information for a full CONSORT diagram (exclusion reasons and counts, as well as numbers of patients who received their allocated interventions, who discontinued intervention, and who were excluded from analysis). It does, however, provide the main flow information.
Additional information can be obtained by viewing the nodes for
patients in the pruned branches (but not their descendants). The
follow
parameter makes that easy:
vtree(FakeRCT,"eligible randomized group followup analyzed",plain=TRUE,
follow=list(eligible="Eligible",randomized="Randomized",followup="Followed up"),
horiz=FALSE,showvarnames=FALSE,title="Assessed for eligibility")
Finally, it may be useful to see the ID numbers in each node. This
can be done using the summary
parameter with the
%list%
code. Since IDs are less useful in the root note,
the %noroot%
code is also specified here:
vtree(FakeRCT,"eligible randomized group followup analyzed",plain=TRUE,
follow=list(eligible="Eligible",randomized="Randomized",followup="Followed up"),
horiz=FALSE,showvarnames=FALSE,title="Assessed for eligibility",
summary="id \nid: %list% %noroot%")
The datasets
package is loaded in R by default. In the
following section, vtree
is applied to several of these
data sets for illustrative purposes. Note that the variable trees
generated by the commands below are not shown. The reader can try these
commands to see what the variable trees look like, and experiment with
many other possibilities.
The esoph
data set (data from a case-control study of
esophageal cancer in Ille-et-Vilaine, France), has 88 different
combinations of age group, alcohol consumption, and tobacco consumption.
Let’s examine the total number of cases and the total number of controls
among patients aged 75 and older compared to the rest of the
patients:
# Relabel agegp 75+ to 75plus because vtree tries to parse the +
<- esoph
ESOPH levels(ESOPH$agegp)[levels(ESOPH$agegp)=="75+"] <- "75plus"
vtree(ESOPH,"agegp=75plus",sameline=TRUE,cdigits=0,
summary=c("ncases \ncases=%sum%%leafonly%","ncontrols controls=%sum%%leafonly%"))
The HairEyeColor
data set is an array representing a
contingency table (also called a crosstab or crosstabulation). Before
vtree
can be applied to this data set, it is necessary to
convert the table of crosstabulated frequencies to a data frame of
cases. For convenience, the vtree
package includes a helper
function to do this, called crosstabToCases
. It is adapted
from a function listed on the Cookbook
for R website
<- crosstabToCases(HairEyeColor) hec
There are a lot of combinations but let’s say we are especially
interested in green eyes (as compared to non-green eyes). We can use the
variable specification Eye=Green
to do this:
vtree(hec,"Hair Eye=Green Sex",sameline=TRUE)
The Titanic
dataset is a 4-dimensional array of counts.
First, let’s convert it to a dataframe of individuals:
<- crosstabToCases(Titanic) titanic
We’ll specify sameline=TRUE
so that the variable tree is
a bit more compact:
vtree(titanic,"Class Sex Age",summary="Survived=Yes \n%pct% survived",sameline=TRUE)
The mtcars
data set was extracted from the 1974 Motor
Trend US magazine, and comprises fuel consumption and 10 aspects of
automobile design and performance for 32 automobiles (1973–74
models).
The rownames of the data set contain the names of the cars. Let’s
move that information into a column. To do that, we’ll make a slightly
altered version of the data frame which we’ll call mt
:
<- mtcars
mt $name <- rownames(mt)
mtrownames(mt) <- NULL
Now let’s look at the mean and standard deviation of horsepower (HP) by number of carburetors, nested within number of gears, and in turn nested within number of cylinders:
vtree(mt,"cyl gear carb",summary="hp \nmean (SD) HP %mean% (%SD%)")
The above shows the mean and SD of horsepower by (1) number of cylinders; (2) number of gears (within number of cylinders); and (3) number of carburetors (within number of gears nested within number of cylinders). That’s a lot of information. Suppose instead that we are only interested in number 3 above, i.e. all combinations of number of cylinders, number of gears, and number of carburetors.
In that case, we can specify ptable=TRUE
, To make the
table a little easier to read, set the number of digits for the mean and
SD to be zero, and relabel the variables.
vtree(mt,"cyl gear carb",summary="hp mean (SD) HP %mean% (%SD%)",
cdigits=0,labelvar=c(cyl="# cylinders",gear="# gears",carb="# carburetors"),
ptable=TRUE)
We might also like to list the names of cars by number of carburetors nested within number of gears:
vtree(mt,"gear carb",summary="name \n%list%%noroot%",splitwidth=50,sameline=TRUE,
labelvar=c(gear="# gears",carb="# carburetors"))
The UCBAdmissions
data is consists of aggregate data on
applicants to graduate school at Berkeley for the six largest
departments in 1973 classified by admission and sex. According to the
data set Details, “This data set is frequently used for illustrating
Simpson’s paradox, see Bickel et al. (1975). At issue is whether the
data show evidence of sex bias in admission practices. There were 2691
male applicants, of whom 1198 (44.5%) were admitted, compared with 1835
female applicants of whom 557 (30.4%) were admitted.” Furthermore, “the
apparent association between admission and sex stems from differences in
the tendency of males and females to apply to the individual departments
(females used to apply more to departments with higher rejection
rates).”
First, we’ll convert the crosstab data to a data frame of cases,
ucb
:
<- crosstabToCases(UCBAdmissions) ucb
Next, let’s look at admission rates by Gender, nested within department:
vtree(ucb,"Dept Gender",summary="Admit=Admitted \n%pct% admitted",sameline=TRUE)
The ChickWeight
data set is from an experiment on the
effect of diet on early growth of chicks. Let’s look at the mean weight
of chicks at birth (0 days of age) and 4 days of age, nested within type
of diet. A simple variable tree can be produced like this:
vtree(ChickWeight,"Diet Time",
keep=list(Time=c("0","4")),summary="weight \nmean weight %mean%g")
To make the display a little easier to read, relabel the nodes and
the Time
variable:
vtree(ChickWeight,"Diet Time",keep=list(Time=c("0","4")),
labelnode=list(
Diet=c("Diet 1"="1","Diet 2"="2","Diet 3"="3","Diet 4"="4"),
Time=c("0 days"="0","4 days"="4")),
labelvar=c(Time="Days since birth"),summary="weight \nmean weight %mean%g")
The InsectSprays
data set contains counts of insects in
agricultural experimental units treated with different insecticides.
Let’s look at those counts by insecticide.
vtree(InsectSprays,"spray",splitwidth=80,sameline=TRUE,
summary="count \ncounts: %list%%noroot%",cdigits=0)
The ToothGrowth
data set contains the length of
odontoblasts (cells responsible for tooth growth) in 60 guinea pigs.
Each animal received one of three dose levels of vitamin C (0.5, 1, and
2 mg/day) by one of two delivery methods, orange juice or ascorbic acid
(a form of vitamin C and coded as VC).
Let’s examine the percentage with length > 20 by dose nested within delivery method:
vtree(ToothGrowth,"supp dose",summary="len>20 \n%pct% length > 20")
To make the display a little easier to read, relabel the nodes and
the Time
variable:
vtree(ToothGrowth,"supp dose",summary="len>20 \n%pct% length > 20",
labelvar=c("supp"="Supplement type","dose"="Dose (mg/day)"),
labelnode=list(supp=c("Vitamin C"="VC","Orange Juice"="OJ")))