---
title: "Introduction to vtree"
subtitle: "*Exploring subsets of data using variable trees*"
author: '`r paste0("Nick Barrowman, ",strftime(Sys.time(),format="%d-%b-%Y"),", Version ",packageVersion("vtree"))`'
output:
rmarkdown::html_vignette:
css: vtreeVignette.css
toc: true
toc_depth: '2'
vignette: >
%\VignetteIndexEntry{Introduction to vtree}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, echo=FALSE}
suppressMessages(library(ggplot2))
library(vtree)
#source("../source.R")
options(width=90)
options(rmarkdown.html_vignette.check_title = FALSE)
```
```{r, echo=FALSE}
spaces <- function (n) {
paste(rep(" ", n), collapse = "")
}
```
# Introduction
vtree is a flexible tool for calculating and displaying *variable trees* —
diagrams that show information about nested subsets of a data frame.
vtree can be used to:
1. explore a data set interactively
2. produce customized figures for reports and publications.
Note, however, that vtree is *not* designed to build or display decision trees.
Given a data frame and simple specifications,
vtree will produce a variable tree and automatically label it with counts,
percentages, and other summaries.
The sections below introduce variable trees and provide an
overview of the features of vtree.
Or you can [skip ahead and start using the `vtree` function](#vtreeFunction).
## Two examples
*Subsets* play an important role in almost any data analysis.
Imagine a data set of countries
that includes variables named `population`, `continent`, and `landlocked`.
Suppose we wish to examine subsets of the data set based on the `continent` variable.
Within each of these subsets,
we could examine *nested* subsets based on the `population` variable,
for example, countries with populations under 30 million and over 30 million.
We might continue to a third nesting based on the `landlocked` variable.
Nested subsets are at the heart of questions like the following:
*Among African countries with a population over 30 million, what percentage are landlocked?* The variable tree below provides the answer:
```{r, echo=FALSE}
df <- build.data.frame(
c("continent","population","landlocked"),
list("Africa","Over 30 million","landlocked",2),
list("Africa","Over 30 million","not landlocked",12),
list("Africa","Under 30 million","landlocked",14),
list("Africa","Under 30 million","not landlocked",26))
```
`r spaces(30)`
`r vtree(df,"continent population landlocked",showroot=FALSE,pxwidth=800,
imageheight="3in")`
By default, vtree uses the colorful display above (to help distinguish variables and values),
but if you prefer a more sedate version,
you can specify a single fill color (or simply white):
`r spaces(30)`
`r vtree(df,"continent population landlocked",showroot=FALSE,fillcolor="aliceblue",
pxwidth=800,imageheight="3in")`
Even in simple situations like this,
it can be a chore to keep track of nested subsets and
calculate the corresponding percentages.
The denominator used to calculate percentages may also depend
on whether the variables have any missing values, as discussed later.
Finally, as the number of variables increases,
the magnitude of the task balloons,
because the number of nested subsets grows exponentially.
vtree provides a general solution to the problem of calculating nested subsets
and displaying information about them.
Nested subsets arise in all kinds of situations.
Consider, for example, flow diagrams for clinical studies,
such as the following
[CONSORT](http://www.consort-statement.org)-style diagram,
produced by vtree.
`r spaces(50)`
`r
vtree(FakeRCT,"eligible randomized group followup",plain=TRUE,
keep=list(eligible="Eligible",randomized="Randomized",followup="Followed up"),
horiz=FALSE,showvarnames=FALSE,title="Assessed for eligibility",
pxwidth=500,imageheight="4.5in")
`
Both the structure of this variable tree
and the numbers shown were automatically determined.
When manual calculation and transcription are instead used to
populate diagrams like this, mistakes are likely.
And although the errors that make it into published articles are often minor,
they can sometimes be disastrous.
One motivation for developing vtree was to make flow diagrams
[reproducible](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3383002/).
The ability to reproducibly generate variable trees also means that
when a data set is updated, a revised tree can be automatically produced.
At the end of this vignette, there is a collection of
[examples of variable trees using R datasets](#RdatasetExamples) that you can try.
## Basic features of a variable tree
The examples that follow use a data set called `FakeData` which represents
`r nrow(FakeData)` fictitious patients.
We'll start by using just two variables,
although variable trees are especially useful with three or more variables.
The variable tree below depicts subsets defined by `Sex` (M or F)
nested within subsets defined by disease `Severity`
(Mild, Moderate, Severe, or NA).
`r vtree(FakeData,"Severity Sex",showlegend=FALSE,horiz=FALSE,
pxwidth=1000,imageheight="2.2in")`
A variable tree consists of *nodes* connected by arrows.
At the top of the diagram above, the *root* node of the tree contains all 46 patients.
The rest of the nodes are arranged in successive layers,
where each layer corresponds to a specific variable.
Note that this highlights one difference between variable trees
and some other kinds of trees:
each layer of a variable tree corresponds to just one variable.
(In a *decision tree*, by contrast, different branches can have different
sequences of variable splits.)
Continuing with the variable tree above,
the nodes immediately below the root represent values of `Severity` and
are referred to as the *children* of the root node.
In this case, `Severity` was missing (NA) for 6 patients,
and there is a node for these patients.
Inside each of the nodes, the number of patients is displayed
and---except for in the missing value node---the corresponding percentage is also shown.
Note that, by default, `vtree` displays "valid" percentages,
i.e. the denominator used to calculate the percentage is the
total number of *non-missing* values,
`r sum(!is.na(FakeData$Severity))`.
The final layer of the tree corresponds to values of `Sex`.
These nodes represent males and females *within subsets* defined by each value of `Severity`.
In each of these nodes the percentage is calculated in terms of
the number of patients in its parent node.
Like any node, a missing-value node can have children.
For example, of the 6 patients for whom `Severity` is missing, 3 are female and 3 are male.
By default, `vtree` displays the full missing-value structure of the specified variables.
Also by default, `vtree` automatically assigns a color palette to the nodes of each variable.
`Severity` has been assigned red hues (lightest for Mild, darkest for Severe),
while `Sex` has been assigned blue hues (light blue for females, dark blue for males).
The node representing missing values of `Severity` is colored white to draw attention to it.
## Variable trees compared to contingency tables
A tree with two variables is similar to a two-way contingency table.
In the example above, `Sex` is shown within levels of `Severity`.
This corresponds to the following contingency table,
where the percentages within each column add to 100%.
These are called *column percentages*.
| Mild | Moderate | Severe | NA
------|-----------|----------|----------|---------
**F** | 11 (58%) | 11 (69%) | 2 (40%) | 3 (50%)
**M** | 8 (42%) | 5 (31%) | 3 (60%) | 3 (50%)
Likewise, a tree with `Severity` shown within levels of `Sex` corresponds to
a contingency table with *row percentages*.
While the contingency table above is more compact than the corresponding variable tree,
some people find the variable tree more intuitive.
When three or more variables are of interest,
multi-way contingency tables are often used.
These are typically displayed using several two-way tables,
but as the number of variables increases,
these become increasingly difficult to interpret.
Variable trees, on the other hand,
have the same simple structure regardless of the number of variables.
Note that contingency tables are not *always* more compact than variable trees.
When most cells of a large contingency table are empty
(in which case the table is said to be *sparse*),
the corresponding variable tree may be more compact since empty nodes are not shown.
## Features of vtree
vtree is designed to be quick and easy to use,
so that it is convenient for data exploration,
but also flexible enough that it can be used to prepare publication-ready figures.
To generate a basic variable tree,
it is only necessary to specify a data frame and some variable names.
However extra features extend this basic functionality to provide:
* control over [labeling](#labeling), [colors](#colors), [legends](#legends), [line wrapping](#wrapping),
and [text formatting](#textFormatting);
* flexible [pruning](#pruning) to remove parts of the tree that are of lesser interest,
which is particularly useful when a tree gets large;
* [display of information about other variables in each node](#summary),
including a variety of summary statistics;
* special displays for [indicator variables](#Venn),
[patterns](#patterns) of values, and
[missingness](#missingValues);
* support for [checkbox variables](#REDCapCheckboxes) from [REDCap](https://www.project-redcap.org) databases;
* features for [dichotomizing variables](#dichotomizing) and [checking for outliers](#detectingOutliers);
* automatic generation of PNG image files and [embedding in R Markdown](#embeddingInKnitrRmarkdown) documents; and
* interactive panning and zooming using the `svtree` function to launch a [Shiny app](#svtree).
In many cases, you may wish to generate several different variable trees to
investigate a collection of variables in a data frame.
For example, it is often useful to change the order of variables,
prune parts of the tree, etc.
## Technical overview
vtree is built on open-source software:
in particular Richard Iannone's
[DiagrammeR](http://rich-iannone.github.io/DiagrammeR/) package,
which provides an interface to the
[Graphviz](https://www.graphviz.org/) software using the
[htmlwidgets](https://www.htmlwidgets.org/) framework.
Additionally,
vtree makes use of the
[Shiny](https://www.rstudio.com/products/shiny/) package,
and the
[svg-pan-zoom](https://github.com/bumbu/svg-pan-zoom) JavaScript library.
A formal description of variable trees follows.
The root node of the variable tree represents the entire data frame.
The root node has a child for each observed value of the first variable that was specified.
Each of these child nodes represents a subset of the data frame with a specific value of the variable, and is labeled with the number of observations in the subset and the corresponding percentage of the number of observations in the entire data frame.
The *n*^th^ layer below the root of the variable tree corresponds to the *n*^th^ variable specified.
Apart from the root node,
each node in the variable tree represents the subset of its parent defined by a specific observed value of the variable at that layer of the tree,
and is labeled with the number of observations in that subset and the corresponding percentage of the number of observations in its parent node.
Note that a node always represents at least one observation.
And unlike a contingency table,
which can have empty cells,
a variable tree has no empty nodes.
# The `vtree` function {#vtreeFunction}
Consider a data frame named `df`,
which includes discrete variables `v1` and `v2`.
Suppose we wish to produce a variable tree showing
subsets based on values of `v1` as well as
subsets of those subsets based on values of `v2`.
The variable tree can be displayed using the following command:
```{r, eval=FALSE}
vtree(df,"v1 v2")
```
Alternatively, you may wish to assign the output of `vtree` to an object:
```{r, eval=FALSE}
simple_tree <- vtree(df,"v1 v2")
```
Then it can be displayed later using:
```{r, eval=FALSE}
simple_tree
```
Suppose `vtree` is called without a list of variables:
```{r, eval=FALSE}
vtree(df)
```
In this case, only the root node is shown, representing the entire data frame.
Although a tree with just one node might not seem very useful,
we'll see later that
[summary information](#summary) about the whole data frame
can be displayed there.
The `vtree` function has numerous optional parameters.
For example, by default `vtree` produces a horizontal tree
(that is, a tree that grows from left to right).
To generate a vertical tree, specify `horiz=FALSE`.
## Mini tutorial
*This section introduces some basic features of the `vtree` function.*
To display a variable tree for a single variable, say `Severity`, use the following command:
```{r,eval=FALSE, results="asis"}
vtree(FakeData,"Severity")
```
`r spaces(45)`
`r vtree(FakeData,"Severity",width=250,height=250,pxwidth=300,imageheight="2.5in")`
By default, next to each layer of the tree, a variable name is shown.
In the example above, "Severity" is shown below the corresponding nodes.
(For a vertical tree, "Severity" would be shown to the left of the nodes.)
If you specify `showvarnames=FALSE`, no variable names will be shown.
`vtree` can also be used with dplyr.
For example, to rename the `Severity` variable as `HowBad`,
we can pipe the data frame into the `rename` function in dplyr,
and then pipe the result into `vtree`:
```{r,eval=FALSE}
library(dplyr)
FakeData %>% rename("HowBad"=Severity) %>% vtree("HowBad")
```
Note that `vtree` also has a [built-in way of renaming variables](#labeling),
which is an alternative to using dplyr.
Large variable trees can be difficult to display in a readable way.
One approach that helps is to display the count and percentage on the same line
in each node.
For example, in the tree above,
the label for the Moderate node is on two lines, like this:
`r spaces(65)`**Moderate** \
`r spaces(65)`**16 (40%)**
Specifying `sameline=TRUE` results in single-line labels, like this:
`r spaces(65)`**Moderate, 16 (40%)**
### Percentages
By default, vtree shows "valid percentages",
i.e. percentages calculated using
the total number of *non-missing* values as denominator.
In the case of `Severity`, there are `r sum(is.na(FakeData$Severity))` missing values,
so the denominator is `r nrow(FakeData)` - `r sum(is.na(FakeData$Severity))`,
or `r nrow(FakeData) - sum(is.na(FakeData$Severity))`.
There are `r sum(FakeData$Severity %in% "Mild")` Mild cases,
and `r sum(FakeData$Severity %in% "Mild")`/`r nrow(FakeData) - sum(is.na(FakeData$Severity))` =
`r sum(FakeData$Severity %in% "Mild")/(nrow(FakeData) - sum(is.na(FakeData$Severity)))` so the percentage shown is 48%.
No percentage is shown in the NA node since missing values are not included in the denominator.
If you prefer the denominator to represent the complete set of observations
(*including* any missing values),
specify `vp=FALSE`.
A percentage will be shown in each of the nodes,
including any NA nodes.
If you don't wish to see percentages, specify `showpct=FALSE`,
and if you don't wish to see counts, specify `showcount=FALSE`.
### Displaying a legend and hiding node labels {#legends}
To display a legend, specify `showlegend=TRUE`.
Next to each variable name are "legend nodes" representing the values of that variable
and colored accordingly.
For each variable, the legend nodes are grouped within a light gray box.
Each legend node also contains a count (with a percentage)
for the value represented by that node in the whole data frame.
This is known as the *marginal* count (and percentage).
When the legend is shown, labels in the nodes of the variable tree are redundant,
since the colors of the nodes identify the values of the variables
(although the labels may aid readability).
If you prefer, you can hide the node labels,
by specifying `shownodelabels=FALSE`:
```{r,eval=FALSE}
vtree(FakeData,"Severity Sex",showlegend=TRUE,shownodelabels=FALSE)
```
`r spaces(45)`
`r vtree(FakeData,"Severity Sex",showlegend=TRUE,shownodelabels=FALSE,
pxwidth=800,imageheight="4in")`
Since `Severity` is the first variable in the tree, it is not nested within another variable.
Therefore the marginal counts and percentages for `Severity`
shown in the legend nodes are identical to those displayed in the nodes of the variable tree.
In contrast, for `Sex`, the marginal counts and percentages are different
from what is shown in the nodes of the variable tree for `Sex` since they are nested within levels of `Severity`.
### Text wrapping {#wrapping}
By default,
`vtree` wraps text onto the next line whenever a space occurs after at least 20 characters.
This can be adjusted, for example, to 15 characters,
by specifying `splitwidth=15`.
To disable line splitting, specify `splitwidth=Inf`
(`Inf` means infinity, i.e. "do not split".)
The `vsplitwidth` parameter is similarly used to control text wrapping in variable names.
This is helpful with long variable names,
which may be truncated unless wrapping is used.
In this case text wrapping occurs not only at spaces,
but also at any of the following characters:
```{eval=FALSE}
. - + _ = / (
```
For example if `vsplitwidth=5`, a variable name like `First_Emergency_Visit`
would be split into
`r spaces(65)``First_`\
`r spaces(65)``Emergency_`\
`r spaces(65)``Visit`
*This concludes the mini-tutorial. vtree has many more features, described in the following sections.*
## Pruning {#pruning}
*This section shows how to remove branches from a variable tree.*
When a variable tree gets too big,
or you are only interested in certain parts of the tree,
it may be useful to remove some nodes along with their descendants.
This is known as *pruning*.
For convenience, there are several different ways to prune a tree,
described below.
### The `prune` parameter
Here's a variable tree we've already seen in various forms:
```{r,eval=FALSE}
vtree(FakeData,"Severity Sex")
```
`r spaces(40)`
`r vtree(FakeData,"Severity Sex", pxwidth=800,imageheight="4.2in")`
Suppose you don't want the tree to show branches for
individuals whose disease is Mild or Moderate.
Specifying `prune=list(Severity=c("Mild","Moderate"))`
removes those nodes, and all of their descendants:
```{r,eval=FALSE}
vtree(FakeData,"Severity Sex",prune=list(Severity=c("Mild","Moderate")))
```
`r spaces(40)`
`r vtree(FakeData,"Severity Sex",prune=list(Severity=c("Mild","Moderate")),
pxwidth=500,imageheight="2.5in")`
In general,
the argument of the `prune` parameter is a *list*
with an element named for each variable you wish to prune.
In the example above, the list has a single element, named `Severity`.
In turn, that element is a vector `c("Mild","Moderate")`
indicating the values of `Severity` to prune.
**Caution**: Once a variable tree has been pruned,
it is no longer complete.
This can sometimes be confusing since not all observations
are represented at certain layers of the tree.
For example in the tree above, only 11 observations are shown in the `Severity` nodes
and their children.
### The `keep` parameter
Sometimes it is more convenient to specify which nodes should be *retained*
rather than which ones should be discarded.
The `keep` parameter is used for this purpose,
and can thus be considered the complement of the `prune` parameter.
For example, to retain the Moderate `Severity` node:
```{r,eval=FALSE}
vtree(FakeData,"Severity Sex",keep=list(Severity="Moderate"))
```
`r spaces(40)`
`r vtree(FakeData,"Severity Sex",keep=list(Severity="Moderate"),
pxwidth=500,imageheight="1.7in")`
**Note**: In addition to the Moderate node,
the missing value node has also been retained.
In general, whenever valid percentages are used (which is the default),
missing value nodes are retained when `keep` is used.
This is because valid percentages are difficult to interpret without
knowing the denominator, which requires knowing the number of missing values.
On the other hand, here's what happens when `vp=FALSE`:
```{r,eval=FALSE}
vtree(FakeData,"Severity Sex",keep=list(Severity="Moderate"),vp=FALSE)
```
`r spaces(40)`
`r vtree(FakeData,"Severity Sex",keep=list(Severity="Moderate"),vp=FALSE,
pxwidth=400,imageheight="1.5in")`
### The `prunebelow` parameter
As seen above, a disadvantage of pruning is that in the resulting tree,
the counts shown in child nodes may not add up to the counts shown in their parent node.
An alternative is to prune *below* the specified nodes
(i.e. to prune their descendants), so that the counts always add up.
In the present example, this means that the Mild and Moderate nodes will be shown,
but not their descendants.
The `prunebelow` parameter is used to do this:
```{r,eval=FALSE}
vtree(FakeData,"Severity Sex",prunebelow=list(Severity=c("Mild","Moderate")))
```
`r spaces(40)`
`r vtree(FakeData,"Severity Sex",prunebelow=list(Severity=c("Mild","Moderate")),
pxwidth=400,imageheight="3in")`
### The `follow` parameter
The complement of `prunebelow` is `follow`.
Instead of specifying which nodes should be pruned below,
this allows you to specify which nodes should be "followed" (that is, *not* pruned below).
### Targeted pruning
This section describes a more flexible way to prune variable trees.
To explain this,
first note that the `prune`, `keep`, `prunebelow`, and `follow` parameters
specify pruning across all branches of the tree.
For example, if you were pruning `Severity` nested within levels of `Sex`,
the pruning would take place in both the M and F branches.
Sometimes, however, it is preferable to perform pruning only in specified
branches of the tree.
This is called *targeted* pruning, and the parameters
`tprune`, `tkeep`, `tprunebelow`, and `tfollow` provide this functionality.
However, their arguments have a more complex form than those of the corresponding
`prune`, `keep`, `prunebelow`, and `follow` parameters
because they specify the *full path* from the
root of the tree all the way to the nodes to be pruned.
For example to remove every `Severity` node except Moderate,
but only for males, the following command can be used:
```{r,eval=FALSE}
vtree(FakeData,"Sex Severity",tkeep=list(list(Sex="M",Severity="Moderate")))
```
`r spaces(40)`
`r vtree(FakeData,"Sex Severity",tkeep=list(list(Sex="M",Severity="Moderate")),
pxwidth=400,imageheight="3in")`
Note that the argument of `tkeep` is a list of lists,
one for each path through the tree.
To keep both Moderate and Severe, specify
`tkeep=list(list(Sex="M",Severity=c("Moderate","Severe")))`.
Now suppose that, in addition to this,
within females,you want to keep just Mild.
Use the following specification to do this:
```{r, eval=FALSE}
tkeep=list(list(Sex="M",Severity=c("Moderate","Severe")),list(Sex=F",Severity="Mild"))
```
### The `prunesmaller` parameter
As a variable tree grows,
it can become difficult to see the forest for the tree.
For example, the following tree is hard to read,
even when `sameline=TRUE` has been specified:
```{r, eval=FALSE}
vtree(FakeData,"Severity Sex Age Category",sameline=TRUE)
```
`r spaces(50)`
`r vtree(FakeData,"Severity Sex Age Category",sameline=TRUE,imageheight="5.5in",pxwidth=500)`
One solution is to prune nodes that contain small numbers of observations.
For example if you want to only see nodes with at least 3 observations,
you can specify `prunesmaller=3`, as in this example:
```{r, eval=FALSE}
vtree(FakeData,"Severity Sex Age Category",sameline=TRUE,prunesmaller=3)
```
`r spaces(35)`
`r vtree(FakeData,"Severity Sex Age Category",sameline=TRUE,prunesmaller=3,
imageheight="4in",pxwidth=800)`
As with the `keep` parameter,
when valid percentages are used (`vp=TRUE`, which is the default),
nodes represent missing values will not be pruned.
(As noted previously,
this is because percentages are confusing when missing values are not shown.)
On the other hand,
when `vp=FALSE`, missing nodes will be pruned (if they are small enough).
## Labels for variables and nodes {#labeling}
*This section shows how to relabel variables and nodes.*
By default, `vtree` labels variables and nodes exactly as they appear in the data frame.
But it is often useful to change these labels.
### Changing variable labels with the `labelvar` parameter
Suppose `Severity` in fact represents initial severity.
To label it that way in the variable tree,
specify `labelvar=c(Severity="Initial severity")`:
```{r,eval=FALSE}
vtree(FakeData,"Severity Sex",horiz=FALSE,labelvar=c(Severity="Initial severity"))
```
`r spaces(30)`
`r vtree(FakeData,"Severity Sex",horiz=FALSE,
labelvar=c(Severity="Initial severity"),
pxwidth=1000,imageheight="2in")`
### Changing node labels with the `labelnode` parameter
By default, `vtree` labels nodes (except for the root node)
using the values of the variable in question.
Sometimes it is convenient to instead specify custom labels for nodes.
The `labelnode` argument can be used to relabel the values.
For example, you might want to use "Male" and "Female" instead of "M" and "F".
```{r,eval=FALSE}
vtree(FakeData,"Group Sex",horiz=FALSE,labelnode=list(Sex=c(Male="M",Female="F")))
```
`r spaces(30)`
`r vtree(FakeData,"Group Sex",horiz=FALSE,labelnode=list(Sex=c(Male="M",Female="F")),
pxwidth=600,imageheight="1.8in")`
The argument of the `labelnode` parameter is specified as
a list whose element names are variable names.
To substitute a new label for an old label,
the syntax is: `"New label"="Old label"`.
Thus the full specification, as used above, is: `labelnode=list(Sex=c(Male="M",Female="F"))`.
### Targeted node labels using the `tlabelnode` parameter
Suppose in the example above that `Group` A represents children and
`Group` B represents adults.
In `Group` A, we would like to use the labels "girl" and "boy",
while in `Group` B we would like to use "woman" and "man".
The `labelnode` parameter cannot handle this situation because the values of
`Sex` need to be labeled differently in different branches of the tree.
The `tlabelnode` parameter allows "targeted" node labels.
```{r,eval=FALSE}
vtree(FakeData,"Group Sex",horiz=FALSE,
labelnode=list(Group=c(Child="A",Adult="B")),
tlabelnode=list(
c(Group="A",Sex="F",label="girl"),
c(Group="A",Sex="M",label="boy"),
c(Group="B",Sex="F",label="woman"),
c(Group="B",Sex="M",label="man")))
```
`r spaces(40)`
`r vtree(FakeData,"Group Sex",horiz=FALSE,
labelnode=list(Group=c(Child="A",Adult="B")),
tlabelnode=list(
c(Group="A",Sex="F",label="girl"),
c(Group="A",Sex="M",label="boy"),
c(Group="B",Sex="F",label="woman"),
c(Group="B",Sex="M",label="man")),
pxwidth=600,imageheight="2.2in")`
## Text and text formatting {#textFormatting}
*This section shows how to add bold, italics, and other text formatting.*
Graphviz,
the open source graph visualization software that vtree is built on,
supports a variety of text formatting (including bold, colors, etc.).
This is used in vtree to control formatting of text such as node labels.
### Markdown-style codes for text formatting
By default, the `vtree` package uses markdown-style codes for text formatting.
In the tables below, `...` represents arbitrary text.
------------- -----------------------------------------------------------------
`\n` insert a line break
`\n*l` make the preceding line left-justified and insert a line break
`*...*` display text in italics
`**...**` display text in bold
`^...^` display text in superscript (using 10 point font)
`~...~` display text in subscript (using 10 point font)
`%%red ...%%` display text in red (or whichever color is specified)
------------- -----------------------------------------------------------------
### HTML-like codes for text formatting
As an alternative,
if you specify `HTMLtext=TRUE` you can use "HTML-like labels"
(implemented in Graphviz), including:
---------------------------------- ----------------------------------------------------------
`
` insert a line break
`
` make the preceding line left-justified and insert a line break
` ... ` display text in italics
` ... ` display text in bold
` ... ` display text in superscript (using 10 point font)
` ... ` display text in subscript (using 10 point font)
` ... ` set font to 10 point
` ... ` set font to Times-Roman
` ... ` set font to red
---------------------------------- ----------------------------------------------------------
See for more details.
### Adding text to nodes using the `text` parameter
Suppose you wish to add the italicized text "*Excluding new diagnoses*"
to any Mild nodes in the tree.
The parameter `text` is used to add text to nodes.
It is specified as a list with an element named for each variable.
In the example below the list has one element, named `Severity`.
That element in turn is a vector `c(Mild="\n*Excluding\nnew diagnoses*")`
indicating that the Mild node should include additional text using Markdown-style formatting
(i.e. `\n` indicates a linebreak and the asterisks around the text indicate that it should be displayed in italics):
```{r,eval=FALSE}
vtree(FakeData,"Group Severity",horiz=FALSE,showvarnames=FALSE,
text=list(Severity=c(Mild="\n*Excluding\nnew diagnoses*")))
```
`r spaces(15)`
`r vtree(FakeData,"Group Severity",horiz=FALSE,showvarnames=FALSE,
text=list(Severity=c(Mild="\n*Excluding\nnew diagnoses*")),
pxwidth=1000,imageheight="2.5in")`
### Targeted text using the `ttext` parameter
In the example above,
suppose that new diagnoses are only excluded from Mild cases in `Group` B.
But the `text` parameter adds text to *all* Mild nodes.
Thus, in situations like this, the `text` parameter is not sufficient.
Instead, you can use the `ttext` parameter to target
exactly which nodes should have the specified text.
The `ttext` parameter requires that you specify the full path from the root of the tree to the node in question,
along with the text in question.
The `ttext` parameter is specified as a list so that multiple targeted text strings can be specified at once.
For example:
```{r,eval=FALSE}
vtree(FakeData,"Group Severity",horiz=FALSE,showvarnames=FALSE,
ttext=list(
c(Group="B",Severity="Mild",text="\n*Excluding\nnew diagnoses*"),
c(Group="A",text="\nSweden"),
c(Group="B",text="\nNorway")))
```
`r spaces(10)`
`r vtree(FakeData,"Group Severity",horiz=FALSE,showvarnames=FALSE,
ttext=list(c(Group="B",Severity="Mild",text="\n*Excluding\nnew diagnoses*"),
c(Group="A",text="\nSweden"),c(Group="B",text="\nNorway")),
pxwidth=1000,imageheight="2.5in")`
## Specification of variables {#VariableSpecification}
*This section shows how to control how variables appear in a variable tree.*
Sometimes it is desirable to modify a variable for use in a variable tree.
For example,
suppose you wish to determine how many values of `Score` are missing.
This is easy to do with dplyr:
```{r,eval=FALSE}
library(dplyr)
FakeData %>% mutate(missingScore=is.na(Score)) %>% vtree("missingScore")
```
But vtree also offers built-in tools for variable specification.
Although limited, they can be very convenient.
### prefix `is.na:`
If an individual variable name is preceded by `is.na:`,
that variable will be replaced by a missing value indicator in the variable tree.
(This differs from the [`check.is.na` parameter](#missingValues),
which is used to replace *all* of the specified variables with missing value indicators.)
For example:
```{r,eval=FALSE}
vtree(FakeData,"is.na:Score")
```
`r spaces(55)`
`r vtree(FakeData,"is.na:Score",
pxwidth=500,imageheight="1.5in")`
### wildcard `#`
Specifying `Ind#` matches all variable names that start with `Ind`
and end with one or more numeric digits, namely `Ind1`, `Ind2`, `Ind3`, and `Ind4`. This wildcard can also be used within a variable name. For example, `visit#duration` would match `visit1duration`, `visit2duration`, etc.
### wildcard `*`
Specifying `Ind*` matches all variable names that start with `Ind`
and end with any other characters (or no other characters).
In `FakeData` this matches `Ind1`, `Ind2`, `Ind3`, and `Ind4` (just like `Ind#` does).
But if `FakeData` contained variables named `Ind` and `Index`,
they would also be matched by `Ind*`.
As with the `#` wildcard, the `*` wildcard can be used within a variable name.
### prefix `i:`
"Intersections" between multiple variables can be generated using the prefix `i:`.
For example, `i:Ind*` generates a variable representing the observed
combinations of values of `Ind1`, `Ind2`, `Ind3`, and `Ind4`.
(If at least one of the variables is missing, the combination will be missing.)
### prefix `r:` (for REDCap)
Vtree includes special support for [REDCap](https://www.project-redcap.org/) data sets.
The prefix `r:` is used to indicate REDCap checkbox variables,
and can be combined with other prefixes.
This is described in the section on [REDCap checkboxes](#REDCapCheckboxes)
later in this vignette.
### prefix `any:`
Sometimes a group of variables contain responses to
a list of checkbox options (often with instructions to "check all that apply").
For example, suppose you have a data frame of shops,
including whether they are open on Saturday (`openSaturday`)
or Sunday (`openSunday`).
Suppose no other variables start with `open`.
Then `open*` will match both `openSaturday` and `openSunday`.
In general for a group of checkbox variables,
it is often useful to know if *any* of the options were selected (i.e. checked).
In the case above, we might want to know which shops are open at all on
the weekend (either Saturday or Sunday).
A specification like `any:open*`
is used to generate a variable that is
* `TRUE` if *any* of the matching variables has a "checked" value
* `FALSE` if none of the matching variables have "checked" values.
The parameters `checked` and `unchecked` specify which values
are considered checked or unchecked respectively,
and have the following defaults:
|parameter | default value
|:-------------|:-----------------------------------------------------------
|`checked` | `c("1","TRUE","Yes","yes")`
|`unchecked` | `c("0","FALSE","No","no")`
Values not listed in `checked` or `unchecked` are treated as missing values.
An alternative prefix, `anyx:`,
is used to specify that missing values will be removed when
performing the calculation.
This matches the behavior of the R function `any` when `na.rm=TRUE` is specified.
### prefix `none:`
The logical complement (negation) of the `any:` prefix.
An alternative prefix, `nonex:`,
is used to specify that missing values will be removed when
performing the calculation.
### prefix `all:`
A specification like `all:open*` generates a variable which is TRUE if *all* of the matching variables have a "checked" value.
An alternative prefix, `allx:`,
is used to specify that missing values will be removed when
performing the calculation.
This matches the behavior of the R function `all` when `na.rm=TRUE` is specified.
### prefix `notall:`
The logical complement (negation) of the `all:` prefix.
An alternative prefix, `notallx:`,
is used to specify that missing values will be removed when
performing the calculation.
### prefix `tri:` {#detectingOutliers}
The `tri:` prefix is useful for identifying values of a numeric variable
that are *extreme* compared to the other values in a node.
**Note:** Unlike other variable specifications,
which take effect at the level of the entire data frame,
the `tri:` prefix takes effect within each node.
The effect of this variable specification
is to *trichotomize* the values of a numeric variable,
i.e. to divide them into three groups:
* "mid": values within plus or minus 1.5×IQR of the median,
* "high": values more than 1.5×IQR above the median,
* "low": values more than 1.5×IQR below the median.
### specification `variable=value` {#dichotomizing}
When a variable takes on a large number of different values,
the resulting variable tree will very large.
One solution is to prune the tree,
for example by keeping just the node corresponding to one value of a particular variable.
An alternative is to specify the value of the variable that is of primary interest and
`vtree` will dichotomize the variable at that value.
For example if `Severity=Mild` is specified,
the `Severity` variable will be dichotomized between `Mild` and `Not Mild`.
### specifications `variablevalue`
These two specifications are used to dichotomize a *numeric* variable,
splitting above and below a specified value.
This can be useful for identifying subsets with extreme values.
## Displaying summary statistics in nodes {#summary}
*This section shows how to display information about other variables in the nodes.*
It is often useful to display information about *other* variables
(apart from those that define the tree) in the nodes of a variable tree.
This is particularly useful for numeric variables,
which usually would not be used to build the tree since they have too many distinct values.
The `summary` parameter allows you to show information (for example, a mean)
about a specified variable within a subset of the data frame.
### Default summaries
Suppose you are interested in summary information for the `Score`
variable for all of the observations in the data frame (i.e. in the root node).
In that case you don't need to specify any variables for the tree itself:
```{r,eval=FALSE}
vtree(FakeData,summary="Score")
```
`r spaces(68)`
`r vtree(FakeData,summary="Score",
pxwidth=500,imageheight="1.2in")`
When the name of a numeric variable (in this case `"Score"`) is specified
as the argument of the `summary` parameter,
a default set of summary statistics (as shown above) appears:
the variable name, the number of missing values,
the mean and standard deviation, the median and interquartile range (IQR),
and the range.
(Note, however, that if there are three or fewer observations,
instead of showing the above summary statistics,
the observations are simply listed.)
Suppose we're building a variable tree based on `Severity`.
We can display these summaries for `Score` in each node:
```{r,eval=FALSE}
vtree(FakeData,"Severity",summary="Score",horiz=FALSE)
```
`r spaces(68)`
`r vtree(FakeData,"Severity",summary="Score",horiz=FALSE,
pxwidth=1000,imageheight="2.7in")`
Sometimes it is helpful to extract summary information as text.
For example, we might wish to access the summary information contained in the Mild node.
This is explained [later on](#extracting), but here's a brief example:
```{r attributes}
vSeverity <- vtree(FakeData,"Severity",summary="Score",horiz=FALSE)
info <- attributes(vSeverity)$info
cat(info$Severity$Mild$.text)
```
There are also default summaries for factor variables and for indicator variables.
For example, `Category` is a factor variable:
```{r,eval=FALSE}
vtree(FakeData,summary="Category")
```
`r spaces(68)`
`r vtree(FakeData,summary="Category",
pxwidth=300,imageheight="1in")`
Indicator variables have two levels such as 0 / 1, or `TRUE` / `FALSE`.
For example, `Event` is an indicator variable
```{r,eval=FALSE}
vtree(FakeData,summary="Event")
```
`r spaces(68)`
`r vtree(FakeData,summary="Event",
pxwidth=300,imageheight="0.5in")`
### Specification of variables in the summary argument
Variables in the `summary` argument can also be specified in a way that is
similar to the [specification of variables](#VariableSpecification)
for structuring a variable tree.
For example, if we wish to know the proportion of patients
in each node whose `Category` is single,
we specify `Category=single` in the `summary` argument:
```{r,eval=FALSE}
vtree(FakeData,"Severity",summary="Category=single",horiz=FALSE)
```
`r vtree(FakeData,"Severity",summary="Category=single",horiz=FALSE,
pxwidth=1000,imageheight="1.4in")`
Summaries can be obtained for a collection of variables using pattern-matching,
for example:
```{r summary-pattern,eval=FALSE}
vtree(FakeData,"Severity",summary="Ind*",sameline=TRUE,horiz=FALSE,just="l")
```
`r vtree(FakeData,"Severity",summary="Ind*",sameline=TRUE,horiz=FALSE,just="l",
pxwidth=1000,imageheight="2.6in",margin=0.25)`
Incidentally, note that `just="l"` specifies that all text should be left-justified,
which conveniently lines up all of the rows of the summary.
The `summary` argument can also use the prefixes `i:`, `any:`, `none:`, `all:`, `notall:` (as well as `anyx:`, `nonex:`, `allx:`, and `notallx:`)
and wildcards `#` and `*`
(similar to [variable specifications](#VariableSpecification)).
Additionally, specifications for [REDCap checkboxes](#REDCapCheckboxes) can be used.
### Control codes: `%noroot%`, `%leafonly%`, `%var=`*v*`%`, and `%node=`*n*`%`
By default, summary information is shown in all nodes.
However, it may also be convenient to only show it in specific nodes.
To control this, special codes that begin and end with `%` can be specified.
The following control codes are available:
|code | summary information restricted to:
|:----------------|----------------------------------------
|`%noroot%` | all nodes *except* the root
|`%leafonly%` | leaf nodes
|`%var=`*v*`%` | nodes of variable *v*
|`%node=`*n*`%` | nodes named *n*
The control codes can be specified by adding them to the end of the summary
string, separated with a space.
For example, to only show summary information for nodes of the `Category` variable
with the value `single`:
```{r, eval=FALSE}
vtree(FakeData,"Severity Category",summary="Score<10 %var=Category%%node=single%",
sameline=TRUE, showlegend=TRUE, showlegendsum=TRUE)
```
`r spaces(35)`
`r vtree(FakeData,"Severity Category",summary="Score<10 %var=Category%%node=single%",
sameline=TRUE, showlegend=TRUE, showlegendsum=TRUE,
pxwidth=1500,imageheight="4.5in")`
Here `showlegend=TRUE` was specified,
and additionally `showlegendsum=TRUE`,
which indicates that summaries should also be shown in legend nodes.
### Customized summaries
The `summary` parameter also allows for customized summaries.
For example, we might wish to display only the mean `Score`
in each node of the tree.
The `%mean%` code is used to represent the mean of the specified variable
(preceded here by a line break, `\n`).
```{r,eval=FALSE}
vtree(FakeData,"Severity",summary="Score \nmean score\n%mean%",sameline=TRUE,horiz=FALSE)
```
`r spaces(30)`
`r vtree(FakeData,"Severity",summary="Score \nmean score\n%mean%",
sameline=TRUE,horiz=FALSE,
pxwidth=800,imageheight="1.5in")`
In addition to the `%mean%` code, numerous other summary codes are supported,
as listed in the table below.
When such a code is present, the default summary is not shown.
Instead,
any text that is provided---in this case `\nmean score\n`---is shown,
together with the requested summary information.
If there are any missing values in a node,
the number of missing values is shown using the abbreviation `mv`.
To see summaries without any decimals, specify `cdigits=0`.
summary code | result
:---------------|:-------------------------------------------------------------------
`%mean%` | mean\
(variant: `%meanx%` does not report missing values*)
`%SD%` | standard deviation\
(variant: `%SDx%` does not report missing values*)
`%sum%` | sum\
(variant: `%sumx%` does not report missing values*)
`%min%` | minimum\
(variant: `%minx%` does not report missing values*)
`%max%` | maximum\
(variant: `%maxx%` does not report missing values*)
`%range%` | range\
(variant: `%rangex%` does not report missing values*)
`%median%` | median, i.e. p50\
(variant: `%medianx%` does not report missing values*)
`%IQR%` | IQR, i.e. p25, p75\
(variant: `%IQRx%` does not report missing values*)
`%freqpct%` | frequency and percentage of values of a variable\
(variant: `%freqpct_%` shows each value on a separate line)
`%freq%` | frequency of values of a variable\
(variant: `%freq_%` shows each value on a separate line)
`%pY%` | *Y*th percentile (e.g. `p50` means the 50th percentile)
`%npct%` | frequency and percentage of a logical variable. By default "valid percentages" are used. Any missing values are also reported.
`%pct%` | same as `%npct%` but percentage only (with no parentheses).
`%list%` | list of individual values, separated by commas\
(variant: `%list_%` shows each value on a separate line)
`%mv%` | the number of missing values
`%nonmv%` | the number of non-missing values
`%v%` | the name of the variable
**Caution is recommended when suppressing missing values.*
The `summary` argument can include any number of these codes,
mixed with text and formatting codes.
### The `%trunc%` code
It is sometimes convenient to see individual values of a variable in each node.
A good example is ID numbers.
To do this, use the `%list%` code.
When a value occurs more than once in the subset,
it will be followed by a count of the number of repetitions in parentheses.
When there are many individual values,
it is often convenient to truncate the output.
If you specify `%trunc=`*N*`%`,
summary information will be truncated after *N* characters, and followed by "...".
### R expressions in the summary argument
Rather than starting the `summary` argument with a variable name,
an R expression involving variables in the data frame can be given,
as long as it does not contain any spaces.
```{r,eval=FALSE}
vtree(FakeData,"Severity Category",
summary="(Post-Pre)/Pre \nmean = %mean%",sameline=TRUE,horiz=FALSE,cdigits=1)
```
`r vtree(FakeData,"Severity Category",
summary="(Post-Pre)/Pre \nmean = %mean%",sameline=TRUE,horiz=FALSE,cdigits=1,
pxwidth=1000,imageheight="2in",margin=0.25)`
Expressions involving functions can also be used; for example `sqrt(abs(Post/Pre))`.
### More than one variable
Sometimes it is useful to display summary information for more than one variable.
To do this, specify `summary` as a *vector* of character strings.
For example:
```{r,eval=FALSE}
vtree(FakeData,"Severity",horiz=FALSE,showvarnames=FALSE,splitwidth=Inf,sameline=TRUE,
summary=c("Score \nScore: mean (SD) %meanx% (%SD%)","Pre \nPre: range %range%"))
```
`r vtree(FakeData,"Severity",horiz=FALSE,showvarnames=FALSE,splitwidth=Inf,sameline=TRUE,
summary=c(
"Score \nScore: mean (SD) %meanx% (%SD%)",
"Pre \nPre: range %range%"),
pxwidth=1400,imageheight="1.4in")`
### Targeted summaries
Sometimes you only want to show a summary in a particular node.
Targeted summaries are specified with the `tsummary` parameter
as a list of character-string vectors.
The initial elements of each character string vector point to a specific node.
The final element of each character string vector is a summary string,
with the same structure as \code{summary}.
```{r,eval=FALSE}
vtree(FakeData,"Age Sex",tsummary=list(list(Age="5",Sex="M","id \n%list%")),horiz=FALSE)
```
`r vtree(FakeData,"Age Sex",tsummary=list(list(Age="5",Sex="M","id \n%list%")),horiz=FALSE,
pxwidth=1400,imageheight="2.3in")`
## Pattern trees and pattern tables {#patterns}
*This section shows how to display all the combinations of values in a set of variables.*
Each node in a variable tree provides the frequency of a particular combination
of values of the variables.
The leaf nodes represent the observed combinations of values of *all* of the variables.
For example, in a variable tree for `Severity` and `Sex`,
the leaf nodes correspond to Mild F, Mild M, Moderate F, Moderate M, etc.
These combinations, or "patterns", can be treated as an additional variable.
And if this new pattern variable is used as the first variable in a tree,
then the branches of the tree will be simplified:
each branch will represent a unique pattern, with no sub-branches.
A "pattern tree" can be easily produced by specifying `pattern=TRUE`.
For example:
```{r, eval=FALSE}
vtree(FakeData,"Severity Sex")
vtree(FakeData,"Severity Sex",pattern=TRUE)
```
`r spaces(10)`
`r vtree(FakeData,"Severity Sex",
pxwidth=500,imageheight="4in")`
`r spaces(15)`
`r vtree(FakeData,"Severity Sex",pattern=TRUE,
pxwidth=600,imageheight="4in")`
Pattern trees are simpler to read than ordinary variable trees,
but they involve a considerable loss of information,
since they only represent the *n*th-degree subsets
(where *n* is the number of variables).
Note that by default, when `pattern=TRUE` is specified,
the root node is not shown (in order to simplify the display).
A disadvantage of this is that the total sample size is not shown.
You can override this behavior by specifying `showroot=TRUE`.
A pattern tree has two other special characteristics.
First, note that after the first layer (representing `pattern`),
counts and percentages are not shown,
since they are not informative:
by definition, all nodes within a branch have the same count.
Second, note that in place of arrows, undirected line segments are shown.
This is because, unlike in a regular variable tree,
the order of variables is irrelevant in a pattern tree.
Sometimes, however, the variables do have a natural ordering,
as in the case of longitudinal variables.
To show arrows, specify `seq=TRUE` instead of `pattern=TRUE`,
and a "sequence" (i.e. an ordered pattern) will be shown.
Summaries can be shown in pattern trees
(using the `summary` parameter), but they only appear in the pattern node
(or the sequence node if `seq=TRUE`).
### Pattern tables
A pattern tree has the same structure as a table.
Indeed, it may be more convenient to produce a table rather than a pattern tree.
A data frame containing the information from the pattern tree
can be exported by specifying `ptable=TRUE`:
```{r}
vtree(FakeData,"Severity Sex",ptable=TRUE)
```
The pattern table includes a column for the counts from the pattern nodes,
and a column for percentages.
Compared to a variable tree, this table is much more compact,
and may be more suitable for use in a manuscript.
### Indicator variables
Pattern trees can be very useful for *indicator variables*,
i.e. variables that take values like 0/1, no/yes, FALSE/TRUE, etc.
For convenience in this section,
we'll refer to 0 (or no, FALSE, etc.) as a *negative*
and 1 (or yes, TRUE, etc.) as an *affirmative*.
The variables `Ind1` through `Ind4` in `FakeData` are 0/1 indicator variables.
If these variables are interpreted as representing set membership
(0 = non-member, 1 = member),
then a pattern tree is an alternative representation of a Venn diagram.
If you specify `Venn=TRUE`,
the nodes (except for the pattern nodes) will be blank,
with only their shade indicating their value
(dark = 1, light = 0, white = missing).
```{r, eval=FALSE}
vtree(FakeData,"Ind1 Ind2 Ind3 Ind4",Venn=TRUE,pattern=TRUE)
```
`r spaces(40)`
`r vtree(FakeData,"Ind1 Ind2 Ind3 Ind4",Venn=TRUE,pattern=TRUE,
pxwidth=500,imageheight="5in")`
Big pattern trees can be overwhelming,
so it may be useful to prune patterns that occur fewer than, say, 3 times,
by specifying `prunesmaller=3`.
A pattern tree for indicator variables provides all the information
that a Venn diagram represents,
but unlike a Venn diagram, missing values are also represented.
This can also be shown as a pattern table.
For example:
```{r}
vtree(FakeData,"Ind1 Ind2",ptable=TRUE)
```
#### The `VennTable` function
For indicator variables, there is an extra function, `VennTable`,
which converts the pattern table to a matrix of character strings
and adds some additional totals.
```{r}
VennTable(vtree(FakeData,"Ind1 Ind2",ptable=TRUE))
```
By default in R, when a matrix of character strings is printed,
quotation marks are displayed around each element.
Unfortunately the result is unattractive.
Instead it's helpful to call the `print` function and specify `quote=FALSE`:
```{r}
print(VennTable(vtree(FakeData,"Ind1 Ind2",ptable=TRUE)),quote=FALSE)
```
Without all those quotation marks, it's easier to see what `VennTable` adds:
* the total sample size (`r nrow(FakeData)`) and percentage (100), and
* the total number (N) of affirmatives for each variable, together with a percentage.
The `VennTable` function can also be used in an R Markdown document.
Specifying `markdown=TRUE` generates a pandoc markdown pipetable,
with several formatting tweaks:
* the rows and columns of the table are transposed
* affirmatives are represented by checkmarks
* negatives are represented by spaces
* missing values are represented by dashes (which can be changed with the `NAcode` parameter).
To display the table in R Markdown, use this inline call:
```{r, eval=FALSE}
`r VennTable(vtree(FakeData,"Ind1 Ind2",ptable=TRUE),markdown=TRUE)`
```
`r VennTable(vtree(FakeData,"Ind1 Ind2",ptable=TRUE),markdown=TRUE)`
`VennTable` has some additional parameters.
The `checked` parameter is used to specify values that should be interpreted as affirmative.
By default, it is set to `c("1","TRUE","Yes","yes","N/A")`.
Similarly, the `unchecked` parameter is used to specify values that should be interpreted as negative,
with default `c("0","FALSE","No","no","not N/A")`.
#### Using the `summary` parameter in pattern tables
The `summary` parameter can also be used in pattern tables.
If a single summary is requested,
it appears in the `summary_1` variable in the data frame.
Additional summaries appear as `summary_2`, `summary_3`, etc.
```{r}
vtree(FakeData,"Severity Sex",summary=c("Score %mean%","Pre %mean%"),ptable=TRUE)
```
### Checking for missing values with the `check.is.na` parameter {#missingValues}
If `check.is.na=TRUE` is specified,
each variable is replaced by an indicator of whether or not it is missing,
and `pattern=TRUE` is automatically set.
As when `Venn=TRUE` is specified, all nodes except for the pattern node are blank,
and only their shade indicates missing (dark) or not (light).
Whereas the variables used to build a variable tree are normally categorical,
in this situation non-categorical variables can be used,
because their missingness is represented instead of their actual values.
```{r,eval=FALSE}
vtree(FakeData,"Severity Age Pre Post",check.is.na=TRUE)
```
`r spaces(40)`
`r vtree(FakeData,"Severity Age Pre Post",check.is.na=TRUE,
pxwidth=600,imageheight="3in")`
Specifying `ptable=TRUE` produces this information in a data frame,
and calling `VennTable` shows additional information.
To display the table in R Markdown, use this inline call:
```{r, eval=FALSE}
`r VennTable(vtree(FakeData,"Severity Age Pre Post",check.is.na=TRUE,ptable=TRUE),
markdown=TRUE)`
```
`r VennTable(vtree(FakeData,"Severity Age Pre Post",check.is.na=TRUE,ptable=TRUE),markdown=TRUE)`
The rows `n` and `pct` represent the frequency and percentage
of the total number of cases for each pattern of missingness,
and the columns `N` and `pct` on the right-hand side represent the frequency and percentage
of missingness for each variable.
It may be useful to identify the ID numbers for these patterns.
Here the results are truncated to 15 characters:
```{r}
vtree(FakeData,"Severity Age Pre Post",check.is.na=TRUE,summary="id %list%%trunc=15%",
ptable=TRUE)
```
## Colors {#colors}
*This section explains how colors and color palettes can be used.*
By default, `vtree` assigns colors to nodes of each successive variable
using color palettes from [RColorBrewer](https://cran.r-project.org/package=RColorBrewer).
The sequence of palettes (identified by short names) is as follows:
- ------- ------ -- ------- ------ -- ------- ------- ------ ------
1 Reds 6 YlGn 11 BuPu 16 RdPu
2 Blues 7 PuBu 12 YlOrRd 17 BuGn
3 Greens 8 PuRd 13 RdYlGn 18 OrRd
4 Oranges 9 YlOrBr 14 GnBu
5 Purples 10 PuBuGn 15 YlGnBu
- ------- ------ -- ------- ------ -- ------- ------- ------ ------
If you prefer to change the color assignments, you can use the `palette` parameter.
For example, by default a variable tree for `Sex` and `Severity` will assign shades of
red to nodes of `Sex` and shades of blue to notes of `Severity`.
To switch to shades of, say, green and orange instead, use:
```{r, eval=FALSE}
vtree(FakeData,"Sex Severity",palette=c(3,4))
```
Sometimes it may be useful to reverse the order of a gradient.
To reverse the order of all gradients, specify `revgradient=TRUE`.
The gradient for selected variables can be reversed as in the example below:
```{r, eval=FALSE}
vtree(FakeData,"Sex Group Severity",revgradient=c(Sex=TRUE,Severity=TRUE))
```
Other color-related parameters include:
---------------- ---------------------------------------------------------
`sortfill` Specifying `sortfill=TRUE` fills nodes with gradient colors in sorted order according to the node count.
`NAfillcolor` By default, missing value nodes are colored white. For a different color (say gray), specify `NAfillcolor="gray"`. To instead use a color from the current palette, specify `NAfillcolor=NULL`.
`rootfillcolor` The color of the root node can be changed (say to yellow) by specifying `rootfillcolor="yellow"`.
`fillcolor` To set all nodes of the tree (except for missing value nodes and the root node) to be the same color (say palegreen), specify `fillcolor="palegreen"`.
`plain` A simple color scheme is produced by specifying `plain=TRUE`. (Additionally, this increases the spaces between nodes.)
---------------- ---------------------------------------------------------
## REDCap checkboxes {#REDCapCheckboxes}
*This section details support for checkbox variables from REDCap.*
In datasets exported from [REDCap](https://www.project-redcap.org/),
checkboxes (i.e. select-all-that-apply boxes)
are represented in a special way.
For each item in a checklist, a separate variable is created.
Suppose survey respondents were asked to select which flavors of ice cream (Chocolate, Vanilla, Strawberry) they like.
Within REDCap,
the variable name for this list of checkboxes is `IceCream`,
but when the dataset is exported,
individual variables `IceCream___1` (representing Chocolate),
`IceCream___2` (Vanilla), and `IceCream___3` (Strawberry) are created.
When the dataset is read into R,
the names of the flavors are embedded in the `attributes` of these variables.
For illustrative purposes, let's build a dataframe like this using the
`build.data.frame` function (for an explanation of this function see the section
of this vignette on [generating a data frame by specifying subset sizes](#GeneratingDataFrames)
```{r}
dessert <- build.data.frame(
c( "group","IceCream___1","IceCream___2","IceCream___3"),
list("A", 1, 0, 0, 7),
list("A", 1, 0, 1, 2),
list("A", 0, 0, 0, 1),
list("A", 1, 1, 1, 1),
list("B", 1, 0, 1, 1),
list("B", 1, 0, 0, 2),
list("B", 0, 1, 1, 1),
list("B", 0, 0, 0, 1))
attr(dessert$IceCream___1,"label") <- "Ice cream (choice=Chocolate)"
attr(dessert$IceCream___2,"label") <- "Ice cream (choice=Vanilla)"
attr(dessert$IceCream___3,"label") <- "Ice cream (choice=Strawberry)"
```
### prefix `r:`
The prefix `r:` identifies a REDCap checklist variable,
and extracts a label from the variable attribute.
For example, the following call automatically displays "Chocolate":
```{r eval=FALSE}
vtree(dessert,"r:IceCream___1")
```
### suffix `@`
The suffix `@` matches REDCap checklist variables based on the naming
scheme used by REDCap for checklist variables.
For example, the following call automatically displays Chocolate,
Vanilla, and Strawberry:
```{r eval=FALSE}
vtree(dessert,"r:IceCream@")
```
### variable prefixes `rany:`, `rnone:`, `rall:`, and `rnotall:`
The variable prefixes `any:`, `none:`, `all:`, and `notall:`
can be combined with the `r:` prefix
to form `rany:`, `rnone:`, `rall:`, and `rnotall:`.
For example, to determine whether anyone did not like *any* of the
flavors (Chocolate, Vanilla, or Strawberry):
```{r eval=FALSE}
vtree(dessert,"rnone:IceCream@")
```
### variable prefix `ri:`
"Intersections" of REDCap variables may be obtained by
combining the `r:` prefix with the `i:` prefix:
```{r eval=FALSE}
vtree(dessert,"ri:IceCream@")
```
### Deprecated: variable prefixes `stem:` and `rc:`
To examine the pattern of ice-cream flavor choices,
the following can be used:
```{r eval=FALSE}
vtree(dessert,"IceCream___1 IceCream___2 IceCream___3",pattern=TRUE)
```
One problem is that this doesn't assign the appropriate labels to
`IceCream___1` (Chocolate), `IceCream___2` (Vanilla), and `IceCream___3` (Strawberry).
Instead, try the following more compact call,
which also assigns labels automatically.
```{r eval=FALSE}
vtree(dessert,"stem:IceCream",pattern=TRUE)
```
The `summary` parameter also supports a `stem:` prefix:
```{r, eval=FALSE}
vtree(dessert,summary="stem:IceCream",splitwidth=Inf,just="l")
```
`r spaces(63)`
`r vtree(dessert,summary="stem:IceCream",splitwidth=Inf,just="l",
pxwidth=1000,imageheight="1in")`
If you wish to only examine specific REDCap checkbox items,
the `rc:` prefix can be used.
For example to examine results for just Chocolate and Strawberry:
```{r, eval=FALSE}
vtree(dessert,"rc:IceCream___1 rc:IceCream___3",pattern=TRUE)
```
## The DOT script generated by `vtree`
*This section shows how to obtain the DOT script that displays a variable tree.*
Specifying `getscript=TRUE` lets you capture the DOT script representing a variable tree.
(DOT is a graph description language used by Graphviz, which is used by DiagrammeR, which is used by vtree!).
Here is an example:
```{r, comment=""}
dotscript <- vtree(FakeData,"Severity",getscript=TRUE)
cat(dotscript)
```
If you wish to directly edit this code,
it can can be pasted into an online Graphviz editor,
for example:
https://dreampuf.github.io/GraphvizOnline/
http://magjac.com/graphviz-visual-editor/
## Extracting a list of information about a variable tree {#extracting}
*This section explains how to obtain all of the counts and percentages of a variable tree.*
Sometimes it is useful to extract counts, percentages, and
summary information from a variable tree.
The object returned by `vtree` has an attribute `info` containing
structured information about the counts and percentages in each node.
Here is an example:
```{r,echo=TRUE,message=FALSE,eval=FALSE}
v <- vtree(FakeData,"Group Viral",horiz=FALSE)
v
```
`r spaces(30)`
```{r,echo=FALSE,message=FALSE}
v <- vtree(FakeData,"Group Viral",pxwidth=800,imageheight="2in",horiz=FALSE)
v
```
```{r,echo=TRUE,message=FALSE}
attributes(v)$info
```
The list contains the counts (`.n`), percentages (`.pct`), and summary
text (`.text`) that appear in the tree.
# Ways to call vtree
`vtree` behaves differently depending on the context in which it is called.
## Calling vtree interactively
* If `vtree` is called interactively in RStudio, it displays the variable tree in the Viewer window.
* If `vtree` is called interactively from the RGui console
(i.e. from R outside of RStudio),
it displays the variable tree in a browser window.
## Calling vtree from knitr and R Markdown {#embeddingInKnitrRmarkdown}
When `vtree` is called from knitr, it generates
* A PNG file if the output format is Markdown
* A PDF file if the output format is LaTeX.
Here's how it does that. `vtree` uses the `DiagrammeR` package,
which automatically generates an `htmlwidget` object for display in HTML,
using the [htmlwidgets](https://www.htmlwidgets.org/) framework.
Then `vtree` converts the `htmlwidget` object into an SVG image,
and finally into a PNG or PDF file.
### Generating PNG files
PNG files are useful because they allow you to display variable trees in Microsoft Word documents,
and also because HTML files that use htmlwidgets can get large,
and if they contain several widgets they can be slow to load.
If `vtree` is called while an R Markdown file is being knitted,
it generates a PNG file and automatically embeds it into the knitted document.
The resolution of the PNG file in pixels is determined by parameters `pxwidth` and `pxheight`.
If neither is specified, `pxwidth` is automatically set to 2000,
which provides good resolution for a printed page.
The height of the image in the R Markdown output document can be specified
using the `imageheight` parameter,
for example `imageheight="4in"` for a 4-inch image.
There is also an `imagewidth` parameter.
If neither is specified, `imageheight` is automatically set to 3 inches.
*Note*: You may notice a warning in the R Markdown rendering
(in RStudio, the R Markdown pane) like this:
```{r, eval=FALSE,echo=TRUE}
:1919791: Invalid asm.js: Function definition doesn't match use
```
Although distracting, this message is irrelevant.
### Embedding image files
The PNG or PDF file is stored in the folder specified by the `folder` parameter,
or if not specified, a temporary folder will be used.
Successive PNG files are named `vtree001.png`, `vtree002.png`,
and so forth and are stored in the folder.
(Similarly PDF files are named `vtree001.pdf`, etc.)
During knitting, `vtree` uses the `options` function in base R to store
a variable called `vtcount` to count the PNG files,
and a variable called `vtfolder` to identify the folder where they will be stored.
To call `vtree` in R Markdown, you can use inline code:
```{r, eval=FALSE}
`r vtree(FakeData,"Sex Severity")`
```
Or you can use a code chunk:
````
`r ''````{r}
vtree(FakeData,"Sex Severity")
```
````
One advantage of code chunks is that they can also be run interactively
(for example within RStudio, by clicking on the green arrow at the top right of a code chunk).
### Generating an image file but not displaying it
Specifying `imageFileOnly=TRUE` instructs vtree to generate an image file but not display it.
### Generating an htmlwidget in an HTML document
When knitting to an HTML document,
htmlwidgets can be used rather than embedding a PNG file.
To use htmlwidgets instead of a PNG file simply specify `pngknit=FALSE`.
## `svtree`: Using vtree in Shiny {#svtree}
Thanks to Shiny and the svg-pan-zoom JavaScript library,
interactive panning and zooming of a variable tree is possible
with the `svtree` function.
The syntax of `svtree` is the same as that of `vtree`,
but instead of generating a static variable tree, it launches a Shiny app.
The mousewheel allows you to zoom in or out.
The variable tree can also be dragged to a different position.
Thanks to the panning and zooming functionality in `svtree`,
it is possible to examine larger variable trees than with `vtree`.
In large variable trees it is often useful to show the variable name in each node,
since the variable labels (which are shown at the bottom or left-hand margin)
may not be visible after zooming.
To show the variable name in each node, specify `showvarinnode=TRUE`.
# Generating a data frame by specifying subset sizes {#GeneratingDataFrames}
`vtree` is designed to generate a variable tree based on a data frame.
However, sometimes the sizes of subsets are known but no data frame is available.
The `build.data.frame` function allows you to build a data frame by specifying the size of subsets.
Here's an example involving pets:
```{r}
build.data.frame(
c("pet","breed","size"),
list("dog","golden retriever","large",5),
list("cat","tabby","small",2))
```
In this case there are five large golden retrievers and 2 small tabby cats.
Although a data frame like this could easily be created without using `build.data.frame`,
it’s a different situation when the counts are large.
For example:
```{r, eval=FALSE}
build.data.frame(
c("pet","breed","size"),
list("dog","golden retriever","large",5),
list("cat","tabby","small",2),
list("dog","Dalmation","various",101),
list("cat","Abyssinian","small",5),
list("cat","Abyssinian","large",22),
list("cat","tabby","large",86))
```
# Examples
## Rudimentary CONSORT diagrams
Consider the following fictitious data about a randomized controlled trial (RCT):
```{r}
FakeRCT
```
The CONSORT diagram (http://www.consort-statement.org/) shows the flow of patients through the study,
starting with those who meet eligibility criteria,
then those who are randomized, etc.
It is easy to produce a rudimentary version of a CONSORT diagram in `vtree`.
The key step is to prune branches for those who are *not* eligible, *not* randomized, etc.
This can be done using the `keep` parameter:
```{r,eval=FALSE}
vtree(FakeRCT,"eligible randomized group followup analyzed",plain=TRUE,
keep=list(eligible="Eligible",randomized="Randomized",followup="Followed up"),
horiz=FALSE,showvarnames=FALSE,title="Assessed for eligibility")
```
`r spaces(55)`
`r vtree(FakeRCT,"eligible randomized group followup analyzed",plain=TRUE,
keep=list(eligible="Eligible",randomized="Randomized",followup="Followed up"),
horiz=FALSE,showvarnames=FALSE,title="Assessed for eligibility",
width=230,height=500,pxwidth=300,imageheight="5in")`
Note that this does not include all of the additional information for a full CONSORT diagram
(exclusion reasons and counts,
as well as numbers of patients who received their allocated interventions,
who discontinued intervention, and who were excluded from analysis).
It does, however, provide the main flow information.
Additional information can be obtained by viewing the nodes for patients in the
pruned branches (but not their descendants).
The `follow` parameter makes that easy:
```{r,eval=FALSE}
vtree(FakeRCT,"eligible randomized group followup analyzed",plain=TRUE,
follow=list(eligible="Eligible",randomized="Randomized",followup="Followed up"),
horiz=FALSE,showvarnames=FALSE,title="Assessed for eligibility")
```
`r spaces(30)`
`r vtree(FakeRCT,"eligible randomized group followup analyzed",plain=TRUE,
follow=list(eligible="Eligible",randomized="Randomized",followup="Followed up"),
horiz=FALSE,showvarnames=FALSE,title="Assessed for eligibility",
width=400,height=500,pxwidth=500,imageheight="5in")`
Finally, it may be useful to see the ID numbers in each node.
This can be done using the `summary` parameter with the `%list%` code.
Since IDs are less useful in the root note,
the `%noroot%` code is also specified here:
```{r, eval=FALSE}
vtree(FakeRCT,"eligible randomized group followup analyzed",plain=TRUE,
follow=list(eligible="Eligible",randomized="Randomized",followup="Followed up"),
horiz=FALSE,showvarnames=FALSE,title="Assessed for eligibility",
summary="id \nid: %list% %noroot%")
```
`r spaces(30)`
`r vtree(FakeRCT,"eligible randomized group followup analyzed",plain=TRUE,
follow=list(eligible="Eligible",randomized="Randomized",followup="Followed up"),
horiz=FALSE,showvarnames=FALSE,title="Assessed for eligibility",
summary="id \nid: %list% %noroot%",
width=500,height=600,pxwidth=600,imageheight="5in")`
## Examples using R datasets {#RdatasetExamples}
The `datasets` package is loaded in R by default.
In the following section,
`vtree` is applied to several of these data sets for illustrative purposes.
Note that the variable trees generated by the commands below are not shown.
The reader can try these commands to see what the variable trees look like,
and experiment with many other possibilities.
### Esophageal cancer
The `esoph` data set
(data from a case-control study of esophageal cancer in Ille-et-Vilaine, France),
has 88 different combinations of age group, alcohol consumption, and tobacco consumption.
Let's examine the total number of cases and the total number of controls
among patients aged 75 and older compared to the rest of the patients:
```{r, eval=FALSE}
# Relabel agegp 75+ to 75plus because vtree tries to parse the +
ESOPH <- esoph
levels(ESOPH$agegp)[levels(ESOPH$agegp)=="75+"] <- "75plus"
vtree(ESOPH,"agegp=75plus",sameline=TRUE,cdigits=0,
summary=c("ncases \ncases=%sum%%leafonly%","ncontrols controls=%sum%%leafonly%"))
```
### Hair and eye color
The `HairEyeColor` data set is an array representing a contingency table
(also called a crosstab or crosstabulation).
Before `vtree` can be applied to this data set,
it is necessary to convert the table of crosstabulated frequencies to a data frame of cases.
For convenience, the `vtree` package includes a helper function to do this,
called `crosstabToCases`.
It is adapted from a function listed on the [Cookbook for R website](http://www.cookbook-r.com/Manipulating_data/Converting_between_data_frames_and_contingency_tables/#countstocases-function)
```{r, eval=FALSE}
hec <- crosstabToCases(HairEyeColor)
```
There are a lot of combinations but let's say we are especially interested
in green eyes (as compared to non-green eyes).
We can use the variable specification `Eye=Green` to do this:
```{r, eval=FALSE}
vtree(hec,"Hair Eye=Green Sex",sameline=TRUE)
```
### Titanic
The `Titanic` dataset is a 4-dimensional array of counts.
First, let's convert it to a dataframe of individuals:
```{r, eval=FALSE}
titanic <- crosstabToCases(Titanic)
```
We'll specify `sameline=TRUE` so that the variable tree is a bit more compact:
```{r, eval=FALSE}
vtree(titanic,"Class Sex Age",summary="Survived=Yes \n%pct% survived",sameline=TRUE)
```
### mtcars
The `mtcars` data set was extracted from the 1974 Motor Trend US magazine,
and comprises fuel consumption and 10 aspects of automobile design and
performance for 32 automobiles (1973–74 models).
The rownames of the data set contain the names of the cars.
Let's move that information into a column.
To do that, we'll make a slightly altered version of the data frame which we'll call `mt`:
```{r, eval=FALSE}
mt <- mtcars
mt$name <- rownames(mt)
rownames(mt) <- NULL
```
Now let's look at the mean and standard deviation of horsepower (HP)
by number of carburetors, nested within number of gears,
and in turn nested within number of cylinders:
```{r, eval=FALSE}
vtree(mt,"cyl gear carb",summary="hp \nmean (SD) HP %mean% (%SD%)")
```
The above shows the mean and SD of horsepower by
(1) number of cylinders;
(2) number of gears (within number of cylinders);
and
(3) number of carburetors (within number of gears nested within number of cylinders).
That's a lot of information.
Suppose instead that we are only interested in number 3 above,
i.e. all combinations of number of cylinders, number of gears, and number of carburetors.
In that case, we can specify `ptable=TRUE`,
To make the table a little easier to read,
set the number of digits for the mean and SD to be zero,
and relabel the variables.
```{r, eval=FALSE}
vtree(mt,"cyl gear carb",summary="hp mean (SD) HP %mean% (%SD%)",
cdigits=0,labelvar=c(cyl="# cylinders",gear="# gears",carb="# carburetors"),
ptable=TRUE)
```
We might also like to list the names of cars by number of carburetors
nested within number of gears:
```{r, eval=FALSE}
vtree(mt,"gear carb",summary="name \n%list%%noroot%",splitwidth=50,sameline=TRUE,
labelvar=c(gear="# gears",carb="# carburetors"))
```
### UCBAdmissions
The `UCBAdmissions` data is consists of aggregate data on
applicants to graduate school at Berkeley for the six largest departments in 1973
classified by admission and sex.
According to the data set Details,
"This data set is frequently used for illustrating Simpson's paradox,
see Bickel et al. (1975).
At issue is whether the data show evidence of sex bias in admission practices.
There were 2691 male applicants, of whom 1198 (44.5%) were admitted,
compared with 1835 female applicants of whom 557 (30.4%) were admitted."
Furthermore,
"the apparent association between admission and sex stems from
differences in the tendency of males and females to apply to the individual departments
(females used to apply more to departments with higher rejection rates)."
First, we'll convert the crosstab data to a data frame of cases, `ucb`:
```{r, eval=FALSE}
ucb <- crosstabToCases(UCBAdmissions)
```
Next, let's look at admission rates by Gender, nested within department:
```{r, eval=FALSE}
vtree(ucb,"Dept Gender",summary="Admit=Admitted \n%pct% admitted",sameline=TRUE)
```
### ChickWeight
The `ChickWeight` data set is from an experiment on the effect of diet on early growth of chicks.
Let's look at the mean weight of chicks at birth (0 days of age) and 4 days of age,
nested within type of diet.
A simple variable tree can be produced like this:
```{r, eval=FALSE}
vtree(ChickWeight,"Diet Time",
keep=list(Time=c("0","4")),summary="weight \nmean weight %mean%g")
```
To make the display a little easier to read, relabel the nodes
and the `Time` variable:
```{r, eval=FALSE}
vtree(ChickWeight,"Diet Time",keep=list(Time=c("0","4")),
labelnode=list(
Diet=c("Diet 1"="1","Diet 2"="2","Diet 3"="3","Diet 4"="4"),
Time=c("0 days"="0","4 days"="4")),
labelvar=c(Time="Days since birth"),summary="weight \nmean weight %mean%g")
```
### InsectSprays
The `InsectSprays` data set contains
counts of insects in agricultural experimental units treated with different insecticides.
Let's look at those counts by insecticide.
```{r, eval=FALSE}
vtree(InsectSprays,"spray",splitwidth=80,sameline=TRUE,
summary="count \ncounts: %list%%noroot%",cdigits=0)
```
### ToothGrowth
The `ToothGrowth` data set contains
the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs.
Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day)
by one of two delivery methods,
orange juice or ascorbic acid (a form of vitamin C and coded as VC).
Let's examine the percentage with length > 20 by dose nested within delivery method:
```{r, eval=FALSE}
vtree(ToothGrowth,"supp dose",summary="len>20 \n%pct% length > 20")
```
To make the display a little easier to read, relabel the nodes
and the `Time` variable:
```{r, eval=FALSE}
vtree(ToothGrowth,"supp dose",summary="len>20 \n%pct% length > 20",
labelvar=c("supp"="Supplement type","dose"="Dose (mg/day)"),
labelnode=list(supp=c("Vitamin C"="VC","Orange Juice"="OJ")))
```