lessR provides many versions of a scatter plot with
its XY() function for one or two variables with an option
to provide a separate scatterplot for each level of one or two
categorical variables. Access all scatterplots with the same simple
syntax. The first variable listed without a parameter name, the
x parameter, is plotted along the x-axis. Any second
variable listed without a parameter name, the y parameter,
is plotted along the y-axis. Each parameter may be represented by a
continuous or categorical variable, a single variable or a vector of
variables.
XY() also plots time series data when the x-axis
variable is a Date variable. See the Time
vignette for those examples.
Illustrate with the Employee data included as part of lessR.
As an option, lessR also supports variable labels.
The labels are displayed on both the text and visualization output. Each
displayed label consists of the variable name juxtaposed with the
corresponding label. Create the table formatted as two columns. The
first column is the variable name and the second column is the
corresponding variable label. Not all variables need to be entered into
the table. The table can be stored as either a csv file or
an Excel file.
Read the variable label file into the l data frame, currently the only permissible name for the label file.
Display the available labels.
## label
## Years Time of Company Employment
## Gender Man or Woman
## Dept Department Employed
## Salary Annual Salary (USD)
## JobSat Satisfaction with Work Environment
## Plan 1=GoodHealth, 2=GetWell, 3=BestCare
## Pre Test score on legal issues before instruction
## Post Test score on legal issues after instruction
A typical scatterplot visualizes the relationship of two continuous
variables, here Years worked at a company, and annual
Salary. Following is the function call to XY() for
the default visualization.
Because d is the default name of the data frame that
contains the variables for analysis, the data parameter
that names the input data frame need not be specified. That is, no need
to specify data=d, though this parameter can be explicitly
included in the function call if desired.
Enhance the default scatterplot with parameter enhance.
The visualization includes the mean of each variable indicated by the
respective line through the scatterplot, the 95% confidence ellipse,
labeled outliers, least-squares regression line with 95% confidence
interval, and the corresponding regression line with the outliers
removed.
The default for formatting both axis labels is to round numeric
values of thousands, such as 100000 to 100K. With parameter
axis_fmt, this default of to {"K"} can be
changed. Also can specify {","} to insert commas in large
numbers with a decimal point or {"."} to insert periods, or
{""} to turn off formatting. The value of
{"K"} can also be combined with {","} or
{"."} by forming a vector of values, such as
c("K", ",").
Axis labels can also be formatted by adding a prefix to a numeric
value with the parameters axis_x_prefix and
axis_y_prefix, such as $ or €.
The specified value can be multiple characters, such as for the
Brazilian currency, R$.
A variety of fit lines can be plotted. The available values:
"loess" for general non-linear fit, "lm" for
linear least squares, "null" for the null (flat line)
model, "exp" for the exponential growth and decay,
"quad" for the quadratic model, and power for
the general power beyond 2. Setting fit to
TRUE plots the "loess" line. With the value of
power, specify the value of the root with parameter
fit_power.
Here, plot the general non-linear fit. For emphasis set
plot_errors to TRUE to plot the residuals from
the line. The sum of the squared errors is displayed to facilitate the
comparison of different models.
Next, plot the exponential fit and show the residuals from the exponential curve. These data are approximately linear so the exponential curve does not vary far from a straight line. The function displays the corresponding sum of squared errors to assist in comparing various models to each other.
The parameter transforms the y variable to the specified
power from the default of 1 before doing the regression
analysis. The availability of this parameter provides for a wide range
of modifications to the underlying functional form of the fit curve.
Map a continuous variable, such as Pre, to the plotted points with
the pt_size parameter, a bubble plot.
Indicate multiple variables to plot along either axis with a vector
defined according to the base R function c(). Plot the
linear model for each variable according to the fit
parameter set to "lm". By default, when multiple lines are
plotted on the same panel, the confidence interval is turned off by
internally setting the parameter fit_se set to
0. Explicitly override this parameter value as needed.
Read the data and convert the values of numerically valued categorical variables to meaningful labels.
d$Airbags <- factor(d$Airbags, levels=0:2, labels=c("none", "driver", "drv+pas"))
d$DriveTrain <- factor(d$DriveTrain, levels=0:2, labels=c("rear", "front", "all"))
d$Manual <- factor(d$Manual, levels=0:1, labels=c("Not_Avail", "Available"))Visualize the scatterplot of MPGhiway and HP, stratified against three categorical variables: Airbags plotted in different colors for each scatterplot, and separate scatterplots for all six combinations of the levels of DriveTrain and Manual.
##
## ---------- Summary Statistics for MPGhiway
To plot a scatterplot matrix, specify multiple variables for the
first parameter value, x, repeated for the second
parameter, y. Define these multiple variables as a vector,
such as defined by c(). Request the non-linear fit line and
corresponding confidence interval by specifying TRUE or
loess for the fit parameter. Request a linear
fit line with the value of "lm".
Smoothing and binning are two procedures for visualizing a relationship with many data values.
To obtain a larger data set, in this example generate random data
with base R rnorm(), then plot. XY() first
checks the presence of the specified variables in the global environment
(workspace). If not there, then from a data frame, of which the default
value is d. Here, randomly generate values from normal
populations for x and y in the workspace.
With large data sets, even for continuous variables there can be much
over-plotting of points. One strategy to address this issue smooths the
scatterplot by setting the type parameter to
smooth. The individual points superimposed on the smoothed
plot are potential outliers. The default number of plotted outliers is
100. Turn off the plotting of outliers completely by setting parameter
smooth_points to 0. Show the linear trend with
fit set to "lm".
Another strategy for alleviating over-plotting makes the fill color
mostly transparent with the transparency parameter, or turn
off completely by setting fill to "off". The
closer the value of trans is to 1, the more transparent is
the fill.
Contour plots are another effective way to visualize scatter plots
with much data. By default, the parameter contours_n is set
at 10. XY() provides a threshold for deleting points for
consideration of plotting the contour curves. Otherwise, if there are
extreme outliers, the axes extend to their maximum and minimum values,
typically resulting in much white space that surrounds the visible
contour plot. The extreme values of outlier points with low density
round down to zero on the color scale. The parameter
contours_pad, with a default value of 0, can adjust the
white space to pad the resulting contour curve. Increase the parameter
value to add more padding to the plot.
Another way to visualize a relationship when there are many data
points is to bin the x-axis. Specify the number of bins with
parameter n_bins. XY() then computes the mean of y
for each bin and connects the means by line segments. This procedure
plots the conditional means by default without any assumption of form
such as linearity. Specify the stat parameter for
median to compute the median of y for each bin. The
standard XY() parameters fill,
color, pt_size and segments also
apply.
Create a Cleveland dot plot when one of the variables has unique (ID)
values. In this example, for a single variable, row names are on the
y-axis. The default plots sorts by the value plotted with the default
value of parameter sort of "+" for an
ascending plot. Set to "-" for a descending plot and
"0" for no sorting.
The standard scatterplot version of a Cleveland dot plot follows, with no sorting and no line segments.
This Cleveland dot plot has two x-variables, indicated as a standard
R vector with the c() function. In this situation, the two
points on each row are connected with a line segment. By default the
rows are sorted by distance between the successive points.
A mixture of categorical and continuous variables can be plotted a variety of ways, as illustrated below.
Plot a scatterplot of two continuous variables for each level of a
categorical variable on the same panel with the by
parameter. Here, plot Years and Salary each for the
two levels of Gender in the data. Colors and geometric plot
shapes can distinguish between the plots. For all variables except an
ordered factor, the default plots according to the default qualitative
color palette, "hues", with the geometric shape of a
point.
Change the plot colors with the fill (interior) and
color (exterior or edge) parameters. Because there are two
levels of the by variable, specify two fill colors and two
edge colors each with an R vector defined by the c()
function. Also, include the regression line for each group with the
fit parameter and increase the size of the plotted points
with the size parameter.
XY(Years, Salary, by=Gender, size=2, fit="lm",
fill=c(M="olivedrab3", W="gold1"),
color=c(M="darkgreen", W="gold4")
)Change the plotted shapes with the shape parameter. The
default value is "circle" with both an exterior color and
filled interior, specified with "color" and
"fill". Other possible values, with fillable interiors, are
"circle", "square", "diamond",
"triup" (triangle up), and "tridown" (triangle
down). Other possible values include all uppercase and lowercase
letters, all digits, and most punctuation characters. The numbers 0
through 25 defined by the R points() function also apply.
If plotting levels according to by, then list one shape for
each level to be plotted.
Or, request default shapes across the different by
groups by setting parameter shapes to
"vary".
A Trellis (facet) plot creates a separate panel for the plot of each
level of the categorical variable. Generate Trellis plots with the
facet parameter. In this example, plot the best-fit linear
model for the data in each panel according to the fit
parameter. By default, the 95% confidence interval for each line is also
displayed.
##
## ---------- Summary Statistics for Years
Turn off the confidence interval by setting the parameter
fit_se to 0 for the value of the confidence level.
A categorical variable plotted with a continuous variable results in a traditional scatterplot though, of course, the scatter is confined to the straight lines that represent the levels of the categorical variable, its values.
The first two parameters of XY() are x and
y. In this example, the categorical variable,
Dept, listed second, specifies the y variable, as
in y=Dept. There is no distinction in this function call for
two continues variables or one continuous and one categorical. The
XY() function evaluates each variable for continuity and
responds appropriately.
To avoid point overlap, if there is at least one duplicated value of
continuous
y for any level of categorical x,
by default some horizontal jitter for each plotted point is added, which
was not needed in this example. Manually adjust the jitter with either
parameter jitter_x or, if x is continuous and
y categorical, the jitter_y parameter. In
addition, if the categorical variable is an R factor or a
variable of type character, by default the mean of the
continuous variable is displayed at each level of the categorical
variable, as well in the text output. If the categorical variable is
numeric, better to convert the variable to a factor to have just the
categories on the axis and not a continuous scale. For example,
d$Gender <- factor(d$Gender).
Another helpful technique for large data sets is to add some fill
transparency with the transparency parameter, with values
such as 0.8 and 0.9. The combination of jitter and transparency allows
for plotting many thousands of points.
Show the different distributions of the continuous variable across the levels of the categorical variable with a scatterplot. Here, show the distribution of Salary for Males and Females across the various departments.
To illustrate, first, the data. Use the Cars93 data set that is installed with lessR, which describes characteristics of 1993 cars.
Two of the categorical variables are integer coded 0 and 1, so recode to R factors to obtain more descriptive labels. For clarity, convert the relevant categorical variables to factors, including Cylinders the number of cylinders for a car, for consistency.
XY() can display the relationships for up to five
variables. The two primary variables, x and y, that
form the basis of the scatter plot, are continuous. Usually these two
variables are listed first in the function call and so do not need their
parameter names specified. Indicate two categorical variables that form
the Trellis panels with parameter facet. Call these two
variables the Trellis variables, which define a Trellis panel for each
combination of their values. Finally, there can be a categorical
grouping variable, the by variable, which plots different
groups within each Trellis panel.
Plot MPGcity according to Weight. Specify the
number of Cylinders and Manual transmission or not as
Trellis conditioning variables to form the Trellis plot. Specify the
Source of the vehicle, Foreign or Domestic as
a grouping variable to plot with separate colors on each panel. Use the
parameter value n_axis_x_skip=2 to include only every third
axis tick label due to the lack of room to avoid overlapping labels.
##
## ---------- Summary Statistics for Weight
From the visualization the patterns emerge. As Weight increases city MPG decreases. Domestic cars tend to weigh more. Foreign cars tend to have fewer cylinders, which also leads to better fuel mileage.
To avoid over-plotting, the plot of two categorical variables results in a bubble plot of their joint frequencies.
The parameter radius scales the size of the bubbles
according to the size of the largest displayed bubble in inches. The
power parameter sets the relative size of the bubbles. The
default power value of 0.5 scales the bubbles so that the
area of each bubble is the value of the corresponding sizing variable. A
value of 1 scales so the radius of each bubble is the value of the
sizing variable, increasing the discrepancy of size between the
variables.
In this example, increase the absolute size of the bubbles as well as
the relative discrepancy in their sizes. If the bubbles become too
large, so that the largest bubbles become truncated, increase the
spacing of the respective axes with the pad_x and/or
pad_y parameters.
An interactive visualization lets the user in real time change
parameter values to change characteristics of the visualization. To
create an interactive two-variable scatterplot of continuous variables
with the employee data that displays the corresponding parameters, run
the function interact() with "ScatterPlot"
specified.
interact("ScatterPlot")
To create an interactive Trellis plot as a combined violin, box, and
scatter plot with the five values of Dept from the Employee data set
that displays the corresponding parameters, run the function
interact() with "Trellis" specified.
interact("Trellis")
The functions are not run here because interactivity requires to run directly from the R console.
Use the base R help() function to view the full manual
for XY(). Simply enter a question mark followed by the name
of the function.
?XY
More on Scatterplots, Time Series plots, and other visualizations from lessR and other packages such as ggplot2 at:
Gerbing, D., R Visualizations: Derive Meaning from Data, CRC Press, May, 2020, ISBN 978-1138599635.