Since mlrCPO
is a package with some depth to it, it
comes with a few vignettes that each explain different aspects of its
operation. These are the current document (“First Steps”), offering a
short introduction and information on where to get started, “mlrCPO Core”, describing all the
functions and tools offered by mlrCPO
that are independent
from specific CPO
s, “CPOs Built
Into mlrCPO”, listing all CPO
s included in the
mlrCPO
package, and “Building Custom CPOs”, describing the
process of creating new CPO
s that offer new
functionality.
All vignettes also have a “compact version” with the R output suppressed for readability. They are linked in the navigation section at the top.
All vignettes assume that mlrCPO
(and therefore its
requirement mlr
) is installed successfully and loaded using
library("mlrCPO")
. Help with installation is provided on
the project’s GitHub
page.
“Composable Preprocessing Operators”, “CPO”, are an extension for the
mlr (“Machine Learning in
R”) project which present preprocessing operations in the form of R
objects. These CPO objects can be composed to form complex operations,
they can be applied to data sets, and can be attached to mlr
Learner
objects to generate machine learning pipelines that
combine preprocessing and model fitting.
“Preprocessing”, as understood by mlrCPO
, is any
manipulation of data used in a machine learning process to get it from
its form as found in the wild into a form more fitting for the machine
learning algorithm (“Learner
”) used for model fitting. It
is important that the exact method of preprocessing is kept track of, to
be able to perform this method when the resulting model is used to make
predictions on new data. It is also important, when evaluating
preprocessing methods e.g. using resampling, that the
parameters of these methods are independent of the validation dataset
and only depend on the training data set.
mlrCPO
tries to support the user in all these aspects of
preprocessing:
CPO
s
that can perform many different operations. Operations that go beyond
the provided toolset can be implemented in custom
CPO
s.CPOTrained
” objects that represent the
preprocessing done on training data that should, in that way, be
re-applied to new prediction data.mlr
“Learner
” objects that represent the
entinre machine learning pipeline to be tuned and evaluated.At the centre of mlrCPO
are “CPO
” objects.
To get a CPO
object, it is necessary to call a CPO
Constructor. A CPO Constructor sets up the parameters of a
CPO
and provides further options for its behaviour.
Internally, CPO Constructors are functions that have a common
interface and a friendly printer method.
# a cpo constructor cpoScale
cpoAddCols
cpoScale(center = FALSE) # create a CPO object that scales, but does not center, data
cpoAddCols(Sepal.Area = Sepal.Length * Sepal.Width) # this would add a column
CPO
s exist first to be applied to data. Every
CPO
represents a certain data transformation, and this
transformation is performed when the CPO
is applied. This
can be done using the applyCPO
function,
or the %>>%
operator.
CPO
s can be applied to data.frame
objects, and
to mlr
“Task
” objects.
= iris[c(1, 2, 3, 51, 52, 102, 103), ]
iris.demo tail(iris.demo %>>% cpoQuantileBinNumerics()) # bin the data in below & above median
A useful feature of CPO
s is that they can be
concatenated to form new operations. Two CPO
s can be
combined using the composeCPO
function or,
as before, the %>>%
operator. When
two CPO
s are combined, the product is a new
CPO
that can itself be composed or applied. The result of a
composition represents the operation of first applying the first
CPO
and then the second CPO
. Therefore,
data %>>% (cpo1 %>>% cpo2)
is the same as
(data %>>% cpo1) %>>% cpo2
.
# first create three quantile bins, then as.numeric() all columns to
# get 1, 2 or 3 as the bin number
= cpoQuantileBinNumerics(numsplits = 3) %>>% cpoAsNumeric()
quantilenum %>>% quantilenum iris.demo
The last example shows that it is sometimes not a good idea to have a
CPO
affect the whole dataset. Therefore, when a
CPO
is created, it is possible to choose what columns the
CPO
should affect. The CPO Constructor has a variety of
parameters, starting with affect.
, that can be used to
choose what columns the CPO
operates on. To prevent
cpoAsNumeric
from influencing the Species
column, we can thus do
= cpoQuantileBinNumerics(numsplits = 3) %>>%
quantilenum.restricted cpoAsNumeric(affect.names = "Species", affect.invert = TRUE)
%>>% quantilenum.restricted iris.demo
A more convenient method in this case, however, is to use an
mlr
“Task
”, which keeps track of the target
column. “Feature Operation” CPO
s (as all the ones shown) do
not influence the target column.
= makeClassifTask(data = iris.demo, target = "Species")
demo.task = demo.task %>>% quantilenum
result getTaskData(result)
When performing preprocessing, it is sometimes necessary to change a
small aspect of a long preprocessing pipeline. Instead of having to
re-construct the whole pipeline, mlrCPO
offers the
possibility to change hyperparameters of a CPO
.
This makes it very easy e.g. for tuning of preprocessing in combination
with a machine learning algorithm.
Hyperparameters of CPO
s can be manipulated in the same
way as they are manipulated for Learners
in
mlr
, using getParamSet
(to
list the parameters), getHyperPars
(to
list the parameter values), and
setHyperPars
(to change these values). To
get the parameter set of a CPO
, it is also possible to use
verbose printing using the !
(exclamation
mark) operator.
= cpoScale()
cpo cpo
getHyperPars(cpo) # list of parameter names and values
getParamSet(cpo) # more detailed view of parameters and their type / range
!cpo # equivalent to print(cpo, verbose = TRUE)
CPO
s use copy semantics, therefore
setHyperPars
creates a copy of a CPO
that has
the changed hyperparameters.
= setHyperPars(cpo, scale.scale = FALSE)
cpo2 cpo2
%>>% cpo # scales and centers iris.demo
%>>% cpo2 # only centers iris.demo
When chaining many CPO
s, it is possible for the many
hyperparameters to lead to very cluttered ParamSet
s, or
even for hyperparameter names to clash. mlrCPO
has two
remedies for that.
First, any CPO
also has an
id
that is always prepended to the
hyperparameter names. It can be set during construction, using the
id
parameter, or changed later using setCPOId
.
The latter one only works on primitive, i.e. not compound,
CPO
s. Set the id
to NULL
to use
the CPO
’s hyperparameters without a prefix.
= cpoScale(id = "a") %>>% cpoScale(id = "b") # not very useful example
cpo getHyperPars(cpo)
The second remedy against hyperparameter clashes is different
“exports” of hyperparameters: The hyperparameters that can be changed
using setHyperPars
, i.e. that are exported by a
CPO
, are a subset of the parameters of the
CPOConstructor
. For each kind of CPO
, there is
a standard set of parameters that are exported, but during construction,
it is possible to influence the parameters that actually get exported
via the export
parameter. export
can be one of
a set of standard export settings (among them “export.all
”
and “export.none
”) or a character
vector of
the parameters to export.
= cpoPca(export = c("center", "rank"))
cpo getParamSet(cpo)
Manipulating data for preprocessing itself is relatively easy. A
challenge comes when one wants to integrate preprocessing into a
machine-learning pipeline: The same preprocessing steps that are
performed on the training data need to be performed on the new
prediction data. However, the transformation performed for prediction
often needs information from the training step. For example, if training
entail performing PCA, then for prediction, the data must not undergo
another PCA, instead it needs to be rotated by the rotation
matrix found by the training PCA. The process of obtaining the
rotation matrix will be called “training” the CPO
, and the
object that contains the trained information is called
CPOTrained
. For preprocessing operations that operate only
on features of a task (as opposed to the target column), the
CPOTrained
will always be applied to new incoming data, and
hence be of class CPORetrafo
and called a
“retrafo” object. To obtain this retrafo object, one
can use retrafo()
. Retrafo objects can be
applied to data just as CPO
s can, by using the
%>>%
operator.
= iris.demo %>>% cpoPca(rank = 3)
transformed transformed
= retrafo(transformed)
ret ret
To show that ret
actually represents the exact same
preprocessing operation, we can feed the first line of
iris.demo
back to it, to verify that the transformation is
the same.
1, ] %>>% ret iris.demo[
We obviously would not have gotten there by feeding the first line to
cpoPca
directly:
1, ] %>>% cpoPca(rank = 3) iris.demo[
CPOTrained
objects associated with an object are
automatically chained when another CPO
is applied. To
prevent this from happening, it is necessary to “clear” the retrafos and
inverters associated with the object using
clearRI()
.
= transformed %>>% cpoScale()
t2 retrafo(t2)
= clearRI(transformed) %>>% cpoScale()
t3 retrafo(t3)
Note that clearRI
has no influence on the
CPO
operations themselves, and the resulting data is the
same:
all.equal(t2, t3, check.attributes = FALSE)
It is also possible to chain CPOTrained
object using
composeCPO()
or %>>%
. This can be useful
if the trafo chain loses access to the retrafo
attribute
for some reason. In general, it is only recommended to compose
CPOTrained
objects that were created in the same process
and in correct order, since they are usually closely associated with the
training data in a particular place within the preprocessing chain.
retrafo(transformed) %>>% retrafo(t3) # is the same as retrafo(t2) above.
So far only CPO
s were introduced that change the feature
columns of a Task
. (“Feature Operation
CPO
s”–FOCPOs). There is another class of
CPO
s, “Target Operation CPO
s” or
TOCPOs, that can change a Task
’s target
columns.
This comes at the cost of some complexity when performing prediction:
Since the training data that was ultimately fed into a
Learner
had a transformed target column, the predictions
made by the resulting model will not be directly comparable to the
original target values. Consider cpoLogTrafoRegr
, a
CPO
that log-transforms the target variable of a regression
Task
. The predictions made with a Learner
on a
log-transformed target variable will be in log-space and need to be
exponentiated (or otherwise re-transformed). This inversion operation is
represented by an “inverter” object that is attached to
a transformation result similarly to a retrafo object, and can be
obtained using the inverter()
function. It
is of class CPOInverter
, a subclass of
CPOTrained
.
= makeRegrTask(data = iris.demo, target = "Petal.Width")
iris.regr = iris.regr %>>% cpoLogTrafoRegr()
iris.logd
getTaskData(iris.logd) # log-transformed target 'Petal.Width'
= inverter(iris.logd) # inverter object
inv inv
The inverter object is used by the invert()
function
that inverts the prediction made by a model trained on the transformed
task, and re-transforms this prediction to fit the space of the original
target data. The inverter object caches the “truth” of the data being
inverted (iris.logd
, in the example), so
invert
can give information on the truth of the inverted
data.
= train("regr.lm", iris.logd)
logmodel = predict(logmodel, iris.logd) # prediction on the task itself
pred pred
invert(inv, pred)
This procedure can also be done with new incoming data. In general,
more than just the cpoLogTrafoRegr
operation could be done
on the iris.regr
task in the example, so to perform the
complete preprocessing and inversion, one needs to use the
retrafo object as well. When applying the retrafo object, a new inverter
object is generated, which is specific to the exact new data that was
being retransformed:
= makeRegrTask("newiris", iris[7:9, ], target = "Petal.Width",
newdata fixup.data = "no", check.data = FALSE)
# the retrafo does the same transformation(s) on newdata that were
# done on the training data of the model, iris.logd. In general, this
# could be more than just the target log transformation.
= newdata %>>% retrafo(iris.logd)
newdata.transformed getTaskData(newdata.transformed)
= predict(logmodel, newdata.transformed)
pred pred
# the inverter of the newly transformed data contains information specific
# to the newly transformed data. In the current case, that is just the
# new "truth" column for the new data.
= inverter(newdata.transformed)
inv.newdata invert(inv.newdata, pred)
The cpoLogTrafoRegr
is a special case of TOCPO in that
its inversion operation is constant: It does not depend on the
new incoming data, so in theory it is not necessary to get a new
inverter object for every piece of data that is being transformed.
Therefore, it is possible to use the retrafo object for
inversion in this case. However, the “truth” column will not be
available in this case:
invert(retrafo(iris.logd), pred)
Whether a retrafo object is capable of performing inversion can be
checked with the getCPOTrainedCapability()
function. It returns a vector with named elements "retrafo"
and "invert"
, indicating whether a CPOTrained
is capable of performing retrafo or inversion. A 1
indicates that the object can perform the action and has an effect, a
0
indicates that the action would have no effect (but also
throws no error), and a -1
means that the object is not
capable of performing the action.
getCPOTrainedCapability(retrafo(iris.logd)) # can do both retrafo and inversion
getCPOTrainedCapability(inv) # a pure inverter, can not be used for retrafo
As an example of a CPO
that does not have a constant
inverter, consider cpoRegrResiduals
, wich fits a regression
model on training data and returns the residuals of this fit. When
performing prediction, the invert
action is to add
predictions by the CPO
’s model to the incoming predictions
made by a model trained on the residuals.
set.seed(123) # for reproducibility
= iris.regr %>>% cpoRegrResiduals("regr.lm")
iris.resid getTaskData(iris.resid)
= train("regr.randomForest", iris.resid)
model.resid
= newdata %>>% retrafo(iris.resid)
newdata.resid getTaskData(newdata.resid) # Petal.Width are now the residuals of lm model predictions
= predict(model.resid, newdata.resid)
pred pred
# transforming this prediction back to compare
# it to the original 'Petal.Width'
= inverter(newdata.resid)
inv.newdata invert(inv.newdata, pred)
Besides FOCPOs and TOCPOs, there are also
“Retrafoless” CPO
s (ROCPOs). These only
perform operation in the training part of a machine learning pipeline,
but in turn are the only CPO
s that may change the number of
rows in a dataset. The goal of ROCPOs is to change the number of data
samples, but not to transform the data or target values themselves.
Examples of ROCPOs are cpoUndersample
,
cpoSmote
, and cpoSample
.
= iris %>>% cpoSample(size = 3)
sampled sampled
There is no retrafo or inverter associated with the result. Instead, both of them are NULLCPO
retrafo(sampled)
inverter(sampled)
Until now, the CPO
s have been invoked explicitly to
manipulate data and get retrafo and inverter objects. It is good to be
aware of the data flows in a machine learning process involving
preprocessing, but mlrCPO
makes it very easy to automatize
this. It is possible to attach a CPO
to a
Learner
using attachCPO
or
the %>>%
-operator. When a CPO
is
attached to a Learner
, a CPOLearner
is
created. The CPOLearner
performs the preprocessing
operation dictated by the CPO
before training the
underlying model, and stores and uses the retrafo and inverter objects
necessary during prediction. It is possible to attach compound
CPO
s, and it is possible to attach further
CPO
s to a CPOLearner
to extend the
preprocessing pipeline. Exported hyperparamters of a CPO
are also present in a CPOLearner
and can be changed using
setHyperPars
, as usual with other Learner
objects.
Recreating the pipeline from General
Inverters with a CPOLearner
looks like the following.
Note the prediction pred
made in the end is identical with
the one made above.
set.seed(123) # for reproducibility
= cpoRegrResiduals("regr.lm") %>>% makeLearner("regr.randomForest")
lrn lrn
= train(lrn, iris.regr)
model
= predict(model, newdata)
pred pred
It is possible to get the retrafo object from a model trained with a
CPOLearner
using the retrafo()
function. In
this example, it is identical with the retrafo(iris.resid)
gotten in the example in General
Inverters.
retrafo(model)
Since the hyperparameters of a CPO
are present in a
CPOLearner
, is possible to tune hyperparameters of
preprocessing operations. It can be done using mlr
’s
tuneParams()
function and works
identically to tuning common Learner
-parameters.
= cpoIca() %>>% makeLearner("classif.logreg")
icalrn
getParamSet(icalrn)
= makeParamSet(
ps makeIntegerParam("ica.n.comp", lower = 1, upper = 8),
makeDiscreteParam("ica.alg.typ", values = c("parallel", "deflation")))
# shorter version using pSS:
# ps = pSS(ica.n.comp: integer[1, 8], ica.alg.typ: discrete[parallel, deflation])
tuneParams(icalrn, pid.task, cv5, par.set = ps,
control = makeTuneControlGrid(),
show.info = FALSE)
Besides the %>>%
operator, there are a few related
operators which are short forms of operations that otherwise take more
typing.
%<<%
is similar to
%>>%
but works in the other direction.
a %>>% b
is the same as
b %<<% a
.%<>>%
and
%<<<%
are the %>>%
or
%<<%
operators, combined with assignment.
a %<>>% b
is the same as
a = a %>>% b
. These operators perform the operations
on their right before they do the assignment, so it is not necessary to
use parentheses when writing
a = a %>>% b %>>% c
as
a %<>>% b %>>% c
.%>|%
and %|<%
feed
data in a CPO
and gets the retrafo()
.
data %>|% a
is the same as
retrafo(data %>>% a)
. The %>|%
operator performs the operation on its right before getting the retrafo,
so it is not necessary to use parentheses when writing
retrafo(data %>>% a %>>% b)
as
data %>|% a %>>% b
.As described before, it is possible to compose
CPO
s to create relatively complex preprocessing pipelines.
It is therefore necessary to have tools to inspect a CPO
pipeline or related objects.
The first line of attack when inspecting a CPO
is always
the print
function. print(x, verbose = TRUE)
will often print more information about a CPO
than the
ordinary print function. A shorthand alias for this is the exclamation
point “!
”. When verbosely printing a
CPOConstructor
, the transformation functions are shown.
When verbosely printing a CPO
, the constituent elements are
separately printed, each showing their parameter sets.
# plain print
cpoAsNumeric !cpoAsNumeric # verbose print
cpoScale() %>>% cpoIca() # plain print
!cpoScale() %>>% cpoIca() # verbose print
When working with compound CPO
s, it is sometimes
necessary to manipulate a CPO
inside a compound
CPO
pipeline. For this purpose, the
as.list()
generic is implemented for both
CPO
and CPOTrained
for splitting a pipeline
into a list of the primitive elements. The inverse is
pipeCPO()
, which takes a list of
CPO
or CPOTrained
and concatenates them using
composeCPO()
.
as.list(cpoScale() %>>% cpoIca())
pipeCPO(list(cpoScale(), cpoIca()))
CPOTrained
objects contain information about the retrafo
or inversion to be performed for a CPO
. It is possible to
access this information using
getCPOTrainedState()
. The “state” of a
CPOTrained
object often contains a $data
slot
with information about the expected input and output format
(“ShapeInfo
”) of incoming data, a slot for each of its
hyperparameters, and a $control
slot that is specific to
the CPO
in question. The cpoPca
state, for
example, contains the PCA rotation matrix and a vector for scaling and
centering. The contents of a state’s $control
object are
described in a CPO
’s help page.
= retrafo(iris.demo %>>% cpoPca())
repca = getCPOTrainedState(repca)
state state
It is even possible to change the “state” of a
CPOTrained
and construct a new CPOTrained
using makeCPOTrainedFromState()
. This is
fairly advanced usage and only recommended for users familiar with the
inner workings of the particular CPO
. If we get familiar
with the cpoPca
CPO
using the
!
-print (i.e. !cpoPca
) to look at the retrafo
function, we notice that the control$center
and
control$scale
values are given to a call of
scale()
. If we want to create a new CPOTrained
that does not perform centering or scaling during before
applying the rotation matrix, we can change these values.
$control$center = FALSE
state$control$scale = FALSE
state= makeCPOTrainedFromState(cpoPca, state) nosc.repca
Comparing this to the original “repca
” retrafo shows
that the result of applying repca
has generally smaller
values because of the centering.
%>>% repca iris.demo
%>>% nosc.repca iris.demo
There is a large and growing variety of CPO
s that
perform many different operations. It is advisable to browse through CPOs Built Into mlrCPO for an overview. To
get a list of all built-in CPO
s, use
listCPO()
. A few important or “meta”
CPO
s that can be used to influence the behaviour of other
CPO
s are described here.
The value associated with “no operation” is the NULLCPO
value. It is the neutral element of the %>>%
operations, and the value of retrafo()
and
inverter()
when there are otherwise no associated retrafo
or inverter values.
NULLCPO
all.equal(iris %>>% NULLCPO, iris)
cpoPca() %>>% NULLCPO
The multiplexer makes it possible to combine many CPOs into one, with
an extra selected.cpo
parameter that chooses between them.
This makes it possible to tune over many different tuner configurations
at once.
= cpoMultiplex(list(cpoIca, cpoPca(export = "export.all")))
cpm !cpm
%>>% setHyperPars(cpm, selected.cpo = "ica", ica.n.comp = 3) iris.demo
%>>% setHyperPars(cpm, selected.cpo = "pca", pca.rank = 3) iris.demo
A simple CPO with one parameter which gets applied to the data as CPO. This is different from a multiplexer in that its parameter is free and can take any value that behaves like a CPO. On the downside, this does not expose the argument’s parameters to the outside.
= cpoWrap()
cpa !cpa
%>>% setHyperPars(cpa, wrap.cpo = cpoScale()) iris.demo
%>>% setHyperPars(cpa, wrap.cpo = cpoPca()) iris.demo
Attaching the cpo applicator to a learner gives this learner a “cpo” hyperparameter that can be set to any CPO.
getParamSet(cpoWrap() %>>% makeLearner("classif.logreg"))
cbind
other CPOs as operation. The cbinder
makes it possible to build DAGs of CPOs that perform different
operations on data and paste the results next to each other. It is often
useful to combine cpoCbind
with cpoSelect
to
filter out columns that would otherwise be duplciated.
= cpoSelect(pattern = "Sepal", id = "first") %>>% cpoScale(id = "scale")
scale = scale %>>% cpoPca()
scale.pca = cpoCbind(scale, scale.pca, cpoSelect(pattern = "Petal", id = "second")) cbinder
cpoCbind
recognises that "scale"
happens
before "pca"
, but is also fed to the result directly. The
verbose print draws a (crude) ascii-art graph.
!cbinder
%>>% cbinder iris.demo
Even though CPO
s are very flexible and can be combined
in many ways, it may be necessary to create completely custom
CPO
s. Custom CPOs can be created using the
makeCPO()
and related functions. “Building Custom CPOs” is a wide topic
which has its own vignette.
CPO
s are built using
CPOConstructor
s by calling them like
functions.CPOConstructors
can be found by using
listCPO()
or consulting the relevant vignette.CPO
s and many related objects is
available using the !
(exclamation mark)
operator.CPO
s export hyperparameters that are accessible using
getParamSet()
and
getHyperPars()
, and mutable using
setHyperPars()
. Which parameters are
exported can be controlled using the
export
parameter during construction.composeCPO()
),
applied to data (applyCPO()
) and attached
to Learner
s (attachCPO()
)
using special functions for each of these operations, or using the
general %>>%
operator.CPO
:
FOCPO (Feature Operation CPO
s),
TOCPO (Target Operation CPO
s) and
ROCPO (Retrafoless CPO
s). The first may
only change feature columns, the second only target columns. While the
last one may change both feature and target values and even the
number of rows of a dataset, it does so with the understanding that new
“prediction” data will not be transformed by it and is thus mainly
useful for subsampling.CPO
has a retrafo-CPOTrained
object associated
with it that can be retrieved using
retrafo()
and used to transform new
prediction data in similar way as the original training data.CPOTrained
objects can themselves be composed using
composeCPO
or
%>>%
, although it is only
recommended to compose CPOTrained
objects in the same order
as they were created, and only if they were created in the same
preprocessing pipeline.CPOTrained
objects can be inspected using
getCPOTrainedState()
, and re-built with
changed state using
makeCPOTrainedFromState()
.inverter()
. An inverter is also created
during application of a retrafo CPOTrained
.CPOTrained
are created during
training and used on every prediction data set, inverter
CPOTrained
are created anew during each CPO
and retrafo-CPOTrained
application and are closely
associated with the data that they were created with.CPOTrained
objects associated with data are stored in
their “attributes” and are automatically chained when more
CPO
s are applied. clearRI()
is used to remove the associated CPOTrained
objects and
prevent this chaining.CPO
s can be attached to Learner
s to get
CPOLearner
s which automatically transform
training and prediction data and perform prediction
inversion.CPOLearner
s have the Learner
’s
and the CPO
’s hyperparameters and can thus be
manipulated using setHyperPars()
, and can
be tuned using tuneParams()
.CPO
s are NULLCPO
(the neutral element of %>>%
),
cpoMultiplex
,
cpoWrap
, and
cpoCbind
.CPO
s using
makeCPO
and similar functions. These are
described in their own vignette.