In this document we give a high-level, relatively non-technical introduction to the functionality available in the cops
package for fitting multidimensional scaling (MDS; Borg & Groenen 2005) models that have an emphasis on providing a clustered configuration. We start with a short introduction to COPS and the models that we have available. We then explain fitting of these models with the cops
package and show how to fit those. For illustration we use the smacof::kinshipdelta
data set (Rosenberg, S. & Kim, M. P., 1975) which lists percentages of how often 15 kinship terms were not grouped together by college students.
For proximity scaling (PS) or multidimensional scaling (MDS) the input is typically an \(N\times N\) matrix \(\Delta^*=f(\Delta)\), a matrix of proximities with elements \(\delta^*_{ij}\), that is a function of a matrix of observed non-negative dissimilarities \(\Delta\) with elements \(\delta_{ij}\). \(\Delta^*\) usually is symmetric (but does not need to be). The main diagonal of \(\Delta\) is 0. We call a \(f: \delta_{ij} \mapsto \delta^*_{ij}\) a proximity transformation function. In the MDS literature these \(\delta_{ij}^*\) are often called dhats or disparities. The problem that proximity scaling solves is to locate an \(N \times M\) matrix \(X\) (the configuration) with row vectors \(x_i, i=1,\ldots,N\) in low-dimensional space \((\mathbb{R}^M, M \leq N)\) in such a way that transformations \(g(d_{ij}(X))\) of the fitted distances \(d_{ij}(X)=d(x_i,x_j)\)—i.e., the distance between different \(x_i, x_j\)—approximate the \(\delta^*_{ij}\) as closely as possible. We call a \(g: d_{ij}(X) \mapsto d_{ij}^*(X)\) a distance transformation function. In other words, proximity scaling means finding \(X\) so that \(d^*_{ij}(X)=g(d_{ij}(X))\approx\delta^*_{ij}=f(\delta_{ij})\).
This approximation \(D^*(X)\) to the matrix \(\Delta^*\) is found by defining a badness-of-fit criterion (loss function), \(\sigma_{MDS}(X)=L(\Delta^*,D^*(X);\Gamma(X))\), that is used to measure how closely \(D^*(X)\) approximates \(\Delta^*\), optionally subject to an additional criterion of the appearance of \(X\), \(\Gamma(X)\). The smaller the badness-of-fit, the better the fit is.
The loss function used is then minimized to find the vectors \(x_1,\dots,x_N\), i.e., \[\begin{equation} \label{eq:optim} \arg \min_{X}\ \sigma_{MDS}(X). \end{equation}\] There are a number of optimization techniques one can use to solve this optimization problem.
Usually, we use the quadratic loss function. A general formulation of a loss function based on a quadratic loss is known as stress (Kruskall 1964) and is \[\begin{equation} \label{eq:stress} \sigma_{MDS}(X)=\sum^N_{i=1}\sum^N_{j=1} z_{ij} w_{ij}\left[d^*_{ij}(X)-\delta^*_{ij}\right]^2=\sum^N_{i=1}\sum^N_{j=1} z_{ij}w_{ij}\left[g\left(d_{ij}(X)\right)-f(\delta_{ij})\right]^2 \end{equation}\] where we use some type of Minkowski distance (\(p > 0\)) as the distance fitted to the points in the configuration, \[\begin{equation} \label{eq:dist} d_{ij}(X) = ||x_{i}-x_{j}||_p=\left( \sum_{m=1}^M |x_{im}-x_{jm}|^p \right)^{1/p} \ i,j = 1, \dots, N. \end{equation}\] Typically, the norm used is the Euclidean norm, so \(p=2\). In standard MDS \(g(\cdot)=f(\cdot)=I(\cdot)\), the identity function. The \(w_{ij}\) and \(z_{ij}\) are finite weights, e.g., with \(z_{ij}=0\) if the entry is missing and \(z_{ij}=1\) otherwise.
This formulation enables one to express a large number of popular MDS methods with cops
. Generally, we allow to use specific choices for \(f(\cdot)\) and \(g(\cdot)\) from the family of power transformations so one can fit the following stress models:
For all of these models one can use the function powerStressMin
which uses majorization to find the solution (de Leeuw, 2014). The function allows to specify a kappa
, lambda
and nu
argument as well as a weightmat
(the \(w_{ij}\)), by setting the respective argument. For some models (those without transformations for the \(d_{ij}\)) one can use smacof::mds
.
The object returned from a call to powerStressMin
is of class smacofP
which extends the smacof
classes (de Leeuw & Mair, 2009) to allow for the power transformations. Apart from that the objects are made so that they have maximum compatibility to methods from smacof
. Accordingly, the following S3 methods are available:
Method | Description |
---|---|
Prints the object | |
summary | A summary of the object |
plot | 2D Plots of the object |
plot3d | Dynamic 3D configuration plot |
plot3dstatic | Static 3D configuration plot |
residuals | Residuals |
coef | Model Coefficients |
Let us illustrate the usage
stress
)res1<-powerStressMin(dis,kappa=1,lambda=1)
res1
#>
#> Call:
#> powerStressMin(delta = dis, kappa = 1, lambda = 1)
#>
#> Model: Power Stress SMACOF
#> Number of objects: 15
#> Stress-1 value: 0.264
#> Number of iterations: 5352
sammon
mappingres2<-powerStressMin(dis,kappa=1,lambda=1,nu=-1,weightmat=dis)
res2
#>
#> Call:
#> powerStressMin(delta = dis, kappa = 1, lambda = 1, nu = -1, weightmat = dis)
#>
#> Model: Power Stress SMACOF
#> Number of objects: 15
#> Stress-1 value: 0.29
#> Number of iterations: 50000
Alternatively, one can use the faster sammon
function from MASS
(Venables & Ripley, 2002) for which we provide a wrapper that adds class attributes and methods (and overloads the function).
res2a<-sammon(dis)
#> Initial stress : 0.17053
#> stress after 3 iters: 0.10649
res2a
#>
#> Call: sammon(d = dis)
#>
#> Model: Sammon Scaling
#> Number of objects: 15
#> Stress: 0.1064925
elastic
scalingres3<-powerStressMin(dis,kappa=1,lambda=1,nu=-2,weightmat=dis)
res3
#>
#> Call:
#> powerStressMin(delta = dis, kappa = 1, lambda = 1, nu = -2, weightmat = dis)
#>
#> Model: Power Stress SMACOF
#> Number of objects: 15
#> Stress-1 value: 0.334
#> Number of iterations: 50000
sstress
modelres4<-powerStressMin(dis,kappa=2,lambda=2)
res4
#>
#> Call:
#> powerStressMin(delta = dis, kappa = 2, lambda = 2)
#>
#> Model: Power Stress SMACOF
#> Number of objects: 15
#> Stress-1 value: 0.346
#> Number of iterations: 47130
rstress
model (with \(r=1\) as \(r=\kappa/2\))res5<-powerStressMin(dis,kappa=2,lambda=1)
res5
#>
#> Call:
#> powerStressMin(delta = dis, kappa = 2, lambda = 1)
#>
#> Model: Power Stress SMACOF
#> Number of objects: 15
#> Stress-1 value: 0.404
#> Number of iterations: 21201
powermds
modelres6<-powerStressMin(dis,kappa=2,lambda=1.5)
res6
#>
#> Call:
#> powerStressMin(delta = dis, kappa = 2, lambda = 1.5)
#>
#> Model: Power Stress SMACOF
#> Number of objects: 15
#> Stress-1 value: 0.367
#> Number of iterations: 50000
powersammon
modelres7<-powerStressMin(dis,kappa=2,lambda=1.5,nu=-1,weightmat=dis)
res7
#>
#> Call:
#> powerStressMin(delta = dis, kappa = 2, lambda = 1.5, nu = -1,
#> weightmat = dis)
#>
#> Model: Power Stress SMACOF
#> Number of objects: 15
#> Stress-1 value: 0.436
#> Number of iterations: 50000
powerelastic
scalingres8<-powerStressMin(dis,kappa=2,lambda=1.5,nu=-2,weightmat=dis)
res8
#>
#> Call:
#> powerStressMin(delta = dis, kappa = 2, lambda = 1.5, nu = -2,
#> weightmat = dis)
#>
#> Model: Power Stress SMACOF
#> Number of objects: 15
#> Stress-1 value: 0.526
#> Number of iterations: 50000
restricted powerstress
modelres9<-powerStressMin(dis,kappa=1.5,lambda=1.5,nu=-1.5,weightmat=2*1-diag(nrow(dis)))
res9
#>
#> Call:
#> powerStressMin(delta = dis, kappa = 1.5, lambda = 1.5, nu = -1.5,
#> weightmat = 2 * 1 - diag(nrow(dis)))
#>
#> Model: Power Stress SMACOF
#> Number of objects: 15
#> Stress-1 value: 0.322
#> Number of iterations: 8649
powerstress
modelres10<-powerStressMin(dis,kappa=2,lambda=1.5,nu=-1.5,weightmat=2*1-diag(nrow(dis)))
res10
#>
#> Call:
#> powerStressMin(delta = dis, kappa = 2, lambda = 1.5, nu = -1.5,
#> weightmat = 2 * 1 - diag(nrow(dis)))
#>
#> Model: Power Stress SMACOF
#> Number of objects: 15
#> Stress-1 value: 0.367
#> Number of iterations: 50000
disms<-log(dis)
diag(disms)<-0
res11<-powerStressMin(disms,kappa=0.0001,lambda=1)
res11
#>
#> Call:
#> powerStressMin(delta = disms, kappa = 1e-04, lambda = 1)
#>
#> Model: Power Stress SMACOF
#> Number of objects: 15
#> Stress-1 value: 0.076
#> Number of iterations: 194
Different ways to plot results are
Another popular type of MDS supported by cops
is based on the loss function type . Here the \(\Delta^*\) are a transformation of the \(\Delta\), \(\Delta^*= f (\Delta)\) so that \(f(\cdot)=-(h\circ l)(\cdot)\) where \(l\) is any function and \(h(\cdot)\) is a double centering operation, \(h(\Delta)=\Delta-\Delta_{i.}-\Delta_{.j}+\Delta_{..}\) where \(\Delta_{i.}, \Delta_{.j}, \Delta_{..}\) are matrices consisting of the row, column and grand marginal means respectively. These then get approximated by (functions of) the inner product matrices of \(X\) \[\begin{equation}
\label{eq:dist2}
d_{ij}(X) = \langle x_{i},x_{j} \rangle
\end{equation}\] We can thus express classical scaling as a special case of the general PS loss with \(d_{ij}(X)\) as an inner product, \(g(\cdot) = I(\cdot)\) and \(f(\cdot)=-(h \circ I)(\cdot)\).
If we again allow power transformations for \(g(\cdot)\) and \(f(\cdot)\) one can fit the following strain models with cops
In stops
we have a wrapper to cmdscale
(overloading the base
function) which extend functionality by offering an object that matches smacofP
objects with corresponding methods.
Let us illustrate the usage. A powerstrain
model is rather easy to fit with simply subjecting the dissimilarity matrix to some power. Here we use \(\lambda=3\).
resc<-cmdscale(kinshipdelta^3)
resc
#>
#> Call: cmdscale(d = kinshipdelta^3)
#>
#> Model: Torgerson-Gower Scaling
#> Number of objects: 15
#> GOF: 0.4257747 0.6281985
The models listed above are also available as dedicted wrapper functions with a cop_
prefix
Function | Model |
---|---|
cop_cmdscale |
Strain/Powerstrain |
cop_smacofSym |
Stress |
cop_smacofSphere |
Smacof on a sphere |
cop_sammon ,cop_sammon2 |
Sammon scaling |
cop_elastic |
Elastic scaling |
cop_sstress |
S-stress |
cop_rstress |
r-stress |
cop_powermds |
Powermds |
cop_powersammon |
Sammon scaling with powers |
cop_powerelastic |
Elastic scaling with powers |
cop_apstress |
Approximate power stress |
cop_powerstress |
Powerstress |
cop_rpowerstress |
Restricted Powerstress |
The main contribution of the cops
package is not in solely fitting the powerstress or powerstrain models and their variants from above, but to augment the badness-of-fit function to achieve a “structured” MDS result automatically (in the sense of clusters or discrete structures). This can be useful mainly for exploring or generating discrete structures or to preserve clusters.
For this an MDS loss function is augmented to include a penalty. This combination of an MDS loss with a clusteredness penalty is what we call “cluster optimized stress” (copstress) and the resulting MDS is coined “Cluster Optimized Proximity Scaling” (or COPS). This is a multi-objective optimization problem as we want to simultaneously minimize badness-of-fit and maximize clusteredness. The computational problem is solved by combining the two, but interpretation should happen individually with the badness-of-fit and clusteredness values respectively.
We allow two ways of how copstress can be used: In one variant (COPS-C) one looks for an optimal configuration \(X^*\) directly, given the transformation parameters. This yields a configuation that has a more clustered appearance than the standard MDS with the same tranformation parameters. In the other (P-COPS) we automatically select optimal transformation parameters and then solve the respective transformed MDS so that the clusteredness appearance of the configuation is improved.
Here we combine a normalized stress function \(\sigma'_{\text{stress}}(X|\theta)\) given stress hyperparameter vector \(\theta\) and a measure of clusteredness, the OPTICS cordillera (\(\text{OC}'(X)\)) to the following objective \[\begin{equation} \label{eq:spstressv1} \sigma_{\text{PS}}(X|\theta)=\text{copstress}(X|\theta) = v_1 \cdot \sigma'_{\text{stress}}(X|\theta) - v_2 \cdot \text{OC}_\gamma'(X), \end{equation}\] with scalarization weights \(v_1,v_2 \in \mathbb{R}_+\), which is minimized over all possible \(X\).
In COPS-C the parameters \(\theta, v_1, v_2\) and \(\gamma\) are all treated as given. Minimizing copstress in this variant jitters the configuration towards a more clustered arrangement, the strength of which is governed by the values of \(v_1, v_2\). We recommend to use the convex combination \(v_2=1-v_1\) with \(0 \leq v_1 \leq 1\). For a given \(\theta\), if \(v_2=0\) the result of the above equation is the same as solving the respective stress problem.
COPS-C can be used with many different transformations including ratio, non-metric (ordinal), interval and power transformations (see below). If the \(\sigma'_{\text{stress}}(X|\theta)\) allows for different transformation of dissimilarities and distances (e.g., powerstress), we expect researchers and practictioners to start from identic transformations. If need arises, e.g., to avoid a problem of near-indifferentiation, one can exploit the flexibility of employing different transformations. For that case we point out that the configuration may then represent a relation that is somewhat further apart of the main aim in MDS of faithfully reproducing the dissimilarities by distances in a comparable space but may allow some desired aspects to be revealed in a graphical representation.
COPS-C can be used either for improving c-clusteredness for a given initial MDS configuration (which may then be only locally optimal) or for looking for the globally near-optimal COPS-C configuration (with different starting configurations, see below).
COPS-C with copstressMin
needs the mandatory argument delta
which is the dissimilarity matrix and some optional additional arguments which we descirbe below.
The default COPS-C (ratio MDS) can already be fit as
cc.res1<-copstressMin(kinshipdelta)
#> Warning in dfoptim::hjk(xold, function(par) copsf(par, delta = delta, disobj =
#> disobj, : Function evaluation limit exceeded -- may not converge.
#> Warning in commonArgs(par + 0, fn, control, environment()): maxfun < 10 *
#> length(par)^2 is not recommended.
cc.res1
#>
#> Call: copstressMin(delta = kinshipdelta)
#>
#> Model: ratio COPS-C with parameter vector = 1 1 1
#>
#> Number of objects: 15
#> Stress of configuration (default normalization): 0.2689719
#> OPTICS Cordillera: Raw 4.450995 Normed 0.2771449
#> Cluster optimized loss (copstress): 0.255319
#> Stress weight: 0.975 OPTICS Cordillera weight: 0.025
#> Number of iterations of hjk-Newuoa optimization: 5057
A number of plots are available.
plot(cc.res1,"confplot")
plot(cc.res1,"Shepard")
plot(cc.res1,"transplot")
plot(cc.res1,"reachplot")
The print function outputs information about the COPS-C model. In this case we fitted a ratio COPS-C that uses the standard MDS stress (all transformation parameters 1). There were 15 objects and the square root of stress of the configuration is 0.268 (compared to 0.267 for standard MDS, see above). The normed OPTICS cordillera value is 0.245, compared to 0.13 for standard MDS (with 0 being no clusteredness and 1 perfect clusteredness, see below). We also get information on \(v_1\) and \(V_2\) which were 0.975 and 0.025 respectively (the default values). The copstress value is 0.255, but we stress that this isn’t particulalry important for interpretation.
The values that we should interpret are the stress and the cordillera. We see that the badness-of-fit for the COPS-C configuration is a bit higher (which is to be expected due to the penalization) and also that clusteredness increased by quite a bit. This is also evident in the Procrustes plot (grey is standard MDS, coral is COPS-C).
cc.res0<-copstressMin(kinshipdelta,stressweight=1,cordweight=0)
#> Warning in dfoptim::hjk(xold, function(par) copsf(par, delta = delta, disobj =
#> disobj, : Function evaluation limit exceeded -- may not converge.
#> Warning in commonArgs(par + 0, fn, control, environment()): maxfun < 10 *
#> length(par)^2 is not recommended.
Specifically, the clusters of “Sister, Daughter, Mother” and “Son, Brother, Father” as well as “Grandson, Grandfather” are a bit more compact for the COPS-C result as compared to the standard MDS, and “Cousin” has been moved slightl towards “Uncle, Nephew”. At the same time, the fit is almost equal (0.268 for COPS-C vs. 267 for MDS).
We can also look at the clusteredness situation with the OPTICS reachability plot, which shows more clusteredness for COPS-C (a stronger up and down of the black line over the reachabilities). Next to the more compact clusters (deeper valleys) the main difference for COPS-C is that with the default minimum points that must form a cluster being 3 (default) in this case means that cousin is now also part of a three object cluster (with low density) and not a noise point as in standard MDS.
par(mfrow=c(1,2))
plot(cc.res0,plot.type="reachplot",main="standard MDS")
plot(cc.res1,plot.type="reachplot",main="COPS-C")
In the above example, we used the default setup, but there are many ways to customize the COPS-C model.
itmax
itmax
argument (defaults to 5000). If it is low, a warning is returned but that should usually be rather inconsequential. Let’s set the iterations to 20000 (where the warning no longer appears but the copstress value is only a slightly lower). If one values accuracy over computation time, then a higher value is preferable.
cc.res1.20000<-copstressMin(kinshipdelta,itmax=20000)
cc.res1.20000
#>
#> Call: copstressMin(delta = kinshipdelta, itmax = 20000)
#>
#> Model: ratio COPS-C with parameter vector = 1 1 1
#>
#> Number of objects: 15
#> Stress of configuration (default normalization): 0.2688849
#> OPTICS Cordillera: Raw 4.158948 Normed 0.2589604
#> Cluster optimized loss (copstress): 0.2556887
#> Stress weight: 0.975 OPTICS Cordillera weight: 0.025
#> Number of iterations of hjk-Newuoa optimization: 7510
ndim
ndim
argument, where ndim=N
. Default is a 2D space, so ndim=2
. Let’s do a COPS-C in a 3D target space.
cc.res1a<-copstressMin(kinshipdelta,ndim=3)
#> Warning in dfoptim::hjk(xold, function(par) copsf(par, delta = delta, disobj =
#> disobj, : Function evaluation limit exceeded -- may not converge.
#> Warning in commonArgs(par + 0, fn, control, environment()): maxfun < 10 *
#> length(par)^2 is not recommended.
cc.res1a$conf
#> D1 D2 D3
#> Aunt -0.48874303 1.1551002 -0.49427552
#> Brother 0.15515569 -0.9667926 1.43724500
#> Cousin -0.01435047 1.2417592 1.20566988
#> Daughter -1.40187543 -1.0030576 0.12641798
#> Father 0.55751471 -1.6206786 0.37593158
#> Granddaughter -0.90888198 -0.4174884 -1.28916497
#> Grandfather 1.08507157 -0.7168600 -1.15601642
#> Grandmother -0.28646751 -0.4538134 -1.65900934
#> Grandson 1.18489829 -0.9077834 -0.53749148
#> Mother -1.22874495 -1.3310337 -0.31962302
#> Nephew 0.95613455 0.4974767 0.74413737
#> Niece -0.90885703 0.9817580 0.06437843
#> Sister -1.48684371 -0.5806758 0.77347845
#> Son 0.16048108 -1.5474696 0.86244788
#> Uncle 0.96129163 0.8282016 0.16845902
minpts
minpts
. Default is ndim+1
, which is typically 3 but should really be selected based on substantive considerations (and must be \(\geq 2\)). It can also be varied in different runs to explore the clusteredness structure. If we set minpts=2
, we see that the two object clusters are pushed more towards each other.
cc.res2<-copstressMin(kinshipdelta,minpts=2)
#> Warning in dfoptim::hjk(xold, function(par) copsf(par, delta = delta, disobj =
#> disobj, : Function evaluation limit exceeded -- may not converge.
#> Warning in commonArgs(par + 0, fn, control, environment()): maxfun < 10 *
#> length(par)^2 is not recommended.
plot(cc.res2)
stressweight
and cordweight
stressweight
and cordweight
. They encode how strong stress and cordillera should respectively be weighted for the scalarization. The higher stressweight
is in relation to cordweight
the more weight is put on stress (so a more faithful representation to the MDS result). The default values are stressweight=0.975
and cordweight=0.025
. We suggest to put much more weight on stress to not create an articifical configuration. Let’s look at the effect of changing it to stressweight=0.8
, cordweight=0.2
— we see we have much more clusteredness now (0.73) but badness-of-fit has also ramped up a lot to 0.33 and the representation may no longer be very faithful to the real dissimilarities.
cc.res3<-copstressMin(kinshipdelta,stressweight=0.8,cordweight=0.2)
#> Warning in dfoptim::hjk(xold, function(par) copsf(par, delta = delta, disobj =
#> disobj, : Function evaluation limit exceeded -- may not converge.
#> Warning in commonArgs(par + 0, fn, control, environment()): maxfun < 10 *
#> length(par)^2 is not recommended.
cc.res3
#>
#> Call: copstressMin(delta = kinshipdelta, stressweight = 0.8, cordweight = 0.2)
#>
#> Model: ratio COPS-C with parameter vector = 1 1 1
#>
#> Number of objects: 15
#> Stress of configuration (default normalization): 0.3263199
#> OPTICS Cordillera: Raw 11.78173 Normed 0.7335991
#> Cluster optimized loss (copstress): 0.1143361
#> Stress weight: 0.8 OPTICS Cordillera weight: 0.2
#> Number of iterations of hjk-Newuoa optimization: 5057
weightmat
weightmat
. This must be a matrix of the same dimensions as the dissimilarity argument delta
. Let’s say we found out that there was a study error where comparing cousins with Aunt was messed up, so we want ot ignore that dissimilarity .
weights<-1-diag(nrow(kinshipdelta)) #set up the weights
row.names(weights)<-colnames(weights)<-row.names(kinshipdelta) #label the rows/columns
weights[c("Cousin","Aunt"),c("Aunt","Cousin")]<-0 #set the Aunt/Cousin dissimilarity to 0
cc.res3a<-copstressMin(kinshipdelta,weightmat=weights)
#> Warning in dfoptim::hjk(xold, function(par) copsf(par, delta = delta, disobj =
#> disobj, : Function evaluation limit exceeded -- may not converge.
#> Warning in commonArgs(par + 0, fn, control, environment()): maxfun < 10 *
#> length(par)^2 is not recommended.
cc.res3a
#>
#> Call: copstressMin(delta = kinshipdelta, weightmat = weights)
#>
#> Model: ratio COPS-C with parameter vector = 1 1 1
#>
#> Number of objects: 15
#> Stress of configuration (default normalization): 0.2645676
#> OPTICS Cordillera: Raw 4.212205 Normed 0.2622765
#> Cluster optimized loss (copstress): 0.2513965
#> Stress weight: 0.975 OPTICS Cordillera weight: 0.025
#> Number of iterations of hjk-Newuoa optimization: 5054
kappa
, lambda
, nu
and theta
kappa
, lambda
, nu
and theta
all allow to fit power transformations (if a theta
is given it overrides the other values), with kappa
being the distance power transformation, lambda
the proximity power transformation and nu
the power transformation for the weights. theta
is a vector collecting c(kappa,lambda,nu)
. Let’s fit an s-stress COPS-C.
cc.res4<-copstressMin(kinshipdelta,kappa=2,lambda=2)
#> Warning in dfoptim::hjk(xold, function(par) copsf(par, delta = delta, disobj =
#> disobj, : Function evaluation limit exceeded -- may not converge.
#> Warning in commonArgs(par + 0, fn, control, environment()): maxfun < 10 *
#> length(par)^2 is not recommended.
cc.res4
#>
#> Call: copstressMin(delta = kinshipdelta, kappa = 2, lambda = 2)
#>
#> Model: ratio COPS-C with parameter vector = 2 2 1
#>
#> Number of objects: 15
#> Stress of configuration (default normalization): 0.3467388
#> OPTICS Cordillera: Raw 7.414909 Normed 0.3675183
#> Cluster optimized loss (copstress): 0.3288824
#> Stress weight: 0.975 OPTICS Cordillera weight: 0.025
#> Number of iterations of hjk-Newuoa optimization: 5057
cc.res4a<-copstressMin(kinshipdelta,theta=c(2,2,1))
#> Warning in dfoptim::hjk(xold, function(par) copsf(par, delta = delta, disobj =
#> disobj, : Function evaluation limit exceeded -- may not converge.
#> Warning in dfoptim::hjk(xold, function(par) copsf(par, delta = delta, disobj =
#> disobj, : maxfun < 10 * length(par)^2 is not recommended.
cc.res4a
#>
#> Call: copstressMin(delta = kinshipdelta, theta = c(2, 2, 1))
#>
#> Model: ratio COPS-C with parameter vector = 2 2 1
#>
#> Number of objects: 15
#> Stress of configuration (default normalization): 0.346754
#> OPTICS Cordillera: Raw 7.422733 Normed 0.367906
#> Cluster optimized loss (copstress): 0.3288875
#> Stress weight: 0.975 OPTICS Cordillera weight: 0.025
#> Number of iterations of hjk-Newuoa optimization: 5057
dis
is the observed dissimilarity matrix)
type="ratio"
and kappa=1
, lambda=1
(default model).type="ordinal"
with different handling of ties
("primary"
, "secondary"
, "tertiary"
. See ?smacof::mds
) .type="interval"
.type="ratio"
andkappa=2
and lambda=2
.type="ratio"
and kappa
and lambda
to the desired values.weightmat=dis
, nu=-1
(for all types).weightmat=dis
, nu=-2
(for all types).kappa
close to zero (say kappa=0.0001
) and manually transforming disms<-log(dis); diag(dism)<-0
and then set the argument delta=disms
.Some options can also be combined. Note that it is currently not possible to use transformation parameters with interval and non-metric MDS.
Let’s fit a non-metric elastic scaling COPS-C model with secondary handling of ties.
cc.res5<-copstressMin(kinshipdelta,type="ordinal",ties="secondary",weightmat=as.matrix(kinshipdelta),nu=-2)
#> Warning in dfoptim::hjk(xold, function(par) copsf(par, delta = delta, disobj =
#> disobj, : Function evaluation limit exceeded -- may not converge.
#> Warning in commonArgs(par + 0, fn, control, environment()): maxfun < 10 *
#> length(par)^2 is not recommended.
cc.res5
#>
#> Call: copstressMin(delta = kinshipdelta, nu = -2, type = "ordinal",
#> ties = "secondary", weightmat = as.matrix(kinshipdelta))
#>
#> Model: ordinal (secondary) COPS-C with parameter vector = 1 1 1
#>
#> Number of objects: 15
#> Stress of configuration (default normalization): 0.1953469
#> OPTICS Cordillera: Raw 3.729241 Normed 0.2322043
#> Cluster optimized loss (copstress): 0.1846582
#> Stress weight: 0.975 OPTICS Cordillera weight: 0.025
#> Number of iterations of hjk-Newuoa optimization: 5055
minpts
is the most important one and we already discussed that.
Additional parameters include q
which is the parameter for the \(L_p\)-norm of the OPTICS Cordillera and is typically 1 (default) or 2. A higher value of q
can be thought of as pronouncing the ups and downs relatively stronger.
The parameter epsilon
relates to the maximum neighbourhood radius around a point to look for possible other points in a cluster and also relates to the density that a cluster must have. It influences the number of points that are classified as noise by OPTICS and improves runtime of OPTICS the smaller it is. It is not a praticularly intuitive parameter but for most MDS application it should suffice to just set it “sufficiently large” so all points are considered as possible neighbours of each other. It should only be changed to a lower value if the concept of “noise points” is useful for a data set (e.g., objects that are not supposed to be in a cluster anyway).
Finally dmax
and rang
relate to the normalization and winsorization distance for the cordillera, essentially as the maximum distance between points that we still take into account. This can be used for make the index more robust to outliers in the configuration so that the algorithm doesn’t just achieve a higher index by placing some points very far away from the rest. If dmax
is NULL, the normalization is set to (0,1.5 x the maximum reachability distance for the torgerson model). If it is set too low, the normed cordillera value may be too high. Similarly, rang
alows to set the whole normalization interval and is (0,dmax) by default. If max(rang)
and dmax
do not agree a warning is printed and rang
takes precedence. These parameters can be used to explor different winsorization limits for robustness checks.
Let’s look at their effects. First we set q=2
and see that the effect of clusteredness is a bit more pronounced as compared to q=1
(because larger ups and downs in the cordillera are weighted a higher)
cc.res6<-copstressMin(kinshipdelta,q=2)
#> Warning in dfoptim::hjk(xold, function(par) copsf(par, delta = delta, disobj =
#> disobj, : Function evaluation limit exceeded -- may not converge.
#> Warning in commonArgs(par + 0, fn, control, environment()): maxfun < 10 *
#> length(par)^2 is not recommended.
plot(cc.res6,plot.type="reachplot")
Let’s lower epsilon
to 0.6 and minpts
to 2 which means that points that are beyond that distance are no longer possible to be considered as cluster members of each other which allows COPS to have “Sister” and “Brother” pushed out of their respective clusters of “Daughter, Mother” and “Father, Son” and have all those two object clusters really tightly packed. The single objects would now be noise points.
cc.res6a<-copstressMin(kinshipdelta,epsilon=0.6,minpts=2)
#> Warning in dfoptim::hjk(xold, function(par) copsf(par, delta = delta, disobj =
#> disobj, : Function evaluation limit exceeded -- may not converge.
#> Warning in commonArgs(par + 0, fn, control, environment()): maxfun < 10 *
#> length(par)^2 is not recommended.
plot(cc.res6a,plot.type="reachplot")
plot(cc.res6a)
Let’s also change dmax
to 1 to make the index more robust. The effect is that “Cousin” is now less far away from the rest.
cc.res6b<-copstressMin(kinshipdelta,dmax=1)
#> Warning in dfoptim::hjk(xold, function(par) copsf(par, delta = delta, disobj =
#> disobj, : Function evaluation limit exceeded -- may not converge.
#> Warning in commonArgs(par + 0, fn, control, environment()): maxfun < 10 *
#> length(par)^2 is not recommended.
plot(cc.res6b)
scale
scale
which influences the scale of the axis. In COPS we’re only interested in the relatvie placement of the objects rather than the scale, so the scale is somewhat aribtrary. It can be set to be sd
(divided by the largest standard deviation of any column; default), none
where no scaling is applied, proc
which deos procrustes adjustment of the final configuration to the starting configuration, rmsq
(configuration divided by the maximum root mean square of the columns) and std
which standardizes all columns (NOTE: this does not preserve the relative distances of the optimal configuration and should probably never have been implemented in the first place).
There are some more arguments which are described in ?copstressMin
.
cops
. The default is hjk-Newuoa
and will typically work quite well. Another good optimizer is CMA-ES
but that has a tendency to fail. See ?copstressMin
for the available solvers for the argument optimmethod
and the supplement to the original article for an empirical comparison.
The second variant of COPS uses the copstress to select the transformation parameters, so that when fitted as powerstress or any of the other badness-of-fit functions, the corresponding configuration has higher clusteredness than a standard MDS (there’s also a chance that the standard MDS will be selected). This can be thought of as a profile method as we use the copstress not for direct minimzation of the objective but as criterion for parameter selection and the minimization to obtain the configuration happens only with the unpenalized badness-of-fit function.
Let us write \(X(\theta)=\arg\min_X \sigma_{MDS}(X,\theta)\) for the optimal configuration for given transformation parameter vector \(\theta\). The objective function for parameter selection is again , and is again the weighted combination of the \(\theta-\)parametrized loss function, \(\sigma_{MDS}\left(X(\theta),\theta\right)\), and the c-clusteredness measure, the OPTICS cordillera or \(OC(X(\theta);\epsilon,k,q)\) but this time to be optimized as a function of \(\theta\) or \[\begin{equation} \label{eq:spstress} \text{coploss}(\theta) = v_1 \cdot \sigma_{MDS}\left(X(\theta),\theta \right) - v_2 \cdot \text{OC}\left(X(\theta);\epsilon,k,q\right) \end{equation}\] with \(v_1,v_2 \in \mathbb{R}\) controlling how much weight should be given to the badness-of-fit measure and c-clusteredness. In general \(v_1,v_2\) are either determined values that make sense for the application or may be used to trade-off fit and c-clusteredness in a way for them to be commensurable. In the latter case we suggest taking the fit function value as it is (\(v_1=1\)) and fixing the scale such that \(\text{copstress}=0\) for the scaling result with the identity transformation (\(\theta=\theta_0\)), i.e., \[\begin{equation} \label{eq:spconstant0} v^{0}_{1}=1, \quad v^{0}_2=\frac{\sigma_{MDS}\left(X(\theta_0),\theta_0\right)}{\text{OC}\left(X(\theta_0);\epsilon,k,q\right)}, \end{equation}\] with \(\theta_0=(1,1,1)^\top\) in case of loss functions with power transformations. Thus an increase of 1 in the MDS loss measure can be compensated by an increase of \(v^0_1/v^0_2\) in c-clusteredness. Selecting \(v_1=1,v_2=v^{0}_2\) this way is in line with looking for a parameter combination that would lead to a configuration that has a more clustered appearance relative to the standard MDS.
The optimization problem in P-COPS is then to find
\[\begin{equation}
\label{eq:soemdsopt2}
\arg\min_{\theta} \text{coploss}(\theta)
\end{equation}\] by evaluating \[\begin{equation}
\label{eq:soemdsopt}
v_1 \cdot \sigma_{MDS}\left(X(\theta),\theta\right) - v_2 \cdot \text{OC}\left(X(\theta);\epsilon,k,q\right) \rightarrow \min_\theta!
\end{equation}\] For a given \(\theta\) if \(v_2=0\) than the result of optimizing the above is the same as solving the respective original MDS problem. Letting \(\theta\) be variable, \(v_2=0\) will minimize the loss over configurations obtained from using different \(\theta\).
The dedicated function for P-COPS is called pcops
. The two main arguments are again the dissimilarity matrix and which MDS model that should be used (loss
). Then pcops
optimizes over \(\theta\) with the values given in theta
being used as starting parameters (if not given, they are all 1).
pcops(diss,loss,...)
For the example we can use a P-COPS model for a classical scaling with power transformations of the dissimilarities (strain
or `powerstrain
loss)
set.seed(666)
resc<-pcops(kinshipdelta,loss="strain",minpts=2)
resc
#>
#> Call: pcops(dis = kinshipdelta, loss = "strain", minpts = 2)
#>
#> Model: P-COPS with strain loss function and theta parameter vector = 1 1.497828 1
#>
#> Number of objects: 15
#> MDS loss value: 0.4524693 0.5844686
#> OPTICS Cordillera: Raw 3.624134 Normed 0.3980467
#> Cluster optimized loss (copstress): -0.01919072
#> MDS loss weight: 1 OPTICS Cordillera weight: 0.3124777
#> Number of iterations of ALJ optimization: 49
The transformation parameters selected is 1.498 for the dissimilarities (as in strain/powerstrain only the dissimilarities are subjected to a power transformation). The resulting badness-of-fit value is 0.45 (this is not a stress, see cmdscale
for its interpretation) and the c-clusteredness value is 0.33.
A number of plots are availabe
stress
, smacofSym
: Kruskall’s stress; Workhorse: smacofSym
, Optimization over \(\lambda\)smacofSphere
: Kruskall’s stress for projection onto a sphere; Workhorse smacofSphere
, Optimization over \(\lambda\)strain
, powerstrain
: Classical scaling; Workhorse: cmdscale
, Optimization over \(\lambda\)sammon
, sammon2
: Sammon scaling; Workhorse: sammon
or smacofSym
, Optimization over \(\lambda\)elastic
: Elastic scaling; Workhorse: smacofSym
, Optimization over \(\lambda\)sstress
: S-stress; Workhorse: powerStressMin
, Optimization over \(\lambda\)rstress
: S-stress; Workhorse: powerStressMin
, Optimization over \(\kappa\)powermds
: MDS with powers; Workhorse: powerStressMin
, Optimization over \(\kappa\), \(\lambda\)powersammon
: Sammon scaling with powers; Workhorse: powerStressMin
, Optimization over \(\kappa\), \(\lambda\)powerelastic
: Elastic scaling with powers; Workhorse: powerStressMin
, Optimization over \(\kappa\), \(\lambda\)apstress
: Approximate power stress model; Workhorse: smacofSym
, Optimization over \(\lambda\), \(\nu\)rpowerstress
: Restricted power stress model; Workhorse: powerStressMin
, Optimization over \(\kappa\) and \(\lambda\) together (which are restricted to be equal), and \(\nu\)powerstress
: Power stress model (POST-MDS); Workhorse: powerStressMin
, Optimization over \(\kappa\), \(\lambda\), and \(\nu\)Note: Anything that uses powerStressMin
as workhorse is a bit slow.
It is also possible to use the pcops
function for finding the loss-optimal transformation in the the non-augmented models specified in loss
, by setting the cordweight
, the weight of the OPTICS cordillera, to 0. Then the function optimizes for the transformation parameters based on the MDS loss function only.
set.seed(666)
resca<-pcops(kinshipdelta,cordweight=0,loss="strain",minpts=2)
resca
#>
#> Call: pcops(dis = kinshipdelta, loss = "strain", cordweight = 0, minpts = 2)
#>
#> Model: P-COPS with strain loss function and theta parameter vector = 1 1.583679 1
#>
#> Number of objects: 15
#> MDS loss value: 0.4531307 0.5920835
#> OPTICS Cordillera: Raw 3.59912 Normed 0.3952994
#> Cluster optimized loss (copstress): 0.1047585
#> MDS loss weight: 1 OPTICS Cordillera weight: 0
#> Number of iterations of ALJ optimization: 55
: Here the results match the result from using the standard cordweight
suggestion. We can give more weight to the c-clusteredness though:
set.seed(666)
rescb<-pcops(kinshipdelta,cordweight=20,loss="strain",minpts=2)
rescb
#>
#> Call: pcops(dis = kinshipdelta, loss = "strain", cordweight = 20, minpts = 2)
#>
#> Model: P-COPS with strain loss function and theta parameter vector = 1 1 1
#>
#> Number of objects: 15
#> MDS loss value: 0.4394143 0.5199867
#> OPTICS Cordillera: Raw 3.781425 Normed 0.4153223
#> Cluster optimized loss (copstress): -8.176666
#> MDS loss weight: 1 OPTICS Cordillera weight: 20
#> Number of iterations of ALJ optimization: 61
This result has more c-clusteredness but less goodness-of-fit. The higher c-clusteredness is discernable in the Grandfather/Brother and Grandmother/Sister clusters (we used a minimum number of 2 observations to make up a cluster, minpts=2
).
As in COPS-C we have a number of parameters to guide and change the behaviour of P-COPS. Many are equal to the ones explained in the COPS-C section, including minpts
, weightmat
, ndim
, init
, stressweight
, cordweight
, q
, epsilon
, rang
, scale
. See the description there.
lower
and upper
lower
and upper
which are the boundaries of the search space in which to look for the parameters. They need to be of the same length as the theta
argument. Naturally, the larger the search space is, the longer it can take to find the optimal parameters. Default values are lower = c(1, 1, 0.5)
and upper = c(5, 5, 2)
. Note this can also be used to set a quasi-restriction on parameters, if there is no canned loss function that does that. In that case we would just set the boundaries very close together, so, say we’d like to use powerstress and search for optimal \(\kappa\) and \(\lambda\) but fix the nu
to be \(-2.5\), we can then set lower = c(0,0,-2.5001)
and upper = c(5,5,-2.4990)
so \(\nu\) will be searched for only in the narrrow band between \((-2.501,-2.499)\). Let’s change the search space to include values between \(0.1\) and \(1.6\) (in the above example \(1.5\) was the optimal parameter).
set.seed(666)
rescc<-pcops(kinshipdelta,loss="strain",minpts=2,lower=c(0.1,0.1,0.1),upper=c(1.6,1.6,1.6))
rescc
#>
#> Call: pcops(dis = kinshipdelta, loss = "strain", minpts = 2, lower = c(0.1,
#> 0.1, 0.1), upper = c(1.6, 1.6, 1.6))
#>
#> Model: P-COPS with strain loss function and theta parameter vector = 1 1.497882 1
#>
#> Number of objects: 15
#> MDS loss value: 0.4524697 0.5844735
#> OPTICS Cordillera: Raw 3.624118 Normed 0.3980449
#> Cluster optimized loss (copstress): -0.01919073
#> MDS loss weight: 1 OPTICS Cordillera weight: 0.3124777
#> Number of iterations of ALJ optimization: 43
: The optimal \(\lambda\) found is again around \(1.498\) resulting in a badness of fit of \(0.45\) and a clusteredness of \(0.398\), so by extending the search space we found no better transformation.
itmaxi
and itmaxo
itmaxi
and itmaxo
arguments. itmaxi
(default 10000) refers to the maximum number of iterations for the inner part (the MDS optimization) and itmaxo
(default 200) refers to the maximum number of iterations for the outer search that tries to find the optimal \(\theta\). The higher itmaxi
argument is the closer the configuration that is evaluated for copstress is to a local optimum and the higher itmaxo
is the more values for the transformation parameters will be tried (which also depends on the optimizer). Time-wise there is a trade-off here between how deep (itmaxi
) we want to go and how broad (itmaxo
). In our experience itmaxi
doesn’t need to be very high and it is better to have a higher itmaxo
, which is probably why in one of life’s great mysteries we set the default values exactly the other way round.
Let’s look at that in action, which doesn’t really change much compared to how it was with the default values (optimal parameter is now \(1.499\)).
set.seed(666)
rescd<-pcops(kinshipdelta,loss="strain",minpts=2,itmaxi=2000,itmaxo=1000)
rescd
#>
#> Call: pcops(dis = kinshipdelta, loss = "strain", minpts = 2, itmaxo = 1000,
#> itmaxi = 2000)
#>
#> Model: P-COPS with strain loss function and theta parameter vector = 1 1.498609 1
#>
#> Number of objects: 15
#> MDS loss value: 0.4524749 0.58454
#> OPTICS Cordillera: Raw 3.623904 Normed 0.3980214
#> Cluster optimized loss (copstress): -0.01919076
#> MDS loss weight: 1 OPTICS Cordillera weight: 0.3124777
#> Number of iterations of ALJ optimization: 111
pcops
we use a nested algorithm combining optimization that internally first solves for \(X\) given \(\theta\), \(\arg\min_X \sigma_{MDS}\left(X,\theta\right)\), and then optimize over \(\theta\) with a metaheuristic. The metaheuristic can be chosen with the optimmethod
argument. Implemented are simulated annealing (optimmethod="SANN"
), particle swarm optimization (optimmethod="pso"
), DIRECT (optimmethod="DIRECT"
), DIRECTL (optimmethod="DIRECTL"
), mesh-adaptive direct search (optimmethod="MADS"
), stochastic global optimization (optimmethod="stogo"
), Hooke-Jeeves pattern search (optimmethods="hjk"
) and a variant of the Luus-Jaakola (optimmethod="ALJ"
) procedure. Default is “ALJ” that usually converges in less than 200 iterations to an acceptable solution.
We listed the possibilities how the behavior of COPS models can be changed in this document. It might have occured to you that there are a lot of options to choose from. We believe that more options and flexibility is generally better, especially in an exploratory setting, but that puts the user on the spot of making their own decisions which not everyone seems to like (many seem to prefer the apparent security of not needing to make them). So, we want to share what appeared as best practice in our experience.
Think carefully about the minimum number of points that should comprise a cluster (the minpts
argument). If there are 5000 objects, a minimum number of points of \(2\) will likely not be very illuminating. This decision depends on substantive considerations.
The scalarization weights trade-off badness-of-fit with clusteredness. Since we typically want to have a representation that is faithful, we recommend to start out with a stressweight
that is much larger than cordweight
(say \(v_1/v_2>7\) times). Then one can successively lower the relative cordweight
to about \(v_1/v_2>3\) if necessary. For typical use cases we’d not recommend getting below this ratio.
When using power transformations, it is best to start out with equal powers for both distances and dissimilarities and allow for different ones only when necessary. In COPS-C that would be set manually and in P-COPS the rpowerstress
loss can be used.
In a standard use case without much idea about the range of distances, we’d set epsilon
for the OC high and dmax
to about \(1-1.5\) times of the largest reachability value that is smaller than the dmax
that results when applying the OC to a standard MDS configuration for the same data, e.g.,
nullmod<-smacof::mds(kinshipdelta) #standard MDS
c0<-cordillera(nullmod$conf,q=1,epsilon=10,minpts=2) #the reachabilities are in $reachplot
dm<-1.5*max(c0$reachplot[c0$reachplot<c0$dmax]) #1.5 times the maximum reachability that is smaller than dmax
dm #the dmax to be used for COPS
#> [1] 1.584628
Staying true to the exploratory nature, trying out different setups and comparing them is a good idea, especially with respect to the cordillera parameters and scalarization weights.
We envisioned, tested and applied the functions in the cops
package for small to moderate data sizes (up to 200 objects). The more objects we have, the more difficult the problem becomes, both with respect to finding the optima and the time it will take to get them. It can also be that the COPS result is not really illuminating with a large number of objects. This is ongoing research, so use at your own risk. We’re always interested in hearing experiences, though, if something goes awry.
Borg I, Groenen PJ (2005). Modern multidimensional scaling: Theory and applications. 2nd edition. Springer, New York
Buja A, Swayne DF, Littman ML, Dean N, Hofmann H, Chen L (2008). Data visualization with multidimensional scaling. Journal of Computational and Graphical Statistics, 17 (2), 444-472.
Chen L, Buja A (2013). Stress functions for nonlinear dimension reduction, proximity analysis, and graph drawing. Journal of Machine Learning Research, 14, 1145-1173.
de Leeuw J (2014). Minimizing r-stress using nested majorization. Technical Report, UCLA, Statistics Preprint Series.
de Leeuw J, Mair P (2009). Multidimensional Scaling Using Majorization: SMACOF in R. Journal of Statistical Software, 31 (3), 1-30.
Kruskal JB (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29 (1), 1-27.
Luus R, Jaakola T (1973). Optimization by direct search and systematic reduction of the size of search region. American Institute of Chemical Engineers Journal (AIChE), 19 (4), 760-766.
McGee VE (1966). The multidimensional analysis of ‘elastic’ distances. British Journal of Mathematical and Statistical Psychology, 19 (2), 181-196.
Rosenberg, S. & Kim, M. P. (1975). The method of sorting as a data gathering procedure in multivariate research. Multivariate Behavioral Research, 10, 489-502.
Rusch, T., Mair, P. and Hornik, K. (2015a) COPS: Cluster Optimized Proximity Scaling. Discussion Paper Series / Center for Empirical Research Methods, 2015/1. WU Vienna University of Economics and Business, Vienna.
Sammon JW (1969). A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, 18 (5), 401-409
Takane Y, Young F, de Leeuw J (1977). Nonmetric individual differences multidimensional scaling: an alternating least squares method with optimal scaling features. Psychometrika, 42 (1), 7-67.
Torgerson WS (1958). Theory and methods of scaling. Wiley.
Venables WN, Ripley BD (2002). Modern Applied Statistics with S. Fourth edition. Springer, New York.