| Encoding: | UTF-8 | 
| Type: | Package | 
| Title: | Cluster Analysis with Missing Values by Multiple Imputation | 
| Version: | 1.5 | 
| Date: | 2025-02-24 | 
| Description: | Allows clustering of incomplete observations by addressing missing values using multiple imputation. For achieving this goal, the methodology consists in three steps, following Audigier and Niang 2022 <doi:10.1007/s11634-022-00519-1>. I) Missing data imputation using dedicated models. Four multiple imputation methods are proposed, two are based on joint modelling and two are fully sequential methods, as discussed in Audigier et al. (2021) <doi:10.48550/arXiv.2106.04424>. II) cluster analysis of imputed data sets. Six clustering methods are available (distances-based or model-based), but custom methods can also be easily used. III) Partition pooling. The set of partitions is aggregated using Non-negative Matrix Factorization based method. An associated instability measure is computed by bootstrap (see Fang, Y. and Wang, J., 2012 <doi:10.1016/j.csda.2011.09.003>). Among applications, this instability measure can be used to choose a number of clusters with missing values. The package also proposes several diagnostic tools to tune the number of imputed data sets, to tune the number of iterations in fully sequential imputation, to check the fit of imputation models, etc. | 
| License: | GPL-2 | GPL-3 | 
| Depends: | R (≥ 3.5.0) | 
| Imports: | stats, graphics, parallel, mice, micemd, mclust, mix, fpc, knockoff, withr, glmnet, ClusterR, FactoMineR, diceR, NPBayesImputeCat, e1071, Rfast, cat, utils, ggplot2, gridExtra, reshape2, methods, Rcpp | 
| RoxygenNote: | 7.3.2 | 
| Suggests: | knitr, rmarkdown, stargazer, VIM, missMDA, clustrd, clusterCrit, bookdown, R.rsp | 
| VignetteBuilder: | R.rsp | 
| LazyData: | true | 
| LinkingTo: | Rcpp, RcppArmadillo | 
| RcppModules: | IO_module | 
| NeedsCompilation: | yes | 
| Packaged: | 2025-02-24 17:02:47 UTC; vince | 
| Author: | Vincent Audigier [aut, cre] (CNAM MSDMA team), Hang Joon Kim [ctb] (University of Cincinnati) | 
| Maintainer: | Vincent Audigier <vincent.audigier@cnam.fr> | 
| Repository: | CRAN | 
| Date/Publication: | 2025-02-24 17:20:08 UTC | 
clusterMI: Cluster Analysis with Missing Values by Multiple Imputation
Description
clusterMI is an R package to perform clustering with missing values. For achieving this goal, multiple imputation is used.
The package offers various multiple imputation methods dedicated to clustered individuals, as discussed in Audigier et al. (2021) <arXiv:2106.04424>.
In addition, it allows pooling results both in terms of partition and instability, as proposed in Audigier and Niang (2022) <doi:10.1007/s11634-022-00519-1>. Among applications, this instability measure can be used to choose a number of clusters with missing values.
References
Audigier, V. and Niang, N., Clustering with missing data: which equivalent for Rubin's rules? Advances in Data Analysis and Classification <doi:10.1007/s11634-022-00519-1>, 2022.
Audigier, V., Niang, N., & Resche-Rigon, M. (2021). Clustering with missing data: which imputation model for which cluster analysis method?. arXiv preprint <arXiv:2106.04424>.
Audigier, V. (2024). Clustering on incomplete data using clusterMI. 10emes Rencontres R, Vannes, France. hal preprint <hal:04550305>.
Bar-Hen, A. and Audigier, V., An ensemble learning method for variable selection: application to high dimensional data and missing values, Journal of Statistical Computation and Simulation, <doi:10.1080/00949655.2022.2070621>, 2022.
Fang, Y. and Wang, J., Selection of the number of clusters via the bootstrap method. Computational Statistics and Data Analysis, 56, 468-477 <doi:10.1016/j.csda.2011.09.003> 2012.
Examples
data(wine, package = "clusterMI")
require(parallel)
set.seed(123456)
ref <- wine$cult
nb.clust <- 3
wine.na <- wine
wine.na$cult <- NULL
wine.na <- prodna(wine.na)
# imputation
m <- 5 # number of imputed data sets. Should be larger in practice
res.imp <- imputedata(data.na = wine.na, nb.clust = nb.clust, m = m)
# cluster analysis by kmeans and pooling
nnodes <- 2 # Number of CPU cores for parallel computing
res.pool <- clusterMI(res.imp, nnodes = nnodes)
res.pool$instability
table(ref, res.pool$part)
# choice of nb.clust
res.nbclust <- choosenbclust(res.pool)
res.nbclust$nb.clust
Class "Rcpp_modelobject"
Description
A class required when using the imputedata function with argument method = "JM-DP"
Extends
Class "C++Object", directly.
All reference classes extend and inherit methods from "envRefClass".
Fields
- .Mu:
- Object of class - activeBindingFunction~~
- .Z_vec:
- Object of class - activeBindingFunction~~
- .a_Phi:
- Object of class - activeBindingFunction~~
- .a_alpha:
- Object of class - activeBindingFunction~~
- .alpha:
- Object of class - activeBindingFunction~~
- .b_Phi:
- Object of class - activeBindingFunction~~
- .b_alpha:
- Object of class - activeBindingFunction~~
- .cube_UT_cholSigma:
- Object of class - activeBindingFunction~~
- .h_0:
- Object of class - activeBindingFunction~~
- .logpi:
- Object of class - activeBindingFunction~~
- .missing_flag_mat:
- Object of class - activeBindingFunction~~
- .vec_Phi:
- Object of class - activeBindingFunction~~
- .where_we_are:
- Object of class - activeBindingFunction~~
- Y_mat:
- Object of class - activeBindingFunction~~
- max_Y_obs:
- Object of class - activeBindingFunction~~
- min_Y_obs:
- Object of class - activeBindingFunction~~
- msg_level:
- Object of class - activeBindingFunction~~
- vec_HyperParameters:
- Object of class - activeBindingFunction~~
Methods
- initialize(...):
- ~~ 
- finalize():
- ~~ 
- Run(...):
- ~~ 
- Iterate():
- ~~ 
Note
From the R package DPImputeCont https://github.com/hang-j-kim/DPImputeCont.
Author(s)
Kim, Hang Joon. University of Cincinnati.
Examples
showClass("Rcpp_modelobject")
Compute Silhouette index
Description
Compute Silhouette index
Usage
Silhouette.intern(d, cl)
Arguments
| d | a distance matrix | 
| cl | a vector of integer defining the partition | 
Value
a real
Diagnostic plot for the number of iterations used in the varselbest function
Description
chooseB plots the proportion of times an explanatory variable is selected according to the number of iterations (B).
Usage
chooseB(
  res.varselbest,
  plotvar = NULL,
  linewidth = 1,
  linetype = "dotdash",
  xlab = "B",
  ylab = "Proportion",
  nrow = 2,
  ncol = 2,
  graph = TRUE
)
Arguments
| res.varselbest | an output from the varselbest function | 
| plotvar | index of variables for which a curve is ploted | 
| linewidth | a numerical value setting the widths of lines | 
| linetype | what type of plot should be drawn | 
| xlab | a title for the x axis | 
| ylab | a title for the y axis | 
| nrow | argument of gtable. Default value is 2. | 
| ncol | argument of gtable. Default value is 2. | 
| graph | a boolean. If FALSE, no graphics are ploted. Default value is TRUE | 
Details
varselbest performs variable selection on random subsets of variables and, then, combines them to recover which explanatory variables are related to the response, following Bar-Hen and Audigier (2022) <doi:10.1080/00949655.2022.2070621>.
More precisely, the outline of the algorithm are as follows: let consider a random subset of sizeblock among p variables.
Then,  any selection variables scheme can be applied.
By resampling B times, a sample of size sizeblock among the p variables, we may count how many times a variable is considered as significantly related to the response and how many times it is not.
The number of iterations B should be large so that the proportion of times a variable is selected becomes stable. chooseB plots the values of proportion according to the number of iterations.
Value
a list of matrices where each row corresponds to the vector of proportions (for all explanatory variables) obtained for a given value of B
References
Bar-Hen, A. and Audigier, V., An ensemble learning method for variable selection: application to high dimensional data and missing values, Journal of Statistical Computation and Simulation, <doi:10.1080/00949655.2022.2070621>, 2022.
See Also
Examples
data(wine, package = "clusterMI")
require(parallel)
ref <- wine$cult
nb.clust <- 3
wine.na<-wine
wine.na$cult <- NULL
wine.na <- prodna(wine.na)
nnodes <- 2 # Number of CPU cores for parallel computing
B <- 80 # Number of iterations for variable selection
# variable selection
res.varsel <- varselbest(data.na = wine.na,
                        listvar = "alco",
                        B = B,
                        nnodes = nnodes,
                        nb.clust = nb.clust,
                        graph = FALSE)
# convergence
res.chooseB <- chooseB(res.varsel)
Tune the number of iterations for variable selection using varselbest
Description
Tune the number of iterations for variable selection using varselbest
Usage
chooseB.intern(
  res.varselbest,
  linewidth = 1,
  linetype = "dotdash",
  xlab = "B",
  ylab = "Proportion",
  graph = TRUE,
  title = NULL
)
Arguments
| res.varselbest | an output from the varselbest function | 
| linewidth | a numerical value setting the widths of lines | 
| linetype | what type of plot should be drawn | 
| xlab | a title for the x axis | 
| ylab | a title for the y axis | 
| graph | a boolean. If FALSE, no graphics are ploted. Default value is TRUE | 
| title | the main title | 
References
Bar-Hen, A. and Audigier, V. An ensemble learning method for variable selection: application to high dimensional data and missing values. ArXiv e-prints <arXiv:1808.06952>
Graphical investigation for the number of datasets generated by multiple imputation
Description
For an object generated by the function clusterMI, the choosem function browses the sequence of the contributory partitions and computes the consensus partition at each step. Then, the rand index between successive consensus partitions is plotted.
Usage
choosem(output, graph = TRUE, nnodes = 1)
Arguments
| output | an output from the clusterMI function | 
| graph | a boolean indicating if a graphic is plotted | 
| nnodes | number of CPU cores for parallel computing. By default  | 
Details
The number of imputed datasets (m) should be sufficiently large to improve the partition accuracy.
The choosem function can be used to check if this number is suitable.
This function computes the consensus partition by considering only the first imputed datasets.
By this way, a sequence of m consensus partitions is obtained.
Then, the rand index between successive partitions is computed and reported in a graph.
The rand index measures the proximity between two partitions.
If the rand index between the last consensus partitions of the sequence reaches its maximum values (1),
then it means last imputed dataset does not modify the consensus partition.
Consequently, the number of imputed datasets can be considered as sufficiently large.
Value
A list of two objects
| part | 
 | 
| rand | a  | 
References
Audigier, V. and Niang, N., Clustering with missing data: which equivalent for Rubin's rules? Advances in Data Analysis and Classification <doi:10.1007/s11634-022-00519-1>, 2022.
See Also
Examples
data(wine, package = "clusterMI")
set.seed(123456)
ref <- wine$cult
nb.clust <- 3
wine.na <- wine
wine.na$cult <- NULL
wine.na <- prodna(wine.na)
#imputation
m <- 5 # number of imputed data sets. Should be larger in practice
res.imp <- imputedata(data.na = wine.na, nb.clust = nb.clust, m = m)
#pooling
nnodes <- 2 # number of CPU cores for parallel computing
res.pool <- clusterMI(res.imp, instability = FALSE, nnodes = nnodes)
res.choosem <- choosem(res.pool)
Diagnostic plot for the number of iterations used in sequential imputation methods
Description
The choosemaxit function plots the within and between variance for each variable (specified in plotvars) against the iteration number for each of the replications (specified in plotm).
Usage
choosemaxit(
  output,
  plotvars = NULL,
  plotm = 1:5,
  size = 0.5,
  linewidth = 1,
  linetype = "dotdash",
  xlab = "iterations",
  ylab = "var",
  title = "Within and between variance plots",
  nvar_by_row = 5
)
Arguments
| output | an outpout from the imputedata function | 
| plotvars | index of variables for which a curve is plotted | 
| plotm | a vector indicating which imputed datasets must be plotted | 
| size | size of points | 
| linewidth | a numerical value setting the widths of lines | 
| linetype | what type of plot should be drawn | 
| xlab | a title for the x axis | 
| ylab | a title for the y axis | 
| title | the main title | 
| nvar_by_row | the number of variables that are plotted per window. Default value is 5. | 
Value
No return value
Examples
data(wine, package = "clusterMI")
set.seed(123456)
wine.na <- wine
wine.na$cult <- NULL
wine.na <- prodna(wine.na)
nb.clust <- 3 # number of clusters
m <- 3 # number of imputed data sets
maxit <- 50 # number of iterations for FCS imputation
res.imp <- imputedata(data.na = wine.na, method = "FCS-homo",
                      nb.clust = nb.clust, m = m, maxit = maxit)
choosemaxit(res.imp)
Tune the number of clusters according to the partition instability
Description
choosenbclust reports the cluster instability according to the number of clusters chosen.
Usage
choosenbclust(output, grid = 2:5, graph = TRUE, verbose = TRUE, nnodes = NULL)
Arguments
| output | an output from the clusterMI function | 
| grid | a vector indicating the grid of values tested for nb.clust. By default 2:5 | 
| graph | a boolean indicating if a graphic is plotted | 
| verbose | if TRUE, choosenbclust will print messages on console | 
| nnodes | number of CPU cores for parallel computing. By default, the value used in the call to the clusterMI function | 
Details
The choosenbclust function browses a grid of values for the number of clusters and for each one imputes the data and computes the instability.
Value
a list of two objects
| nb.clust | the number of clusters in  | 
| crit | a vector indicating the instability for each value in the grid | 
References
Audigier, V. and Niang, N., Clustering with missing data: which equivalent for Rubin's rules? Advances in Data Analysis and Classification <doi:10.1007/s11634-022-00519-1>, 2022.
See Also
Examples
data(wine, package = "clusterMI")
require(parallel)
set.seed(123456)
ref <- wine$cult
nb.clust <- 3
wine.na <- wine
wine.na$cult <- NULL
wine.na <- prodna(wine.na)
# imputation
res.imp <- imputedata(data.na=wine.na, nb.clust = nb.clust, m = 5)
# pooling
nnodes <- 2 # number of CPU cores for parallel computing
res.pool <- clusterMI(res.imp, nnodes = nnodes, instability = FALSE)
# choice of nb.clust
choosenbclust(res.pool)
Kfold cross-validation for specifying threshold r
Description
chooser returns a list specifying the optimal threshold r for each outcome as well as the associated set of explanatory variables selected, and the cross-validation errror for each value of the grid
Usage
chooser(
  res.varsel,
  K = 10,
  seed = 12345,
  listvar = NULL,
  grid.r = seq(0, 1, 1/1000),
  graph = TRUE,
  printflag = TRUE,
  nb.clust = NULL,
  nnodes = NULL,
  sizeblock = NULL,
  method.select = NULL,
  B = NULL,
  modelNames = NULL,
  nbvarused = NULL,
  path.outfile = NULL
)
Arguments
| res.varsel | an output from the varselbest function | 
| K | an integer given the number of folds | 
| seed | a integer | 
| listvar | a vector of characters specifiying variables (outcomes) for which cross-validation should be done. By default, all variables that have been considered for varselbest are used. | 
| grid.r | a grid for the tuning parameter r | 
| graph | a boolean. If TRUE, cross-validation results are printed | 
| printflag | a boolean. If TRUE, messages are printed | 
| nb.clust | number of clusters. By default, the same as the one used in varselbest | 
| nnodes | an integer specifying the number of nodes for parallel computing. By default, the same as the one used in varselbest | 
| sizeblock | number of sampled variables at each iteration. By default, the same as the one used in varselbest | 
| method.select | variable selection method used. By default, the same as the one used in varselbest | 
| B | number of iterations. By default, the same as the one used in varselbest | 
| modelNames | mixture model specification for imputation of subsets. By default, the same as the one used in varselbest | 
| nbvarused | a maximal number of selected variables (can be required for a dataset with a large number of variables) | 
| path.outfile | a path for message redirection | 
Details
varselbest performs variable selection on random subsets of variables and, then, combines them to recover which explanatory variables are related to the response.
More precisely, the outline of the algorithm are as follows: let consider a random subset of sizeblock among p variables.
By choosing sizeblock small, this subset is low dimensional, allowing treatment of missing values by standard imputation method for clustered individuals.
Then, any selection variable scheme can be applied (lasso, stepwise and knockoff are proposed by tuning the method.select argument).
By resampling B times, a sample of size sizeblock among the p variables, we may count how many times, a variable is considered as significantly related to the response and how many times it is not.
We need to define a threshold (r) to conclude if a given variable is significantly related to the response. chooser aims at finding the optimal value for the threshold r using Kfold cross-validation.
Value
A list where each object refers to an outcome variable called in the listvar argument. Each element is composed of three objects
| r | the optimal value for the threshold | 
| error | the cross-validation error for each value in  | 
| selection | the subset of selected variables for the optimal threshold | 
References
Bar-Hen, A. and Audigier, V., An ensemble learning method for variable selection: application to high dimensional data and missing values, Journal of Statistical Computation and Simulation, <doi:10.1080/00949655.2022.2070621>, 2022.
Schafer, J. L. (1997) Analysis of Incomplete Multivariate Data. Chapman & Hall, Chapter 9.
Examples
data(wine, package = "clusterMI")
require(parallel)
set.seed(123456)
ref <- wine$cult
nb.clust <- 3
wine.na <- wine
wine.na$cult <- NULL
wine.na <- prodna(wine.na)
nnodes <- 2 # parallel::detectCores()
B <- 100 #  Number of iterations
m <- 5 # Number of imputed data sets
# variables selection for incomplete variable "alco"
listvar <- "alco"
res.varsel <- varselbest(data.na = wine.na,
                         nb.clust = nb.clust,
                         listvar = listvar,
                         B = B,
                         nnodes = nnodes)
# frequency of selection
propselect <- res.varsel$proportion[listvar, ]
#predictormatrix with the default threshold value                         
predictmat <- res.varsel$predictormatrix
# r optimal and associated predictor matrix 
res.chooser <- chooser(res.varsel = res.varsel)
thresh <- res.chooser[[listvar]]$r
is.selected <- propselect>=thresh
predictmat[listvar, names(is.selected)] <- as.numeric(is.selected)
# imputation
res.imp.select <- imputedata(data.na = wine.na, method = "FCS-homo",
                     nb.clust = nb.clust, predictmat = predictmat, m = m)
Apply clustering method after multiple imputation
Description
From a list of imputed datasets clusterMI performs cluster analysis on each imputed data set.
Usage
cluster.intern(
  res.imp,
  method.clustering = "kmeans",
  scaling = TRUE,
  nb.clust = NULL,
  method.hclust = "average",
  method.dist = "euclidean",
  modelNames = NULL,
  modelName.hc = "VVV",
  nstart.kmeans = 100,
  iter.max.kmeans = 10,
  m.cmeans = 2,
  samples.clara = 500,
  verbose = FALSE
)
Arguments
| res.imp | a list of imputed data sets | 
| method.clustering | a single string specifying the clustering algorithm used ("kmeans", "pam", "clara", "hclust" or "mixture","cmeans") | 
| scaling | boolean. If TRUE, variables are scaled. Default value is TRUE | 
| nb.clust | an integer specifying the number of clusters | 
| method.hclust | character string defining the clustering method for hierarchical clustering (required only if method.clustering = "hclust") | 
| method.dist | character string defining the method use for computing dissimilarity matrices in hierarchical clustering (required only if method.clustering = "hclust") | 
| modelNames | character string indicating the models to be fitted in the EM phase of clustering (required only if method.clustering = "mixture"). By default modelNames = NULL. | 
| modelName.hc | A character string indicating the model to be used in model-based agglomerative hierarchical clustering.(required only if method.clustering = "mixture"). By default modelNames.hc = "VVV". | 
| nstart.kmeans | how many random sets should be chosen for kmeans initalization. Default value is 100 (required only if method.clustering = "kmeans") | 
| iter.max.kmeans | how many iterations should be chosen for kmeans. Default value is 10 (required only if method.clustering = "kmeans") | 
| m.cmeans | degree of fuzzification in cmeans clustering. By default m.cmeans = 2 | 
| samples.clara | number of samples to be drawn from the dataset when performing clustering using clara algorithm. Default value is 500. | 
| verbose | logical | 
Details
Performs cluster analysis (according to the method.clustering argument). For achieving this goal, the function uses as an input an output from the imputedata function and applies the cluster analysis method on each imputed data set
Step 1 can be tuned by specifying the cluster analysis method used (method.clustering argument).
If method.clustering = "kmeans" or "pam", then the number of clusters can be specified by tuning the nb.clust argument. By default, the same number as the one used for imputation is used.
The number of random initializations can also be tuned through the nstart.kmeans argument.
If method.clustering = "hclust" (hierarchical clustering), the method used can be specified (see hclust). By default "average" is used. Furthermore, the number of clusters can be specified, but it can also be automatically chosen if nb.clust < 0.
If method.clustering = "mixture" (model-based clustering using gaussian mixture models), the model to be fitted can be tuned by modifying the modelNames argument (see Mclust).
If method.clustering = "cmeans" (clustering using the fuzzy c-means algorithm), then the fuzziness parameter can be modfied by tuning them.cmeans argument. By default, m.cmeans = 2.
Can be performed in parallel by specifying the number of CPU cores (nnodes argument).
Value
A list with clustering results
See Also
hclust, Mclust, imputedata, cmeans,dist
Examples
data(wine, package = "clusterMI")
require(parallel)
set.seed(123456)
ref <- wine$cult
nb.clust <- 3
m <- 5 # number of imputed data sets. Should be larger in practice
wine.na <- wine
wine.na$cult <- NULL
wine.na <- prodna(wine.na)
#imputation
res.imp <- imputedata(data.na = wine.na, nb.clust = nb.clust, m = m)
#analysis by kmeans and pooling
nnodes <- 2 # parallel::detectCores()
res.pool <- clusterMI(res.imp, nnodes = nnodes)
res.pool$instability
table(ref, res.pool$part)
Cluster analysis and pooling after multiple imputation
Description
From a list of imputed datasets clusterMI performs cluster analysis on each imputed data set, estimates the instability of each partition using bootstrap (following Fang, Y. and Wang, J., 2012 <doi:10.1016/j.csda.2011.09.003>) and pools results as proposed in Audigier and Niang (2022) <doi:10.1007/s11634-022-00519-1>.
Usage
clusterMI(
  output,
  method.clustering = "kmeans",
  method.consensus = "NMF",
  scaling = TRUE,
  nb.clust = NULL,
  Cboot = 50,
  method.hclust = "average",
  method.dist = "euclidean",
  modelNames = NULL,
  modelName.hc = "VVV",
  nstart.kmeans = 100,
  iter.max.kmeans = 10,
  m.cmeans = 2,
  samples.clara = 500,
  nnodes = 1,
  instability = TRUE,
  verbose = TRUE,
  parameter.nmf = list(method.init = c("BOK", "kmeans"), threshold = 10^(-5), printflag =
    FALSE, parameter.kmeans = list(nstart = 100, iter.max = 50, algorithm =
    c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"), trace = FALSE),
    parameter.minibatchkmeans = list(batch_size = 10, num_init = 1, max_iters = 50,
    init_fraction = 1, initializer = "kmeans++", early_stop_iter = 10, verbose = FALSE,
    CENTROIDS = NULL, tol = 1e-04, tol_optimal_init = 0.3, seed = 1))
)
Arguments
| output | an output from the imputedata function | 
| method.clustering | a single string specifying the clustering algorithm used ("kmeans", "pam", "clara", "hclust" or "mixture","cmeans") | 
| method.consensus | a single string specifying the consensus method used to pool the contributory partitions ("NMF" or "CSPA") | 
| scaling | boolean. If TRUE, variables are scaled. Default value is TRUE | 
| nb.clust | an integer specifying the number of clusters. If two distinct values are considered for analysis and pooling steps, then a 2 length vector can be provided to specify the two numbers respectively. | 
| Cboot | an integer specifying the number of bootstrap replications. Default value is 50 | 
| method.hclust | character string defining the clustering method for hierarchical clustering (required only if method.clustering = "hclust") | 
| method.dist | character string defining the method use for computing dissimilarity matrices in hierarchical clustering (required only if method.clustering = "hclust") | 
| modelNames | character string indicating the models to be fitted in the EM phase of clustering (required only if method.clustering = "mixture"). By default modelNames = NULL. | 
| modelName.hc | A character string indicating the model to be used in model-based agglomerative hierarchical clustering.(required only if method.clustering = "mixture"). By default modelNames.hc = "VVV". | 
| nstart.kmeans | how many random sets should be chosen for kmeans initalization. Default value is 100 (required only if method.clustering = "kmeans") | 
| iter.max.kmeans | how many iterations should be chosen for kmeans. Default value is 10 (required only if method.clustering = "kmeans") | 
| m.cmeans | degree of fuzzification in cmeans clustering. By default m.cmeans = 2 | 
| samples.clara | number of samples to be drawn from the dataset when performing clustering using clara algorithm. Default value is 500. | 
| nnodes | number of CPU cores for parallel computing. By default, nnodes = 1 | 
| instability | a boolean indicating if cluster instability must be computed. Default value is TRUE | 
| verbose | a boolean. If TRUE, a message is printed at each step. Default value is TRUE | 
| parameter.nmf | a list of arguments for consensus by NMF. See the fastnmf help page. | 
Details
clusterMI performs cluster analysis (according to the method.clustering argument) and pooling after multiple imputation. For achieving this goal, the clusterMI function uses as an input an output from the imputedata function and then
- applies the cluster analysis method on each imputed data set 
- pools contributory partitions using non-negative matrix factorization 
- computes the instability of each partition by bootstrap 
- computes the total instability 
Step 1 can be tuned by specifying the cluster analysis method used (method.clustering argument).
If method.clustering = "kmeans" or "pam", then the number of clusters can be specified by tuning the nb.clust argument. By default, the same number as the one used for imputation is used.
The number of random initializations can also be tuned through the nstart.kmeans argument.
If method.clustering = "hclust" (hierarchical clustering), the method used can be specified (see hclust). By default "average" is used. Furthermore, the number of clusters can be specified, but it can also be automatically chosen if nb.clust < 0.
If method.clustering = "mixture" (model-based clustering using gaussian mixture models), the model to be fitted can be tuned by modifying the modelNames argument (see Mclust).
If method.clustering = "cmeans" (clustering using the fuzzy c-means algorithm), then the fuzziness parameter can be modfied by tuning them.cmeans argument. By default, m.cmeans = 2.
Step 2 performs consensus clustering by Non-Negative Matrix Factorization, following Li and Ding (2007) <doi:10.1109/ICDM.2007.98>.
Step 3 applies the nselectboot function on each imputed data set and returns the instability of each cluster obtained at step 1. The method is based on bootstrap sampling, followong Fang, Y. and Wang, J. (2012) <doi:10.1016/j.csda.2011.09.003>. The number of iterations can be tuned using the Cboot argument.
Step 4 averages the previous instability measures given a within instability (Ubar), computes a between instability (B) and a total instability (T = B + Ubar). See Audigier and Niang (2022) <doi:10.1007/s11634-022-00519-1> for details.
All steps can be performed in parallel by specifying the number of CPU cores (nnodes argument). Steps 3 and 4 are more time consuming. To compute only steps 1 and 2 use instability = FALSE.
Value
A list with three objects
| part | the consensus partition | 
| instability | a list of four objects:  | 
| call | the matching call | 
References
Audigier, V. and Niang, N. (2022) Clustering with missing data: which equivalent for Rubin's rules? Advances in Data Analysis and Classification <doi:10.1007/s11634-022-00519-1>
Fang, Y. and Wang, J. (2012) Selection of the number of clusters via the bootstrap method. Computational Statistics and Data Analysis, 56, 468-477. <doi:10.1016/j.csda.2011.09.003>
T. Li, C. Ding, and M. I. Jordan (2007) Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization. In Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, ICDM'07, page 577-582, USA. IEEE Computer Society. <doi:10.1109/ICDM.2007.98>
See Also
hclust, nselectboot, Mclust, imputedata, cmeans
Examples
data(wine, package = "clusterMI")
require(parallel)
set.seed(123456)
ref <- wine$cult
nb.clust <- 3
m <- 5 # number of imputed data sets. Should be larger in practice
wine.na <- wine
wine.na$cult <- NULL
wine.na <- prodna(wine.na)
#imputation
res.imp <- imputedata(data.na = wine.na, nb.clust = nb.clust, m = m)
#analysis by kmeans and pooling
nnodes <- 2 # parallel::detectCores()
res.pool <- clusterMI(res.imp, nnodes = nnodes)
res.pool$instability
table(ref, res.pool$part)
Consensus clustering using non-negative matrix factorization
Description
From a list of partitions fastnmf pools partition as proposed in Li and Ding (2007) <doi:10.1109/ICDM.2007.98>.
Usage
fastnmf(
  listpart,
  nb.clust,
  method.init = c("BOK", "kmeans"),
  threshold = 10^(-5),
  printflag = TRUE,
  parameter.kmeans = list(nstart = 100, iter.max = 50, algorithm = c("Hartigan-Wong",
    "Lloyd", "Forgy", "MacQueen"), trace = FALSE),
  parameter.minibatchkmeans = list(batch_size = 10, num_init = 1, max_iters = 50,
    init_fraction = 1, initializer = "kmeans++", early_stop_iter = 10, verbose = FALSE,
    CENTROIDS = NULL, tol = 1e-04, tol_optimal_init = 0.3, seed = 1)
)
Arguments
| listpart | a list of partitions | 
| nb.clust | an integer specifying the number of clusters | 
| method.init | a vector giving initialisation methods used among "BOK", "kmeans", "minibatchkmeans" "sample". See details. | 
| threshold | a real specifying when the NMF algorithm is stoped. Default value is 10^(-5) | 
| printflag | a boolean. If TRUE, nmf will print messages on console. Default value is TRUE | 
| parameter.kmeans | a list of arguments for kmeans function. See keans help page. | 
| parameter.minibatchkmeans | list of arguments for MiniBatchKmeans function. See MiniBatchKmeans help page. | 
Details
fastnmf performs consensus clustering using non-negative matrix factorization following Li and Ding (2007) <doi:10.1109/ICDM.2007.98>. The set of partitions that are aggregated needs to be given as a list where each element is a vector of numeric values. Note that the number of classes for each partition can vary. The number of classes for the consensus partition should be given using the nb.clust argument. The NMF algorithm is iterative and required an initial partition. This latter is specified by method.init. method.init = "BOK" means the partition considered is a partition from listpart which minimizes the NMF criterion. Alternative methods are "kmeans", "minibathckmeans" or "sample". If method.init = "kmeans" (or "minibatchkmeans"), then clustering on the average of connectivity matrices is performed by kmeans (or "minibatchkmeans"). Mini Batch Kmeans could be faster than kmeans if the number of individuals is large. If method.init = "sample", then a random partition is drawn. If method.init is a vector of several characters, then several initialization methods are considered and the best method is returned. By default, method.init = c("BOK", "kmeans").
Value
For each initialisation method, a list of 5 objets is returned
| Htilde | A fuzzy disjunctive table | 
| S | A positive matrix | 
| Mtilde | The average of connectivity matrices | 
| crit | A vector with the optimized criterion at each iteration | 
| cluster | the consensus partition in nb.clust classes | 
In addition, the best initialisation method is returned
References
T. Li, C. Ding, and M. I. Jordan (2007) Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization. In Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, ICDM'07, page 577-582, USA. IEEE Computer Society. <doi:10.1109/ICDM.2007.98>
See Also
Examples
data(wine, package = "clusterMI")
require(clustrd)
set.seed(123456)
ref <- wine$cult
nb.clust <- 3
m <- 3 # number of imputed data sets. Should be larger in practice
wine.na <- wine
wine.na$cult <- NULL
wine.na <- prodna(wine.na)
#imputation
res.imp <- imputedata(data.na = wine.na, nb.clust = nb.clust, m = m)
#analysis using reduced kmeans
## apply the cluspca function on each imputed data set
res.ana.rkm <- lapply(res.imp$res.imp,
                      FUN = cluspca,
                      nclus = nb.clust,
                      ndim = 2,
                      method= "RKM")
## extract the set of partitions (under "list" format)
res.ana.rkm <-lapply(res.ana.rkm,"[[","cluster")
# pooling by NMF
res.pool.rkm <- fastnmf(res.ana.rkm, nb.clust = nb.clust)
## extract the partition corresponding to the best initialisation
part <- res.pool.rkm$best$clust
Multiple imputation methods for cluster analysis
Description
imputedata returns a list of imputed datasets by using imputation methods dedicated to individuals clustered in (unknown) groups
Usage
imputedata(
  data.na,
  method = "JM-GL",
  nb.clust = NULL,
  m = 20,
  maxit = 50,
  Lstart = 100,
  L = 20,
  method.mice = NULL,
  predictmat = NULL,
  verbose = TRUE,
  seed = 1234,
  bootstrap = FALSE
)
Arguments
| data.na | an incomplete dataframe | 
| method | a single string specifying the imputation method used among "FCS-homo","FCS-hetero","JM-DP","JM-GL". By default method = "JM-GL". See the details section | 
| nb.clust | number of clusters | 
| m | number of imputed datasets. By default, m = 20. | 
| maxit | number of iterations for FCS methods (only used for method = FCS-homo or method = FCS-hetero) | 
| Lstart | number of iterations for the burn-in period (only used if method ="JM-DP" or "JM-GL") | 
| L | number of skipped iterations to keep one imputed data set after the burn-in period (only used if method ="JM-DP" or "JM-GL") | 
| method.mice | a vector of strings (or a single string) giving the imputation method for each variable (only used for method = FCS-homo or method = FCS-hetero). Default value is "pmm" (predictive mean matching) for FCS-homo and "mice.impute.2l.jomo" for FCS-hetero | 
| predictmat | predictor matrix used for FCS imputation (only used for method = FCS-homo or method = FCS-hetero) | 
| verbose | a boolean. If TRUE, a message is printed at each iteration. Use verbose = FALSE for silent imputation | 
| seed | a positive integer initializing the random generator | 
| bootstrap | a boolean. Use bootstrap = TRUE for proper imputation with FCS methods (Mclust sometimes fails with multiple points) | 
Details
The imputedata offers various multiple imputation methods dedicated to clustered individuals.
In particular, two fully conditional imputation methods are proposed (FCS-homo and FCS-hetero) which essentially differ by the assumption about the covariance in each cluster (constant or not respectively).
The imputation requires a pre-specified number of clusters (nb.clust). See choosenbclust if this number is unknown.
The imputedata function alternates clustering and imputation given the partition of individuals.
When the clustering is performed, the function calls the mice function from the mice R package to perform imputation.
The mice package proposes various methods for imputation which can be specified by tuning the method.mice argument.
Note that two other joint modelling methods are also available: JM-GL from the R package mix and JM-DP from the R package DPImputeCont https://github.com/hang-j-kim/DPImputeCont
Value
a list of 3 objets
| res.imp | a list with the several imputed datasets | 
| res.conv | for FCS methods, an array given the between (and within) inertia of each imputed variable at each iteration and for each imputed dataset. For JM methods, a matrix given the between inertia for each variable and each imputed dataset. | 
| call | the matching call | 
References
Kim, H. J., Reiter, J. P., Wang, Q., Cox, L. H. and Karr, A. F. (2014), Multiple imputation of missing or faulty values under linear constraints, Journal of Business and Economics Statistics, 32, 375-386 <doi:10.1080/07350015.2014.885435>
Schafer, J. L. (1997) Analysis of Incomplete Multivariate Data. Chapman & Hall, Chapter 9.
Audigier, V., Niang, N., & Resche-Rigon, M. (2021). Clustering with missing data: which imputation model for which cluster analysis method?. arXiv preprint <arXiv:2106.04424>.
See Also
mice choosenbclust choosemaxit varselbest imp.mix
Examples
data(wine, package = "clusterMI")
set.seed(123456)
wine.na <- wine
wine.na$cult <- NULL
wine.na <- prodna(wine.na)
nb.clust <- 3 # number of clusters
m <- 20 # number of imputed data sets
res.imp <- imputedata(data.na = wine.na, nb.clust = nb.clust, m = m)
lapply(res.imp$res.imp, summary)
initialize fastnmf
Description
initialize fastnmf
Usage
initfastnmf.intern(
  method,
  Mtilde,
  listpart = NULL,
  nb.clust = NULL,
  parameter.kmeans = list(nstart = 100, iter.max = 50, algorithm = c("Hartigan-Wong",
    "Lloyd", "Forgy", "MacQueen"), trace = FALSE),
  parameter.minibatchkmeans = list(batch_size = 10, num_init = 1, max_iters = 50,
    init_fraction = 1, initializer = "kmeans++", early_stop_iter = 10, verbose = FALSE,
    CENTROIDS = NULL, tol = 1e-04, tol_optimal_init = 0.3, seed = 1)
)
Arguments
| method | a vector giving initialisation methods among "BOK", "kmeans", "minibatchkmeans" "sample" | 
| Mtilde | a fuzzy connectivity matrix | 
| listpart | a list of partitions (required for kmeans and minibatchkmeans method) | 
| parameter.kmeans | list of arguments for kmeans function | 
| parameter.minibatchkmeans | list of arguments for MiniBatchKmeans function | 
MclustBootstrap with nboot = 1 and the same output as Mclust
Description
MclustBootstrap with nboot = 1 and the same output as Mclust
Usage
mclustboot.intern(don, G = NULL, modelNames = NULL, verbose = FALSE)
Arguments
| don | matrix, or data frame | 
| G | An integer vector specifying the numbers of mixture components | 
| modelNames | A vector of character strings indicating the models to be fitted | 
| verbose | A logical controlling if a text progress bar is displayed | 
internal function
Description
prend en entrée un jeu de données, ajoute une variable classe, et renvoie les parametre du general location model. Une des particularité est l'initialisation du EM par un modèle de mélange. En effet em.mix a sa propre façon d'initialiser qui parfois renvoie des paramètres identiques pour plusieurs groupes
Usage
myem.mix(dataset, K, boot = FALSE, silent = TRUE, nbquali = 0, eps = 1e-04)
one fold cross-validation for specifying threshold r
Description
one fold cross-validation for specifying threshold r
Usage
onefold.chooser(
  data.train.k,
  data.test.k,
  jj,
  grid.r,
  nb.clust,
  nnodes,
  sizeblock,
  method.select,
  B,
  modelNames,
  K,
  path.outfile,
  nbvarused
)
Arguments
| data.train.k | a train set | 
| data.test.k | a test set | 
| jj | name of the outcome used for variable selection | 
| grid.r | a grid for the tuning parameter r | 
| nb.clust | number of clusters | 
| sizeblock | number of sampled variables at each iteration | 
| method.select | variable selection method | 
| B | number of iterations | 
| modelNames | mixture model specification for imputation of subsets | 
| K | number of fold | 
| path.outfile | a path for message redirection | 
| nbvarused | a maximal number of selected variables (can be required with a large number of variables) | 
Overimputation diagnostic plot
Description
overimpute assesses the fit of the predictive distribution after performing multiple imputation with the imputedata function
Usage
overimpute(
  res.imputedata,
  plotvars = NULL,
  plotinds = NULL,
  nnodes = 2,
  path.outfile = NULL,
  alpha = 0.1,
  mfrow = NULL,
  mar = c(5, 4, 4, 2) - 1.9
)
Arguments
| res.imputedata | an output from the imputedata function | 
| plotvars | column index of the variables overimputed | 
| plotinds | row index of the individuals overimputed | 
| nnodes | an integer indicating the number of nodes for parallel calculation. Default value is 2 | 
| path.outfile | a vector of strings indicating the path for redirection of print messages. Default value is NULL, meaning that silent imputation is performed. Otherwise, print messages are saved in the files path.outfile/output.txt. One file per node is generated. | 
| alpha | alpha level for prediction intervals | 
| mfrow | a vector of the form c(nr, nc) | 
| mar | a numerical vector of the form c(bottom, left, top, right) | 
Details
This function imputes each observed value from each conditional imputation model obtained from the imputedata function. The comparison between the "overimputed" values and the observed values is made by building a confidence interval for each observed value using the quantiles of the overimputed values (see Blackwell et al. (2015) <doi:10.1177/0049124115585360>). Note that confidence intervals built with quantiles require a large number of imputations. If the model fits well the data, then the 90% confidence interval should contain the observed value in 90% of the cases. The function overimpute takes as an input an output of the imputedata function (res.imputedata argument), the indices of the incomplete continuous variables that are plotted (plotvars), the indices of individuals (can be useful for time consuming imputation methods), the number of CPU cores for parallel computation, and the path for exporting print message generated during the parallel process (path.outfile).
Value
A list of two matrices
| res.plot | 7-columns matrix that contains (1) the variable which is overimputed, (2) the observed value of the observation, (3) the mean of the overimputations, (4) the lower bound of the confidence interval of the overimputations, (5) the upper bound of the confidence interval of the overimputations, (6) the proportion of the other variables that were missing for that observation in the original data, and (7) the color for graphical representation | 
| res.values | a matrix with overimputed values for each cell. The number of columns corresponds to the number of values generated (i.e. the number of imputed datasets) | 
References
Blackwell, M., Honaker, J. and King. G. 2015. A Unified Approach to Measurement Error and Missing Data: Overview and Applications. Sociological Methods and Research, 1-39. <doi:10.1177/0049124115585360>
Examples
data(wine, package = "clusterMI")
require(parallel)
set.seed(123456)
ref <- wine$cult
nb.clust <- 3
wine.na <- wine
wine.na$cult <- NULL
wine.na <- prodna(wine.na)
nnodes <- 2 # Number of CPU cores used for parallel computation
# Multiple imputation using m = 100 (should be larger in practice)
res.imp.over <- imputedata(data.na = wine.na,
                           nb.clust = nb.clust,
                           m = 100)
# Overimputation
## overimputed variable
plotvars <- "alco" 
## selection of 20 complete individuals on variable "alco"
plotinds <- sample(which(!is.na(wine.na[, plotvars])),
                    size = 20)
## overimputation                   
res.over <- overimpute(res.imp.over,
                       nnodes = nnodes,
                       plotvars = plotvars,
                       plotinds = plotinds,
                       )
Introduce missing values using a missing completely at random mechanism
Description
prodna generates an incomplete dataset by removing a proportion of observed values
Usage
prodna(X, pct = 0.3)
Arguments
| X | a data frame (or matrix). | 
| pct | proportion of missing values. By default pct = 0.3. | 
Value
an incomplete data frame
Examples
n <- 1000
p <- 5
X <- matrix(runif(n*p), nrow = n, ncol = p)
summary(X)
X.na <- prodna(X)
colMeans(is.na(X.na))
Variable selection for specifying conditional imputation models
Description
varselbest performs variable selection from an incomplete dataset (see Bar-Hen and Audigier (2022) <doi:10.1080/00949655.2022.2070621>) in order to specify the imputation models to use for FCS imputation methods
Usage
varselbest(
  data.na = NULL,
  res.imputedata = NULL,
  listvar = NULL,
  nb.clust = NULL,
  nnodes = 1,
  sizeblock = 5,
  method.select = "knockoff",
  B = 200,
  r = 0.3,
  graph = TRUE,
  printflag = TRUE,
  path.outfile = NULL,
  mar = c(2, 4, 2, 0.5) + 0.1,
  cex.names = 0.7,
  modelNames = NULL
)
Arguments
| data.na | a dataframe with only numeric variables | 
| res.imputedata | an output from  | 
| listvar | a character vector indicating for which subset of incomplete variables variable selection must be performed. By default all column names. | 
| nb.clust | the number of clusters used for imputation | 
| nnodes | number of CPU cores for parallel computing. By default, nnodes = 1 | 
| sizeblock | an integer indicating the number of variables sampled at each iteration | 
| method.select | a single string indicating the variable selection method applied on each subset of variables | 
| B | number of iterations, by default B = 200 | 
| r | a numerical vector (or a single real number) indicating the threshold used for each variable in listvar. Each value of r should be between 0 and 1. See details. | 
| graph | a boolean. If TRUE two graphics are plotted per variable in  | 
| printflag | a boolean. If TRUE, a message is printed at each iteration. Use printflag = FALSE for silent selection. | 
| path.outfile | a vector of strings indicating the path for redirection of print messages. Default value is NULL, meaning that silent imputation is performed. Otherwise, print messages are saved in the files path.outfile/output.txt. One file per node is generated. | 
| mar | a numerical vector of the form c(bottom, left, top, right). Only used if graph = TRUE | 
| cex.names | expansion factor for axis names (bar labels) (only used if graph = TRUE) | 
| modelNames | a vector of character strings indicating the models to be fitted in the EM phase of clustering | 
Details
varselbest performs variable selection on random subsets of variables and, then, combines them to recover which explanatory variables are related to the response.
More precisely, the outline of the algorithm are as follows: let consider a random subset of sizeblock among p variables.
By choosing sizeblock small, this subset is low dimensional, allowing treatment of missing values by standard imputation method for clustered individuals.
Then, any selection variable scheme can be applied (lasso, stepwise and knockoff are proposed by tuning the method.select argument).
By resampling B times, a sample of size sizeblock among the p variables, we may count how many times, a variable is considered as significantly related to the response and how many times it is not.
We need to define a threshold (r) to conclude if a given variable is significantly related to the response.
Value
a list of four objects
| predictormatrix | a numeric matrix containing 0 and 1 specifying on each line the set of predictors to be used for each target column of the incomplete dataset. | 
| res.varsel | a list given details on the variable selection procedure (only required for checking convergence by the  | 
| proportion | a numeric matrix of proportion indicating on each line the variable importance of each predictor | 
| call | the matching call | 
References
Bar-Hen, A. and Audigier, V., An ensemble learning method for variable selection: application to high dimensional data and missing values, Journal of Statistical Computation and Simulation, <doi:10.1080/00949655.2022.2070621>, 2022.
See Also
mice,clusterMI, imputedata,knockoff,glmnet,imp.mix
Examples
data(wine, package = "clusterMI")
require(parallel)
set.seed(123456)
ref <- wine$cult
nb.clust <- 3
wine.na <- wine
wine.na$cult <- NULL
wine.na <- prodna(wine.na)
nnodes <- 2 # parallel::detectCores()
B <- 150 #  Number of iterations
m <- 5 # Number of imputed data sets
# variable selection
res.varsel <- varselbest(data.na = wine.na,
                         nb.clust = nb.clust,
                         listvar = c("alco","malic"),
                         B = B,
                         nnodes = nnodes)
predictmat <- res.varsel$predictormatrix
# imputation
res.imp.select <- imputedata(data.na = wine.na, method = "FCS-homo",
                     nb.clust = nb.clust, predictmat = predictmat, m = m)
Chemical analysis of wines from three different cultivars
Description
Data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines
Usage
data(wine)
Format
A data frame with 178 rows and 14 variables:
- cult
- Cultivar 
- alco
- Alcohol 
- malic
- Malic acid 
- ash
- Ash 
- alca
- Alcalinity of ash 
- mg
- Magnesium 
- phe
- Total phenols 
- fla
- Flavanoids 
- nfla
- Nonflavanoid phenols 
- pro
- Proanthocyanins 
- col
- Color intensity 
- hue
- Hue 
- ratio
- OD280/OD315 of diluted wines 
- prol
- Proline 
Source
https://archive.ics.uci.edu/ml/datasets/wine
References
M. Forina, C. Armanino, M. Castino and M. Ubigli. Vitis, 25:189-201 (1986)