In this vignette I’ll illustrate how to increase the accuracy on the MNIST (to approx. 98.4%) and CIFAR-10 data (to approx. 58.3%) using the KernelKnn package and HOG (histogram of oriented gradients).
The MNIST data set of handwritten digits has a training set of 70,000 examples and each row of the matrix corresponds to a 28 x 28 image. The unique values of the response variable y range from 0 to 9. More information about the data can be found in the DataSets repository (the folder includes also an Rmarkdown file).
# using system('wget..') on a linux OS
system("wget https://raw.githubusercontent.com/mlampros/DataSets/master/mnist.zip")
<- read.table(unz("mnist.zip", "mnist.csv"), nrows = 70000, header = T,
mnist
quote = "\"", sep = ",")
= mnist[, -ncol(mnist)]
X dim(X)
## [1] 70000 784
# the KernelKnn function requires that the labels are numeric and start from 1 : Inf
= mnist[, ncol(mnist)] + 1
y table(y)
## y
## 1 2 3 4 5 6 7 8 9 10
## 6903 7877 6990 7141 6824 6313 6876 7293 6825 6958
K nearest neighbors do not perform well in high dimensions due to the curse of dimensionality (k observations that are nearest to a given test observation x1 may be very far away from x1 in p-dimensional space when p is large [ An introduction to statistical learning, James/Witten/Hastie/Tibshirani, pages 108-109 ]), leading to a very poor k-nearest-neighbors fit. One option to overcome this problem would be to use truncated svd (irlba package) to reduce the dimensions of the data. A second option, which is appropriate in case of images, would be to use image descriptors. In this vignette, I’ll compare those two approaches.
I experimented with different settings and the following parameters gave the best results,
irlba_singlular_vectors | k | method | kernel |
---|---|---|---|
40 | 8 | braycurtis | biweight_tricube_MULT |
library(irlba)
= irlba(as.matrix(X), nv = 40, nu = 40, verbose = F) # irlba truncated svd
svd_irlb
= as.matrix(X) %*% svd_irlb$v # new_x using the 40 right singular vectors new_x
library(KernelKnn)
= KernelKnnCV(as.matrix(new_x), y, k = 8, folds = 4, method = 'braycurtis',
fit
weights_function = 'biweight_tricube_MULT', regression = F,
threads = 6, Levels = sort(unique(y)))
# str(fit)
# evaluation metric
= function (y_true, preds) {
acc
= table(y_true, max.col(preds, ties.method = "random"))
out
= sum(diag(out))/sum(out)
acc
acc }
= unlist(lapply(1:length(fit$preds),
acc_fit
function(x) acc(y[fit$folds[[x]]],
$preds[[x]])))
fit
acc_fit
## [1] 0.9742857 0.9749143 0.9761143 0.9741143
cat('mean accuracy using cross-validation :', mean(acc_fit), '\n')
## mean accuracy using cross-validation : 0.9748571
Utilizing truncated svd a 4-fold cross-validation KernelKnn model gives a 97.48% accuracy.
In this chunk of code, besides KernelKnnCV I’ll also use HOG. The histogram of oriented gradients (HOG) is a feature descriptor used in computer vision and image processing for the purpose of object detection. The technique counts occurrences of gradient orientation in localized portions of an image. This method is similar to that of edge orientation histograms, scale-invariant feature transform descriptors, and shape contexts, but differs in that it is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization for improved accuracy (Wikipedia).
library(OpenImageR)
= HOG_apply(X, cells = 6, orientations = 9, rows = 28, columns = 28, threads = 6)
hog
##
## time to complete : 1.802997 secs
dim(hog)
## [1] 70000 324
= KernelKnnCV(hog, y, k = 20, folds = 4, method = 'braycurtis',
fit_hog
weights_function = 'biweight_tricube_MULT', regression = F,
threads = 6, Levels = sort(unique(y)))
#str(fit_hog)
= unlist(lapply(1:length(fit_hog$preds),
acc_fit_hog
function(x) acc(y[fit_hog$folds[[x]]],
$preds[[x]])))
fit_hog
acc_fit_hog
## [1] 0.9833714 0.9840571 0.9846857 0.9838857
cat('mean accuracy for hog-features using cross-validation :', mean(acc_fit_hog), '\n')
## mean accuracy for hog-features using cross-validation : 0.984
By changing from the simple svd-features to HOG-features the accuracy of a 4-fold cross-validation model increased from 97.48% to 98.4% (approx. 1% difference)
CIFAR-10 is an established computer-vision dataset used for object recognition. The data I’ll use in this example is a subset of an 80 million tiny images dataset and consists of 60,000 32x32 color images containing one of 10 object classes ( 6000 images per class ). Furthermore, the data were converted from RGB to gray, normalized and rounded to 2 decimal places (to reduce the storage size). More information about the data can be found in my DataSets repository (I included an Rmarkdown file).
I’ll build the kernel k-nearest-neighbors models in the same way I’ve done for the mnist data set and then I’ll compare the results.
# using system('wget..') on a linux OS
system("wget https://raw.githubusercontent.com/mlampros/DataSets/master/cifar_10.zip")
<- read.table(unz("cifar_10.zip", "cifar_10.csv"), nrows = 60000, header = T,
cifar_10
quote = "\"", sep = ",")
= cifar_10[, -ncol(cifar_10)]
X dim(X)
## [1] 60000 1024
# the KernelKnn function requires that the labels are numeric and start from 1 : Inf
= cifar_10[, ncol(cifar_10)]
y table(y)
## y
## 1 2 3 4 5 6 7 8 9 10
## 6000 6000 6000 6000 6000 6000 6000 6000 6000 6000
The parameter settings are similar to those for the mnist data,
irlba_singlular_vectors | k | method | kernel |
---|---|---|---|
40 | 8 | braycurtis | biweight_tricube_MULT |
= irlba(as.matrix(X), nv = 40, nu = 40, verbose = F) # irlba truncated svd
svd_irlb
= as.matrix(X) %*% svd_irlb$v # new_x using the 40 right singular vectors new_x
= KernelKnnCV(as.matrix(new_x), y, k = 8, folds = 4, method = 'braycurtis',
fit
weights_function = 'biweight_tricube_MULT', regression = F,
threads = 6, Levels = sort(unique(y)))
# str(fit)
= unlist(lapply(1:length(fit$preds),
acc_fit
function(x) acc(y[fit$folds[[x]]],
$preds[[x]])))
fit
acc_fit
## [1] 0.4080667 0.4097333 0.4040000 0.4102667
cat('mean accuracy using cross-validation :', mean(acc_fit), '\n')
## mean accuracy using cross-validation : 0.4080167
The accuracy of a 4-fold cross-validation model using truncated svd is 40.8%.
Next, I’ll run the KernelKnnCV using the HOG-descriptors,
= HOG_apply(X, cells = 6, orientations = 9, rows = 32,
hog
columns = 32, threads = 6)
##
## time to complete : 3.394621 secs
dim(hog)
## [1] 60000 324
= KernelKnnCV(hog, y, k = 20, folds = 4, method = 'braycurtis',
fit_hog
weights_function = 'biweight_tricube_MULT', regression = F,
threads = 6, Levels = sort(unique(y)))
# str(fit_hog)
= unlist(lapply(1:length(fit_hog$preds),
acc_fit_hog
function(x) acc(y[fit_hog$folds[[x]]],
$preds[[x]])))
fit_hog
acc_fit_hog
## [1] 0.5807333 0.5884000 0.5777333 0.5861333
cat('mean accuracy for hog-features using cross-validation :', mean(acc_fit_hog), '\n')
## mean accuracy for hog-features using cross-validation : 0.58325
By using hog-descriptors in a 4-fold cross-validation model the accuracy in the cifar-10 data increases from 40.8% to 58.3% (approx. 17.5% difference).