biganalytics provides a few functions for analysis: linear regression model, generalized linear regression model, and clustering. In this post, I would like to focus on clustering, namely, bigkmeans function. There are several algorithms regarding k-means, for example, Hartigan-Wong method, Lloyd method, Forgy method, and MacQueen method. bigkmeans implements the last. The authors say in the manual that bigkmeans also work for the ordinay matrix objects. Where does bigkmeans excel ordinary kmeans? I decided to experiment it.
The "Gisette Data Set" was used. This dataset is famous for the hand-written digit recognition problem, one of datasets of the NIPS 2003 feature selection challenge. It contains 13,500 records and 5,000 features.
The experiments were conducted on two conditions:
1. kmeans with data.frame
2. bigkmeans with marix
Here is the sorce code:
First of all, load biganalytics package, and set the parameters for conducting k-means alogrithms.
Created by Pretty R at inside-R.org
Second, read the dataset as a data.frame object, and also convert it to a matrix object to use in bigkmeans. Please notice that the data file was generated by combining "gisette_train.data", "gisette_test.data", and "gisette_valid.data".
# read data
gisette.km <- read.table("../data/gisette_all.data", sep="",
header=FALSE)
gisette.bkm <- as.matrix(gisette.km)
Created by Pretty R at inside-R.org
Third, generate the object for maintaining the calcultion time, and measure the calculation time in those two cases, varying the size of the dataset.
# generate objects for maintainig calculation time
calc.time <- matrix(NA, nrow=nsize, ncol=3,
dimnames=list(size, c("kmeans with data.frame",
"bigkmeans with matrix")
)
)
# measure calculation time
for (i in 1:nsize) {
size.i <- size[i]
gisette.km.i <- gisette.km[1:size.i, ]
gisette.bkm.i <- gisette.bkm[1:size.i, ]
# 1.kmeans with data.frame
cat("1.kmeans with data.frame", "\n")
calc.time[i, 1] <- system.time(
kmeans(gisette.km.i, centers, iter.max,
nstart, algorithm)
)[3]
rm(gisette.km.i)
gc()
# 2.bigkmeans with matrix
cat("2.bigkmeans with matrix", "\n")
calc.time[i, 2] <- system.time(
bigkmeans(gisette.bkm.i, centers, iter.max,
nstart)
)[3]
rm(gisette.bkm.i)
gc()
}
Created by Pretty R at inside-R.org
Finally, plot the result.
Created by Pretty R at inside-R.org
The result is shown below:
It is clearly shown that bigkmeans is faster than kmeans even for an ordinary matrix object: by 1.26 at N=5000, 1.39 at N=7500, 1.83 at N=10000, and almost twice at N=11000.
For datasets with fewer features, I'll try in the near future.
LINK:
The Bigmemory Project(vignette)
Big data analysis in R(sorry, in Japanese)