In the typical nonparametric approach to classification in instance-based learning (IBL) and data mining, random data (the training set of patterns) are collected and used to design a decision rule (classif ier). Instance based learning (IBL) algorithms attempt to classify a new unseen instance (test data) based on some ``proximal neighbour" rule, i.e. by taking a majority vote among the class labels of its proximal instances in the training or reference dataset. One of the most well known such rules is the k-nearest neighbor (k-NN) decision rule (also known as the lazy learning) in which an unknown pattern is classified into the majority class among the k-nearest neighbors in the train ing set. This rule gives low error rates when the training set is large. Geometric proximity graphs especially the Gabriel graph (a subgraph of the well known Delaunay triangulation) provides an elegant algorithmic alternative to k-NN based IBL algorithms [15]. The main reason for this is that the Gabriel graph preserves the original nearest neighbour decision boundary between data points of different classes very well. This brings up up the problem of efficiently computing the Gabriel graph of large training data sets in high dimensional space.
Mukherjee [47] in his M.Sc. thesis
has considered extending the idea of
approximate k-NN search to approximate Gabriel neighbours
search, and has designed an algorithm to compute the latter. It is now
possible to compute
approximate Gabriel neighbors of a query point in
dimension
in
time. We thin (reduce) the
original reference
dataset by computing its approximate Gabriel graph and use it
independently as a
classifier as well as an input to the IBL algorithms. We achieve
excellent empirical
results; in terms of classification accuracy, reference set storage
reduction and
consequently, query response time. The Gabriel thinning algorithm
coupled with the IBL
algorithms consistently outperforms the IBL algorithms used alone, with
respect to
storage requirements and maintains similar accuracy levels.
These results are reported in [15,16].
Currently we are working on further extending this idea to the areas such as pattern classification using SVM (Support Vector Machine), subspace clustering etc.