Instance-based learning

Next: Collection depots problem Up: Other projects Previous: Mobile facility location

Instance-based learning

In the typical nonparametric approach to classification in instance-based learning (IBL) and data mining, random data (the training set of patterns) are collected and used to design a decision rule (classif ier). Instance based learning (IBL) algorithms attempt to classify a new unseen instance (test data) based on some ``proximal neighbour" rule, i.e. by taking a majority vote among the class labels of its proximal instances in the training or reference dataset. One of the most well known such rules is the k-nearest neighbor (k-NN) decision rule (also known as the lazy learning) in which an unknown pattern is classified into the majority class among the k-nearest neighbors in the train ing set. This rule gives low error rates when the training set is large. Geometric proximity graphs especially the Gabriel graph (a subgraph of the well known Delaunay triangulation) provides an elegant algorithmic alternative to k-NN based IBL algorithms [15]. The main reason for this is that the Gabriel graph preserves the original nearest neighbour decision boundary between data points of different classes very well. This brings up up the problem of efficiently computing the Gabriel graph of large training data sets in high dimensional space.

Mukherjee [47] in his M.Sc. thesis has considered extending the idea of approximate k-NN search to approximate Gabriel neighbours search, and has designed an algorithm to compute the latter. It is now possible to compute approximate Gabriel neighbors of a query point in dimension in $O(dk\log n)$ time. We thin (reduce) the original reference dataset by computing its approximate Gabriel graph and use it independently as a classifier as well as an input to the IBL algorithms. We achieve excellent empirical results; in terms of classification accuracy, reference set storage reduction and consequently, query response time. The Gabriel thinning algorithm coupled with the IBL algorithms consistently outperforms the IBL algorithms used alone, with respect to storage requirements and maintains similar accuracy levels. These results are reported in [15,16].

Currently we are working on further extending this idea to the areas such as pattern classification using SVM (Support Vector Machine), subspace clustering etc.

Next: Collection depots problem Up: Other projects Previous: Mobile facility location