The Vision and Media Lab, Simon Fraser University

We consider the problem of describing the actions being performed by human figures in still images. We attack this problem using an unsupervised learning approach, attempting to discover the set of action classes present in a large collection of training images.

Our approach consists of three stages. In the first stage, the coarse shape of the human figures is used to match pairs of images using shape contexts. Then a linear programming relaxation technique is used to compute the detailed distance between a pair of images. Due to the computational complexity of this process, we employ a fast pruning method to enable its use on a large collection of images. Spectral clustering is then performed using the resulting distances.

We present clustering and image labelling results on three datasets: baseball, figure skating, and basketball. The datasets are available here.

For more information, contact Yang Wang