CMPT 726: Machine Learning

CMPT 726: Machine Learning Project

The intent of the course project is to give you some practice at doing research. If you are a new graduate student, this could be your first time doing research. The important thing to learn is the correct methodology for doing research. I am open to your own projects and ideas, as long as you use machine learning in a meaningful way. If you would like some feedback in advance, I suggest that you come to my office hour or send me a brief description (1 or 2 paragraphs).

Methodology

The key components, and those on which you will be graded, are:

Choosing the right problem. Ideally you will have a problem from your current/potential research area which could benefit from the use of machine learning techniques. Please feel free to use this problem for your project. However, you must not submit work you have done before this course as your project.
If you haven't decided on a research area, or would like to work on something different, that is fine too. A great resource for datasets to work on is the UCI repository.
Don't choose something that is too hard nor too simple. If in doubt, please come to my office hours and ask about your topic. A rough guideline for grad projects is that they should be approximately 2 times as much work as one assignment.
What has been done before? A month in the lab can save you a day in the library. This is a course project, and not a peer-reviewed paper, but you should be aware of the most closely related work. In fact, a perfectly good project is to implement a previous paper (of non-trivial complexity). I expect roughly 3-5 citations to other work as part of your project report.
You must also maintain high standards of academic integrity. Standing on the shoulders of giants is highly recommended, just make it clear who these giants are. If you use someone else's code, you must provide a citation. If you use text/equations from someone else's paper, you must cite and quote it. If you use figures from another paper, you must clearly state such.
Comparative experiments. You must compare what you have done to at least one other method to know if anything interesting has been achieved. Proper experiments should only change one component at a time (e.g. different classifier, same features). You should also study different parameters of algorithms to ascertain sensitivity (e.g. regularization parameter values). If you are using a standard dataset, you can compare your results (one method) to others'. Just make sure the experiments are comparable (e.g. same training/test data).
You will not be graded on the quality of your results, but on the quality of your experimental methodology.
Quality of exposition. If you write a paper and nobody can read it, does it make a contribution? Clearly state the problem you worked on, the methods you used, who has done what before, what was the intent of your project, which datasets, and what parameters you used. Use a spell-checker, create figures with legible fonts and labelled axes, and provide figures visualizing your results.
A standard project report has four sections:
1. Introduction (includes citations to closely related work)
2. Approach
3. Experiments
4. Conclusion

Types of Topics

Applications to specific problems. I expect this to be the most common format. You could apply an existing machine learning algorithm to a problem of interest to you. There would be value also in implementing modifications of existing algorithms if necessary for your application.
A survey or synthesis of a few related papers on a topic of interest to you. For example, you could summarize Bayesian approaches to curve fitting or explore new topics like Gaussian processes.
A theoretical research project. This might look at mathematical questions, e.g. proving performance guarantees for machine learning algorithms, or deriving methods from assumptions.
Implementing Algorithms.
Other listings of course projects from other universities. These might give you some ideas. Doing a web search of your own is fine.

CMU 1998
CMU 2007.This one contains datasets as well.
The Kaggle competition has real-world challenge problems and data sets. You can compare your system with others. If you enter this year's competition, you could win big bucks! Just managing to post a reasonable entry would be enough for a course project, you don't have to win.
The KDD Cup provides clean real-world datasets. These have been analyzed many times so it should be easy for you to find comparison points.
I would be interested in projects to do with recommendation systems. A course project could implement one or more statistical algorithms, together with an evalution one or more datasets. A matlab implementation would be especially nice for future use in the course. I have some Java code, there are probably Matlab routines available as well. Here are some more specific suggestions.
1. The most successful type of machine learning technique in recommender systems is based on latent variable models, especially Matrix Factorization; see also this presentation. You could implement a basic matrix factorization algorithm, together with a state-of-the-art one, and compare.
2. The EM algorithm can be applied to learn latent variable models for recommendation systems (learn "types" of items and "types" of users).
3. These issues and algorithms can also be addressed in a collaborative filtering setting where you know links among users, e.g. friendship in a social network.
4. I'm also working on machine learning for relational and network data. Let me know if you are interested in projects related to that.
You could consider a project around structure learning for Bayes nets (learning the edges). For example:
1. Using Bayes net learning for feature selection : before applying a classifier, learn a Bayes net, and remove features (nodes) that are not in the Markov blanket of the class node.
2. Efficient Bayes net structure learning, for example by using ADTrees for caching sufficient statistics (data counts).

Grading Criteria for Final Project.

30% Presentation. Clarity, conciseness, spelling---quality of exposition.
40% Originality. To what extent were you creative in developing your own ideas?
30% Evaluation, methodology.

Handing in the Project.

The project report should be prepared using the NIPS style files. The page limit for the project report is 5 pages in this format. You must submit the report electronically on the submission server in PDF format. This is the server we have been using throughout the course.

The report is due at 11:59pm on Sunday December 9. This deadline will be horribly strict.

Back to CMPT 726.