CMPT 726: Machine Learning Project
The intent of the course project is to give you some practice at
doing research. If you are a new graduate student, this could be your
first time doing research. The important thing to learn is the
correct methodology for doing research. I am open to your own projects
and ideas, as long as you use machine learning in a meaningful way.
If you would like some feedback in advance, I suggest that you come to
my office hour or send me a brief description (1 or 2 paragraphs).
Methodology
The key components, and those on which you will be graded, are:
- Choosing the right problem. Ideally you will have a problem
from your current/potential research area which could benefit from the
use of machine learning techniques. Please feel free to use this
problem for your project. However, you must not submit work you have
done before this course as your project.
If you haven't decided on a research area, or would like to work on
something different, that is fine too. A great resource for datasets
to work on is the UCI
repository.
Don't choose something that is too hard nor too simple. If in
doubt, please come to my office hours and ask about your topic. A
rough guideline for grad projects is that they should be approximately
2 times as much work as one assignment.
- What has been done before? A month in the lab can save you
a day in the library. This is a course project, and not a
peer-reviewed paper, but you should be aware of the most closely
related work. In fact, a perfectly good project is to implement a
previous paper (of non-trivial complexity). I expect roughly 3-5
citations to other work as part of your project report.
You must also maintain high standards of academic integrity.
Standing on the shoulders of giants is highly recommended, just make
it clear who these giants are. If you use someone else's code, you
must provide a citation. If you use text/equations from someone
else's paper, you must cite and quote it. If you use figures from
another paper, you must clearly state such.
- Comparative experiments. You must compare what you have
done to at least one other method to know if anything interesting has
been achieved. Proper experiments should only change one component at
a time (e.g. different classifier, same features). You should also
study different parameters of algorithms to ascertain sensitivity
(e.g. regularization parameter values). If you are using a standard
dataset, you can compare your results (one method) to others'. Just
make sure the experiments are comparable (e.g. same training/test data).
You will not be graded on the quality of your results, but
on the quality of your experimental methodology.
- Quality of exposition. If you write a paper and nobody can
read it, does it make a contribution? Clearly state the problem
you worked on, the methods you used, who has done what before,
what was the intent of your project, which datasets, and what parameters
you used. Use a spell-checker, create figures with legible fonts and
labelled axes, and provide figures visualizing your results.
A standard project report has four sections:
- Introduction (includes citations to closely related work)
- Approach
- Experiments
- Conclusion
Types of Topics
- Applications to specific problems. I expect this to be the most
common format.
You could apply an existing machine learning algorithm to a problem of
interest to you. There would be value also in implementing
modifications of existing algorithms if necessary for your application.
- A survey or synthesis of a few related papers on a topic of interest to you. For example, you
could summarize Bayesian approaches to curve fitting or explore new topics like Gaussian processes.
- A theoretical research project. This might look at mathematical
questions, e.g. proving performance guarantees for machine learning
algorithms, or deriving methods from assumptions.
- Implementing Algorithms.
You may implement an already
existing algorithm, for example for future use in the course. For
instance, Weka and Matlab don't seem to have general Bayes net structure
learning algorithms for continuous variables.
- Other listings of course projects from other universities. These
might give you some ideas. Doing a web search of your own is fine.
- CMU 1998
- CMU 2007.This one contains datasets as well.
- The Kaggle competition has
real-world challenge problems and data sets. You can compare your system
with others. If you enter this year's competition, you could win big
bucks! Just managing to post a reasonable entry would be enough for a
course project, you don't have to win.
- The KDD Cup provides clean real-world datasets. These have been analyzed many times so it should be easy for you to find comparison points.
- I would be interested in projects to do with recommendation systems.
A course project could implement one or more statistical algorithms,
together with an evalution one or more datasets. A matlab implementation
would be especially nice for future use in the course. I have some Java
code, there are probably Matlab routines available as well. Here are
some more specific suggestions.
- The most successful type of machine learning technique in recommender systems is based on latent variable models, especially Matrix Factorization; see also this presentation.
You could implement a basic matrix factorization algorithm, together with a state-of-the-art one, and compare.
- The EM algorithm can be applied to learn latent variable models for
recommendation systems (learn "types" of items and "types" of users).
- These issues and algorithms can also be addressed in a collaborative
filtering setting where you know links among users, e.g. friendship in a
social network.
- I'm also working on machine learning for relational and network data. Let me know if you are interested in projects related to that.
- You could consider a project around structure learning for Bayes nets (learning the edges). For example:
- Using Bayes net learning for feature selection : before applying a classifier, learn a Bayes net, and remove features (nodes) that are not in the Markov blanket of the class node.
- Efficient Bayes net structure learning, for example by using ADTrees for caching sufficient statistics (data counts).
Grading Criteria for Final Project.
- 30% Presentation. Clarity, conciseness, spelling---quality of exposition.
- 40% Originality. To what extent were you creative in developing your own ideas?
- 30% Evaluation, methodology.
Handing in the Project.
The project report should be prepared using the NIPS style
files. The page limit for the project report is 5 pages in this
format. You must submit the report electronically on the submission server in PDF
format. This is the server we have been using throughout the course.
The report is due at 11:59pm on Sunday December 9. This deadline
will be horribly strict.
Back to CMPT 726.