CMPT-459 Course Project
Important dates
October 29, 2000 -- Deadline to determine project option
November 5, 2000 -- Deadline to determine sub-option for those
take algorithm implementation option
December 11, 2000 (time to be determined) -- Deadline to
submit the project
You have two options to finish the project. One student can only take one
option. You must send peijian@sfu.ca an
email before October 29, 2000 to indicate which
option you take.
In order to finish the project, you need to access some confidential
information. Thus,
you are required to sign a non-disclosure agreement (NDA)
(Postscript file).
Please do that as follows.
- Print and sign the NDA, check related box in the
agreement. Hand it to Dr. Han in person.
- After that, please send an email to han@cs.sfu.ca
and cc peijian@sfu.ca.
- If you would like to have a look at NASA ASRS database before you make the
decision on project options, you can sign the NDA and get the dataset. Only
those working on implementation can get the FP-growth/CLOSET object code.
They will be available on November 19, 2000.
Option 1: Mining NASA ASRS (Aviation Safety Reporting System) Database
This NASA ASRS (Aviation Safety Reporting System) database records more than 80,000 aviation incidents. It is our sincere hope that through the study and analysis of this historical data, together we can help reduce the number and severity of accidents in the future. This data represents the information collected through the Aviation Safety Reporting System’s voluntary incident reports. We have translated the data into readable
form and merged together. There are over 70 searchable fields.
You are asked to analyze and mine a subset of this dataset, which containing
only records of Boeing plans, using the techniques you learn
from the course. You can use any method you would like.
As an example, you can compare the incidents in B727/B737/B747/B757/B767,
find if some model(s) have some part(s) with distinct trouble rate, which
distinguish the model from others.
Data set
The database is recorded in a comma delimited text
file. The file is zipped by WinZip. Before you can get that file,
you are required to sign a non-disclosure agreement
(NDA).
Submission
- A report describing your mining process and the interesting knowledge you
found. You should show the knowledge is non-trivial, interesting and
potentially useful.
- Related programs and data files.
Evaluation Criteria
Your submission will be evaluated based on the rationale of your mining
process and the freshness, interestingness and potential usage of the
knowledge you discovered.
Option 2. Efficient implementation of FP-growth or CLOSET
Please read the following two research papers. ( Both papers are downloadable
from Dr. Han's
and Mr. Pei's home pages.)
- J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate
generation. ACM SIGMOD 2000.
- J. Pei, J. Han, and R. Mao. CLOSET: An efficient algorithm for mining
frequent closed itemsets. ACM SIGMOD DMKD 2000.
Implement algorithm FP-growth or CLOSET. You must send peijian@sfu.ca
an email before November 5, 2000 to indicate which
sub-option, FP-growth or CLOSET you will implement. The executable file of
FP-growth/CLOSET will be available on November 19, 2000 for your performance
study. Before you get the program, you need to sign a non-disclosure agreement (NDA).
The input & output of your program should be as follows.
Input:
- Transaction file name
- Number of transaction
- Number of items
- Maximal length of transactions
- Support threshold
Output:
- Number of frequent (closed) itemsets
- A file pattern.txt containing all frequent (closed) itemsets
- Runtime
About transaction data file
- A transaction is recorded in form of "#_items_in_this_transaction
item_1 item_2 ...". Each item is represented as an integer. All items
within a transaction are listed lexicographically.
- Two data sets for FP-growth testing are downloadable here (gzipped format D1,
D2,
UNIX compressed format D1,
D2 ),
with 10,000/100,000 transactions and 10,000 items (item number from 1 to 10,000).
The maximal length of transactions is 100. You should use gunzip in UNIX to
unzip the data files.
- The Connect-4 test dataset for CLOSET can be downloaded here.
You should use WinZip to unzip the data file.
Submission
- One executable file. File name should be FP_user-id.exe or CL_user-id.exe.
- Source code. The place to upload it will be announced before deadline.
- A report. Please address following issues.
- Description of your implementation, including key techniques. Please
analyze what are the smartness and efficient ways in your implementation.
- What are the major costs in your implementation of FP-growth or CLOSET?
- According to your experience, what in FP-growth or CLOSET could be
improved?
- (Optional for FP-growth) What are the effects of size of virtual
memory on the runtime of FP-growth algorithm? You can use the Connect-4
dataset to test your program. Is the larger the better, or can you
propose some ratio?
- (Optional for CLOSET) Max-pattern is a frequent itemset such that no
super-itemset is frequent. How to mine max-patterns using CLOSET?
Outline of the algorithm is enough. NO implementation is required.
- (For FP-growth) Report the number of frequent itemsets your program finds on the test
datasets with support threshold 2%, 1.5%, 1% and 0.8%, and related
runtime. Draw related runtime-support threshold curves.
- (For CLOSET) Report the number of frequent closed itemsets your program finds on
the Connect-4 dataset with support threshold 98%, 95%, 85% and 75%, and
related runtime. Draw a runtime-support threshold curve.
- Compare the efficiency of your program and those from the authors
(will be available soon, NDA required).
Evaluation Criteria
- Correctness (40 points). If your program finds the correct set of patterns
you get 40 points. Otherwise, you get portion of 30 points depending on the
error rates.
- Efficiency (30 points).
- Report (30points).