CMPT-459 Course Project


Important dates

October 29, 2000 -- Deadline to determine project option

November 5, 2000 -- Deadline to determine sub-option for those take algorithm implementation option

December 11, 2000 (time to be determined) -- Deadline to submit the project


You have two options to finish the project. One student can only take one option. You must send peijian@sfu.ca an email before October 29, 2000 to indicate which option you take.

In order to finish the project, you need to access some confidential information. Thus, you are required to sign a non-disclosure agreement (NDA) (Postscript file). Please do that as follows.

  1. Print and sign the NDA, check related box in the agreement. Hand it to Dr. Han in person. 
  2. After that, please send an email to han@cs.sfu.ca and cc peijian@sfu.ca.
  3. If you would like to have a look at NASA ASRS database before you make the decision on project options, you can sign the NDA and get the dataset. Only those working on implementation can get the FP-growth/CLOSET object code. They will be available on November 19, 2000.

Option 1: Mining NASA ASRS (Aviation Safety Reporting System) Database

This NASA ASRS (Aviation Safety Reporting System) database records more than 80,000 aviation incidents. It is our sincere hope that through the study and analysis of this historical data, together we can help reduce the number and severity of accidents in the future. This data represents the information collected through the Aviation Safety Reporting System’s voluntary incident reports. We have translated the data into readable form and merged together. There are over 70 searchable fields.

You are asked to analyze and mine a subset of this dataset, which containing only records of Boeing plans, using the techniques you learn from the course. You can use any method you would like.

As an example, you can compare the incidents in B727/B737/B747/B757/B767, find if some model(s) have some part(s) with distinct trouble rate, which distinguish the model from others.

Data set

The database is recorded in a comma delimited text file. The file is zipped by WinZip. Before you can get that file, you are required to sign a non-disclosure agreement (NDA).

Submission

Evaluation Criteria

Your submission will be evaluated based on the rationale of your mining process and the freshness,  interestingness and potential usage of the knowledge you discovered.

Option 2. Efficient implementation of FP-growth or CLOSET

Please read the following two research papers. ( Both papers are downloadable from Dr. Han's and Mr. Pei's home pages.)

Implement algorithm FP-growth or CLOSET. You must send peijian@sfu.ca an email before November 5, 2000 to indicate which sub-option, FP-growth or CLOSET you will implement. The executable file of FP-growth/CLOSET will be available on November 19, 2000 for your performance study. Before you get the program, you need to sign a non-disclosure agreement (NDA).

The input & output of your program should be as follows.

Input:

Output:

About transaction data file

  1. A transaction is recorded in form of "#_items_in_this_transaction item_1 item_2 ...". Each item is represented as an integer. All items within a transaction are listed lexicographically.
  2. Two data sets for FP-growth testing are downloadable here (gzipped format D1, D2, UNIX compressed format D1, D2 ), with 10,000/100,000 transactions and 10,000 items (item number from 1 to 10,000). The maximal length of transactions is 100. You should use gunzip in UNIX to unzip the data files.
  3. The Connect-4 test dataset for CLOSET can be downloaded here. You should use WinZip to unzip the data file.

Submission

Evaluation Criteria

  1. Correctness (40 points). If your program finds the correct set of patterns you get 40 points. Otherwise, you get portion of 30 points depending on the error rates.
  2. Efficiency (30 points).
  3. Report (30points).