Course Title: Data Mining


Number of Unites: 4

Schedule: Three hours of lecture and one hour of discussion per week.

Prerequisites: Basic concepts and algorithms from probability and statistics

Catalog Description :
Knowledge discovery is the process of discovering useful regularities in large and complex data sets. The field encompasses techniques from artificial intelligence (representation and search), statistics (inference), and databases (data storage and access). When integrated into useful systems, these techniques can help human analysts make sense of vast stores of digital information. This course presents the fundamental principles of the field, familiarizes students with the technical details of representative algorithms.

Expanded Description:
  1. Data pre-processing 
    • Data cleaning
    • Data transformation
    • Data reduction
    • Discretization
  2. Association rules and sequential patterns
    • Basic concepts
    • Apriori Algorithm
    • Mining association rules with multiple minimum supports
    • Mining class association rules
    • Sequetial pattern mining
  3. Supervised learning (Classification) 
    • Basic concepts
    • Decision trees
    • Classifier evaluation
    • Rule induction
    • Classification based on association rules
    • Naive-Bayesian learning
    • Naive-Bayesian learning for text classification
    • Support vector machines
    • K-nearest neighbor
  4. Unsupervised learning (Clustering)
    • Basic concepts
    • K-means algorithm
    • Representation of clusters
    • Hierarchical clustering
    • Distance functions
    • Data standardization
    • Handling mixed attributes
    • Which clustering algorithm to use?
    • Cluster evaluation
    • Discovering holes and data regions
  5. Post-processing 
    • Objective interestingness
    • Subjective interestingness
  6. Information retrieval and Web search
    • Basic text processing and representation
    • Cosine similarity
    • Relevance feedback and Rocchio algorithm
  7. Partially supervised learning
    • Semi-supervised learning
      • Learning from labeled and unlabeled examples using EM
      • Learning from labeled and unlabeled examples using co-training
    • Learning from positive and unlabeled examples
  8. Link analysis 
    • Social network analysis
    • Citation analysis: co-citation and bibliographic coupling
    • The PageRank algoithm (of Google)
    • The HITS algorithm: authorities and hubs
    • Mining communities on the Web
  9. Data extraction and information integration 

Course Objectives & Role in the Program:
This course has three objectives. First, to provide students with a sound basis in data mining tasks and techniques. Second, to ensure that students are able to read, and critically evaluate data mining research papers. Third, to ensue that students are able to implement and to use some of the important data mining and text mining algorithms.

Method of Evaluation:
  1. Midterm: 25%
  2. Final Exam: 40%
  3. Projects: 
    • Project 1: Algorithm implementation (15%)
    • Project 2: Research project (including implementation) (20%)

Required Books:

Textbooks:
  1. Building an Intelligent Web: Theory & Practice, R. Akerkar & P. Lingras; Jones & Bartlett, 2007. 
  2. Data mining: Concepts and Techniques, by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers, ISBN 1-55860-489-8.

Reference books:
  1. Principles of Data Mining, by David Hand, Heikki Mannila, Padhraic Smyth, The MIT Press, ISBN 0-262-08290-X.
  2. Introduction to Data Mining, by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Pearson/Addison Wesley, ISBN 0-321-32136-7.
  3. Data mining resource site: KDnuggets Directory

Useful Links:


foo