ISE-535 Data Mining

Course Description

Data mining is the discipline of extracting useful insights from large quantities of data.  As such, the focus in this class is on inference and not on prediction (which is the focus of ISE-529).

This course is organized into two broad sections:

  • Exploratory data analysis and statistical data analysis techniques to find useful information from data.
  • Algorithm-based data mining techniques, statistical (machine) learning-based techniques for data classification, clustering analysis, association pattern mining, data streams mining, outlier analysis.

To the maximum extent possible, this course teaches the concepts by means of case studies using actual or simulated but realistic business data.

Learning Objectives and Outcomes

  • Develop an advanced level of proficiency with the preprocessing, visualization, and statistical analysis of data as well as several of the primary data mining algorithmic techniques.
  • Review and re-enforce basic statistical concepts that are important in the field of data science.
  • Apply unsupervised modeling techniques (including clustering and association rule mining) to analyze and obtain insights from data.
  • At the completion of the semester, the student will be able to take raw data and perform all of the steps necessary to generate a professional data analysis report.

Textbooks

The theoretical material in the course are drawn from the following texts:

  • Pang-Ning Tan, et. al., Introduction to Data Mining 2nd ed. 2019 ISBN 978-0-13-312890-1
  • Charu C. Aggarwal, Data Mining, Springer, 2015 ISBN 9783319141411
  • Bruce, et. al., Practical Statistics for Data Scientists, O’Reilly, 2020 (PSDS)

Course Content and Schedule

  • Module 1 – Introduction to Data Mining
    • Data/feature types
    • Data structuring and cleansing
  • Module 2 – Foundations of Exploratory Data Analysis
    • Introduction to EDA and descriptive statistics
    • Data visualization techniques
    • Handling missing and anomalous data.
    • Generating comprehensive EDA reports
    • Feature distributions and relationships
  • Module 3 – Statistical Foundations for Analytics
    • Statistical inference and hypothesis testing
    • Computational statistics (resampling, bootstrapping, permutation testing)
    • Probability distributions for analytics
    • Statistical assumptions and model validity
    • Avoiding false discoveries and understanding p-hacking
  • Module 4 – Inference and Interpretability in Supervised Learning
    • Overview of model transparency
    • Global interpretation techniques
    • Local interpretation techniques
    • Model-agnostic vs model-specific transparency techniques.
  • Module 5 – Finding Structure in Data: Unsupervised Learning Methods
    • Clustering techniques
    • Dimensionality reduction
    • Anomaly detection
  • Module 6 – Pattern Discovery with Association Rules
    • Key metrics for association rules
    • Algorithms for mining association rules
    • Applications and use cases.
  • Module 7 – Mining Data Streams