ISE-535 Data Mining – Bruce Wilcox

Course Description

Data mining is the discipline of extracting useful insights from large quantities of data. As such, the focus in this class is on inference and not on prediction (which is the focus of ISE-529).

This course is organized into two broad sections:

Exploratory data analysis and statistical data analysis techniques to find useful information from data.
Algorithm-based data mining techniques, statistical (machine) learning-based techniques for data classification, clustering analysis, association pattern mining, data streams mining, outlier analysis.

To the maximum extent possible, this course teaches the concepts by means of case studies using actual or simulated but realistic business data.

Learning Objectives and Outcomes

Develop an advanced level of proficiency with the preprocessing, visualization, and statistical analysis of data as well as several of the primary data mining algorithmic techniques.
Review and re-enforce basic statistical concepts that are important in the field of data science.
Apply unsupervised modeling techniques (including clustering and association rule mining) to analyze and obtain insights from data.
At the completion of the semester, the student will be able to take raw data and perform all of the steps necessary to generate a professional data analysis report.

Textbooks

The theoretical material in the course are drawn from the following texts:

Pang-Ning Tan, et. al., Introduction to Data Mining 2nd ed. 2019 ISBN 978-0-13-312890-1
Charu C. Aggarwal, Data Mining, Springer, 2015 ISBN 9783319141411
Bruce, et. al., Practical Statistics for Data Scientists, O’Reilly, 2020 (PSDS)

Course Content and Schedule

Module 1 – Introduction to Data Mining
- Data/feature types
- Data structuring and cleansing
Module 2 – Foundations of Exploratory Data Analysis
- Introduction to EDA and descriptive statistics
- Data visualization techniques
- Handling missing and anomalous data.
- Generating comprehensive EDA reports
- Feature distributions and relationships
Module 3 – Statistical Foundations for Analytics
- Statistical inference and hypothesis testing
- Computational statistics (resampling, bootstrapping, permutation testing)
- Probability distributions for analytics
- Statistical assumptions and model validity
- Avoiding false discoveries and understanding p-hacking
Module 4 – Inference and Interpretability in Supervised Learning
- Overview of model transparency
- Global interpretation techniques
- Local interpretation techniques
- Model-agnostic vs model-specific transparency techniques.
Module 5 – Finding Structure in Data: Unsupervised Learning Methods
- Clustering techniques
- Dimensionality reduction
- Anomaly detection
Module 6 – Pattern Discovery with Association Rules
- Key metrics for association rules
- Algorithms for mining association rules
- Applications and use cases.
Module 7 – Mining Data Streams