Orchestrated Data Pipelines Research Project

The objective of this project is to research data pipeline architectures, technologies, and best practices and to develop a realistic orchestrated data pipeline.

Past research projects have developed a realistic set of source data for a hypothetical large professional services (consulting) firm consisting of:

A relational database of client projects/engagements
Spreadsheets containing information on indirect costs and non-billable time charges by consulting staff
JSON file consisting of data from client surveys

Also developed is the schema for a data warehouse for this firm consisting of the following three data marts:

Project management star schema
Financial analysis star schema
Client feedback star schema

The objectives for the spring 2025 semester are as follows:

Create a github repository for data sources and the code that generates and populates the data sources
Validate the data sources by generating test queries against the data sources
Create incremental versions of the data sources (to simulate monthly refresh cycles)
Research and select the architecture for the data pipeline, including a platform (Snowflake or Google Cloud) and an orchestration tools (Airflow, Kubeflow, or Snowflake Tasks)
Implement and test the orchestrated data pipeline

The specific products that we hope to produce during the spring semester are summarized below:

A survey paper on data pipeline architectures, technologies, and best practics
A working data pipeline for ultimate use in future ISE-558 class projects
A PowerPoint presentation to accompany the working data pipeline for use in ISE-558 lectures

We have openings for 3-5 students for this project. Student background requires is having completed ISE-558, strong SQL and Python skills, and an interest in Data Engineering.