Orchestrated Data Pipelines Research Project

The objective of this project is to research data pipeline architectures, technologies, and best practices and to develop a realistic orchestrated data pipeline.

Past research projects have developed a realistic set of source data for a hypothetical large professional services (consulting) firm consisting of:

  • A relational database of client projects/engagements
  • Spreadsheets containing information on indirect costs and non-billable time charges by consulting staff
  • JSON file consisting of data from client surveys

Also developed is the schema for a data warehouse for this firm consisting of the following three data marts:

  • Project management star schema
  • Financial analysis star schema
  • Client feedback star schema

The objectives for the spring 2025 semester are as follows:

  • Create a github repository for data sources and the code that generates and populates the data sources
  • Validate the data sources by generating test queries against the data sources
  • Create incremental versions of the data sources (to simulate monthly refresh cycles)
  • Research and select the architecture for the data pipeline, including a platform (Snowflake or Google Cloud) and an orchestration tools (Airflow, Kubeflow, or Snowflake Tasks)
  • Implement and test the orchestrated data pipeline

The specific products that we hope to produce during the spring semester are summarized below:

  • A survey paper on data pipeline architectures, technologies, and best practics
  • A working data pipeline for ultimate use in future ISE-558 class projects
  • A PowerPoint presentation to accompany the working data pipeline for use in ISE-558 lectures

We have openings for 3-5 students for this project. Student background requires is having completed ISE-558, strong SQL and Python skills, and an interest in Data Engineering.