The objective of this project is to research data pipeline architectures, technologies, and best practices and to develop a realistic orchestrated data pipeline.
Past research projects have developed a realistic set of source data for a hypothetical large professional services (consulting) firm consisting of:
- A relational database of client projects/engagements
- Spreadsheets containing information on indirect costs and non-billable time charges by consulting staff
- JSON file consisting of data from client surveys
Also developed is the schema for a data warehouse for this firm consisting of the following three data marts:
- Project management star schema
- Financial analysis star schema
- Client feedback star schema
The objectives for the spring 2025 semester are as follows:
- Create a github repository for data sources and the code that generates and populates the data sources
- Validate the data sources by generating test queries against the data sources
- Create incremental versions of the data sources (to simulate monthly refresh cycles)
- Research and select the architecture for the data pipeline, including a platform (Snowflake or Google Cloud) and an orchestration tools (Airflow, Kubeflow, or Snowflake Tasks)
- Implement and test the orchestrated data pipeline
The specific products that we hope to produce during the spring semester are summarized below:
- A survey paper on data pipeline architectures, technologies, and best practics
- A working data pipeline for ultimate use in future ISE-558 class projects
- A PowerPoint presentation to accompany the working data pipeline for use in ISE-558 lectures
We have openings for 3-5 students for this project. Student background requires is having completed ISE-558, strong SQL and Python skills, and an interest in Data Engineering.