In Fall 2019, CKIDS hosted six DataFest projects, and fourteen additional projects in collaboration with the GRIDS data science student association. These projects were proposed by USC faculty and researchers through an open call for project proposals. Below is a short overview of all twenty projects.
Selected DataFest Fall 2019 Projects
1. Understanding Internet Communities through Videogames
PROJECT DESCRIPTION: Online multiplayer games provide a wealth of data that can be used to study human behaviors. Many questions that can be investigated with rich datasets of online game player actions, interactions, and targeted survey questions. We have a wide range of ongoing student projects that use this data to study a range of human behaviors.
SKILLS NEEDED: R
WHAT STUDENTS WILL LEARN: Social network analysis, text analysis, pattern discovery from online communication
ADVISORS:
- Dmitri Williams, Annenberg School of Communication and Journalism
- Fred Morstatter, Viterbi School of Engineering
STUDENT PARTICIPANTS:
- Junchu Zhang, M.Sc. student in Business Analytics, Marshall School of Business
- Himan Kriplani, M.Sc. student in Computer Science, Viterbi School of Engineering
- Shatad Purohit, Ph.D. student in Astronautical Engineering, Viterbi School of Engineering
- Jinney Guo, M.Sc. student in Business Analytics, Marshall School of Business
2. Measuring population-level nutrition and dietary habits from Instagram
PROJECT DESCRIPTION: This project will investigate the quality of Instagram textual posts as a source of data for measurements of dietary patterns and nutrition quality, focusing on spatial and textual features of posts linked to food outlets. Using an Instagram dataset of all geo-located posts at food outlets in Los Angeles for 3 months in 2014, this project will investigate whether Instagram posts, despite implicit biases (and to the extent possible, accounting for these biases), can provide a representative health signal, informative of the quality of population nutrition and dietary patterns at a highly-resolved (e.g. census tract level) spatial scale.
POINTERS: Dataset described and published in: ”#FoodPorn: Obesity Patterns in Culinary Interactions” (This project will focus on the subset of those posts from Los Angeles)
SKILLS NEEDED: R or Matlab preferably, optional: text analytics, social network analysis, and statistical modeling
WHAT STUDENTS WILL LEARN: Tracking real-life health signals from social media data; machine learning / pattern mining to cluster behaviors; spatial statistical analysis using big data, combined from various sources (social media data, official public health statistics)
ADVISORS:
- Abigail Horn, Keck School of Medicine
- Kayla de la Haye, Keck School of Medicine
- Andres Abeliuk, Viterbi School of Engineering
STUDENT PARTICIPANTS:
- Divyatmika Lnu, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
- Shagun Gupta, M.Sc. student in Computer Science, Viterbi School of Engineering
- Nina Thiebaut, M.Sc. student in Analytics, Viterbi School of Engineering
- Nisha Tiwari, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
- Ian Choi, M.Sc. student in Applied Data Science, Viterbi School of Engineering
3. Tracking Coastal Change at Catalina Island
PROJECT DESCRIPTION: Since 1992, the USC Wrigley Institute for Environmental Studies ‘Catalina Conservation Divers’ have been collecting underwater biological and environmental data from coastal ocean sites around Catalina Island, California. In cooperation with the USC Wrigley Institute, the CCD team (made up of community scientists and volunteer SCUBA divers) conducts quarterly surveys of marine species and benthic water temperatures at various depths and locations. The Wrigley Institute has been collecting and archiving this data for years, and the data has not been holistically studied to date. We need assistance in analyzing data for trends across location, ocean depth, and time.
POINTERS: Project details at https://dornsife.usc.edu/wrigley/wies-ccd/ (FYI: data not currently online)
SKILLS NEEDED: Basic statistics and basic programming
WHAT STUDENTS WILL LEARN: This project will help students apply data science skills toward understanding and disseminating an exciting long-term novel dataset about environmental change in our local environment; and generating information to inform marine researchers and the natural resource management community. [Day trip to the USC Wrigley Marine Science Center on Catalina also very likely!]
ADVISORS:
- Jessica Dutton, Dornsife College of Letters, Arts, and Sciences
- Diane Kim, Dornsife College of Letters, Arts, and Sciences
- Deborah Khider, Viterbi School of Engineering
STUDENT PARTICIPANTS:
- Wei-Fan Chen, M.Sc. student in Applied Data Science, Viterbi School of Engineering
- Sameeksha Mahajan, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
- Zijing Zhang, M.Sc. student in Analytics, Viterbi School of Engineering
- Pratheek Athreya, M.Sc. student in Applied Data Science, Viterbi School of Engineering
4. Data Mining Past Climates
PROJECT DESCRIPTION: Estimates of climate variations over the past 1,000 years play an increasing role in climate assessments. A key quantity to derive from them is the transient climate response (TCR), which quantifies the warming at expected from slowly-rising CO2 concentrations. TCR helps constrain the climate models used to predict the future evolution of Earth’s climate. In this project, you will help design an efficient workflow to estimate TCR from existing paleoclimate datasets and emerging statistical methods.
DATASETS/CONTEXT: https://www.nature.com/articles/sdata201788; http://pastglobalchanges.org/science/wg/2k-network/nature-geosc-2k-july-19
SKILLS NEEDED: Python
WHAT STUDENTS WILL LEARN: Statistical modeling, how to design and execute a data analytic pipeline
ADVISORS:
- Julien Emile-Geay, Dornsife College of Letters, Arts and Sciences
- Deborah Khider, Viterbi School of Engineering
STUDENT PARTICIPANTS:
- Yuxuan Ji, M.Sc. student in Computer Science, Viterbi School of Engineering
- Zhifeng Liu, M.Sc. student in Applied Data Science, Viterbi School of Engineering
- Ashka Patel, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
- Aditi Choudhary, M.Sc. student in Computer Science, Viterbi School of Engineering
- Shelly Mehta, M.Sc. student in Computer Science, Viterbi School of Engineering
5. Understanding human environmental perceptions using multi-biometric signals in the built environment
PROJECT DESCRIPTION: Human, as a building occupant, is always surrounded by several indoor environmental quality (IEQ) elements, such as thermal, visual, air, and acoustic conditions. Therefore, the user’s environmental comfort and work productivity are significantly affected by the IEQ conditions, especially in residential, office, and educational facilities. This research is to investigate the relationships between the user’s IEQ comfort perceptions, IEQ conditions and his/her bio-metric signals to understand how to identify individual IEQ perception as a function of single or combined bio-signals (changes). The study outcome will have a potential to be integrated with the existing building mechanical/electrical control systems to enhance the user’s IEQ conditions while contributing to his/her comfort and well-being in the built environment.
POINTERS: https://www.nsf.gov/awardsearch/showAward?AWD_ID=1707068&HistoricalAwards=false4
SKILLS NEEDED: R, SPSS, Weka or similar data mining tools, Python (secondary)
WHAT STUDENTS WILL LEARN: physiological signal analysis, feature analysis, multi-variable correlation analysis, bio-signal synchrony analysis
ADVISORS:
- Joon-Ho Choi, School of Architecture
- Seon Kim, Viterbi School of Engineering
STUDENT PARTICIPANTS:
- Manoj Muralidhara, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
- Shubham Banka, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
- Gaurav Gupta, M.Sc. student in Computer Science, Viterbi School of Engineering
6. Mining Side Effects in Cancer Treatment
Project description: SideEffects is a cancer patient’s resource to access treatment and side effects tailored to the patient’s treatment and disease history. The app sources content from clinical data, National Cancer Institute, social media, and user input from an app.
Responsibilities: Gather and organize existing data via different sources about cancer disease, treatment, and care management.
ADVISORS:
- Thuy Thanh Truong, Keck School of Medicine
- Ken Nguyen
- Fred Morstatter, Viterbi School of Engineering
STUDENT PARTICIPANTS:
- Justin Ho, M.Sc. student in Computer Science (Scientists & Engineers), Viterbi School of Engineering
- Sankareswari Govindarajan, M.Sc. student in Computer Science, Viterbi School of Engineering
- Kevin Tran, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
- Yi-Hsin Chung, M.Sc. student in Business Analytics, Marshall School of Business
- Sangeeth Koratten, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
Selected CKIDS Fall 2019 Projects
7. Modeling the career trajectory of music artists
PROJECT DESCRIPTION: Many musicians, from up-and-comers to established artists, rely heavily on performing live to promote and disseminate their music. To advertise live shows, artists often use concert discovery platforms that make it easier for their fans to track tour dates. In this project, we ask whether digital traces of musical performances generated on those platforms can be used to understand career trajectories of artists. We have amassed a dataset we constructed by cross-referencing data from such platforms (Songkick, and Discogs). In this project, you will identify and explore patterns that can be used to identify successful musicians.
SKILLS NEEDED: Python
WHAT STUDENTS WILL LEARN: Predictive modeling, how to design and execute a machine learning workflow
ADVISOR:
- Fred Morstatter, Viterbi School of Engineering
8. Automated generation of paper authors
PROJECT DESCRIPTION: This project will result in an open-source software tool that will have general applicability for scientific publications. Papers with hundreds of authors are not uncommon in science, and it often takes many weeks to compile an author list in the desired order with proper affiliations and acknowledgments. We have implemented an algorithm that generates the author information for a paper based on the type of contribution of each author within the ENIGMA neuroscience consortium. This project would extend this software to read in compiled spreadsheets or forms and extract information about universities and other institutions from structured web sources, to interoperate with widely-used frameworks such as Wikidata.
POINTERS: http://enigma.ini.usc.edu/, https://doi.org/10.1101/399402
SKILLS NEEDED: Python (or similar), optionally RDF or databasing tools, and web interfaces
WHAT STUDENTS WILL LEARN: Deployment of open source tools that are interoperable with existing science infrastructure.
ADVISOR:
- Neda Jahanshad, Keck School of Medicine
STUDENT PARTICIPANTS:
- Ruiyu Zhao, M.Sc. student in Electrical Engineering, Viterbi School of Engineering
- Xingyu Wei, M.Sc. student in Computer Science (Scientists & Engineers), Viterbi School of Engineering
- Tieming Sun, M.Sc. student in Electrical Engineering, Viterbi School of Engineering
9. A visual analytic toolkit for cultural biases
PROJECT DESCRIPTION: This project will result in a visual analytics toolkit that will enable social scientists to understand the cultural groups and biases at play in a social dataset. News, books, and social media all contain biases that stem from the cultural background of the author(s). We have developed algorithms to identify the cultural groups at play in an arbitrary dataset, as well as natural language processing approaches that can discover the biases of each group. This project would help bring put these tools into the hands of social scientists by displaying the output of these algorithms in novel visualizations.
SKILLS NEEDED: Python, JavaScript
WHAT STUDENTS WILL LEARN: Interfaces for communicating biases to social scientists
ADVISOR:
- Fred Morstatter, Viterbi School of Engineering
STUDENT PARTICIPANTS:
- Mansi Ganatra, M.Sc. student in Applied Data Science, Viterbi School of Engineering
- Saatvik Tikoo, M.Sc. student in Computer Science, Viterbi School of Engineering
10. Modeling uncertainty in drought data
PROJECT DESCRIPTION: Droughts can have a substantial impact on agricultural systems and human livelihood. A Python package to calculate various drought indices in being developed. In this project, you will expand on this package and develop methods to test the sensitivity of the models to various input datasets and parameters.
POINTERS: Drought products will be generating from national weather products available here.
SKILLS NEEDED: Python
WHAT STUDENTS WILL LEARN: Uncertainty modeling, designing and implementing workflows.
ADVISOR:
- Deborah Khider, Viterbi School of Engineering
STUDENT PARTICIPANTS:
- Shravya Manety, M.Sc. student in Computer Science, Viterbi School of Engineering
- Abhilash Pandurangan, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
11. Automated time series analysis
PROJECT DESCRIPTION: This project will result in a Python package for automated time series analysis. Based on the characteristics of the data, you will design functions that (1) perform essential tasks in data cleaning and select appropriate methodologies, (2) implement various algorithms currently not supported through pandas and scikit-learn, and (3) create appropriate visualizations.
POINTERS: Example datasets available here. Software is available here: https://github.com/LinkedEarth/Pyleoclim_util
SKILLS NEEDED: Python, familiarity with Pypi
WHAT STUDENTS WILL LEARN: Time series analysis, deployment of open source tools, designing and implementing workflows.
ADVISOR:
- Deborah Khider, Viterbi School of Engineering
STUDENT PARTICIPANTS:
- Feng Zhu, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
- Myron Kwan, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
- Shilpa Thomas, M.Sc. student in Electrical Engineering, Viterbi School of Engineering
- Nikhil Dhara Venkata, M.Sc. student in Computer Science, Viterbi School of Engineering
- Deepanshu Madan, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
12. Creating and visualizing a linked knowledge base of crime data
PROJECT DESCRIPTION: A lot of data is available in the web in a tabular manner, but it’s difficult to manipulate and visualize without a significant effort. In this project, we aim to test a novel framework created at ISI to build and visualize knowledge bases. The objective is to create a knowledge base that extends the other resources in the Web such as Wikidata or Wikipedia, and visualize the results using interactive maps and plots.
SKILLS NEEDED: Knowledge representation, RDF, Python (basic)
WHAT STUDENTS WILL LEARN: Interfaces, knowledge base construction, population and querying
ADVISOR:
- Daniel Garijo, Viterbi School of Engineering
STUDENT PARTICIPANTS:
- Vedant Diwanji, M.Sc. student in Computer Science, Viterbi School of Engineering
- Yi-Li Chen, M.Sc. student in Analytics, Viterbi School of Engineering
- Andrew Zhao, M.Sc. student in Business Analytics, Marshall School of Business
- Haripriya Dharmala, M.Sc. student in Computer Science, Viterbi School of Engineering
13. Building an open catalog of integrated datasets for Los Angeles
While many open data efforts have managed to successfully expose public data in the web, it is often complicated to determine how these records can be integrated with each other (due to heterogeneous ids, not clear how to place them into a map, etc.). In this project, the student will leverage the novel techniques for integrating, registering and connecting datasets with overlapping elements. The results will be visualized by the student using interactive maps.
Skills needed: Python.
What the student will learn: REST APIs, Querying knowledge bases, Data augmentation and visualization
ADVISOR:
- Daniel Garijo, Viterbi School of Engineering
STUDENT PARTICIPANTS:
- Anjana Niranjan, M.Sc. student in Electrical Engineering, Viterbi School of Engineering
- Kushagra Singh Sachan, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
- Chi Sheng Yang, M.Sc. student in Computer Science, Viterbi School of Engineering
14. Building Sports Data Knowledge Graphs
PROJECT DESCRIPTION: Public sports data is often spread across many differing sources, creating issues of entity resolution and record linkage. Knowledge graphs are a popular conceptual technology for storing, fusing and querying information from such disparate sources. This project will focus on building a sports data knowledge graph, from various open data and asset (e.g. video) sources/API, based on a Wikidata infrastructure.
POINTERS: https://github.com/statsbomb/open-data, http://collegefootballdata.com, http://espn.com, https://github.com/maksimhorowitz/nflscrapR, etc. etc.
SKILLS NEEDED: Python, Unix system skills, SPARQL
WHAT STUDENTS WILL LEARN: Data acquisition and normalization, knowledge graph construction, modeling and analysis
ADVISOR: Jeremy Abramson, abramson@isi.edu
15. Tell us where it hurts
Project Description: LA Care has a mission to “provide access to quality health care for Los Angeles County’s vulnerable and low-income communities and residents and to support the safety net required to achieve that purpose.” In the many coordinated activities LA Care conducts to provide a comprehensive health insurance safety net, it collects massive amounts of healthcare data. With advances in analytics enabled by AI approaches (e.g. predictive modeling, machine learning, model refinement and validation), the organization is looking for ways to mine and analyze its data to drive optimization and improvement of product development, marketing techniques and business strategies. Students will work with stakeholders throughout the organization to identify opportunities for leveraging company data to drive business solutions. The ability to identify and address “pain points” will depend on the skills that students bring to the project.
Skills: A flexible combination from basic to advanced knowledge of Python, R, JavaScript, machine learning techniques, statistical and data mining techniques.
What students will learn: How to work with different functional teams to identify and prioritize problems and then develop, implement, and monitor models to address challenges and improve business processes that will lead to improved health outcomes among the members of LA Care’s covered community.
ADVISORS:
- George Tolomiczenko, Keck School of Medicine
- Phil McAbee, LA Care
STUDENT PARTICIPANTS:
- Yihang Chen, M.Sc. student in Healthcare Data Science, Viterbi School of Engineering and Keck School of Medicine
- Lisa Meng, M.Sc. student in Healthcare Data Science, Viterbi School of Engineering and Keck School of Medicine
- Chiaofeng Yang, M.Sc. student in Healthcare Data Science, Viterbi School of Engineering and Keck School of Medicine
- Nelson Lam, M.Sc. student in Biostatistics, Keck School of Medicine
16. Capturing provenance of data analyses
PROJECT DESCRIPTION: Documenting how a result was obtained from data analysis involves documenting the software, software settings, and datasets used to obtain that result so it can be explained properly. This project will design and develop a user interface for specifying provenance records using W3C standards. The interface will enable users to document the provenance of data analysis no matter what infrastructure they used (R scripts, sk-learn, etc).
POINTERS: This project will extend the ASSET workflow sketching interface (http://asset-project.info/sketching.html) to capture provenance sketches.
SKILLS NEEDED: Firebase, JavaScript.
WHAT STUDENTS WILL LEARN: Interfaces for capturing data analysis steps, provenance standards, data analysis workflow representations.
ADVISOR:
- Yolanda Gil, Viterbi School of Engineering
STUDENT PARTICIPANT:
- Rahul Jeswani, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
17. Data infrastructure for USC
PROJECT DESCRIPTION: Developing software and data resources for USC students. Software resources include tools to process and analyze specific types of data (eg social networks, images, text, etc), data preparation tools, or machine learning libraries. Data resources include thematic data repositories, such as urban LA data, environmental LA data, entertainment LA data, etc.
SKILLS NEEDED: Open source software development, data services.
WHAT STUDENTS WILL LEARN: Development of enterprise infrastructure.
ADVISOR:
- Yolanda Gil, Viterbi School of Engineering
STUDENT PARTICIPANTS:
- Yang Dai, M.Sc. student in Spatial Data Science, Viterbi School of Engineering and Dornsife College of Letters, Arts, and Sciences
- Zixuan Zhang, B.Sc. student in Electrical Engineering, Viterbi School of Engineering
- Shenoy Pratik Gurudatt, M.Sc. student in Computer Science, Viterbi School of Engineering
- Parul Gupta, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
- Yu Wang, M.Sc. student in Electrical Engineering, Viterbi School of Engineering
- Sanjiv Soni, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
18. Detecting deep fakes
PROJECT DESCRIPTION: Spread of misinformation has become a significant problem, raising the importance of relevant detection methods. While there are different manifestations of misinformation, in this work we focus on detecting face manipulations in videos. This project will focus on detecting face manipulations in videos. We exploit the temporal dynamics of videos with recurrent networks.
SKILLS NEEDED: Prior experience in image processing, programming.
WHAT STUDENTS WILL LEARN: Deep learning, image and video processing.
ADVISOR:
- Wael Abd-Almageed, Viterbi School of Engineering
STUDENT PARTICIPANTS:
- Shenoy Pratik Gurudatt, M.Sc. student in Computer Science, Viterbi School of Engineering
19. Measuring Pollution Benefits from Congestion Pricing Initiatives
PROJECT DESCRIPTION: Using real-time big data from Los Angeles freeways on traffic and Aclima data on pollution measurements, this project will estimate the links between speed and pollution. Estimating this relationship properly is crucial for knowing the benefits that congestion pricing may generate in terms of pollution reduction. Computer Science methods will be used to guide the choice of policy intervention and guide prediction.
DESIRED SKILLS: R, Machine Learning
WHAT STUDENTS WILL LEARN: Traffic modeling, pollution modeling, econometric tools to estimate the causal effects of policy interventions, methods for integrating CS tools into public policy solutions-oriented research
ADVISOR:
- Antonio Bento, Price School of Public Policy
20. Learning to Connect: Modeling Social Network Dynamics and Evolution by Imitation Learning
PROJECT DESCRIPTION: In this research, we aim to model how human players make connection decisions in an online game where players are free to add or delete a friend, as well as join a clan.
ADVISORS:
- Emilio Ferrara, Viterbi School of Engineering
- Dmitri Williams, Annenberg School for Communication and Journalism
STUDENT PARTICIPANTS:
- Yiley Zeng, Ph.D. student in Computer Science, Viterbi School of Engineering