CKIDS DataFest Fall 2019 Project Descriptions – Center for Knowledge-Powered Interdisciplinary Data Science (CKIDS)

In Fall 2019, CKIDS hosted six DataFest projects, and fourteen additional projects in collaboration with the GRIDS data science student association. These projects were proposed by USC faculty and researchers through an open call for project proposals. Below is a short overview of all twenty projects.

Selected DataFest Fall 2019 Projects

1. Understanding Internet Communities through Videogames

PROJECT DESCRIPTION: Online multiplayer games provide a wealth of data that can be used to study human behaviors. Many questions that can be investigated with rich datasets of online game player actions, interactions, and targeted survey questions. We have a wide range of ongoing student projects that use this data to study a range of human behaviors.

SKILLS NEEDED: R

WHAT STUDENTS WILL LEARN: Social network analysis, text analysis, pattern discovery from online communication

ADVISORS:

Dmitri Williams, Annenberg School of Communication and Journalism
Fred Morstatter, Viterbi School of Engineering

STUDENT PARTICIPANTS:

Junchu Zhang, M.Sc. student in Business Analytics, Marshall School of Business
Himan Kriplani, M.Sc. student in Computer Science, Viterbi School of Engineering
Shatad Purohit, Ph.D. student in Astronautical Engineering, Viterbi School of Engineering
Jinney Guo, M.Sc. student in Business Analytics, Marshall School of Business

2. Measuring population-level nutrition and dietary habits from Instagram

PROJECT DESCRIPTION: This project will investigate the quality of Instagram textual posts as a source of data for measurements of dietary patterns and nutrition quality, focusing on spatial and textual features of posts linked to food outlets. Using an Instagram dataset of all geo-located posts at food outlets in Los Angeles for 3 months in 2014, this project will investigate whether Instagram posts, despite implicit biases (and to the extent possible, accounting for these biases), can provide a representative health signal, informative of the quality of population nutrition and dietary patterns at a highly-resolved (e.g. census tract level) spatial scale.

POINTERS: Dataset described and published in: ”#FoodPorn: Obesity Patterns in Culinary Interactions” (This project will focus on the subset of those posts from Los Angeles)

SKILLS NEEDED: R or Matlab preferably, optional: text analytics, social network analysis, and statistical modeling

WHAT STUDENTS WILL LEARN: Tracking real-life health signals from social media data; machine learning / pattern mining to cluster behaviors; spatial statistical analysis using big data, combined from various sources (social media data, official public health statistics)

ADVISORS:

Abigail Horn, Keck School of Medicine
Kayla de la Haye, Keck School of Medicine
Andres Abeliuk, Viterbi School of Engineering

STUDENT PARTICIPANTS:

Divyatmika Lnu, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
Shagun Gupta, M.Sc. student in Computer Science, Viterbi School of Engineering
Nina Thiebaut, M.Sc. student in Analytics, Viterbi School of Engineering
Nisha Tiwari, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
Ian Choi, M.Sc. student in Applied Data Science, Viterbi School of Engineering

3. Tracking Coastal Change at Catalina Island

PROJECT DESCRIPTION: Since 1992, the USC Wrigley Institute for Environmental Studies ‘Catalina Conservation Divers’ have been collecting underwater biological and environmental data from coastal ocean sites around Catalina Island, California. In cooperation with the USC Wrigley Institute, the CCD team (made up of community scientists and volunteer SCUBA divers) conducts quarterly surveys of marine species and benthic water temperatures at various depths and locations. The Wrigley Institute has been collecting and archiving this data for years, and the data has not been holistically studied to date. We need assistance in analyzing data for trends across location, ocean depth, and time.

POINTERS: Project details at https://dornsife.usc.edu/wrigley/wies-ccd/ (FYI: data not currently online)

SKILLS NEEDED: Basic statistics and basic programming

WHAT STUDENTS WILL LEARN: This project will help students apply data science skills toward understanding and disseminating an exciting long-term novel dataset about environmental change in our local environment; and generating information to inform marine researchers and the natural resource management community. [Day trip to the USC Wrigley Marine Science Center on Catalina also very likely!]

ADVISORS:

Jessica Dutton, Dornsife College of Letters, Arts, and Sciences
Diane Kim, Dornsife College of Letters, Arts, and Sciences
Deborah Khider, Viterbi School of Engineering

STUDENT PARTICIPANTS:

Wei-Fan Chen, M.Sc. student in Applied Data Science, Viterbi School of Engineering
Sameeksha Mahajan, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
Zijing Zhang, M.Sc. student in Analytics, Viterbi School of Engineering
Pratheek Athreya, M.Sc. student in Applied Data Science, Viterbi School of Engineering

4. Data Mining Past Climates

PROJECT DESCRIPTION: Estimates of climate variations over the past 1,000 years play an increasing role in climate assessments. A key quantity to derive from them is the transient climate response (TCR), which quantifies the warming at expected from slowly-rising CO2 concentrations. TCR helps constrain the climate models used to predict the future evolution of Earth’s climate. In this project, you will help design an efficient workflow to estimate TCR from existing paleoclimate datasets and emerging statistical methods.

DATASETS/CONTEXT: https://www.nature.com/articles/sdata201788; http://pastglobalchanges.org/science/wg/2k-network/nature-geosc-2k-july-19

SKILLS NEEDED: Python

WHAT STUDENTS WILL LEARN: Statistical modeling, how to design and execute a data analytic pipeline

ADVISORS:

Julien Emile-Geay, Dornsife College of Letters, Arts and Sciences
Deborah Khider, Viterbi School of Engineering

STUDENT PARTICIPANTS:

Yuxuan Ji, M.Sc. student in Computer Science, Viterbi School of Engineering
Zhifeng Liu, M.Sc. student in Applied Data Science, Viterbi School of Engineering
Ashka Patel, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
Aditi Choudhary, M.Sc. student in Computer Science, Viterbi School of Engineering
Shelly Mehta, M.Sc. student in Computer Science, Viterbi School of Engineering

5. Understanding human environmental perceptions using multi-biometric signals in the built environment

PROJECT DESCRIPTION: Human, as a building occupant, is always surrounded by several indoor environmental quality (IEQ) elements, such as thermal, visual, air, and acoustic conditions. Therefore, the user’s environmental comfort and work productivity are significantly affected by the IEQ conditions, especially in residential, office, and educational facilities. This research is to investigate the relationships between the user’s IEQ comfort perceptions, IEQ conditions and his/her bio-metric signals to understand how to identify individual IEQ perception as a function of single or combined bio-signals (changes). The study outcome will have a potential to be integrated with the existing building mechanical/electrical control systems to enhance the user’s IEQ conditions while contributing to his/her comfort and well-being in the built environment.

POINTERS: https://www.nsf.gov/awardsearch/showAward?AWD_ID=1707068&HistoricalAwards=false4

SKILLS NEEDED: R, SPSS, Weka or similar data mining tools, Python (secondary)

WHAT STUDENTS WILL LEARN: physiological signal analysis, feature analysis, multi-variable correlation analysis, bio-signal synchrony analysis

ADVISORS:

Joon-Ho Choi, School of Architecture
Seon Kim, Viterbi School of Engineering

STUDENT PARTICIPANTS:

Manoj Muralidhara, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
Shubham Banka, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
Gaurav Gupta, M.Sc. student in Computer Science, Viterbi School of Engineering

6. Mining Side Effects in Cancer Treatment

Project description: SideEffects is a cancer patient’s resource to access treatment and side effects tailored to the patient’s treatment and disease history. The app sources content from clinical data, National Cancer Institute, social media, and user input from an app.

Responsibilities: Gather and organize existing data via different sources about cancer disease, treatment, and care management.

ADVISORS:

Thuy Thanh Truong, Keck School of Medicine
Ken Nguyen
Fred Morstatter, Viterbi School of Engineering

STUDENT PARTICIPANTS:

Justin Ho, M.Sc. student in Computer Science (Scientists & Engineers), Viterbi School of Engineering
Sankareswari Govindarajan, M.Sc. student in Computer Science, Viterbi School of Engineering
Kevin Tran, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
Yi-Hsin Chung, M.Sc. student in Business Analytics, Marshall School of Business
Sangeeth Koratten, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering

Selected CKIDS Fall 2019 Projects

7. Modeling the career trajectory of music artists

PROJECT DESCRIPTION: Many musicians, from up-and-comers to established artists, rely heavily on performing live to promote and disseminate their music. To advertise live shows, artists often use concert discovery platforms that make it easier for their fans to track tour dates. In this project, we ask whether digital traces of musical performances generated on those platforms can be used to understand career trajectories of artists. We have amassed a dataset we constructed by cross-referencing data from such platforms (Songkick, and Discogs). In this project, you will identify and explore patterns that can be used to identify successful musicians.

SKILLS NEEDED: Python

WHAT STUDENTS WILL LEARN: Predictive modeling, how to design and execute a machine learning workflow

ADVISOR:

Fred Morstatter, Viterbi School of Engineering

8. Automated generation of paper authors

PROJECT DESCRIPTION: This project will result in an open-source software tool that will have general applicability for scientific publications. Papers with hundreds of authors are not uncommon in science, and it often takes many weeks to compile an author list in the desired order with proper affiliations and acknowledgments. We have implemented an algorithm that generates the author information for a paper based on the type of contribution of each author within the ENIGMA neuroscience consortium. This project would extend this software to read in compiled spreadsheets or forms and extract information about universities and other institutions from structured web sources, to interoperate with widely-used frameworks such as Wikidata.

POINTERS: http://enigma.ini.usc.edu/, https://doi.org/10.1101/399402

SKILLS NEEDED: Python (or similar), optionally RDF or databasing tools, and web interfaces

WHAT STUDENTS WILL LEARN: Deployment of open source tools that are interoperable with existing science infrastructure.

ADVISOR:

Neda Jahanshad, Keck School of Medicine

STUDENT PARTICIPANTS:

Ruiyu Zhao, M.Sc. student in Electrical Engineering, Viterbi School of Engineering
Xingyu Wei, M.Sc. student in Computer Science (Scientists & Engineers), Viterbi School of Engineering
Tieming Sun, M.Sc. student in Electrical Engineering, Viterbi School of Engineering

9. A visual analytic toolkit for cultural biases

PROJECT DESCRIPTION: This project will result in a visual analytics toolkit that will enable social scientists to understand the cultural groups and biases at play in a social dataset. News, books, and social media all contain biases that stem from the cultural background of the author(s). We have developed algorithms to identify the cultural groups at play in an arbitrary dataset, as well as natural language processing approaches that can discover the biases of each group. This project would help bring put these tools into the hands of social scientists by displaying the output of these algorithms in novel visualizations.

SKILLS NEEDED: Python, JavaScript

WHAT STUDENTS WILL LEARN: Interfaces for communicating biases to social scientists

ADVISOR:

Fred Morstatter, Viterbi School of Engineering

STUDENT PARTICIPANTS:

Mansi Ganatra, M.Sc. student in Applied Data Science, Viterbi School of Engineering
Saatvik Tikoo, M.Sc. student in Computer Science, Viterbi School of Engineering

10. Modeling uncertainty in drought data

PROJECT DESCRIPTION: Droughts can have a substantial impact on agricultural systems and human livelihood. A Python package to calculate various drought indices in being developed. In this project, you will expand on this package and develop methods to test the sensitivity of the models to various input datasets and parameters.

POINTERS: Drought products will be generating from national weather products available here.

SKILLS NEEDED: Python

WHAT STUDENTS WILL LEARN: Uncertainty modeling, designing and implementing workflows.

ADVISOR:

Deborah Khider, Viterbi School of Engineering

STUDENT PARTICIPANTS:

Shravya Manety, M.Sc. student in Computer Science, Viterbi School of Engineering
Abhilash Pandurangan, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering

11. Automated time series analysis

PROJECT DESCRIPTION: This project will result in a Python package for automated time series analysis. Based on the characteristics of the data, you will design functions that (1) perform essential tasks in data cleaning and select appropriate methodologies, (2) implement various algorithms currently not supported through pandas and scikit-learn, and (3) create appropriate visualizations.

POINTERS: Example datasets available here. Software is available here: https://github.com/LinkedEarth/Pyleoclim_util

SKILLS NEEDED: Python, familiarity with Pypi

WHAT STUDENTS WILL LEARN: Time series analysis, deployment of open source tools, designing and implementing workflows.

ADVISOR:

Deborah Khider, Viterbi School of Engineering

STUDENT PARTICIPANTS:

Feng Zhu, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
Myron Kwan, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
Shilpa Thomas, M.Sc. student in Electrical Engineering, Viterbi School of Engineering
Nikhil Dhara Venkata, M.Sc. student in Computer Science, Viterbi School of Engineering
Deepanshu Madan, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering

12. Creating and visualizing a linked knowledge base of crime data

PROJECT DESCRIPTION: A lot of data is available in the web in a tabular manner, but it’s difficult to manipulate and visualize without a significant effort. In this project, we aim to test a novel framework created at ISI to build and visualize knowledge bases. The objective is to create a knowledge base that extends the other resources in the Web such as Wikidata or Wikipedia, and visualize the results using interactive maps and plots.

SKILLS NEEDED: Knowledge representation, RDF, Python (basic)

WHAT STUDENTS WILL LEARN: Interfaces, knowledge base construction, population and querying

ADVISOR:

Daniel Garijo, Viterbi School of Engineering

STUDENT PARTICIPANTS:

Vedant Diwanji, M.Sc. student in Computer Science, Viterbi School of Engineering
Yi-Li Chen, M.Sc. student in Analytics, Viterbi School of Engineering
Andrew Zhao, M.Sc. student in Business Analytics, Marshall School of Business
Haripriya Dharmala, M.Sc. student in Computer Science, Viterbi School of Engineering

13. Building an open catalog of integrated datasets for Los Angeles

While many open data efforts have managed to successfully expose public data in the web, it is often complicated to determine how these records can be integrated with each other (due to heterogeneous ids, not clear how to place them into a map, etc.). In this project, the student will leverage the novel techniques for integrating, registering and connecting datasets with overlapping elements. The results will be visualized by the student using interactive maps.

Skills needed: Python.

What the student will learn: REST APIs, Querying knowledge bases, Data augmentation and visualization

ADVISOR:

Daniel Garijo, Viterbi School of Engineering

STUDENT PARTICIPANTS:

Anjana Niranjan, M.Sc. student in Electrical Engineering, Viterbi School of Engineering
Kushagra Singh Sachan, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
Chi Sheng Yang, M.Sc. student in Computer Science, Viterbi School of Engineering

14. Building Sports Data Knowledge Graphs

PROJECT DESCRIPTION: Public sports data is often spread across many differing sources, creating issues of entity resolution and record linkage. Knowledge graphs are a popular conceptual technology for storing, fusing and querying information from such disparate sources. This project will focus on building a sports data knowledge graph, from various open data and asset (e.g. video) sources/API, based on a Wikidata infrastructure.

POINTERS: https://github.com/statsbomb/open-data, http://collegefootballdata.com, http://espn.com, https://github.com/maksimhorowitz/nflscrapR, etc. etc.

SKILLS NEEDED: Python, Unix system skills, SPARQL

WHAT STUDENTS WILL LEARN: Data acquisition and normalization, knowledge graph construction, modeling and analysis

ADVISOR: Jeremy Abramson, abramson@isi.edu

15. Tell us where it hurts

Project Description: LA Care has a mission to “provide access to quality health care for Los Angeles County’s vulnerable and low-income communities and residents and to support the safety net required to achieve that purpose.” In the many coordinated activities LA Care conducts to provide a comprehensive health insurance safety net, it collects massive amounts of healthcare data. With advances in analytics enabled by AI approaches (e.g. predictive modeling, machine learning, model refinement and validation), the organization is looking for ways to mine and analyze its data to drive optimization and improvement of product development, marketing techniques and business strategies. Students will work with stakeholders throughout the organization to identify opportunities for leveraging company data to drive business solutions. The ability to identify and address “pain points” will depend on the skills that students bring to the project.

Skills: A flexible combination from basic to advanced knowledge of Python, R, JavaScript, machine learning techniques, statistical and data mining techniques.

What students will learn: How to work with different functional teams to identify and prioritize problems and then develop, implement, and monitor models to address challenges and improve business processes that will lead to improved health outcomes among the members of LA Care’s covered community.

ADVISORS:

George Tolomiczenko, Keck School of Medicine
Phil McAbee, LA Care

STUDENT PARTICIPANTS:

Yihang Chen, M.Sc. student in Healthcare Data Science, Viterbi School of Engineering and Keck School of Medicine
Lisa Meng, M.Sc. student in Healthcare Data Science, Viterbi School of Engineering and Keck School of Medicine
Chiaofeng Yang, M.Sc. student in Healthcare Data Science, Viterbi School of Engineering and Keck School of Medicine
Nelson Lam, M.Sc. student in Biostatistics, Keck School of Medicine

16. Capturing provenance of data analyses

PROJECT DESCRIPTION: Documenting how a result was obtained from data analysis involves documenting the software, software settings, and datasets used to obtain that result so it can be explained properly. This project will design and develop a user interface for specifying provenance records using W3C standards. The interface will enable users to document the provenance of data analysis no matter what infrastructure they used (R scripts, sk-learn, etc).

POINTERS: This project will extend the ASSET workflow sketching interface (http://asset-project.info/sketching.html) to capture provenance sketches.

SKILLS NEEDED: Firebase, JavaScript.

WHAT STUDENTS WILL LEARN: Interfaces for capturing data analysis steps, provenance standards, data analysis workflow representations.

ADVISOR:

Yolanda Gil, Viterbi School of Engineering

STUDENT PARTICIPANT:

Rahul Jeswani, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering

17. Data infrastructure for USC

PROJECT DESCRIPTION: Developing software and data resources for USC students. Software resources include tools to process and analyze specific types of data (eg social networks, images, text, etc), data preparation tools, or machine learning libraries. Data resources include thematic data repositories, such as urban LA data, environmental LA data, entertainment LA data, etc.

SKILLS NEEDED: Open source software development, data services.

WHAT STUDENTS WILL LEARN: Development of enterprise infrastructure.

ADVISOR:

Yolanda Gil, Viterbi School of Engineering

STUDENT PARTICIPANTS:

Yang Dai, M.Sc. student in Spatial Data Science, Viterbi School of Engineering and Dornsife College of Letters, Arts, and Sciences
Zixuan Zhang, B.Sc. student in Electrical Engineering, Viterbi School of Engineering
Shenoy Pratik Gurudatt, M.Sc. student in Computer Science, Viterbi School of Engineering
Parul Gupta, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering
Yu Wang, M.Sc. student in Electrical Engineering, Viterbi School of Engineering
Sanjiv Soni, M.Sc. student in Computer Science (Data Science), Viterbi School of Engineering

18. Detecting deep fakes

PROJECT DESCRIPTION: Spread of misinformation has become a significant problem, raising the importance of relevant detection methods. While there are different manifestations of misinformation, in this work we focus on detecting face manipulations in videos. This project will focus on detecting face manipulations in videos. We exploit the temporal dynamics of videos with recurrent networks.

SKILLS NEEDED: Prior experience in image processing, programming.

WHAT STUDENTS WILL LEARN: Deep learning, image and video processing.

ADVISOR:

Wael Abd-Almageed, Viterbi School of Engineering

STUDENT PARTICIPANTS:

Shenoy Pratik Gurudatt, M.Sc. student in Computer Science, Viterbi School of Engineering

19. Measuring Pollution Benefits from Congestion Pricing Initiatives

PROJECT DESCRIPTION: Using real-time big data from Los Angeles freeways on traffic and Aclima data on pollution measurements, this project will estimate the links between speed and pollution. Estimating this relationship properly is crucial for knowing the benefits that congestion pricing may generate in terms of pollution reduction. Computer Science methods will be used to guide the choice of policy intervention and guide prediction.

DESIRED SKILLS: R, Machine Learning

WHAT STUDENTS WILL LEARN: Traffic modeling, pollution modeling, econometric tools to estimate the causal effects of policy interventions, methods for integrating CS tools into public policy solutions-oriented research

ADVISOR:

Antonio Bento, Price School of Public Policy

20. Learning to Connect: Modeling Social Network Dynamics and Evolution by Imitation Learning

PROJECT DESCRIPTION: In this research, we aim to model how human players make connection decisions in an online game where players are free to add or delete a friend, as well as join a clan.

ADVISORS:

Emilio Ferrara, Viterbi School of Engineering
Dmitri Williams, Annenberg School for Communication and Journalism

STUDENT PARTICIPANTS:

Yiley Zeng, Ph.D. student in Computer Science, Viterbi School of Engineering