DataFest is a collaborative CKIDS data science project done with GRIDS (Graduates Rising in Information and Data Science). Projects were proposed by USC faculty and researchers through an open call for proposals. The following are the descriptions of the projects that were selected for Fall 2021:
Selected DataFest Fall 2021 Projects
1. Decoding How Humans Encode Memories
Description: Advancements in closed-loop deep brain stimulation (DBS) enabled more intelligent autonomy for therapeutic intervention across a wide range of neurologic and psychiatric disorders. The predominant approach relies on control-theoretic approximations of the brain’s complex functional relationships with the external environment–in particular, a mapping between targeted stimulation and naturalistic responses of different regions of the brain. However, existing approaches fail to capture the environmental context of neuronal biomarkers. Thus, we leverage a set of IoT sensors to capture the human experience and environmental context, i.e., a subset of human sensory channels, in order to estimate the state of the human brain and provide the foundation for smarter, context-dependent DBS. We explore neural-symbolic approaches that integrate the powerful perception capabilities of deep learning with human logic to reason about the complex dependencies across a heterogeneous set of sensors.
Skills Needed: Python, Basic to intermediate Deep Learning
What you will learn: The students will learn how to reason about complex spatiotemporal data across a heterogeneous set of IoT sensors. In particular, they will explore the limitations of state-of-the-art deep learning approaches in terms of reasoning about complex events, e.g., reasoning about the audio, video, and inertial measurement data to detect when a person “walks through a doorway.” The students will also work with real-world patient data.
Advisors:
- Luis Garcia, Information Sciences Institute
Student participants:
- Yishan Li, M.S. student in Applied Data Science, Viterbi School of Engineering
- Navyada Koshatwar, M.S. student in Applied Data Science, Viterbi School of Engineering
- Gayathri Shrikanth, M.S. student in Computer Science, Viterbi School of Engineering
- Rushi Shah, M.S. student in Applied Data Science, Viterbi School of Engineering
- Pranjali Tushar Tembhurnikar, M.S. student in Computer Science, Viterbi School of Engineering
- Manuel Amaya, M.S. student in Applied Data Science, Viterbi School of Engineering
2. NVISION: Network Visualization Interventions Supporting Interpretation of Objective News
Description: “The current fractured media landscape allows individuals to choose confirming over credible information, and information spreads quicker online than interventions like fact-checking. Misinformation can be debiased by identifying gaps in mental representations of the world (mental models) and prompt alerts to be vigilant about assessing information (Lewandowsky et al, 2012).
We aim to develop interventions to make media-balance salient to users to mitigate the spread of misinformation. Social sampling theory describes that our misperceptions of others is explained by the sample of people we encounter (Galesic, Olsson, and Rieskamp, 2012), and we are more likely to link to similar people online (Kossinets & Watts, 2009). Our interventions address limitations in individual views of the media landscape. We aim to attach visualizations of sharing network to news articles in real-time to make these biases explicit.”
Skills Needed: Twitter data collection, Network data analysis & visualization
What you will learn: Students will learn how to automate the process to pull and summarize data from Twitter. Students will learn various methods of visualizing social networks.
Advisors:
- Daniel Benjamin, Dornsife College of Letters, Arts and Sciences
- Fred Morstatter, Information Sciences Institute
Student participants:
- Yanan Zhou, M.S. student in Analytics, Viterbi School of Engineering
- Megan Josep, M.S. student in Communication Data Science, Viterbi School of Engineering & Annenberg School for Communication and Journalism
- Priya Mane, M.S. student in Computer Science, Viterbi School of Engineering
- Minrui Chen, M.S. student in Communication Data Science, Viterbi School of Engineering & Annenberg School for Communication and Journalism
- Abhinav Rao, M.S. student in Applied Data Science, Viterbi School of Engineering
- Samip Kalyani, M.S. student in Computer Science, Viterbi School of Engineering
3. Discovering and Measuring Biases in Commonsense Knowledge Bases
Description: Common sense knowledge bases are used widely in research, spanning many areas in artificial intelligence, including natural language understanding, computer vision, and planning. However, these resources may contain human biases, which will ultimately be embedded in the resulting AI solution and potentially have negative societal impacts. The extent to which these biases exist is unclear. In this project, you will define several well-motivated biases (location, gender, ethnicity) and measure the extent to which they are represented in ConceptNet.
Skills Needed: Python, Statistics, Data Analysis, Clustering, Machine Learning, Language Models
What you will learn: How to apply data science tools and methods to measure bias in real-world knowledge bases
Advisors:
- Filip Ilievski, Information Sciences Institute
- Fred Morstatter, Information Sciences Institute
Student participants:
- Linglan Zhang, M.S. student in Analytics, Viterbi School of Engineering
- Yu Zhang, M.S. student in Analytics, Viterbi School of Engineering
- Sara Melotte, M.S. student in Computer Science, Viterbi School of Engineering
- Aditya Uday Malte, M.S. student in Computer Science, Viterbi School of Engineering
- Namita Santosh Mutha, M.S. student in Computer Science, Viterbi School of Engineering
4. Looking at White Hat (?) Hacker Social Networks on Github
Description: “Open-source intelligence (“OSINT”) is a rapidly growing area of cybersecurity. This project seeks to explore OSINT information available on GitHub. Specifically, we will build and analyze a dataset comprised of users on GitHub who show a specific interest in GitHub repos related to hacking artifacts. This dataset and social network analysis could help us determine what attributes lead to “black hat” — or malicious — cyber actors.”
Skills Needed: At least one or more of Python, APIs, GraphQL, databases, OSINT, cybersecurity
What you will learn: OSINT/cybersecurity, data integration, cyber “attribution”, social network analysis
Advisors:
- Jeremy Abramson, Information Sciences Institute
Student participants:
- Jonathan Lal, M.S. student in Computer Science, Viterbi School of Engineering
- Aditya Ramani, M.S. student in Computer Science, Viterbi School of Engineering
- Sanket Bhilare, M.S. student in Computer Science, Viterbi School of Engineering
- Keerthana Prakash, M.S. student in Computer Science, Viterbi School of Engineering
- Himani Amrute, M.S. student in Computer Science, Viterbi School of Engineering
5. Detecting Biases in College Football Recruiting
Description: College football recruiting is big business. This project aims to build and analyze a comprehensive college football recruiting dataset, to help determine if there are biases in who and how college football coaches recruit players. This data set will combine college football recruiting data from the web with census and other socioeconomic data, to search for patterns in where and how college football coaches recruit players.
Skills Needed: Python, web scraping, API access, database design (SQL/NoSQL)
What you will learn: Data integration, web app building, record linkage, database design, bias algorithms
Advisors:
- Jeremy Abramson, Information Sciences Institute
Student participants:
- Hassaan Hasan, M.S. student in Computer Science, Viterbi School of Engineering
- Aditya Dave, M.S. student in Computer Science, Viterbi School of Engineering
- Yin He, M.S. student in Communication Data Science, Viterbi School of Engineering & Annenberg School for Communication and Journalism
- Kartik Balodi, M.S. student in Applied and Computational Mathematics, Viterbi School of Engineering
- Akshat Jetli, M.S. student in Computer Science, Viterbi School of Engineering
- Manav Jain, M.S. student in Computer Science, Viterbi School of Engineering
6. Automatically segmenting and describing the human corpus callosum from brain MRIs
Description: The human corpus callosum is the largest pathway connecting the left and right hemispheres of the brain. The shape of the corpus callosum (CC) changes throughout the course of human development, and it can also be altered with respect to disease onset. We can explore the variation in CC shape along the middle of the brain, but we need to extract it reliably first. The lab currently has two methods for extracting the CC, one using only image processing techniques, and another using deep learning (UNet) but these methods do not always extract the CC accurately. The accuracy results often depend on the MRI scanner that was used, or the abnormalities present in the scan. Can we improve the performance of our deep learning model with additional training data? Can we change some processing steps to improve the model? Once we do have an accurate segmentation, then what shape metrics of the CC as a whole, or in parts, are most telling of the underlying biology, such as age and risk for disease?
Skills Needed: Bash, Python or R recommended. Familiarity with deep learning will be helpful for some group members, but not required. Neuroanatomy not required!
What you will learn: Students will learn how to train and test various segmentation models using real 3D brain imaging data. They will learn how real data variation, and becoming familiar with the raw data itself can help drive informed decisions for improved image segmentation models and also improved metrics for biological studies.
Advisors:
- Neda Jahanshad, Keck School of Medicine
Student participants:
- Kathy Wang, M.S. student in Healthcare Data Science, Viterbi School of Engineering
- Shayan Javid, M.S. student in Applied Data Science, Viterbi School of Engineering
- Abhinaav Ramesh, M.S. student in Healthcare Data Science, Viterbi School of Engineering
- Vineet Agarwal, M.S. student in Computer Science, Viterbi School of Engineering
- Jiahui Lu, M.S. student in Applied Data Science, Viterbi School of Engineering
7. COVID-19 misinformation
Description: This new project attempts to understand the interaction between anti-vaxxers (anti-vaccination groups) and alt-right groups on platforms such as Facebook. The goal of this project is to understand how do these two types of fringe groups interact over the years, and how do their interactions and discourse evolve during the COVID-19 pandemic. It would be interesting to explore the longitudinal patterns of network/discourse co-evolution and how such patterns may change in times of dramatic events.In terms of data, I have access to Facebook’s historical data archive and I have collected a dataset that contains a list of anti-vaxxer (n=158) and alt-right groups’ (n=183) Facebook posts over 10 years (2010-2021). The dataset can be further expanded with additional help.
Skills Needed: Social network analysis, natural language processing
What you will learn: The ability to synthesize insights from multiple methods
Advisors:
- Aimei Yang, Annenberg School for Communication and Journalism
Student participants:
- Yilin Qi, B.A. student in Linguistics & Data Science, Dornsife College of Letters, Arts and Sciences & Viterbi School of Engineering
- Revanth Madamala, M.S. student in Computer Science, Viterbi School of Engineering
- Luer Lyu, M.S. student in Computer Science, Viterbi School of Engineering
- Chidambaram Veerappan, M.S. student in Applied Data Science, Viterbi School of Engineering
8. Comparing Clinical Trials to Improve Cancer Treatments
Description: The goal of this project is to assist clinicians to find the best course of treatment for a cancer patient based on the latest and most appropriate clinical trials. Because new drugs are appearing increasingly fast, it is hard to keep track of the outcomes of all clinical trials and determine the best treatment. In collaboration with biomedical researchers, we have been developing algorithms to extract information about clinical trials from government websites, to structure the information, and to find the clinical trials that are most relevant for a given patient. We want to improve the algorithms to structure this information, and to develop similarity metrics that will help us retrieve and rank clinical trials.
Skills Needed: Python, algorithms
What you will learn: Knowledge graph technologies, algorithms for similarity metrics
Advisors:
- Yolanda Gil, Information Sciences Institute
Student participants:
- Lili Zhou, M.S. student in Healthcare Data Science, Viterbi School of Engineering
- Nikita Goel, M.S. student in Computer Science, Viterbi School of Engineering
- Chanda, M.S. student in Computer Science, Viterbi School of Engineering
- Wenjia Dou, M.S. student in Computer Science, Viterbi School of Engineering
- Sanjeev Kadagathur Vadiraj, M.S. student in Computer Science, Viterbi School of Engineering
- Audrey Lin, M.S. student in Communication Data Science, Viterbi School of Engineering & Annenberg School for Communication and Journalism
9. Transfer learning for adversarial machine translation
Description: Neural Machine Translation (NMT) is the process of mapping a segment of words from a source language to a target language using neural networks. However, NMT systems rely on large datasets for the source and target languages, and perform poorly on low-resource languages where there is insufficient parallel data. An effective method for improving NMT on low-resource languages is to employ transfer learning, where a model trained on a high-resource language pair is used to initialize training for the low-resource language pair. In this work, we will study the effect of employing transfer learning methods on an adversarial machine translation models based on Long Short-Term Memory Recurrent Neural Networks (LSTM).
Skills Needed: Python, Basic Machine Learning Concepts, Deep Learning, Deep Learning Software (PyTorch)
What you will learn: Deep learning, NLP, Recurrent Neural Networks
Advisors:
- Mohammad Reza Rajati, Viterbi School of Engineering
Student participants:
- Manoj Yadav, M.S. student in Computer Science, Viterbi School of Engineering
- Frederick Norman, B.S. student in Computer Science, Viterbi School of Engineering
- Vikrame Vasudev Krishnan, M.S. student in Computer Science, Viterbi School of Engineering
- Hanyu He, M.S. student in Communication Data Science, Viterbi School of Engineering & Annenberg School for Communication and Journalism
- Shaochen Tan, M.S. student in Applied Data Science, Viterbi School of Engineering
- Amit Singh, M.S. student in Computer Science, Viterbi School of Engineering
10. Object detection and classification APIs for urban street image analysis
Description: Developing models to analyze images is a demanding task that requires significant time, resources, and effort. Recently, companies such as Amazon and Google are providing services to make the modeling process easier so even users with little machine learning expertise can enjoy deep learning technologies. Based on our prior work in object detection and classification for smart city applications, we would like to compare and evaluate the process and performance of commercial services using our training datasets. This project will be a good practice to understand the image machine learning modeling process and the advantages/limitations of commercial services for customized learning.
Advisors:
- Seon Kim, Viterbi School of Engineering
Student participants:
- Vibhav Chitalia, M.S. student in Electrical Engineering, Viterbi School of Engineering
- Utkarsh Baranwal, M.S. student in Computer Science, Viterbi School of Engineering
- Carlos Zamora, B.S. student in Electrical and Computer Engineering, Viterbi School of Engineering
- Wonjun Lee, B.S. student in Computer Science, Viterbi School of Engineering
11. Investigate the healthy indoor air quality under Covid-19 in Los Angeles based on machine learning
Description: So far we do not know how much ventilation quantity will be needed to effectively prevent infection with COVID-19, and there is no sensor can measure coronavirus. But PM 2.5 and CO2 are the good indicators to estimate the Covid-19 virus concentration. Moreover, bio-signals can be used to assess people’s state of health. In my experiment, I will find participants and collect data including indoor environmental data, outdoor environmental data and human bio-signals. Then, the data will be analyzed by machine learning to find the correlation between the indoor environmental factors and the outdoor environmental factors, the correlation between the indoor environmental factors and human factors, also I can find the appropriate range of every indoor air quality factors when people under human healthy state, therefore, finally I can control the window to keep the indoor CO2 and pm2.5 within that range of the conclusion to keep people in a healthy state.
Skills Needed: Python, statistics
What you will learn: Learn the experience of analyzing data in a real project
Advisors:
- Minghuan Gong
Student participants:
- Yixiao Li, B.A. & B.S. student in Cognitive Science & Computer Science, Dornsife College of Letters, Arts and Sciences & Viterbi School of Engineering
- Zhaohong Feng, M.S. student in Analytics & Communication Data Science, Viterbi School of Engineering & Annenberg School for Communication and Journalism
12. Impacts of Smart Windows on Human’s Bio-Signals
Description: This research will be conducted to find out the relationship between humans’ bio-signals and electrochromic windows, which is useful to create a possible mechanism of using bio-signals to control the windows. By using wearable sensors and remote sensors, subjects’ bio-signals like heart rate, skin temperature, and pupil sizes, and indoor environmental quality like temperature and humidity could be monitored and analyzed. At last, by utilizing machine learning and data analysis skills, the impacts on humans’ bio-signals could be analyzed.
Skills Needed: Machine Learning, Data Analysis
What you will learn: Machine Learning, Data Analysis
Advisors:
- Zihan Wang
Student participants:
- Wenjia Dou, M.S. student in Computer Science, Viterbi School of Engineering
- Jingping Yu, M.S. student in Environmental Data Science, Viterbi School of Engineering
Continuing Projects from DataFest Fall 2020
1. Tracking health and nutrition signals from social media data (begun Spring 2020)
Description: Food environments (the physical spaces where people acquire and consume food) can profoundly impact diet and related diseases. Effective, robust measures of food environment nutritional quality are required by researchers and policymakers investigating their effects on individual dietary behavior and designing targeted public health interventions. The most commonly used indicators of food environment nutritional quality are limited to measuring the binary presence or absence of entire categories of food outlet type, such as ‘fast-food’ outlets, which can range from burger joints to salad chains. There would be great value in a summarizing indicator of restaurant nutritional quality that exists along a continuum, and which can be applied at the scale of large food environments, for example across Los Angeles County, to make distinctions between diverse restaurants within and across categories of food outlets.
This project will explore the ability to track real-life health and nutrition signals from social media data, focusing on data from Foursquare and Yelp. We will investigate the ability to access menu information from the APIs of these social media platforms, and develop measures to assess the nutritional content of these menus. Multiple aims will be investigated in this project, including scraping data from social media; NLP of menu text, tag, and comment data; developing predictive models of obesity; and more. “Ground truth” data on dietary patterns of LA residents will be available, enabling validation of dietary measures and predictive models built from menu data.
Skills needed: Python or R, NLP, spatial data, basic statistical modeling
What students will learn: Tracking real-life health signals from social media data; evaluating its quality and representativeness from a health perspective; spatial statistical analysis using big data, combined from various sources (social media data, official public health statistics); building predictive models for public health; possible experience to participate in writing conference abstracts and journal papers
Advisors:
- Abigail Horn, Department of Population and Public Health Sciences
- Dr. Andrés Abeliuk, University of Chile (formerly Information Sciences Institute, USC)
Student participants
- Iris C. Liu, M.S. student in Computer Science, Viterbi School of Engineering
2. Investigating disparities in the COVID-19 epidemic in Los Angeles County through fine-grained epidemic modeling
Description: Fine-grained epidemiological modeling of the spread of COVID-19 can inform public health policy that accounts for disparities in the risk of exposure, infection, and death across different locations and different demographic groups. In Los Angeles County, disparities in COVID-19 infection rates by neighborhood have been tremendous. Throughout the current large outbreak wave, infection incidence rates in low-income, predominantly Hispanic neighborhoods of East LA have consistently been 10-15 times higher than in wealthier, predominantly white neighborhoods in West LA. Many well-informed hypotheses exist to explain the cause of these disparities in infection, including employment sectors that require leaving homes to work, household density, and behavioral differences across cultures and age groups. But for Los Angeles County, these hypotheses have not been evaluated quantitatively in the context of an epidemic modeling framework.
To explain the disproportionate impact of the virus on disadvantaged demographic groups in Los Angeles County, we are developing a networked multiple-population epidemic model to investigate how epidemic dynamics and infection outcomes differ across fine-grained neighborhoods. Specifically, we will extend an already-developed stochastic SEIR+ disease model that includes healthcare, death, and vaccination compartments into the networked multiple-population framework, which will model movements, contacts, and infection pathways within and between neighborhoods. A key feature of this modeling framework will be the use of dynamic mobility data, derived from US cell phone data, to inform changes in the daily movements of people within and between neighborhoods. This data will provide the basis of a weighted infection-transmissible contact network between neighborhoods. The SEIR disease model is run on top of this contact network, determining infection dynamics across the neighborhoods. The model will allow obtaining estimates of key epidemic quantities including transmission rates (and the time-varying reproductive number, R(t)) and infection fatality rates for each neighborhood, and identifying the neighborhoods driving epidemic spread (through contacts within and across neighborhoods). Furthermore, hierarchical modeling techniques will be used to obtain estimates of infection and fatality rates for substrata representing combinations of ethnicity/race, age, and sex within each neighborhood.
CKIDS PROJECT TASKS
While the overarching goal of this project is to develop a multiple-population epidemic model for Los Angeles County (LAC) across a network of connected neighborhoods, it is also necessary to maintain a single-population model for LAC as a whole that estimates the epidemic parameters for this larger spatial level. Such a single-population model has been maintained since May 2020 by the USC Biostatistics COVID modeling team. This model serves two important purposes. First, since May 2020 it has supported the LAC Department of Public Health, which has requested updates on key epidemic predictions on a weekly basis. Second; the parameters estimated from the single population model will serve as prior distributions in the Bayesian parameter estimation framework used in the networked-neighborhood model.
The first task for the CKIDS student will be to re-implement the parameter estimation framework for the existing LAC-level model, such that parameters are estimated each week and fixed for future estimates forward in time. This can be done either through modification to the existing code and parameter estimation framework, written in R and using Approximate Bayesian Computation (ABC), or through a full reimplementation of the modeling code. The second task will be to maintain the model estimation and website displaying updates through weekly updates using data that comes directly from the LAC Department of Public Health. A third possible task, depending on the interest of the CKIDS student, will be to apply the modeling to California data, and other counties in California (so far it has only been applied to LAC data).
Skills needed: Programming in the R language, familiarity with statistical methods for parameter estimation, familiarity with computational simulation, willingness to study the code of an existing model, the time commitment of 10 hours/week
What students will learn: Epidemic (SEIR) modeling and stochastic epidemic modeling, parameter estimation in a dynamic system, working with real-world infection data, supporting Los Angeles County Department of Public Health
Advisors:
- Abigail Horn, Department of Population and Public Health Sciences>
STUDENT PARTICIPANTS:
- Tao Huang, M.S. student in Biostatistics, Keck School of Medicine
- Jianing (Julia) Chen, M.S. student in Applied Data Science, Viterbi School of Engineering