CKIDS DataFest Spring 2021 Project Descriptions – Center for Knowledge-Powered Interdisciplinary Data Science (CKIDS)

CKIDS is hosting DataFest Spring 2021 in collaboration with the GRIDS data science student association. These projects were proposed by USC faculty and researchers through an open call for project proposals.

Below is a short overview of all the projects.

Students can learn more about the projects and sign up to participate during the DataFest Spring 2021 kickoff event.

New DataFest Spring 2021 Projects

1. Microtelcos and the Digital Divide in CA

The COVID-19 pandemic has reinvigorated calls to close the digital divide in the US and
elsewhere. Without adequate Internet access, households are at a disadvantage in education, jobs,
health, and other key dimensions of wellbeing. While local broadband markets are increasingly
concentrated, there is also increased interest in exploring the role those small local operators
(“microtelcos”) can play in serving in low-income and rural communities. These microtelcos range from small wireless cooperatives to mom-and-pop private ISPs to municipal-backed operators. This project seeks to a) map and identify the characteristics of communities where microtelcos are present in CA, and b) to examine whether microtelcos presence affects broadband service quality and adoption by businesses and households in the community. The project will use broadband deployment data collected by the CPUC (California Public Utilities Commission) and socioeconomic data from the Census Bureau.

SKILLS NEEDED: Basic to intermediate statistics or econometrics; GIS experience desirable

WHAT STUDENTS WILL LEARN: Students will become familiar with the application of statistics and GIS for policy-oriented analysis. They will also gain a deeper understanding of the broadband access market and the new last-time technologies deployed by ISPs. Finally, they will be able to contribute to the write-up of results.

ADVISORS:

Hernan Galperin, Annenberg School for Communication and Journalism

STUDENT PARTICIPANTS:

Asjad Asif Jah, M.S. student in Spatial Data Science, Viterbi School of Engineering
Jonathan Gonzalez, M.S. student in Computer Science, Viterbi School fo Engineering

2. Investigating disparities in the COVID-19 epidemic in Los Angeles County through fine-grained epidemic modeling

Fine-grained epidemiological modeling of the spread of COVID-19 can inform public health policy that accounts for disparities in the risk of exposure, infection, and death across different locations and different demographic groups. In Los Angeles County, disparities in COVID-19 infection rates by neighborhood have been tremendous. Throughout the current large outbreak wave, infection incidence rates in low-income, predominantly Hispanic neighborhoods of East LA have consistently been 10-15 times higher than in wealthier, predominantly white neighborhoods in West LA. Many well-informed hypotheses exist to explain the cause of these disparities in infection, including employment sectors that require leaving homes to work, household density, and behavioral differences across cultures and age groups. But for Los Angeles County, these hypotheses have not been evaluated quantitatively in the context of an epidemic modeling framework.

To explain the disproportionate impact of the virus on disadvantaged demographic groups in Los Angeles County, we are developing a networked multiple-population epidemic model to investigate how epidemic dynamics and infection outcomes differ across fine-grained neighborhoods. Specifically, we will extend an already-developed stochastic SEIR+ disease model that includes healthcare, death, and vaccination compartments into the networked multiple-population framework, which will model movements, contacts, and infection pathways within and between neighborhoods. A key feature of this modeling framework will be the use of dynamic mobility data, derived from US cell phone data, to inform changes in the daily movements of people within and between neighborhoods. This data will provide the basis of a weighted infection-transmissible contact network between neighborhoods. The SEIR disease model is run on top of this contact network, determining infection dynamics across the neighborhoods. The model will allow obtaining estimates of key epidemic quantities including transmission rates (and the time-varying reproductive number, R(t)) and infection fatality rates for each neighborhood, and identifying the neighborhoods driving epidemic spread (through contacts within and across neighborhoods). Furthermore, hierarchical modeling techniques will be used to obtain estimates of infection and fatality rates for substrata representing combinations of ethnicity/race, age, and sex within each neighborhood.

CKIDS PROJECT TASKS

While the overarching goal of this project is to develop a multiple-population epidemic model for Los Angeles County (LAC) across a network of connected neighborhoods, it is also necessary to maintain a single-population model for LAC as a whole that estimates the epidemic parameters for this larger spatial level. Such a single-population model has been maintained since May 2020 by the USC Biostatistics COVID modeling team. This model serves two important purposes. First, since May 2020 it has supported the LAC Department of Public Health, which has requested updates on key epidemic predictions on a weekly basis. Second; the parameters estimated from the single population model will serve as prior distributions in the Bayesian parameter estimation framework used in the networked-neighborhood model.

The first task for the CKIDS student will be to re-implement the parameter estimation framework for the existing LAC-level model, such that parameters are estimated each week and fixed for future estimates forward in time. This can be done either through modification to the existing code and parameter estimation framework, written in R and using Approximate Bayesian Computation (ABC), or through a full reimplementation of the modeling code. The second task will be to maintain the model estimation and website displaying updates through weekly updates using data that comes directly from the LAC Department of Public Health. A third possible task, depending on the interest of the CKIDS student, will be to apply the modeling to California data, and other counties in California (so far it has only been applied to LAC data).

SKILLS NEEDED: Programming in the R language, familiarity with statistical methods for parameter estimation, familiarity with computational simulation, willingness to study the code of an existing model, the time commitment of 10 hours/week

WHAT STUDENTS WILL LEARN: Epidemic (SEIR) modeling and stochastic epidemic modeling, parameter estimation in a dynamic system, working with real-world infection data, supporting Los Angeles County Department of Public Health

ADVISORS:

Abigail Horn, Keck School of Medicine

STUDENT PARTICIPANTS:

Tao Huang, M.S. student in Biostatistics, Keck School of Medicine
Jianing (Julia) Chen, M.S. student in Applied Data Science, Keck School of Medicine

3. Mapping the Ethical Concerns Surrounding AI Research

With the recent enthusiasm about algorithmic fairness and responsible AI, many conferences
are encouraging or requiring a broader impact section to assess societal harms and benefits of the AI research being presented. In this project, we will analyze the themes of these sections, with a particular focus on the ethical issues being addressed and acknowledged. We will develop tools and methods to evaluate the harms and benefits of the presented research. The goal is to see how is the community helping AI research to be less harmful but more beneficial for society. For more background on work in this area, please review this workshop.

SKILLS NEEDED: Python, an interest in AI ethics

WHAT STUDENTS WILL LEARN: NLP, statistics, basic data visualization

ADVISORS:

Fred Morstatter, Information Sciences Institute

STUDENT PARTICIPANTS:

Chaitali Joshi, M.S. student in Applied Data Science, Viterbi School of Engineering
Param Bole, M.S. student in Computer Science, Viterbi School of Engineering
Muhammad Oneeb Ul Haq Khan, M.S. student in Computer Science, Viterbi School of Engineering
Madeleine Thompson, M.S. student in Applied Data Science, Viterbi School of Engineering
Aparna Nair, M.S. student in Applied Data Science, Viterbi School of Engineering

4. Object Detection and Classification for Street Cleanliness

In collaboration with the Sanitation Department of LA, IMSC has been developing a
framework to automatically detect the cleanliness of streets as well as any special objects in need of removal. The framework makes use of machine learning technology trained on images/videos collected by the city and/or taken by citizens. The images taken by mobile cameras (e.g., LA City’s garbage collection trucks and/or citizens’ smartphones using our own MediaQ App) are transferred to the MediaQ server, then these images can be automatically classified based on predefined cleanliness indexes and object types (such as bulky item, illegal dumping). In this project, we will focus on the detection and classification of homeless encampments in LA streets. Recorded images/videos with GPS location data will be processed and the classification results will be displayed on a map to understand the distribution of homeless people in LA, which is essential data to study the homeless issue.

SKILLS NEEDED: image machine learning, Python, data visualization

WHAT STUDENTS WILL LEARN: Deep learning in practical smart city applications, spatial-visual data analysis

ADVISORS:

Seon Ho Kim, Viterbi School of Engineering

STUDENT PARTICIPANTS:

Trisha Sinha, M.S. student in Electrical Engineering, Viterbi School of Engineering
Harsh Jaykumar Jalan, M.S. student in Computer Science, Viterbi School of Engineering
Nai Cih Liou, M.S. student in Applied Data Science, Viterbi School of Engineering
Ashwin Sakhare, Ph.D. student in Biomedical Engineering, Viterbi School of Engineering
Divya Manjunath, M.S. student in Applied Data Science, Viterbi School of Engineering

5. Drought prediction in Southern California using deep learning

Seasonal drought predictions are important for the management of water resources for
agriculture, urban consumption… Seasonal forecasts have traditionally been done using a physics-based model. In this project, we will use a deep learning approach for drought forecasting in CA.

SKILLS NEEDED: Python, deep learning, statistics, Pytorch

WHAT STUDENTS WILL LEARN: Deep learning (namely CNN/LSTM), data exploration

ADVISORS:

Deborah Khider, Information Sciences Institute

STUDENT PARTICIPANTS:

Shubhashree Dash, M.S. student in Computer Science, Viterbi School of Engineering
Katie Chak, M.S. student in Communication Data Science, Viterbi School of Engineering & Annenberg School for Communication and Journalism

6. Omics and Aging in Killfish

Students will analyze aging in the African turquoise killifish, a species with the shortest
lifespan of all vertebrates. By analyzing multi-omic data over the lifetime of many individuals, we can begin to understand the cellular changes that reflect aging.

SKILLS NEEDED: Python or R. Optional: machine learning

WHAT STUDENTS WILL LEARN: Data integration and machine learning skills for a challenging practical problem.

ADVISORS:

Berenice Benayoun, Leonard Davis School of Gerontology
Jose Luis Ambite, Viterbi School of Engineering

STUDENT PARTICIPANTS:

Deven Panchac, M.S. student in Applied Data Science, Viterbi School of Engineering
Michael Mathew, M.S. student in Applied Data Science, Viterbi School of Engineering
Akansha Das, M.S. student in Computer Science, Viterbi School of Engineering
Suchetha Bhat, M.S. student in Computer Science, Viterbi School of Engineering
Shruti Krishna Kumar, M.S. student in Applied Data Science, Viterbi School of Engineering

7. Studying the Effects of Genes and Environment in Aging

Students will analyze genomic and environmental data collected through the lifetime of
individuals to investigate which genes and external conditions could be associated with aging. The goal of the project will be to reproduce an existing published paper and improve on its results.

SKILLS NEEDED: Python or R. Optional: machine learning

WHAT STUDENTS WILL LEARN: Data integration and machine learning skills for a challenging practical problem.

ADVISORS:

T. Em Arpawong, Leonard Davis School of Gerontology
Yolanda Gil, Viterbi School of Engineering

STUDENT PARTICIPANTS:

Meera Patel, B.A. student in Data Science, Viterbi School of Engineering
Boya Li, M.S. student in Chemical Engineering, Viterbi School of Engineering
Haoyang Chen, M.S. student in Computer Science, Viterbi School of Engineering
Qinming Zhang, B.A. & B.S. student in Data Science & Economics, Viterbi School of Engineering & Marshall School of Business
Ming Yan, M.S. student in Healthcare Data Science, Viterbi School of Engineering & Keck School of Medicine

8. Turning Cyber Data into Language

Cyber ontologies such as STIX and ATT&CK can represent complex relationships between cyber threat actors, attacks, and infrastructure. While such representations are easily processed by computers, cyber analysts often prefer dealing with written text. Natural language ontologies like FrameNet represent language in a structured manner as well, but frame specifications are often not
specific enough for a given domain (like cybersecurity). In this project, students will learn about cybersecurity threat ontologies and build a GUI web app tool that annotates provided cyber threat documents. No previous knowledge of cybersecurity necessary!

SKILLS NEEDED: Python (Flask, Streamlit, etc. helpful but not required)

WHAT STUDENTS WILL LEARN: Web application development, cybersecurity domain knowledge, natural language generation

ADVISORS:

Jeremy Abramson, Information Sciences Institute

STUDENT PARTICIPANTS:

Ruoyu Li, M.S. student in Computer Science, Viterbi School of Engineering
Carol Varkey, M.S. student in Cybersecurity Engineering, Viterbi School of Engineering
Rengapriya Aravindan, M.S. student in Computer Science, Viterbi School of Engineering
Chuqi Liu, M.S. student in Applied Data Science, Viterbi School of Engineering
Ziheng Gong, M.S. student in Applied Data Science, Viterbi School of Engineering

9. Looking at White Hat (?) Hacker Social Networks on Github

Open-source intelligence (“OSINT”) is a rapidly growing area of cybersecurity. This project
seeks to explore OSINT information available on GitHub. Specifically, we will build and analyze a dataset comprised of users on GitHub who show a specific interest in GitHub repos related to hacking artifacts. This dataset and social network analysis could help us determine what attributes lead to “black hat” — or malicious — cyber actors.

SKILLS NEEDED: At least one or more of Python, APIs, GraphQ, databases

WHAT STUDENTS WILL LEARN: OSINT/cybersecurity, data integration, cyber “attribution”, social network analysis

ADVISORS:

Jeremy Abramson, Information Sciences Institute

STUDENT PARTICIPANTS:

Dat Nguyen, M.S. student in Computer Science, Viterbi School of Engineering
Wenwen Zheng, M.S. student in Communication Data Science, Viterbi School of Engineering & Annenberg School for Communication and Journalism
Nai Cih Liou, M.S. student in Applied Data Science, Viterbi School of Engineering
Xinran Liang, M.S. student in Applied Data Science, Viterbi School of Engineering

Continuing Projects from DataFest Fall 2020

1. Mapping the Uncanny Valley

While many stories involve the friendly and familiar, scary stories across cultures, from
Hamlet to Yotsuya Kaidan to Siren Head involve beings that are almost—but not quite—human. Can
these stories give us insight into the “nearly-human” uncanny valley? Initial results from our group say yes! While some research has explored the uncanny valley for images, the research is limited and previously unexplored in text format. If we can extract human emotions surrounding text descriptions, we can exploit an enormous array of data. Our goal this semester is to analyze our objective definitions of “fear” or “creepiness” in a story and test how the similarity of words to “human” make them more or less creepy. Moreover, we will explore what features of objects make them more or less scary. These findings share a direct relationship to AI and robotics where our goal is always to improve pleasant interactions and affability in human-computer and human-robot interactions. The students will build on initial work to apply NLP methods to these texts and improve upon existing initial results.

SKILLS NEEDED: Python and an interest in NLP (deep knowledge of NLP is not needed – the tasks can be learned on the go). Optional skills are knowledge of nltk, keras, and genism libraries.

WHAT STUDENTS WILL LEARN: Students will combine fields of computational social science, NLP, and other subfields of AI to analyze large text datasets. They will explore a range of possible tasks including classification and sentiment analysis (e.g., what separates “creepy” and not-so-creepy scenarios?), text embedding, and clustering elements of a story, e.g., finding the arc of creepy stories. The broader goal will be to better understand a deep psychological problem, the problem of the uncanny valley, that risks inhibiting the goals of human-computer interaction. When students better understand what makes something creepy, they can explore how AI can avoid the uncanny valley and appear familiar, friendly, and safe to the public at large.

ADVISORS:

Keith Burghhardt, Information Sciences Institute

STUDENT PARTICIPANTS:

Saurabh Jain, M.S. student in Applied Data Science, Viterbi School of Engineering
Athashree Vartak, M.S. student in Computer Science, Viterbi School of Engineering
Olivia Fryt, M.S. student in Applied Data Science, Viterbi School of Engineering
Yilin Qi, B.A. student in Data Science & Linguistics, Viterbi School of Engineering
Jiashu Xu, B.S. student in Computer Science & Applied Math, Viterbi School of Engineering

2. Machine Learning to Analyze Rock Microstructures

Students will analyze images from optical microscopes that reveal features of materials and
microstructures using machine learning techniques. These images have been collected by geologists, who use them to study the rock samples that they collect in the field and determine their properties and origins. We have a baseline system already implemented, and the goal is to improve it with new machine learning techniques, guided by the insights of our collaborating geologists.

SKILLS NEEDED: Machine learning for image analysis in Python

WHAT STUDENTS WILL LEARN: Deep learning skills in a challenging practical application for image analysis.

ADVISORS:

Yolanda Gil, Viterbi School of Engineering
Wael Abd-Almageed, Viterbi School of Engineering

STUDENT PARTICIPANTS:

Xiaoyu Wang, M.S. student in Applied Data Science, Viterbi School of Engineering
Stephen Iota, M.S. student in Computer Science, Viterbi School of Engineering
Bolong Pan, M.S. student in Applied Data Science, Viterbi School of Engineering
Junyi Liu, M.S. student in Computer Science, Viterbi School of Engineering
Ming Lyu, M.S. student in Computer Science, Viterbi School of Engineering

3. Detecting Biases in College Football Recruiting

Description: College football recruiting is big business. This project aims to build and analyze a
comprehensive college football recruiting dataset, to help determine if there are biases in who and how college football coaches recruit players. This data set will combine college football recruiting data from the web with census and other socioeconomic data, to search for patterns in where and how college football coaches recruit players.

SKILLS NEEDED: Python, web scraping, API access, database design (SQL/NoSQL)

WHAT STUDENTS WILL LEARN: Data integration, record linkage, database design, bias algorithms

ADVISORS:

Jeremy Abramson, Information Sciences Institute

STUDENT PARTICIPANTS:

Manasi Godse, M.S. student in Computer Science, Viterbi School of Engineering
Jackie Fan, B.S. student in Computer Science & Business Administration, Viterbi School of Engineering & Marshall School of Business
Yash Gupta, M.S. student in Computer Science, Viterbi School of Engineering
Rehan Ahmed, M.S. student in Applied Data Science, Viterbi School of Engineering
Jiahang Song, M.S. student in Applied Data Science, Viterbi School of Engineering

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31