Workshop Schedule

December 10, 2025 (2:00 PM HST) — join us for a day of talks and sessions focused on advancing child speech AI technologies with fairness, precision, and interactive learning.

Virtual attendance

Remote participants can join the live workshop via Zoom. Please mute your microphone on entry and rename yourself with affiliation for roll call.

Join Zoom Room
Opening Remarks
14:00–14:05 · CC308

Opening Remarks

Welcome and overview of goals and logistics.

Advances and Challenges of Child ASR
14:05–14:35 · CC308

Advances and Challenges of Child ASR

Abeer Alwan

UCLA

Child speech is characterized by larger inter- and intra- speaker variability than adults’ speech, partly due to vocal tract changes as children grow. In addition, there is a lack of large, publicly available datasets that can adequately train machine learning algorithms for various recognition tasks. As a result, the performance of automatic speech recognition (ASR) systems of child speech is worse than that of adults. In this talk, I will summarize various efforts in data collection, developing data augmentation techniques, and benchmarking children’s speech recognition with supervised and self-supervised speech foundation models. Our studies point to the need for accounting for several factors when designing child speech processing systems: age (an ASR system that works well for a 9-year-old child would not necessarily work well for a 6-year-old), style (reading versus spontaneous speech), dialect (differences not only in pronunciation but also in word usage and grammar), and reading and/or language impairment. Moreover, for language assessments, transliteration is sometimes more valuable to the teacher than a corrected transcription. As a result, data diversity, and not just quantity, is especially critical when designing child ASR systems. While significant progress has been made in child speech processing, several challenges remain.

Speech as a modality for the characterization and adaptation of neurodiversity
14:35–15:05 · CC308

Speech as a modality for the characterization and adaptation of neurodiversity

Mark Hasegawa-Johnson

UIUC

Parents without medical expertise may seek help to integrate their children into home life, school, and society. Neuromotor conditions such as cerebral palsy (CP) and Down syndrome (DS) are typically diagnosed prenatally or at birth, but may generate challenges later; conditions such as apnea, anxiety, autism, and developmental language delay may remain undetected until their behavioral correlates have caused problems. Artificial intelligence has the potential to characterize neurodiversity early in life, and to adapt its behavior in order to help the child and her parents learn together. Wearables such as Littlebeats (TM) have been shown to discriminate sleep versus waking, and monologue versus dialogue infant vocalizations; with these abilities, an infant wearable has the potential to detect behavioral disorders early in life, and to help the parents find accommodations. Accurate tests of developmental language delay exist for children 3-5 years of age, and professional speech and language treatments have been shown to improve outcomes: automatic speech recognition (ASR) for young children has the potential to make these treatments available to all children who need them. Thanks to the Speech Accessibility Project, ASR error rates for adults with Parkinson's disease halved this year, and there is reason to believe that similarly large improvements for adults with CP and DS could help grant them better access to economic and social opportunities. In these and other ways, artificially intelligent speaking agents have the potential to bridge gaps in society, and improve inter-human interaction.

Coffee & Networking Break
15:35–15:50 · Lobby

Coffee & Networking Break

Refreshments and networking.

Developing Robust Speaker Diarization for Child-Adult Dyadic Interaction
16:20–16:50 · CC308

Developing Robust Speaker Diarization for Child-Adult Dyadic Interaction

Tiantian Feng

USC

Automating child speech analysis is increasingly critical for applications such as neurocognitive assessment and developmental evaluation. Speaker diarization, which identifies “who spoke when”, is an essential component of the automated analysis for these applications involving child-adult dyadic interactions. In this talk, I will introduce our recent efforts to build speaker diarization models for child–adult conversations. First, I will introduce our recent effort in developing child-adult speaker diarization benchmarks using speech foundation models. We show that exemplary foundation models can achieve 39.5% in Diarization Error Rate compared to previous speaker diarization methods. Moreover, I will present our data-efficient solution by creating simulated child-adult conversations using AudioSet to address the data sparsity in child-adult dyadic interaction that further improves speaker diarization performance. Finally, I will show that our diarization model not only achieves strong performance in speaker diarization but also produces behavioral features that are highly correlated with human-annotated labels.

Closing Remarks
18:20–18:30 · CC308

Closing Remarks

Wrap-up and closing.