Sandeep Chinchali (UT Austin): “Why We Need Multimodal Generative AI for Time Series (and Video)”
Abstract:
“Synthesize a realistic electrocardiogram (ECG) from a patient’s medical record to stress-test a disease classifier while preserving privacy.”
“Forecast home energy demand given location, EV usage, and an incoming winter freeze.”
“Retrieve all instances where a self-driving car executed an unprotected left near a truck and an accident followed within ten minutes.”
These tasks require multimodal generation, forecasting, and long-horizon retrieval over structured dynamical data. While foundation models excel in language and vision, they remain limited in maintaining conditional fidelity, enforcing hard physical constraints, and reasoning compositionally over temporal structure. In this talk, I present advances toward structure-aware and constraint-aware multimodal generative AI.
- Multimodal Time Series Generation and Forecasting (ICML ‘24 Spotlight, NeurIPS ‘25)
First, I introduce Time Weaver, a multimodal diffusion framework for conditional time-series synthesis under rich metadata. We propose a metric for conditional fidelity to evaluate alignment between generated and empirical conditional distributions, achieving strong gains across healthcare, radar, energy, and traffic tasks. I then present Constrained Posterior Sampling, an inference-time method that embeds hard physical and domain constraints directly into diffusion sampling, enabling physically consistent generation without retraining. Variants of these models are deployed with Samsung, Algo8, GoodRx, Ōura, and the U.S. Army.
- Data-Efficient Multimodal Retrieval for Structured Temporal Data (ICLR ‘25, ECCV ‘24, CVPR ‘25)
Next, I address multimodal retrieval in low-data regimes. Canonical Similarity Analysis approximates multimodal encoders using unimodal models coupled via structured matrix decomposition, reducing paired data requirements by orders of magnitude relative to CLIP-style training while preserving retrieval accuracy. Finally, our DARPA-funded Neuro-Symbolic Video Search framework integrates vision-language models with formal temporal logic to enable scalable, interpretable, long-horizon video search and generation.
Across generation and retrieval, a common principle emerges: foundation models for real-world dynamical systems require explicit structured priors, constraint-aware inference, and data-efficient multimodal alignment.
Biography:
Sandeep Chinchali is an assistant professor in UT Austin’s ECE department. He completed his PhD in computer science at Stanford and undergrad at Caltech. Previously, he was the first principal data scientist at Uhana (acquired by VMWare). Sandeep’s research has been recognized with the Outstanding Paper Award at MLSys 2022, a finalist for Best Systems Paper at Robotics: Science and Systems 2019, and best student paper at SPIE 2025. He regularly works with Silicon Valley startups and companies like Lockheed Martin, Honda, Collins, Intel, Viavi, and Cisco.
Zoom: https://upenn.zoom.us/j/97160689874

