Research Interest
I study how to make synthetic data more reliable and effective for scaling AI training pipelines. My work challenges the assumption that even well-structured synthetic data is always beneficial, showing that, without careful integration and diagnostics, it can degrade performance or distort learning dynamics. I develop methods for structured data augmentation, failure analysis, and algorithmic repair to make synthetic data more trustworthy. Recent projects include empirical studies exposing structural flaws in model-based RL pipelines built on synthetic rollouts, time-symmetric data augmentation in sequential decision-making problems, and ongoing development of diagnostic tools for out-of-distribution detection using diffusion models, aimed at identifying when synthetic data distributions diverge from trusted real-world contexts.
|