I explore how machines can understand and interpret the visual world through video analysis, object interactions, and language-grounded perception. My research spans from fine-grained video segmentation to interaction-aware generation, aiming to build AI systems that truly comprehend visual scenes.
Mask Track Alignment for interaction-aware video generation. Includes MATRIX-11K, interaction-dominant layer analysis, and custom evaluation metrics.
(Repo link — coming soon)Interaction-aware referring VOS with explicit instance-level grounding and language-conditioned track selection.
Project Page ↗Language Aligned Track Selection for referring VOS. Improves grounding via alignment between language and instance tracks.
Project Page ↗Multi-Granularity VOS framework and benchmark for robust segmentation across temporal scales.
Project Page ↗For collaboration or internship opportunities, reach out via LinkedIn or DM on X. Scholar profile: Google Scholar.
Last updated: