Understanding the World through Computer Vision and Multimodal Intelligence

I explore how machines can understand and interpret the visual world through video analysis, object interactions, and language-grounded perception. My research spans from fine-grained video segmentation to interaction-aware generation, aiming to build AI systems that truly comprehend visual scenes.

Google Scholar ↗ LinkedIn ↗ X (Twitter) ↗ Project Pages ↗

Computer Vision

Video Understanding

Multimodal Learning

Visual Scene Analysis

Research teaser showcasing computer vision work

Publications

Self-Evolving Neural Radiance Fields
Wild3D Workshop @ ICCV 2025

🔗 View Full Project Page

MUG-VOS: Multi-Granularity Video Object Segmentation
AAAI 2025

🔗 View Full Project Page

Referring Video Object Segmentation via Language Aligned Track Selection
arXiv 2025

🔗 View Full Project Page

InterRVOS: Interaction-aware Referring Video Object Segmentation
Under review at AAAI 2026

🔗 View Full Project Page

MATRIX: Mask Track Alignment for Interaction-Aware Video Generation
Under review at ICLR 2026

🔗 View Full Project Page

Selected Projects

MATRIX

Mask Track Alignment for interaction-aware video generation. Includes MATRIX-11K, interaction-dominant layer analysis, and custom evaluation metrics.

(Repo link — coming soon)

InterRVOS

Interaction-aware referring VOS with explicit instance-level grounding and language-conditioned track selection.

Project Page ↗

SOLA

Language Aligned Track Selection for referring VOS. Improves grounding via alignment between language and instance tracks.

Project Page ↗

MUG-VOS

Multi-Granularity VOS framework and benchmark for robust segmentation across temporal scales.

Project Page ↗

Contact

For collaboration or internship opportunities, reach out via LinkedIn or DM on X. Scholar profile: Google Scholar.

Last updated: