Abstract: Video Captioning requires effective extraction and fusion of multimodal features, including visual, semantic, and textual information, to generate accurate natural language descriptions. To ...
OpenWorldSAM pushes the boundaries of SAM2 by enabling open-vocabulary segmentation with flexible language prompts. [2026-1-4]: Demo release: we’ve added simple demos to run OpenWorldSAM on images ...