Detectron2 Tutorial - Search News

A Transformer-based Multimodal Feature Fusion Model for Video Captioning

Abstract: Video Captioning requires effective extraction and fusion of multimodal features, including visual, semantic, and textual information, to generate accurate natural language descriptions. To ...

GitHub

OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts [NeurIPS 2025]

OpenWorldSAM pushes the boundaries of SAM2 by enabling open-vocabulary segmentation with flexible language prompts. [2026-1-4]: Demo release: we’ve added simple demos to run OpenWorldSAM on images ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results

A Transformer-based Multimodal Feature Fusion Model for Video Captioning

OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts [NeurIPS 2025]

Trending now