Abstract: Video Captioning requires effective extraction and fusion of multimodal features, including visual, semantic, and textual information, to generate accurate natural language descriptions. To ...