Abstract: We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visual aligned features solely through watching videos. We ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results