Abstract: Image-text matching as a fundamental cross-modal understanding task presents unique challenges in weakly-aligned scenarios. Such data typically feature highly abstract textual captions with ...