Learning Visual Representations From Cross-Modal Correspondence