Connecting vision and language via image retrieval and captioning