Research on vision-language models combining image and text understanding for applications including ad comprehension, content tagging, visual question answering, and multimodal content generation.