Cross-modal Transfer Between Vision and Language for Protest Detection
Multimodal data (data with two or more modalities like text, images, or audio) has gained more and more attention in the last year. One example is the image-generating model DALL·E 2, released by OpenAI, which uses the modalities of images and text. Even though multimodal models have proven to have great potential, most of today's systems for socio-political event detection are text-based.
In this presentation, we discuss a proposed approach of using the increasing amount of multimodal data to decrease the need for annotation - as presented in our paper "Cross-modal Transfer Between Vision and Language for Protest Detection." We propose a method that utilizes existing annotated unimodal data to perform zero-shot event detection in another data modality. Specifically, we focus on protest detection in text and images and show that a pretrained vision-and-language alignment model (CLIP) can be leveraged to this end. In particular, our results suggest that annotated protest text data can act supplementarily to detect protests in images, but significant transfer is also demonstrated in the opposite direction.
Ria has a background in automation, mechatronics, and machine learning. She works as a software engineer in the Threat Intelligence team at Recorded Future. Her interest in AI came from learning about natural language processing, and today she calls herself an enthusiast, both professionally and personally. She hopes to be a part of using ML and AI to make our world a safer place!