Cross-modal Transfer Between Vision and Language for Protest Detection
Multimodal data (data with two or more modalities like text, images, or audio) has gained more and more attention in the last year. One example is the image-generating model DALL·E 2, released by OpenAI, which uses the modalities of images and text. Even though multimodal models have proven to have great potential, most of today's systems for socio-political event detection are text-based.
In this presentation, we discuss a proposed approach of using the increasing amount of multimodal data to decrease the need for annotation - as presented in our paper "Cross-modal Transfer Between Vision and Language for Protest Detection." We propose a method that utilizes existing annotated unimodal data to perform zero-shot event detection in another data modality. Specifically, we focus on protest detection in text and images and show that a pretrained vision-and-language alignment model (CLIP) can be leveraged to this end. In particular, our results suggest that annotated protest text data can act supplementarily to detect protests in images, but significant transfer is also demonstrated in the opposite direction.
Kajsa is a software engineer in the Text Analytics team at Recorded Future with a big passion for natural language processing, artificial intelligence techniques, and how they can be used to make our world a better and safer place. She recently graduated from the master's programme Complex Adaptive Systems at Chalmers, and on a day-to-day basis, she loves implementing state-of-the-art methods most efficiently.