Abstract
Monocl collects data from various public sources within life sciences, spanning publications, clinical trials, meeting presentations, and research grants. We extract and disambiguate all the names from these sources, allowing us to figure out who has done what, making them searchable on our platform. Doing this name disambiguation can be both complicated and resource-intensive, so we use Apache Spark to distribute the load to a cluster.
From the start, we managed our very own Spark clusters using Ansible and Bash, running Spark in standalone mode. But as the data team, the weekly commitment, and the number of sources has grown, we needed a new, more scalable solution. Looking at the different alternatives, we decided to go with Spark on Kubernetes and use Argo Workflows as our workflow orchestration tool. This talk is about why those choices were made and how it all turned out.
Henrik Alburg
Data Science Manager @ Monocl
Henrik Alburg leads the data science and engineering teams at Monocl. He is working on everything ranging from the infrastructure of running all data pipelines to model optimization for all Monocl’s sources. Before that, Henrik worked with consultancy within search and data science. He holds a Master’s in Computer Science from Chalmers University of Technology.