Moving Monocl’s Spark workflows to Kubernetes
Monocl is collecting data from a range of public sources within life sciences. Spanning publications, clinical trials, meeting presentations and research grants. From these sources we extract and disambiguate all the names, allowing us to figure out who has done what, which we then make searchable in our platform. Doing this name disambiguation can be both complicated and resource intensive, which is why we use Apache Spark to distribute the load to a cluster. From the start we managed our own Spark-clusters using Ansible and Bash, running Spark in standalone mode. But as the data team, number of sources and weekly commitment has grown a new, more scalable, solution was needed. Looking at the different alternatives out there we decided to go with Spark on Kubernetes and use Argo Workflows as our workflow orchestration tool. This talk is about why those choices were made and how it all turned out
Henrik Alburg leads the data science and engineering teams at Monocl. Working on everything ranging from the infrastructure of running all data pipelines to model optimization for all Monocl’s sources. Before that, Henrik worked with consultancy within search and data science. He holds a Master’s in Computer Science from Chalmers University of Technology.