19 May 14:30 – 19 May 15:00 in Talks

(Memorable) Lessons of running an always-on Dask Cluster

Tatiana Statsenko, Alexander Hirner

Audience level:
Intermediate

Description

In this talk, we describe the lessons learned from running modern computer-vision tasks in an always available Dask cluster.

Abstract

At Moonvision we built a free image annotation tool and perform image analysis for visual quality assurance and self-checkouts. Since 2017, we heavily rely on Dask for our processing chores within a Kubernetes cluster. Some of the tasks, like object detection or cropping, need to be low latency, while others, for instance video image extraction, run in the background. Dask has the ability to cover such use-cases better than traditional workflow engines. However, keeping your workers always around will lead to particular challenges in terms of memory use and stability. In this talk, we’d like to share how we've overcome most of them.

The first part will focus on out-of-memory errors during large map-reduce tasks. Afterwards, we explain how to use Queues for caching computer-vision models to achieve inference times under 0.5s. Eventually, this led us to a dedicated scheduler for high priority tasks. Lastly, we describe caveats of tracking tasks in the case of errors, retries and pureness.