class: theme layout: true --- class: center, middle  .footnote[.left[[Image Credit: Morten]]] ## Maestro ### Xuyang, Tony & Saurabh @ OUS AMG 26/06/2024 & 03/07/2024 --- class: middle # What will we talk about? - reasons for maestro - how far have we come - is someone using this and how? - internal mechanism - challenges and discussion ??? - background --- class: middle # Why maestro?  ??? - Complicated and unclear process from sequencing to interpretation - Many ad hoc scripts with unclear dependencies - Only partially automated - a lot manual debugging and searching for files - Difficult to adapt to new methods and increasing scope of samples --- class: middle # Solution?  ??? - All automated, including error handling - Full tracking of metadata and files; file paths no longer significant - Storage of status, results and history - Modular: New methods can be developed independently of each other and easily plugged in --- class: middle # What's Maestro - lightweight, easy to use, modular workflow management system - library written in python, open source - in built UI to monitor and operate on workflow/jobs - built keeping bioinformatics pipelines in mind - tries to be smart and efficient ??? - helps automate and orchestrate at all levels - helps in monitoring of status and system - helps keep pipelines task run independently and with fewer complications - runs processes in async for efficiency - can run on different infrastructure with database sync --- class: middle ## Maestro basics - Workflow, workflow steps, and jobs, orders - Artefacts - Watchers (File, Message, Artefacts) - Classes (`FileWatcher`, `MessageWatcher`, `BaseArtefactModel`) - NATS & Neo4j - `maestro_id` ??? - jobs are single entity of work/process that can be executed - jobs can be order one at a time or together as part of a workflow - workflow is essentially collection of jobs, their outputs and future connection - workflow step represents and contains information about step process, parameters, other metadata - workflow steps are connected via input and output - maestro automatically create connection between different workflow steps based on output of a step being used as input to some other step - we achieve this using watchers - watchers are core to maestro - maestro reacts to file changes, aretfacts, and messages it's watching - NATS is key ingredient for maestro to work - maestro ID acts as unique key to identify correct resource of each job --- class: middle ## Maestro workflow lifecycle - User defines classes derived from classes defined in Maestro - define schema for all aretfacts, use `BaseAretfact` as base - define `process` method for each classes where work needs to be done - also specify input class ??? - an user places an order for a workflow consisting of one or more job, this message with payload is communicated via NATS - maestro creates job entity in the database - almost all workflows are started when file is posted to a directory watched by maestro - this directory is user defined - as soon as the file is found, the FileArtefact is created and message is sent via NATS - The artefact watcher receives this message and reacts by checking jobs in the database in the pending state if dependency of a job is satisfied and if it's ready to be run - if job is ready to be run we add it to worker queue where job process executed when resources are available - both input and output artefact entity along with other relevant entities are created and connected --- class: middle ## review  ??? - maestro provides base classes that needs to be used to run workflows - these are filewatcher and messagewatcher - you need to define input expected and work you want to do inside `async def process` and return output at the end - for filewatcher you don't need to provide process method, maestro will process the file and create fileartefact - all inputs and output entities needs to be modeled as subclass of BaseArtefactModel - entities can have nesting and list with nested artefact - once everything is defined, you need to create workflow and job object - use order_workflow utility to place workflow order --- class: middle ## Interesting internals  - Invalidation - Reruns - database sync (specific to TSD-NSC) ??? - we want options to invalidate artefact that we thing are incorrect and should not be used - marking them invalidated in the database is easy enough but this consequence of completed and pending jobs are still there - what can be allowed to invalidated? input, output, step, entire workflow? - what can propagate through the graph, are we messing up with any job state, running into weirdness - similar to invalidate, what can be allowed to rerun, is rerun running the same as executing same job entity again with fixed code? - should we create new entities in graph? how should we connect them in graph? should rerun also propagate? - we will run steps in both TSD and NSC, different job/orders will be executed on TSD/NSC but we need to keep database in sync - before we were using watcher system, Neo4jWatcher where payload was brodacasted via NATS and supposed to be picked via watcher on each side - this not ideal if one of the system goes down along with NATS resulting in different database state/not synced - now we use Neo4j tx where when tx is commited one infra will broadcast the message - each infra has database identifier and tx are executed in order to keep the database in sync - on each infra we also have primary and secondary neo4j instances for managing unexpected downtime, backups etc. --- class: middle ## DEMO - ordering and running workflows - current (internal) API - UI and maybe look at pretty Neo4j graphs - invalidating, rerunning jobs ??? --- class: middle ## Steps status & future - vcpipe base, trio, annopipe ✅ - vcpipe 8.0 deployed and ready to be used in steps ✅ - TSD deployment 🚧 - testing 🚧 - ELLA delivery 📅 - data packing and transfer 📅 - NovaseqX 📅 --- class: middle # Challenges - Lower level process management - Job based custom resource allocation - On demand scaling - Keeping the library as generic as possible, atleast for other units ??? - as we are getting close to production stage, there seems to be need of managing maestro processes on a lower level - this becomes especially true for long running or hung up processes that might need abrupt termination - or processes that are unexpectedly terminated - on similar note for certain workflow or jobs we might need to adjust compute resource allocation - this is again going into low level memory management which quickly add up and increases maintenance load on dev - scaling is user dependent but also something we might want to think about for cases of sudden increases samples from upstream - last but not least as we are trying to cater to Steps, we have to keep in mind that Maestro should remain as generic as possible - so others unit here can use Maestro - we would like more people to use, contribute, find bugs in Maestro --- class: middle # Thank You, questions? --- # References [1] https://gitlab.com/DPIPE/labautomation/maestro [2] https://gitlab.com/DPIPE/labautomation/steps