class: theme layout: true --- class: center, middle  .footnote[.left[[Image Credit: Unsplash](https://unsplash.com/photos/numbers-on-metal-deposit-boxes-in-a-bank-Uf-c4u1usFQ)]] ## DRS, HTSGET, MAESTRO! ### Saurabh B @ OUS AMG 14/03/2024 --- class: middle # Is this talk for you? - or should you switch off my audio/video and take a 15 min walk? ??? -- - you want quick overview of Data Repository Service (DRS), HTSGET by GA4GH - you care about data access standardization - you are interested in faster and efficient access to genomic data stream - last and most importantly you like the sound of mechanical keys --- class: middle # Who Involved? - data producers - data publishing tools and platforms - data repositories - data access tools and platforms - data consumers ??? --- class: middle # The data access problem - there's potentially lot of (useful) data that can be shared - everyone ends up following their "best practices" to share data - there can be mediators and in between layers increasing complication - both data produces and consumer end up building and maintaining system around this environment resulting in avoidable task ??? --- class: middle # The data access problem  .footnote[.left[[Image Credit: GA4HG](https://ga4gh.github.io/data-repository-service-schemas/public/img/figure1.png)]] ??? --- class: middle # The data access solution?  .footnote[.left[[Image Credit: GA4HG](https://ga4gh.github.io/data-repository-service-schemas/public/img/figure2.png)]] ??? --- # DRS or Data Repository Service - The proposed solution is DRS API spec(yellow box in previous fig) -- - The specification standardizes (data) access method, operations, payloads, responses - Provides principles, guidance to get close to industry standards (OpenAPI, RESTful, etc.) ??? --- # htsget: Why? - DRS might (hopefully) solve data repository access issue, but -- - final data delivery is usually in (large) file format over FTP/HTTPS - downloading, indexing/managing all these files is a nightmare - there are propriety protocols and tools which deviates us away from standardization and security ??? --- # htsget: What? - Open standard protocol to real time secure streaming of genomic data - allows data retrieval overlapping a specific genomic interval - supports streaming of existing community standard such as SAM, BAM, CRAM, VCF, BCF - Secure access using OAuth 2.0 (industry standard) ??? - --- # htsget: Implementation and interoperability
1
- Each server loaded the 1000 Genomes/Hapmap CEU trio in both BAM (NCBI37) and CRAM format (GRCh38), as well as RNASeq and ChIP-seq data. - ran a mixture of random and edge-case queries for a given client/server combination, checking the integrity of the returned data against a local file. - ran an extensive suite of tests on a total of 25 different client-server combinations - available in Python, Go, Rust (more and stable releases in works) ??? --- # DEMO - boot up DRS starter kit - access api endpoint - quick scan of schema ??? --- # maestro? - maestro is a lightweight library to automate, monitor tasks in the pipeline - maestro has wait/dependency on incoming files to initiate tasks, identifying these files is challenging (naming dir/files is difficult) - maestro has an API ??? --- # Conclusion ## The Good - Data access solutions by GA4GH look quite promising - Offers way to standardized dreamland - Both of the specification aim to be SOA ## The ? - Need to address typical privacy concerns - How well does this work on internal infra - Will need rigorous testing, validation - More? --- # Thank You! --- # References [1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6298043/ [2] http://samtools.github.io/hts-specs/htsget.html [3] https://www.ga4gh.org/product/data-repository-service-drs/ [4] https://github.com/ga4gh/data-repository-service-schemas