Our software development teams use a multi-institutional Agile Scrum approach to create HuBMAP technologies deployed using microservices in a hybrid cloud. We run daily distributed stand-ups and two week sprint cycles. This enables continuous new deployments of features and enhancements under permissive open source licenses.
The HuBMAP Portal principally utilizes the following core technologies, frameworks, and languages: Globus (identity federation, data flow), Python (APIs), Javascript (UI), Neo4j (graph databases), Docker (container per micro service), and Airflow (workflows), among others. Core storage and other high performance services run locally at Pittsburgh Supercomputing Center whereas high availability services run on Amazon Web Services.
Software issues, enhancement, and feature requests are tracked using a GitHub issues board that is populated directly by developers and by user feedback via the help desk.
We maintain dev, test, and production instances of most HuBMAP systems. In some areas we use continuous integration with Travis CI or GitHub CI
HuBMAP Data Ingest
HuBMAP HIVE is responsible for producing and managing data ingest processes and associated software in collaboration with the Data Providers. HuBMAP Data Providers are responsible for producing data and metadata in collaboration with the HIVE. These processes are rapidly evolving into scalable pipelines.
HuBMAP metadata is ingested into a Dockerized Neo4j graph database for Provenance as well as various function-specific relational and no-sql databases.
Data providers submit data using a combination of web registration forms, tools noted above, and registration of experimental and sample protocols at Protocols.io. Metadata is submitted through the ingest process as Tab separated value (.TSV) files containing sample, assay, antibody, and contributor metadata that meets HuBMAP specifications.
The UUID API forms the basis of ID generation. Data providers use the Tissue & donor registration tool to generate donor, organ, tissue sample (including spatial data), and dataset-specific identifiers that are interlinked and displayed on the Portal.
We accept Donor data on a HIPAA conforming Globus site and de-identify Donor data using professional de-identification services via manual abstraction from organ procurement organizations, DICOM data, electronic health record and other tabular data, as available.
Our antibody validation database and query system (pending release) includes antibody validations done by RRID by assay by organ. For individual datasets data contributors will include the RRID (and related information) for each imaging channel in antibody tab separated values files enabling linkage of submitted antibodies & their validation reports.
Each HuBMAP collection, ASCT+B table, and reference object receives its own Digital Object Identifier (DOI) using HuBMAP’s DOI registration service. Each dataset will have its HuBMAP DOI soon. We produce protocol DOIs via protocols.io and standard publication DOIs via HuBMAP Publications.
The CCF RUI (Registration User Interface) is a tool that supports the registration of a three-dimensional (3D) tissue block within a 3D reference organ. The registration data is used in current versions of the Common Coordinate Framework (CCF, see CCF RUI SOP, CCF RUI GitHub repository, RUI Demo) and the CCF Exploration User Interface (EUI) developed within HuBMAP. The RUI currently supports 11 organs, written in TypeScript using libraries such as: Angular 11, Deck.gl, NGXS, Angular Material, and N3.js.
HuBMAP Data Validation is a continuously improving process that starts with defining QC/QA standards and establishing definitions for donor, sample and assay metadata. Standards, definitions, metadata schema and data directory schema are created by teams under the Data Coordination Working Group. Metadata schemas are available here, along with Excel templates with dropdowns for data entry.
Data providers format their data and metadata files according to the metadata and data directory schema specifications for each assay type. Required formats for metadata field input are described in the Github page for each assay-specific metadata schema. Data providers also include the required QA/QC assessments of their data as components of the submission.
HuBMAP validation tools written in Python ensure data submissions conform to HuBMAP standards which are shared and documented for data providers to use to run many of HuBMAP’s checks on their own prior to submission. Other services include Metadata submission conversion, ingest validation and base checks (checksum, file type, etc.) as well as assay-specific checks.
HuBMAP staff conduct 178 (and growing) automated and manual QA/QC checks as part of the data submission & publication process. Manual validation steps are being automated as development capacity allows.
Prior to publication, each dataset is formally approved by the data-providing institution and one or more HIVE members. Data providers must also confirm the quality of spatial and semantic metadata using the CCF EUI.
Pipelines are Dockerized by HIVE or data providers and verified by HIVE and integrated with the other portal components, including these general pipeline tools:Data ingest pipeline, Mixed datatype pipeline tools, OME.TIFF Pyramid, Pipeline visualization (CWL), Pipeline deployment. These are run by the HIVE in the process of generating datasets for publication.
Each of the pipelines produce data and metadata back to the ingest services to enable management of publication status and controlled access of metadata and datasets. Datasets, once approved, are pushed to published and public status, using custom code which changes the status to public of upstream Provenance entities (e.g., samples, donors) and downstream files (e.g., movement of data to Globus public access endpoints if not protected sequence data).
We currently manually capture dataset submission & publication efforts including active datasets’ status, target month of publication, and future datasets. We comprehensively track donor, sample, dataset, spatial, pipeline, visualization, antibody, security (identifiably sequencing), protocol, documentation, metadata & QA/QC standards compliance, and data contributors.
Internally, we regularly update data into a spreadsheet and use our Sankey diagram tool to view HuBMAP’s current and planned state of dataset publication (Figure).
HuBMAP Data Portal
The HuBMAP Data Portal UI is principally a Flask app, using React on the front end and primarily Elasticsearch on the back end, wrapped in a Docker container for deployment using Docker Compose. It is deployed at portal.hubmapconsortium.org. Scientists access summary data, visualizations, and data downloads by dataset on the Portal. Globus facilitates file transfer for local use of data.
The HuBMAP Portal Style Guide is used for the Data Portal and other HuBMAP sites.
While HuBMAP published datasets are openly accessible, HuBMAP consortium level access is managed via the HuBMAP profile system and uses Globus authentication for credential checking.
The Vitessce Viewer is a visual integration tool for exploration of spatial single cell experiments. Its modular design is optimized for scalable, linked visualizations that support the spatial and non-spatial representation of tissue-, cell- and molecule-level data. Vitessce integrates the Viv library to visualize highly multiplexed, high-resolution, high-bit depth image data directly from OME-TIFF files and Bio-Formats-compatible Zarr stores.
Multiple opportunities to query the data use these mechanisms: General Search (Elasticsearch), Query tools and Facets (integrated in UI), and Semantic query (not yet available to Portal users) including by Gene, Cell, Spatial, and Multidimensional; while the CCF EUI provides a detailed look at different parts of the human body, including the heart, kidney, and spleen and spatial data query.
HuBMAP’s APIs support registration and loading of data that complies with HuBMAP data standards and ingest formats as well as core functions underpinning the Portal UI itself. Data Search - Search API is a thin wrapper of the Elasticsearch. It handles data indexing and reindexing into the backend Elasticsearch. Identity system - The uuid-api service is a restful web service used to create and query UUIDs used across HuBMAP.
The HuBMAP Portal provides access to cutting-edge tools to help analyze the data such as the ASCT+B Reporter - includes a partonomy tree that presents relationships between various anatomical structures and substructures, that is combined with their respective cell types and biomarkers via a bimodal network - and Azimuth - is a Shiny app demonstrating a query-reference mapping algorithm for single-cell data - and the Cells API: backend, js client, py client - with other tolls coming such as the Knowledge Graph and associated Schema for Ontology ingest & API services and application and biomedical ontologies
The HIVE monitors HuBMAP portal activity including usage, download, and limited demographic factors using Monitoring services. Current State FAIRness Assessment.
Consortium-level access is driven from an integrated user registration tool that collects and associates credentials among Members’ institutions, Globus file transfer service, GitHub code repositories, Google Drive document storage, and other services presented via the WordPress based HuBMAP consortium website.
Any identifiable sequencing data is accessible via dbGaP within 6 months of initial publication on the HuBMAP portal in order to ensure secure access to this sensitive data -- for details, see the Sequencing data dbGaP submission tool
Data providers and the HIVE are responsible for secure loading and storage of identifiable sequencing data -- generally, the data providers manage administrative interaction with dbGaP and the HIVE (IEC) manages technical interaction & data loading of identifiable sequencing datasets.