CNAG is one of the partners of the collaborative action DATOS-CAT, which aims to enhance the visibility and scientific impact of population-based cohorts created in Catalonia. Additionally, it seeks to enrich the procedures used in these cohorts, promoting their applicability in other similar contexts. To achieve this, it focuses on improving data interoperability, thereby facilitating their exploitation and use in scientific research.

 

In the context of personalized and precision medicine, long-term data collection allows researchers to track the evolution of diseases over time, identify patterns of environmental and genetic risk, and evaluate the impact of different treatment strategies. However, there is currently no standardized system that allows researchers and hospital entities to collect, store, and share their data in a simple, secure manner, and that enables interoperability among them. Within this context, the DATOS-CAT project is born, a collaborative action that aims to increase the visibility and scientific impact of the GCAT population cohort, a strategic project of the Germans Trias i Pujol Research Institute (IGTP) GCAT'Genomes for life and its subcohort focused on COVID-19, COVICAT-CONTENT. It also aims to contribute to the development of procedures applicable to other cohorts, improving the level of interoperability of their data within the framework of the FAIR principles (Findable, Accessible, Interoperable, Reusable) to facilitate their exploitation and scientific use. Specifically, the project will focus on the Catalan population database that has been comprehensively monitoring nearly 20,000 individuals since 2012, collecting clinical, lifestyle, and environmental data among others, and from which a population genetic database has been generated.

 

DATOS-CAT is an ambitious project of the Complementary Plan for Biotechnology Applied to Health that involves collaboration from 7 entities in Catalonia: the Barcelona Supercomputing Center (BSC) as scientific coordinator, the Institute of Bioengineering of Catalonia (IBEC), the Germans Trias i Pujol Research Institute (IGTP), the Centre for Genomic Regulation (CRG), the Centro Nacional de Análisis Genómico (CNAG), the Barcelona Institute for Global Health (ISGlobal), a center promoted by the "la Caixa" Foundation, and the Hospital Clínic of Barcelona. Each contributes with its unique expertise and resources, making their joint work an achievement in itself. In this regard, IGTP contributes its knowledge of cohort data, ISGlobal has contributed to the development of tools related to DataSHIELD, BSC with its experience in data standardization to Observational Medical Outcomes Partnership (OMOP), Hospital Clínic with the development of Ontobridge, and CRG with its experience with the EGA data repository, and CNAG with its experience in the integration and analysis of phenotypic and genomic data. The success achieved so far in the DATOS-CAT project has been the result of solid and coordinated collaboration among all participating entities.

 

Progress of the DATOS-CAT Project

 

The DATOS-CAT project has successfully reached its interim milestone by developing and publishing the necessary tools for data characterization and standardization. Thus, the implementation of a set of software has been completed, including three major groups: (i) Implementation of a data catalog, (ii) Data transformation, and (iii) other development tools. This milestone represents significant progress both for the project itself and for the current situation of the GCAT cohort. The completion of the first software prototype provides a solid foundation for the continuous development of the DATOS-CAT project, allowing the implementation of tools and systems to catalog data, standardize them to a common data model, and facilitate federated analysis. This tool is a fundamental component that facilitates data transfer and interoperability between systems, thus contributing to the consolidation of the Catalan population database.

 

These tools have been openly published with free licenses at https://github.com/DATOS-CAT. In this repository, the tools used for cataloging can be found. The platform selected for this cataloging has been MICA. Additionally, two other tools have been developed and also published in the repository, which allow federated analysis of data while preserving their privacy through two mechanisms highly recognized by the scientific community such as Beacon (https://beacon-project.io/) and Datashield (https://www.datashield.org/).

 

We also find the tools for data transformation to the common data model OMOP (Observational Medical Outcomes Partnership) which, driven by OHDSI (Observational Health Data Sciences and Informatics), is among the most widely used semantically interoperable models for persistence and exploitation in longitudinal health data records worldwide. Within the three groups mentioned above, the dsOMOP package stands out, built from ISGlobal, and Ontobridge, a tool developed by Hospital Clínic that addresses the problem of data transformation in a novel way using semantic technologies and more traditional approaches. OntoBridge is a flexible and scalable tool that provides an integrated and simplified workflow for the adoption of common data models (CDMs) such as OMOP. Its ontology-based architecture allows reusing efforts made, thus consolidating different data sources, as well as converting to different CDMs, making it a simpler and reusable process. Unlike existing tools in the market and scientific literature, it does not focus on a single part of the process or on a specific CDM, which represents a significant advancement in the current landscape of secondary use of biomedical data. Improvements made to OntoBridge and its publication in an open repository will streamline and optimize the conversion of data based on local models to OMOP. This will facilitate the achievement of project objectives by avoiding complex and repetitive ETL processes requiring multiple tools.

 

Next Steps

 

The overall objective of DATOS-CAT is to contribute to the development of procedures applicable to other cohorts, improving the level of interoperability of their data within the European ecosystem of biomedical data. In this sense, the next steps until its achievement consist precisely of each institution using the tools developed for data standardization to the proposed common model and publishing them using the federated mechanisms mentioned, among others. At the end of the project, it is intended to make available to researchers in Catalonia, and the rest of the world, a unique tool to address questions related to our health and disease treatment, allowing a better understanding of genetic and environmental risks associated with diseases. Once the data are generally accessible, from mid-2025 onwards, the first results of their exploitation can be seen almost immediately.