The Integrated Data Repository:
Ontology Mapping and Data Discovery for the Translational Investigator
Rob Wynden 1 , BSCS, Russ J. Cucina 1 , MD, MS, Maggie Massary 2 , Davera Gabriel 3 , RN, Marco Casale 4 , MS, Ketty Mobed 1 , PhD, MSPH, Mark G. Weiner 2 , MD, Prakash Lakshminarayanan 1 , MBA, Hillari Allen 1 , Michael Kamerick 1 , BSCS
1 University of California , San Francisco , CA ; 2 University of Pennsylvania , Philadelphia , PA ;
3 University of California , Davis , CA ; 4 University of Rochester , Rochester , NY
An integrated data repository (IDR) containing aggregations of clinical, biomedical, economic, administrative, and public health data are key components of an overall translational research infrastructure. But most available data repositories are designed using standard data warehouse architecture using a predefined data model, which does not facilitate many types of health research. In response to these shortcomings we have designed a schema and associated components which will facilitate the creation of an IDR by directly addressing the urgent need for terminology and ontology mapping in biomedical and translational sciences and give biomedical researchers the required tools to streamline and optimize their research. Specifically designed user interfaces will allow for standardized and easier data collection in the biomedical setting. The proposed system will dramatically lower the barrier to IDR development at biomedical research institutions to support biomedical and translational research, and will furthermore promote inter-institute data sharing and research collaboration.
An integrated data repository (IDR) containing aggregations of clinical, biomedical, economic, administrative, and public health data are key components of an overall translational research infrastructure. Such a repository can provide a rich platform for a wide variety of biomedical research efforts. Examples might include correlative studies seeking to link clinical observations with molecular data, data mining to discover unexpected relationships, and support for clinical trial development through hypothesis testing, cohort scanning and recruitment. Significant challenges to the successful construction of a repository exist, and they include, among others, the ability to gain regular access to source clinical systems and the preservation of semantics across systems during the aggregation process.
Most repositories are designed using standard data warehouse architecture, with a predefined data model incorporated into the data base schema. The traditional approach to data warehouse construction is to heavily reorganize and frequently to modify source data in an attempt to represent that information within a single database schema. This information technology perspective on data warehouse design is not well suited for the construction of data warehouses to support translational biomedical science. The purpose of this paper is to discuss the components which would facilitate the creation of an IDR by directly addressing the need for terminology and ontology mapping in biomedical and translational sciences and presenting discovery interfaces for the biomedical researcher to effectively access the information residing on the IDR.
There are several challenges posed by IDR projects geared toward biomedical research which do not apply to the construction of most commercial warehouse implementations: 1) integrity of source data - a clear requirement in the construction of an IDR is that source data may never be altered, nor may their interpretation be altered. Records may be updated, but strict version control is required to enable reconstruction of the data that was available at a given point in time. Regulatory requirements and researchers demand clear visibility to the source data in its native format to verify it has not been altered; 2) high variability in source schema designs - IDRs import data from a very large set of unique software environments, from multiple institutions, each with its own unique schema; 3) limited resources for the data governance of standardization - widespread agreement on the interpretation, mapping and standardization of source data that has been encoded using many different ontologies over a long period of time may be infeasible. In some cases the owners of the data may not even be available to work on data standardization projects, particularly in the case of historical data; 4) limited availability of software engineering staff with specialized skill sets - interpretation of source data during the data import process requires a large and highly skilled technical staff with domain expertise, talent often not available or only at considerable expense; and 5) valid yet contradictory representations of data - there are valid, yet contradictory interpretations of source data depending on the domain of discourse of the researcher. Examples related to the inconsistency of the researchers’ domain of discourse include: two organizations may interpret the same privacy code differently, or researchers within the same specialty may not use the same ontology, or clinical and research databases often encode race and ethnicity in differing ways. We have developed an alternative approach that incorporates the use of expert systems technologies to provide researchers with data models based on their own preferences, including the ability to select a preferred coding/terminology standard if so desired. We believe that such an approach will be more consistent with typical research methodologies, and that it will allow investigators to handle the raw data of the repository with the degrees of freedom to which they are accustomed.
An ontology mapping component is essential for providing successful and cost effective data integration for two main reasons: 1 ) to s treamline data acquisition and identification process by a) delivering data to researchers in a just-in-time fashion, instead of requiring that all data be transmitted to the IDR via a single common format and without the requirement that all data be stored within a single centralized database schema, b) providing a data discovery and data request user interface that allows researchers’ data requests to be semi-automated, based on appropriate IRB permissions, and c) potentially facilitating the emergence of a commercial ontology mapping service market. Once established, these commercial entities could then offer mapping services for large, established ontologies and for datasets derived from large and established software environments; and 2 ) to d evelop a standards-based technical infrastructure by a) p roviding the software infrastructure with which an IDR can provide data to researchers, organized by a hierarchical terminology appropriate to that researcher’s domain of expertise, utilizing the same data sets for various purposes irrespective of specialty or the type of study, b) providing a reusable and fully functional and documented software component to be installed at any biomedical research site, c) providing a knowledge management system and ontology mapping tools that enable less technical users to translate the complex array of data fields needed to fulfill data requests, and d) facilitating inter-institutional data sharing by translating data definitions among one or more site-specific terminologies or ontologies, and shareable aggregated data sets.
We propose an ontology mapping software service that runs inside of an IDR. This service will provide the capability to map data encoded with different ontologies into a format appropriate for a single area of specialty, without preempting further mapping of that same data for other purposes. This approach would represent a fundamental shift in both the representation of data within the IDR and a shift in how resources are allocated for servicing translational biomedical informatics environments. Instead of relying on an inflexible, pre-specified data governance and data model, the proposed architecture shifts resources to handling user requests for data access via dynamically constructed views of data (Fig.1). Therefore, data interpretation happens as a result of an investigator’s specific request and only as required.
Figure 1. Complex data governance (top) can be exchanged for rules encoding (bottom)
User interaction with an IDR that implements the proposed tools will differ from that of a traditional data warehouse in two important respects: 1) Data Discovery - in models where up-front data governance has been applied, the data governance and standardization process will generate a large amount of documentation required to describe the source data, raising a barrier to researcher utilization. In the proposed model, the knowledge required of the researcher will be significantly reduced, and the researcher would only require enough information about the data available to formulate a specific request for access. 2) Translation - the translation of data from its source ontology into the ontology required by the researcher will not be completed during the extract, transform and load (ETL) phase. The ontology mapping will be completed after the source data has already been imported into the IDR.
To support these distinctions, we are developing two technologies that are making this approach practical: 1) Inference Based Ontology Mapping – the source data must be translated into the ontology that the biomedical researcher requires for a particular domain of expertise. The IDR will use a rules-based system to perform this mapping of source data format to the researcher’s ontology of choice. 2) A Discovery Interface – because all source data will not be analyzed in detail at the time of the initial ETL process that brings data in to the warehouse, a mechanism is required to conceptualize the IDR contents. A web browser-based interface for data discovery and concept mapping will be used to describe the contents of the IDR so that the researcher can learn what types of data are available prior to requesting institutional review board (IRB) approval for access. These self-service user interfaces (UIs) are illustrated below (Fig. 2-5).
Figure 2. Data Discovery UI
Figure 3. IDR Dashboard UI
Figure 4. Data Request UI
Figure 5. Mapping UI
The logical data model includes work recently developed by the caBIG community for terminology metadata as well as modeling derived from work by Noy 1 et al., Brinkley 2 et al., Gennari 3 et al., and Advani 4 et al. At the center of these structures are Metadata, Provenance , and System tables that address high-level administrative and data ownership information requirements. This includes : 1) metadata for provenance with institutional metadata , 2) locally and globally unique and human-readable object identifiers for all objects and actors, including those who are entities responsible for the mapping (e.g. creator), 3) individuals contributing or performing the activity (e.g. contributors), and 4) those with primary responsibility such as oversight or review (e.g. curators ) . Each mapping will intrinsically have a source and a target instance and for every instance will require a robust set of attributes to uniquely identify the map both locally and globally. These data elements will also provide information regarding map derivation and details about the nature of the transformation activity.
The maps, relationships , and data transform structures are represented by the Ontology Map and mapping tables. Relationship s or associations (including collections) will have their own set of metadata such as unambiguous descriptions, directionality, cardinality, etc. Although this diagram shows data elements with enumerated value domains, those listed are suggestions for development in this early model. Maps will have associated identifiers not only about themselves, but also their relationship to a Harvest table (Fig. 6). MapRules are textual data that contain an XML encoded mapping rule.
This system would consist of only two runtime components, an Ontology Mapper Discovery Interface that accepts and tracks user requests and an Ontology Mapping Service and its associated Mapping Interpreter. This service would run as a background task and process data according to a preconfigured schedule.
Figure 6. Ontology maps and association with Harvest tables
IDRs pose special challenges in information security and regulatory compliance. While attempting to balance privacy and compliance with the need for convenient data access by authorized personnel, our design implements high security standards, including a) encrypting all protected health information as defined by the regulations of the Health Insurance Portability and Accountability Act (HIPAA), b) implementing FISMA security guidelines for access to any data obtained via an agreement with a Veterans Administration health facility, c) restricting researcher access to data in accordance with the regulations of the Office of Human Research Protection and associated IRB approval processes, d) maintaining detailed audit reports of end user activity, e) implementing a password policy for all administrative level access, f) encrypting all data transports via SSL using HTTPS and SFTP, and g) requiring all users to receive appropriate training in information security practices and sign a security practices agreement.
Our proposed design is intended to greatly facilitate biomedical research by minimizing the initial investment that is typically required to resolve semantic incongruities that arise when merging data from disparate sources. Through the use of a rules-based system, the translation of data into the domain of a specific researcher can be accomplished more quickly and efficiently than with a traditional data warehouse design. The proposed system will dramatically lower the barrier to IDR development at biomedical research institutions to support biomedical and translational research, and promote inter-institute data sharing and research collaboration.
- Noy NF, Crubézy M, Fergerson RW, Knublauch H, Samson WT, Vendetti J, Musen M. Protégé-2000: an open-source ontology-development and knowledge acquisition environment. Proc. AMIA Symp. 2003; 953.
- Brinkley JF, Suciu D, Detwiler LT Gennari JH, Rosse C. A framework for using reference ontologies as a foundation for the semantic web. Proc. AMIA Symp. 2006; 96-100.
- Gennari JH, Musen MA, Fergerson RW, Grosso WE, Crubézy M, Eriksson H, Noy NF, Tu SW. The evolution of Protégé: an environment for knowledge-based systems development. International Journal of Human Computer Studies 2003; 58(1):89-123.
- Advani A, Tu S, O’Connor M, Coleman R, Goldstein MK, Musen M. Integrating a modern knowledge-based system architecture with a Legacy VA database: The ATHENA and EON projects at Stanford. Proc. AMIA Symp. 1999; 653-7.