Space Tools

AMIA_Spring_IDR_v1.1.doc

The Integrated Data Repository:

A Non-Traditional Data Warehousing Approach for the Translational Investigator

Marco J. Casale 3 , MS, Russ J. Cucina 1 , MD, MS, Mark G. Weiner 2 , MD, David A. Krusch, MD 3, Prakash Lakshminarayanan 1 , MBA, Aaron Abend, Bill Grant, Rob Wynden 1 , BSCS

1 University of California, San Francisco, CA; 2 University of Pennsylvania, Philadelphia, PA;

3 University of Rochester, Rochester, NY

Abstract

An integrated data repository (IDR) containing aggregations of clinical, biomedical, economic, administrative, and public health data are key components of an overall translational research infrastructure. But most available data repositories are designed using standard data warehouse architecture using a predefined data model, which does not facilitate many types of health research. In response to these shortcomings we have designed a schema and associated components which will facilitate the creation of an IDR by directly addressing the urgent need for terminology and ontology mapping in biomedical and translational sciences and give biomedical researchers the required tools to streamline and optimize their research. The proposed system will dramatically lower the barrier to IDR development at biomedical research institutions to support biomedical and translational research, and will furthermore promote inter-institute data sharing and research collaboration.

Introduction

An integrated data repository (IDR) containing aggregations of clinical, biomedical, economic, administrative, and public health data are key components of an overall translational research infrastructure. Such a repository can provide a rich platform for a wide variety of biomedical research efforts. Examples might include correlative studies seeking to link clinical observations with molecular data, data mining to discover unexpected relationships, and support for clinical trial development through hypothesis testing, cohort scanning and recruitment. Significant challenges to the successful construction of a repository exist, and they include, among others, the ability to gain regular access to source clinical systems and the preservation of semantics across systems during the aggregation process.

Most repositories are designed using standard data warehouse architecture, with a predefined data model incorporated into the database schema. The traditional approach to data warehouse construction is to heavily reorganize and frequently to modify source data in an attempt to represent that information within a single database schema. This information technology perspective on data warehouse design is not well suited for the construction of data warehouses to support translational biomedical science. The purpose of this paper is to discuss the components which would facilitate the creation of an IDR by directly addressing the need for terminology and ontology mapping in biomedical and translational sciences and the novel approach to data warehousing design.

Background

The IDR database design is a departure from the traditional data warehouse design proposed by Inmon. The data warehouse architecture model shown in Figure 1 depicts the process of transforming operational data into information for the purpose of generating knowledge. The diagram displays data flowing from left to right in accordance with the corporate information factory (CIF) approach (Inmon et al). According to Inmon, the data enters the CIF as raw data collected by operational applications. The data is transformed through extract, transform and load processes and stored in either the data warehouse or the ODS, operational data store.

Figure 1. Inmon’s Corporate Information Factory Data Warehousing Model

“Often up to 80 percent of the work in building a data warehouse is devoted to the extraction, transformation, and load (ETL) process: locating the data; writing programs to extract, filter, and cleanse the data; transforming it into a common encoding scheme; and loading it into the data warehouse.” (Hobbs, Hillson & Lawande,) The data model is typically designed based on the structure of the source data. Figure 2 depicts a source system clinical data model.

Figure 2. University of Rochester simplified clinical data repository OLTP data model

Furthermore, after the data has been prepared, it is loaded into the de-normalized schema of the data warehouse or data marts and resides there in a fine grain level of detail. The logical design of a data warehouse is usually composed of the star schema. “A star schema is a simple database design (particularly suited to ad-hoc queries) in which dimensional data (describing how data are commonly aggregated) are separated from fact or event data ( describing individual business transactions).” (Hoffer, Prescott & McFadden)

However, an IDR has to incorporate heterogeneous data. A common data modeling design approach in biomedical informatics is the entity-attribute-value (EAV) model. An EAV design , conceptually involves a table with three columns—a column for entity/object identification (ID), one for attribute/parameter (or an attribute ID pointing to an attribute descriptions table), and one for the value for the attribute. The table has one row for each A-V pair. The IDR is built on the combination of both the star schema data model and the EAV approach. An example of this approach can be seen in Partners Healthcare i2b2 NIH supported data warehouse design.

$C:\Users\Marco\Documents\CTSA_Ontomapper\i2b2_core_with_datatypes.jpg$

Figure 3. Partners Healthcare i2b2 star schema EAV data model

EAV design is potentially attractive for databases with complex and constantly changing schemas that reflect rapidly evolving knowledge in scientific domains. Here, when a conventional design is used, the tables (and the code that manipulates them, plus the user interface) need continuous redesign with each schema revision.

Federated vs. Centralized Approach

The debate regarding a federated data warehouse design versus an ‘all in one’ centralized data warehouse design has been a key point between the two main data warehousing factions between Bill Inmon and Ralph Kimbal. The following example outlines the differences for the two approaches and justifies why the centralized approach is the favorable choice in biomedical informatics .

Many institutions have electronic clinical data that is decentralized across different departments. That infrastructure can be used to create an integrated data set with information that spans the source systems. However, the decentralization creates the need for redundant steps, lengthy iterative processes to refine the information, and requires that more people have access to protected health information in order to satisfy the research information needs. To illustrate these issues, the following describes the workflow needed to define a cohort of people known to have cardiovascular disease and laboratory evidence of significant renal disease defined by an elevated serum creatinine.

In the decentralized system, where should the investigator start? He can begin by going to the billing system that stores diagnoses and get a list of PHI (Personal Health Information) of people with a history of a heart attack. Then he can transport that list of identifiers to the people who work in the laboratory and request the serum creatinine levels on that set of patients, and then limit the list to those who have an elevation. The lab will have to validate the patient matches generated by the billing system by comparing PHI, a step redundant with the billing system. Furthermore, many of the subjects associated with heart attack may not have the elevated creatinine, so, in retrospect, the PHI of these people should not have been available to the people running the query in the lab. Perhaps the cohort that was generated was not as large as expected, and the investigator decides to expand the cohort to those patients with a diagnosis of peripheral vascular disease and stroke. He then has to iterate back to the billing system to draw additional people with these additional diagnoses, and then bring the new list of patient identifiers to the lab to explore their creatinine levels.

The centralized warehouse as proposed will conduct the matching of patient identifiers behind the scenes. The information system will conduct the matching of patients across the different components of the database, so that identifiers do not have to be manually transported and manipulated by the distinct database managers at each location. The result is that a centralized warehouse is radically more secure than a decentralized warehouse due to the reduced exposure of PHI. Further, if the query produces results that are not satisfactory, the cycle of re-querying the data with new criteria will be faster, and user controlled.

Discussion

There are several challenges posed by IDR projects geared toward biomedical research which do not apply to the construction of most commercial warehouse implementations: 1) integrity of source data - a clear requirement in the construction of an IDR is that source data may never be altered, nor may their interpretation be altered. Records may be updated, but strict version control is required to enable reconstruction of the data that was available at a given point in time. Regulatory requirements and researchers demand clear visibility to the source data in its native format to verify it has not been altered; 2) high variability in source schema designs - IDRs import data from a very large set of unique software environments, from multiple institutions, each with its own unique schema; 3) limited resources for the data governance of standardization - widespread agreement on the interpretation, mapping and standardization of source data that has been encoded using many different ontologies over a long period of time may be infeasible. In some cases the owners of the data may not even be available to work on data standardization projects, particularly in the case of historical data; 4) limited availability of software engineering staff with specialized skill sets - interpretation of source data during the data import process requires a large and highly skilled technical staff with domain expertise, talent often not available or only at considerable expense; and 5) valid yet contradictory representations of data - there are valid, yet contradictory interpretations of source data depending on the domain of discourse of the researcher. Examples related to the inconsistency of the researchers’ domain of discourse include: two organizations may interpret the same privacy code differently, or researchers within the same specialty may not use the same ontology, or clinical and research databases often encode race and ethnicity in differing ways. We have developed an alternative approach that incorporates the use of expert systems technologies to provide researchers with data models based on their own preferences, including the ability to select a preferred coding/terminology standard if so desired. We believe that such an approach will be more consistent with typical research methodologies, and that it will allow investigators to handle the raw data of the repository with the degrees of freedom to which they are accustomed.

An ontology mapping component provides a s treamline data acquisition and identification process by delivering data to researchers in a just-in-time fashion, instead of requiring that all data be transmitted to the IDR via a single common format and without the requirement that all data be stored within a single centralized database schema, providing a knowledge management system and ontology mapping tools that enable less technical users to translate the complex array of data fields needed to fulfill data requests, and facilitating inter-institutional data sharing by translating data definitions among one or more site-specific terminologies or ontologies, and shareable aggregated data sets.

We propose an ontology mapping software service that runs inside of an IDR. This service will provide the capability to map data encoded with different ontologies into a format appropriate for a single area of specialty, without preempting further mapping of that same data for other purposes. This approach would represent a fundamental shift in both the representation of data within the IDR and a shift in how resources are allocated for servicing translational biomedical informatics environments. Instead of relying on an inflexible, pre-specified data governance and data model, the proposed architecture shifts resources to handling user requests for data access via dynamically constructed views of data (Fig.4). Therefore, data interpretation happens as a result of an investigator’s specific request and only as required.

BlockDiag1

Figure 4. Complex data governance (top) can be exchanged for rules encoding (bottom)

2) Translation - the translation of data from its source ontology into the ontology required by the researcher will not be completed during the extract, transform and load (ETL) phase. The ontology mapping will be completed after the source data has already been imported into the IDR.

To support the translation of data, we are developing an approach called Inference Based Ontology Mapping – in which the source data must be translated into the ontology that the biomedical researcher requires for a particular domain of expertise. The IDR will use a rules-based system to perform this mapping of source data format to the researcher’s ontology of choice.

Ontology Mapper Data Model- Extending i2b2

To support the alternate data governance model shown in Figure 4 above a further evolution of the database model provided by i2b2 was required. In order to support this refinement of the data management process, the ability to map data (instance mapping) after ETL must be added to the data warehousing model. The proposed Ontology Mapper functionality that enables the creation of maps for the transformation of data from source encoding to target encoding required several changes on the original i2b2 schema.

In order to maintain data integrity, a design approach was made to diminish the alteration of the core i2b2 star schema design. Minimal changes were made to the observation_fact table to lesson any impact on the primary key. These changes support the relationship between the core fact and the instance map that was used in support of that fact. In general, the primary key is necessary for providing a unique instance of an observation. In addition, all fields in a record should have significance.

For instance, if an aggregation of records is generated across a particular value, the primary key must continue to uniquely identify the instance of the aggregated record. Aggregation requirements posed a unique challenge related to the partial existence of facts. A separate fact table was designed to accommodate aggregate facts. The development of aggregate fact tables is a common practice in data warehousing, typically accomplished through a materialized view against the detailed fact table. Aggregation also provides a valuable data set to clinical researchers who are only interested in de-identified data sets or for patient cohort identification.

Furthermore, the reduction of alterations to the star schema will help to reduce changes incurred by subsequent releases of the i2b2 software. As each database upgrade is applied and in addition to the map_aggr_fact table previously mentioned, the following minor modifications will also need to be applied:

1) Encoding_Dimension: This is a new table created for the purpose of storing the various encodings used in the Mapper system. 2) Map_Dimension: This table has been created to store info on the instance mapper xml files uploaded onto the system. 3) Map_Data_Fact: This is a new table designed to store the records created as a result of Ontology Mapper execution. All the transformed records in target encodings will be housed in this table. 4) Changes to the i2b2 Observation_Fact table: a) The addition of the CONCEPT_PATH column which is referenced from table CONCEPT_DIMENSION for storing the concept path pertaining to the concept code used in this table. This is required since the original i2b2 design supports only the CONCEPT_CD (this can be the same for multiple concept paths) and resolving the concept path from the same is difficult.

b) Addition of the ENCODING_CD column which is referenced from table ENCODING_DIMENSION to denote the encoding type this record is encoded with.

c) Addition of the OBSERVATION_FACT_ID column which was added to store a running sequence no. for uniquely identifying each record in the table.

Figure 5 depicts the Ontology Mapper Data Model.

Figure 5. Ontology Mapper Star Schema – Extension of i2b2 Star Schema

Ontology Mapper Database Design: Generating Data Marts

This final set of modifications made to i2b2 was required in order to provide compatibility with the existing i2b2 application server code. Additionally these modifications also offer a fundamental improvement in i2b2 security, storage requirements and performance.

Figure 6. i2b2 Workbench and related tables

In the current i2b2 workbench software the concept dimension is used to populate the list of concept paths displayed on the left of the screen. The user generates a query via a drag-and-drop of these concept paths into the query boxes at the top right. The query results are populated from the observation fact table using the box on the lower right.

The SQL query which is generated by this UI is stored in the request table. The actual data returned is stored within the response table as an XML blob. Using an i2b2 feature called the Export Data Feature the user is then granted access to the data contained within that response data BLOB .

This architecture has several disadvantages:

1) Data stored within the observation fact table is replicated into the response data BLOB. For users which require a snapshot of data this format may be acceptable. However database BLOB fields have size limitations and we expect researcher queries to require access to very large sets of IDR data. Replication of such large datasets has distinct disadvantages regarding the efficient usage of storage space.

Also not all users want to have access to information as snapshots. The Ontology Mapper will continuously instance map incoming data which is presented to the system during regular ETL processing. Some users will request access to this newly instance mapped data as soon as it becomes available. An algorithm which is based on

2) This architecture requires to maintenance of a duplicate security paradigm outside of the host database environment. The BLOBs which are created and exported to the researcher must be permissioned for usage solely by the researcher’s study staff under IRB (Institutional Review Board) protocol. Since this mechanism is not using native database security for data delivery that security mechanism must be replicated within the application server layer.

Security models which are dependent on application server security are inherently less secure than models which leverage native database security. In models which leverage database security, if the application server layer is hacked then the would-be hacker would still not obtain access to the data in question.

To address these concerns and to provide an easier method of integration of the Ontology Mapper into the i2b2 framework we have therefore made the following additional modifications to the i2b2 schema.

Figure 10. Extensions to i2b2 to support the generation of Study Specific Views

Please note that in this model instead of providing access to end users via the XML BLOB of response data (referred to as the i2b2 CRC or Clinical Research Chart), the user is instead granted access automatically to a database view. That database view can then be either manipulated directly or it can be materialized into a data mart when necessary.

Also the original workbench software would be modified so that the data shown in the lower right portion of the screen is derived from an Observation Fact View which is comprised of data from both the Observation Fact Table and the Mapped Data Fact Table or from both the Observation Fact Table and the Mapped_Aggr_Fact table.

The Observation Fact View is created such:

CREATE

SELECT OBS.* FROM

OBS_FACT, MAPPED_DATA_FACT

WHERE OBS.FACT.OBSERVATION_FACT_ID = MAPPED_DATA_FACT.OBSERVATION_FACT_ID

This is a common Create Table As Select statement (CTAS) which is used to generate derived tables from base tables.

Conclusion

Our proposed design is intended to greatly facilitate biomedical research by minimizing the initial investment that is typically required to resolve semantic incongruities that arise when merging data from disparate sources. Through the use of a rules-based system, the translation of data into the domain of a specific researcher can be accomplished more quickly and efficiently than with a traditional data warehouse design. The proposed system will dramatically lower the barrier to IDR development at biomedical research institutions to support biomedical and translational research, and promote inter-institute data sharing and research collaboration.

References

Noy NF, Crubézy M, Fergerson RW, Knublauch H, Samson WT, Vendetti J, Musen M. Protégé-2000: an open-source ontology-development and knowledge acquisition environment. Proc. AMIA Symp. 2003; 953.

Brinkley JF, Suciu D, Detwiler LT Gennari JH, Rosse C. A framework for using reference ontologies as a foundation for the semantic web. Proc. AMIA Symp. 2006; 96-100.

Gennari JH, Musen MA, Fergerson RW, Grosso WE, Crubézy M, Eriksson H, Noy NF, Tu SW. The evolution of Protégé: an environment for knowledge-based systems development. International Journal of Human Computer Studies 2003; 58(1):89-123.

Advani A, Tu S, O’Connor M, Coleman R, Goldstein MK, Musen M. Integrating a modern knowledge-based system architecture with a Legacy VA database: The ATHENA and EON projects at Stanford. Proc. AMIA Symp. 1999; 653-7.

Hobbs, Lilian; Hillson, Susan & Lawande,

Shilpa Oracle9iR2 Data Warehousing. Burlington: Digital Press. 2003, 6

Inmon, W. H., Imhoff, C., Sousa, R. Introducing

the Corporate Information Factory 2nd Edition New York: John Wiley & Sons, Inc. 2001

Hoffer, Jeffrey A.; Prescott, Mary B., McFadden,

Fred R.. Modern Database Management. Upper Saddle River: Prentice Hall. 2002, 421

Nadkarni, Prakash M. MD et al,

Organization of Heterogeneous Scientific Data Using the EAV /CR Representation . Journal of American Medical Informatics 1999

Pages

Recently updated