Health Ontology Mapper
Space shortcuts
Space Tools

Questions for Dr. Shawn Murphy regarding i2b2 design

1.           What is the best approach for linking our database design to the i2b2 schema for the following areas:

a.           Rules Mapping Schema  (store both concept_cd / concept_path)

b.           Data Discovery Schema

and…

c.           Data Request Schema

(please review the current design of the key relationships used in the Provider table as an example)

2.           Design Approach (Rob) – Should we keep our Ontology Mapper design separate from the CRC design?

Yes, our ontology mapper design can exist as separate tables in the database that link to the i2b2 schema design. In addition, our Ontology Mapper can read / write rows into the observation_fact table.

The study specific CRCs would be built based on the rows which OntoMapper adds to the main enterprise-wide CRC.

3.           Rules Mapping Design (Prakash) – What date fields in the Concept_Dimension table should be used when a concept is updated or a new concept is added?

Update_Date – This is the date that the source system thinks that something was updated

Download_Date – This is the date that the concept was downloaded into the staging area (ETL process). This date is important for clarifying synchronization issues.

Import_Date – This is the date that the concept was actually loaded from the staging area into the Concept_Dimension

So the dates may not be directly related to our mapping rules: instead the mapping_table should store both the concept_cd and concept_path based on the answer to question 4.

4.           Concept_Dimension to Observation_Fact relationship. (Marco) – The concept_cd Foreign key in the Observation_Fact table does not link to the Concept_Dimension primary key therefore allowing one concept code to be associated with multiple concept paths. Is this correct? Is this essentially how the many to many relationship between the two tables is handled rather than creating an associative entity or bridge entity?

Yes, the concept code from the Observation_Fact table may be related to many concept_paths in the concept dimension table. A single record from the Concept_Dimension table must be obtained by linking the concept_cd values and selecting a specific concept_path value.

5.           Patient_Dimension (Marco) – Is this table a mini MPI?   Does this table need to be regenerated to handle derived data such as ‘Age’?   Are any data elements aggregated or rolled up?

Yes, in a sense it is a mini MPI. The Patient Dimension table Three columns are required: Patient_Num, Birth_date, Death_date.  The rest of the fields are derived or aggregated. The method or algorithm for aggregation is either to take the latest value or to take the value that has the maximum occurrence. For instance, the value for gender = ‘Male’ 9 times and ‘Female’ 1 time from the different source systems for the same patient. The algorithm would store ‘Male’ in the Patient Dimension.

Also we may, as a group, want to review the paper written by Steve Johnson on the CDR EAV Model.

6.           Do you have any suggestions for indexes on the Dimension tables besides the primary key?

The AUG slides provide some details on how to index the observation_fact table.

Dimension table indexes:

Concept:  Primary Key on concept_path

Secondary index on concept_cd

Provider: Primary key on Provider_ID and Provider_Path

Secondary index on provider_id

Patient Dimension: Primary key on Patient_num

Indexes on every other field

Visit Dimension: Primary key on Encounter_num

Indexes on every other field

7.           Will there be a time_dimension table in the new release? Will there be an SQL script for loading the time_dimension or is it up to each installation?

It would be nice to have one but unfortunately a time dimension would be difficult to implement. However, it may be possible to implement a Date dimension.  Shawn mentioned that we may be able to use a lookup table algorithm to implement this (see 12 below).

8.           Is it possible to export the CRC to a denormalized extract file or snapshot of data?

This is coming in the next release. I2b2 plans on release updates starting in June and every 15 days afterwards through August.  Please note that these extracted datasets are a Physical extract.  Different DB products use different terminologies for this but we should be able to support this on all of the platforms.

We may have an issue here: CRC extracts are limited to a max size of 1 gig.  This appears to be an accidental limitation imposed by the physical design.  Also there appears to be a work-around for this which involves the use of an SFTP server for transfer.  This has no direct bearing on the Ontomapper project but may become a serious issue related to our individual i2b2 installations.  For one thing, FTPs is Much faster than SFTP and we need to get that additional server deployed into our individual architectures.

9.           How do you deploy/implement multiple CRCs within the hive? Do you need to create multiple instances of the i2b2demodata star schema?

Yes, A CRC has a direction relationship with the i2b2demodata schema. One approach is to create one large CRC that contains everything and then extract observation_fact records into a separate schema for a data mart, a particular protocol or study, along with a copy of all of the dimensions.

Shawn offered to send us the new documentation for this feature … we may want to take him up on that offer.

Also Shawn seemed to indicate that Partners didn’t always choose to store translated data back into the enterprise-wide main CRC.  Sometimes they chose to only store translated data in one of the study specific CRCs instead.  This approach may be in contradiction to Shawn’s answer to our question #2 above, but since the suggested approach in #2 above does meet our minimum system requirements, we could add this option to OntoMapper in a later release.

10. Dr. Mark Weiner had a question regarding the NLP cell and how it leverages the UMLS.

We ran out of time for this question but Shawn did explain that  i2b2 can run on multiple machines rather than one single server. The cells are related to one another in the Project Management cell. For instance, the NLP cell could run on a Windows server while the rest of the i2b2 Hive runs on a Linux server.

I2b2 (the new release) has a Service Oriented Architecture and the integration of non-core modules can be distributed.

11. How is a new ontology loaded into the i2b2 database.

Shawn described the necessary tables in the i2b2metadata schema that are used by the Ontology Cell for loading new ontologies.

The TABLE_ACCESS table contains a record for each ontology table that is loaded. In the Demo database, there is only one Ontology table called: i2b2. New ontologies could be added as new tables with names such as: ICD-9, SNOMED, LOINC and referenced in the TABLE_ACCESS.C_TABLE_NAME field.

The ROW IN THE ontology table is moved into the CONCEPT_DIMENSION table by extracting three columns:

C_FULLNAME, C_NAME, C_BASECODE

The C_FULLNAME maps to the CONCEPT_PATH column

The C_NAME maps to the NAME_CHAR column

The C_BASECODE maps to the CONCEPT_CD column

We need to determine if the following statement is true:

Only those rows in the Ontology table, which have data encoded for them in the fact table, are actually present in the CONCEPT_DIMENSION table.  If there is a row in an ontology, which does Not have any data associated with it, then that row is Not listed in the CONCEPT_DIMENSION table.

12. Upcoming additions:

I2b2 will be adding a LOOKUP table which does not qualify as a DIMENSION because it doesn’t strictly support the fact table. However, there are several codes in the observation fact table such as VALTYPE_CD which require lookup values. The LOOKUP table is considered metadata rather than a dimension table.

I2b2 will be adding 2 additional tables: patient_mapping and encounter_mapping. These tables will allow multiple patient identifiers from different source systems to map to one internal patient_num value. The encounter_mapping table will provide similar mappings between multiple visit numbers and a single internal encounter_num.