[CORE-49] i2b2DemoData.I2B2 Appears to store Trees as Paths and multiple times - i2b2 JIRA

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Working As Designed
Affects Version/s: 1.6.00-RC2
Fix Version/s: None
Component/s: Data
Labels:
None
Environment:
Windows/RHE (both)

Description

I am currently reviewing an active instance of i2b2 at the UW (Seattle, Wa). While reviewing this instance I took a close look at the table I2B2 in DB i2b2MetaData. There are 2 concerns I have: 1) I think best practice for storing ontology trees is not to store all possible paths (for reasons of combinatoric explosion and the basic limits on how wide and usable text field should be) but also 2) in this table (and I ran a query to verify this) these paths are stored 3 times.

For example: Let's say I have the following path(s).

/Economy/Jobs/Boeing
/Economy/Jobs/Microsoft

In the current version of i2b2, these would be stored as shown above. Assuming deep trees, this gets much worse at a faster rate of complexity growth.

However, you could store a concept edge list instead

Economy -> Jobs
Jobs -> Boeing
Jobs -> Microsoft

If you first translate to an integer key from a look up table, this version of the ontology tree will be more compact, faster to search, and less likely to be prone to update anomalies.

For a small ontology or concept tree, this wouldn't be a big deal. For situations involving health care data this seems unsustainable as a strategy. Otherwise, you are forced to artificially splice the tree to fit the paths into these 3 fields -- which seems error prone and an update nightmare. Is anyone putting any thought into re-thinking how these concept/ontology trees are stored?

Here is the SQL I ran:

/****** Script for SelectTopNRows command from SSMS ******/
SELECT TOP 1000 [C_HLEVEL]
      ,[C_FULLNAME]
      ,[C_NAME]
      ,[C_SYNONYM_CD]
      ,[C_VISUALATTRIBUTES]
      ,[C_TOTALNUM]
      ,[C_BASECODE]
      ,[C_METADATAXML]
      ,[C_FACTTABLECOLUMN]
      ,[C_TABLENAME]
      ,[C_COLUMNNAME]
      ,[C_COLUMNDATATYPE]
      ,[C_OPERATOR]
      ,[C_DIMCODE]
      ,[C_COMMENT]
      ,[C_TOOLTIP]
      ,[UPDATE_DATE]
      ,[DOWNLOAD_DATE]
      ,[IMPORT_DATE]
      ,[SOURCESYSTEM_CD]
      ,[VALUETYPE_CD]
  FROM [i2b2metadata].[i2b2metadata].[I2B2]
  where
[C_FULLNAME] <> [C_TOOLTIP]
and
[C_FULLNAME] <> [C_DIMCODE]
and
[C_DIMCODE] <> [C_TOOLTIP]

Attachments

Activity

Ascending order - Click to sort in descending order

Daniel Sullivan added a comment - 04/May/11 10:59 AM

Also, my bad, the db is i2b2MetaData... sorry..

Daniel Sullivan added a comment - 04/May/11 10:59 AM Also, my bad, the db is i2b2MetaData... sorry..

Daniel Sullivan added a comment - 04/May/11 11:09 AM

When I did a distinct, using the following query, I got 54K unique Paths out of 84K records. Here is the query:

SELECT distinct
[C_FULLNAME],
[C_TOOLTIP],
[C_DIMCODE]
FROM [i2b2metadata].[i2b2metadata].[I2B2]

Daniel Sullivan added a comment - 04/May/11 11:09 AM When I did a distinct, using the following query, I got 54K unique Paths out of 84K records. Here is the query: SELECT distinct [C_FULLNAME], [C_TOOLTIP], [C_DIMCODE] FROM [i2b2metadata].[i2b2metadata].[I2B2]

Janice Donahoe added a comment - 28/May/14 3:25 PM

The demo data is provided as an example. Sites have the liberty to setup their metadata however it meets their needs.

Janice Donahoe added a comment - 28/May/14 3:25 PM The demo data is provided as an example. Sites have the liberty to setup their metadata however it meets their needs.

People

Assignee:: Mike Mendis

Reporter:: Daniel Sullivan

Participant/s:: Daniel Sullivan, Janice Donahoe, Mike Mendis

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 04/May/11 10:57 AM

Updated:: 05/Jun/14 5:41 PM

Resolved:: 28/May/14 3:25 PM

i2b2 Core Software