[CORE-49] i2b2DemoData.I2B2 Appears to store Trees as Paths and multiple times Created: 04/May/11  Updated: 05/Jun/14  Resolved: 28/May/14

Status: Closed
Project: i2b2 Core Software
Component/s: Data
Affects Version/s: 1.6.00-RC2
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Daniel Sullivan Assignee: Mike Mendis
Resolution: Working As Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: Windows/RHE (both)

Rank: 0|i000zz:
Participant/s:

 Description   
I am currently reviewing an active instance of i2b2 at the UW (Seattle, Wa). While reviewing this instance I took a close look at the table I2B2 in DB i2b2MetaData. There are 2 concerns I have: 1) I think best practice for storing ontology trees is not to store all possible paths (for reasons of combinatoric explosion and the basic limits on how wide and usable text field should be) but also 2) in this table (and I ran a query to verify this) these paths are stored 3 times.

For example: Let's say I have the following path(s).

/Economy/Jobs/Boeing
/Economy/Jobs/Microsoft

In the current version of i2b2, these would be stored as shown above. Assuming deep trees, this gets much worse at a faster rate of complexity growth.

However, you could store a concept edge list instead

Economy -> Jobs
Jobs -> Boeing
Jobs -> Microsoft

If you first translate to an integer key from a look up table, this version of the ontology tree will be more compact, faster to search, and less likely to be prone to update anomalies.

For a small ontology or concept tree, this wouldn't be a big deal. For situations involving health care data this seems unsustainable as a strategy. Otherwise, you are forced to artificially splice the tree to fit the paths into these 3 fields -- which seems error prone and an update nightmare. Is anyone putting any thought into re-thinking how these concept/ontology trees are stored?

Here is the SQL I ran:

/****** Script for SelectTopNRows command from SSMS ******/
SELECT TOP 1000 [C_HLEVEL]
      ,[C_FULLNAME]
      ,[C_NAME]
      ,[C_SYNONYM_CD]
      ,[C_VISUALATTRIBUTES]
      ,[C_TOTALNUM]
      ,[C_BASECODE]
      ,[C_METADATAXML]
      ,[C_FACTTABLECOLUMN]
      ,[C_TABLENAME]
      ,[C_COLUMNNAME]
      ,[C_COLUMNDATATYPE]
      ,[C_OPERATOR]
      ,[C_DIMCODE]
      ,[C_COMMENT]
      ,[C_TOOLTIP]
      ,[UPDATE_DATE]
      ,[DOWNLOAD_DATE]
      ,[IMPORT_DATE]
      ,[SOURCESYSTEM_CD]
      ,[VALUETYPE_CD]
  FROM [i2b2metadata].[i2b2metadata].[I2B2]
  where
[C_FULLNAME] <> [C_TOOLTIP]
and
[C_FULLNAME] <> [C_DIMCODE]
and
[C_DIMCODE] <> [C_TOOLTIP]

 Comments   
Comment by Daniel Sullivan [ 04/May/11 ]
Also, my bad, the db is i2b2MetaData... sorry..
Comment by Daniel Sullivan [ 04/May/11 ]
When I did a distinct, using the following query, I got 54K unique Paths out of 84K records. Here is the query:

SELECT distinct
[C_FULLNAME],
[C_TOOLTIP],
[C_DIMCODE]
FROM [i2b2metadata].[i2b2metadata].[I2B2]
Comment by Janice Donahoe [ 28/May/14 ]
The demo data is provided as an example. Sites have the liberty to setup their metadata however it meets their needs.
Generated at Sat Apr 20 08:23:10 UTC 2024 using Jira 8.20.11#820011-sha1:0629dd8d260e3954ece49053e565d01dabe11609.