Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Working As Designed
-
1.6.00-RC2
-
None
-
None
-
Windows/RHE (both)
Description
I am currently reviewing an active instance of i2b2 at the UW (Seattle, Wa). While reviewing this instance I took a close look at the table I2B2 in DB i2b2MetaData. There are 2 concerns I have: 1) I think best practice for storing ontology trees is not to store all possible paths (for reasons of combinatoric explosion and the basic limits on how wide and usable text field should be) but also 2) in this table (and I ran a query to verify this) these paths are stored 3 times.
For example: Let's say I have the following path(s).
/Economy/Jobs/Boeing
/Economy/Jobs/Microsoft
In the current version of i2b2, these would be stored as shown above. Assuming deep trees, this gets much worse at a faster rate of complexity growth.
However, you could store a concept edge list instead
Economy -> Jobs
Jobs -> Boeing
Jobs -> Microsoft
If you first translate to an integer key from a look up table, this version of the ontology tree will be more compact, faster to search, and less likely to be prone to update anomalies.
For a small ontology or concept tree, this wouldn't be a big deal. For situations involving health care data this seems unsustainable as a strategy. Otherwise, you are forced to artificially splice the tree to fit the paths into these 3 fields -- which seems error prone and an update nightmare. Is anyone putting any thought into re-thinking how these concept/ontology trees are stored?
Here is the SQL I ran:
/****** Script for SelectTopNRows command from SSMS ******/
SELECT TOP 1000 [C_HLEVEL]
,[C_FULLNAME]
,[C_NAME]
,[C_SYNONYM_CD]
,[C_VISUALATTRIBUTES]
,[C_TOTALNUM]
,[C_BASECODE]
,[C_METADATAXML]
,[C_FACTTABLECOLUMN]
,[C_TABLENAME]
,[C_COLUMNNAME]
,[C_COLUMNDATATYPE]
,[C_OPERATOR]
,[C_DIMCODE]
,[C_COMMENT]
,[C_TOOLTIP]
,[UPDATE_DATE]
,[DOWNLOAD_DATE]
,[IMPORT_DATE]
,[SOURCESYSTEM_CD]
,[VALUETYPE_CD]
FROM [i2b2metadata].[i2b2metadata].[I2B2]
where
[C_FULLNAME] <> [C_TOOLTIP]
and
[C_FULLNAME] <> [C_DIMCODE]
and
[C_DIMCODE] <> [C_TOOLTIP]
For example: Let's say I have the following path(s).
/Economy/Jobs/Boeing
/Economy/Jobs/Microsoft
In the current version of i2b2, these would be stored as shown above. Assuming deep trees, this gets much worse at a faster rate of complexity growth.
However, you could store a concept edge list instead
Economy -> Jobs
Jobs -> Boeing
Jobs -> Microsoft
If you first translate to an integer key from a look up table, this version of the ontology tree will be more compact, faster to search, and less likely to be prone to update anomalies.
For a small ontology or concept tree, this wouldn't be a big deal. For situations involving health care data this seems unsustainable as a strategy. Otherwise, you are forced to artificially splice the tree to fit the paths into these 3 fields -- which seems error prone and an update nightmare. Is anyone putting any thought into re-thinking how these concept/ontology trees are stored?
Here is the SQL I ran:
/****** Script for SelectTopNRows command from SSMS ******/
SELECT TOP 1000 [C_HLEVEL]
,[C_FULLNAME]
,[C_NAME]
,[C_SYNONYM_CD]
,[C_VISUALATTRIBUTES]
,[C_TOTALNUM]
,[C_BASECODE]
,[C_METADATAXML]
,[C_FACTTABLECOLUMN]
,[C_TABLENAME]
,[C_COLUMNNAME]
,[C_COLUMNDATATYPE]
,[C_OPERATOR]
,[C_DIMCODE]
,[C_COMMENT]
,[C_TOOLTIP]
,[UPDATE_DATE]
,[DOWNLOAD_DATE]
,[IMPORT_DATE]
,[SOURCESYSTEM_CD]
,[VALUETYPE_CD]
FROM [i2b2metadata].[i2b2metadata].[I2B2]
where
[C_FULLNAME] <> [C_TOOLTIP]
and
[C_FULLNAME] <> [C_DIMCODE]
and
[C_DIMCODE] <> [C_TOOLTIP]