[CORE-49] i2b2DemoData.I2B2 Appears to store Trees as Paths and multiple times Created: 04/May/11 Updated: 05/Jun/14 Resolved: 28/May/14 |
|
Status: | Closed |
Project: | i2b2 Core Software |
Component/s: | Data |
Affects Version/s: | 1.6.00-RC2 |
Fix Version/s: | None |
Type: | Improvement | Priority: | Major |
Reporter: | Daniel Sullivan | Assignee: | Mike Mendis |
Resolution: | Working As Designed | Votes: | 0 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: | Windows/RHE (both) |
Rank: | 0|i000zz: |
Participant/s: |
Description |
I am currently reviewing an active instance of i2b2 at the UW (Seattle, Wa). While reviewing this instance I took a close look at the table I2B2 in DB i2b2MetaData. There are 2 concerns I have: 1) I think best practice for storing ontology trees is not to store all possible paths (for reasons of combinatoric explosion and the basic limits on how wide and usable text field should be) but also 2) in this table (and I ran a query to verify this) these paths are stored 3 times. For example: Let's say I have the following path(s). /Economy/Jobs/Boeing /Economy/Jobs/Microsoft In the current version of i2b2, these would be stored as shown above. Assuming deep trees, this gets much worse at a faster rate of complexity growth. However, you could store a concept edge list instead Economy -> Jobs Jobs -> Boeing Jobs -> Microsoft If you first translate to an integer key from a look up table, this version of the ontology tree will be more compact, faster to search, and less likely to be prone to update anomalies. For a small ontology or concept tree, this wouldn't be a big deal. For situations involving health care data this seems unsustainable as a strategy. Otherwise, you are forced to artificially splice the tree to fit the paths into these 3 fields -- which seems error prone and an update nightmare. Is anyone putting any thought into re-thinking how these concept/ontology trees are stored? Here is the SQL I ran: /****** Script for SelectTopNRows command from SSMS ******/ SELECT TOP 1000 [C_HLEVEL] ,[C_FULLNAME] ,[C_NAME] ,[C_SYNONYM_CD] ,[C_VISUALATTRIBUTES] ,[C_TOTALNUM] ,[C_BASECODE] ,[C_METADATAXML] ,[C_FACTTABLECOLUMN] ,[C_TABLENAME] ,[C_COLUMNNAME] ,[C_COLUMNDATATYPE] ,[C_OPERATOR] ,[C_DIMCODE] ,[C_COMMENT] ,[C_TOOLTIP] ,[UPDATE_DATE] ,[DOWNLOAD_DATE] ,[IMPORT_DATE] ,[SOURCESYSTEM_CD] ,[VALUETYPE_CD] FROM [i2b2metadata].[i2b2metadata].[I2B2] where [C_FULLNAME] <> [C_TOOLTIP] and [C_FULLNAME] <> [C_DIMCODE] and [C_DIMCODE] <> [C_TOOLTIP] |
Comments |
Comment by Daniel Sullivan [ 04/May/11 ] |
Also, my bad, the db is i2b2MetaData... sorry.. |
Comment by Daniel Sullivan [ 04/May/11 ] |
When I did a distinct, using the following query, I got 54K unique Paths out of 84K records. Here is the query: SELECT distinct [C_FULLNAME], [C_TOOLTIP], [C_DIMCODE] FROM [i2b2metadata].[i2b2metadata].[I2B2] |
Comment by Janice Donahoe [ 28/May/14 ] |
The demo data is provided as an example. Sites have the liberty to setup their metadata however it meets their needs. |