Uploaded image for project: 'i2b2 Core Software'
  1. i2b2 Core Software
  2. CORE-49

i2b2DemoData.I2B2 Appears to store Trees as Paths and multiple times

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Working As Designed
    • Affects Version/s: 1.6.00-RC2
    • Fix Version/s: None
    • Component/s: Data
    • Labels:
      None
    • Environment:
      Windows/RHE (both)
    • Rank:
      0|i000zz:

      Description

      I am currently reviewing an active instance of i2b2 at the UW (Seattle, Wa). While reviewing this instance I took a close look at the table I2B2 in DB i2b2MetaData. There are 2 concerns I have: 1) I think best practice for storing ontology trees is not to store all possible paths (for reasons of combinatoric explosion and the basic limits on how wide and usable text field should be) but also 2) in this table (and I ran a query to verify this) these paths are stored 3 times.

      For example: Let's say I have the following path(s).

      /Economy/Jobs/Boeing
      /Economy/Jobs/Microsoft

      In the current version of i2b2, these would be stored as shown above. Assuming deep trees, this gets much worse at a faster rate of complexity growth.

      However, you could store a concept edge list instead

      Economy -> Jobs
      Jobs -> Boeing
      Jobs -> Microsoft

      If you first translate to an integer key from a look up table, this version of the ontology tree will be more compact, faster to search, and less likely to be prone to update anomalies.

      For a small ontology or concept tree, this wouldn't be a big deal. For situations involving health care data this seems unsustainable as a strategy. Otherwise, you are forced to artificially splice the tree to fit the paths into these 3 fields -- which seems error prone and an update nightmare. Is anyone putting any thought into re-thinking how these concept/ontology trees are stored?

      Here is the SQL I ran:

      /****** Script for SelectTopNRows command from SSMS ******/
      SELECT TOP 1000 [C_HLEVEL]
            ,[C_FULLNAME]
            ,[C_NAME]
            ,[C_SYNONYM_CD]
            ,[C_VISUALATTRIBUTES]
            ,[C_TOTALNUM]
            ,[C_BASECODE]
            ,[C_METADATAXML]
            ,[C_FACTTABLECOLUMN]
            ,[C_TABLENAME]
            ,[C_COLUMNNAME]
            ,[C_COLUMNDATATYPE]
            ,[C_OPERATOR]
            ,[C_DIMCODE]
            ,[C_COMMENT]
            ,[C_TOOLTIP]
            ,[UPDATE_DATE]
            ,[DOWNLOAD_DATE]
            ,[IMPORT_DATE]
            ,[SOURCESYSTEM_CD]
            ,[VALUETYPE_CD]
        FROM [i2b2metadata].[i2b2metadata].[I2B2]
        where
      [C_FULLNAME] <> [C_TOOLTIP]
      and
      [C_FULLNAME] <> [C_DIMCODE]
      and
      [C_DIMCODE] <> [C_TOOLTIP]

        Attachments

          Activity

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Git Source Code