IDRT - Integrated Data Repository Toolkit
Space shortcuts
Space Tools

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Optimization of i2b2 query strategy

i2b2 queries often contain multiple attributes which are combined with AND clauses (e.g. "diagnosis = diabetes AND age <= 18 years"). The i2b2 CRC cell (Clinical Research Chart) queries these attributes consecutively from the OBSERVATION_FACT table, using the result sets of each query as a base for the next query. Thus the amount of patients or encounters that need needs to be evaluated decrease over each step of the query plan. In order to narrow down the search space quickly, it would be ideal to first query for attributes occurring rarely in the overall dataset. While this optimization is supported by i2b2, the platform needs to know about the frequency of each concept in the overall dataset to make use of it.

The ontology table "I2B2" contains the column "C_TOTALNUM" for this purpose. It provides the count of distinct patients for each concept in the overall dataset. The CRC cell utilizes this column to sort the order of query steps in the overall query plan in order to first query for attributes with low occurenceoccurrence. However, the C_TOTALNUM column is not routinely filled during loading of the OBSERVATION_FACT table, which leads to this optimization not being applied in many cases.

The following SQL Script can be exected executed at the end of an ETL (extraction, transformation, loading) pathway to calculate all value for the C_TOTALNUM column automatically. The script was developed on an Oracle 11g platform but should be easily portable to other relational databases.

Info
titleCoverage

Please note that the script will calculate the patient counts only over facts stored in the OBSERVATION_FACT table. If you installation contains fact data in other tables (e.g. PATIENT_DIMENSION) the script will need to be adapted to your individual configuration.

Code Block
languagesql
collapsetrue
-- SQL Script to calculate the column i2b2.c_totalnum

-- 1. Backup the i2b2 table
CREATE TABLE t_i2b2 AS (SELECT * FROM i2b2);

-- 2a. Calculate count of distinct patients for all branches and leaves of concept hierarchy (regular concepts
CREATE TABLE t_totalnum AS
SELECT i2b.c_tablename,
       i2b.c_fullname,
       COUNT(DISTINCT obs.patient_num) AS c_totalnum
  FROM observation_fact  obs
  JOIN concept_dimension con ON obs.concept_cd   =    con.concept_cd
  JOIN i2b2              i2b ON con.concept_path LIKE i2b.c_fullname || '%'
 GROUP BY i2b.c_tablename,i2b.c_fullname;

-- 2b. (Optionally) Calculate count of distinct patients for all branches and leaves of concept hierarchy (modifier concepts
INSERT INTO t_totalnum (c_tablename, c_fullname, c_totalnum)
SELECT i2b.c_tablename,
       i2b.c_fullname,
       COUNT(DISTINCT obs.patient_num) AS c_totalnum
  FROM observation_fact   obs
  JOIN modifier_dimension con ON obs.modifier_cd   =    con.modifier_cd
  JOIN i2b2               i2b ON con.modifier_path LIKE i2b.c_fullname || '%'
 GROUP BY i2b.c_tablename,i2b.c_fullname;

-- 3. Truncate i2b2 table
TRUNCATE TABLE i2b2;

-- 4. Repopulate the i2b2 table from its backup and the generated counts
INSERT INTO i2b2 (c_hlevel, c_fullname, c_name, c_synonym_cd,
c_visualattributes, c_totalnum, c_basecode, c_metadataxml,
c_facttablecolumn, c_tablename, c_columnname, c_columndatatype,
c_operator, c_dimcode, c_comment, c_tooltip, m_applied_path,
update_date, download_date, import_date, sourcesystem_cd, valuetype_cd,
m_exclusion_cd, c_path, c_symbol)
SELECT i2b.c_hlevel,
i2b.c_fullname, i2b.c_name, i2b.c_synonym_cd, i2b.c_visualattributes,
NVL(tnm.c_totalnum, 0), i2b.c_basecode, i2b.c_metadataxml,
i2b.c_facttablecolumn, i2b.c_tablename, i2b.c_columnname,
i2b.c_columndatatype, i2b.c_operator, i2b.c_dimcode, i2b.c_comment,
i2b.c_tooltip, i2b.m_applied_path, i2b.update_date, i2b.download_date,
i2b.import_date, i2b.sourcesystem_cd, i2b.valuetype_cd,
i2b.m_exclusion_cd, i2b.c_path, i2b.c_symbol
  FROM      t_i2b2     i2b
  LEFT JOIN t_totalnum tnm ON i2b.c_tablename = tnm.c_tablename AND i2b.c_fullname = tnm.c_fullname;

-- 5. Drop temporary tables
DROP TABLE t_totalnum;
DROP TABLE t_i2b2;

-- 6. Commit changes
COMMIT;

...

As an additional advantage, the i2b2 web client displays values of C_TOTALNUM in the concept hierarchy. Please note that i2b2 currently does not show counts for modifier concepts and to our knowledge also does not utilize the patient counts for modifier concepts for query optimization. Step 2b of the above SQL script (which is runtime intensive) can thus be commented out without negative effects.

Calculation of database table statistics

Many relational databases keep internal statistics about the composition of data in all tables of the system. These provide information about the total number of values in each column as well as their distribution, among other details. The database query optimizer leverages these statistics to construct the potentially best query strategy, including the order in which to query joined tables or the order in which to apply selection criteria. Depending on database configuration, these statistics may be updated automatically, e.g. after table content is updated, periodically (e.g. each night) or manually. Out-of-date statistics, which do no longer reflect the current content of a table, may lead to notable performance impacts because of inappropriate query strategies proposed by the optimizer. Thus, unless the database system is generally configured to update table statistics immediately after loading, it is recommended to include a procedure to explicitly update table statistics at the end of an i2b2 ETL pathway.

...

Code Block
languagesql
collapsetrue
BEGIN
DBMS_STATS.GATHER_SCHEMA_STATS([SCHEMA]);
END;

Index optimization of the OBSERVATION_FACT table

Column indexes are a long established method to improve query performance by speeding up the selection of relevant rows as well as join operations. In order to achieve this effect, the distinct values of selected table columns are stored separately, together with pointers to table rows containing each value. The database can leverage the often smaller size of these index structures to more quickly locate relevant data rows, instead of having to scan the full original table. Indexes are often stored in specialized structures (e.g. binary trees) which are optimized for quick access. Indexes can be constructed on single columns as well as multiple columns (composite indexes), which are concatenated to form the actual index value. Composite indexes provide an additional performance benefit when multiple columns are queried that are all part of the composite index, as only a single index structure needs to be examined, instead of several. However, composite indexes can only be applied if the provided search criteria are matching the order of columns in the index (e.g. if a composite index is concatenated out of columns A, B and C, it cannot be applied if column A is not part of the query).

Whether the database actually uses indexes or not is decided by the query optimizer at runtime, depending on table statistics and cached performance data from previous queries, and explicit "optimizer hints" that can be inluded inludced in SQL queries, among others. The query strategy actually chosen can be examined at runtime (e.g. with the Oracle EXPLAIN PLAN statement) or by inspecting query logs.

...

Related information: Partitioning of tables and indexes can provide further performance improvements. The following section contains a modified script to create partitioned indexes. Also, the ETL process can be sped up by dropping indexes before loading and recreating them afterwards. 

Partitioning of the OBSERVATION_FACT Table

Some relational databases provide a partitioning feature, which can be applied to break up large tables into separate segments (partitioning, however, often is a separately licensed "enterprise-level" feature). The goal is to group data elements which are usually queried together in distinct partitions. While it would also be possible to explicitly place such groups into separate individual tables, the advantage of partitioning is that the overall table can still be accessed transparently under a single name, thus obviating the need to change program code accessing the table. 

...

In i2b2, the OBSERVATION_FACT table and its related indexes are obvious candidates for partitioning. As mentioned in the previous section, the volume of data types can be very heterogeneous, and partitioning can be leveraged e.g. to skip large volumes of laboratory findings for queries that do not adress address them. The CONCEPT_CD column can be used as a partitioning key: it is usually prefixed with a schema code that relates to the type of data referenced (e.g. ICD for diagnoses, LAB for lab values). Due to the high volume of distinct concept codes, however, list-based partitioning cannot be applied, but rather range-based partitioning that include all concepts within a schema key or group of schema keys. The definition of the partitions and key ranges needs to be adapted to each local i2b2 installation, as it reflects the schema keys and data volumes locally used.

...

Code Block
languagesql
collapsetrue

DROP TABLE "OBSERVATION_FACT" cascade constraints;
CREATE TABLE "OBSERVATION_FACT"
    (           "ENCOUNTER_NUM" NUMBER(38,0) NOT NULL ENABLE,
                "PATIENT_NUM" NUMBER(38,0) NOT NULL ENABLE,
                "CONCEPT_CD" VARCHAR2(50 BYTE) NOT NULL ENABLE,
                "PROVIDER_ID" VARCHAR2(50 BYTE) NOT NULL ENABLE,
                "START_DATE" DATE NOT NULL ENABLE,
                "MODIFIER_CD" VARCHAR2(100 BYTE) NOT NULL ENABLE,
                "INSTANCE_NUM" NUMBER(18,0) NOT NULL ENABLE,
                "VALTYPE_CD" VARCHAR2(50 BYTE),
                "TVAL_CHAR" VARCHAR2(255 CHAR),
                "NVAL_NUM" NUMBER(18,5),
                "VALUEFLAG_CD" VARCHAR2(50 BYTE),
                "QUANTITY_NUM" NUMBER(18,5),
                "UNITS_CD" VARCHAR2(50 BYTE),
                "END_DATE" DATE,
                "LOCATION_CD" VARCHAR2(50 BYTE),
                "OBSERVATION_BLOB" CLOB,
                "CONFIDENCE_NUM" NUMBER(18,5),
                "UPDATE_DATE" DATE,
                "DOWNLOAD_DATE" DATE,
                "IMPORT_DATE" DATE,
                "SOURCESYSTEM_CD" VARCHAR2(50 BYTE),
                "UPLOAD_ID" NUMBER(38,0),
                CONSTRAINT "OBSERVATION_FACT_PK" PRIMARY KEY ("ENCOUNTER_NUM", "CONCEPT_CD", "PROVIDER_ID", "START_DATE", "MODIFIER_CD", "INSTANCE_NUM") USING INDEX LOCAL ENABLE
   )
LOB ("OBSERVATION_BLOB") STORE AS BASICFILE (ENABLE STORAGE IN ROW CHUNK 8192 RETENTION)
PARTITION BY RANGE ("CONCEPT_CD")
   (
   PARTITION BIO VALUES LESS THAN ('DEM'), -- "BIO" partition for biomaterials
   PARTITION DEM VALUES LESS THAN ('ENC'), -- "DEM" partition for demographics
   PARTITION ENC VALUES LESS THAN ('ICD'), -- "ENC" partition for encounter data
   PARTITION ICD VALUES LESS THAN ('LAB'), -- "ICD" partition for diagnoses
   PARTITION LAB VALUES LESS THAN ('OPS'), -- "LAB" partition for lab findings
   PARTITION OPS VALUES LESS THAN ('ORG'), -- "OPS" partition for procedures
   PARTITION ORG VALUES LESS THAN ('PAT'), -- "ORG" partition for organizational data
   PARTITION PAT VALUES LESS THAN ('RAD'), -- "PAT" partition for pathology findings
   PARTITION RAD VALUES LESS THAN ('THE'), -- "RAD" partition for radiology findings
   PARTITION THE VALUES LESS THAN ('ZZZ')  -- "THE" partition for therapy data
   );
CREATE INDEX of_ctx_blob                 ON observation_fact (observation_blob) indextype is ctxsys.context  parameters ('sync (on commit)');
CREATE INDEX fact_nolob                  ON observation_fact (patient_num, start_date, concept_cd, encounter_num, instance_num, nval_num, tval_char, valtype_cd, modifier_cd, valueflag_cd, provider_id, quantity_num, units_cd, end_date, location_cd, confidence_num, update_date, download_date, import_date, sourcesystem_cd, upload_id) LOCAL;
CREATE INDEX fact_patcon_date_prvd_idx   ON observation_fact (patient_num, concept_cd, start_date, end_date, encounter_num, instance_num, provider_id, nval_num, valtype_cd) LOCAL;
CREATE INDEX idrt_fact_cnpt_pat_enct_idx ON observation_fact (concept_cd, instance_num, patient_num, encounter_num) LOCAL;
CREATE INDEX idrt_fact_mdf_pat_enct_idx  ON observation_fact (modifier_cd, instance_num, patient_num, encounter_num) LOCAL;

...