32. Query Optimization

Optimization of i2b2 query strategy

i2b2 queries often contain multiple attributes which are combined with AND clauses (e.g. "diagnosis = diabetes AND age <= 18 years"). The i2b2 CRC cell queries these attributes consecutively from the OBSERVATION_FACT table, using the result sets of each query as a base for the next query. Thus the amount of patients or encounters that need to be evaluated decrease over each step of the query plan. In order to narrow down the search space quickly, it would be ideal to first query for attributes occurring rarely in the overall dataset. While this optimization is supported by i2b2, the platform needs to know about the frequency of each concept in the overall dataset to make use of it.

The ontology table "I2B2" contains the column "C_TOTALNUM" for this purpose. It provides the count of distinct patients for each concept in the overall dataset. The CRC utilizes this column to sort the order of query steps in the overall query plan in order to first query for attributes with low occurence. However, the C_TOTALNUM column is not routinely filled during loading of the OBSERVATION_FACT table, which leads to this optimization not being applied in many cases.

The following SQL Script can be exected at the end of an ETL (extraction, transformation, loading) pathway to calculate all value for the C_TOTALNUM column automatically. The script was developed on an Oracle 11g platform but should be easily portable to other relational databases.

-- SQL Script to calculate the column i2b2.c_totalnum

-- 1. Backup the i2b2 table
CREATE TABLE t_i2b2 AS (SELECT * FROM i2b2);

-- 2a. Calculate count of distinct patients for all branches and leaves of concept hierarchy (regular concepts
CREATE TABLE t_totalnum AS
SELECT i2b.c_tablename,
       i2b.c_fullname,
       COUNT(DISTINCT obs.patient_num) AS c_totalnum
  FROM observation_fact  obs
  JOIN concept_dimension con ON obs.concept_cd   =    con.concept_cd
  JOIN i2b2              i2b ON con.concept_path LIKE i2b.c_fullname || '%'
 GROUP BY i2b.c_tablename,i2b.c_fullname;

-- 2b. (Optionally) Calculate count of distinct patients for all branches and leaves of concept hierarchy (modifier concepts
INSERT INTO t_totalnum (c_tablename, c_fullname, c_totalnum)
SELECT i2b.c_tablename,
       i2b.c_fullname,
       COUNT(DISTINCT obs.patient_num) AS c_totalnum
  FROM observation_fact   obs
  JOIN modifier_dimension con ON obs.modifier_cd   =    con.modifier_cd
  JOIN i2b2               i2b ON con.modifier_path LIKE i2b.c_fullname || '%'
 GROUP BY i2b.c_tablename,i2b.c_fullname;

-- 3. Truncate i2b2 table
TRUNCATE TABLE i2b2;

-- 4. Repopulate the i2b2 table from its backup and the generated counts
INSERT INTO i2b2 (c_hlevel, c_fullname, c_name, c_synonym_cd,
c_visualattributes, c_totalnum, c_basecode, c_metadataxml,
c_facttablecolumn, c_tablename, c_columnname, c_columndatatype,
c_operator, c_dimcode, c_comment, c_tooltip, m_applied_path,
update_date, download_date, import_date, sourcesystem_cd, valuetype_cd,
m_exclusion_cd, c_path, c_symbol)
SELECT i2b.c_hlevel,
i2b.c_fullname, i2b.c_name, i2b.c_synonym_cd, i2b.c_visualattributes,
NVL(tnm.c_totalnum, 0), i2b.c_basecode, i2b.c_metadataxml,
i2b.c_facttablecolumn, i2b.c_tablename, i2b.c_columnname,
i2b.c_columndatatype, i2b.c_operator, i2b.c_dimcode, i2b.c_comment,
i2b.c_tooltip, i2b.m_applied_path, i2b.update_date, i2b.download_date,
i2b.import_date, i2b.sourcesystem_cd, i2b.valuetype_cd,
i2b.m_exclusion_cd, i2b.c_path, i2b.c_symbol
  FROM      t_i2b2     i2b
  LEFT JOIN t_totalnum tnm ON i2b.c_tablename = tnm.c_tablename AND i2b.c_fullname = tnm.c_fullname;

-- 5. Drop temporary tables
DROP TABLE t_totalnum;
DROP TABLE t_i2b2;

-- 6. Commit changes
COMMIT;

Notes

The script temporarily truncates the i2b2 table and repopulates it from a backup. This should be kept in mind if local modifications of i2b2 installations include triggers on this table.

The script creates and drops 2 temporary tables during execution.

The script should be invoked with the user credentials of the schema containing the i2b2 ontology table.

As an additional advantage, the i2b2 web client displays values of C_TOTALNUM in the concept hierarchy. Please note that i2b2 currently does not show counts for modifier concepts and to our knowledge also does not utilize the patient counts for modifier concepts for query optimization. Step 2b of the above SQL script (which is runtime intensive) can thus be commented out without negative effects.

Calculation of database table statistics

Many relational databases keep internal statistics about the composition of data in all tables of the system. These provide information about the total number of values in each column as well as their distribution, among other details. The database query optimizer leverages these statistics to construct the potentially best query strategy, including the order in which to query joined tables or the order in which to apply selection criteria. Depending on database configuration, these statistics may be updated automatically, e.g. after table content is updated, periodically (e.g. each night) or manually. Out-of-date statistics, which do no longer reflect the current content of a table, may lead to notable performance impacts because of inappropriate query strategies proposed by the optimizer. Thus, unless the database system is generally configured to update table statistics immediately after loading, it is recommended to include a procedure to explicitly update table statistics at the end of an i2b2 ETL pathway.

The following Oracle PL/SQL script invokes the function to gather the statistics of the i2b2 schema. The placeholder [SCHEMA] needs to be replaced with the name of the schema containing the i2b2 fact and/or dimension tables.

BEGIN
DBMS_STATS.GATHER_SCHEMA_STATS([SCHEMA]);
END;

Optimization of i2b2 query strategy

Calculation of database table statistics

Index optimization of the OBSERVATION_FACT table

Pages

Recently updated