IDRT - Integrated Data Repository Toolkit
Space shortcuts
Space Tools

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Optimization of i2b2 query strategy

i2b2 queries often contain multiple attributes which are combined with AND clauses (e.g. "diagnosis = diabetes AND age <= 18 years"). The i2b2 CRC cell (Clinical Research Chart) queries these attributes consecutively from the OBSERVATION_FACT table, using the result sets of each query as a base for the next query. Thus the amount of patients or encounters that need needs to be evaluated decrease over each step of the query plan. In order to narrow down the search space quickly, it would be ideal to first query for attributes occurring rarely in the overall dataset. While this optimization is supported by i2b2, the platform needs to know about the frequency of each concept in the overall dataset to make use of it.

The ontology table "I2B2" contains the column "C_TOTALNUM" for this purpose. It provides the count of distinct patients for each concept in the overall dataset. The CRC cell utilizes this column to sort the order of query steps in the overall query plan in order to first query for attributes with low occurenceoccurrence. However, the C_TOTALNUM column is not routinely filled during loading of the OBSERVATION_FACT table, which leads to this optimization not being applied in many cases.

The following SQL Script can be exected executed at the end of an ETL (extraction, transformation, loading) pathway to calculate all value for the C_TOTALNUM column automatically. The script was developed on an Oracle 11g platform but should be easily portable to other relational databases.

...

As an additional advantage, the i2b2 web client displays values of C_TOTALNUM in the concept hierarchy. Please note that i2b2 currently does not show counts for modifier concepts and to our knowledge also does not utilize the patient counts for modifier concepts for query optimization. Step 2b of the above SQL script (which is runtime intensive) can thus be commented out without negative effects.

Calculation of database table statistics

Many relational databases keep internal statistics about the composition of data in all tables of the system. These provide information about the total number of values in each column as well as their distribution, among other details. The database query optimizer leverages these statistics to construct the potentially best query strategy, including the order in which to query joined tables or the order in which to apply selection criteria. Depending on database configuration, these statistics may be updated automatically, e.g. after table content is updated, periodically (e.g. each night) or manually. Out-of-date statistics, which do no longer reflect the current content of a table, may lead to notable performance impacts because of inappropriate query strategies proposed by the optimizer. Thus, unless the database system is generally configured to update table statistics immediately after loading, it is recommended to include a procedure to explicitly update table statistics at the end of an i2b2 ETL pathway.

...

Code Block
languagesql
collapsetrue
BEGIN
DBMS_STATS.GATHER_SCHEMA_STATS([SCHEMA]);
END;

Index optimization of the OBSERVATION_FACT table

Column indexes are a long established method to improve query performance by speeding up the selection of relevant rows as well as join operations. In order to achieve this effect, the distinct values of selected table columns are stored separately, together with pointers to table rows containing each value. The database can leverage the often smaller size of these index structures to more quickly locate relevant data rows, instead of having to scan the full original table. Indexes are often stored in specialized structures (e.g. binary trees) which are optimized for quick access. Indexes can be constructed on single columns as well as multiple columns (composite indexes), which are concatenated to form the actual index value. Composite indexes provide an additional performance benefit when multiple columns are queried that are all part of the composite index, as only a single index structure needs to be examined, instead of several. However, composite indexes can only be applied if the provided search criteria are matching the order of columns in the index (e.g. if a composite index is concatenated out of columns A, B and C, it cannot be applied if column A is not part of the query).

Whether the database actually uses indexes or not is decided by the query optimizer at runtime, depending on table statistics and cached performance data from previous queries, and explicit "optimizer hints" that can be inluded inludced in SQL queries, among others. The query strategy actually chosen can be examined at runtime (e.g. with the Oracle EXPLAIN PLAN statement) or by inspecting query logs.

...

Related information: Partitioning of tables and indexes can provide further performance improvements. The following section contains a modified script to create partitioned indexes. Also, the ETL process can be sped up by dropping indexes before loading and recreating them afterwards. 

Partitioning of the OBSERVATION_FACT Table

Some relational databases provide a partitioning feature, which can be applied to break up large tables into separate segments (partitioning, however, often is a separately licensed "enterprise-level" feature). The goal is to group data elements which are usually queried together in distinct partitions. While it would also be possible to explicitly place such groups into separate individual tables, the advantage of partitioning is that the overall table can still be accessed transparently under a single name, thus obviating the need to change program code accessing the table. 

...

In i2b2, the OBSERVATION_FACT table and its related indexes are obvious candidates for partitioning. As mentioned in the previous section, the volume of data types can be very heterogeneous, and partitioning can be leveraged e.g. to skip large volumes of laboratory findings for queries that do not adress address them. The CONCEPT_CD column can be used as a partitioning key: it is usually prefixed with a schema code that relates to the type of data referenced (e.g. ICD for diagnoses, LAB for lab values). Due to the high volume of distinct concept codes, however, list-based partitioning cannot be applied, but rather range-based partitioning that include all concepts within a schema key or group of schema keys. The definition of the partitions and key ranges needs to be adapted to each local i2b2 installation, as it reflects the schema keys and data volumes locally used.

...