IDRT - Integrated Data Repository Toolkit
Space shortcuts
Space Tools

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.

Optimization of i2b2 query strategy

i2b2 queries often contain multiple attributes which are combined with AND clauses (e.g. "diagnosis = diabetes AND age <= 18 years"). The i2b2 CRC cell (Clinical Research Chart) queries these attributes consecutively from the OBSERVATION_FACT table, using the result sets of each query as a base for the next query. Thus the amount of patients or encounters that needs to be evaluated decrease over each step of the query plan. In order to narrow down the search space quickly, it would be ideal to first query for attributes occurring rarely in the overall dataset. While this optimization is supported by i2b2, the platform needs to know about the frequency of each concept in the overall dataset to make use of it.


As an additional advantage, the i2b2 web client displays values of C_TOTALNUM in the concept hierarchy. Please note that i2b2 currently does not show counts for modifier concepts and to our knowledge also does not utilize the patient counts for modifier concepts for query optimization. Step 2b of the above SQL script (which is runtime intensive) can thus be commented out without negative effects.

Calculation of database table statistics

Many relational databases keep internal statistics about the composition of data in all tables of the system. These provide information about the total number of values in each column as well as their distribution, among other details. The database query optimizer leverages these statistics to construct the potentially best query strategy, including the order in which to query joined tables or the order in which to apply selection criteria. Depending on database configuration, these statistics may be updated automatically, e.g. after table content is updated, periodically (e.g. each night) or manually. Out-of-date statistics, which do no longer reflect the current content of a table, may lead to notable performance impacts because of inappropriate query strategies proposed by the optimizer. Thus, unless the database system is generally configured to update table statistics immediately after loading, it is recommended to include a procedure to explicitly update table statistics at the end of an i2b2 ETL pathway.


Code Block

Index optimization of the OBSERVATION_FACT table

Column indexes are a long established method to improve query performance by speeding up the selection of relevant rows as well as join operations. In order to achieve this effect, the distinct values of selected table columns are stored separately, together with pointers to table rows containing each value. The database can leverage the often smaller size of these index structures to more quickly locate relevant data rows, instead of having to scan the full original table. Indexes are often stored in specialized structures (e.g. binary trees) which are optimized for quick access. Indexes can be constructed on single columns as well as multiple columns (composite indexes), which are concatenated to form the actual index value. Composite indexes provide an additional performance benefit when multiple columns are queried that are all part of the composite index, as only a single index structure needs to be examined, instead of several. However, composite indexes can only be applied if the provided search criteria are matching the order of columns in the index (e.g. if a composite index is concatenated out of columns A, B and C, it cannot be applied if column A is not part of the query).


Related information: Partitioning of tables and indexes can provide further performance improvements. The following section contains a modified script to create partitioned indexes. Also, the ETL process can be sped up by dropping indexes before loading and recreating them afterwards. 

Partitioning of the OBSERVATION_FACT Table

Some relational databases provide a partitioning feature, which can be applied to break up large tables into separate segments (partitioning, however, often is a separately licensed "enterprise-level" feature). The goal is to group data elements which are usually queried together in distinct partitions. While it would also be possible to explicitly place such groups into separate individual tables, the advantage of partitioning is that the overall table can still be accessed transparently under a single name, thus obviating the need to change program code accessing the table.