IDRT - Integrated Data Repository Toolkit
Space shortcuts
Space Tools

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Related information: Partitioning of tables and indexes can provide further performance improvements. The following section contains a modified script to create partitioned indexes. Also, the ETL process can be sped up by dropping indexes before loading and recreating them afterwards. 

Partitioning of the OBSERVATION_FACT Table

Some relational databases provide a partitioning feature, which can be applied to break up large tables into separate segments (partitioning, however, often is a separately licensed "enterprise-level" feature). The goal is to group data elements which are usually queried together in distinct partitions. While it would also be possible to explicitly place such groups into separate individual tables, the advantage of partitioning is that the overall table can still be accessed transparently under a single name, thus obviating the need to change program code accessing the table. 

A prerequisite for partitioning is a "partitioning key" column that can be used to unequivocally assign each record to a single partition. Partitions can be defined by ranges over the partition key (e.g. years from 2001-2005, 2006-2010, ...) or by mapping individual values to partitions (list-based partitioning), and other methods. At runtime, the query optimizer can identify search-criteria over the partitioning key to preselect relevant partitions to access (e.g. only the 2001-2005 partition for a query limited to 2004), allowing it to skip over all other (irrelevant) partitions. Besides this kind of implicit partition selection (by including the partition key in query criteria), partitions can also be explicitly referenced in FROM clauses. Partitioning can be applied not only to table data, but also to related indexes, thus speeding up access to indexes as well. Partitioned indexes are also called "local" indexes.

Partitioning can also provide additional benefits during the ETL phase, as it simplifies the isolated truncating and loading of individual segments of data, without having to reload all data or needing to use slower DELETE statements for removing data to be reloaded.

In i2b2, the OBSERVATION_FACT table and its related indexes are obvious candidates for partitioning. As mentioned in the previous section, the volume of data types can be very heterogeneous, and partitioning can be leveraged e.g. to skip large volumes of laboratory findings for queries that do not adress them. The CONCEPT_CD column can be used as a partitioning key: it is usually prefixed with a schema code that relates to the type of data referenced (e.g. ICD for diagnoses, LAB for lab values). Due to the high volume of distinct concept codes, however, list-based partitioning cannot be applied, but rather range-based partitioning that include all concepts within a schema key or group of schema keys. The definition of the partitions and key ranges needs to be adapted to each local i2b2 installation, as it reflects the schema keys and data volumes locally used.

Warning
titleWarning

Please note that existing tables cannot be partitioned. It is necessary to drop the OBSERVATION_FACT table, re-create it with appropriate partitions and reload the data (e.g. through the regular ETL process or from a prior copy of the table)

The following Oracle SQL script drops the OBSERVATION_FACT table, re-creates it with a sample partitioning scheme and re-creates all necessary indexes, including an option to also partition the indexes. NB: the indexes creation statement