33. ETL Optimization

The ETL pathway (extraction, transformation & loading) can take up a notable amount of time for large datasets, which can sometimes lead to bottlenecks in the sense that e.g. a nightly loading of data is no longer possible. There are several possible optimizations which are not specific to i2b2, but have long been established in the classic data warehouse field.

Disabling Index Maintenance during Loading

As mentioned in the query optimization page, the i2b2 database schema contains several indexes to speed up query performance. When data is being loaded into indexed tables, the databases has to simultaneously update or extend both the actual table row data as well as the related index structures, in order to maintain index consistency at all times. Depending on the number of indexes this can incur a notable performance impact due to concurrent write operations across different areas of the physical hard drives. This effect becomes especially large when tables are not incrementally extended (like in a transactional setting), but truncated and fully loaded (like in a typical data warehouse setting).

It is therefore standard practice in data warehousing to disable or drop indexes before truncating and (re-)loading data into large tables. Even though the database will spend additional time rebuilding the indexes after loading, performance is improved as it is carried out sequentially and not simultaneously to table row data loading.

Exception

This rule does not apply if a table is being maintained incrementally in the sense of "delta updates". In this case indexes may in fact be beneficial to quickly locate rows to be updated

The following Oracle SQL script contains statements to drop indexes from the OBSERVATION_FACT table (part 1 - to be inserted into an ETL pathway before fact loading) and to recreate the same indexes (part 2 - to be inserted into an ETL pathway after fact loading).

DROP TABLE "OBSERVATION_FACT" cascade constraints;
CREATE TABLE "OBSERVATION_FACT"
( note} "ENCOUNTER_NUM" NUMBER(38,0) NOT NULL ENABLE,

"PATIENT_NUM" NUMBER(38,0) NOT NULL ENABLE,
"CONCEPT_CD" VARCHAR2(50 BYTE) NOT NULL ENABLE,
"PROVIDER_ID" VARCHAR2(50 BYTE) NOT NULL ENABLE,
"START_DATE" DATE NOT NULL ENABLE,
"MODIFIER_CD" VARCHAR2(100 BYTE) NOT NULL ENABLE,
"INSTANCE_NUM" NUMBER(18,0) NOT NULL ENABLE,
"VALTYPE_CD" VARCHAR2(50 BYTE),
"TVAL_CHAR" VARCHAR2(255 CHAR),
"NVAL_NUM" NUMBER(18,5),
"VALUEFLAG_CD" VARCHAR2(50 BYTE),
"QUANTITY_NUM" NUMBER(18,5),
"UNITS_CD" VARCHAR2(50 BYTE),
"END_DATE" DATE,
"LOCATION_CD" VARCHAR2(50 BYTE),
"OBSERVATION_BLOB" CLOB,
"CONFIDENCE_NUM" NUMBER(18,5),
"UPDATE_DATE" DATE,
"DOWNLOAD_DATE" DATE,
"IMPORT_DATE" DATE,
"SOURCESYSTEM_CD" VARCHAR2(50 BYTE),
"UPLOAD_ID" NUMBER(38,0),
CONSTRAINT "OBSERVATION_FACT_PK" PRIMARY KEY ("ENCOUNTER_NUM", "CONCEPT_CD", "PROVIDER_ID", "START_DATE", "MODIFIER_CD", "INSTANCE_NUM") USING INDEX LOCAL ENABLE
)
LOB ("OBSERVATION_BLOB") STORE AS BASICFILE (ENABLE STORAGE IN ROW CHUNK 8192 RETENTION)
PARTITION BY RANGE ("CONCEPT_CD")
(
PARTITION BIO VALUES LESS THAN ('DEM'),
PARTITION DEM VALUES LESS THAN ('FAL'),
PARTITION FAL VALUES LESS THAN ('ICD'),
PARTITION ICD VALUES LESS THAN ('LAB'),
PARTITION LAB VALUES LESS THAN ('OPS'),
PARTITION OPS VALUES LESS THAN ('ORG'),
PARTITION ORG VALUES LESS THAN ('PAT'),
PARTITION PAT VALUES LESS THAN ('RAD'),
PARTITION RAD VALUES LESS THAN ('THE'),
PARTITION THE VALUES LESS THAN ('ZZZ')
);
CREATE INDEX of_ctx_blob ON observation_fact (observation_blob) indextype is ctxsys.context parameters ('sync (on commit)');
CREATE INDEX fact_nolob ON observation_fact (patient_num, start_date, concept_cd, encounter_num, instance_num, nval_num, tval_char, valtype_cd, modifier_cd, valueflag_cd, provider_id, quantity_num, units_cd, end_date, location_cd, confidence_num, update_date, download_date, import_date, sourcesystem_cd, upload_id) LOCAL;
CREATE INDEX fact_patcon_date_prvd_idx ON observation_fact (patient_num, concept_cd, start_date, end_date, encounter_num, instance_num, provider_id, nval_num, valtype_cd) LOCAL;
CREATE INDEX idrt_fact_cnpt_pat_enct_idx ON observation_fact (concept_cd, instance_num, patient_num, encounter_num) LOCAL;
CREATE INDEX idrt_fact_mdf_pat_enct_idx ON observation_fact (modifier_cd, instance_num, patient_num, encounter_num) LOCAL;

-- 1. Drop OBSERVATION_FACT indexes
DROP INDEX of_ctx_blob;
DROP INDEX fact_nolob;
DROP INDEX fact_patcon_date_prvd_idx;
DROP INDEX idrt_fact_cnpt_pat_enct_idx;
DROP INDEX idrt_fact_mdf_pat_enct_idx;

-- 2. Recreate OBSERVATION_FACT indexes
CREATE INDEX of_ctx_blob                 ON observation_fact (observation_blob) indextype is ctxsys.context  parameters ('sync (on commit)');
CREATE INDEX fact_nolob                  ON observation_fact (patient_num, start_date, concept_cd, encounter_num, instance_num, nval_num, tval_char, valtype_cd, modifier_cd, valueflag_cd, provider_id, quantity_num, units_cd, end_date, location_cd, confidence_num, update_date, download_date, import_date, sourcesystem_cd, upload_id) LOCAL;
CREATE INDEX fact_patcon_date_prvd_idx   ON observation_fact (patient_num, concept_cd, start_date, end_date, encounter_num, instance_num, provider_id, nval_num, valtype_cd) LOCAL;
CREATE INDEX idrt_fact_cnpt_pat_enct_idx ON observation_fact (concept_cd, instance_num, patient_num, encounter_num) LOCAL;
CREATE INDEX idrt_fact_mdf_pat_enct_idx  ON observation_fact (modifier_cd, instance_num, patient_num, encounter_num) LOCAL;

The script should be executed with the credentials of the schema containing the OBSERVATION_FACT table

Please note that the script already contains both the optimized indexes as well as the LOCAL option for partitioning, which may need to be removed depending on availability and use of the partitioning feature.

Disabling Index Maintenance during Loading

Pages

Recently updated