[CORE-379] Postgres totalnum counter very slow Created: 29/Oct/19 Updated: 08/Apr/20 Resolved: 27/Feb/20 |
|
Status: | Closed |
Project: | i2b2 Core Software |
Component/s: | None |
Affects Version/s: | 1.7.12 |
Fix Version/s: | 1.7.12 |
Type: | Bug | Priority: | Minor |
Reporter: | mike mendis | Assignee: | Jeffrey Klann |
Resolution: | Fixed | Votes: | 0 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Issue Links: |
|
||||||||
Rank: | 0|s0008w: | ||||||||
Affects Database/s: |
PostgreSQL
|
||||||||
Participant/s: |
Description |
The postgres totalnum script takes a very long time to run because a self-join on an ontology table with a LIKE clause on c_fullname takes ages. The same query on MSSQL with the same tables and same indices takes <5s.
select distinct p1.c_fullname, p2.c_basecode from dimCountOnt p1 inner join dimCountOnt p2 on p2.c_fullname like p1.c_fullname||'%' escape '#' where p2.c_fullname like '\i2b2\Diagnoses\%' escape '#' and p1.c_fullname like '\\i2b2\\Diagnoses\\%' Using specialized postgres indices did not help: create index dimCountOntA on dimCountOnt using spgist (c_fullname); CREATE INDEX dimCountOntB ON dimCountOnt(c_fullname text_pattern_ops); CREATE INDEX dimCountOntBjk ON dimCountOnt USING btree(c_fullname varchar_pattern_ops); |
Comments |
Comment by Jeffrey Klann [ 04/Nov/19 ] |
Jeff Green provided this optimization:
"Jeff, Looking at your code and Mike's code, it appears that for each node in the Diagnoses branch of an i2b2 ontology tree you are trying to find all of the concept codes that are associated with the node and all of its children. Worst case, without (or with incorrect) optimization by the server a join using the like operator can behave like a cross join between two tables. Since your table has approximately 100,000 rows, you could get potentially get up to 10 billion rows returned (if you could wait long enough for the join to finish) depending on how your ontology tree was built. Instead of taking each node and trying to identify and separate all of its children in the table from the nodes that are not its children, I found it more efficient to turn the problem upside down. I took the single row for each node and exploded it into multiple rows, one row for each of its parents as defined by the path stored in the c_fullname column. Instead of using a join to find the children I used a union all to gather up all of the parent rows exploded from the children. The PostgreSQL code below shows how I accomplished this without the need for any temporary tables or indexes, returning 83,787 rows. This code runs in about 18 seconds on my laptop against the i2b2 demo data in a PostgreSQL server running within a VirtualBox Linux machine. With minor dialect changes it ran in about 11 seconds using the i2b2 demo data in SQL Server running directly on my laptop. I hope this helps! Jeff Green Prognosis Data Corp with recursive concepts (c_fullname, c_hlevel, c_basecode) as ( select c_fullname, c_hlevel, c_basecode from i2b2metadata.i2b2 where lower(c_facttablecolumn) = 'concept_cd' and lower(c_tablename) = 'concept_dimension' and lower(c_columnname) = 'concept_path' and m_applied_path = '@' and coalesce(c_fullname,'') <> '' and coalesce(c_basecode,'') <> '' and c_fullname like '\\i2b2\\%' union all select cast( left(c_fullname, length(c_fullname)-position('\' in right(reverse(c_fullname), length(c_fullname)-1))) as varchar(700) ) c_fullname, c_hlevel-1 c_hlevel, c_basecode from concepts where concepts.c_hlevel>0 ) select distinct c_fullname, c_basecode from concepts where c_fullname like '\\i2b2\\Diagnoses\\%' order by c_fullname, c_basecode " |
Comment by Jeffrey Klann [ 04/Nov/19 ] |
Speed improvement added to code, now runs in 2 minutes on demodata. |
Comment by Jeffrey Klann [ 27/Feb/20 ] |
This was fixed in December 2019 |