[CORE-379] Postgres totalnum counter very slow Created: 29/Oct/19  Updated: 08/Apr/20  Resolved: 27/Feb/20

Status: Closed
Project: i2b2 Core Software
Component/s: None
Affects Version/s: 1.7.12
Fix Version/s: 1.7.12

Type: Bug Priority: Minor
Reporter: mike mendis Assignee: Jeffrey Klann
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
related to CORE-343 Provide totalnum counting scripts Closed
Rank: 0|s0008w:
Affects Database/s:
PostgreSQL
Participant/s:

 Description   
The postgres totalnum script takes a very long time to run because a self-join on an ontology table with a LIKE clause on c_fullname takes ages. The same query on MSSQL with the same tables and same indices takes <5s.

select distinct p1.c_fullname, p2.c_basecode
from dimCountOnt p1
inner join dimCountOnt p2
on p2.c_fullname like p1.c_fullname||'%' escape '#'
 where p2.c_fullname like '\i2b2\Diagnoses\%' escape '#'
and p1.c_fullname like '\\i2b2\\Diagnoses\\%'

Using specialized postgres indices did not help:

create index dimCountOntA on dimCountOnt using spgist (c_fullname);
CREATE INDEX dimCountOntB ON dimCountOnt(c_fullname text_pattern_ops);
CREATE INDEX dimCountOntBjk ON dimCountOnt USING btree(c_fullname varchar_pattern_ops);

 Comments   
Comment by Jeffrey Klann [ 04/Nov/19 ]
Jeff Green provided this optimization:

"Jeff,

     Looking at your code and Mike's code, it appears that for each node in the Diagnoses branch of an i2b2 ontology tree you are trying to find all of the concept codes that are associated with the node and all of its children. Worst case, without (or with incorrect) optimization by the server a join using the like operator can behave like a cross join between two tables. Since your table has approximately 100,000 rows, you could get potentially get up to 10 billion rows returned (if you could wait long enough for the join to finish) depending on how your ontology tree was built.

     Instead of taking each node and trying to identify and separate all of its children in the table from the nodes that are not its children, I found it more efficient to turn the problem upside down. I took the single row for each node and exploded it into multiple rows, one row for each of its parents as defined by the path stored in the c_fullname column. Instead of using a join to find the children I used a union all to gather up all of the parent rows exploded from the children. The PostgreSQL code below shows how I accomplished this without the need for any temporary tables or indexes, returning 83,787 rows. This code runs in about 18 seconds on my laptop against the i2b2 demo data in a PostgreSQL server running within a VirtualBox Linux machine. With minor dialect changes it ran in about 11 seconds using the i2b2 demo data in SQL Server running directly on my laptop.

     I hope this helps!

Jeff Green
Prognosis Data Corp

with recursive concepts (c_fullname, c_hlevel, c_basecode) as
(
select c_fullname, c_hlevel, c_basecode
from i2b2metadata.i2b2
where lower(c_facttablecolumn) = 'concept_cd'
and lower(c_tablename) = 'concept_dimension' and lower(c_columnname) = 'concept_path'
and m_applied_path = '@'
and coalesce(c_fullname,'') <> '' and coalesce(c_basecode,'') <> ''
and c_fullname like '\\i2b2\\%'
union all
select cast(
left(c_fullname, length(c_fullname)-position('\' in right(reverse(c_fullname), length(c_fullname)-1)))
as varchar(700)
) c_fullname,
c_hlevel-1 c_hlevel, c_basecode
from concepts
where concepts.c_hlevel>0
)
select distinct c_fullname, c_basecode
from concepts
where c_fullname like '\\i2b2\\Diagnoses\\%'
order by c_fullname, c_basecode "
Comment by Jeffrey Klann [ 04/Nov/19 ]
Speed improvement added to code, now runs in 2 minutes on demodata.
Comment by Jeffrey Klann [ 27/Feb/20 ]
This was fixed in December 2019
Generated at Fri Apr 26 20:32:38 UTC 2024 using Jira 8.20.11#820011-sha1:0629dd8d260e3954ece49053e565d01dabe11609.