Universal vs project-focused
One of the first considerations when starting to design an i2b2 project ontology is the primary target audience that is expected to formulate the queries. Basically, there are two types of users:
Things are different for external researchers or when data from different sources are combined in a research database. Here, an universal ontology has to be created. That means "translating" project-sprecific specific label into generic ones and moving data. Events can be classified into screening, baseline, intervention, follow-up. Data should be classified into unambiguous groups like demography, diagnosis, laboratory, medications. Whereever Wherever possible, mapping data to medical standards or terminologies should be considered.
Depth of the navigation
In most cases, it is adviceable advisable to just stick to the original depth of navigation. But depending on the source format the data came from, this can lead to very deep hierarchies. Examplary Exemplary for an ODM import, one will find seven levels of hierarchy for a clinical trial:
The first level "Ontology" can also be renamed to something more expressive.
Another idea is to abandon hierarchy levels that where useful for data collection, but not for presentation. More than 250 concepts, on the other hand, should not be listed under a single i2b2 folder for reasons of usability.
Splitting large value sets without natural hierarchy
A concept might have a large number of possible values without a normative hierarchy. Examples are code systems like zip codes, genetic information, or costs for billing. In this case, it is not feasible to represent every possible code: 1 Euro, 2 Euro, 3 Euro, ...
The basic idea is to find an artificalartificial, but yet reasonable substitute ontology. A possible solution is shown in the Boston Demodata: not every plausible patient age is coded right below age, instead, there is an intermediate level for every decade (0-9 years, 10-19 years, ...). So, one can select a bulk of ages with one click.
Splitting ages into decades might be suboptimal for some use cases in clinical research. For instance, when recruiting participants for clinical trials, inclusion criteria hardly match decades. In most scenarios, it would be better to have categories like 0-17 years (harder regulations for trials with minors), 18-80 years and 81-130 years (special screening for elderly). Categories should have subcategories where appropriate: 18-80 years might be further splitted into 18-49, 50-65, 66-80 years.
Costs could have categories at ten thousands, thousands, hundereds hundreds and so on. Alphanumeric codes could have categories defined by the first letter (0-9, A-Z). Postal codes could have states and counties as classifying attributes.