Skip to content

Update file I/O for Harmonica #91

@twhetzel

Description

@twhetzel

TASKS

  • add script to pivot conditions as column headers to row values
  • try to include conditions into SchemaAutomator to include in model

Some "conditions" may be included within a subjects file like this, e.g. https://github.com/linkml/dm-bip/blob/main/toy_data/raw_data/subject.tsv. Where for each line of participant/subject information, there are column headers that represent the condition names and then for each participant/subject values are yes or no to indicate if they have the condition and there is only one line of data per participant/subject.

However, some information about conditions is more complicated, e.g. https://github.com/linkml/dm-bip/blob/main/toy_data/raw_data_conditions/conditions_complex-questions.tsv where information about conditions can be from survey questions and each participant/subject is asked the full set of questions and answers may or may not overlap across participants/subjects and the file has multiple lines per participant/subject representing each survey question/answer.

Currently, for the INCLUDE project the Data Intake team reformats the "conditions" data. Using the non-survey question conditions as the first example, the information is reformatted so that there is one unique conditions value in the file and that is annotated with Harmonica and then the resulting file of annotations include additional columns for the ontology CURIE and label for each ontology the data is annotated with. However, the steps pre- and post- annotation to create this file to annotate are not known.

For the data ingest pipeline, initially it was discussed that this file reformatting to create the conditions file would happen as manual, customized steps. More recently, there have been discussions to more fully automate this into the pipeline and change the file I/O for Harmonica. One option is to have Harmonica annotate the conditions from the participant/subject file and add the annotations back into the participant/subject file. Another option is to include the annotation as a step within LinkMLMap.

For the first option, to annotate the participant/subject file and add back annotations, @amc-corey-cox do you want to have Harmonica take a config file for example to know which columns are conditions to annotate or do you want that as a separate pre-processing script? There are a couple of ways to extract and then combine the annotated conditions information to the subjects so I would like to align the plans with you.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions