Batch alignment #258

Waschina · 2025-04-02T14:57:38Z

TODOs before merge

What's new?

This PR mainly includes a re-implementation of parts in gapseq find and gapseq find-transport. Major changes:

Run time is greatly improved by performing only one large multiple sequence alignment rather than many smaller ones.
Users can now choose between three different sequence alignment algorithms: blast, diamond, mmseqs2. The users can choose the algorithm using the option -A <algorithm> in gapseq find/gapseq find-transport.
A few bug fixes (see below where new gapseq pathway predictions were compared to old predictions).
The output table <query>-Pathways.tbl now includes additional columns that fully document how the completion percent was calculated and why the pathways were predicted to be present or absent. Also, a FAQ and its answer concerning completeness calculations was added to the documentation.
When a genomic nucleotide FASTA file is used as input, it’s first translated into amino acid sequences of open reading frames (ORFs). For this step, the optional dependency pyrodigal is required.
gapseq automatically selects the appropriate codon translation table by running pyrodigal with three options:
- Table 4: "Mycoplasma/Spiroplasma (Mollicutes)"
- Table 11: "Bacterial, Archaeal, and Plant Plastid Code" (default for most prokaryotic tools)
- Table 25: "Candidate Division SR1 and Gracilibacteria"
The choice between Table 11 and Tables 4/25 depends on genome coverage. If using Table 4 or 25 gives at least 5% higher coverage than Table 11, then 4 or 25 is used. Choosing between Table 4 and 25 is more nuanced since both yield the same coverage. The key difference is how the codon UGA is interpreted:
- In Table 11, UGA is a stop codon.
- In Table 4, UGA codes for Tryptophan.
- In Table 25, UGA codes for Glycine.
Since the Tryptophan content in proteins is typically around 1%, the table that produces a Tryptophan usage closest to this value is selected.

Admittedly, this approach relies on heuristic thresholds, but it works well in practice. If users already know the correct codon table for their genome, they can provide a protein FASTA file directly to avoid translation by gapseq.
There are fewer dependencies on other software libraries. Specifically, the dependencies on 'exonerate', 'barrnap', 'bedtools', 'perl', and 'parallel' were dropped.

Comparison of new to old gapseq pathway predictions

I compared the pathway prediction from gapseq version 1.4 (commit 11c3011) to the new prediction made with batch alignments. Tests were made with genome toy/myb71.faa.gz. A few important things I noticed.

Discrepancy: Old prediction FALSE, new prediction TRUE, same completeness

Example case:

gapseq find -p PWY-5973 myb71.faa.gz

In this case, the old prediction is FALSE in he old version, because a key reaction stated in the file dat/meta_pwy.tbl is not part of the reaction IDs listed for this pathway in the same table. Thus, gapseq tested if the key reaction was predicted, but since it was not tested, it was always FALSE. When checking all pathways, this is relevant for a total of 26 pathways:

library(data.table)
mpw <- fread("dat/meta_pwy.tbl")
mpw <- mpw[keyRea != ""]
mpw[, keyReaRE := gsub(",","|",keyRea)]
mpw[, keyReaRE := gsub("\\.","\\\\.",keyReaRE)]
mpw[, mk := grepl(keyReaRE, reaId), by = id]
mpw[mk == FALSE, .(id,reaId, keyRea)]

Result:

              id                                                                                                                                                                               reaId                            keyRea
          <char>                                                                                                                                                                              <char>                            <char>
 1:   |PWY-1042|                    6PFRUCTPHOS-RXN,2.7.1.90-RXN,F16ALDOLASE-RXN,TRIOSEPISOMERIZATION-RXN,GAPOXNPHOSPHN-RXN,PHOSGLYPHOS-RXN,3PGAREARR-RXN,2PGADEHYDRAT-RXN,PEPDEPHOS-RXN,1.2.1.9-RXN                                OR
 2:  |PWY2B4Q-4|                                                                                                                   RXN2B4Q-48,RXN2B4Q-49,RXN2B4Q-50,RXN2B4Q-51,RXN2B4Q-52,RXN2B4Q-53                         RXN-22424
 3: |PWY2DNV-11|                                                                                                                                           H2PTEROATESYNTH-RXN,RXN2DNV-42,RXN2DNV-43                          RXN-8850
 4:   |PWY-4202| RXN-7002,RXN-7001,RXN-21019,2.1.1.138-RXN,RXN-22325,RXN-22332,RXN-22333,RXN-22334,RXN-22336,RXN-22335,RXN-22329,RXN-22337,1.20.4.2-RXN,TRANS-RXN0-551,RXN-22324,RXN-22323,RXN-22403 RXN-21028,2.1.1.137-RXN,RXN-22330
 5:   |PWY-5115|                                                                                                                                                                            RXN-7772                          RXN-7771
 6:   |PWY-5278|                                                                                                                                       SULFATE-ADENYLYLTRANS-RXN,RXN-23218,RXN-23217     ADENYLYLSULFATE-REDUCTASE-RXN
 7:   |PWY-5288|                                                                                           RXN-8184,RXN-8185,RXN-8186,RXN-8187,RXN-8189,RXN-8190,RXN-8026,RXN-8025,RXN-8214,RXN-8215                                OR
 8:   |PWY-5306|                                                                                                          RXN-8202,SULFATE-ADENYLYLTRANS-RXN,RXN-23218,RXN-23217,RXN-17803,RXN-17804     ADENYLYLSULFATE-REDUCTASE-RXN
 9:   |PWY-5308|                                                                                           R164-RXN,RXN-8202,SULFATE-ADENYLYLTRANS-RXN,RXN-23218,RXN-23217,SULFITE-DEHYDROGENASE-RXN     ADENYLYLSULFATE-REDUCTASE-RXN
10:   |PWY-5690|                         FUMHYDR-RXN,SUCCCOASYN-RXN,2OXOGLUTARATEDEH-RXN,ACONITATEHYDR-RXN,CITSYN-RXN,ACONITATEDEHYDR-RXN,ISOCITRATE-DEHYDROGENASE-NAD+-RXN,MALATE-DEH-RXN,RXN-14971 ISOCITRATE-DEHYDROGENASE-NAD+-RXN
11:   |PWY-5973|                                                                                                                                   RXN-9555,2.3.1.179-RXN,RXN-9556,RXN-9557,RXN-9558                          RXN-8389
12:   |PWY-6676|                                               R17-RXN,RXN-11942,RXN-11941,RXN-11940,RXN-16434,RXN-16435,RXN-17803,RXN-21586,SULFATE-ADENYLYLTRANS-RXN,RXN-23218,RXN-23217,RXN-15349     ADENYLYLSULFATE-REDUCTASE-RXN
13:   |PWY-6893|                                                                                                                                                             THI-P-KIN-RXN,RXN-12610                         RXN-12609
14:   |PWY-7663|                                                                                                                                             RXN-16629,RXN-16630,RXN-16631,RXN-16632                         RXN-16614
15:   |PWY-7966|                                                                                                                                                       RXN-19305,RXN-19301,RXN-19322                         RXN-17124
16:   |PWY-7967|                                                                                                                                                       RXN-19323,RXN-19304,RXN-19300                         RXN-17123
17:   |PWY-7969|                                                                                                                                                       RXN-19320,RXN-19303,RXN-19299                         RXN-17122
18:   |PWY-8195|                                                                                                                   RXN-21844,RXN-21843,2.7.8.6-RXN,RXN-21851,TRANS-RXN-433,RXN-22033                          RXN-9160
19:   |PWY-8198|                                                                                                                   RXN-21844,RXN-21843,2.7.8.6-RXN,RXN-21854,TRANS-RXN-430,RXN-22030                          RXN-9159
20:   |PWY-8246|                                                                                                                                                                 RXN-22248,RXN-22249               RXN-22246,RXN-22245
21:   |PWY-8247|                                                                                                                         RXN-22250,RXN-22251,RXN-22252,RXN-22253,RXN-22262,RXN-22273               RXN-22245,RXN-22246
22:   |PWY-8266|                                                                                                               RXN-21019,RXN-22332,RXN-22333,RXN-22334,RXN-22336,RXN-22335,RXN-22329 RXN-21028,2.1.1.137-RXN,RXN-22330
23:   |PWY-8321|                                                                                                                                                  ISOCHORSYN-RXN,RXN-22684,RXN-22683                          RXN-1981
24:   |PWY-8352|                                                                    NICONUCADENYLYLTRAN-RXN,NAD-SYNTH-GLN-RXN,NAD-SYNTH-NH3-RXN,QUINOPRIBOTRANS-RXN,QUINOLINATE-SYNTHA-RXN,RXN-22981                      1.4.1.21-RXN
25:   |PWY-8391|                                                                                                                                                 URONATE-DEHYDROGENASE-RXN,RXN-23374                         RXN-11407
26:   |PWY-8393|                                                                                                                                                                 RXN-23378,RXN-23376          D-XYLULOSE-REDUCTASE-RXN
              id                                                                                                                                                                               reaId                            keyRea

The new gapseq version ignores defined key reactions, that are not part of the respective pathway.

Discrepancy: Old prediction TRUE, new prediction FALSE, different completeness

Example run:

gapseq find -p SERDEG-PWY myb71.faa.gz

These discrepancies were in all cases I analysed in depth due to a small bug in the previous version when counting found subunits. The responsible code line in the previous version:

https://github.com/jotech/gapseq/blob/11c30119b8aa1407b5c83c1c11b92e983c1efefc/src/gapseq_find.sh#L917C17-L917C84

The problem is that in cases where an undefined subunit was found, it also adds 1 to the counter in the variable $subnits_found. However, these should not be considered in the specific code line, as the $subunit_fraction variable is later compared to the $subunit_cutoff (which is by default 0.5). Thus in cases of two real subunits (with no hits) and an hit to an undefined subunit it predicts the reaction to be present because it calculates 1/2 (1 undef subunit / 2 total real subunits), which is 0.5. This value is equal to $subunit_cutoff , and the undefined hit causes the complex prediction to be TRUE/present. The actual code line should have been:

subunit_fraction=$(echo "100*($subunits_found-$subunits_undefined_found)/$subunits_count" | bc)

This bug is fixed in the new version (within the R code of file analyse_alignments.R).

This problem causes discrepancies in > 25 cases for myb71.faa.gz.

Discrepancy: Less transporters identified in new gapseq version

While I am still investigating this, I noticed a potential bug in the previous gapseq version. E.g. when a hit to a transport with the TC 2.A.1.1.4 is found, it connects this hit also to substrates of transporters with a TC that start with 2.A.1.1.4, e.g. 2.A.1.1.403. The responsible line in the previous gapseq code (src/transporter.sh) was:

substrates=$(grep -F $tc tcdb_all | cut -f2)

With tc=2.A.1.1.4, this line yielded the hits:

2.A.1.1.45 CHEBI:5418;glucose 
2.A.1.1.41 CHEBI:10085;xylose 
2.A.1.1.4 CHEBI:5418;glucose 
2.A.1.1.49 CHEBI:9840;UDP-sugar 
2.A.1.1.40 CHEBI:10085;xylose 
2.A.1.1.43 CHEBI:5584;hydron|CHEBI:5418;glucose|CHEBI:5256;galactose|CHEBI:14575;mannose|CHEBI:5172;fructose 2.A.1.1.47 CHEBI:9885;7,9-dihydro-1H-purine-2,6,8(3H)-trione 
2.A.1.1.404 CHEBI:5418;glucose 2.A.1.1.403 CHEBI:23062;cellodextrin|CHEBI:3522;cellobiose|CHEBI:6353;alpha-lactose|CHEBI:3528;cellotriose 
2.A.1.1.46 CHEBI:5418;glucose 
2.A.1.1.48 CHEBI:9840;UDP-sugar 
2.A.1.1.44 CHEBI:5709;hexose 
2.A.1.1.42 CHEBI:5584;hydron|CHEBI:4167;D-glucopyranose

The new gapseq version is more strict and reports only the substrates of transporters with the exact TC.

Other small changes in the new gapseq version

Complex detection

In the old and the new gapseq version, complexed are detected by analysing the fasta sequence headers for key terms such as "chain" or "subunit". In rare cases, where there were several sequences but only very few that indicated a subunit association, gapseq always needed hits to those sequences in order to say that the complex is there. However, in most organisms this enzyme might not be a complex/heteromer.

New approach: If 20% or less of the sequences are predicted to be a specific subunit, the reaction is not tested as a complex; i.e., no subunit hits are required for the reaction prediction to be TRUE. This is implemented in src/complex_prediction.R

Gram prediction

Gram prediction is used to decide which biomass reaction is added to a bacterial metabolic model. In the previous version, the prediction was made within the gapseq draft, where the biomass reaction was also added to the model. Now, the Gram-staining prediction is moved to gapseq find. The rationale behind this decision is that gapseq find has the genome sequence already as input; performing the HMM-based Gram prediction here makes sense, as this also needs the genome. The predicted Gram-staining is added as info to the headers of the output tables "...-Reactions.tbl" and "...-Pathways.tbl".

Bug fix (#252)

- allows to use annotations from eggnog-mapper to be used in gapseq - only creates reaction table and no pathway table at the moment

Get latest commits from master branch

Next step, perform the actual batch blast!

updated toy data

Not done yet...

- still need to be tested - comparison between aligners - comparison with previous tables still built with sequentional alignments - is the new output compatible with gapseq draft?

Still some errors...

Also: Search for custom EC number(s) was tested and should now work. Also for cases were a single EC does results in no blast or no sequences. Multiple ECs(comma-sep.) were also tested.

Refers to #216 and #219

When a **genomic nucleotide FASTA** file is used as input, it’s first translated into amino acid sequences of open reading frames (ORFs). For this step, the optional dependency pyrodigal is required. gapseq automatically selects the appropriate codon translation table by running pyrodigal with three options: - Table 4: "Mycoplasma/Spiroplasma (Mollicutes)" - Table 11: "Bacterial, Archaeal, and Plant Plastid Code" (default for most prokaryotic tools) - Table 25: "Candidate Division SR1 and Gracilibacteria" The choice between Table 11 and Tables 4/25 depends on genome coverage. If using Table 4 or 25 gives at least 5% higher coverage than Table 11, then 4 or 25 is used. Choosing between Table 4 and 25 is more nuanced since both yield the same coverage. The key difference is how the codon UGA is interpreted: - In Table 11, UGA is a stop codon. - In Table 4, UGA codes for Tryptophan. - In Table 25, UGA codes for Glycine. Since the Tryptophan content in proteins is typically around 1%, the table that produces a Tryptophan usage closest to this value is selected. Admittedly, this approach relies on heuristic thresholds, but it seems to work well in practice. If users already know the correct codon table for their genome, they can provide a protein FASTA file directly to avoid translation by gapseq.

And keeping bedtools and barrnap as dependencies. For now.

In general, I would recommend using full metabolite names in models and not abbreviations.

I have run some basic tests, which all worked well. More tests, with special input, are still required. Also, the output works with the draft reconstruction script.

This breaks the previous syntax of 'doall' but makes it more coherent with the other procedures and how options are passed to the subscripts. Also it gives more flexibility to customize the doall workflow.

The prompt output (logs) is formatted to match previous gapseq versions.

These are: mmseqs diamond pyrodigal

- Details for optional dependencies - Details on the automatic determination of the codon table

- Metabolite and Reactions attributes were wrong or missing in the output model files.

The data is now stored within the output model files in RDS format (i.e., `...-draft.RDS`)

- old example of how to use usage gapseq fill was not working

- no output files could be written if quick test is enabled

The column "complex" when reading the alignment stats in gapseq draft should be NA and not an empty string if the reaction is not catalyzed by a protein complex.

Bug: Sometimes, the indexing in the mclapply function of the vectors 'rea' and 'reaName' did not work properly.

Waschina and others added 24 commits March 3, 2025 08:41

getDBhit() function implemented in R

975e37f

getDBhit implementation in R

e1fe68c

First steps towards reference sequence collate before batch alignment

dee5da2

Merge pull request #253 from jotech/master

dabdebf

Bug fix (#252)

More steps towards batch alignments

052426c

initial eggnog-mapper import scripts

9f15d05

- allows to use annotations from eggnog-mapper to be used in gapseq - only creates reaction table and no pathway table at the moment

getting req seq files

1fffb42

Merge pull request #255 from jotech/master

c612763

Get latest commits from master branch

Finished creating full fasta with all required reference seqs

cd5124b

Next step, perform the actual batch blast!

first batch blast test

e9cb4e5

Merge pull request #256 from jotech/master

3be0cba

updated toy data

Supporting blast, diamond, and mmseq2 as aligners

45dcb32

Analysing batch alignment statistics

c13d470

Not done yet...

First reaction output table with new batch alignments

c7ae8c8

- still need to be tested - comparison between aligners - comparison with previous tables still built with sequentional alignments - is the new output compatible with gapseq draft?

Pathway prediction output table with new batch alignments

f781473

Testing "-e <EC>" input

77fa1ac

Still some errors...

Corrected error in pathway presence/absence prediction

f4744f0

Also: Search for custom EC number(s) was tested and should now work. Also for cases were a single EC does results in no blast or no sequences. Multiple ECs(comma-sep.) were also tested.

Finalized pathway presence/absence inference from alignments

eabf8ef

Small correction to threshold handling

37545de

use 'strictCandidate' option

1dc03b9

Speed improvements via 'parallel'

5a24d29

Use vagueCutoff parameter in pathway completeness calculations

cc7cd53

Refers to #216 and #219

Update Complex Detection and Vague Reaction Handling

5108f08

Waschina requested a review from jotech April 2, 2025 15:01

Waschina added 5 commits April 2, 2025 17:26

Update installation instructions and conda env definition

da3efef

Add pathway prediction status to [...]-Reaction.tbl table

0b9dd55

New output table comment header

98892b4

And keeping bedtools and barrnap as dependencies. For now.

Allow different aligners in transporter prediction

8435792

getDBhit() function implemented in R

111199c

Waschina and others added 30 commits April 7, 2025 12:04

Adjustments in draft generation – bug fix in transporter.sh

f584ca6

Update tutorial

a918010

Draft script to process gapseq find-transport alignment results

6c68a99

Adjust transport alignment analysis

6cadcb2

Renamed Neu5Ac to N-acetylneuraminate

7f45ecf

In general, I would recommend using full metabolite names in models and not abbreviations.

Optimized transporter script finished

23c42e1

I have run some basic tests, which all worked well. More tests, with special input, are still required. Also, the output works with the draft reconstruction script.

Added ammonium channel to custom seed transporter table

680c06d

Update gapseq start script and moved 'doall' to separate script file

62b2ee9

This breaks the previous syntax of 'doall' but makes it more coherent with the other procedures and how options are passed to the subscripts. Also it gives more flexibility to customize the doall workflow.

Smaller changes in docs and doall

2c14f31

Small clean up

6cbeab0

Allow custom aligner options to be proided to alignment tool

5b7476c

Only list pathways matching key with option "-o" in 'gapseq find'

1c6cad7

The prompt output (logs) is formatted to match previous gapseq versions.

Respect option "-x" (no aligmnents) in 'gapseq find'

8f2736a

'gapseq test' looks for optional dependencies

fffc635

These are: mmseqs diamond pyrodigal

Modify installation instruction with opt dependencies info

6450e4d

Update documentation

b852169

- Details for optional dependencies - Details on the automatic determination of the codon table

Bug fix in pan draft (metabolite and reaction attributes)

1c45bca

- Metabolite and Reactions attributes were wrong or missing in the output model files.

Added pyrodigal to conda environment definition

7e4ed47

Bug fix when reading gff files

f170756

File clean up

893eac1

Cleaning up options in find(-transport)

9227a56

Omit -rxnWeights.RDS and -rxnXgenes.RDS output files

d9387be

The data is now stored within the output model files in RDS format (i.e., `...-draft.RDS`)

Bug fix (when input file ends with other suffix than ".fna")

5607729

updated toy model files

2477543

updated toy data to work with usage example

87cabee

- old example of how to use usage gapseq fill was not working

correct error with writing output files and quick test

acc8c02

- no output files could be written if quick test is enabled

Bug fix

a24c7f9

The column "complex" when reading the alignment stats in gapseq draft should be NA and not an empty string if the reaction is not catalyzed by a protein complex.

Clean up and bug fix

31dfe42

Bug: Sometimes, the indexing in the mclapply function of the vectors 'rea' and 'reaName' did not work properly.

updated toy files

641509d

updated gapseq test to check for cobrar version

ac87dd2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Batch alignment #258

Batch alignment #258

Uh oh!

Waschina commented Apr 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Batch alignment #258

Are you sure you want to change the base?

Batch alignment #258

Uh oh!

Conversation

Waschina commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODOs before merge

What's new?

Comparison of new to old gapseq pathway predictions

Discrepancy: Old prediction FALSE, new prediction TRUE, same completeness

Discrepancy: Old prediction TRUE, new prediction FALSE, different completeness

Discrepancy: Less transporters identified in new gapseq version

Other small changes in the new gapseq version

Complex detection

Gram prediction

Uh oh!

Uh oh!

Waschina commented Apr 2, 2025 •

edited

Loading