-
Notifications
You must be signed in to change notification settings - Fork 37
Batch alignment #258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
Waschina
wants to merge
88
commits into
master
Choose a base branch
from
batch_alignment_tests
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Batch alignment #258
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Bug fix (#252)
- allows to use annotations from eggnog-mapper to be used in gapseq - only creates reaction table and no pathway table at the moment
Get latest commits from master branch
Next step, perform the actual batch blast!
updated toy data
Not done yet...
- still need to be tested - comparison between aligners - comparison with previous tables still built with sequentional alignments - is the new output compatible with gapseq draft?
Still some errors...
Also: Search for custom EC number(s) was tested and should now work. Also for cases were a single EC does results in no blast or no sequences. Multiple ECs(comma-sep.) were also tested.
When a **genomic nucleotide FASTA** file is used as input, it’s first translated into amino acid sequences of open reading frames (ORFs). For this step, the optional dependency pyrodigal is required. gapseq automatically selects the appropriate codon translation table by running pyrodigal with three options: - Table 4: "Mycoplasma/Spiroplasma (Mollicutes)" - Table 11: "Bacterial, Archaeal, and Plant Plastid Code" (default for most prokaryotic tools) - Table 25: "Candidate Division SR1 and Gracilibacteria" The choice between Table 11 and Tables 4/25 depends on genome coverage. If using Table 4 or 25 gives at least 5% higher coverage than Table 11, then 4 or 25 is used. Choosing between Table 4 and 25 is more nuanced since both yield the same coverage. The key difference is how the codon UGA is interpreted: - In Table 11, UGA is a stop codon. - In Table 4, UGA codes for Tryptophan. - In Table 25, UGA codes for Glycine. Since the Tryptophan content in proteins is typically around 1%, the table that produces a Tryptophan usage closest to this value is selected. Admittedly, this approach relies on heuristic thresholds, but it seems to work well in practice. If users already know the correct codon table for their genome, they can provide a protein FASTA file directly to avoid translation by gapseq.
And keeping bedtools and barrnap as dependencies. For now.
In general, I would recommend using full metabolite names in models and not abbreviations.
I have run some basic tests, which all worked well. More tests, with special input, are still required. Also, the output works with the draft reconstruction script.
This breaks the previous syntax of 'doall' but makes it more coherent with the other procedures and how options are passed to the subscripts. Also it gives more flexibility to customize the doall workflow.
The prompt output (logs) is formatted to match previous gapseq versions.
These are: mmseqs diamond pyrodigal
- Details for optional dependencies - Details on the automatic determination of the codon table
- Metabolite and Reactions attributes were wrong or missing in the output model files.
The data is now stored within the output model files in RDS format (i.e., `...-draft.RDS`)
- old example of how to use usage gapseq fill was not working
- no output files could be written if quick test is enabled
Bug: Sometimes, the indexing in the mclapply function of the vectors 'rea' and 'reaName' did not work properly.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TODOs before merge
gapseq test
to look for optional dependencies.gapseq find-transport
to allow other aligners and to improve speed.doall
still works.gapseq adapt
(i.e., use R implementation of getDBhit function) instead of bash script-g
Exhaustive search is still required, and if yes, how should the alignment command be adjusted?gapseq find
only if-p (all)
?2.0.0-beta
)src/gapseq_quickstart.sh
- I think the functionality is covered bygapseq doall
src/graveyard/
src/coverage.R
—The script won't work anymore since gapseq now performs the alignments on protein fasta genomes. If a nucleotide fasta is provided, it is translated to proteins before the alignments. Also, the coverage (in terms of the percentage of ORFs associated with metabolic reactions) is always calculated ingapseq find
and reported in the header of the output...-Reactions.tbl
and-Pathways.tbl
tables.src/download.sh
- Script is not used anywhere.src/seed_pwy.R
src/seedcorrect.py
What's new?
This PR mainly includes a re-implementation of parts in
gapseq find
andgapseq find-transport
. Major changes:Run time is greatly improved by performing only one large multiple sequence alignment rather than many smaller ones.
Users can now choose between three different sequence alignment algorithms: blast, diamond, mmseqs2. The users can choose the algorithm using the option
-A <algorithm>
ingapseq find
/gapseq find-transport
.A few bug fixes (see below where new gapseq pathway predictions were compared to old predictions).
The output table
<query>-Pathways.tbl
now includes additional columns that fully document how the completion percent was calculated and why the pathways were predicted to be present or absent. Also, a FAQ and its answer concerning completeness calculations was added to the documentation.When a genomic nucleotide FASTA file is used as input, it’s first translated into amino acid sequences of open reading frames (ORFs). For this step, the optional dependency pyrodigal is required.
gapseq automatically selects the appropriate codon translation table by running pyrodigal with three options:
The choice between Table 11 and Tables 4/25 depends on genome coverage. If using Table 4 or 25 gives at least 5% higher coverage than Table 11, then 4 or 25 is used. Choosing between Table 4 and 25 is more nuanced since both yield the same coverage. The key difference is how the codon UGA is interpreted:
Since the Tryptophan content in proteins is typically around 1%, the table that produces a Tryptophan usage closest to this value is selected.
Admittedly, this approach relies on heuristic thresholds, but it works well in practice. If users already know the correct codon table for their genome, they can provide a protein FASTA file directly to avoid translation by gapseq.
There are fewer dependencies on other software libraries. Specifically, the dependencies on 'exonerate', 'barrnap', 'bedtools', 'perl', and 'parallel' were dropped.
Comparison of new to old gapseq pathway predictions
I compared the pathway prediction from gapseq version 1.4 (commit 11c3011) to the new prediction made with batch alignments. Tests were made with genome
toy/myb71.faa.gz
. A few important things I noticed.Discrepancy: Old prediction FALSE, new prediction TRUE, same completeness
Example case:
In this case, the old prediction is FALSE in he old version, because a key reaction stated in the file
dat/meta_pwy.tbl
is not part of the reaction IDs listed for this pathway in the same table. Thus, gapseq tested if the key reaction was predicted, but since it was not tested, it was always FALSE. When checking all pathways, this is relevant for a total of 26 pathways:Result:
The new gapseq version ignores defined key reactions, that are not part of the respective pathway.
Discrepancy: Old prediction TRUE, new prediction FALSE, different completeness
Example run:
These discrepancies were in all cases I analysed in depth due to a small bug in the previous version when counting found subunits. The responsible code line in the previous version:
https://github.com/jotech/gapseq/blob/11c30119b8aa1407b5c83c1c11b92e983c1efefc/src/gapseq_find.sh#L917C17-L917C84
The problem is that in cases where an undefined subunit was found, it also adds 1 to the counter in the variable
$subnits_found
. However, these should not be considered in the specific code line, as the$subunit_fraction
variable is later compared to the$subunit_cutoff
(which is by default 0.5). Thus in cases of two real subunits (with no hits) and an hit to an undefined subunit it predicts the reaction to be present because it calculates 1/2 (1 undef subunit / 2 total real subunits), which is 0.5. This value is equal to$subunit_cutoff
, and the undefined hit causes the complex prediction to be TRUE/present. The actual code line should have been:subunit_fraction=$(echo "100*($subunits_found-$subunits_undefined_found)/$subunits_count" | bc)
This bug is fixed in the new version (within the R code of file
analyse_alignments.R
).This problem causes discrepancies in > 25 cases for
myb71.faa.gz
.Discrepancy: Less transporters identified in new gapseq version
While I am still investigating this, I noticed a potential bug in the previous gapseq version. E.g. when a hit to a transport with the TC 2.A.1.1.4 is found, it connects this hit also to substrates of transporters with a TC that start with 2.A.1.1.4, e.g. 2.A.1.1.403. The responsible line in the previous gapseq code (
src/transporter.sh
) was:substrates=$(grep -F $tc tcdb_all | cut -f2)
With tc=2.A.1.1.4, this line yielded the hits:
The new gapseq version is more strict and reports only the substrates of transporters with the exact TC.
Other small changes in the new gapseq version
Complex detection
In the old and the new gapseq version, complexed are detected by analysing the fasta sequence headers for key terms such as "chain" or "subunit". In rare cases, where there were several sequences but only very few that indicated a subunit association, gapseq always needed hits to those sequences in order to say that the complex is there. However, in most organisms this enzyme might not be a complex/heteromer.
New approach: If 20% or less of the sequences are predicted to be a specific subunit, the reaction is not tested as a complex; i.e., no subunit hits are required for the reaction prediction to be TRUE. This is implemented in
src/complex_prediction.R
Gram prediction
Gram prediction is used to decide which biomass reaction is added to a bacterial metabolic model. In the previous version, the prediction was made within the
gapseq draft
, where the biomass reaction was also added to the model. Now, the Gram-staining prediction is moved togapseq find
. The rationale behind this decision is thatgapseq find
has the genome sequence already as input; performing the HMM-based Gram prediction here makes sense, as this also needs the genome. The predicted Gram-staining is added as info to the headers of the output tables "...-Reactions.tbl" and "...-Pathways.tbl".