Commandline Tutorial

If you run ab12phylo-cmd without parameters or with -h, it will show you the usage help:

$ ab12phylo-cmd

AB12PHYLO commandline version 0.5.12-beta built on 31 May 2021

usage: ab12phylo-cmd [-h] [-p1 | -p2 | -px] [-viz] [-view] [-dir DIR]
                     [-g GENES [GENES ...]] [-abi ABI_DIR] [-abiset ABI_SET]
                     [-sampleset SAMPLE_SET] [-csv CSV_DIR] [-r2 REGEX_CSV]
                     [-r1 REGEX_ABI | -r3 REGEX_3 REGEX_3 REGEX_3]
                     [-r4 REGEX_REV] [-rf REF [REF ...] | -rd REF_DIR]
                     [-qal MIN_PHRED] [-bad BAD_STRETCH] [-end END_RATIO]
                     [-local | -none | -remote | -xml BLAST_XML [BLAST_XML ...]]
                     [-db DB] [-dbpath DBPATH] [-remotedb REMOTE_DB]
                     [-algo {clustalo,mafft,muscle,t_coffee}]
                     [-gbl {skip,relaxed,balanced,default,strict}]
                     [-tool {raxml-ng,iqtree2}] [-st START_TREES]
                     [-bst BOOTSTRAP] [-uf] [-evomodel EVOMODEL | -findmodel]
                     [-s SEED] [-metric {TBE,FBP}]
                     [-msa_viz [{pdf,png} [{pdf,png} ...]]]
                     [-threshold THRESHOLD]
                     [-out_fmt {pdf,png,svg} [{pdf,png,svg} ...]]
                     [-md MIN_DIST] [-mpd MIN_PLOT_DIST]
                     [-drop DROP_NODES [DROP_NODES ...]]
                     [-replace REPLACE_NODES [REPLACE_NODES ...]] [-root ROOT]
                     [-supp] [-gap GAP_SHARE] [-unk UNKNOWN_SHARE] [-poly]
                     [-i | -v] [-c CONFIG] [-nt MAX_THREADS] [-version]
                     [-test] [-q] [-init]
                     [result_dir]

optional arguments:
  -h, --help            show this help message and exit
  -i, --info            show some more information in console output.
  -v, --verbose         show all runtime information in console.
  -c CONFIG, --config CONFIG
                        path to .yaml config file with defaults; command line
                        arguments will override them.
  -nt MAX_THREADS, --max_threads MAX_THREADS
                        Limit the number of CPUs to use for AB12PHYLO.
  -version, --version   print version information and exit.
  -test, --test         Test run.
  -q, --headless        do not start a CGI server nor display in browser. For
                        remote use.
  -init, --initialize   re-initialize ab12phylo-cmd: Search for existing
                        BLAST+, RAxML-NG and IQ-Tree installations, or re-run
                        these.

RUN MODES:
  -p1, --prepare        run first part of ab12phylo-cmd, including BLAST but
                        excluding RAxML-NG/IQ-Tree.
  -p2, --finish         run second part of ab12phylo-cmd, beginning with
                        RAxML-NG/IQ-Tree
  -px, --add_xml        after -p1 run; only read BLAST results. Pass file via
                        -xml.
  -viz, --visualize     invoke ab12phylo-visualize by appending ab12phylo-cmd
                        command.
  -view, --view         invoke ab12phylo-view by appending ab12phylo-cmd
                        command.
[...]

As the full output is a lot to take in, here is a more gradual introduction:

Basic options

Trace data

First, make sure all the ABI trace files for your tree inference project are in a single directory. It's ok if they are distributed across several subdirectories, but there should not be any files ending in .ab1 that you do not want in the tree. (In fact, you can also use a file with an exclusive list of filenames for your analysis, but that's already an advanced feature.)
When you have identified the root directory of all your ABI trace data, let's start building a command:

ab12phylo-cmd -abi <seq_dir>

where <seq_dir> is the directory. Leave out the <> of course, they are meant to signify a placeholder. Also, -abi is actually the short option of --abi_dir: You can use either one, but this tutorial will predominantly use the short options.
The path of your directory might then look like this:
/home/me/mystuff/phylogenetics/all_the_ABI_data
If there are any spaces in the path, make sure to surround it with "", so it looks like
"phylogenetics and popgen/all the ABI data".

Wellsplates

You might have wellsplate mappings, meaning one or more files in .csv format like this: wellsplates/box_2.csv

CS322,CS313,CS084,...
CS079,CS327,...
...

Follow the same approach as for the directory with ABI trace files:

ab12phylo-cmd -abi <seq_dir> -csv <wellsplates_dir>

Genes / Loci

Wellsplate mappings use the filename, so there is probably a file in your dataset that looks a bit like FN_box2_ITS1F_C01.ab1, and maps to CS084 because it comes from the well C01 on plate/box number 2. You can see how that was extracted from the filename, but there is additional information: First, the sampled gene ITS1F. Tell ab12phylo-cmd about all the genes in your analysis:

ab12phylo-cmd -abi <seq_dir> \
    -csv <wellsplates_dir> \
    -g <barcode_gene>

You would pass -g ITS1F in our case. The backslashes \\ are only there to tell the terminal that this command continues in the next line and can help to make it more readable, but you don't have to use them. Furthermore, every commandline option explained here can also be set in a configuration file, see the relevant section below.

Filename Parsing

As you have seen, the plate number, gene name, and the sequencer's isolate coordinates are parsed from the .ab1 filename with Regular Expressions. You will probably have to adjust the default RegEx to your sequencer's filename pattern; examples can be found here or there. To list all ABI trace filenames, try find . -name '*.ab1' on linux, where . indicates the current directory and its subdirectories are searched.

⚠️ Make sure the file extension .ab1 is not part of your regular expressions

When you are done refining a RegEx with three capturing groups, pass it via -r1 if they occur in the order plate_number--gene--well_ID. If you are using a bash shell or zsh, make sure to enclose the expression with double quotes "".

ab12phylo-cmd -abi <seq_dir> \
    -csv <wellsplates_dir> \
    -g <barcode_gene> \
    -r1 <"one_regular_expression">

If plate number, gene and well ID do not occur in that order in your filenames, specify three separate expressions, each with a single capturing group:

ab12phylo-cmd ...
    -r3 <"rx1" "rx2" "rx3">

⚠️ Make sure to use three capturing groups / regular expressions even if you do not use wellsplates. In that case, the first one must match nothing '', a space ' ', a minus '-' or an underscore '_'. These are accordingly not valid as plate IDs.

If you define separate regular expressions in the config.yaml, make sure the r3 line looks like a list as shown below, and that r1 is commented out:

regex_csv: (\d)+[^\d_]*.csv           # r2
#regex_abi: '^(...)(...)(...)'        # r1
regex_3: ['rx(1)', '(r)x2', 'r(x)3']  # r3
regex_rev: '(Rev)|(rev)'              # r4

The-r2 option is used for extracting the plate number from wellsplate .csv filenames.
The last RegEx option is -r4, which is intended to identify reverse reads. ab12phylo-cmd will then add the reverse complement of these records to the dataset instead of the record itself.

Reference taxa

To include reference taxa, provide a single file in FASTA format for each sequenced locus. Ideally the sequences within would be GenBank records like this. As with regular samples, a reference must have a sequence for each gene to be included in the tree.
To understand how reference taxa are matched across genes, consider the following examples:

>AF347033.1 Alternaria arborescens strain EGS 39-128 18S ribosomal RNA gene, partial sequence; (...)
AGGGATCATTACACAAATATGAAGGCGGGCTGGAACCTCTCGG (...)
>AY563277.1 Stemphylium vesicarium major allergen alt a1 (alt a 1) gene, partial cds (...)
CGCTCTCTTCGCCGCTGCCGGCCTCGCTGCCGCCGCTCCCTTC (...)
>ref. ICMP_19454
TCGACGGGTGAGTTCGAGGCCCTGGAGATGCGCGATGGTGGCA (...)

As accession numbers are unique for a sequence, they cannot be used. AB12PHYLO splits the description at each ' ' and extracts the next two space-separated elements after the word strain or the accession number as the strain or species, respectively. Therefore, it would identify the examples above as strain EGS 39-128, or species Stemphylium vesicarium. Mind that using >ICMP_19454 would not work, but the last example given above is parsed as ICMP_19454.

ab12phylo-cmd ...
    -rf <ref.fasta>  <other_gene_ref.fasta>

-rf means reference file(s); they have to be in the same order as the genes!

Alternatively, use -rd (meaning reference directory), but then the FASTA files have to be named accordingly, like <ref_dir>/ITS1F.fasta or <ref_dir>/<other_gene>.fasta.

ab12phylo-cmd ...
    -rd <ref_dir>

Output directory

Before we skip ahead a bit, -dir specifies the output directory where ab12phylo-cmd will store the results of your run:

ab12phylo-cmd ...
    -dir <results>

Test run

At this stage, it is probably easiest to run a quick analysis on a dummy data set; conveniently ab12phylo-cmd comes with its own test data set. Please run:

ab12phylo-cmd -test

If this is the very first time you run ab12phylo-cmd, it might try installing some non-python dependencies. With -test, AB12PHYLO will read options from an auxiliary (or backup) config file at <ab12phylo_installation_root>/ab12phylo_cmd/config/test_config.yaml and run on these. The test run is set to --verbose (meaning a lot of small, often unimportant checkpoints will be printed to your terminal) and will run --no_remote BLAST search.

Config files

As just mentioned, the commandline AB12PHYLO uses files with detailed, static configurations. They somewhat shadow the commandline options, and are indeed intended to make your life a bit easier, and your commandline calls shorter. It is important to note that the commandline supersedes the config, meaning that passing e.g. --max_threads 4 will override the line:

max_threads: 64  # CPU limit

Every time ab12phylo-cmd saves the results from a run, it also creates a used_config.yaml. With this file, and the first few lines of your ab12phylo.log (where you can find the call arguments, and the random seed for this run if you did not set one), you will be able to exactly re-create the run and its results.

You can find the default config here on GitHub or at /ab12phylo_cmd/config/config.yaml inside your installation directory. To find it, run:

$ pip show ab12phylo
Name: ab12phylo
...
Location: /home/<user>/anaconda3/envs/ab1/lib/python3.6/site-packages

More detailed settings

ab12phylo-cmd has a lot of defaults, but still allows fine-grained access:

ab12phylo-cmd -rf <ref.fasta> \
    -db <your_own> \
    -dbpath <your_dir> \
    -abiset <whitelist> \
    -regex_3 <"rx1" "rx2" "rx3"> \
    -algo <mafft-clustalo-muscle-tcoffee> \
    -gbl balanced \
    -local \
    -i \
    -p1

default: AB12PHYLO will search for .ab1 and .csv files in or below the current working directory (-abi, -csv)
default: use the ./results subdirectory (-dir)
default: consider all genes identified in the ABI trace files part of the analysis (-g)
default: use the GTR+Γ evolutionary model of DNA substitution (`-evomodel)
use <your_own> BLAST+ database; in <your_dir>
only trace files listed in the <whitelist> will be read
plate number, gene name and well will be parsed from the .ab1 filename using these three RegEx
default: all other RegEx will be read from the config.yaml
-algo will generate the MSA(s): mafft, clustalo, muscle or t_coffee
-gbl sets Gblocks MSA trimming mode: skip, relaxed, balanced or strict
-local skips online BLAST for sequences not in the local BLAST+ db and read why
-i or --info shows some more run details in the console
-p1 run only part one, up until BLAST; also generate an HTML visualization of the MSA for finding funny sequences

ab12phylo-cmd -p2 \
    -bst 1000 \
    -st [32,16]  \
    -s 4 \
    -v 
    
ab12phylo-cmd -p2 \
    --ml_tool iqtree2 \
    --findmodel \
    --ultrafast \
    -bst 1000 \
    -st [32,16]  \
    -nt 8 \
    -s 4 \
    -v \
    -out_fmt pdf png svg

These invocations do essentially the same thing, it does not make much sense to run both.
The first one:

-p2 run part two of ab12phylo-cmd, starting with RAxML-NG or IQ-Tree 2
default: build a tree with RAxML-NG instead of IQ-Tree 2
-st start ML tree searches from 32 random and 16 parsimony-based starting trees
-s fixes the random --seed to 4 for reproducibility
-v or --verbose shows all logged events in the console

Alternatively:

--ml_tool now defines that IQ-Tree 2 will be used to re-construct a tree
--findmodel instead of using a pre-selected one, use IQ-Tree to infer the best-fit evolutionary model
--ultrafast IQ-Tree offers another mode of Bootstrapping next to the usual, non-parametric one. It will then always run 1000 or more iterations!
-st IQ-Tree will start 32+16=48 tree searches from random start trees
-nt will limit the number of CPUs/hardware threads used to 8
-out_fmt save the resulting tree in each of the three listed file formats

Run modes

As you might have seen above, you can run ab12phylo-cmd in separate parts:

-p1 or --prepare runs the first stage

Read in all the input files in the specified directory, map them to their original IDs if wellsplate mappings were provided, and run the sequencing quality control / ABI trace trimming. Also add the reference data, and save the compiled dataset in a .FASTA file, in a separate directory for each gene in the analysis. Write a .CSV table containing this first-stage metadata.
Then, build a Multiple Sequence Alignment from each of the gene-wise .FASTA files. If the selected algorithm (either MAFFT, Clustal Omega, MUSCLE or T-Coffee) is installed, this computation will be run on your machine; otherwise using the EMBL service for MSA construction.
The MSAs are trimmed to conserved sites using Gblocks, with one of five different --gblocks settings for ab12phylo-cmd: skip, relaxed, balanced, default or strict. After trimming, the single-gene MSAs are concatenated into one multi-gene MSA.

In a separate, de-synced thread, a BLAST search can be run for the sequence data from the first gene passed after --genes. However, downloading one of the large NCBI databases (->-db) or waiting until the public BLAST API has finished responding to your -remote BLAST queries might take a long time. To ameliorate this, there is:

-px or --add_xml reads in BLAST results

If you have run -p1 --no_BLAST and uploaded a <gene>/<gene>.fasta to web BLAST; you can parse your XML results via:

-px -xml part1.xml part2.xml

Strictly speaking, -xml = --BLAST_xml is defined as a BLAST+ mode; meaning if you are repeating a run and re-using XMLs from earlier, you can "skip" BLAST by passing -BLAST_xml result.xml, but --no_BLAST --BLAST_xml result.xml will not work as you can only set a single BLAST mode.

-p2 or --finish constructs a tree

In this central part of the pipeline, a maximum-likelihood tree is inferred from the concatenated MSA using either RAxML-NG or IQ-Tree.

For RAxML-NG (--ml_tool raxml-ng), a user-defined number of ML tree searches starting from random and/or parsimony starting trees is split among several threads. Once this is finished, the maximum-likelihood tree is identified and confidence in this tree is estimated using non-parametric bootstrapping. This is again split among several instances of RAxML-NG (coarse-grained parallelization). After creating bootstrap trees, support values are mapped onto the ML tree.

For IQ-Tree 2 (--ml_tool iqtree2), there is an optional but distinct first step of inferring the best-fit evolutionary model, triggered by passing --findmodel. By contrast, the evolutionary model has to be user-defined for RAxML-NG. Furthermore, IQ-Tree can use ultrafast (--ultrafast) instead of standard non-parametric bootstrapping. However, the number of iterations --bootstrap will be set to 1000 if a lower number was passed.

Both RAxML-NG and IQ-Tree 2 will produce two ML trees, with TBE and FBP support values.
When ML inference is finished, ab12phylo-cmd will proceed to plotting.

-viz or --visualize plots the tree

This stage will be automatically run for --finish, but can be called separately and is intended for tree modifications. Please have a look at the 'VISUALIZATION' section of the in-line help ->ab12phylo-cmd -h, and the matching section below.

-view or --view enables tree searching

Next to creating trees in Newick format as well as graphics, an important part of the ab12phylo-cmd results is the result.html page. Based on a CGI server backend, this page allows you to select taxa in the tree and calculate diversity statistics; and the -view flag re-activates the CGI server that allows this functionality. It is also implicitly part of -p2 and -viz.

BLAST

If none of the smaller BLAST+ databases are sufficient for your search, you are better off with a web BLAST. Collate your data by running -p1 --no_BLAST, then upload the <gene>/<gene>.fasta for the gene you wish to use for species annotation. Pass the result via --BLAST_xml. You can pass more than one file; AB12PHYLO will use the hit with the highest percent identity with the sample, or the most frequent one, or annotate a sample with multiple equally well-fitting species.

BLAST API

AB12PHYLO is perfectly capable of running online BLAST searches without a BLAST+ installation. However, this is not a suitable main BLAST strategy as BLAST API queries are de-prioritised after just a few attempts. Accordingly, if several runs are attempted on the same data set, passing the --no_remote flag will use data from an earlier run by default. You can also leave out species annotation by skipping BLAST entirely with --no_BLAST.

BLAST+ Database

Sometimes, this pipeline might run headless on a server. To keep it from running head-first in a firewall when it attempts to update its BLAST+ database via FTP, please pre-supply an unzipped, ready-to-use BLAST+ database via -dbpath (and -db name). Find databases on the NCBI website, but web BLAST and --BLAST_xml might be both easier and faster.

Results + Motif Search

You can import results from ab12phylo-cmd into the graphical ab12phylo. It is faster, easier and more capable when it comes to viewing and modifying trees: Open a new window/project, press Ctrl+i or Import from commandline version in the upper right hamburger menu and select the folder with your commandline results.

Once ML tree inference, bootstrapping and BLAST has finished from the commandline, the pipeline will display a result.html in your web browser. This page contains a form that allows Motif search across node attributes and calculates diversity metrics for the matching subset/subtree. Entering a space ' ' should match all samples, and entering multiple motifs separated by commas , will select all leaves that match at least one motif.

To select a subtree, enter a motif that matches several leaves or at least two separate specific leaf motifs, with one of them as far left as possible. Entering a single, specific motif for a subtree search can be used to find the index of a node for tree modifications like rooting.

Once you hit MATCH or SUBTREE, the CGI script in the package will highlight the selection in the tree, as well as compute and display diversity statistics for this 'population'. It will also write and link a file with all selected sample IDs or the paths of the original .ab1 files in the /query subdirectory, which can be used as a whitelist file for a subsequent subset analysis run. Pass it via -sampleset if using sample IDs, or via -abiset if using file paths. File paths are recommended to reliably exclude outlier versions with identical base ID.

If results are moved or sent, motif search will be possible by using ab12phylo-view or starting a CGI server in the directory via python3 -m http.server --cgi <port>. Find the port on the intro tab.

Plotting

This is easier in the graphical ab12phylo, and you can import commandline results to the graphical version. If you would like to stick to the commandline:

ab12phylo-visualize + ab12phylo-view

ab12phylo-visualize will re-plot phylogenies and render a new results.html. An end user may use this to switch support values or plot an MSA visualization with -msa-viz (which will take some rendering time for wider alignments). ab12phylo-view shows results of a previous run in a browser, with motif search enabled. Both commands accept a path to the results directory or default to ., and are equivalent to appending to the original ab12phylo-cmd call.

ab12phylo-view <result_folder>
# or
cd <results_folder>
ab12phylo-view
# or 
ab12phylo-cmd -c <my-config.yaml> -bst 1000 (...) -view

Support Values

For visualization, you can pick either Felsenstein Bootstrap Proportions FBP or Transfer Bootstrap Expectation TBE support values with -metric. Newick tree files for both types are generated anyway, so switching the support value metric only requires re-running ab12phylo-visualize.

MSA visualization

AB12PHYLO can plot an additional rectangular tree with an MSA visualization next to it by passing -msa_viz. This is single-threaded and takes some extra time for larger MSAs. If you run -p1, you can visually inspect the MView HTML of the alignment for outliers already without having to wait for the entire pipeline.

Tree modifications

You can -drop_nodes, -replace_nodes or subtrees with a placeholder, or -root the tree with an outgroup. These options accept node indices you can get from a motif search. Here is an example:

screenshot of the results page showing a small tree

To root the tree above at the sample labelled as Alternaria doliconidium, enter a query that matches the label and press SUBTREE. This will return the index of the node in the tree, here 18.

screenshot of the results page showing a small tree

Then re-running ab12phylo-cmd with -viz -root 18 will give us:

screenshot of the results page showing a small tree

Also see the VISUALIZATION section of --help.

Troubleshooting / Odds and Ends

Remote runs

tmux and --headless are highly recommended for remote runs, as well as pre-supplying a BLAST+ db, adapting or replacing the YAML config file and setting a fixed seed for reproducibility.

Genes and References

If data for several genes is supplied, AB12PHYLO will restrict the analysis to samples that are present for all genes. If this causes a lot of samples to be dropped, it might be worth leaving out a gene entirely by setting -g = --genes manually. If you somehow have No samples shared across all genes, there might be some trace files that seemingly belong to another gene.

Passing references is closely inter-linked: If a directory of reference files is supplied via -rd = --ref_dir, the package will try to match the .FASTA files inside to genes by their filename. For example, ITS1F.fasta will be matched to trace data from the ITS1F gene. Alternatively, an ordered list of reference files can be passed via -rf = --ref, and file names will be ignored.

If you are feeling this neat and precise and set both the genes and individual references, be careful: The pipeline will then deliberately match references and genes by order. Therefore, this will make a mess:

-g ITS1F OPA10 -rf ../opa10.fasta ITS1F.phy

Log File

If you are having trouble, look at the log file! It will be in your results directory and named like ab12phylo[|-p1|-p2][-view|-viz]?.log. Alternatively, you can set the --verbose flag and get the same information in real-time to your commandline. Your choice! For individual support, write an email to [email protected] or raise an issue on GitHub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commandline Tutorial

Basic options

Trace data

Wellsplates

Genes / Loci

Filename Parsing

Reference taxa

Output directory

Test run

Config files

More detailed settings

Run modes

BLAST

BLAST API

BLAST+ Database

Results + Motif Search

Plotting

ab12phylo-visualize + ab12phylo-view

Support Values

MSA visualization

Tree modifications

Troubleshooting / Odds and Ends

Remote runs

Genes and References

Log File

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally