Skip to content
This repository was archived by the owner on Oct 2, 2020. It is now read-only.

Tutorial DRAFT2

Andrey Kartashov edited this page Dec 6, 2015 · 9 revisions

A simple three steps workflow

Lets learn CWL based on a simple three steps workflow. The workflow will create a coverage map (e.g. for ChIP/RNA-Seq) based on results of alignment. A bam file produced by an aligner (bowtie/STAR/bwa/etc) will be used as an input and a BigWig file will be produced as an output. The BigWig file will show coverage by reads or fragments for the particular genome (mm9/mm10/hg19/xenTro7/etc). First, I will use the bedtools genomecov subcommand to create a coverage map in bedGraph format then I have to sort the resulting file and finally to convert to bigwig using bedGraphToBigWig from the UCSC userApps package. I would like to have a workflow that I can use for ChIP-Seq and RNA-Seq data.

It is good to wrap each tool in separate .cwl file to avoid possible ID conflicts

I'm going to create a separate wrapper for each individual tool.

bedtools

I know a few things about bedtools - base command and subcommand can I write a CWL? Yes! I'm ready to write a few lines of CWL. I have to declare class CommandLineTool and specify required objects: baseCommand, inputs, outputs. Lets create a first CWL file bedtools-genomecov.cwl. Don't forget to make it executable (chmod +x bedtools-genomecov.cwl). Below is the CWL code:

#!/usr/bin/env cwl-runner
class: CommandLineTool
description: |
  Tool:    bedtools genomecov (aka genomeCoverageBed)
  Sources: https://github.com/arq5x/bedtools2
  Summary: Compute the coverage of a feature file among a genome.
  Usage: bedtools genomecov [OPTIONS] -i <bed/gff/vcf> -g <genome>
inputs: []
outputs: []
baseCommand: ["bedtools", "genomecov"]

Now I can execute this tool directly or run cwltool bedtools-genomecov.cwl. It will show the help from bedtools genomecov.

Tool:    bedtools genomecov (aka genomeCoverageBed)
Version: v2.25.0
Summary: Compute the coverage of a feature file among a genome.

Usage: bedtools genomecov [OPTIONS] -i <bed/gff/vcf> -g <genome>

Options: 
        -ibam           The input file is in BAM format.
                        Note: BAM _must_ be sorted by position
        ...
        -bg		Report depth in BedGraph format. For details, see:
			genome.ucsc.edu/goldenPath/help/bedgraph.html
        ...

To create a coverage file I will use bedtools genomecov with following args:

  • -g GenomeFile - chromosome sizes for the genome (required)
  • -ratio normalization coefficient (e.g 1/(millions of reads/fragments)) (optional)
  • -split To show splicing for RNA-Seq data (A small note: ChIP-seq data do not have splices but may have a gap in a read if bowtie2 for instance was used for alignment, so this parameter should not affect ChIP-seq data too much I'm going to use it always enabled, but of course we can make it optional)
  • -bg To output data in BedGraph format also always on
  • -fs To use fragment instead of reads for coverage (optional)
  • -strand, -pc - other optional args

These args can be used to describe different behaviors for different data sets that I need to process (e.g RNA/ChIP single/pair stranded/unstranded).

Now I can add all the inputs that I'm going to use:

#!/usr/bin/env cwl-runner

class: CommandLineTool

description: |
  Tool:    bedtools genomecov (aka genomeCoverageBed)
  Sources: https://github.com/arq5x/bedtools2
  Summary: Compute the coverage of a feature file among a genome.
  Usage: bedtools genomecov [OPTIONS] -i <bed/gff/vcf> -g <genome>

inputs:
  - id: "#input"
    type: File
    description: |
      The input file can be in BAM format
          (Note: BAM _must_ be sorted by position)
      or <bed/gff/vcf>
    inputBinding:
      position: 1
      prefix: "-ibam"

  - id: "#genomeFile"
    type: File
    description: 'Input genome file.'
    inputBinding:
      position: 2
      prefix: "-g"

  - id: "#dept"
    type: boolean
    default: true
    inputBinding:
      position: 4
      prefix: -bg

  - id: "#scale"
    type: ["null",float ]
    description: |
      Scale the coverage by a constant factor.
      Each coverage value is multiplied by this factor before being reported.
      Useful for normalizing coverage by, e.g., reads per million (RPM).
      - Default is 1.0; i.e., unscaled.
      - (FLOAT)
    inputBinding:
      position: 4
      prefix: -scale

  - id: "#split"
    type: ["null",boolean]
    description: |
      reat "split" BAM or BED12 entries as distinct BED intervals.
      when computing coverage.
      For BAM files, this uses the CIGAR "N" and "D" operations
      to infer the blocks for computing coverage.
      For BED12 files, this uses the BlockCount, BlockStarts, and BlockEnds
      fields (i.e., columns 10,11,12).
    inputBinding:
      position: 4
      prefix: "-split"

  - id: "#strand"
    type: ["null", string]
    description: |
      Calculate coverage of intervals from a specific strand.
      With BED files, requires at least 6 columns (strand is column 6).
      - (STRING): can be + or -
    inputBinding:
      position: 4
      prefix: "-strand"

  - id: "#pairchip"
    type: ["null", boolean]
    description: "pair-end chip seq experiment"
    inputBinding:
      position: 4
      prefix: "-pc"

  - id: "#fragmentsize"
    type: ["null", int]
    description: "fixed fragment size"
    inputBinding:
      position: 4
      prefix: "-fs"

outputs: []
baseCommand: ["bedtools", "genomecov"]

To describe the bedtools coverage arguments I'll use a YAML list of command input parameter objects in inputs. For each parameter I have to provide at least an id and type. The id has to be unique. To make an argument optional I have to add "null" type to the type's list, for instance, type: ["null", int] optional argument of type int. inputBinding makes the parameter appear in the final command line. To add a prefix before parameter to form an argument use prefix. To bind to a specific position within command line use position. If you need a parameter for the preprocessing in CWL and not for the final command line do not use the inputBinding object. For the bedtools genomecov actual order of arguments is not important so I don't have to use position.

I'm ready to run my first job. Job is a json file with all required parameters for cwl:

{
    "input": {
        "class": "File",
        "path": "./test-files/rna.SRR948778.bam"
    },
    "genomeFile": {
        "class": "File",
        "path": "./test-files/mm10-chrNameLength.txt"
    },
    "scale":1,
    "split": true,
    "dept":true
}

To run the job run cwltool bedtools-genomecov.cwl bedtools-genomecov-job.json. Cwltool checks that all required parameters and types, as well as files exist and, if everything is consistent, cwltool will execute the command. And I am getting the bedGraph output to stdout.

Although I can run the tool as described above, I prefer to have cwl that can use all the arguments of original tool. From the bedtools description I know that for genomecov I can use either -d, -bg or -bga but not together. It also requires -i if the input file is in bed/gff/vcf file formats or -ibam if it is bam format and I need have to have index file together with bam.

To achieve proper processing of -d, -bg or -bga I can use enum type.

  - id: "#dept"
    type:
      name: "JustDepts"
      type: enum
      symbols: ["-bg","-bga","-d"]
    inputBinding:
      position: 4

And I have to replace the "dept": true with "dept":"-bg" in the job file.

I assume that gff file has .gff extension and bam file has .bam although it might be different in other environments. In CWL draft2 it's probably the only way to check the file type. So, I'm going to change the parameter 'input' (id:"#input") to check file type and add either -i or -ibam prefix to the provided file. To do that I use JavaScript and nodejs engine. In the inputBinding in valueFrom object I add engine: node-engine.cwl and a script. Script executes a code as it is a function we have to return a value. To access inputs IDes in JavaScript, CWL provides $job object with keys as IDes, for example: $job['input'] holds object with id "#input". To access file path I can use $job['input'].path then check the extension. My function returns an array [prefix,file_path] (example: ['-ibam','my.bam']) so in the final command line they appear in this order.

var prefix = ((/.*\.bam$/i).test($job['input'].path))?'-ibam':'-i'; //either -ibam or -i
return [prefix,$job['input'].path]; //use the original path and add the prefix

The final CWL block:

  - id: "#input"
    type: File
    description: |
      The input file can be in BAM format
          (Note: BAM _must_ be sorted by position)
      or <bed/gff/vcf>
    inputBinding:
      position: 1
      valueFrom:
        engine: node-engine.cwl
        script: |
          {
            var prefix = ((/.*\.bam$/i).test($job['input'].path))?'-ibam':'-i';
            return [prefix,$job['input'].path];
          }

I can also assume that if it is a bam file the corresponding index is in the same directory and after .bam it has .bai, example: "my.bam.bai". In terms of CWL it is a secondary file and it will go into secondaryFiles object, but why do we need it? The secondary files do not form the command line, but it is the instruction to CWL platform to check their existence, map them into a docker image or copy them to a location in a cloud. So it is just a best practice :). I add almost the same code to the secondaryFiles object except I return empty array [] if it is not a bam file. And return an object of type File if it is .bam {"path":"my.bam.bai", "class": "File"} now it is not a string.

  - id: "#input"
    type: File
    description: |
      The input file can be in BAM format
          (Note: BAM _must_ be sorted by position)
      or <bed/gff/vcf>
    inputBinding:
      position: 1
      secondaryFiles:
        - engine: node-engine.cwl
          script: |
           {
            if ((/.*\.bam$/i).test($job['input'].path))
               return {"path": $job['input'].path+".bai", "class": "File"};
            return [];
           }
      valueFrom:
        engine: node-engine.cwl
        script: |
          {
            var prefix = ((/.*\.bam$/i).test($job['input'].path))?'-ibam':'-i';
            return [prefix,$job['input'].path];
          }

And the final thing is to redirect stdout to a file. It is good to provide future file name to save stdout. To do that we need a new parameter in inputs it holds file name type string. I would reference this name in outputs and stdout objects of CWL. To reference input I use special case engine cwl:JsonPointer and in the script string I provide /job/#id in my case it is /job/genomecoverage. I remember that I have to add this parameter into the job file "genomecoverageout": "./rna.SRR948778.bedGraph"

  - id: "#genomecoverageout"
    type: string

outputs:
  - id: "#genomecoverage"
    type: File
    description: "The file containing the genome coverage"
    outputBinding:
      glob: 
        engine: cwl:JsonPointer
        script: /job/genomecoverageout
stdout: 
  engine: cwl:JsonPointer
  script: /job/genomecoverageout

The complete version of CWL description is here: bedtools-genomecov.cwl

linux sort

To pass my data to the bedGraphToBigWig tool. I have to sort it by first column and then by second column within same first key. To do that from command line I use sort -k1,1 -k2,2n filename. Argument -k can be repeated as many times as needed. For arguments that can be repeated I use CWL type array:

  - id: "#key"
    type: 
      type: array
      items: string 
      inputBinding:
        prefix: "-k"
    inputBinding:
      position: 1
    description: |
      -k, --key=POS1[,POS2]
      start a key at POS1, end it at POS2 (origin 1)

There are two inputBinding directives one is within type array which makes each element of the array be prefixed by prefix and the other is for the id #key which positions the set.

However, if I'd like to add more logic for each array element I do not add both inputBinding and going to use this array to form the command line in arguments.

arguments:
  - valueFrom:
      engine: node-engine.cwl
      script: $job['key'].map(function(i) {return "-k"+i;})

For example, I can add regular expression to check each parameter /\d+,\d+n?/g. The complete version of CWL description: linux-sort.cwl

bedGraphToBigWig

To make a wrapper for this one I can use what I've learned above - no new tricks. The complete version of CWL description: ucsc-bedGraphToBigWig.cwl

Workflow

Now I have all three tools wrapped in CWL. I know tools input parameters and output data. I'm ready to create my workflow. Workflow has it's own input and output lists it might be confused, but actually not, remember that from the workflow you need a final result but not an intermediate. In my bam-to-bigwig workflow, I need to provide a bam file, a normalization coefficient, split for RNA-seq, fragment size for single-end ChIP-Seq, force to use actual fragment for paired-end ChIP parameters and get a bigWig file as the output.

#!/usr/bin/env cwl-runner

class: Workflow
inputs:
  - id: "#input"
    type: File

  - id: "#genomeFile"
    type: File

  - id: "#scale"
    type: float

  - id: "#pairchip"
    type: ["null",boolean]

  - id: "#fragmentsize"
    type: ["null",int]

  - id: "#strand"
    type: ["null",string]

  - id: "#bigWig"
    type: string

outputs:
  - id: "#outfile"
    type: File
    source: "#bigwig.bigWigOut"
Clone this wiki locally