Nextflow

Scalable, Sharable and Reproducible Computational Workflows across Clouds and Clusters

Rad Suchecki (CSIRO)

The challenge

  • Large analysis workflows are fragile ecosystems of software tools, scripts and dependencies.

  • This complexity commonly makes these workflows not only irreproducible but sometimes even not re-runnable outside their original development environment.

  • Even small workflows affected

Let others and your future self(!)

  • reliably re-run your analyses
  • trace back origins of results

Re-running pipelines

  • new data (e.g. additional samples)
  • updated software
  • different compute environment (cloud?)
  • errors found
  • new ideas
  • and any combination of the above

Push-button workflow wish-list


  • version controlled
  • container-backed
  • seamless execution across different environments (if computationally feasible)
    • laptop/server/cluster/cloud
  • sharable
    • effort required for someone else to use it

Nextflow


  • Reactive workflow framework
  • Domain specific programming language
  • Aimed at bioinformaticians familiar with programming


  • Designed for seamless scalability of existing tools and scripts
    • Implicitly parallelized, asynchronous data streams
  • Separation of pipeline logic from the definitions of
    • software environment (on $PATH, modules, binaries, conda, containers)
    • execution environment (laptop, server, cluster, cloud)

https://www.nature.com/articles/nbt.3820/

Credit: Evan Floden

Nextflow building blocks

Processes (1-to-many tasks)

  • safe and lock-free parallelization
  • executed in separate work directories
  • easy clean-up, no issue of partial results following an error

Channels

  • facilitate data flow between processes by linking their outputs/inputs
  • a suite of operators applied to channels to shape the data flow
    • filtering, transforming, forking, combining…

https://www.nature.com/articles/nbt.3820/

Parallelisation

Credit: Evan Floden

  • Independent tasks will run in parallel (Ts & Cs apply)
  • Reduced overallocation of resources

Getting started

Required

  • POSIX compatible system (Linux, Solaris, OS X, etc)
  • Bash 2.3 (or later)
  • Java 8 (or later)

Install

curl -s https://get.nextflow.io | bash

Software you want to run

  • Available on PATH or under bin/
  • via Docker
  • via Singularity
  • via Conda
  • via Modules

Hello world syntax

#!/usr/bin/env nextflow
echo true

cheers = Channel.from 'Bonjour', 'Ciao', 'Hello', 'Hola'

process sayHello {
  input: 
    val x from cheers
  script:
    """
    echo '$x world!'
    """
}

Hello world run

N E X T F L O W  ~  version 19.01.0
Pulling nextflow-io/hello ...
 downloaded from https://github.com/nextflow-io/hello.git
Launching `nextflow-io/hello` [thirsty_lorenz] - revision: a9012339ce [master]
[warm up] executor > local
[e2/c8f03c] Submitted process > sayHello (2)
[64/590321] Submitted process > sayHello (1)
[4b/eb8576] Submitted process > sayHello (4)
Bonjour world!
Ciao world!
[50/0c1e69] Submitted process > sayHello (3)
Hello world!
Hola world!

Hello world how?

 project name: nextflow-io/hello
 repository  : https://github.com/nextflow-io/hello
 local path  : /home/rad/.nextflow/assets/nextflow-io/hello
 main script : main.nf
 revisions   : 
 * master (default)
   mybranch
   testing
   v1.1 [t]
   v1.2 [t]

Alternatives

Hello world shared pipelines

  • Run specific revision (commit SHA hash, branch or tag)
N E X T F L O W  ~  version 19.01.0
Launching `nextflow-io/hello` [nasty_thompson] - revision: baba3959d7 [v1.1]
NOTE: Your local project version looks outdated - a different revision is available in the remote repository [3b355db864]
[warm up] executor > local
[a6/1e9862] Submitted process > sayHello (1)
[d3/1e2167] Submitted process > sayHello (2)
[d1/de9861] Submitted process > sayHello (3)
Bojour world! (version 1.1)
Ciao world! (version 1.1)
[e0/5fd598] Submitted process > sayHello (4)
Hola world! (version 1.1)
Hello world! (version 1.1)

Command line syntax basics

  • Single dash (-) for Nextflow params

-resume prevents re-running of tasks when relevant inputs/scripts unchanged

  • Double dash (--) for pipeline params (defined by you)

Filename sample.fastq.gz will be available to NF under params.input.

Flowchart

Logic (and input data) of this example workflow is adapted from EMBL-ABR Snakemake webinar by Nathan Watson-Haigh

Example workflow

#!/usr/bin/env nextflow

//Build link to reference
referenceLink = params.ref.base_url + params.ref.chr + ".fsa.zip"

//Take accessions defined in nextflow.config.
//Use --take N to process first N accessions or --take all to process all
accessionsChannel = Channel.from(params.accessions).take( params.take == 'all' ? -1 : params.take )

//fetch adapters file - either local or remote
adaptersChannel = Channel.fromPath(params.adapters)

process download_chromosome {
  tag { params.ref.chr }

  //Prevent re-downloading of large files
  storeDir { "${params.outdir}/downloaded" }  //use with care, caching will not work as normal so changes to input may not take effect
  scratch false //must be false otherwise storeDir ignored

  input:
    referenceLink

  output:
    file('*') into references

  script:
  """
  wget ${referenceLink}
  """
}

process bgzip_chromosome {
  cpus '2' //consider defining in conf/requirements.config based on process name or label
  tag { ref }

  input:
    file ref from references

  output:
    file('*') into chromosomesChannel

  script:
  """
  unzip -p ${ref} \
    | bgzip --threads ${task.cpus} \
    > ${ref}.gz
  """
}

process bgzip_chromosome_subregion {
  input:
    file chr from chromosomesChannel

  output:
    file('subregion') into subregionsChannel

  script:
  """
  samtools faidx ${chr} ${params.ref.chr}:${params.ref.start}-${params.ref.end} \
    | bgzip --threads ${task.cpus} \
    > subregion
  """
}

process extract_reads {
  tag { accession }
  storeDir { "${params.outdir}/downloaded_reads" }  //use with care, caching will not work as normal so changes to input may not take effect

  input:
    val accession from accessionsChannel
    //e.g. ACBarrie

  output:
    set val(accession), file('*.fastq.gz') into (extractedReadsChannelA, extractedReadsChannelB)
    //e.g. ACBarrie, [ACBarrie_R1.fastq.gz, ACBarrie_R2.fastq.gz]

  script:
  """
  samtools view -hu "${params.bam.base_url}/${params.bam.chr}/${accession}.realigned.bam" \
    ${params.bam.chr}:${params.bam.start}-${params.bam.end} \
  | samtools collate -uO - \
  | samtools fastq -F 0x900 -1 ${accession}_R1.fastq.gz -2 ${accession}_R2.fastq.gz \
    -s /dev/null -0 /dev/null - \
  && zcat ${accession}_R1.fastq.gz | head | awk 'END{exit(NR<4)}' \
  && zcat ${accession}_R2.fastq.gz | head | awk 'END{exit(NR<4)}'
  """
}

process fastqc_raw {
  tag { accession }

  input:
    set val(accession), file('*') from extractedReadsChannelA

  output:
    file('*') into fastqcRawResultsChannel

  script:
  """
  fastqc  --quiet --threads ${task.cpus} *
  """
}

process multiqc_raw {
  input:
    file('*') from fastqcRawResultsChannel.collect()

  output:
    file('*') into multiqcRawResultsChannel

  script:
  """
  multiqc .
  """
}

process trimmomatic_pe {
  echo true
  tag {accession}

  input:
    set file(adapters), val(accession), file('*') from adaptersChannel.combine(extractedReadsChannelB)

  output:
    set val(accession), file('*.paired.fastq.gz') into (trimmedReadsChannelA, trimmedReadsChannelB)

  script:
  """
  trimmomatic PE \
    *.fastq.gz \
    ${accession}_R1.paired.fastq.gz \
    ${accession}_R1.unpaired.fastq.gz \
    ${accession}_R2.paired.fastq.gz \
    ${accession}_R2.unpaired.fastq.gz \
    ILLUMINACLIP:${adapters}:2:30:10:3:true \
    LEADING:2 \
    TRAILING:2 \
    SLIDINGWINDOW:4:15 \
    MINLEN:36
  """
}

process fastqc_trimmed {
  tag { accession }

  input:
    set val(accession), file('*') from trimmedReadsChannelB

  output:
    file('*') into fastqcTrimmedResultsChannel

  script:
  """
  fastqc --quiet --threads ${task.cpus} *
  """
}

process multiqc_trimmed {
  input:
    file('*') from fastqcTrimmedResultsChannel.collect()

  output:
    file('*') into multiqcTrimmedResultsChannel

  script:
  """
  multiqc .
  """
}

process bwa_index {
  input:
    file(ref) from subregionsChannel

  output:
    set val(ref.name), file("*") into indexChannel //also valid: set val("${ref}"), file("*") into indexChannel

  script:
  """
  bwa index -a bwtsw ${ref}
  """
}


process bwa_mem {
  tag { accession }

  input:
    set val(ref), file('*'), val(accession), file(reads) from indexChannel.combine(trimmedReadsChannelA)

    output:
        file('*.bam') into alignedReadsChannel

  script:
  """
  bwa mem -t ${task.cpus} -R '@RG\\tID:${accession}\\tSM:${accession}' ${ref} ${reads} | samtools view -b > ${accession}.bam
  """
}

Example workflow

N E X T F L O W  ~  version 19.01.0
Launching `../main.nf` [kickass_sinoussi] - revision: f79dd55637
[warm up] executor > local
[skipping] Stored process > download_chromosome (chr4A)
[skipping] Stored process > extract_reads (ACBarrie)
[47/c2c9c0] Submitted process > bgzip_chromosome (iwgsc_refseqv1.0_chr4A.fsa.zip)
[22/37d522] Submitted process > fastqc_raw (ACBarrie)
[3d/6ef4c4] Submitted process > trimmomatic_pe (ACBarrie)
[50/ef233d] Submitted process > bgzip_chromosome_subregion
[a3/cc388b] Submitted process > fastqc_trimmed (ACBarrie)
[ca/dc1f95] Submitted process > multiqc_raw
[47/b5335b] Submitted process > bwa_index
[8a/6fd29b] Submitted process > multiqc_trimmed
[42/9808e9] Submitted process > bwa_mem (ACBarrie)

The work directory (1/2)

work
├── 22
│   └── 37d522a1fe1388da08e24e62bbb7d5
├── 3d
│   └── 6ef4c43e33e0d2b48b0edbfd917125
├── 42
│   └── 9808e951494f48a2168ae74de55e68
├── 47
│   ├── b5335b39a7632ddefce1048261c09d
│   └── c2c9c0defe7425db495e84baa41494
├── 50
│   └── ef233dbe547cd45d8c4e09a14b5653
├── 8a
│   └── 6fd29bba8ac619614ec8027b52566e
├── a3
│   └── cc388bb77491624da6653b95ddad21
└── ca
    └── dc1f9530e07c432fe40ad2ea50ab9b

17 directories, 0 files

The work directory (2/2)

work
├── [4.0K]  22
│   └── [4.0K]  37d522a1fe1388da08e24e62bbb7d5
│       ├── [698K]  ACBarrie_R1_fastqc.html
│       ├── [460K]  ACBarrie_R1_fastqc.zip
│       ├── [  92]  ACBarrie_R1.fastq.gz -> /home/rad/repos/nextflow-embl-abr-webinar/docs/results/downloaded_reads/ACBarrie_R1.fastq.gz
│       ├── [703K]  ACBarrie_R2_fastqc.html
│       ├── [472K]  ACBarrie_R2_fastqc.zip
│       ├── [  92]  ACBarrie_R2.fastq.gz -> /home/rad/repos/nextflow-embl-abr-webinar/docs/results/downloaded_reads/ACBarrie_R2.fastq.gz
│       ├── [   0]  .command.begin
│       ├── [   0]  .command.err
│       ├── [   0]  .command.log
│       ├── [   0]  .command.out
│       ├── [2.6K]  .command.run
│       ├── [  46]  .command.sh
│       ├── [3.6K]  .command.stub
│       ├── [ 191]  .command.trace
│       └── [   1]  .exitcode
├── [4.0K]  3d
│   └── [4.0K]  6ef4c43e33e0d2b48b0edbfd917125
│       ├── [  92]  ACBarrie_R1.fastq.gz -> /home/rad/repos/nextflow-embl-abr-webinar/docs/results/downloaded_reads/ACBarrie_R1.fastq.gz
│       ├── [154K]  ACBarrie_R1.paired.fastq.gz
│       ├── [1.0K]  ACBarrie_R1.unpaired.fastq.gz
│       ├── [  92]  ACBarrie_R2.fastq.gz -> /home/rad/repos/nextflow-embl-abr-webinar/docs/results/downloaded_reads/ACBarrie_R2.fastq.gz
│       ├── [156K]  ACBarrie_R2.paired.fastq.gz
│       ├── [ 580]  ACBarrie_R2.unpaired.fastq.gz
│       ├── [   0]  .command.begin
│       ├── [ 882]  .command.err
│       ├── [ 882]  .command.log
│       ├── [   0]  .command.out
│       ├── [2.8K]  .command.run
│       ├── [ 290]  .command.sh
│       ├── [3.6K]  .command.stub
│       ├── [ 178]  .command.trace
│       ├── [   1]  .exitcode
│       ├── [4.0K]  tmp
│       │   └── [4.0K]  63
│       │       └── [4.0K]  dc95d6f752fa23093302a7915c2813
│       │           └── [  93]  TruSeq3-PE.fa
│       └── [ 137]  TruSeq3-PE.fa -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/3d/6ef4c43e33e0d2b48b0edbfd917125/tmp/63/dc95d6f752fa23093302a7915c2813/TruSeq3-PE.fa
├── [4.0K]  42
│   └── [4.0K]  9808e951494f48a2168ae74de55e68
│       ├── [432K]  ACBarrie.bam
│       ├── [ 113]  ACBarrie_R1.paired.fastq.gz -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/3d/6ef4c43e33e0d2b48b0edbfd917125/ACBarrie_R1.paired.fastq.gz
│       ├── [ 113]  ACBarrie_R2.paired.fastq.gz -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/3d/6ef4c43e33e0d2b48b0edbfd917125/ACBarrie_R2.paired.fastq.gz
│       ├── [   0]  .command.begin
│       ├── [1.2K]  .command.err
│       ├── [1.2K]  .command.log
│       ├── [   0]  .command.out
│       ├── [3.4K]  .command.run
│       ├── [ 164]  .command.sh
│       ├── [3.6K]  .command.stub
│       ├── [ 157]  .command.trace
│       ├── [   1]  .exitcode
│       ├── [  99]  subregion.amb -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/47/b5335b39a7632ddefce1048261c09d/subregion.amb
│       ├── [  99]  subregion.ann -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/47/b5335b39a7632ddefce1048261c09d/subregion.ann
│       ├── [  99]  subregion.bwt -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/47/b5335b39a7632ddefce1048261c09d/subregion.bwt
│       ├── [  99]  subregion.pac -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/47/b5335b39a7632ddefce1048261c09d/subregion.pac
│       └── [  98]  subregion.sa -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/47/b5335b39a7632ddefce1048261c09d/subregion.sa
├── [4.0K]  47
│   ├── [4.0K]  b5335b39a7632ddefce1048261c09d
│   │   ├── [   0]  .command.begin
│   │   ├── [ 480]  .command.err
│   │   ├── [ 480]  .command.log
│   │   ├── [   0]  .command.out
│   │   ├── [2.5K]  .command.run
│   │   ├── [  45]  .command.sh
│   │   ├── [3.6K]  .command.stub
│   │   ├── [ 134]  .command.trace
│   │   ├── [   1]  .exitcode
│   │   ├── [  95]  subregion -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/50/ef233dbe547cd45d8c4e09a14b5653/subregion
│   │   ├── [  33]  subregion.amb
│   │   ├── [  56]  subregion.ann
│   │   ├── [ 57K]  subregion.bwt
│   │   ├── [ 14K]  subregion.pac
│   │   └── [ 28K]  subregion.sa
│   └── [4.0K]  c2c9c0defe7425db495e84baa41494
│       ├── [   0]  .command.begin
│       ├── [   0]  .command.err
│       ├── [   0]  .command.log
│       ├── [   0]  .command.out
│       ├── [2.5K]  .command.run
│       ├── [ 120]  .command.sh
│       ├── [3.6K]  .command.stub
│       ├── [ 187]  .command.trace
│       ├── [   1]  .exitcode
│       ├── [  96]  iwgsc_refseqv1.0_chr4A.fsa.zip -> /home/rad/repos/nextflow-embl-abr-webinar/docs/results/downloaded/iwgsc_refseqv1.0_chr4A.fsa.zip
│       └── [216M]  iwgsc_refseqv1.0_chr4A.fsa.zip.gz
├── [4.0K]  50
│   └── [4.0K]  ef233dbe547cd45d8c4e09a14b5653
│       ├── [   0]  .command.begin
│       ├── [   0]  .command.err
│       ├── [   0]  .command.log
│       ├── [   0]  .command.out
│       ├── [2.6K]  .command.run
│       ├── [ 131]  .command.sh
│       ├── [3.6K]  .command.stub
│       ├── [ 165]  .command.trace
│       ├── [   1]  .exitcode
│       ├── [ 119]  iwgsc_refseqv1.0_chr4A.fsa.zip.gz -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/47/c2c9c0defe7425db495e84baa41494/iwgsc_refseqv1.0_chr4A.fsa.zip.gz
│       ├── [  24]  iwgsc_refseqv1.0_chr4A.fsa.zip.gz.fai
│       ├── [181K]  iwgsc_refseqv1.0_chr4A.fsa.zip.gz.gzi
│       └── [ 17K]  subregion
├── [4.0K]  8a
│   └── [4.0K]  6fd29bba8ac619614ec8027b52566e
│       ├── [ 116]  ACBarrie_R1.paired_fastqc.html -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/a3/cc388bb77491624da6653b95ddad21/ACBarrie_R1.paired_fastqc.html
│       ├── [ 115]  ACBarrie_R1.paired_fastqc.zip -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/a3/cc388bb77491624da6653b95ddad21/ACBarrie_R1.paired_fastqc.zip
│       ├── [ 116]  ACBarrie_R2.paired_fastqc.html -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/a3/cc388bb77491624da6653b95ddad21/ACBarrie_R2.paired_fastqc.html
│       ├── [ 115]  ACBarrie_R2.paired_fastqc.zip -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/a3/cc388bb77491624da6653b95ddad21/ACBarrie_R2.paired_fastqc.zip
│       ├── [   0]  .command.begin
│       ├── [ 397]  .command.err
│       ├── [ 397]  .command.log
│       ├── [   0]  .command.out
│       ├── [3.1K]  .command.run
│       ├── [  26]  .command.sh
│       ├── [3.6K]  .command.stub
│       ├── [ 179]  .command.trace
│       ├── [   1]  .exitcode
│       ├── [4.0K]  multiqc_data
│       │   ├── [119K]  multiqc_data.json
│       │   ├── [ 835]  multiqc_fastqc.txt
│       │   ├── [ 413]  multiqc_general_stats.txt
│       │   ├── [ 12K]  multiqc.log
│       │   └── [ 344]  multiqc_sources.txt
│       └── [1.1M]  multiqc_report.html
├── [4.0K]  a3
│   └── [4.0K]  cc388bb77491624da6653b95ddad21
│       ├── [699K]  ACBarrie_R1.paired_fastqc.html
│       ├── [465K]  ACBarrie_R1.paired_fastqc.zip
│       ├── [ 113]  ACBarrie_R1.paired.fastq.gz -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/3d/6ef4c43e33e0d2b48b0edbfd917125/ACBarrie_R1.paired.fastq.gz
│       ├── [707K]  ACBarrie_R2.paired_fastqc.html
│       ├── [468K]  ACBarrie_R2.paired_fastqc.zip
│       ├── [ 113]  ACBarrie_R2.paired.fastq.gz -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/3d/6ef4c43e33e0d2b48b0edbfd917125/ACBarrie_R2.paired.fastq.gz
│       ├── [   0]  .command.begin
│       ├── [   0]  .command.err
│       ├── [   0]  .command.log
│       ├── [   0]  .command.out
│       ├── [2.7K]  .command.run
│       ├── [  45]  .command.sh
│       ├── [3.6K]  .command.stub
│       ├── [ 187]  .command.trace
│       └── [   1]  .exitcode
└── [4.0K]  ca
    └── [4.0K]  dc1f9530e07c432fe40ad2ea50ab9b
        ├── [ 109]  ACBarrie_R1_fastqc.html -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/22/37d522a1fe1388da08e24e62bbb7d5/ACBarrie_R1_fastqc.html
        ├── [ 108]  ACBarrie_R1_fastqc.zip -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/22/37d522a1fe1388da08e24e62bbb7d5/ACBarrie_R1_fastqc.zip
        ├── [ 109]  ACBarrie_R2_fastqc.html -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/22/37d522a1fe1388da08e24e62bbb7d5/ACBarrie_R2_fastqc.html
        ├── [ 108]  ACBarrie_R2_fastqc.zip -> /home/rad/repos/nextflow-embl-abr-webinar/docs/work/22/37d522a1fe1388da08e24e62bbb7d5/ACBarrie_R2_fastqc.zip
        ├── [   0]  .command.begin
        ├── [ 397]  .command.err
        ├── [ 397]  .command.log
        ├── [   0]  .command.out
        ├── [3.0K]  .command.run
        ├── [  26]  .command.sh
        ├── [3.6K]  .command.stub
        ├── [ 178]  .command.trace
        ├── [   1]  .exitcode
        ├── [4.0K]  multiqc_data
        │   ├── [126K]  multiqc_data.json
        │   ├── [ 807]  multiqc_fastqc.txt
        │   ├── [ 399]  multiqc_general_stats.txt
        │   ├── [ 12K]  multiqc.log
        │   └── [ 316]  multiqc_sources.txt
        └── [1.1M]  multiqc_report.html

22 directories, 141 files

Example workflow run

Refresh page to see the embedded asciicast or go to https://asciinema.org/a/233197

Configuration file(s)

  • nextflow.config
  • $HOME/.nextflow/config
  • But also
    includeConfig 'conf/publish.config'
    • Extend config by passing additional file at run time -c additional.config
    • Ignore default and use custom config file passed at run time -C custom.config
  • Config scopes e.g. env, params, process, docker
  • Config profiles (!)

Recall: separation of workflow logic from compute, software envs

  • Much about software, execution environments can be defined in the directive declarations block at the top of the process body e.g.
params {
  take = 1 //can be overwritten at run-time e.g. --take 2 to just process first two accessions or --take all to process all
  accessions = [
    "ACBarrie",
    "Alsen",
    "Baxter",
    "Chara",
    "Drysdale",
    "Excalibur",
    "Gladius",
    "H45",
    "Kukri",
    "Pastor",
    "RAC875",
    "Volcanii",
    "Westonia",
    "Wyalkatchem",
    "Xiaoyan",
    "Yitpi"
  ]

  adapters = "https://raw.githubusercontent.com/timflutre/trimmomatic/master/adapters/TruSeq3-PE.fa"

  ref {
    base_url = "https://urgi.versailles.inra.fr/download/iwgsc/IWGSC_RefSeq_Assemblies/v1.0/iwgsc_refseqv1.0_"
    chr      = "chr4A"
    start    = "688055092"
    end      = "688113092"
  }
  bam {
    base_url = "http://crobiad.agwine.adelaide.edu.au/dawn/jbrowse-prod/data/wheat_parts/minimap2_defaults/whole_genome/PE/BPA"
    chr      = "chr4A_part2"
    start    = "235500000"
    end      = "235558000"
  }
  outdir = "./results" //can be overwritten at run-time e.g. --outdir dirname
  infodir = "./flowinfo" //can be overwritten at run-time e.g. --infodir dirname
}

process {
  cache = 'lenient'
}

profiles {
  //SOFTWARE
  conda {
    process {
      conda = "$baseDir/conf/conda.yaml"
    }
  }
  condamodule {
    process.module = 'miniconda3/4.3.24'
  }
  docker {
    process.container = 'rsuchecki/nextflow-embl-abr-webinar'
    docker {
      enabled = true
      fixOwnership = true
    }
  }
  singularity {
    process {
      container = 'shub://csiro-crop-informatics/nextflow-embl-abr-webinar' //Singularity hub
      // container = 'rsuchecki/nextflow-embl-abr-webinar' //pulled from Docker hub - would suffice but Singularity container is re-built from docker image so not ideal for reproducibility
      //scratch = true //This is a hack needed for singularity versions approx after 2.5 and before 3.1.1 as a workaround for https://github.com/sylabs/singularity/issues/1469#issuecomment-469129088
    }
    singularity {
      enabled = true
      autoMounts = true
      cacheDir = "singularity-images"  //when distibuting the pipeline probably should point under $workDir
    }
  }
  singularitymodule {
    process.module = 'singularity/3.2.1' //Specific to our cluster - update as required
  }
  //EXECUTORS
  awsbatch {
    aws.region = 'ap-southeast-2'
    process {
      executor = 'awsbatch'
      queue = 'flowq'
      process.container = 'rsuchecki/nextflow-embl-abr-webinar'
    }
    executor {
      awscli = '/home/ec2-user/miniconda/bin/aws'
    }
  }
  slurm {
    process {
      executor = 'slurm'
    }
  }
}

//PUBLIS RESULTS
params.publishmode = "copy"
includeConfig 'conf/publish.config'

//GENERATE REPORT https://www.nextflow.io/docs/latest/tracing.html//trace-report
report {
    enabled = true
    file = "${params.infodir}/report.html"
}

//GENERATE TIMELINE https://www.nextflow.io/docs/latest/tracing.html//timeline-report
timeline {
    enabled = true
    timeline.file = "${params.infodir}/timeline.html"
}

//GENERATE PIPELINE TRACE https://www.nextflow.io/docs/latest/tracing.html//trace-report
trace {
    enabled = true
    file = "${params.infodir}/trace.txt"
}

//GENERATE GRAPH REPRESENTATION OF THE PIPELINE FLOW
dag {
    enabled = true
    file = "${params.infodir}/flowchart.dot"
    // file = "${params.infodir}/flowchart.png" //requires graphviz for rendering
}

Configuration profiles

Setting up software environment(s)

  • Global software env for the workflow
  • Separate software envs for individual processes or subsets
  • A bit of both
  • Our example workflow:
    • global Conda env -> Docker -> Singularity

Software environment (Conda)

  • can be
    • used directly
    • to build a container
    • slow…
name: tutorial
channels:
 - bioconda
 - conda-forge
 - default
dependencies:
 - fastqc=0.11.8
 - multiqc=1.7
 - trimmomatic=0.36
 - pigz=2.3.4
 - bwa=0.7.17
 - samtools=1.9
 - htslib=1.9
 - unzip=6.0
 - tabix=0.2.6
 - gnu-wget=1.18
profiles {
  conda {
    process {
      conda = "$baseDir/conf/conda.yaml"
    }
  }
}

Software environment (Docker)

  • Container image (automated build) on Docker Hub Docker Pulls
  • Need not be conda-based
  • Dockerfile
FROM rsuchecki/miniconda3:4.5.12

ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8

LABEL maintainer="Rad Suchecki <rad.suchecki@csiro.au>"
SHELL ["/bin/bash", "-c"]

COPY conf/conda.yaml /
RUN conda env create -f /conda.yaml && conda clean -a
ENV PATH /opt/conda/envs/tutorial/bin:$PATH
  docker {
    process.container = 'rsuchecki/nextflow-embl-abr-webinar'
    docker {
      enabled = true
      fixOwnership = true
    }
  }
  • NF pulls the container image from Docker Hub when our pipeline is run with -profile docker (or -profile awsbatch)

Software environment (Singularity)

  • Singularity can pull from Docker Hub and convert to its format
  • Dedicated build on Singularity Hub https://www.singularity-hub.org/static/img/hosted-singularity–hub-%23e32929.svg
    • ensures the same image is used (reproducibility!)
    • Need not be Docker based
    • Build automation less flexible than on Docker Hub
    • Singularity recipe
Bootstrap:docker
From:rsuchecki/nextflow-embl-abr-webinar:latest
  singularity {
    process.container = 'shub://csiro-crop-informatics/nextflow-embl-abr-webinar' //Singularity hub
    singularity {
      enabled = true
      autoMounts = true
    }
  }
  • NF pulls the container image from Singularity Hub when our pipeline is run with -profile singularity

Pipeline outputs:

  • The publishDir directive
    • define end products of a pipeline
    • make them easily accessible
process {
  withName: multiqc_raw {
    publishDir {
      path = "${params.outdir}/qc_raw"
      mode = "${params.publishmode}"
    }
  }
  withName: multiqc_trimmed {
    publishDir {
      path = "${params.outdir}/qc_trimmed"
      mode = "${params.publishmode}"
    }
  }
  withName: bwa_mem {
    publishDir {
      path = "${params.outdir}/bwa"
      mode = "${params.publishmode}"
    }
  }
  //Currently not applied, add:
  //label 'stats'
  //at the top of a process definition to store declared outputs as follows
  withLabel: stats {
    publishDir {
      path = "${params.outdir}/stats"
      mode = "${params.publishmode}"
    }
  }
}

Pipeline outputs

results
├── bwa
│   └── ACBarrie.bam
├── downloaded
│   └── iwgsc_refseqv1.0_chr4A.fsa.zip
├── downloaded_reads
│   ├── ACBarrie_R1.fastq.gz
│   └── ACBarrie_R2.fastq.gz
├── flowinfo
│   ├── report.html
│   ├── timeline.html
│   ├── trace.txt
│   └── trace.txt.1
├── qc_raw
│   ├── multiqc_data
│   └── multiqc_report.html
└── qc_trimmed
    ├── multiqc_data
    └── multiqc_report.html

8 directories, 10 files

AWS Batch execution

Refresh page to see the embedded asciicast or go to https://asciinema.org/a/233421

Workflow introspection

NF Resources

Acknowledgments

Twitter Follow