Skip to content

Manual

# Get help in the command line
nextflow run fmalmeida/mpgap --help

Tip

All these parameters are configurable through a configuration file. We encourage users to use the configuration file since it will keep your execution cleaner and more readable. See a config example.

Input description

  • path to fastq files containing sequencing reads (Illumina, Nanopore or Pacbio)
  • path to Pacbio subreads.bam file containing raw data (Optional)
  • path to Nanopore FAST5 files containing raw data (Optional)

The input data must be provided via a samplesheet in YAML format given via the --input parameter. Please read the samplesheet reference page to understand how to properly create one.

Tip

A samplesheet template can be downloaded with: nextflow run fmalmeida/mpgap --get_samplesheet

Assembly possibilities

The pipeline is capable of assembling Illumina, ONT and Pacbio reads in three main ways:

  1. Short reads only assemblies
    • Unicycler
    • SPAdes
    • Megahit
    • Shovill (for paired reads only).

Note

Shovill is a software that can work with different assemblers as its core. The pipeline executes shovill with both spades, skesa and megahit, so user can compare the results.

  1. Long reads only assemblies
    • Unicycler
    • Canu
    • Flye
    • Raven
    • Shasta
    • wtdbg2
    • hifiasm

Note

Hifiasm is a software that expect high quality reads as input. Thus, the pipeline only executes this assembler if the either --corrected_longreads or --high_quality_longreads is used (is true). When true, users can further control it with --skip_hifiasm and --hifiasm_additional_parameters as all the other assemblers can.

  1. Hybrid assemblies (using both short and long reads)
    • Unicycler
    • SPAdes
    • Haslr
    • Use short reads to correct errors (polish) in long reads assemblies.

Hybrid assembly strategies

Hybrid assemblies can be produced with two available strategies that are described below. To choose the strategies adopted, users must set the hybrid_strategy parameter either from inside the YAML file (which will overwrite, for that sample, any value set) as described in the samplesheet reference page or with the --hybrid-strategy parameter to set a new default value for all samples.

Valid options are: 1, 2 or both.

Strategy 1

By using Unicycler, Haslr and/or SPAdes specialized hybrid assembly modules.

Note

It is achieved when using --hybrid_strategy 1 or --hybrid_strategy both

Strategy 2

By polishing (correcting errors) a long reads only assembly with Illumina reads. This will tell the pipeline to produce a long reads only assembly (with canu, wtdbg2, shasta, raven, flye or unicycler) and polish it with Pilon and Polypolish. By default, it runs 4 rounds of polishing for Pilon (params.pilon_polish_rounds).

Note

It is achieved when using --hybrid_strategy 2 or --hybrid_strategy both

Additionally, these long reads only assemblies can also be polished with Nanopolish or Racon+Medaka tools for nanopore reads and gcpp for Pacbio reads, before polishing with short reads. For that, users must properly set the samplesheet parameters (medaka_model, nanopolish_fast5 and/or pacbio_bam).

Parameters documentation

Please note that, through the command line, the parameters that are boolean (true or false) do not expect any value to be given for them. They must be used by itself, for example: --skip_spades --skip_flye.

Input and Output options

Parameter
Required Default Description
--input NA Path to input samplesheet in YAML format
--output NA Directory to store output files
--organism bacteria Organism type of inputs, options are: bacteria, eukaryote, fungus. It impacts the datases selected for Quast and BUSCO QC modules.

BUSCO lineage options

Parameter
Required Default Description
--busco_lineage bacteria_odb10 or eukaryota_odb10 or fungi_obd10 Select a BUSCO lineage for the pipeline to use. Depends on --organism.

Note

If blank, bacteria_odb10 will be used. If unsure you can set the param to auto which will tell BUSCO to automatically select the most appropriate one (it takes a little bit more of time and space).

Start/Max resources on job request

Parameter
Required Default Description
--start_asm_cpus 6 How many cpus should an assembly job request in the very first attempt?. This is essential for bigger genomes in order to avoid having to fail the first try due lack of memory and then running again (automatically) using all the max values allowed with the max_cpus parameter.
--start_asm_mem 20.GB How much memory should an assembly job request in the very first attempt?. This is essential for bigger genomes in order to avoid having to fail the first try due lack of memory and then running again (automatically) using all the max values allowed with the max_mem parameter.
--max_cpus 10 Max number of threads a job can use across attempts. After one failed attempt this is maxed out.
--max_memory 40.GB Max amount of memory a job can use across attempts. After one failed attempt this is maxed out.
--max_time 40.h Max amount of time a job can take to run

Assemblies configuration

All these parameters listed below (for genome size, assembly strategy, long reads characteristics and for long reads polishers) if used via the command line or from the NF config file, they will set values in a global manner for all the samples.

However, they can also be set in a sample-specific manner. If a sample has a value for one of these parameters in the samplesheet, it will overwrite the "global/default" value for that specific sample and use the one provided inside the YAML.

Genome size

Parameter
Required Default Description
--genome_size NA This sets the expected genome sizes for canu, wtdbg2 and haslr assemblers, which require this value. Options are estimatives with common suffices, for example: 3.7m, 2.8g, etc.

Hybrid assembly strategies

Parameter
Required Default Description
--hybrid_strategy 1 It tells the pipeline which hybrid assembly strategy to adopt. Options are: 1, 2 or both. Please read the description of the hybrid assembly strategies to better choose the right strategy.

Long reads characteristics

Parameter
Required Default Description
--wtdbg2_technology ont/sq It tells the pipeline which technology the long reads are, which is required for wtdbg2. Options are: ont for Nanopore reads, rs for PacBio RSII, sq for PacBio Sequel, ccs for PacBio CCS reads. With not wanted, consider using --skip_wtdbg2
--shasta_config Nanopore-Oct2021 It tells the pipeline which shasta pre-set configuration to use when assembling nanopore reads. Please read the shasta configuration manual page to know the available models
--corrected_longreads False It tells the pipeline to interpret the input long reads as "corrected". This will activate (if available) the options for corrected reads in the assemblers. For example: -corrected (in canu), --pacbio-corr|--nano-corr (in flye), etc. Be cautious when using this parameter. If your reads are not corrected, and you use this parameter, you will probably do not generate any contig
--high_quality_longreads False It tells the pipeline to interpret the input long reads as "(high quality - hifi)". This will activate (if available) the options for high quality (hifi) reads in the assemblers. For example: -corrected (in canu), --pacbio-hifi|--nano-hq (in flye), etc. Be cautious when using this parameter. If your reads are not corrected, and you use this parameter, you will probably do not generate any contig

Long reads polishers

Parameter
Required Default Description
--medaka_model r941_min_high_g360 It tells the pipeline which available medaka model to use to polish nanopore long reads assemblies. Please read medaka manual to see available models
--nanopolish_max_haplotypes 1000 It sets the maximum number of haplotypes to be considered by Nanopolish. Sometimes the pipeline may crash because to much variation was found exceeding the limit

Note

For assembly polishing with medaka models, the assembly is first polished one time with racon using the -m 8 -x -6 -g -8 -w 500 as this is the dataset in which Medaka has been trained on. Therefore, the medaka polishing in this pipeline mean Racon 1X + Medaka.

Advanced customization options

Note

Additional parameters must be set inside double quotes separated by blank spaces.

Parameter
Required Default Description
--quast_additional_parameters NA Give additional parameters to Quast while assessing assembly metrics. Must be given as shown in Quast manual. E.g. " --large --eukaryote "
--skip_raw_assemblies_polishing False This will make the pipeline not polish raw assemblies on hybrid strategy 2. For example, if a sample is assembled with flye and polished with medaka, by default, both assemblies will be passed to pilon and polypolish so you can compare them. If you don't need this comparison and don't want to polish the raw assembly, use this parameter
--skip_canu False Skip the execution of Canu
--canu_additional_parameters False Passes additional parameters for Canu assembler. E.g. " correctedErrorRate=0.075 corOutCoverage=200 ". Must be given as shown in Canu's manual
--skip_flye False Skip the execution of Flye
--flye_additional_parameters False Passes additional parameters for Flye assembler. E.g. " --meta --iterations 4 ". Must be given as shown in Flye's manual
--skip_raven False Skip the execution of Raven
--raven_additional_parameters False Passes additional parameters for Raven assembler. E.g. " --polishing-rounds 4 ". Must be given as shown in Raven's manual
--skip_shasta False Skip the execution of Shasta
--shasta_additional_parameters False Passes additional parameters for Raven assembler. E.g. " --Assembly.detangleMethod 1 ". Must be given as shown in Shasta's manual
--skip_wtdbg2 False Skip the execution of wtdbg2
--wtdbg2_additional_parameters False Passes additional parameters for wtdbg2 assembler. E.g. " -k 250 ". Must be given as shown in wtdbg2's manual. Remember, the script called for wtdbg2 is wtdbg2.pl thus you must give the parameters used by it
--skip_hifiasm False Skip the execution of hifiasm
--hifiasm_additional_parameters False Passes additional parameters for hifiasm assembler. E.g. " --ul ul.fq.gz ". Must be given as shown in hifiasm's manual
--skip_unicycler False Skip the execution of Unicycler
--unicycler_additional_parameters False Passes additional parameters for Unicycler assembler. E.g. " --mode conservative --no_correct ". Must be given as shown in Unicycler's manual
--skip_spades False Skip the execution of SPAdes
--spades_additional_parameters False Passes additional parameters for SPAdes assembler. E.g. " --meta --plasmids ". Must be given as shown in Spades' manual
--skip_haslr False Skip the execution of Haslr
--haslr_additional_parameters False Passes additional parameters for Haslr assembler. E.g. " --cov-lr 30 ". Must be given as shown in Haslr' manual
--skip_shovill False Skip the execution of Shovill
--shovill_additional_parameters False Passes additional parameters for Shovill assembler. E.g. " --depth 15 ". Must be given as shown in Shovill's manual. The pipeline already executes shovill with spades, skesa and megahit, so please, do not use it with shovill's --assembler parameter
--skip_megahit False Skip the execution of Megahit
--megahit_additional_parameters False Passes additional parameters for Megahit assembler. E.g. " --presets meta-large ". Must be given as shown in Megahit's manual
--skip_pilon False Skip pilon polisher when performing hybrid assembly strategy 2
--skip_polypolish False Skip polypolisher polisher when performing hybrid assembly strategy 2