Manual

# Get help in the command line
nextflow run fmalmeida/mpgap --help

Tip

All these parameters are configurable through a configuration file. We encourage users to use the configuration file since it will keep your execution cleaner and more readable. See a config example.

Input description

path to fastq files containing sequencing reads (Illumina, Nanopore or Pacbio)
path to Pacbio subreads.bam file containing raw data (Optional)
path to Nanopore FAST5 files containing raw data (Optional)

The input data must be provided via a samplesheet in YAML format given via the --input parameter. Please read the samplesheet reference page to understand how to properly create one.

Tip

A samplesheet template can be downloaded with: nextflow run fmalmeida/mpgap --get_samplesheet

Assembly possibilities

The pipeline is capable of assembling Illumina, ONT and Pacbio reads in three main ways:

Short reads only assemblies
- Unicycler
- SPAdes
- Megahit
- Shovill (for paired reads only).

Note

Shovill is a software that can work with different assemblers as its core. The pipeline executes shovill with both spades, skesa and megahit, so user can compare the results.

Long reads only assemblies
- Unicycler
- Canu
- Flye
- Raven
- Shasta
- wtdbg2
- hifiasm

Note

Hifiasm is a software that expect high quality reads as input. Thus, the pipeline only executes this assembler if the either --corrected_longreads or --high_quality_longreads is used (is true). When true, users can further control it with --skip_hifiasm and --hifiasm_additional_parameters as all the other assemblers can.

Hybrid assemblies (using both short and long reads)
- Unicycler
- SPAdes
- Haslr
- Use short reads to correct errors (polish) in long reads assemblies.

Hybrid assembly strategies

Hybrid assemblies can be produced with two available strategies that are described below. To choose the strategies adopted, users must set the hybrid_strategy parameter either from inside the YAML file (which will overwrite, for that sample, any value set) as described in the samplesheet reference page or with the --hybrid-strategy parameter to set a new default value for all samples.

Valid options are: 1, 2 or both.

Strategy 1

By using Unicycler, Haslr and/or SPAdes specialized hybrid assembly modules.

Note

It is achieved when using --hybrid_strategy 1 or --hybrid_strategy both

Strategy 2

By polishing (correcting errors) a long reads only assembly with Illumina reads. This will tell the pipeline to produce a long reads only assembly (with canu, wtdbg2, shasta, raven, flye or unicycler) and polish it with Pilon and Polypolish. By default, it runs 4 rounds of polishing for Pilon (params.pilon_polish_rounds).

Note

It is achieved when using --hybrid_strategy 2 or --hybrid_strategy both

Additionally, these long reads only assemblies can also be polished with Nanopolish or Racon+Medaka tools for nanopore reads and gcpp for Pacbio reads, before polishing with short reads. For that, users must properly set the samplesheet parameters (medaka_model, nanopolish_fast5 and/or pacbio_bam).

Parameters documentation

Please note that, through the command line, the parameters that are boolean (true or false) do not expect any value to be given for them. They must be used by itself, for example: --skip_spades --skip_flye.

Input and Output options

Parameter	Default	Description
`--input`	NA	Path to input samplesheet in YAML format
`--output`	NA	Directory to store output files
`--organism`	bacteria	Organism type of inputs, options are: bacteria, eukaryote, fungus. It impacts the datases selected for Quast and BUSCO QC modules.

BUSCO lineage options

Parameter	Required	Default	Description
`--busco_lineage`		bacteria_odb10 or eukaryota_odb10 or fungi_obd10	Select a BUSCO lineage for the pipeline to use. Depends on `--organism`.

Note

If blank, bacteria_odb10 will be used. If unsure you can set the param to auto which will tell BUSCO to automatically select the most appropriate one (it takes a little bit more of time and space).

Start/Max resources on job request

Parameter	Default	Description
`--start_asm_cpus`	6	How many cpus should an assembly job request in the very first attempt?. This is essential for bigger genomes in order to avoid having to fail the first try due lack of memory and then running again (automatically) using all the max values allowed with the max_cpus parameter.
`--start_asm_mem`	20.GB	How much memory should an assembly job request in the very first attempt?. This is essential for bigger genomes in order to avoid having to fail the first try due lack of memory and then running again (automatically) using all the max values allowed with the max_mem parameter.
`--max_cpus`	10	Max number of threads a job can use across attempts. After one failed attempt this is maxed out.
`--max_memory`	40.GB	Max amount of memory a job can use across attempts. After one failed attempt this is maxed out.
`--max_time`	40.h	Max amount of time a job can take to run

Assemblies configuration

All these parameters listed below (for genome size, assembly strategy, long reads characteristics and for long reads polishers) if used via the command line or from the NF config file, they will set values in a global manner for all the samples.

However, they can also be set in a sample-specific manner. If a sample has a value for one of these parameters in the samplesheet, it will overwrite the "global/default" value for that specific sample and use the one provided inside the YAML.

Genome size

Parameter	Required	Default	Description
`--genome_size`		NA	This sets the expected genome sizes for canu, wtdbg2 and haslr assemblers, which require this value. Options are estimatives with common suffices, for example: `3.7m`, `2.8g`, etc.

Hybrid assembly strategies

Parameter	Required	Default	Description
`--hybrid_strategy`		1	It tells the pipeline which hybrid assembly strategy to adopt. Options are: `1`, `2` or `both`. Please read the description of the hybrid assembly strategies to better choose the right strategy.

Long reads characteristics

Parameter	Default	Description
`--wtdbg2_technology`	ont/sq	It tells the pipeline which technology the long reads are, which is required for wtdbg2. Options are: `ont` for Nanopore reads, `rs` for PacBio RSII, `sq` for PacBio Sequel, `ccs` for PacBio CCS reads. With not wanted, consider using `--skip_wtdbg2`
`--shasta_config`	Nanopore-Oct2021	It tells the pipeline which shasta pre-set configuration to use when assembling nanopore reads. Please read the shasta configuration manual page to know the available models
`--corrected_longreads`	False	It tells the pipeline to interpret the input long reads as "corrected". This will activate (if available) the options for corrected reads in the assemblers. For example: `-corrected` (in canu), `--pacbio-corr\|--nano-corr` (in flye), etc. Be cautious when using this parameter. If your reads are not corrected, and you use this parameter, you will probably do not generate any contig
`--high_quality_longreads`	False	It tells the pipeline to interpret the input long reads as "(high quality - hifi)". This will activate (if available) the options for high quality (hifi) reads in the assemblers. For example: `-corrected` (in canu), `--pacbio-hifi\|--nano-hq` (in flye), etc. Be cautious when using this parameter. If your reads are not corrected, and you use this parameter, you will probably do not generate any contig

Long reads polishers

Parameter	Required	Default	Description
`--medaka_model`		r941_min_high_g360	It tells the pipeline which available medaka model to use to polish nanopore long reads assemblies. Please read medaka manual to see available models
`--nanopolish_max_haplotypes`		1000	It sets the maximum number of haplotypes to be considered by Nanopolish. Sometimes the pipeline may crash because to much variation was found exceeding the limit

Note

For assembly polishing with medaka models, the assembly is first polished one time with racon using the -m 8 -x -6 -g -8 -w 500 as this is the dataset in which Medaka has been trained on. Therefore, the medaka polishing in this pipeline mean Racon 1X + Medaka.

Advanced customization options

Note

Additional parameters must be set inside double quotes separated by blank spaces.

Parameter	Default	Description
`--quast_additional_parameters`	NA	Give additional parameters to Quast while assessing assembly metrics. Must be given as shown in Quast manual. E.g. `" --large --eukaryote "`
`--skip_raw_assemblies_polishing`	False	This will make the pipeline not polish raw assemblies on hybrid strategy 2. For example, if a sample is assembled with flye and polished with medaka, by default, both assemblies will be passed to pilon and polypolish so you can compare them. If you don't need this comparison and don't want to polish the raw assembly, use this parameter
`--skip_canu`	False	Skip the execution of Canu
`--canu_additional_parameters`	False	Passes additional parameters for Canu assembler. E.g. `" correctedErrorRate=0.075 corOutCoverage=200 "`. Must be given as shown in Canu's manual
`--skip_flye`	False	Skip the execution of Flye
`--flye_additional_parameters`	False	Passes additional parameters for Flye assembler. E.g. `" --meta --iterations 4 "`. Must be given as shown in Flye's manual
`--skip_raven`	False	Skip the execution of Raven
`--raven_additional_parameters`	False	Passes additional parameters for Raven assembler. E.g. `" --polishing-rounds 4 "`. Must be given as shown in Raven's manual
`--skip_shasta`	False	Skip the execution of Shasta
`--shasta_additional_parameters`	False	Passes additional parameters for Raven assembler. E.g. `" --Assembly.detangleMethod 1 "`. Must be given as shown in Shasta's manual
`--skip_wtdbg2`	False	Skip the execution of wtdbg2
`--wtdbg2_additional_parameters`	False	Passes additional parameters for wtdbg2 assembler. E.g. `" -k 250 "`. Must be given as shown in wtdbg2's manual. Remember, the script called for wtdbg2 is `wtdbg2.pl` thus you must give the parameters used by it
`--skip_hifiasm`	False	Skip the execution of hifiasm
`--hifiasm_additional_parameters`	False	Passes additional parameters for hifiasm assembler. E.g. `" --ul ul.fq.gz "`. Must be given as shown in hifiasm's manual
`--skip_unicycler`	False	Skip the execution of Unicycler
`--unicycler_additional_parameters`	False	Passes additional parameters for Unicycler assembler. E.g. `" --mode conservative --no_correct "`. Must be given as shown in Unicycler's manual
`--skip_spades`	False	Skip the execution of SPAdes
`--spades_additional_parameters`	False	Passes additional parameters for SPAdes assembler. E.g. `" --meta --plasmids "`. Must be given as shown in Spades' manual
`--skip_haslr`	False	Skip the execution of Haslr
`--haslr_additional_parameters`	False	Passes additional parameters for Haslr assembler. E.g. `" --cov-lr 30 "`. Must be given as shown in Haslr' manual
`--skip_shovill`	False	Skip the execution of Shovill
`--shovill_additional_parameters`	False	Passes additional parameters for Shovill assembler. E.g. `" --depth 15 "`. Must be given as shown in Shovill's manual. The pipeline already executes shovill with spades, skesa and megahit, so please, do not use it with shovill's `--assembler` parameter
`--skip_megahit`	False	Skip the execution of Megahit
`--megahit_additional_parameters`	False	Passes additional parameters for Megahit assembler. E.g. `" --presets meta-large "`. Must be given as shown in Megahit's manual
`--skip_pilon`	False	Skip pilon polisher when performing hybrid assembly strategy 2
`--skip_polypolish`	False	Skip polypolisher polisher when performing hybrid assembly strategy 2