Samplesheet

The samplesheet is a required YAML document that is used to describe the input samples and, if desired, its "sample-specific" configuration. The input samplesheet is given using the --input parameter.

Tip

A samplesheet template can be downloaded with:

nextflow run fmalmeida/mpgap --get_samplesheet

A guide on how to proper configure it is shown below:

Samplesheet header

The first line of the file must be the header followed by an indentation (two white spaces):

samplesheet:
  - ...:

Sample identification

Each sample must be identified by the tag id in the YAML file, followed by the sample's input tags (YAML keys) that shall be used by the pipeline:

Warning

This value will be used to create sub-directories in the output directory. Thus, to not use white spaces.

samplesheet:
  - id: sample_1
    ...:
    ...:
  - id: sample_2
    ...:
    ...:

These are the tags that are used to represent/set the input files that shall be used for each sample. The available tags are:

Input tags	Description
`illumina`	Used to set path to illumina raw reads (paired, unpaired or both)
`pacbio`	Used to set path to pacbio raw reads (mutually excludable with `nanopore`)
`pacbio_bam`	Used to set path to pacbio bam file (used in conjunction with `pacbio` for long reads assembly polishing with gcpp)
`nanopore`	Used to set path to nanopore raw reads (mutually excludable with `pacbio`)
`nanopolish_fast5`	Used to set path to nanopore raw FAST5 data (used in conjunction with `nanopore` for long reads assembly polishing with Nanopolish)

Note for the illumina tag/key

When using both paired and unpaired reads, the paired reads must be given first, in the order: pair 1, pair 2, unpaired.
Otherwise, if using only paired reads, they must be given in the order: pair 1, pair 2.
If using only unpaired reads, only one entry is expected. Check samples in the template to 1, 4 and 5 to understand it.
The illumina tag is the only one that must be set in indented newlines
- two white spaces relative to the
- one line per read as shown in the complete samplesheet example.

Warning

All the other input tags must be set in the same line, right after the separator (":"), without quotations, white spaces or signs

These are the tags that are used to represent/set the "sample-specific" assembly configuration that shall be used for each sample.

By default, if not set inside the samplesheet, the pipeline will use the configurations set via the "nextflow config file" or via the command line. Otherwise, if set inside the samplesheet, they will overwrite the pipeline's configuration for that specific sample.

Please, the manual reference page the global/defaults configurations.

The available tags are:

Input tags	Description
`hybrid_strategy`	This sets which strategy to run when performing hybrid assemblies. Please read the :ref:`manual` reference page to understand the adopted strategies. Options are: `1`, `2`, or `both`
`corrected_long_reads`	Tells whether the long reads used are corrected or not. Options: `true`, `false`
`nanopolish_max_haplotypes`	It sets the max number of haplotypes to be considered by Nanopolish. Sometimes the pipeline may crash because to much variation was found exceeding the limit. Options: any integer value
`medaka_model`	Used to polish a longreads-only assembly with Medaka. It selects a Medaka ONT sequencing model for polishing. Please read medaka manual to know the available models
`shasta_config`	This selects the shasta configuration file to be used when assembling reads. It is now mandatory for shasta since its v0.8 release. Please read the shasta configuration manual page to know the available models
`genome_size`	This sets the expected genome sizes for canu, wtdbg2 and haslr assemblers, which require this value. Options are estimatives with common suffices, for example: `3.7m`, `2.8g`, etc
`wtdbg2_technology`	This sets the technology of input reads. It is required by wtdbg2. Options are: `ont` for Nanopore reads, `rs` for PacBio RSII, `sq` for PacBio Sequel, `ccs` for PacBio CCS reads

Complete samplesheet example

# This is an exemplification of the accepted YAML syntax accepted by the MpGAP pipeline
#
# The file must contain a header-line with the key "samplesheet:". All the
# input samples must be given nested to this key. Obs: The header-line and
# the nest indentation (two white spaces) are REQUIRED.
#
# Each sample must initiate with the key "id:" listed with a "-". This value
# is what will be used as prefix for outputs. The input keys must be given 
# right after this id key, with the same indent without the "-".
#
# Please read the https://mpgap.readthedocs.io/en/latest/samplesheet.html documentation page
# in order to understand how to properly set it up and which are the available/expected keys
#
# A template (with the correct fields, syntax and indentation) is given below:

samplesheet:
  - id: sample_1
    illumina: 
      - dataset/reads_1.fastq.gz
      - dataset/reads_2.fastq.gz
  - id: sample_2
    illumina:
      - dataset/reads_unpaired.fastq.gz
  - id: sample_3
    illumina:
      - dataset/reads_1.fastq.gz
      - dataset/reads_2.fastq.gz
      - dataset/pairs_merged.fastq.gz
  - id: sample_4
    nanopore: dataset/ont_reads.fastq.gz
    corrected_long_reads: true
    nanopolish_fast5: dataset/kleb/fast5_pass
    genome_size: 5.5m
  - id: sample_5
    pacbio: dataset/pacbio_reads.fastq.gz
    pacbio_bam: dataset/pacbio_reads.subreads.bam
    wtdbg2_technology: rs  
  - id: sample_6
    illumina:
      - dataset/reads_1.fastq.gz
      - dataset/reads_2.fastq.gz
    nanopore: dataset/ont_reads.fastq.gz
    hybrid_strategy: both

Samplesheet

Samplesheet header

Sample identification

YAML keys related to input files

YAML keys related to configuration

Complete samplesheet example