Skip to content

Samplesheet

The samplesheet is a required YAML document that is used to describe the input samples and, if desired, its "sample-specific" configuration. The input samplesheet is given using the --input parameter.

Tip

A samplesheet template can be downloaded with:

nextflow run fmalmeida/mpgap --get_samplesheet

A guide on how to proper configure it is shown below:

Samplesheet header

The first line of the file must be the header followed by an indentation (two white spaces):

samplesheet:
  - ...:

Sample identification

Each sample must be identified by the tag id in the YAML file, followed by the sample's input tags (YAML keys) that shall be used by the pipeline:

Warning

This value will be used to create sub-directories in the output directory. Thus, to not use white spaces.

samplesheet:
  - id: sample_1
    ...:
    ...:
  - id: sample_2
    ...:
    ...:

These are the tags that are used to represent/set the input files that shall be used for each sample. The available tags are:

Input tags
Description
illumina Used to set path to illumina raw reads (paired, unpaired or both)
pacbio Used to set path to pacbio raw reads (mutually excludable with nanopore)
pacbio_bam Used to set path to pacbio bam file (used in conjunction with pacbio for long reads assembly polishing with gcpp)
nanopore Used to set path to nanopore raw reads (mutually excludable with pacbio)
nanopolish_fast5 Used to set path to nanopore raw FAST5 data (used in conjunction with nanopore for long reads assembly polishing with Nanopolish)

Note for the illumina tag/key

  • When using both paired and unpaired reads, the paired reads must be given first, in the order: pair 1, pair 2, unpaired.
  • Otherwise, if using only paired reads, they must be given in the order: pair 1, pair 2.
  • If using only unpaired reads, only one entry is expected. Check samples in the template to 1, 4 and 5 to understand it.
  • The illumina tag is the only one that must be set in indented newlines
    • two white spaces relative to the
    • one line per read as shown in the complete samplesheet example.

Warning

All the other input tags must be set in the same line, right after the separator (":"), without quotations, white spaces or signs

These are the tags that are used to represent/set the "sample-specific" assembly configuration that shall be used for each sample.

By default, if not set inside the samplesheet, the pipeline will use the configurations set via the "nextflow config file" or via the command line. Otherwise, if set inside the samplesheet, they will overwrite the pipeline's configuration for that specific sample.

Please, the manual reference page the global/defaults configurations.

The available tags are:

Input tags
Description
hybrid_strategy This sets which strategy to run when performing hybrid assemblies. Please read the :ref:manual reference page to understand the adopted strategies. Options are: 1, 2, or both
corrected_long_reads Tells whether the long reads used are corrected or not. Options: true, false
nanopolish_max_haplotypes It sets the max number of haplotypes to be considered by Nanopolish. Sometimes the pipeline may crash because to much variation was found exceeding the limit. Options: any integer value
medaka_model Used to polish a longreads-only assembly with Medaka. It selects a Medaka ONT sequencing model for polishing. Please read medaka manual to know the available models
shasta_config This selects the shasta configuration file to be used when assembling reads. It is now mandatory for shasta since its v0.8 release. Please read the shasta configuration manual page to know the available models
genome_size This sets the expected genome sizes for canu, wtdbg2 and haslr assemblers, which require this value. Options are estimatives with common suffices, for example: 3.7m, 2.8g, etc
wtdbg2_technology This sets the technology of input reads. It is required by wtdbg2. Options are: ont for Nanopore reads, rs for PacBio RSII, sq for PacBio Sequel, ccs for PacBio CCS reads

Complete samplesheet example

# This is an exemplification of the accepted YAML syntax accepted by the MpGAP pipeline
#
# The file must contain a header-line with the key "samplesheet:". All the
# input samples must be given nested to this key. Obs: The header-line and
# the nest indentation (two white spaces) are REQUIRED.
#
# Each sample must initiate with the key "id:" listed with a "-". This value
# is what will be used as prefix for outputs. The input keys must be given 
# right after this id key, with the same indent without the "-".
#
# Please read the https://mpgap.readthedocs.io/en/latest/samplesheet.html documentation page
# in order to understand how to properly set it up and which are the available/expected keys
#
# A template (with the correct fields, syntax and indentation) is given below:

samplesheet:
  - id: sample_1
    illumina: 
      - dataset/reads_1.fastq.gz
      - dataset/reads_2.fastq.gz
  - id: sample_2
    illumina:
      - dataset/reads_unpaired.fastq.gz
  - id: sample_3
    illumina:
      - dataset/reads_1.fastq.gz
      - dataset/reads_2.fastq.gz
      - dataset/pairs_merged.fastq.gz
  - id: sample_4
    nanopore: dataset/ont_reads.fastq.gz
    corrected_long_reads: true
    nanopolish_fast5: dataset/kleb/fast5_pass
    genome_size: 5.5m
  - id: sample_5
    pacbio: dataset/pacbio_reads.fastq.gz
    pacbio_bam: dataset/pacbio_reads.subreads.bam
    wtdbg2_technology: rs  
  - id: sample_6
    illumina:
      - dataset/reads_1.fastq.gz
      - dataset/reads_2.fastq.gz
    nanopore: dataset/ont_reads.fastq.gz
    hybrid_strategy: both