Samplesheet
The samplesheet is a required YAML document that is used to describe the input samples and, if desired, its "sample-specific" configuration. The input samplesheet is given using the --input
parameter.
A guide on how to proper configure it is shown below:
Samplesheet header
The first line of the file must be the header followed by an indentation (two white spaces):
Sample identification
Each sample must be identified by the tag id
in the YAML file, followed by the sample's input tags (YAML keys) that shall be used by the pipeline:
Warning
This value will be used to create sub-directories in the output directory. Thus, to not use white spaces.
YAML keys related to input files
These are the tags that are used to represent/set the input files that shall be used for each sample. The available tags are:
Input tags |
Description |
---|---|
illumina |
Used to set path to illumina raw reads (paired, unpaired or both) |
pacbio |
Used to set path to pacbio raw reads (mutually excludable with nanopore ) |
pacbio_bam |
Used to set path to pacbio bam file (used in conjunction with pacbio for long reads assembly polishing with gcpp) |
nanopore |
Used to set path to nanopore raw reads (mutually excludable with pacbio ) |
nanopolish_fast5 |
Used to set path to nanopore raw FAST5 data (used in conjunction with nanopore for long reads assembly polishing with Nanopolish) |
Note for the illumina tag/key
- When using both paired and unpaired reads, the paired reads must be given first, in the order: pair 1, pair 2, unpaired.
- Otherwise, if using only paired reads, they must be given in the order: pair 1, pair 2.
- If using only unpaired reads, only one entry is expected. Check samples in the template to 1, 4 and 5 to understand it.
- The illumina tag is the only one that must be set in indented newlines
- two white spaces relative to the
- one line per read as shown in the complete samplesheet example.
Warning
All the other input tags must be set in the same line, right after the separator (":"), without quotations, white spaces or signs
YAML keys related to configuration
These are the tags that are used to represent/set the "sample-specific" assembly configuration that shall be used for each sample.
By default, if not set inside the samplesheet, the pipeline will use the configurations set via the "nextflow config file" or via the command line. Otherwise, if set inside the samplesheet, they will overwrite the pipeline's configuration for that specific sample.
Please, the manual reference page the global/defaults configurations.
The available tags are:
Input tags |
Description |
---|---|
hybrid_strategy |
This sets which strategy to run when performing hybrid assemblies. Please read the :ref:manual reference page to understand the adopted strategies. Options are: 1 , 2 , or both |
corrected_long_reads |
Tells whether the long reads used are corrected or not. Options: true , false |
nanopolish_max_haplotypes |
It sets the max number of haplotypes to be considered by Nanopolish. Sometimes the pipeline may crash because to much variation was found exceeding the limit. Options: any integer value |
medaka_model |
Used to polish a longreads-only assembly with Medaka. It selects a Medaka ONT sequencing model for polishing. Please read medaka manual to know the available models |
shasta_config |
This selects the shasta configuration file to be used when assembling reads. It is now mandatory for shasta since its v0.8 release. Please read the shasta configuration manual page to know the available models |
genome_size |
This sets the expected genome sizes for canu, wtdbg2 and haslr assemblers, which require this value. Options are estimatives with common suffices, for example: 3.7m , 2.8g , etc |
wtdbg2_technology |
This sets the technology of input reads. It is required by wtdbg2. Options are: ont for Nanopore reads, rs for PacBio RSII, sq for PacBio Sequel, ccs for PacBio CCS reads |
Complete samplesheet example
# This is an exemplification of the accepted YAML syntax accepted by the MpGAP pipeline
#
# The file must contain a header-line with the key "samplesheet:". All the
# input samples must be given nested to this key. Obs: The header-line and
# the nest indentation (two white spaces) are REQUIRED.
#
# Each sample must initiate with the key "id:" listed with a "-". This value
# is what will be used as prefix for outputs. The input keys must be given
# right after this id key, with the same indent without the "-".
#
# Please read the https://mpgap.readthedocs.io/en/latest/samplesheet.html documentation page
# in order to understand how to properly set it up and which are the available/expected keys
#
# A template (with the correct fields, syntax and indentation) is given below:
samplesheet:
- id: sample_1
illumina:
- dataset/reads_1.fastq.gz
- dataset/reads_2.fastq.gz
- id: sample_2
illumina:
- dataset/reads_unpaired.fastq.gz
- id: sample_3
illumina:
- dataset/reads_1.fastq.gz
- dataset/reads_2.fastq.gz
- dataset/pairs_merged.fastq.gz
- id: sample_4
nanopore: dataset/ont_reads.fastq.gz
corrected_long_reads: true
nanopolish_fast5: dataset/kleb/fast5_pass
genome_size: 5.5m
- id: sample_5
pacbio: dataset/pacbio_reads.fastq.gz
pacbio_bam: dataset/pacbio_reads.subreads.bam
wtdbg2_technology: rs
- id: sample_6
illumina:
- dataset/reads_1.fastq.gz
- dataset/reads_2.fastq.gz
nanopore: dataset/ont_reads.fastq.gz
hybrid_strategy: both