Example use case of a non-bacterial dataset

As part of the reviews received in the published paper of this pipeline, it was requested to provide an example use case of the preprocessing pipeline (ngs-preprocess) and this one using a non-bacterial dataset in order to demonstrate that these two pipelines are not specific to bacterial genomes.

Thus, this is what we are providing in this page. The dataset used here, is the one that has been shown to be preprocessed in our ngs-preprocess pipeline. You can check it out here. The data is used then here to show its connection across pipelines.

Expected data location

In the ngs-preprocess example case, we have saved all our outputs in an output directory called ./preprocessed_reads. Thus, all the examples here will be shown with paths relative to that one, and thus, for working on your machine, you should make sure to have the correct paths.

Get the data

After preprocessing the data as shown in the ngs-preprocess example case we can use the generated data to perform genome assembly. Be careful with the correct file paths.

Input data samplesheet

As it is described in this pipelines' manual, an input data samplesheet is required. In our example, we called it input.yml and it looks like this:

samplesheet:
  - id: aspergillus_fumigatus
    nanopore: ./preprocessed_reads/final_output/nanopore/SRR23337893.filtered.fq.gz  # remember to give the right path in your machine
    genome_size: 30m
    wtdbg2_technology: ont
    shasta_config: Nanopore-Oct2021

Assembling the data

Outputs will be at genome_assembly.

nextflow run fmalmeida/mpgap \
    -profile docker \
    --output ./genome_assembly \
    --tracedir ./genome_assembly/pipeline_info \
    --input input.yml \
    --skip_unicycler \
    --skip_hifiasm \
    --flye_additional_parameters ' --keep-haplotypes ' \
    --organism 'fungus' \
    --max_cpus 20 \
    --max_memory '40.GB' \
    --start_asm_mem '30.GB'

Note

Unicycler was skipped since it is known to not work well with long reads alone. Finally, it was passed down an additional parameter to quast so it uses eukaryote tools and databases for gene prediction and completeness QC. Otherwise, it uses bacteria.

Afterwards

The generated outputs are displayed as an example of the outputs generated by the pipeline in the outputs page.

Under normal circunstances, I would recommend using our bacterial genome annotation pipeline, bacannot, a try. However, since this is not a bacterial genome, feel free to use the examples in any way you'd like.