Build Genome reference

Requirements

Docker

Please refer to Section 1.Docker or Docker for installation.

Now we are about to run the first step of the Vulture pipeline i.e. mkref (genome reference making), execute the command below in your favourite terminal or powershell and wait it to be finished.

Download the input data for mkref stage

The input data required for mkref stage are available in the downloadable links below, you can save them into your own S3 bucket folder:

The human genome : hg38.fa
The human genome annotation : hg38.unique_gene_names.gtf
The list of human-host microbe genome : prokaryotes.csv
The list of human-host virus genome : viruSITE_human_host.txt

Alternatavly, you can generate the files yourself following instructions below:

VirusSITE (viruSITE human host) Click "Format: CSV"

NCBI Prokaryotes Filters -> Host (Homo sapiens) -> Assembly level (Complete) -> RefSeq category (representative) -> Download [prokaryotes.csv]

Notice that you can edit the of virus or Prokaryotes in viruSITE_human_host.txt and prokaryotes.csv by any text editor to further customize the genome. Then, the folder of your reference input should have the following files:

#The input path e.g. /home/user/data/refinput should have:
hg38.fa  hg38.unique_gene_names.gtf  prokaryotes.csv  viruSITE_human_host.txt

We then edit the mkref profile in the Vulture/nextflow/nextflow.config:

...
    mkref {
        process.container = 'public.ecr.aws/b6a4h2a6/scvh_mkref:latest'
        docker.enabled = true
        params.ref = '[The full path of you pull your reference genome input, e.g. /home/user/data/refinput]'
        params.humanfa = 'hg38.fa'
        params.humagtf = 'hg38.unique_gene_names.gtf'
        params.viruSITE = 'viruSITE_human_host.txt'
        params.prokaryotes = 'prokaryotes.csv'
        params.outdir = '[The full path of you pull your reference genome output, e.g. /home/user/data/genome]'
        docker.fixOwnership = true
        docker.containerOptions = "--user root"
    }  
...

Afterwards, run the following command to make your own reference genome:

nextflow run scvh_mkref.nf -profile mkref -with-report mkref_$(date +%s).html -bg &>> mkref_$(date +%s).log;

The output files should be in the folder you specified in the nextflow.config file, e.g. /home/user/data/genome. There will be a subfolder named "newref" in the output folder, which contains the following files:

3M-february-2018.txt
737K-august-2016.txt
human_host_viruses_microbes.viruSITE.NCBIprokaryotes.with_hg38.fa
human_host_viruses_microbes.viruSITE.NCBIprokaryotes.with_hg38.removed_amb_viral_exon.gtf
human_host_viruses_reference_set

These files can be the input of the next step of the Vulture pipeline. They are also available in the downloadable links below:

Hg38 human genome with virus and microbes: human_host_viruses_microbes.viruSITE.NCBIprokaryotes.with_hg38.fa
Hg38 human genome with virus: human_host_viruses.viruSITE.with_hg38.fa
Hg38 human genome annotaion with virus and microbes: human_host_viruses_microbes.viruSITE.NCBIprokaryotes.with_hg38.removed_amb_viral_exon.gtf
Hg38 human genome annotaion with virus: human_host_viruses.viruSITE.with_hg38.removed_amb_viral_exon.gtf