Nextflow and scRNA-Seq processing
Setup the AWS CLI
Install the AWS CLI and prepare the access key and secret access key ahead by following instructions at Obtaining AWS Credentials.
pip install awscli
aws configure
# here will prompt you to enter access key and secret access key
Clone Vulture source code
Clone the source code of Vulture into your local computer
git clone https://github.com/holab-hku/Vulture.git
# change into directory below
cd Vulture/nextflow
Create S3 Bucket to store results
# specify bucket names and save them into bash environment variables
export BUCKET_NAME_TEMP=vulture-temp
export BUCKET_NAME_RESULTS=vulture-results
echo "BUCKET_NAME_TEMP=${BUCKET_NAME_TEMP}" |tee -a ~/.bashrc
echo "BUCKET_NAME_RESULTS=${BUCKET_NAME_RESULTS}" |tee -a ~/.bashrc
# create S3 buckets with specified bucket names
aws --region ${AWS_REGION} s3 mb s3://${BUCKET_NAME_TEMP}
aws --region ${AWS_REGION} s3 mb s3://${BUCKET_NAME_RESULTS}
Run Vulture pipeline - 1. Build Genome reference
Now we are about to run the first step of the Vulture pipeline i.e. mkref (genome reference making), execute the command below in your favourite terminal or powershell and wait it to be finished
nextflow run scvh_mkref.nf -profile mkref -bucket-dir s3://${BUCKET_NAME_TEMP} --outdir=s3://${BUCKET_NAME_RESULTS}/batchA -with-report mkref_$(date +%s).html -bg &>> mkref_$(date +%s).log;
The input data required for mkref stage are available in the downloadable links below, you can save them into your own S3 bucket folder:
hg38.fa hg38.unique_gene_names.gtf prokaryotes.csv viruSITE_human_host.txt
Also you can generate the files yourself following instructions below:
VirusSITE (viruSITE human host) Click "Format: CSV"
NCBI Prokaryotes Filters -> Host (Homo sapiens) -> Assembly level (Complete) -> RefSeq category (representative) -> Download [prokaryotes.csv]
After the mkref job is done, you need to edit line in "nextflow/nextflow.config" file -> "params.ref" to the actual S3 path where your output reference genome files are i.e. in "s3://${BUCKET_NAME_RESULTS}/batchA" or you could download from the downloadable links below and store them into your own S3 bucket folder: human_host_viruses_microbes.viruSITE.NCBIprokaryotes.with_hg38.removed_amb_viral_exon.gtf human_host_viruses_microbes.viruSITE.NCBIprokaryotes.with_hg38.fa human_host_viruses.viruSITE.with_hg38.removed_amb_viral_exon.gtf human_host_viruses.viruSITE.with_hg38.fa
...
mkref {
aws.region = 'us-east-2'
process.container = 'public.ecr.aws/b6a4h2a6/scvh_mkref:latest'
process.executor = 'awsbatch'
process.queue = 'vulture-stdq'
# this line needs to be changed
params.ref = 's3://vulture-reference/humangenome/'
}
...
Run Vulture pipeline - 2. Start main analysis
Before we start our analysis, we need to edit "nextflow/params.yaml" file to include the reads of your interest. Here is a snippet of how the "params.yaml" file looks like
...
soloStrand: "Forward"
alignment: "STAR"
technology: "10XV2"
virus_database: "viruSITE"
soloMultiMappers: "EM"
soloFeatures: "GeneFull"
inputformat: "bam"
soloInputSAMattrBarcodeSeq: "CB UB"
barcodes_whitelist: "None"
reads:
- "SRR6885502"
- "SRR6885503"
- "SRR6885504"
- "SRR6885505"
- "SRR6885506"
- "SRR6885507"
- "SRR6885508"
...
Execute the command below to start the main analysis of Vulture.
nextflow run scvh_full.nf -profile batchfull -params-file params.yaml -bucket-dir s3://${BUCKET_NAME_TEMP} --outdir=s3://${BUCKET_NAME_RESULTS}/batchD -with-report report_bam_$(date +%s).html -bg &>> submitnf_bam_$(date +%s).log