Quality Control
Learning objectives
- Understanding the FastQ format
- Interpret FastQC reports
- Create high quality reads by trimmign and filtering with FastP and BBduk
FastQ format¶
First let's take a look at the data
zcat /data/reads/sample_0.fq.gz | head
The output looks like this
@gi|1184849861|gb|KY629563.1|-34525/1
CTGATCAACGTGGGAATGAACATGGAGCAGAGCGGCCAGTGTCTCAACCTGTGCTCCGCCTTCATGCACTCTCGCGGCGCCTTCAAGCCCGGCGACATCGACGTGGAAATTTTCACCATGCCCACCAAGTTCGGGCGCGTCAGCACCACG
+
DDDGGEG*DIIGH?KKHJGHGKKKJKKJJ9JDK=JKFKKGJHIKHEIJJJKKGIHKGDE1DKA>KEGBEEEBGI?BBEGHEEEFEEE9ED;FCB3BEEDCBEE5EEC'CA$E9EEEAEEEFEE?FFCEEE;$EE)DBCDEEDDECAAE'$
@gi|1184849861|gb|KY629563.1|-34523/1
GCCTTTGTCGGGTAAGGGGTGTGGCCCTCCTCCCGACAAGGCGGGCCACGGTTCGCCAGCGAACTAGGTCGGGGAGCTGGGAAGGAGCCGGAATCGGGTGGCCCCAATTTCGGGGAGAGGTTTGGGCGTCAGCCGCCCGGAAGCTCGTCG
+
DDDE2GGGIIIIIKKJKHJKEDJ=AIHKKHKJKKKKJHBIE$KJJJCEKJB=JH@JKEHKGEEEEDAJECDGC?EIFEBEDDF6FGEEDEE$FBDCEEA@EE$4EE$E??CDE?ED;CCDE1DEBE;ECC$?$$AEDA;EDD@$EEDE=F
@gi|1184849861|gb|KY629563.1|-34521/1
AGACCGAGCCCTTCCTTATAGTGGATTTCTCCGGTTCCGTCAACCAAATTTGCAGTAATGCGGGAGCGACGCTTCCACGAAATGGTCACTGTCCCGTCCACCTCGGAGAGCTTTACGCTTTCCGGTGTGTAGGGCTTGAGATCGATCGAT
There is a pattern that is repeated every 4 lines. This is the FASTQ format. It follows this structure:
Line | Description |
---|---|
1 | Always begins with '@' and then information about the read |
2 | The actual DNA sequence |
3 | Always begins with a '+' and sometimes the same info in line 1 |
4 | Has a string of characters which represent the quality scores; must have same number of characters as line 2 |
The fourth line shows the quality of the read. To make the sequence and the qaility align well, the numerical score is converted into a code where each individual character represents the numerical quality of an individual nucleotide, following this scheme:
FASTQ quality encoding
Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
| | | | |
Quality score: 01........11........21........31........41
NextSeq and NovaSeq data often contains poly-g tails. Let's check if this also is the case for our simulated input
$ zcat /data/reads/sample_0.fq.gz | grep -E "GGGGGGGGGGGGGGGGGG$"
This search does not retun any hits. But often the end of the sequences contain stretches of poly-g. Below is an example how poly-g tails look like in Novaseq data.
Sequence data with poly-g tails
GGCCAGGACCACGCGGTGGAGCAGCGCGCCGCGGCGCCGGGCGCGCTTGACGACCAGTCTCTTATAAACATATCCCAGCCCACGAGACCACCAGTTGCATCTCGTATTCCGTCTTATGCTTGTATATTGGGGGGGGGGGGGGGGGGGGGGG
GAGTCGATCGAGGAGATGAAGCACGCGGAGAAGGTCATTCACCGCATCCTCTACTTCGATGCTGTCTCTTATACACATCCCGAGCCCACGAGACCACCTGTTGCATCTCGTATGCCGTCTTCTGCTTGAAAAGGGGGGGGGGGGGGGGGGG
GGCCCGTGCAGTTCGAGATCATCTCCGAGCCCACGAGACGACCGGGTGCTTGGCGGGAGCGGCGGGGTGGTTTTATCTTCGTGGTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGCGGCAACGCCGACACCTACCAGCTACAGCAGACGGTGCGCCGCACCATCGGCCGCTGGGTCGGCGGCCGGCTGCGCCGCCGTCCCAAGATAATCCCGGTGGTGGGGGGGGGTGGTTGGGGGGGTTGTGGGGGGGGGGGGGGGGGGGGG
GGTAGGGCCGCTCGAGAAGCTCGCACAGCATGCGGCCGAATTCGCGGTACATGCATACGTTGACGTCGGCGGGGGGGGGGGGGGGGGGAGAGCACCGGGGGGCGCACAGGGCGGCCGGGTGGGGGTCGTGGTGGGGGGGGGGGGGGGGGGG
GGGGAGGAGGGAGACGCGCCCGCGGGCGTGCCCGCCGCCGCGGCGATGCCCTTGAAGGCCGCTGGTGTGTCCTTCAGCGCCTGCGCCGCGAGTTCGCCGAACTGCTGCGTCAACGCGCCCCACCACTGCTGCGGGGGGGGGGGGGGGGGGG
GTACCCGAGCCGCTCCGCGTGCCGCCGGACCTCCGACGCGTGAGGCCCGGAGCCGGTCGCTGACCAGAGGGCCAGGCCGAAGCGATGGGGAGTGGTCGCGACGAACCCGCAGCGCTGTCCGGCGGCGGGGGGGGGGGGGGGGGGGGGGGGG
GAAGGTGTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGGGGGGGCGGGGGCGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GCGCCACGCCGAGCACCGACGGCATCATCGGCACCCACACGCCGTGTGTGAACCTGTCTCTTATACACATCTCCGAGCCACGAGACCACCTGTTGCATCTCGTATGCCGTCTTCTGCTTGAAAATGGGGGGGGGGGGGGGGGGGGGGGGGG
CTCTTCATCCGTTCCGGCGCCTGCATCCATTCCCGCGGCGCCGGTTGGCGGGGGGGGGGGGGGGGGGGGGGGGGGTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
FastQC reports¶
Let's check the quality of the data with FastQC before quality filtering:
mkdir fastqc_untrimmed_reads
fastqc --threads 8 /data/reads/sample_0.fq.gz -o fastqc_untrimmed_reads/
Question
Open the FastqQC report fastqc_untrimmed_reads//sample_0_fastqc.html
Quality trimming and filtering¶
Because NovaSeq and NextSeq from Illumina contain often poly-g tails, it is good to use a trimming tool that can detect poly-g tails. Fastp can do that. We keep the defaults like they are, but specify we have interleaved input data. In case you have R1 and R2 file for each sample, you need to enable adapter detection with the --detect_adapter_for_pe
flag. Execute fastp with this command:
fastp -i /data/reads/sample_0.fq.gz \
--stdout \
--interleaved_in \
-q 25 \
--cut_front \
--cut_tail \
--cut_mean_quality 25 \
-l 51 \
--thread 16 \
--trim_poly_g > sample_0.trim.fastq
Now make a FastQC report again, to see the results of the quality filtering.
mkdir fastqc_trimmed_reads
fastqc --threads 8 sample_0.trim.fastq -o fastqc_trimmed_reads/
Exercise
- Are there any adapter sequences detected?
- Take a look at the html report
Alternative trimming with bbduk¶
Alternative trimming with bbduk. Compare poly-g tail filtering, adapter trimming.
bbduk.sh in=/data/reads/sample_0.fq.gz \
out=sample_0.trim.bbduk.fastq.gz \
interleaved=true \
trimpolygright=1 \
qtrim=w trimq=20 \
minlength=51 \
ref=/data/databases/nextera.fa.gz ktrim=r \
stats=bbduk.stats \
t=16
Question
- Create also an FastQC report for the trimming with bbduk
- What are differences between filtering with fastp and bbduk?
Remove PhiX sequences¶
bbmap.sh ref=/data/databases/phix174_ill.ref.fa.gz \
in=sample_0.trim.fastq \
interleaved=true \
outu=sample_0.nophix.fastq.gz \
outm=sample_0.phix.fastq.gz \
t=4
Exercise
- How many PhiX sequences are detected?
- From which samples?
- Can you confirm the sequences are PhiX?
- The input data was simulated without adding any PhiX. How could there still PhiX sequences being detected?
From one sample to many¶
From one sample to many
Now do a QC for all samples. You can use a for loop for that. For example, fastp can be run like:
for file in /data/reads/*.gz; do \
sample=$(basename ${file} .fq.gz);
fastp -i $file --stdout --interleaved_in -q 25 --cut_front --cut_tail --cut_mean_quality 25 -l 51 --thread 16 --trim_poly_g > $sample.trim.fastq;
bbmap.sh ref=/data/databases/phix174_ill.ref.fa.gz in=$sample.trim.fastq interleaved=true outu=$sample.nophix.fastq.gz outm=$sample.phix.fastq.gz t=4;
done
Quickly check what is in this metagenome¶
Sendsketch¶
sendsketch.sh --in=sample_0.nophix.fastq.gz threads=4 address=refseq
Sourmash¶
Note: The sourmash database is not included in the VM because it can't be downloaded at the moment. There is an overview of all prepared databases The 51 kmer set of representative genomes would be a good one to use when available
Now try it locally, using sourmash. First create a signature for a sigle sample.
sourmash sketch dna -p scaled=10000,k=51 sample_0.nophix.fastq.gz -o sample_0.sig
Find what is in this metagenome using the gather
command:
sourmash gather sample_0.sig /data/databases/gtdb-rs207.genomic-reps.dna.k51.lca.json.gz
Kraken¶
Compare with kraken
First setup the database
mkdir kraken_db
tar zxvf /data/databases/k2_standard_08gb_20220926.tar.gz -C kraken_db
Now run kraken2
kraken2 --db kraken_db/ --threads 16 --output sample_0.kraken --report sample_0.kraken.report --gzip-compressed --use-names sample_0.nophix.fastq.gz
Question
Take a look at the sample_0.kraken.report
. What are the differences in classification compared to sendsketch and sourmash?