Sequencing data will typically be provided to you in fastq format (.fq or .fastq) or as a compressed gzipped fastq (.fq.gz) in order to save space. We can view a gzipped file with the zless command, let’s take a look:
cd ChIP-seq_workshop/fastq # Move into the fastq directory (if not already)
zless Input_R1.fq.gz | head -n 12
Fastq files contain 4 lines per sequenced read:
Next we want to assess the quality of our sequencing data and check for any biases and contamination.
When running a sequencing pipeline it is useful to know that your sequencing runs contain the types of sequence they’re supposed to. FastQ Screen allows you to set up a standard set of libraries against which all of your sequences can be searched. Your search libraries might contain the genomes of all of the organisms you work on, along with PhiX, Vectors or other contaminants commonly seen in sequencing experiments. We will run a screen of our sequences against human, mouse, rat, e.coli and s.cerevisiae (defaults):
cd .. #Move up a directory again
fastq_screen --conf /homes/genomes/tool_configs/fastq_screen/fastq_screen.conf fastq/*fq.gz --outdir fastq
# * is a wild card character
Once complete take a look at the output images in your browser via your public html folder. This shows that most of your reads align to the yeast genome and that no reads align uniquely to other organisms:
FastQC provides simple quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to get a quick impression of whether your data has any problems of which you should be aware before proceeding.
fastqc fastq/*.fq.gz
FastQC will create report files for each of your datasets which we can view in the browser. We will go through each of the images during the workshop. For future reference, specific guidance on how to interpret the output of each module is provided in the fastqc help pages.
An example of poor quality sequencing at the end of short reads:
The software gives a pass, fail or warning flag for each test based on what we would expect from a regular DNA-sequencing run. It is important to realise that FastQC does not understand the origin of your data and that different datasets will have different characteristics. For instance RNA sequencing often involves the use of random hexamer primers that are not as random as you might expect. The profile below in the first ~15 bases is perfectly normal for these samples but will be flagged as an error by FastQC:
Bisulfite treatment (used to investigate DNA methylation) converts most Cs in the genome to Ts, as we can see below. FastQC will not be happy with this profile, the point is, understand what you have sequenced and what you expect to see rather than blindly trusting the FastQC flag system!
Visit the QCFail website for more examples and advice on quality control for NGS datasets.
We can view summaries of multiple reports at once by using multiqc:
multiqc -o fastq fastq
MultiQC searches for report files in a directory and compiles them into a single report. Open the multiqc report via a web browser to see how the raw datasets compare. Here we have the output of FastQ_screen and FastQC, but MultiQC works with the outputs of many tools other tools which we’ll see later.
If we look at the adapter content and over represented sequences sections we can see a small amount of contamination particularly in the second replicates.
Up until now we have run command line tools on each one of our datasets in serial, this means they run one after the other. In this tutorial we only have a few small datasets and the tools run relatively quickly, but this approach won’t scale well to multiple large datasets. A more efficient approach is to run all of our datasets in parallel, later we will create a script.
Unix has a program called parallel which allows you to run tools on multiple datasets at the same time. The following command would list all of your gzipped fastq files and pipe them into parallel.
ls fastq/*fq.gz | parallel -j 4 fastqc {} &
ps f
First, let’s create a file that lists our sample names so we can feed this into parallel. We could just type this manually, but here fastq files are ‘piped’ into parallel as above but we use regular expression within ‘sed’ to remove the name ending, this can now be used to name all files.
ls fastq/*fq.gz | parallel basename | sed s/.fq.gz// > samples.txt
See if you can adapt the QC portion of the script to use parallel.
Hint:cat samples.txt
will print the names of the samples.
cat samples.txt | parallel -j 4 fastq_screen --conf /homes/genomes/tool_configs/fastq_screen/fastq_screen.conf fastq/{}.fq.gz --outdir fastq
cat samples.txt | parallel -j 4 fastqc fastq/{}.fq.gz
From the FastQC report we can see that the overall quality of our sequencing is good, however it is good practice to perform some pre-processing and filtering of reads. Poor quality sequencing can make a read less alignable so it is good practice to quality trim the ends of reads until we get to the high quality portion. Trimming is not always neccessary as some mapping programs will trim the reads for you or perform soft clipping where only part of a read is required to align but studies have shown that pre-processing generally improves alignment rate if done correctly.
Sequencing libraries are normally constructed by ligating adapters to fragments of DNA or RNA. If your read length is longer than your fragment then sequenced reads will contain the adapter sequence. Adapter removal is also a necessary consideration for your QC workflow, especially if adapters are detected by FastQC.
An example of adapter contamination at the end of reads:
Once reads have been trimmed they will vary in length. You may want to filter out reads that are now too short to be uniquely mapped. Normally a cutoff of 20-30bp is standard.
Trim with caution and think about the consequences of having different length reads later on in your pipeline. In fact, it is possible to overtrim your reads and aggressively remove valid data.
Cutadapt finds and removes unwanted sequences from your high-throughput sequencing reads. Cutadapt can perform quality trimming, adapter removal and read filtering as well as many other operations to prepare your reads for optimal alignment. We will run cutadapt with the following parameters:
We will also run FastQC on the trimmed dataset.
cat samples.txt | parallel -j 4 "cutadapt -a AGATCGGAAGAG -q 20 --minimum-length 36 -o fastq/{}.trim.fq.gz fastq/{}.fq.gz > fastq/{}.trim.cutadapt_report.txt"
cat samples.txt | parallel -j 4 fastqc fastq/{}.trim.fq.gz
To view a cutadapt report:
less fastq/Reb1_R1.trim.cutadapt_report.txt
hit ‘q’ to exit less.
Let’s compare the fastqc reports using multiqc. As you have run it already you need to use the force (-f) flag to get it to overwrite the current report.
multiqc -f -o fastq fastq
Open the multiqc report via a web browser to see how the raw and trimmed datasets compare.