Visualisation

8. Creating Tracks

Converting BAM files to bigWig

Visually inspecting data via a genome browser is often the first step in any analysis. Has the sequencing worked as expected? Are there noticeable differences between my samples by eye? BAM files are typically large and contain detailed information for every read, thus they are slow to access and view on a genome browser. If we are only interested in read profiles we can convert our BAM files to graphs of sequencing depth per base. The wiggle file format has three columns to represent a genomic interval and a 4th as a score which will represent the number of reads overlapping that region. We will create a compressed version of this format called bigWig for use with genome browsers. To do this we are going to use the deepTools package which has a lot of useful tools particularly for ChIP-seq analysis.

The bamCoverage tool in deepTools has many different options for normalising, smoothing and filtering your BAM files before converting to bigWig and the documentation is worth a read.

cat samples.txt | parallel -j 4 "bamCoverage -bs 1 --normalizeUsing BPM -e 200 -b bwa_out/{}/{}.uniq.bam --outFileName bwa_out/{}/{}.uniq.bw"

-bs is the bin size. We set this to 1, but we could smooth our read coverage into larger bins across the genome.
–normalizeUsing BPM is a normalisation strategy akin to TPM in RNA-seq and stands for “bins per million”. There are several strategies to choose or we can ignore normalisation.
-e 200 extends our reads to 200b as this is the average fragment length. With paired end data the reads would be extended to match the ends of both reads but with single end we must supply a value.

9. Scripting

Putting all of your commands into a script is good practice for keeping track of your analyses and for reproducibility. We want to write scripts so they can be used again on different datasets and avoid hardcoding the names of files.

We now have our alignments (BAM) and visualisation files (bigWig) and this is normally a branching point for downstream analyses that quantify and annotate the data.

Here is a pipeline containing everything we have run so far; pipeline.sh

Looking at this pipeline script you’ll notice we are not running parallel every time we run a command, instead we launch the script using parallel and the sample names are passed using the special ‘$1’ bash variable. This is more efficient as the samples will run through each command independently.

Using this we can re-run everything from start to end in one go.

mkdir CS_workshop_tmp #Create a temporary directory
cd CS_workshop_tmp #Move into that directory
cp /homes/library/training/ChIP-seq_workshop/pipeline.sh . # Copy the pipeline into the new directory
cp ../samples.txt . #Copy the samples file
nohup cat samples.txt | parallel -j 4 bash pipeline.sh {} > pipeline.log & #Run the shell script (See Below) 
cd .. #Move back to the main directory

The nohup command stands for no hangup and keeps the process running even if you log out of your command line session.
We use the tool bash and the name of our script to run our commands through the shell (‘sh’ will also work).
We then redirect > the output to a log file to keep track of any errors.
We use the & to run everything in the background to continue using the terminal.

You can keep track of your pipeline by using ps or looking at the log file.

ps f
tail CS_workshop_tmp/pipeline.log # Shows the end of a file

Note any commands run on multiple samples will need be to run separately (after pipeline finished)

multiqc CS_workshop_tmp/fastq -o CS_workshop_tmp/fastq

Tidy Up!

Files are large, disk space is expensive, remove any unwanted or temporary files from your folder. We should always keep the raw data (fastq) and our final processed datasets (BAM, bigWig etc) and the script we used to generate them. SAM files are large and should be removed once converted to BAM.

rm fastq/*trim*fq.gz #Remove trimmed fastq temp files
rm bwa_out/*/*.sam #Remove sam files

Integrative Genomics Viewer (IGV)

IGV (Integrative genomics viewer) is a powerful genome browser from the Broad institute. It is perfect for viewing your sequencing files as loading data is easy, common formats are understood, views are customisable and navigation is quick.

First let’s put all of our visualisation files in one folder.

mkdir visualisation
ln -s $PWD/bwa_out/*/*bw visualisation
ln -s $PWD/bwa_out/*/*uniq.bam* visualisation

It is best to download the desktop app for use on your own machine but if you are using a university managed computer you can access the web app at https://igv.org/app/.

Open IGV and set the genome to sacCer3, then find your visualisation folder online. In the desktop version you can drag and drop files into IGV. If you are using the webapp you will need to download the files you require and open them using the tracks menu.

Open a BAM file and a corresponding bigWig file into IGV. Make sure the .bai index file is in the same folder as the BAM.

First, let’s navigate the chromosome:

Use the + / - zoom bar at the top right
Drag and highlight an area to zoom on the genome ruler
Type a position or gene identifier in to the search box
If we zoom in close enough our BAM file will load and we can investigate alignments of individual reads

Now let’s customise our view:

DESKTOP APP

Select Tracks->Fit Data To Window To automatically set track heights to fill the view
Right click on the Refseq genes track and switch between display modes Collapsed, Expanded and Squished. (Use the horizontal panel divider to adjust height)
Use the right click menu on the bigwig track name to customise the display. Try the following:
Rename the track
Change the track colour
Change the graph type
Change the windowing function
Adjust the data range
Set the scale to Autoscale. This automatically sets the Y-axis to the height of the tallest peak in your view.

WEB APP

Use the cog on the Ensembl genes track to switch between display modes Collapsed, Expanded and Squished.
Customise the bigwig track by trying the following:
Rename the track
Change the track colour
Change the track height
Set a minimum and maximum Y-axis
Set the scale to Autoscale. This automatically sets the Y-axis to the height of the tallest peak in your view.

Challenge:

One BAM vs bigWig

Why are the coverage profiles different?
Which format is faster to view?
Can you identify mismatches in read alignment?
What other information do the BAM files give us?

Now let’s load all our bigwig files

Can you identify peaks in the ChIP samples
Do you see any peaks that are also present in the Input?

Other genome browsers

Ensembl and UCSC websites have online genome browsers where you can view your data alongside published experimental data and genomic annotations in their databases.

Key Aims:

Create tracks from our data
Visualise alignments on a genome browser