### Life 708
***
# Identifying Variation
***
### Will Rowe
View the presentation online at:
[will-rowe.github.io/identifying-variation-2017](https://will-rowe.github.io/identifying-variation-2017)
---
## Aims of these slides
***
* introduction to genome variation
* short read alignment
* interpreting alignments
* calling variants
----
## Introduction
***
**genome variation** = differences in DNA content / structure between two organisms
---
### Why look at variation?
***
* allows us to study evolution
- phylogenetics
- population genomics
* let's us make genotype / phenotype associations
- genome wide association studies (GWAS)
- human disease, agriculture, genetic engineering
---
### What is variation?
***
* Single Nucleotide Polymorphisms (SNPs)
- a genetic "typo" of one nucleotide (e.g. A > G)
* INsertion / DELetions (INDELs)
- a string of one or more nucleotides that has been added/removed from a location in a genome (typically 1-100bp)
* Structural Variants (SVs)
- a region of DNA that has been inverted / translocated / duplicated (typically >100bp)
* Mobile Genetic Elements (MGEs)
- insertion / replication of retrotransposons, transposons, integrons etc.
---
### How do we identify variation?
***
* Assembly and comparative genomics (eg. BLAST / ACT) good for identifying SVs and MGEs
* Comparison of sample DNA to a reference allows identification of variation at a given genomic position
* Sequencing reads from a sample are **aligned** against a reference genome
* A **pileup** of the reads is essentially a sample of all the alleles present in the sampled cell population
- greater sequencing coverage = bigger sample size = greater certainty in variant calls
- considerations must be made for ploidy, sequencing quality / errors etc.
---
![IGV](https://image.slidesharecdn.com/bbr-1-140522075509-phpapp01/95/genome-browsing-genomic-data-mining-and-genome-data-visualization-with-ensembl-biomart-and-igv-uebuat-bioinformatics-course-session-13-vhir-barcelona-72-638.jpg?cb=1402537040)
----
## Short Read Alignment
***
* To call variants in a sample, we need to align sequencing reads to a reference genome
* As discussed in previous lectures, we may have millions of short reads and a large reference genome (millions / billions of nucleotides)
* How do you align reads on this scale, in a reasonable amount of time / memory whilst allowing for inexact matching / errors?
* Other considerations also include: appropriate reference genomes, orientation / pairing of reads, multiple matching
* Short read alignment is hard as the shorter the read, the less likely a unique match to the reference
---
Computationally, exact matching is relatively easy:
```
READ 1: ATTCTTTACGACGAGTG
REFERENCE: AGCGTGACATTCTTTACGACGAGTGATATTCTGTAATAT
MATCH: *****************
```
***
Inexact (fuzzy) matching is much harder:
```
READ 2: ATTCTGTAAGAC
REFERENCE: AGCGTGACATTCTTTACGACGAGTGATATTCTGTAATAT
MATCH: *****G**A*** *********G*C
```
---
### Short Read Alignment
***
* There is an important distinction between read **mapping** and read **alignment**
- mapping is to quickly determine the best locations in the genome that a read could align
- alignment is to determine the best per-base alignment of a read to each possible mapping location in the genome
* Short read alignment is **global** with respect to the read and **local** with respect to the reference
- ideally the whole read matches one location in the reference genome
- in reality, repeats, variation and sequencing error add complexity
---
### Approaches to short read alignment
***
* A brute force approach would involve looking at every position in a reference for a possible query match
- this is not efficient
- indexing the reference so we know where possible matches occur improves the efficiency
* Hash-based indexing
- hashing of either the sequencing reads or the reference genome
* Indexing of trie structures
- utilising a reference index that is derived from a trie data structure
---
### Hash-based alignment
***
* Reads or reference sequence split into k-mers
* A hash function is used to convert k-mer strings to integers
* The integers are used as an array index for fast searching of the query sequence(s)
![hash-table](http://everythingcomputerscience.com/images/phone_book_HashTable.jpg)
---
### Hash-based alignment
***
* Once the index has been built, the query is split into k-mers
- lookup query k-mer locations in the reference via the index
- for each query, select the reference location with most k-mer hits and perform Smith-Waterman alignment of read/reference (dynamic programming)
- this is the **seed** and **extend** principle
* The MAQ and SHRiMP short read aligners are based on this approach
- hash index the reads, scan using the reference sequence and use mapping quality scores to determine best alignments
- allowing seed mismatches and a spaced-seed approach is used to improve seed-matching sensitivity
---
![hash-table-2](https://image.slidesharecdn.com/20101210ngscourse-101208080158-phpapp01/95/20110114-next-generation-sequencing-course-56-728.jpg?cb=1303919918)
---
### Hash-based alignment
***
* Hash-based approaches have some downsides
- static table means resizing can be costly - bad for a dynamic reference
- search performance reduces when reaching table capacity
* Majority of modern short-read aligners are **suffix/prefix trie**-based aligners
- main advantage over hash-based is that alignment of identical sequences is only done once
- memory footprint can also be much lower than hash-based approach
---
### Indexing of trie structures
***
* A trie is a computational data structure
- advantages of this structure include: fast lookup, no need for hash functions, no key collisions
* For short read alignment, the trie structure holds all the suffixes of a reference sequence as an index
- one trie node per common suffix
- this enables fast string matching (query look up)
---
![trie-structure](http://marknelson.us/attachments/1996/suffix-trees/FIGURE1.gif)
* Example suffix trie for the text BANANAS
- all suffixes of BANANAS are found in the trie
- search for substrings by starting at the root and following matches down the tree until exhausted
- the paths of this trie can be compressed by removing nodes with single descendants
---
### Algorithms based on suffix/prefix tries
***
* these algorithms find exact matches and build inexact alignments (supported by the matches)
* to identify exact matches, the algorithms use a representation of the suffix/prefix trie
- E.G. suffix tree, suffix array, FM-index
- a special terminal character ($) is used to denote the end of the suffix
- FM-index is widely used due to it's small memory footprint
---
![langmead-tries](http://www.langmead-lab.org/wp-content/uploads/2014/01/Screen-Shot-2014-02-03-at-12.11.23-AM.png)
Suffix Trie ----- Suffix Tree ----- Suffix Array ----- FM Index
---
### The FM-index
***
* The FM-index is used by short read aligners such as Bowtie and BWA
* To understand how the FM-index is queried during read alignment, let's first look at the **Burrows-Wheeler Transform**
* The Burrows-Wheeler Transform (BWT) was originally a method to improve algorithmic efficiency of text compression but is widely used in bioinformatics software
---
### The Burrows-Wheeler Transform
***
* for a given reference sequence, T
* get all rotations of T
* form the Burrows-Wheeler Matrix (BWM)
* sort the BWM lexicographically
* the last column of the sorted matrix is the BWT of the reference, i.e. BWT(T)
---
![BWM-1](http://homolog.us/Tutorials/Tut-Img/Set6/BWT3.png)
* get all rotations of the string "homolog.us"
---
![BWM-2](http://homolog.us/Tutorials/Tut-Img/Set6/BWT4.png)
* sort the BWM lexicographically and look at the last column
---
### The Burrows-Wheeler Transform (BWT)
***
* The BWT of a string allows for efficient compression
- runs of the same character can be condensed
- this is what computing tools such as bzip2 are based on
- the BWM is not stored - only the BWT is used
* The BWM resembles the suffix array
- sorted order is the same, regardless if using rotations or suffixes
- this means, rather than finding all rotations, we can generate BWT(T) by:
```python
if SA[i] > 0:
BWT[i] = T[SA[i] - 1]
if SA[i] == 0:
BWT[i] = $
```
---
![langmead-tries](http://www.langmead-lab.org/wp-content/uploads/2014/01/Screen-Shot-2014-02-03-at-12.11.23-AM.png)
BWT = the characters one to the left of the suffixes in suffix array
---
### The Burrows-Wheeler Transform (BWT)
***
* The BWT is reversible - you can get the original reference sequence back
* For this we utilise the **Last-to-Front** (LF) mapping property
* This property means that the order of characters in the first column is the same as that of the last column
* Let's look at the homolog.us BWM example with all O's coloured by occurrence
---
![BWM-3](http://homolog.us/Tutorials/Tut-Img/Set6/BWT6.png)
---
![BWT-1](https://www.researchgate.net/profile/Cole_Trapnell/publication/24177230/figure/fig3/AS:267670977249315@1440829142907/Figure-1-Burrows-Wheeler-transform.png)
---
### The Burrows-Wheeler Transform (BWT)
***
* Now that we have seen:
- how to perform the BWT
- how BWT can be used to compress a string
- how BWT can be reversed
* How is it used as an index in short read alignment?
- the **FM Index** allows for this and is based on BWT
---
### The FM Index
***
* Stands for Full-text Minute-space
* FM index combines BWT with a few additional data structures
- checkpoints and a suffix array sample
* Querying the FM index is based on the LF property
- the LF function matches character and rank
- this can be used to reconstruct T from BWT(T) or to query the reference
- the FM index stores extra information to facilitate this calculation
---
![](https://willrowe.net/_slides/slide-data/identifying-variation/LF-lang.png)
---
### The Burrows-Wheeler Transform (BWT)
***
* We have seen how the FM index can be used to find exact sequence matches
- requires small amount of memory
- short read alignment needs allowances for mismatches and read quality etc.
- solutions such as backtracking quality-aware searching facilitate this
* For more information on BWT, [this](http://www.cs.jhu.edu/~langmea/resources/lecture_notes/bwt_and_fm_index.pdf) lecture series by Ben Langmead is highly recommended
* For how it is implemented in short read aligners, [this](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2943993/#!po=25.0000) review and [this](https://arxiv.org/abs/1303.3997) paper are good resources
----
## Alignments
***
* Now that we know the basic principles behind short read aligners, let's go over the alignment results
* There is a single unified file format for storing read alignments to a reference genome called **SAM** format
* SAM stands for Sequence Alignment Map
- there is a binary version of this file format: **BAM**
- BAM allows for fast processing/indexing of the alignment
- there is also **CRAM** format (not discussed further) that offers reference-based compression
* We will explore the SAM format during the practical and use **samtools** to get basic alignment statistcs, as well as manipulate and convert alignment files
---
### Alignment Post-processing
***
* Sorting, filtering and indexing the alignment
- speeds up downstream analysis and reduces file size
- can be done using piped commands (holding alignment in memory)
* Local realignment
- InDels can cause reads not to align correctly
- multiple sequence alignment is performed in suspect alignment regions
* Duplicate removal
- duplicate reads may be PCR artifacts and these need to be marked/removed so that variant call is not biased (reduces false positives)
* Base quality score recalibration
- per-base recalibration uses reported quality score, read position and dinucleotide context
---
### Variant Calling
***
* To call variants from a pileup of reads, we generally use a probabilistic method (e.g. Bayesian model)
- tools such as Freebayes, GATK HaplotypeCaller, Samtools/BCFtools and Varscan are commonly used
* These programs calculate **genotype likelihoods** at each base position
- variants can be called relative to multiple samples
* Mapping quality, base quality and read pair information etc. is utilised in making the calls
* Variant calls are contained in a Variant Call Format (VCF) file
---
### Variant Filtering
***
* Once we have a VCF file, we apply filtering to reduce false positives
* We can filter using criteria such as depth, variant frequency, strand balance
* We could also use information on known variants to recalibrate our variant quality scores
* Once filtered, we should interpret our results and then see if we need to reapply the filters
- check SNP density, transition/transversion ratio etc.
---
![GATK](https://us.v-cdn.net/5019796/uploads/FileUpload/eb/44f317f8850ba74b64ba47b02d1bae.png)
----
## References
***
* [IGV](http://everythingcomputerscience.com/images/phone_book_HashTable.jpg)
* [suffix tries](http://marknelson.us/1996/08/01/suffix-trees/)
* [langmead lab](http://www.cs.jhu.edu/~langmea/resources/lecture_notes/bwt_and_fm_index.pdf)
* [homolog.us](http://homolog.us/Tutorials)
* GATK
* A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform. 2010 Sep;11(5):473-83. doi: 10.1093/bib/bbq015. Epub 2010 May 11.
* Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10: R25 Article in Genome biology 10(3):R25 · April 2009