### Life 708 *** # Identifying Variation *** ### Will Rowe
View the presentation online at: [will-rowe.github.io/identifying-variation-2017](https://will-rowe.github.io/identifying-variation-2017) --- ## Aims of these slides *** * introduction to genome variation * short read alignment * interpreting alignments * calling variants ---- ## Introduction *** **genome variation** = differences in DNA content / structure between two organisms --- ### Why look at variation? *** * allows us to study evolution - phylogenetics - population genomics * let's us make genotype / phenotype associations - genome wide association studies (GWAS) - human disease, agriculture, genetic engineering --- ### What is variation? *** * Single Nucleotide Polymorphisms (SNPs) - a genetic "typo" of one nucleotide (e.g. A > G) * INsertion / DELetions (INDELs) - a string of one or more nucleotides that has been added/removed from a location in a genome (typically 1-100bp) * Structural Variants (SVs) - a region of DNA that has been inverted / translocated / duplicated (typically >100bp) * Mobile Genetic Elements (MGEs) - insertion / replication of retrotransposons, transposons, integrons etc. --- ### How do we identify variation? *** * Assembly and comparative genomics (eg. BLAST / ACT) good for identifying SVs and MGEs * Comparison of sample DNA to a reference allows identification of variation at a given genomic position * Sequencing reads from a sample are **aligned** against a reference genome * A **pileup** of the reads is essentially a sample of all the alleles present in the sampled cell population - greater sequencing coverage = bigger sample size = greater certainty in variant calls - considerations must be made for ploidy, sequencing quality / errors etc. --- ![IGV](https://image.slidesharecdn.com/bbr-1-140522075509-phpapp01/95/genome-browsing-genomic-data-mining-and-genome-data-visualization-with-ensembl-biomart-and-igv-uebuat-bioinformatics-course-session-13-vhir-barcelona-72-638.jpg?cb=1402537040) ---- ## Short Read Alignment *** * To call variants in a sample, we need to align sequencing reads to a reference genome * As discussed in previous lectures, we may have millions of short reads and a large reference genome (millions / billions of nucleotides) * How do you align reads on this scale, in a reasonable amount of time / memory whilst allowing for inexact matching / errors? * Other considerations also include: appropriate reference genomes, orientation / pairing of reads, multiple matching * Short read alignment is hard as the shorter the read, the less likely a unique match to the reference --- Computationally, exact matching is relatively easy: ``` READ 1: ATTCTTTACGACGAGTG REFERENCE: AGCGTGACATTCTTTACGACGAGTGATATTCTGTAATAT MATCH: ***************** ``` *** Inexact (fuzzy) matching is much harder: ``` READ 2: ATTCTGTAAGAC REFERENCE: AGCGTGACATTCTTTACGACGAGTGATATTCTGTAATAT MATCH: *****G**A*** *********G*C ``` --- ### Short Read Alignment *** * There is an important distinction between read **mapping** and read **alignment** - mapping is to quickly determine the best locations in the genome that a read could align - alignment is to determine the best per-base alignment of a read to each possible mapping location in the genome * Short read alignment is **global** with respect to the read and **local** with respect to the reference - ideally the whole read matches one location in the reference genome - in reality, repeats, variation and sequencing error add complexity --- ### Approaches to short read alignment *** * A brute force approach would involve looking at every position in a reference for a possible query match - this is not efficient - indexing the reference so we know where possible matches occur improves the efficiency * Hash-based indexing - hashing of either the sequencing reads or the reference genome * Indexing of trie structures - utilising a reference index that is derived from a trie data structure --- ### Hash-based alignment *** * Reads or reference sequence split into k-mers * A hash function is used to convert k-mer strings to integers * The integers are used as an array index for fast searching of the query sequence(s) ![hash-table](http://everythingcomputerscience.com/images/phone_book_HashTable.jpg) --- ### Hash-based alignment *** * Once the index has been built, the query is split into k-mers - lookup query k-mer locations in the reference via the index - for each query, select the reference location with most k-mer hits and perform Smith-Waterman alignment of read/reference (dynamic programming) - this is the **seed** and **extend** principle * The MAQ and SHRiMP short read aligners are based on this approach - hash index the reads, scan using the reference sequence and use mapping quality scores to determine best alignments - allowing seed mismatches and a spaced-seed approach is used to improve seed-matching sensitivity --- ![hash-table-2](https://image.slidesharecdn.com/20101210ngscourse-101208080158-phpapp01/95/20110114-next-generation-sequencing-course-56-728.jpg?cb=1303919918) --- ### Hash-based alignment *** * Hash-based approaches have some downsides - static table means resizing can be costly - bad for a dynamic reference - search performance reduces when reaching table capacity * Majority of modern short-read aligners are **suffix/prefix trie**-based aligners - main advantage over hash-based is that alignment of identical sequences is only done once - memory footprint can also be much lower than hash-based approach --- ### Indexing of trie structures *** * A trie is a computational data structure - advantages of this structure include: fast lookup, no need for hash functions, no key collisions * For short read alignment, the trie structure holds all the suffixes of a reference sequence as an index - one trie node per common suffix - this enables fast string matching (query look up) --- ![trie-structure](http://marknelson.us/attachments/1996/suffix-trees/FIGURE1.gif) * Example suffix trie for the text BANANAS - all suffixes of BANANAS are found in the trie - search for substrings by starting at the root and following matches down the tree until exhausted - the paths of this trie can be compressed by removing nodes with single descendants --- ### Algorithms based on suffix/prefix tries *** * these algorithms find exact matches and build inexact alignments (supported by the matches) * to identify exact matches, the algorithms use a representation of the suffix/prefix trie - E.G. suffix tree, suffix array, FM-index - a special terminal character ($) is used to denote the end of the suffix - FM-index is widely used due to it's small memory footprint --- ![langmead-tries](http://www.langmead-lab.org/wp-content/uploads/2014/01/Screen-Shot-2014-02-03-at-12.11.23-AM.png) Suffix Trie ----- Suffix Tree ----- Suffix Array ----- FM Index --- ### The FM-index *** * The FM-index is used by short read aligners such as Bowtie and BWA * To understand how the FM-index is queried during read alignment, let's first look at the **Burrows-Wheeler Transform** * The Burrows-Wheeler Transform (BWT) was originally a method to improve algorithmic efficiency of text compression but is widely used in bioinformatics software --- ### The Burrows-Wheeler Transform *** * for a given reference sequence, T * get all rotations of T * form the Burrows-Wheeler Matrix (BWM) * sort the BWM lexicographically * the last column of the sorted matrix is the BWT of the reference, i.e. BWT(T) --- ![BWM-1](http://homolog.us/Tutorials/Tut-Img/Set6/BWT3.png) * get all rotations of the string "homolog.us" --- ![BWM-2](http://homolog.us/Tutorials/Tut-Img/Set6/BWT4.png) * sort the BWM lexicographically and look at the last column --- ### The Burrows-Wheeler Transform (BWT) *** * The BWT of a string allows for efficient compression - runs of the same character can be condensed - this is what computing tools such as bzip2 are based on - the BWM is not stored - only the BWT is used * The BWM resembles the suffix array - sorted order is the same, regardless if using rotations or suffixes - this means, rather than finding all rotations, we can generate BWT(T) by: ```python if SA[i] > 0: BWT[i] = T[SA[i] - 1] if SA[i] == 0: BWT[i] = $ ``` --- ![langmead-tries](http://www.langmead-lab.org/wp-content/uploads/2014/01/Screen-Shot-2014-02-03-at-12.11.23-AM.png) BWT = the characters one to the left of the suffixes in suffix array --- ### The Burrows-Wheeler Transform (BWT) *** * The BWT is reversible - you can get the original reference sequence back * For this we utilise the **Last-to-Front** (LF) mapping property * This property means that the order of characters in the first column is the same as that of the last column * Let's look at the homolog.us BWM example with all O's coloured by occurrence --- ![BWM-3](http://homolog.us/Tutorials/Tut-Img/Set6/BWT6.png) --- ![BWT-1](https://www.researchgate.net/profile/Cole_Trapnell/publication/24177230/figure/fig3/AS:267670977249315@1440829142907/Figure-1-Burrows-Wheeler-transform.png) --- ### The Burrows-Wheeler Transform (BWT) *** * Now that we have seen: - how to perform the BWT - how BWT can be used to compress a string - how BWT can be reversed * How is it used as an index in short read alignment? - the **FM Index** allows for this and is based on BWT --- ### The FM Index *** * Stands for Full-text Minute-space * FM index combines BWT with a few additional data structures - checkpoints and a suffix array sample * Querying the FM index is based on the LF property - the LF function matches character and rank - this can be used to reconstruct T from BWT(T) or to query the reference - the FM index stores extra information to facilitate this calculation --- ![](https://willrowe.net/_slides/slide-data/identifying-variation/LF-lang.png) --- ### The Burrows-Wheeler Transform (BWT) *** * We have seen how the FM index can be used to find exact sequence matches - requires small amount of memory - short read alignment needs allowances for mismatches and read quality etc. - solutions such as backtracking quality-aware searching facilitate this * For more information on BWT, [this](http://www.cs.jhu.edu/~langmea/resources/lecture_notes/bwt_and_fm_index.pdf) lecture series by Ben Langmead is highly recommended * For how it is implemented in short read aligners, [this](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2943993/#!po=25.0000) review and [this](https://arxiv.org/abs/1303.3997) paper are good resources ---- ## Alignments *** * Now that we know the basic principles behind short read aligners, let's go over the alignment results * There is a single unified file format for storing read alignments to a reference genome called **SAM** format * SAM stands for Sequence Alignment Map - there is a binary version of this file format: **BAM** - BAM allows for fast processing/indexing of the alignment - there is also **CRAM** format (not discussed further) that offers reference-based compression * We will explore the SAM format during the practical and use **samtools** to get basic alignment statistcs, as well as manipulate and convert alignment files --- ### Alignment Post-processing *** * Sorting, filtering and indexing the alignment - speeds up downstream analysis and reduces file size - can be done using piped commands (holding alignment in memory) * Local realignment - InDels can cause reads not to align correctly - multiple sequence alignment is performed in suspect alignment regions * Duplicate removal - duplicate reads may be PCR artifacts and these need to be marked/removed so that variant call is not biased (reduces false positives) * Base quality score recalibration - per-base recalibration uses reported quality score, read position and dinucleotide context --- ### Variant Calling *** * To call variants from a pileup of reads, we generally use a probabilistic method (e.g. Bayesian model) - tools such as Freebayes, GATK HaplotypeCaller, Samtools/BCFtools and Varscan are commonly used * These programs calculate **genotype likelihoods** at each base position - variants can be called relative to multiple samples * Mapping quality, base quality and read pair information etc. is utilised in making the calls * Variant calls are contained in a Variant Call Format (VCF) file --- ### Variant Filtering *** * Once we have a VCF file, we apply filtering to reduce false positives * We can filter using criteria such as depth, variant frequency, strand balance * We could also use information on known variants to recalibrate our variant quality scores * Once filtered, we should interpret our results and then see if we need to reapply the filters - check SNP density, transition/transversion ratio etc. --- ![GATK](https://us.v-cdn.net/5019796/uploads/FileUpload/eb/44f317f8850ba74b64ba47b02d1bae.png) ---- ## References *** * [IGV](http://everythingcomputerscience.com/images/phone_book_HashTable.jpg) * [suffix tries](http://marknelson.us/1996/08/01/suffix-trees/) * [langmead lab](http://www.cs.jhu.edu/~langmea/resources/lecture_notes/bwt_and_fm_index.pdf) * [homolog.us](http://homolog.us/Tutorials) * GATK * A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform. 2010 Sep;11(5):473-83. doi: 10.1093/bib/bbq015. Epub 2010 May 11. * Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10: R25 Article in Genome biology 10(3):R25 · April 2009