### Life 708
***
# Genome Assembly
***
### Will Rowe
View the presentation online at:
[will-rowe.github.io/genome-assembly-2017](https://will-rowe.github.io/genome-assembly-2017)
---
## Aims of these slides
***
* introduction to genome assembly
* understand how some assembly algorithms work
* how to assess assembly quality
* walkthrough a typical genome assembly workflow
----
## Introduction
***
* no technology yet exists to sequence **full-length** DNA
* we have to sequence short fragments and piece them together computationally
* assembly is the process of reconstructing a genome sequence from raw sequencings reads
* there are two types of genome assembly
- *de novo*
- re-sequencing (reference-guided)
* we'll focus on de novo assembly of bacterial genomes
---
### Why assemble genomes?
***
* we need a reference genome (we might not already have one)
* we want to look at genome structure / put features into context
* to make comparisons to other genomes
---
* there are many different sequencing platforms available
- you've covered this in LIFE708 already
- they each have advantages / disadvantages
* platform choice is dictated by experimental design
- there are also cost/practical considerations
* there are different assembly approaches
- e.g. for long / short read platforms
* sequencing bias and sequencing error are factors that may impact our assembly quality
---
![seq technologies](https://flxlexblog.files.wordpress.com/2016/07/developments_in_high_throughput_sequencing.jpg)
---
### *De novo* assembly
***
* the DNA read by sequencers has been randomly fragmented and amplified as part of library preparation\*
* sequencing reads are therefore unordered fragments of a genome
* the aim of *de novo* assembly is to piece the reads together so that they give the whole genome sequence
* assembly quality is dictated by several factors
- read length and **coverage**
- sequence data quality
- genome complexity
---
* sequencing **coverage** is the average number of reads that align to known reference bases
* low sequencing coverage can result in genomic regions with no sequencing reads
* appropriate read length depends on genome size - i.e. reads need to be long enough to contain unique features
- genomic repeats make assembly hard
- hard to resolve if reads are shorter than repeat length
* paired-end reads with a long insert size can address read length limitations (i.e. resolve repeats)
---
### Let's make sure we understand some terminology
***
|term|definition|
|-------|-------|
|assembly|reconstructing a genome sequence from raw sequencings reads|
|read|fragments of our genome generated by a sequencer|
|coverage|the average number of reads that align to known reference bases|
|contig|a *contigious sequence* built from overlapping reads and representing a consensus region of DNA|
|scaffold|sets of non-overlapping contigs separated by gaps of known length|
|algorithm|a set of rules to perform a task|
|graph|represents relationships using nodes and edges|
---
![genome assembly overview](https://www.k.u-tokyo.ac.jp/pros-e/person/shinichi_morishita/genome-assembly.jpg)
----
## Aims of these slides
***
* introduction to genome assembly ✓
* understand how some assembly algorithms work
* how to assess assembly quality
* walkthrough a typical genome assembly workflow
----
## Assembly algorithms
***
* it's important to know how genome assemblers work
- very important for experimental design
- helps us to choose the right tool for assembling data
- allows us to interpret results properly!
- we can also troubleshoot if our assembly isn't as expected
---
### Greedy algorithm
***
* performs pairwise comparison of reads
* combines any read pairs with sufficient overlap between edges
* assembly stops when no more overlaps are found
* greedy algorithms aren't good with repeats - leading to misassemblies
* they also can't easily utilise global information (e.g. read pairs)
---
### Greedy algorithm
***
![greedy algorithm](https://willrowe.net/_slides/slide-data/genome-assembly/greedy-algorithm.jpg)
---
### Graph-based assembly
***
* represent relationships between sequenced reads
* graphs consist of edges and nodes
- graph can be directed or undirected (edge directionality)
- nodes can have multiple edges
- assemblers generally use directed multigraphs
* most modern genome assemblers use these graphs
- build graph from reads and traverse it to derive genome
* several graph-based assembly algorithms
- **overlap layout consensus (OLC)** - reads remain intact
- **de Bruijn graph** - breaks down reads to k-mers and finds exact overlaps
- **string graph** - variant of OLC approach
- **hybrid** - combines multiple approaches / sequence data
---
### Graph theory: the Bridges of Konigsberg
***
![bridges](https://upload.wikimedia.org/wikipedia/commons/5/5d/Konigsberg_bridges.png)
* can we visit each part of the city by crossing each bridge once?
---
### Graph theory: the Bridges of Konigsberg
***
![bridges and graph](https://willrowe.net/_slides/slide-data/genome-assembly/7-bridges-konigsberg.png)
* represent each land mass as a node and each bridge as an edge
* Eulerian path = visit every edge of a graph exactly once
* it's not possible in this case!
---
### Graph theory: Eulerian paths
***
![eulerian paths](https://www3.cs.stonybrook.edu/~skiena/combinatorica/animations/anim/euler.gif)
* graphs with Eulerian paths are termed Eulerian
* traversing a Eulerian path takes linear time
---
### Graph theory: Hamiltonian paths
***
![hamiltonian paths](http://www3.cs.stonybrook.edu/~algorith/files/hamiltonian-cycle-L.gif)
* Hamiltonian path = visit every node of a graph exactly once
* traversing a Hamiltonian path is hard to implement an algorithm for (NP-complete)
---
### Graph theory: Hamiltonian paths
***
![hamiltonian paths](http://www3.cs.stonybrook.edu/~algorith/files/hamiltonian-cycle-R.gif)
* visit every node once
* traversing a Hamiltonian path is hard to implement an algorithm for (NP-complete)
---
### Overlap-Layout-Consensus (OLC) algorithm
***
* based on the greedy algorithm but finds all possible overlaps
* finds **overlaps** (aligning the reads) & constructs overlap graph
* **layout** reads according to alignment (find Hamiltonian path)
* gets the **consensus** by traversing Hamiltonian path
* E.G. assemblers: Celera, Newbler, Canu
---
### Overlap-Layout-Consensus (OLC) algorithm
***
![OLC algorithm](https://willrowe.net/_slides/slide-data/genome-assembly/OLC-algorithm.jpg)
---
### De Bruijn graph-based algorithm
***
* reads are broken down to k-mers (substrings of length k)
- why do we use k-mers and not reads?
* a de Bruijn graph is constructed from the k-mers
- k-mers are connected if they have k-1 shared bases
* the genome is derived using the Eulerian path through the graph
* assuming no sequencing error, sequencing reads will match the genome perfectly
- therefore the de Bruijn graph constructed from reads will be the same as that constructed from the underlying genome
* E.G. assemblers: SPAdes, Velvet, ABySS
---
![de Bruijn eg](http://www.homolog.us/Tutorials/Tut-Img/Set1/fig5.png)
* edges connect nodes that have overlaps of 6 (=k-1) bases
* in this example, no k-mers appear multiple times
---
![graph eg](https://willrowe.net/_slides/slide-data/genome-assembly/graph-eg.jpg)
---
### De Bruijn graph-based algorithm: considerations
***
* we have seen that repetitive regions lead to multiple edges
* DNA strandedness will also increase number of node edges (think reverse complementing)
* how are sequencing errors and artifacts handled?
* what k-mer size should we use?
---
### De Bruijn graph-based algorithm: graph features
***
![graph features](https://willrowe.net/_slides/slide-data/genome-assembly/graph-features.jpg)
* use k-mer frequency to resolve these graph features
- remove low depth kmers
- clip tips, merge bubbles, remove links
- resolve small repeats using long k-mers
---
### De Bruijn graph-based algorithm: k-mer size
***
* avoid using an even numbered k-mer size
- they can lead to reverse complementing
- affects the strand specificity of the graph
- palindromic k-mers are avoided with an odd k
* increasing k-mer size can resolve ambiguities
- higher k-mer size can < number of edges and < possible paths
- however, higher k-mer size also more sensitive to sequencing errors
- higher k-mer size means more RAM needed
* try several k-mer sizes to get the best assembly!
---
### De Bruijn graph-based algorithm: recommended resources
***
good websites and papers:
[coding4medicine](https://www.coding4medicine.com/Members/Materials/assembly/files/chapter1.html), [Langmead lab teaching materials](http://www.langmead-lab.org/teaching-materials/), [Compeau et al. 2011](http://www.nature.com/nbt/journal/v29/n11/full/nbt.2023.html), [EBI course](https://www.ebi.ac.uk/training/online/course/ebi-next-generation-sequencing-practical-course/part-1)
great explanation video:
---
### Recap:
![assembly strategies](https://willrowe.net/_slides/slide-data/genome-assembly/assembly-strategies.gif)
---
## Practical assembly example
***
Theory is one thing but let's look at implementing an assembly algorithm ourselves
Example found here: [bits of bioinformatics blog](https://pmelsted.wordpress.com/2013/11/23/naive-python-implementation-of-a-de-bruijn-graph/)
---
----
## Aims of these slides
***
* introduction to genome assembly ✓
* understand how some assembly algorithms work ✓
* how to assess assembly quality
* walkthrough a typical genome assembly workflow
----
## Assembly quality
***
We assess quality by looking at assembly **contiguity**, **completeness** and **correctness**
---
### Contiguity
***
* Ideally, we want very long contigs and few of them
* We measure contiguity using:
- contig number
- contig length (average, median and maximum)
- N statistics (e.g. N50)
* N50 is a statistical measure of the average length of a set of contigs
- 50% of the entire assembly is contained in contigs => to the N50 value
---
### Completeness and correctness
***
* completeness is the proportion of the sequenced genome represented by the assembled contigs
- completeness = assembled genome size / estimated genome size
* correctness is a measure of the number of errors in the assembly
- feature compressions (i.e. repeats)
- improper contig scaffolding
- introduced SNPs/InDels
---
### Assessing assembly quality: checklist
***
* check the three C's - contiguity, completeness, correctness
* look at all the statistics we are given by assemblers
* map the original reads to the assemblies to find inconsistencies
* check / visualise against other assemblies / references
----
## Aims of these slides
***
* introduction to genome assembly ✓
* understand how some assembly algorithms work ✓
* how to assess assembly quality ✓
* walkthrough a typical genome assembly workflow
----
## Typical genome assembly workflow
***
* experimental design
- genome size? repetitive regions? plasmids?
* DNA extraction
- how much DNA? quality?
* library preparation & sequencing
- library design (paired-end/mate-pair, insert size)? how much data needed?
* quality assessment of sequence libraries
- how to we assess quality before/after sequencing?
* genome assembly
- how to validate / compare / improve / annotate?
---
### DNA extraction
***
* we want high molecular weight DNA with no contamination
* use absorption ratios to estimate quality and quantity (Qubit, Nanodrop etc.)
* run the samples on a gel to check for degradation, protein / RNA contamination etc.
---
![QC gel image](https://genomics.ed.ac.uk/sites/default/files/page_images/Image%201.png "QC gel")
---
### Library preparation and sequencing
***
You've already covered technologies - which one to choose?
How many reads do I need to get a good assembly?
---
### Library preparation and sequencing
***
* for bacterial genome assembly, we usually want at least 30X sequencing coverage
* **sequencing coverage** is the average number of reads that align to known reference bases
* the Lander/Waterman equation is a method to compute expected coverage
## coverage = (no. reads * length of read) / genome size
---
### Example calculation
***
* an Illumina MiSeq flowcell can generate 25 million reads
* organism genome size: 4.5 Mbp (Megabase pair)
* required coverage: x30
* read length: 150 bp
* number of reads needed (N):
N = (30 x 4.5e+6) / 150
N = 9e+5 reads
---
### Quality assessment
***
* once we have sequence data, we want to check the quality
- we'll be covering how to check sequence read quality tomorrow
* example FASTQC report [found here](https://willrowe.net/_slides/slide-data/genome-assembly/QC/multiqc_report.html)
---
### Assemble!
***
In the practical tomorrow we will use a real-world *Shigella flexneri* sequencing dataset
We will be retrieving the data, quality checking it and then running some assembly algorithms. We'll then assess our assemblies, compare them to a reference genome and annotate our draft assemblies.
---
### Assembly annotation
***
* once we have assembled, we want to extract information from the genome
* covering annotation would be a lecture in itself
* the basic levels are: nucleotide, protein and process
- identify known features using alignment
- prediction of new features
* annotation is an important part of comparative genomics and was covered in LIFE721
---
### Closing the genome
***
* we have described how to assemble a **draft** genome
* draft assemblies contain errors
- sequence errors
- misassemblies
- contamination
* manual curation, re-sequencing and other approaches are often needed to close a genome
----
## Aims of these slides
***
* introduction to genome assembly ✓
* understand how some assembly algorithms work ✓
* how to assess assembly quality ✓
* walkthrough a typical genome assembly workflow ✓
----
## References used in this slide deck
***
https://flxlexblog.files.wordpress.com/
http://math.nmsu.edu/
https://link.springer.com/article/10.1007/s12575-009-9004-1
http://www.homolog.us/Tutorials
http://cbsuss05.tc.cornell.edu/doc/
http://debruijn.herokuapp.com/
https://pmelsted.wordpress.com/
https://www.slideshare.net/torstenseemann/de-novo-genome-assembly-tseemann-imb-winter-school-2016-brisbane-au-4-july-2016
https://genomics.ed.ac.uk/