Genome Assembly Lecture






### Life 708

***

# Genome Assembly

***

### Will Rowe


View the presentation online at:

[will-rowe.github.io/genome-assembly-2017](https://will-rowe.github.io/genome-assembly-2017)

---

## Aims of these slides

***

* introduction to genome assembly

* understand how some assembly algorithms work

* how to assess assembly quality

* walkthrough a typical genome assembly workflow

----

## Introduction

***

* no technology yet exists to sequence **full-length** DNA

* we have to sequence short fragments and piece them together computationally

* assembly is the process of reconstructing a genome sequence from raw sequencings reads

* there are two types of genome assembly
  - *de novo*
  - re-sequencing (reference-guided)

* we'll focus on de novo assembly of bacterial genomes

---

### Why assemble genomes?

***

* we need a reference genome (we might not already have one)

* we want to look at genome structure / put features into context

* to make comparisons to other genomes

---

* there are many different sequencing platforms available
 - you've covered this in LIFE708 already
 - they each have advantages / disadvantages

* platform choice is dictated by experimental design
 - there are also cost/practical considerations

* there are different assembly approaches
 - e.g. for long / short read platforms

* sequencing bias and sequencing error are factors that may impact our assembly quality

---

![seq technologies](https://flxlexblog.files.wordpress.com/2016/07/developments_in_high_throughput_sequencing.jpg)

---

### *De novo* assembly

***

* the DNA read by sequencers has been randomly fragmented and amplified as part of library preparation\*

* sequencing reads are therefore unordered fragments of a genome

* the aim of *de novo* assembly is to piece the reads together so that they give the whole genome sequence

* assembly quality is dictated by several factors
 - read length and **coverage**
 - sequence data quality
 - genome complexity

---

* sequencing **coverage** is the average number of reads that align to known reference bases

* low sequencing coverage can result in genomic regions with no sequencing reads

* appropriate read length depends on genome size - i.e. reads need to be long enough to contain unique features
  - genomic repeats make assembly hard
  - hard to resolve if reads are shorter than repeat length

* paired-end reads with a long insert size can address read length limitations (i.e. resolve repeats)

---

### Let's make sure we understand some terminology

***

|term|definition|
|-------|-------|
|assembly|reconstructing a genome sequence from raw sequencings reads|
|read|fragments of our genome generated by a sequencer|
|coverage|the average number of reads that align to known reference bases|
|contig|a *contigious sequence* built from overlapping reads and representing a consensus region of DNA|
|scaffold|sets of non-overlapping contigs separated by gaps of known length|
|algorithm|a set of rules to perform a task|
|graph|represents relationships using nodes and edges|

---

![genome assembly overview](https://www.k.u-tokyo.ac.jp/pros-e/person/shinichi_morishita/genome-assembly.jpg)


----

## Aims of these slides

***

* introduction to genome assembly ✓

* understand how some assembly algorithms work

* how to assess assembly quality

* walkthrough a typical genome assembly workflow

----

## Assembly algorithms

***

* it's important to know how genome assemblers work

  - very important for experimental design

  - helps us to choose the right tool for assembling data

  - allows us to interpret results properly!

  - we can also troubleshoot if our assembly isn't as expected

---

### Greedy algorithm

***

* performs pairwise comparison of reads

* combines any read pairs with sufficient overlap between edges

* assembly stops when no more overlaps are found

* greedy algorithms aren't good with repeats - leading to misassemblies

* they also can't easily utilise global information (e.g. read pairs)

---

### Greedy algorithm

***

![greedy algorithm](https://willrowe.net/_slides/slide-data/genome-assembly/greedy-algorithm.jpg)

---

### Graph-based assembly

***

* represent relationships between sequenced reads

* graphs consist of edges and nodes
 - graph can be directed or undirected (edge directionality)
 - nodes can have multiple edges
 - assemblers generally use directed multigraphs

* most modern genome assemblers use these graphs
 - build graph from reads and traverse it to derive genome

* several graph-based assembly algorithms
 - **overlap layout consensus (OLC)** - reads remain intact
 - **de Bruijn graph** - breaks down reads to k-mers and finds exact overlaps
 - **string graph** - variant of OLC approach
 - **hybrid** - combines multiple approaches / sequence data

---

### Graph theory: the Bridges of Konigsberg

***

![bridges](https://upload.wikimedia.org/wikipedia/commons/5/5d/Konigsberg_bridges.png)

* can we visit each part of the city by crossing each bridge once?

---

### Graph theory: the Bridges of Konigsberg

***

![bridges and graph](https://willrowe.net/_slides/slide-data/genome-assembly/7-bridges-konigsberg.png)

* represent each land mass as a node and each bridge as an edge

* Eulerian path = visit every edge of a graph exactly once

* it's not possible in this case!

---

### Graph theory: Eulerian paths

***

![eulerian paths](https://www3.cs.stonybrook.edu/~skiena/combinatorica/animations/anim/euler.gif)

* graphs with Eulerian paths are termed Eulerian

* traversing a Eulerian path takes linear time

---

### Graph theory: Hamiltonian paths

***

![hamiltonian paths](http://www3.cs.stonybrook.edu/~algorith/files/hamiltonian-cycle-L.gif)

* Hamiltonian path = visit every node of a graph exactly once

* traversing a Hamiltonian path is hard to implement an algorithm for (NP-complete)

---

### Graph theory: Hamiltonian paths

***

![hamiltonian paths](http://www3.cs.stonybrook.edu/~algorith/files/hamiltonian-cycle-R.gif)

* visit every node once

* traversing a Hamiltonian path is hard to implement an algorithm for (NP-complete)

---

### Overlap-Layout-Consensus (OLC) algorithm

***

* based on the greedy algorithm but finds all possible overlaps

* finds **overlaps** (aligning the reads) & constructs overlap graph

* **layout** reads according to alignment (find Hamiltonian path)

* gets the **consensus** by traversing Hamiltonian path

* E.G. assemblers: Celera, Newbler, Canu

---

### Overlap-Layout-Consensus (OLC) algorithm

***

![OLC algorithm](https://willrowe.net/_slides/slide-data/genome-assembly/OLC-algorithm.jpg)

---

### De Bruijn graph-based algorithm

***

* reads are broken down to k-mers (substrings of length k)
 - why do we use k-mers and not reads?

* a de Bruijn graph is constructed from the k-mers
 - k-mers are connected if they have k-1 shared bases

* the genome is derived using the Eulerian path through the graph

* assuming no sequencing error, sequencing reads will match the genome perfectly
 - therefore the de Bruijn graph constructed from reads will be the same as that constructed from the underlying genome

* E.G. assemblers: SPAdes, Velvet, ABySS

---

![de Bruijn eg](http://www.homolog.us/Tutorials/Tut-Img/Set1/fig5.png)

* edges connect nodes that have overlaps of 6 (=k-1) bases

* in this example, no k-mers appear multiple times

---

![graph eg](https://willrowe.net/_slides/slide-data/genome-assembly/graph-eg.jpg)

---

### De Bruijn graph-based algorithm: considerations

***

* we have seen that repetitive regions lead to multiple edges

* DNA strandedness will also increase number of node edges (think reverse complementing)

* how are sequencing errors and artifacts handled?

* what k-mer size should we use?

---

### De Bruijn graph-based algorithm: graph features

***

![graph features](https://willrowe.net/_slides/slide-data/genome-assembly/graph-features.jpg)

* use k-mer frequency to resolve these graph features
 - remove low depth kmers
 - clip tips, merge bubbles, remove links
 - resolve small repeats using long k-mers

---

### De Bruijn graph-based algorithm: k-mer size

***

* avoid using an even numbered k-mer size
 - they can lead to reverse complementing
 - affects the strand specificity of the graph
 - palindromic k-mers are avoided with an odd k

* increasing k-mer size can resolve ambiguities
 - higher k-mer size can < number of edges and < possible paths
 - however, higher k-mer size also more sensitive to sequencing errors
 - higher k-mer size means more RAM needed

* try several k-mer sizes to get the best assembly!

---

### De Bruijn graph-based algorithm: recommended resources

***

good websites and papers:

[coding4medicine](https://www.coding4medicine.com/Members/Materials/assembly/files/chapter1.html), [Langmead lab teaching materials](http://www.langmead-lab.org/teaching-materials/), [Compeau et al. 2011](http://www.nature.com/nbt/journal/v29/n11/full/nbt.2023.html), [EBI course](https://www.ebi.ac.uk/training/online/course/ebi-next-generation-sequencing-practical-course/part-1)

great explanation video:



---

### Recap:

![assembly strategies](https://willrowe.net/_slides/slide-data/genome-assembly/assembly-strategies.gif)

---

## Practical assembly example

***

Theory is one thing but let's look at implementing an assembly algorithm ourselves

Example found here: [bits of bioinformatics blog](https://pmelsted.wordpress.com/2013/11/23/naive-python-implementation-of-a-de-bruijn-graph/)

---



----

## Aims of these slides

***

* introduction to genome assembly ✓

* understand how some assembly algorithms work ✓

* how to assess assembly quality

* walkthrough a typical genome assembly workflow

----

## Assembly quality

***

We assess quality by looking at assembly **contiguity**, **completeness** and **correctness**

---

### Contiguity

***

* Ideally, we want very long contigs and few of them

* We measure contiguity using:
 - contig number
 - contig length (average, median and maximum)
 - N statistics (e.g. N50)

* N50 is a statistical measure of the average length of a set of contigs
 -  50% of the entire assembly is contained in contigs => to the N50 value

---

### Completeness and correctness

***

* completeness is the proportion of the sequenced genome represented by the assembled contigs
 - completeness = assembled genome size / estimated genome size

* correctness is a measure of the number of errors in the assembly
 - feature compressions (i.e. repeats)
 - improper contig scaffolding
 - introduced SNPs/InDels

---

### Assessing assembly quality: checklist

***

* check the three C's - contiguity, completeness, correctness

* look at all the statistics we are given by assemblers

* map the original reads to the assemblies to find inconsistencies

* check / visualise against other assemblies / references

----

## Aims of these slides

***

* introduction to genome assembly ✓

* understand how some assembly algorithms work ✓

* how to assess assembly quality ✓

* walkthrough a typical genome assembly workflow

----

## Typical genome assembly workflow

***

* experimental design
  - genome size? repetitive regions? plasmids?

* DNA extraction
  - how much DNA? quality?

* library preparation & sequencing
  - library design (paired-end/mate-pair, insert size)? how much data needed?

* quality assessment of sequence libraries
  - how to we assess quality before/after sequencing?

* genome assembly
  - how to validate / compare / improve / annotate?

---

### DNA extraction

***

* we want high molecular weight DNA with no contamination

* use absorption ratios to estimate quality and quantity (Qubit, Nanodrop etc.)

* run the samples on a gel to check for degradation, protein / RNA contamination etc.

---

![QC gel image](https://genomics.ed.ac.uk/sites/default/files/page_images/Image%201.png "QC gel")

---

### Library preparation and sequencing

***

You've already covered technologies - which one to choose?

How many reads do I need to get a good assembly?

---

### Library preparation and sequencing

***

* for bacterial genome assembly, we usually want at least 30X sequencing coverage

* **sequencing coverage** is the average number of reads that align to known reference bases

* the Lander/Waterman equation is a method to compute expected coverage

## coverage = (no. reads * length of read) / genome size

---

### Example calculation

***

* an Illumina MiSeq flowcell can generate 25 million reads

* organism genome size: 4.5 Mbp (Megabase pair)

* required coverage: x30

* read length: 150 bp

* number of reads needed (N):

N = (30 x 4.5e+6) / 150

N = 9e+5 reads

---

### Quality assessment

***

* once we have sequence data, we want to check the quality
 - we'll be covering how to check sequence read quality tomorrow

* example FASTQC report [found here](https://willrowe.net/_slides/slide-data/genome-assembly/QC/multiqc_report.html)

---

### Assemble!

***

In the practical tomorrow we will use a real-world *Shigella flexneri* sequencing dataset

We will be retrieving the data, quality checking it and then running some assembly algorithms. We'll then assess our assemblies, compare them to a reference genome and annotate our draft assemblies.

---

### Assembly annotation

***

* once we have assembled, we want to extract information from the genome

* covering annotation would be a lecture in itself

* the basic levels are: nucleotide, protein and process
 - identify known features using alignment
 - prediction of new features

* annotation is an important part of comparative genomics and was covered in LIFE721

---

### Closing the genome

***

* we have described how to assemble a **draft** genome

* draft assemblies contain errors
 - sequence errors
 - misassemblies
 - contamination

* manual curation, re-sequencing and other approaches are often needed to close a genome

----

## Aims of these slides

***

* introduction to genome assembly ✓

* understand how some assembly algorithms work ✓

* how to assess assembly quality ✓

* walkthrough a typical genome assembly workflow ✓

----

## References used in this slide deck

***

https://flxlexblog.files.wordpress.com/
http://math.nmsu.edu/
https://link.springer.com/article/10.1007/s12575-009-9004-1
http://www.homolog.us/Tutorials
http://cbsuss05.tc.cornell.edu/doc/
http://debruijn.herokuapp.com/
https://pmelsted.wordpress.com/
https://www.slideshare.net/torstenseemann/de-novo-genome-assembly-tseemann-imb-winter-school-2016-brisbane-au-4-july-2016
https://genomics.ed.ac.uk/