Book Store Online

The following

content is provided under a Creative

Commons license. Your support will help MIT

OpenCourseWare continue to offer high quality

educational resources for free. To make a donation or

view additional materials from hundreds of MIT courses,

visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: OK, so welcome

back to computational systems biology, I’m David Gifford. I’m delighted to be with

you here here today. And today we’re going to be

talking about a topic that is central to modern high

throughput biology, which is understanding how to do

short read alignment, sometimes called read mapping. Now it’s very important

to me that you understand what I’m about to

say today, and so I’m hopeful that you’ll be

uninhibited to raise your hand and ask questions about the

fine points in today’s lecture if you have any, because I’d

be totally delighted to answer any questions and we

have enough time today that we can spend time looking

at one aspect of this problem and understand it thoroughly. An associated topic is the

question library complexity. How many people have heard of

sequencing libraries before? Let’s see a show of hands. OK, how many people have heard

of read alignment before, read mapping? OK, great, fantastic. Let’s start with

what we’re going to be talking about today. We’re going to first begin

talking about what a sequencing library is and what we

mean by library complexity. We’ll then turn to what has been

called a full text minute size index, sometimes called a

burrows Wheeler transform index, a BWT index,

an FM index, but this is at the center of most

modern computational biology algorithms for processing high

throughput sequencing data. And then we’ll turn how

to use that type of index for read alignment. So let’s start now with what

a sequencing library Is. Let’s just say that

you have a DNA sample, we’ll be talking about various

ways of producing said samples throughout the term. But we’re going

to assume that we have a bunch of

different DNA molecules. And I’ll illustrate the

different molecules here in different colors. And we have three different

types of molecules here. Some molecules are duplicated,

because as you know, typically, we’re preparing DNA

from an experiment where there are many cells and we can get

copies of DNA from those cells or the DNA could

be amplified using PCR or some other technique. So we have this

collection of molecules, and to make a library,

we’re going to process it. And one of the

things that we’ll do when we process the

library is we’ll put sequencing adapters on. These are short DNA

sequences that we put on to the end

of the molecules to enable them to

have defined sequences at the ends which

permits sequencing. Now, if somebody hands you

a tube of DNA like this, there are a couple

questions you could ask. You could check the DNA

concentration to find out how DNA is there,

you could run a gel to look at the size

of the fragments that you’re sequencing. We’ll be returning

to that later, but these are typically

called the insert sizes of the library that

you’re sequencing, the total length of the

DNA excluding the adapters. But we could also ask questions

about how complex this library is, because it’s possible to run

experiments where you produce libraries that are

not very complex, where they don’t have very many

different types of molecules. Now that typically is a

failure of the experiment. So an important part

of quality control is characterizing the

library complexity where we want to figure out

here complexity is equal to 3. There are three different

types of molecules. And we sample these molecules. And when we sample them, we get

a bunch of DNA sequence reads. And typically the number

of reads that we get is larger than the

complexity of the library. Here we have a total

of 12 different reads. And when we sequence a library,

we’re sampling from it. And so the probability that we

get any one particular molecule is going to be roughly

speaking equal to 1 over c, which is the complexity. And thus, we could use

the binomial distribution to figure out the

likelihood that we had exactly four of

these type one molecules. However, as n the number

of sequencing reads grows to be very large, typical

numbers are a hundred million different reads,

the binomial becomes cumbersome to work with. And so we typically are

going to characterize this kind of selection

process with a different kind of distribution. So one idea is to

use a Poisson, where we say that the

rate of sequencing is going to be n over c. And we can see that

here shown on the slide above is the same

process where we have the ligation

of the adapters. We have a library and we have

reads coming from the library. We have a characterized

library complexity here, there are four different

types of molecules. And the modeling

approach is that assuming that we have c different

unique molecules, the probability that

we’ll get any one of them when we’re doing the

sequencing is 1 over c. And if we do end

sequencing reads, we can find out the

probability that we’ll get a certain number of

each type of molecule. Let’s just stick with the

first one to start, OK? Now part of the challenge

in analyzing sequencing data is that you don’t see

what you don’t sequence. So things that actually occur

0 times in your sequencing data still may be present

in the library. And what we would like to do

is from the observed sequencing data, estimate the

library complexity. So we have all of

the sequencing data, we just don’t know how

many different molecules there are over here. So one way to do

with this is to say that let us suppose that we

make a histogram of the number of times we see

distinct molecules and we’re going to say that

we can observe molecules that are sequenced or appear l

times up through r times. So we actually can

create a version of the distribution

that characterizes just a part of

what we’re seeing. So if we do this, we can

build a Poisson model and we can estimate lambda

from what we can observe. We don’t get to observe

things we don’t see. So for sure, we know

we can’t observe the things that are

sequenced 0 times. But for the things that are

sequenced at least one time, we can build an

estimate of lambda. And from that

estimate of lambda, we can build an estimate of C. So one way to look

at this is that if we look at the total number

of unique molecules that we sequence, which is

equal to m, then the probability that we observe between l

and r occurrences of a given individual sequence

times c is going to be equal to the total

number of unique molecules that we observe. Another way to look at this

is the very bottom equation where we note that if we

look at the total complexity of the library and we multiply

it by 1 minus the probability that we don’t observe

certain molecules, that will give an estimate

of the total number of unique molecules

that we do see. And thus we can

manipulate that to come up with an estimate

of the complexity. Are there any questions about

the details of this so far? OK, so this is a

very simple model for estimating the

complexity of a library based upon looking at the

distribution of reads that we actually observe for

quality control purposes. And let us suppose that

we apply this to thousands genomes data, which is

public data on human. And suppose we want to test

whether this model works or not, so what

we’re going to do is we’re going to estimate

the library complexity from 10 percent of the sequencing

reads, so we’ll pick 10 percent of the reads

of an individual at random, we’ll estimate the

complexity of the library, and then we’ll also take all

of the region the individual and estimate the complexity. And if our estimator

is pretty good, we should get about

the same number, from 10 percent of the reads

and from all of the reads. Will people go along with that? Think that seems reasonable? OK, so we do that. And this is what we get. And it’s hard to see

the diagonal line here, but there’s a big oops here. And the big oops is that if we

estimate the library complexity from just 10 percent

of the reads, it’s grossly underestimating

the number of unique molecules we actually have. In fact, it’s off by typically

a factor of two or more. So for some reason,

even though we’re examining millions of

reads in this subsample, we’re not getting

a good estimate of the complexity

of the library. Does anybody have any idea

what could be going wrong here? Why is it that this

very simple model that is attempting to estimate how

many different molecules we have here based upon what

we observe is broken? Any ideas at all? And please say your name first. AUDIENCE: I’m Chris. PROFESSOR: Hi Chris. AUDIENCE: Is it because

repeated sequences, so there could be a short

sequence at the end of one molecule that’s the beginning

of another one, middle of another one, so [INAUDIBLE]. PROFESSOR: Chris, you’re

on the right track, OK? Because what we have

assumed at the outset was that all of these molecules

occur with equal probability. Right? What would happen

if in fact there are four copies

of this purple one and only two copies of

the other molecules? Then the probability

of sampling this one is going to be twice as high

as the probability of sampling one of these. If there’s non uniformity

in the original population, that’s going to mess

up our model big time. And that could happen

from repeated sequences or other kinds of

duplicated things, or it could be that there’s

unequal amplification. It might be that

PCR really loves a particular molecule, right,

and amplifies that one a lot, and doesn’t amplify another one

that’s difficult to amplify. So somewhere in our

experimental protocol pipeline, it could be that

there’s non uniformity and thus we’re getting a

skewness to our distribution here in our library. So the other thing

that’s true is in a Poisson, lambda,

which is equal to the mean, is also equal to the variance. And so our Poisson’s

only one knob we could turn to fit

the distribution. So coming back to this,

we talked about the idea that the library complexity

still may be four but then there may

be different numbers of molecules of each type. And here’s an idea

for you, right? The idea is this. Imagine that the

top distributions are the number of each type

of molecule that are present. And it might be that our

original assumption was that it was like the very

top, that typically there are two copies of each molecule

in the original sequencing library, and that’s a

fairly tight distribution. But it could be, in fact,

that the number of molecules of each type is very dispersed. And so if we look at

each one of those plots at the top, the

first four, those are going to be our guesses

about the distribution of the number of

copies of a molecule in the original library. And we don’t know

what that is, right? That’s something we can’t

directly observe, but imagine that we took that

distribution and used it for lambda in our

Poisson distribution down below for sampling. So we have one distribution

over the number of each type of

molecule we have and we have the Poisson for

sampling from that, and we put those two together. And when we do that, we have

the Poisson distribution at the top, the

gamma distribution is what we’ll use

for representing the number of different

species over here and their relative copy number. And when we actually put

those together as shown, we wind up with what’s

called the negative binomial distribution, which is a

more flexible distribution, it has two parameters. And that negative

binomial distribution can be used, once again,

to estimate our library complexity. And when we do so, we

have lamba be the same, but k is a new parameter. It measures sort of the

variance or dispersion of this original

sequencing library. And then when we fit this

negative binomial distribution to that 1,000 genomes data, it’s

going to be hopefully better. Let’s start with

a smaller example. If we have a library that’s

artificial with a known million unique molecules

and we subsample, it gives you 100,000

reads, you can see that with

different dispersions here in the left, k

with different values from 0.1 to 20, the Poisson

begins to grossly underestimate the complexity of the library

as the dispersion gets larger, whereas the negative

binomial, otherwise known as the GP or gamma Poisson,

does a much better job. And furthermore,

when we look at this, in the context of the

thousand genomes data, you can see when we fit this

how much better we are doing. Almost all those points

are almost exactly on the line, which means you

can take a small sampling run and figure out from

that sampling run how complex your library is. And that allows us to tell

something very important, which is what is the marginal

value of extra sequencing. So for example, if somebody

comes to you and they say, well, I ran my experiment

and all I could afford was 50 million reads. Do you think I

should sequence more? Is there more information in my

experimental DNA preparation? It’s easy to tell now, right? Because you can actually

analyze the distribution of the reads that they got

and you can go back and you could estimate

the marginal value of additional sequencing. And the way you do that is you

go back to the distribution that you fit this

negative binomial and ask if you

have r more reads, how many more unique molecules

are you going to get? And the answer is

that you can see that if you imagine that

this is artificial data, but if you imagine that

you had a complexity of 10 to the 6 molecules, the

number sequencing regions is on the x-axis, the number of

observed distinct molecules is on the y-axis, and as you

increase the sequencing depth, you get more and more

back to the library. However, the important thing

to note is that the more skewed the library is, the less

benefit you get, right? So if you look at the various

values of k, as k gets larger, the sort of the

skewness of the library increases, and you can see that

you get fewer unique molecules as you increase the

sequencing depth. Now I mention this

to you because it’s important to think

in a principled way about analyzing sequencing data. If somebody drops 200 million

reads on your desk and says, can you help me with

these, it’s good to start with some fundamental

questions, like just how complex is the original

library and you think that these data are

really good or not, OK? Furthermore, this is a

introduction to the idea that certain kinds of

very simplistic models, like Poisson models

of sequencing data can be wrong because

they’re not adequately taking into account

the over dispersion of the original

sequencing count data. OK, so that’s all there is

about library complexity. Let’s move on now to questions

of how to deal with these reads once we have them. So the fundamental

challenge is this. I hand you a genome like human. 3 times 10 to the ninth bases. This will be in fast

a format, let’s say. I had you reads. And this will be– we’ll

have, say, 200 base pairs times 2 times 10 to

the eighth different reads. And this will be

in fast q format. The q means that there are–

it’s like fast a except that our quality

score’s associated with each particular

base position. And the PHRED score

which is typically used for these

sorts of qualities, is minus p minus 10

times log base 10 of the probability of an error. Right, so a PHRED score of

10 means that there’s a 1 in 10 chance that the bases is

an error, a PHRED score of 20 means it’s one in a

100, and so forth. And then the goal is

today if I give you these data on a

hard drive, your job would be to produce a SAM

file, a Sequence Alignment and Mapping file,

which tells us where all these reads

map in the genome. And more pictorially,

the idea is that there are many

different reasons why we want to do this mapping. So one might be

to do genotyping. You and I differ in our

genomes by about one base in a thousand. So if I sequence your genome

and I map it back or align it to the human brain

reference genome, I’m going to find differences

between your genome and the human reference genome. And you can see how this is

done at the very top where we have the aligned

reads and there’s a G, let’s say, in the

sample DNA, and there’s a C in the reference. But in order to figure out

where the differences are, we have to take

those short reads and align them to the genome. Another kind of

experimental protocol uses DNA fragments

that are representative of some kind of

biological process. So here the DNA being

produced are mapped back to the genome to look

for areas of enrichment or what are sometimes

called peaks. And there we want to actually

do exactly the same process, but the post processing once

the alignment is complete is different. So both of these share the goal

of taking hundreds of millions of short reads and aligning

them to a very large genome. And you heard about Smith

Waterman from Professor Berg, and as you can tell, that

really isn’t going to work, because its time

complexity is not going to be admissible for

hundreds of millions of reads. So we need to come up

with a different way of approaching this problem. So finding this alignment is

really a performance bottleneck for many computational

biology problems today. And we have to talk a

little bit about what we mean by a good alignment,

because we’re going to assume, of course, fewer

mismatches are better. And we’re going to try and

align to high quality bases as opposed to low quality

bases and note that all we have in our input data are

quality scores for the reads. So we begin with an assumption

that the genome is the truth and when we are

aligning, we are going to be more permissive

of mismatches in read locations that have

higher likelihood of being wrong. So is everybody OK

with the set up so far? You understand what

the problem is? Yes, all the way in the back

row, my back row consultants, you’re good on that? See, the back row

is always the people I call on for consulting

advice, right? So yeah. You’re all good back there? Good, I like that, good,

that’s good, I like that, OK. All right. So now I’m going to

talk to you about what are the most amazing

transforms I have seen. It’s called the Burrows

Wheeler Transform. And it is a transform

that we will do to the original

genome that allows us to do this look up

very, very quickly. And it’s worth understanding. So here’s the basic idea behind

the Burrows Wheeler Transform. We take the original string that

we want to use as our target that we’re going to

look things up in, OK, so this is going to be the

dictionary looking things up in and it’s going to be

the genome sequence. And you can see the sequence on

the left hand side, ACA, ACG, and the dollar sign represents

the end of string terminator. OK. Now here’s what

we’re going to do. We take all possible

rotations of this string, OK? And we’re going to sort them. And the result of

sorting all the rotations is shown in the next

block of characters. And you can see that the

end of string character has the shortest sorting

order, followed by A, C, and G and that all the strings

are ordered lexically by all of their letters. So once again, we take

the original input string, we do all of the

possible rotations of it, and then we sort them and wind

up with this Burrows Wheeler Matrix as it’s called

in this slide, OK? And we take the last

column of that matrix and that is the Burrows

Wheeler Transform. Now yo might say, what on

earth is going on here? Why would you want to

take a string or even an entire genome? We actually do this

on entire genomes, OK? Consider all the

rotations of it, sort them, and then take the

last column of that matrix. What could that be doing, OK? Here’s a bit of

intuition for you. The intuition is that that

Burrows Wheeler Matrix is representing all

of the suffixes of t. OK, so all the red things are

suffixes of t in the matrix. And when we are going

to be matching a read, we’re going to be matched

it from its end going towards the beginning

of it, so we’ll be matching suffixes of it. And I’m going to show you a

very neat way of using this transform to do matching

very efficiently. But before I do that,

I want you to observe that it’s not complicated. All we do is we take all

the possible rotations and we sort them and we

come up with this transform. Yes. AUDIENCE: What are you

sorting them based on? PROFESSOR: OK, what

was your name again? AUDIENCE: I’m Samona. PROFESSOR: Samona. What are we sorting

them based upon? We’re just sorting

them alphabetically. AUDIENCE: OK. PROFESSOR: So you can see that

if dollar sign is the lowest alphabetical character,

that if you consider each one a word, that they’re

sorted alphabetically, OK? So we have seven

characters in each row and we sort them alphabetically. Or a lexically. Good question. Any other questions like that? This is a great time

to ask questions, because what’s

going to happen is that in about the

next three minutes if you lose your attention

span of about 10 seconds, you’re going to look up and

you’ll say, what just happened? Yes. AUDIENCE: Could you

explain the suffixes of t? PROFESSOR: The suffixes of t? Sure. Let’s talk about

the suffixes of tr. They’re all of the

things that end t. So a suffix of t would be

G, or CG, or ACG, or AACG, or CAACG, or the

entire string t. Those are all of

the endings of t. And if you look

over on the right, you can see all of

those suffixes in red. So one way to

think about this is that it’s sorting all of the

suffixes of t in that matrix. Because the rotations are

exposing the suffixes, right, is what’s going on. Does that make sense to you? Now keep me honest here in a

minute, OK, you’ll help me out? Yes. Your name first? AUDIENCE: [INAUDIBLE]. [INAUDIBLE] PROFESSOR: [INAUDIBLE]. AUDIENCE: What is dollar sign? PROFESSOR: Dollar sign is

the end of string character which has the lowest

lexical sorting order. So it’s marking the end of t. That’s how we know that

we’re at the end of t. Good question. Yes. AUDIENCE: Can you sort them

non-alphabetically, just different ways to sort

them [INAUDIBLE] algorithm? PROFESSOR: The

question is, can you sort them non alphabetically. You can sort them any way as

long as it’s consistent, OK. But let’s stick with

alphabetical lexical order today. It’s really simple

and it’s all you need. Yes. in red is the suffixes

in the last colored group on the right? PROFESSOR: No, no. AUDIENCE: What’s in red? PROFESSOR: What’s in

red are all the suffixes of T on the very far left. OK? AUDIENCE: On the right,

last column group? PROFESSOR: The right

last column group. That last column in red, that is

the Burrows-Wheeler Transform, read from top to bottom. OK? And I know you’re

looking at that and saying, how could

that possibly be useful? We’ve taken our genome. We’ve shifted it all around. We’ve sorted it, we

take this last thing. It looks like junk to me, right? But you’re going to

find out that all of the information in

the genome is contained in that last string

in a very handy way. Hard to believe but true. Hard to believe but true. Yes. Prepare to be amazed, all right? These are all great questions. Any other questions

of this sort? . OK. So, I’m going to make a very

important observation here that is going to be crucial

for your understanding. So I have reproduced the

matrix down on this blackboard. What? That’s usually there under

that board, you know that. You guys haven’t checked this

classroom before, have you? No. It’s always there. It’s such a handy transform. So this is the same

matrix as the matrix you see on the right. I’m going to make a very, very

important assertion right now. OK? The very important assertion

is that if you consider that this is the first a

in the last column that is the same textual occurrence

in the string as the first a in the first column. And this is the second a

in the last column, that’s the same as the second

a in the first column. And you’re going to say,

what does he mean by that? OK, do the following

thought experiment. Look at the matrix. OK? And in your mind, shift it

left and put all the characters on the right hand side. OK? When you do that,

what will happen is that these

things will be used to sort the occurrences

a on the right hand side. Once again, if you shift this

whole thing left and these pop over to the right, then the

occurrence of these a’s will be sorted by these

rows from here over. But these are alphabetical. And therefore they’re going

to certain alphabetical order. And therefore these a’s

will sort in the same order here as they are over there. So that means that when

we do this rotation, that this textual

occurrence of a will have the same rank in the first

column and in the last column. And you can see I’ve annotated

the various bases here with their ranks. This is the first g, the first

c, the first end of line, end of string character. First a, second a,

third a, second c. And correspondingly I

have the same annotations over here and thus

the third a here is the same lexical occurrence

as the third a on the left in the string, same

text occurrence. Now I’m going to let that

sink in for a second, and then when somebody

asks a question, I’m going to explain it again

because it’s a little bit counterintuitive. But the very important

thing is if we think about textual recurrences

of characters in that string t and we put them

in this framework, that the rank allows

us to identify identical textual

recurrences of a character. Would somebody like

to ask a question? Yes. Say your name and

the question, please. AUDIENCE: Dan. PROFESSOR: Dan. AUDIENCE: So in

your original string though those won’t

correspond to the same order in the transformed string. So like the a’s in the

original string in their order, they don’t correspond

numerically to the transformed string. PROFESSOR: That’s correct. Is that OK? The comment was that the

order in BWT, the transform is not the same as the order

in the original string. And all I’m saying is that in

this particular matrix form, that the order on

the last column is the same as the order

in the first column for a particular character. And furthermore, that these are

matching textual occurrences, right? Now if I look at

a2 here, we know that c comes after it, then

a, then a, and c and g, right? Right. OK, so did that

answer your question that they’re not

exactly the same? AUDIENCE: Yes. I don’t understand how

they’re useful yet. PROFESSOR: You don’t

understand how it’s useful yet. OK. Well, maybe we better

get to the useful part and then you can– OK. So let us suppose that

we want to, from this, reconstruct the original string. Does anybody have any

ideas about how to do that? OK. Let me ask a different question. If we look at this g1, right? And then this is the same

textual occurrence, right? And we know that

this g1 comes right before the end of character,

in end of string terminator, right? So if we look at the

first row, we always know what the last character

was in the original string. The last character is g1, right? Fair enough? OK Where does g1 would

occur over here? Right over here, right? What’s the character before g1? c2. where is c2 over here? What’s the character before c2? a3. What’s the character before a3? a1. Uh, oh. Let me just cheat a little

bit here. a1 a3 c2 g1 $. So we’re at a1, right? What’s the character before a1? c1, right? What’s the character before c1? a2. And what’s the

character before a2? That’s the end of string. Is that the original

string that we had? Magic. OK? Yes. AUDIENCE: Wouldn’t it

be simpler to look at– to just remove the

dollar sign [INAUDIBLE] or do you mean reconstruct

from only the transformed? PROFESSOR: We’re

only using this. This is all we have. Because I actually didn’t

use any of these characters. I was only doing the matching

so we would go to the right row. Right? I didn’t use any of this. And so, but do people

understand what’s going on here? If anybody has any questions,

now is a great time to raise your hand

and say– here we go. We have a customer. Say your name and

the question, please. AUDIENCE: My name is Eric. PROFESSOR: Thanks, Eric. AUDIENCE: Can you illustrate

how you would do this without using any

of the elements to the left of the box? PROFESSOR: Absolutely, Eric. I’m so glad you

asked that question. That’s the next thing

we’re going to talk about. OK, but before I get to

there, I want to make sure, are you comfortable doing

it with all the stuff on the left hand side? You’re happy about that? OK. if anyone was unhappy about

that, now would be the time to say, I’m unhappy, help me. How about the details? Everybody’s happy? Yes. AUDIENCE: So, you have

your original string in the first place,

though, so why do you want to create another

string of the same length? Like, how does this help

you match your read? PROFESSOR: How does this

help you match your read? How does this help you match

your read, was the question. What is was your name? AUDIENCE: Dan. PROFESSOR: That’s right, Dan. That’s a great question. I’m so glad you asked it. First we’ll get

to Eric’s question and then we’ll get to yours. Because I know if I don’t give

you a good answer that you’re going to be very mad, right? OK? Let’s talk about

the question of how to do this without

the other things. So we’re going to create

something called the last to first function that maps

a character in the last row, column, I should say,

to the first column. And there is the

function right there. It’s called LF. You give it a row number. The rows are zero origined. And you give it a

character and it tells you what the corresponding place

is in the first column. And it has two components. The first is Occ, which tells

you how many characters are smaller than that

character lexically. So tells you where, for example,

the a’s start, the c’s start, or the g’s start. So in this case, for

example Occ of c is 4. That is the c’s start at

0, 1,2, 3, the fourth row. OK? And then count tells you

the rank of c minus 1. So it’s going to

essentially count how many times c occurs

before the c at the row you’re pointing at. In this case, the

answer is 1 and you add 1 and that gets you

to 5, which is this row. So this c2 maps here to c2

as we already discussed. So this LF function is a

way to map from the last row to the first row. And we need to have

two components. So we need to know Occ, which

is very trivial to compute. There are only five

elements, one for each base and one for the end of line

terminator, which is actually zero. So it will only have

integers and count, which is going to tell us the

rank in the BWT transform and we’ll talk about how

to do that presently. OK. So did that answer

your question, how to do this without

the rest of the matrix? Eric? AUDIENCE: Could you show us

step by step on the blackboard how you would reconstruct it? PROFESSOR: How do

we reconstruct it? AUDIENCE: Yeah. PROFESSOR: You mean

something like this? Is this what you’re suggesting? AUDIENCE: Somehow

I get a feeling that the first column

doesn’t help us in understanding how

the algorithm work only using the last column. PROFESSOR: OK. Your comment, Eric,

is that you feel like the first column

doesn’t help us understand how the algorithm works, only

using the last column, right? OK. AUDIENCE: [INAUDIBLE] going back

to the first column of data. PROFESSOR: OK. Well let’s compute the LF

function of the character and the row for each

one of these things, OK? And that might help

you, all right? Because that’s the central part

of being able to reverse this transform. So this is, to be more clear,

I’ll make it more explicit. This is LF of I and BWT of

i, where i goes from 0 to 6. So what is that value

for this one right here? Anybody know? Well it would be Occ of

g, which is 6, right? Plus count of of 6 n g,

which is going to be 0. Or I can look right

over here and see that in fact it’s 6, right? Because this occurrence

of g1 is right here. So this LF value is, it’s 6

4 0 a1 is in 1, a2 is in 2, a3 is in 3, c2 is in 5. So this is the LF

function, 6 4 0 1 2 3 5. And I don’t need any

of this to compute it. Because it simply is equal

to, going back one slide, it’s equal to Occ

of c plus count. So it’ going to be equal to

where that particular character starts on the left hand

side and its rank minus 1. And so these are

the values for LF. This is what I need to be

able to take this string and recompute the

original string. If I can compute this, I

don’t need any of that. And to compute this,

I need two things. I need Occ and I need count. All right? Now, I can tell you’re not

quite completely satisfied yet. So maybe you can ask

me another question and it would be

very helpful to me. AUDIENCE: How did you get

G1’s last and first functions score being 6? PROFESSOR: OK. Let’s take that apart. We want to know what LF of

6 and where was that G1? G1 is 1 and 0, right? Sorry. LF of 1 and g is

equal to, right? Is that g and 1 or 0? Oop, sorry it’s in 0. So this is what you like

me to compute, right? OK what’s Occ of g? It’s how many characters

are less than g in the original string? I’ll give you a clue. It’s 1, 2, 3, 4, 5, 6. AUDIENCE: [INAUDIBLE]. PROFESSOR: No, it’s how many

characters are less than g in the original string. How many things are going

to distort underneath it? Where do the g’s begin

in the sorted version? The g’s begin in row 6. OK? So OCC of g is 6. Is that– are you getting

hung up on that point? AUDIENCE: Yes. How do you know that without

ever referencing back to the first 5 columns? PROFESSOR: Because when we

build the index we remember. AUDIENCE: Oh, OK. PROFESSOR: So we have to,

I haven’t told you this, but I need to compute,

I need to remember ways to compute Occ and

count really quickly. Occ is represented

by four values only. They’re where the a’s start,

the c’s start, the t’s start, and the g’s start. That’s all I need to

know, four integers. OK? Are you happy with that? Occ, the a’s start at 1. The c’s start at 4 and

the g’s start at 6. OK? That’s all I need to remember. But I precompute that. OK? Remember it. Are you happy with that no? AUDIENCE: Yes. PROFESSOR: OK. And then this business over

here of count of zero and g. Right? Which is how many

g’s, what’s the rank of this g in the

right hand side? 1. Minus 1 is 0. So that’s 0. That’s how we computed it. OK? These are great

questions because I think they’re foundational. Yeah. AUDIENCE: Occ and count

are both precomputed as you’re building on this

last column [INAUDIBLE]. PROFESSOR: They are

precomputed and well, I have not told you, see,

you’re sort of ahead of things a little bit in that I’d

hoped to suspend disbelief and that I could actually

build these very efficiently. But yes, they’re

built at the same time as the index is built. OK? But yes. OK and if I have Occ and

count and I have the string, then I can get this. But I can still need to

get to Dan’s question because he has been

very patient over there. He wants to know

how I can use this to find sequence

region of genome. And he’s just being

so good over there. I really appreciate that, Dan. Thanks so much for that. Are you happier? OK you’re good. All right. So and this is what we just did. The walk left algorithm

actually inverts the BWT by using this

function to walk left and using that

Python code up there, you can actually reconstruct

the original string you started with. So it’s just very simple. And we went through

it on the board. Are there any questions

about it at all? AUDIENCE: Yes. Can you actually

do this any column? Like, why are you using the

last column instead of like, why can’t you just change

it like the [INAUDIBLE], the equation and

make it work for– PROFESSOR: Because a

very important thing is that this is actually a

very important property right? Which is all the

suffixes are sorted here. And if we didn’t

do that though, I couldn’t answer Dan’s question. And he’d be very upset. So please be respectful

of his interest here. Now, the thing is,

the beauty of this is, is that I have all

these suffixes sorted and what you’re about to see is

the most amazing thing, which is that we’re going

to snap our fingers and, bang, we can

map 200 million reads in no time at all. You like that? You’re laughing. Oh, oh, that’s not a good

That’s not a good sign. Let’s press ahead

fearlessly, OK? And talk about how we’re

going to use this to map read. So we’re going to

figure out how to use this index and this

transform to rapidly aligned reads to a reference genome. And we’re not talking about

one read or 10 reads or 1 million reads. We’re talking about hundreds

of millions of reads. So it has be very

efficient indeed, OK? So here’s the essential idea. There’s the core

algorithm on the slide. Which is that what we do is we

take the original query that we have the read that

we’re trying to match and we’re going to

process it backwards from the end of

the read forwards. And we begin by considering all

possible suffixes from row zero to, in this case, it

would be row seven. Which is the length of

the entire transform. And we iterate and

in each iteration we consider suffixes

that match the query. So in the first

step, right here, let me see if I can get my

point working, there we are. So in the first step

here, we matching this c. OK? And we compute the

LF of the top, which is this row and of the bottom,

which is down here pointing off the end, and that takes

us to the first d here and to this point. Here are the two

c’s that could be possible matches to our

query, which ends in a c. We then say, oh, the next

character we have to match is an a. So we look here at the

a we need to match, and starting from this

row, which is row four, and this row, which is row six,

we compute the LF of each one of these to figure out what

rows in a precedes these c’s. And the way we compute the LF

is that we use the character a to be able to figure out which

rows have the a preceding the c. You can see, when we

compute those LF functions, what we wind up with

are these rows where we have a followed by c. So we’re beginning to

match the end of our read, as we go from right to left. We then compute the

same thing once again, considering the

first a and ask what rows are going to allow us to

put this a in front of the ac to form our entire read. And we compute the LF once

again of these things. And you can see that

here it takes us to this specific row aac. So that row represents

a suffix that is matching our query exactly. So we iterate this loop

to be able to match a read against the index. And we’re using the

LF function to do it. And it’s a really

beautiful algorithm. And remember, we only

have the transform. We don’t have the

rest of this matrix. So before I press ahead and

talk about other details, I think it’s important to

observe a couple of things that are a little bit

counterintuitive about this. One counterintuitive

aspect of it is, that when I’m over here

for example, and for example when I’m computing the LF here,

I’m computing the LF of row two with respect to a. But there’s a dollar sign there. Right? So I’m using this

to the LF function, to tell me where a suffix

would be that actually follows my constraint of having to

have an a be the prefix of ac, where I am right now. This code is actually

not fake code. It’s the actual code

that’s in a matcher, for matching a read

against the index. Now let me just stop

right here for a second and see if there

any other questions. Dan is getting now his answer

to his question, right? About how you actually use

this for matching reads. You do this once for every read. And it is linear time. Right? It’s the length

of the read itself is all the time it takes

to match in a huge genome. So once we’ve built the

index of the genome, in fact, most of

the time when you’re doing this sort of mapping,

you don’t build the index. You download the index

off of a website. And so you don’t have to pay for

the time to build this index. You just download the index

and you take your reads and the time to match

all of your sequencing reads against a certain build

of whatever genome you’re using is simply linear in the

number of bases you have. Questions? Yes. And say your name and

the question, please. AUDIENCE: How did

[INAUDIBLE] come up with intuition [INAUDIBLE]? It seems like they just

pulled it out of a hat. PROFESSOR: You know, I asked

Mike that the other day. I saw him in a meeting

and he sort of surprised at how this has taken off. And he told me some

other interesting facts about this, which you

probably could deduce. Which is that if you

only want to match reads that are

four long, you only have to sort this matrix by

the first four characters. But there are other little

tricks you can play here. Any other questions? Yes. AUDIENCE: I’m Deborah. What is the FM index? PROFESSOR: What is the FM index. Well, the guys who

thought this up have the last

initials of F and M, but that’s not

what it stands for, contrary to popular opinion. It stands for full

text minute size. That’s what they claim. So if you hear people talking

about full text minute size indices, or FM

indices, the Fm index is actually the part that was

being asked over here, the Occ part and the LF part, how you

actually compute those quickly. That was what FNM contributed

to this but, generically when we’re talking about

this style of indexing, it’s called FM indexing or you

might hear, I’m using a BWT. Some people will say that. But that’s what FM stands for. Does that answer your question? Excellent. OK. All right. Any– these are all

great questions. Yes. AUDIENCE: [INAUDIBLE]. PROFESSOR: Oh, you

don’t know that a and c are there,

except that remember, if you look at the way

that this is working, is that you’re not actually

reconstructing strings, you’re only trying to find them. Right? And so at the end,

top and bottom are going to point to the row

that contains the suffix where your original read was. And now your next

question is going to be, where is that in the genome? This doesn’t do me any good. I mean, the number 1

doesn’t help me out here, doesn’t mean anything. Not good, right? So where is it in the

genome is the next question. So we’ll get to

that in a second. What happens if you

give me a read that doesn’t match anywhere

in this index? Well if you give me a read

that doesn’t match anywhere in this index, what happens

is the top and bottom become the same. So on top and bottom become the

same, it’s a failed look up. All right? And that’s because the suffix

doesn’t exist in the index. And once top and

bottom become the same, they remain the same

throughout that loop. Yes. AUDIENCE: I’m Sally. My main question is

that this doesn’t provide any leeway for errors. You kind of have to be able

to present all of your rates. PROFESSOR: Sally, you’re

absolutely correct. I’m so glad you

asked that question. Your observation is

it does not provide any leeway for mismatches. And so unlike all the

other algorithms we study, which had these very

nice matrices and ability to assign weights

to mismatches, this is only doing exact matching. And so what you need

help understanding is, how we can deal

with mismatches in the presence of this. And I’ll get to that in

less than 10 minutes. And it won’t be quite as elegant

as what you saw from Professor Berg but it’s what

everybody does. So that’s my only

excuse for it OK? Yes. AUDIENCE: What is the

bottom set of arrows doing? What’s its significance? PROFESSOR: The significance

of top and bottom, that’s a great question. What the significance

of top and bottom? Top and bottom bracket

in that original matrix, the suffixes that are

matching the original query. OK? And so between top

and bottom minus 1 are all of the rows

that have a suffix that match our original query. And if top equals bottom,

there are no matching suffixes. But assuming there

are matching suffixes, those are rows that

contain a matching suffix. And as we progress along,

top and bottom change as the suffixes change

as we expand the suffix to contain more and more bases. OK? OK. Any other questions? OK. So now back to the question

over here, which is that OK, I know that we’ve matched, and

I know that we have this hit. The question is, where is it

in the Genome because the fact that it matched row

one of my BWT matrix does me absolutely

no good at all. Right? Anybody have any ideas about

how we could figure out where it is the genome? Given what I’ve told you so far,

which is that you have the BWT, you have Occ, and

you have count, and you can compute

the LF function. Any ideas? Doesn’t matter how slow it is. OK well how could

I figure out what came before aac in the genome? Yes. AUDIENCE: So, for,

like at the beginning, we rebuilt this whole

string, starting at the end. You could rebuild the

whole string starting at, rebuild the whole genome

starting at the 1 and see– PROFESSOR: You could rebuild

the entire genome that prepends or occurs

before aac, right? AUDIENCE: Exactly. PROFESSOR: Exactly. So that’s what we can do. We could actually do

our walk left algorithm. We can walk left

from there, find out that we go two steps

until we hit dollar sign, and therefore,

the offset is two, where it occurs in the genome. So we can give a match

position by walking left. Does everybody see

that, that we could walk left to figure where it is? It’s not fast, but it works. Yes. AUDIENCE: I’m Ted. PROFESSOR: Hi, Ted. AUDIENCE: So now our function

first has to take the read and it has to align

it and the same, where the position is built

into the end of the function, the speed of the function is now

dependent on position as well. Is that right? Because the longer it takes, PROFESSOR: I was

being a little bit glib find if it matches

or not this linear time. Now you’re saying,

hey, wait a minute. I want to know where

it is in the genome. That’s a big bonus, right? And so you’d like to

know where that is? Yes. But I still can do

that in linear time. And we’ll show you how

to do that in a second. This is not linear time. This actually

needs to walk back, order the length of genome

for every single query. That’s not good. All right? Well, What we could do is, we could

store what’s called a suffix array with each row and

say, where in the genome that position is. Where that row starts. And then maybe a simple look-up. That when you actually have

a hit in row one, ah, OK, start your position

two of the genome. But then the

problem with that is that it actually

takes a lot of space. And we want to have

compact indices. So the trick is what we

do is, instead of storing that entire suffix

array, we store every so many rows,

like every 25 rows. And all we do is, we

walk left until we hit a row that

actually has the value, and then we add how many times

we walked left, plus the value and we know where we

are in the genome. So we can sample

the suffix array, and by sampling

the suffix array, we cut down our storage

hugely and it’s still pretty efficient. Because what we

can do is, we just walk left until we hit a

sample suffix array location and then add the two

numbers together. All right? So that’s how it’s done. OK? So that’s how we

actually do the alignment and figure out where things are. The one thing I

haven’t told you about is how to compute

count efficiently. Now remember what count does. Count is a function–

but this is putting it all together where we’re

matching this query, we do the steps. We get the match. Then we do walk

left once and then we look at the suffix to

figure out where we are, right? The business about count

is that what we need to do is to figure out the

rank of a particular base in a position in the transform. And one way to do that is

to go backwards to the whole transform, counting how many

g;s occur before this one, and that’s very expensive,

to compute the rank of this particular g. Remember the rank is simply

the number of g’s that occur before this

one in the BWT. Very simple metric. So instead of

doing that, what we can do is, we can build a

data structure that every once in awhile, counts how

many a’s, c’s, g’s, and t’s have occurred

before now in the BWT. And so we’re going to sample

this with these checkpoints, and then when you want to

compute count at any point, you can go to the nearest

checkpoint, wherever that is, and make an

adjustment by counting the number of characters

between you and that checkpoint. Very straightforward. All Right So this, coming back to

question, it’s Time, right? –asked you need to

build this checkpointing mechanism at the same time you

build the index, as well as the sampling of

the suffix array. So a full index consists

of the transform itself, which is the genome

transformed into its BWT. And they literally take the

entire genome and do this. Typically they’ll put dollar

signs between the chromosomes. So they’ll transform

the whole thing. It takes a sampling

of the suffix array we just saw and it takes

the checkpointing of the LF function to make

a constant time. And that’s what is

inside of an FM index. OK? Now it’s small, which is

one of the nice things, compared to things

like suffix tree, suffix arrays, or even other

kinds of hash structures for looking for seeds, it

really is not even twice the size of the genome. So it’s a very compact index

that is very, very efficient. And so it’s a wonderful

data structure for doing what we’re

doing, except we have not dealt with mismatches

yet, right? And so once again,

I want to put a plug in for BWA which is really

a marvelous aligner. And we’ll talk about

tomorrow in recitation if you want to know all of the

details of what it actually takes to make this

work in practice. Now, it finds exactness

matches quickly, but it doesn’t really have

any allowances for mismatches. And the way that bow tie and

other aligners deal with this, and they’re all

pretty consistent, is in the following way, which

is that they do backtracking. Here’s the idea. You try and match

something or match a read and you get to a particular

point in the read, and you can’t go any further. Top is equal to bottom. So you know that

there’s no suffix in the genome that

matches your query. So what do you do? Well, what you can do is you can

try all of the different bases at that position

besides the one you tried to see whether

it matches or not. I can see the horror

coming over people. Oh, no, not

backtracking, not that. But sometimes it actually works. And just to give you order of

magnitude idea about how this works in practice,

when reads don’t match, they limit backtracking to about

125 times in these aligners. so they try pretty hard

to actually match things. And yes, it is true that

even with this backtracking, it’s still a great approach. And sometimes the first

thing you try doesn’t work, and you have to backtrack,

trying multiple bases at that location until

you get one that matches. And then you can proceed. OK And you eventually

wind up at the alignment you see in the lower right

hand corner, where you’re substituting a g for an a,

an a for a g, excuse me, to make it go forward. Do people understand

the essential idea of this idea of backtracking? Does anybody have any comments

or questions about it? Like ew, or ideas? Yes. AUDIENCE: What about gaps? PROFESSOR: What about gaps? BWA, I believe, processes gaps. But gaps are much, much less

likely than missed bases. The other thing is that if

you’re doing a sequencing library, and you have a read

that actually has a gap in it, it’s probably the case you

have another read that doesn’t. For the same sequence. So it is less important

to process gaps than it is to

process differences. The reason is that

differences mean that it might be a

difference of an allele. In other words, it

might be that your base is different than

the reference genome. Indels are also possible. And there are

different strategies of dealing with those. That would be a great question

for Hang tomorrow about gaps. Because he can tell you

in practice what they do. And we’ll get into

a little bit of that at the end of today’s lecture. Yes. Question? AUDIENCE: I’m Levy. PROFESSOR: Hi, Levy. AUDIENCE: How do

you make sure when you’re backtracking

that you end up with the best possible match? Do you just go down the first– PROFESSOR: The

questions is how do you guarantee you wind up with

the best possible match? The short answer

is that you don’t. There’s a longer

answer, which we’re about to get to, about how

we try to approximate that. And what judgment you would use

to get to what we would think is a practically good match. OK? But in terms of

theoretically optimal, the answer is, it doesn’t

attempt to do that. That’s a good question. Yes. AUDIENCE: In practice, does

this backtracking method at the same time as you’re

computing the matches or– PROFESSOR: Yes. So what’s happening is,

you remember that loop where we’re going

around, where we were moving the top

and bottom pointers. If you get to a point

where they come together, then you would at that

point, begin backtracking and try different bases. And if you look, I

posted the BWA paper on the Stellar website. And if you look in

one of the figures, the algorithm is there,

And you’ll actually see, if you can deconvolute

what’s going on, that inside the loop, it’s

actually doing exactly that. Yes. Question. AUDIENCE: In practice, is the

number of errors is small, would it make sense

just to use [INAUDIBLE]? PROFESSOR: We’re

going to get to that. I think the question was, if

the number of errors is small, would it be good to actually

use a different algorithm, drop into a different algorithm? So there are algorithms

on FM index assisted Smith Waterman, for example. Where you get to

the neighborhood by a fast technique and

then you do a full search, using a more in depth

principles methodology, right? And so there’s some papers

I have in the slides here that I referenced

that do exactly that. These are all great questions. OK. Yes. AUDIENCE: If you’re– If

you only decide to backtrack a certain number of

times, like 100 times, then wouldn’t like the alignment

be biased towards the end of the short read? PROFESSOR: I am so glad

you asked this question. The question is, and

what was your name again? AUDIENCE: Kevin. PROFESSOR: Kevin. Kevin asks, gee, if you’re

matching from the right to the left, and you’re

doing backtracking, isn’t this going to

be biased towards the right into the read,

in some sense, right? Because if the right into

the read doesn’t match, then you’re going

to give up, right? In fact, what we know is

the left end of the read is the better end of the read. Because sequences are done five

prime to three prime and thus typically, the highest quality

scores or the best quality scores are in the left

hand side of the read. So do you have any idea about

how you would cope with that? AUDIENCE: You could just

reverse one of them. But you’d reverse– PROFESSOR: Exactly what they do. They execute the entire genome

and they reverse it and then they index that. And so, when they create

what’s called a mirror index, they just reverse

the entire genome, and now you can

match left to right, as opposed to right to left. Pretty cool, huh? Yeah. So backtracking, just note that

there are different alignments that can occur across

different backtracking paths. And this is not

optimal in any sense. And to your question

about how you actually go about picking a

backtracking strategy, assuming we’re matching

from right to left again for a moment, what you can

do is, if you hit a mismatch, you backtrack to the lowest

quality based position, according to PHRED scores. We talked about

PHRED scores earlier, which are shown

here on the slide. And you backtrack there and

you try a different base and you move forward from there. So you’re assuming

that the read, which is the query, which is

associated quality scores, is most suspect where the

quality score is the lowest. So you backtrack to the right

to the leftmost lowest quality score. Now it’s a very simple approach. Right? And we talked a little

bit about the idea that you don’t necessarily want

to match from the right side and thus, typically the

parameters to algorithms like this include,

how many mismatches are allowed in the first L

bases on the left end, the sum of the mismatch qualities

you’re going to tolerate, and so forth. And you’ll find that these align

yourself with a lot of switches that you can set. And you can consult

with your colleagues about how to set switches,

because it depends upon the particular type

of data you’re aligning, the length of the

reads and so forth. But suffice it to say,

when you’re doing this, typically we create these mirror

indices that actually reverse the entire genome

and then index it. So we can either match either

right to left or left to right. And so for example, if

you have a mirror index, and you only tolerate

up to two errors, then you know that

either, you’re going to get the

first half right in one index or

the mirror index. And so you can use both indices

in parallel, the forward and the reverse

index of the genome, and then get pretty

far into the read before you have to

start backtracking. There are all these

sorts of techniques, shall we say, to

actually overcome some the limitations of backtracking. Any questions about

backtracking at all? Yes. AUDIENCE: Is it

trivial knowing the BWT originally to find

the mirror BWT? Like for example, PROFESSOR: No, it’s not trivial. AUDIENCE: So it’s not

like a simple matrix transforming [INAUDIBLE]. PROFESSOR: No. Not to my knowledge. I think you start with the

original genome, you reverse it and then you compute

the BWT with that. Right? That’s pretty easy to do. And Hang was

explaining to me today how you compute his new

ways of compute BWT, which don’t actually involve

sorting the entire thing. There are insertion ways of

computing the BWT that are very [INAUDIBLE], and you could

ask him this question tomorrow if you care to come. All right just to give you an

idea on how complex things are, to build an index

like this, takes, for the entire

human genome, we’re talking five hours

of compute time to compute an index to give

you an order of magnitude time for how to compute the BWT. the LF checkpoints, and

the suffix array sampling. Something like that. So it’s really not

too bad to compute the index of the entire genome. And to do searches, you

know, we’re talking about, like on a four

processor machine, we’re talking

about maybe upwards of 100 million rads

per hour to map. So if you have 200

million reads and you want to map them to a genome,

or align them as it’s sometimes called, it’s going to take

you a couple hours to do it. So this is sort of the order of

magnitude of the time required to do these sorts of functions. And there are a

couple fine points I wanted to end with today. The first is we haven’t

talked at all about paired, erred, and read alignment. In paired read alignment,

you get, for each molecule, you get two reads. One starting at the five

prime end on one side, , and one starting from the five

prime end on the other side. So typical read links might

be 100 base pairs on the left and 100 base pairs on the right. What is called the insert

size is the total size of the molecule from five prime

end to five prime end to read. And the stuff in the

middle is not observed. We actually don’t

know what it is. And we also don’t

know how long it is. Now when these

libraries are prepared, size selection is

done, so we get a rough idea of

what it should be. We can actually

compute by looking at where things align on the

genome, what it actually is. But we don’t know absolutely. If we were able to

strictly control the length of the

unobserved part, which is almost impossible to do, then

we would get molecular rulers. And we would know

exactly down to the base, whether or not there were

indels between the left read and the right read when

we did the alignment. We actually don’t

have that today. The sequencing

instrument actually identifies the read

pairs in its output. That’s the only way to do this. So when you get an output

file, like a fast Q file, from a sequencing

instrument, it will tell you, for a given molecule,

here’s the left read and here’s the right read. Although left and right are

really sort of misnomers because there really is

no left and right, right? This is one end and then

this is the other end. Typical ways of

processing these paired reads, first you align

left and right reads. And they could really

only be oriented with respect to a

genome sequence where you say that one has a lower

coordinate than the other one when you’re actually

doing the alignment. And if one read fails to align

uniquely, then what you can do is, you know what neighborhood

you’re in because you know, roughly speaking, what

the insert size is, so you can do Smith Waterman

to actually try and locate the other read in

that neighborhood. Or you can tolerate

multiply mapped reads. One thing that I did not

mention to you explicitly, is that when you match

the entire query, and top and bottom are more

than one away from each other, that means you’ve got

many places in the genome that things map. And thus you may report

all of those locations or I might report the first one. So that’s one bit of insight

into how to do a map paired reads. And these are becoming

very important because as sequencing

costs go down, people are doing

more and more paired and sequencing

because they give you much more information about the

original library you created and for certain protocols can

allow you to localize events in the genome far

more accurately. Final piece of advice

on considerations for read alignment. We talked about the idea that

some reads will map or align uniquely to the genome

and some will multimap. You know that the genome is

roughly 50% repeat sequence. And thus it’s likely that if

you have a particular read molecule, there’s

a reasonable chance that it will map to

multiple locations. Is there a question here? No. OK. You have to figure out what your

desired mismatch tolerance is when you’re doing alignment

and set the parameters to your aligner carefully,

after reading the documentation thoroughly, because

as you could tell, there’s no beautiful

matrix formulation like there is with a

well established basis in the literature,

rather it’s more ad hoc. And you need to figure

out what the desired processing is for paired reads. So what we’ve talked

about today is we started off talking about

library complexity and the idea that when we get a bunch

of reads from a sequencer, we can use that collection

of reads to estimate the complexity of

our original library and whether or

not something went wrong in the biological

processing that we were doing. Assuming it’s a

good set of reads, we need to figure out where

they align to the genome. So we talked about this idea

of creating a full text minute size index, which involves

a Burrows-Wheeler transform. And we saw how we can

compute that and throw away almost everything else

except for the BWT itself, the suffix array

checkpoints, and the FM index checkpoints to be able

to reconstruct this at a relatively modest increase

in size over the genome itself and do this very,

very rapid matching, albeit with more problematic

matching of mismatches. And then we turned

to the question of how to deal with those

mismatches with backtracking and some fine points on

paired end alignment. So that is the end

of today’s lecture. On Tuesday of next

week, we’ll talk about how to actually construct

a reference genome, which is a really neat thing

to be able to do, take a whole bunch of reads, put

the puzzle back together again. I would encourage you to

make sure you understand how this indexing

strategy works. Look at the slides. Feel free to ask any of us. Thanks so much for

your attention. Welcome back. Have a great weekend. We’ll see you next Tuesday.

copyright 2019 Blog WordPress Theme By Blog WordPress Theme