0%

Genomics HW4速成指南

Genomics HW4 速成指南

这次的作业中只有第一题有一些难度。因为要用动态规划进行序列比对。更细致的来讲,使用Smith-Waterman算法进行局部比对。

动态规划(英语:Dynamic programming,简称DP)是一种在数学、管理科学、计算机科学、经济学和生物信息学中使用的,通过把原问题分解为相对简单的子问题的方式求解复杂问题的方法。(From Wikipedia)

Smith-Waterman算法中涉及到一个打分矩阵,简单的来说就是先将第一行和第一列赋值为0,再按照下面的打分公式对每一个小格进行填充,并记录下来最大打分分数的来源,最后从最大分数开始回溯到0,就是两个序列之间的最佳匹配。设MijM_{ij}为这个以小格子的填充分数,那么打分的公式有:

F(i,j)=max{F(i1,j1)+s(x_i,y_j)F(i1,j)+GapPenaltyF(i,j1)+GapPenalty 0F(i,j) = max \begin{cases} F(i-1,j-1)+s(x\_i,y\_j)\\ F(i-1,j)+GapPenalty \\F(i,j-1)+ GapPenalty \\\ 0 \end{cases}

用python3实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
#Genomics Homework4 
#Use DP algorithm to find the highest scored alignment
#1. Calculate the alignment score of seq1&seq2
#2. find the best alignment
#3. Calculate the best alignment score

seq1 = "ACGCNCTTTTCTCGCGTACCTTTACTAATAGAATGCAAAGACGTCCCCCG"
seq2 = "CTCGCGTACCTTTACTAAGAGAATGCGNAGACGTCCCCCGGNGGACCGAT"

match_award = 1
mismatch_penalty = -2
N = -1
gap_penalty = -3
Extendgap_recalibration = 1 #延伸gap的罚分是-2,检测到上一个来自于上方向或者左方向,就给gap_penalty+1

def alignScore(seq1,seq2):
score =0
for i,j in zip(seq1,seq2):
if(i == "N" or j == "N"):
score -=1
elif (i == "-" or j =="-"):
score -= 2
elif (i == j):
score +=1
else :
score -=2 #mismatch
return score

# 初始化打分表
def zeros(shape):
retval = []
for x in range(shape[0]):
retval.append([])
for y in range(shape[1]):
retval[-1].append(0)
return retval

# 计算得分
def match_score(alpha, beta):
if alpha == beta:
return match_award
elif alpha == 'N' or beta == 'N':
return N
elif alpha == '-' or beta == '-':
return gap_penalty
else:
return mismatch_penalty

def water(seq1, seq2):
m, n = len(seq1), len(seq2) # length of two sequences

# Generate DP table and traceback path pointer matrix
score = zeros((m+1, n+1)) # the DP table
pointer = zeros((m+1, n+1)) # to store the traceback path

max_score = 0 # initial maximum score in DP table

# Calculate DP table and mark pointers
for i in range(1, m + 1):
for j in range(1, n + 1):#第一行第一列都是0,所以循环从1开始。
score_diagonal = score[i-1][j-1] + match_score(seq1[i-1], seq2[j-1])
#填充的格子从1开始,填充格子的1对应序列的0,所以match_score的下表要减去1
score_up = score[i][j-1] + gap_penalty
score_left = score[i-1][j] + gap_penalty

if (pointer[i][j-1] == ( 2)):
score_up = score[i][j-1] + gap_penalty + Extendgap_recalibration
if (pointer[i-1][j] == (1 )):
score_left = score[i-1][j] + gap_penalty + Extendgap_recalibration

score[i][j] = max(0,score_left, score_up, score_diagonal)

if score[i][j] == 0:
pointer[i][j] = 0 # 0 means end of the path
if score[i][j] == score_left:
pointer[i][j] = 1 # 1 means trace up
if score[i][j] == score_up:
pointer[i][j] = 2 # 2 means trace left
if score[i][j] == score_diagonal:
pointer[i][j] = 3 # 3 means trace diagonal
if score[i][j] >= max_score:
max_i = i
max_j = j
max_score = score[i][j]

align1, align2 = '', '' # initial sequences

i,j = max_i,max_j # indices of path starting point

#traceback, follow pointers
while pointer[i][j] != 0:
if pointer[i][j] == 3:
align1 += seq1[i-1]
align2 += seq2[j-1]
i -= 1
j -= 1
elif pointer[i][j] == 2:
align1 += '-'
align2 += seq2[j-1]
j -= 1
elif pointer[i][j] == 1:
align1 += seq1[i-1]
align2 += '-'
i -= 1
#[align1,align2,symbol,maxscore]
answerlist= finalize(align1, align2)
answerlist.append(max_score)
return answerlist

def finalize(align1, align2):
align1 = align1[::-1] #reverse sequence 1
align2 = align2[::-1] #reverse sequence 2

i,j = 0,0

#输出结果: alignment
symbol = ''
identity = 0
for i in range(0,len(align1)):
# if two AAs are the same, then output the letter
if (align1[i] == align2[i]):
symbol += "*"
# if one of them is N
elif (align1[i] =="N" or align2[i]=="N"):
symbol += "N"
# if they are not identical and none of them is gap
elif (align1[i] != align2[i] and align1[i] != '-' and align2[i] != '-'):
symbol += 'M'
#if one of them is a gap, output a -
elif (align1[i] == '-' or align2[i] == '-'):
symbol += '-'

return[align1,align2,symbol]

def printer(answerlist):
#answerlist: [align1,align2,symbol,maxscore]
print("seq1 " + answerlist[0])
print("seq2 " + answerlist[1])
print("symb " + answerlist[2])
print("Best Alignment score: " + str(answerlist[3]))


print("Alignment socre: " + str(alignScore(seq1,seq2)))
printer(water(seq1,seq2))