我正在写一个Python程序里面有计算突变的残留物和一组strings.These串的位置的数字编码是蛋白质sequences.These序列被存储FASTA格式文件每个蛋白质序列由comma.The序列长度分离可能有所不同protein.In这个我试图找到的位置和序列的突变。 我用下面的$ C $下得到这一点。
A ='AGFESPKLH
B ='KGFEHMKLH
对于在范围I(LEN(一)):
如果A [1] = B [I]!
打印I,A [1],B [i]于
但我想顺序文件的输入文件的。下面的数字会告诉我project.In这个数字输入文件sequences.The最后箱重$的第一个框重新presents排列p $ psents输出文件。 我怎样才能在Python做到这一点? 请帮助我。 谢谢大家的时间。
例如:
输入文件
MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD
位置1 2 3 4 5 6 1 2 3 4 5 6
蛋白质序列1 MTAQDDTAD
蛋白质Sequence2的MTAQDDTAD
蛋白质序列3 MTSQEDTSE
蛋白质sequence4 MTAQDDTAD
蛋白质sequence5 MKAQHDKAH
蛋白质序列比DISCARD非可变区
位置2 2 3 3 5 5 5
蛋白质序列1 T Að
蛋白质Sequence2的T Að
蛋白质序列3ŤSEP s]
蛋白质sequence4 T Að
蛋白质sequence5 K A ^ h
突变的残留物SPLITED分隔列
输出文件应该是这样的:
位置+残留2T 2K 3A 3S 5D 5E 5H
序列1 1 0 1 0 1 0 0
Sequence2的1 0 1 0 1 0 0
序列3 1 0 0 1 0 1 0
sequence4 1 0 1 0 1 0 0
sequence5 0 1 1 0 0 0 1
(残留物是$ C $光盘1张IF preSENT,0如果不存在)
解决方案
如果你是用表格数据的工作,考虑大熊猫:
从大熊猫进口*
数据='MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD
自由度=数据框([在data.split用于行列表(行)(,)])
打印数据框({STR(COL)+ VAL:(DF [COL] == VAL)。适用(INT)
为COL在集df.columns的VAL(DF [COL])})
输出:的
0M 1K 1T 2A 2S 3Q 4D 4E 4H 5D
0 1 0 1 1 0 1 1 0 0 1
1 1 0 1 1 0 1 1 0 0 1
2 1 0 1 0 1 1 0 1 0 1
3 1 0 1 1 0 1 1 0 0 1
4 1 1 0 1 0 1 0 0 1 1
如果你想删除的列全部为一:
打印df.select(波长X:不DF [X]。所有的(),轴= 1)
1K 1T 2A 2S 4D 4E 4H
0 0 1 1 0 1 0 0
1 0 1 1 0 1 0 0
2 0 1 0 1 0 1 0
3 0 1 1 0 1 0 0
4 1 0 1 0 0 0 1
I'm writing a python program which has to compute a numerical coding of mutated residues and positions of a set of strings.These strings are protein sequences.These sequences are stored in fasta format file and each protein sequence is separated by comma.The sequence lengths may differ for different protein.In this I tried to find the position and sequence which are mutated. I used following code for getting this.
a = 'AGFESPKLH'
b = 'KGFEHMKLH'
for i in range(len(a)):
if a[i] != b[i]:
print i, a[i], b[i]
But I want the sequence file as input file.The following figure will tell about my project.In this figure first box represents alignment of input file sequences.The last box represents the output file. How can I do this in Python? please help me. Thank you for everyone for your time.
example:
input file
MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD
positions 1 2 3 4 5 6 1 2 3 4 5 6
protein sequence1 M T A Q D D T A D
protein sequence2 M T A Q D D T A D
protein sequence3 M T S Q E D T S E
protein sequence4 M T A Q D D T A D
protein sequence5 M K A Q H D K A H
PROTEIN SEQUENCE ALIGNMENT DISCARD NON-VARIABLE REGION
positions 2 2 3 3 5 5 5
protein sequence1 T A D
protein sequence2 T A D
protein sequence3 T S E
protein sequence4 T A D
protein sequence5 K A H
MUTATED RESIDUE IS SPLITED TO SEPARATE COLUMN
Output file should be like this:
position+residue 2T 2K 3A 3S 5D 5E 5H
sequence1 1 0 1 0 1 0 0
sequence2 1 0 1 0 1 0 0
sequence3 1 0 0 1 0 1 0
sequence4 1 0 1 0 1 0 0
sequence5 0 1 1 0 0 0 1
(RESIDUES ARE CODED 1 IF PRESENT, 0 IF ABSENT)
解决方案
If you are to work with tabular data, consider pandas:
from pandas import *
data = 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'
df = DataFrame([list(row) for row in data.split(',')])
print DataFrame({str(col)+val:(df[col]==val).apply(int)
for col in df.columns for val in set(df[col])})
output:
0M 1K 1T 2A 2S 3Q 4D 4E 4H 5D
0 1 0 1 1 0 1 1 0 0 1
1 1 0 1 1 0 1 1 0 0 1
2 1 0 1 0 1 1 0 1 0 1
3 1 0 1 1 0 1 1 0 0 1
4 1 1 0 1 0 1 0 0 1 1
If you want to drop the columns with all ones:
print df.select(lambda x: not df[x].all(), axis = 1)
1K 1T 2A 2S 4D 4E 4H
0 0 1 1 0 1 0 0
1 0 1 1 0 1 0 0
2 0 1 0 1 0 1 0
3 0 1 1 0 1 0 0
4 1 0 1 0 0 0 1