突变的残留物和位置的数字编码残留物、突变、位置、数字

2023-09-11 23:34:48 作者：毁誉由人

我正在写一个Python程序里面有计算突变的残留物和一组strings.These串的位置的数字编码是蛋白质sequences.These序列被存储FASTA格式文件每个蛋白质序列由comma.The序列长度分离可能有所不同protein.In这个我试图找到的位置和序列的突变。我用下面的$ C $下得到这一点。

  A ='AGFESPKLH
B ='KGFEHMKLH
对于在范围I（LEN（一））：
  如果A [1] = B [I]！
     打印I，A [1]，B [i]于

但我想顺序文件的输入文件的。下面的数字会告诉我project.In这个数字输入文件sequences.The最后箱重$的第一个框重新presents排列p $ psents输出文件。我怎样才能在Python做到这一点？请帮助我。谢谢大家的时间。

例如：

 输入文件

MTAQDD，MTAQDD，MTSQED，MTAQDD，MKAQHD




        位置1 2 3 4 5 6 1 2 3 4 5 6

蛋白质序列1 MTAQDDTAD

蛋白质Sequence2的MTAQDDTAD

蛋白质序列3 MTSQEDTSE

蛋白质sequence4 MTAQDDTAD

蛋白质sequence5 MKAQHDKAH


     蛋白质序列比DISCARD非可变区

 位置2 2 3 3 5 5 5

蛋白质序列1 T Að

蛋白质Sequence2的T Að

蛋白质序列3ŤSEP s]

蛋白质sequence4 T Að

蛋白质sequence5 K A ^ h

   突变的残留物SPLITED分隔列

输出文件应该是这样的：

 位置+残留2T 2K 3A 3S 5D 5E 5H

       序列1 1 0 1 0 1 0 0

       Sequence2的1 0 1 0 1 0 0

       序列3 1 0 0 1 0 1 0

       sequence4 1 0 1 0 1 0 0

       sequence5 0 1 1 0 0 0 1

    （残留物是$ C $光盘1张IF preSENT，0如果不存在）

解决方案

如果你是用表格数据的工作，考虑大熊猫：

 从大熊猫进口*

数据='MTAQDD，MTAQDD，MTSQED，MTAQDD，MKAQHD

自由度=数据框（[在data.split用于行列表（行）（，）]）

打印数据框（{STR（COL）+ VAL：（DF [COL] == VAL）。适用（INT）
        为COL在集df.columns的VAL（DF [COL]）}）

输出：的

  0M 1K 1T 2A 2S 3Q 4D 4E 4H 5D
0 1 0 1 1 0 1 1 0 0 1
1 1 0 1 1 0 1 1 0 0 1
2 1 0 1 0 1 1 0 1 0 1
3 1 0 1 1 0 1 1 0 0 1
4 1 1 0 1 0 1 0 0 1 1

如果你想删除的列全部为一：

 打印df.select（波长X：不DF [X]。所有的（），轴= 1）

   1K 1T 2A 2S 4D 4E 4H
0 0 1 1 0 1 0 0
1 0 1 1 0 1 0 0
2 0 1 0 1 0 1 0
3 0 1 1 0 1 0 0
4 1 0 1 0 0 0 1

I'm writing a python program which has to compute a numerical coding of mutated residues and positions of a set of strings.These strings are protein sequences.These sequences are stored in fasta format file and each protein sequence is separated by comma.The sequence lengths may differ for different protein.In this I tried to find the position and sequence which are mutated. I used following code for getting this.

a = 'AGFESPKLH'
b = 'KGFEHMKLH'
for i in range(len(a)):
  if a[i] != b[i]:
     print i, a[i], b[i]

But I want the sequence file as input file.The following figure will tell about my project.In this figure first box represents alignment of input file sequences.The last box represents the output file. How can I do this in Python? please help me. Thank you for everyone for your time.

example:

input file

MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD




        positions  1  2  3  4  5  6                         1  2  3  4  5  6

protein sequence1  M  T  A  Q  D  D                            T  A     D

protein sequence2  M  T  A  Q  D  D                            T  A     D

protein sequence3  M  T  S  Q  E  D                            T  S     E

protein sequence4  M  T  A  Q  D  D                            T  A     D

protein sequence5  M  K  A  Q  H  D                            K  A     H


     PROTEIN SEQUENCE ALIGNMENT                          DISCARD NON-VARIABLE REGION

        positions  2  2  3  3  5  5  5

protein sequence1  T     A     D   

protein sequence2  T     A     D   

protein sequence3  T        S     E

protein sequence4  T     A     D   

protein sequence5     K  A           H

   MUTATED RESIDUE IS SPLITED TO SEPARATE COLUMN

Output file should be like this:

position+residue   2T  2K  3A  3S  5D  5E  5H

       sequence1   1   0   1   0   1   0   0

       sequence2   1   0   1   0   1   0   0

       sequence3   1   0   0   1   0   1   0

       sequence4   1   0   1   0   1   0   0

       sequence5   0   1   1   0   0   0   1

    (RESIDUES ARE CODED 1 IF PRESENT, 0 IF ABSENT)

解决方案

If you are to work with tabular data, consider pandas:

from pandas import *

data = 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'

df = DataFrame([list(row) for row in data.split(',')])

print DataFrame({str(col)+val:(df[col]==val).apply(int) 
        for col in df.columns for val in set(df[col])})

output:

  0M  1K  1T  2A  2S  3Q  4D  4E  4H  5D
0   1   0   1   1   0   1   1   0   0   1
1   1   0   1   1   0   1   1   0   0   1
2   1   0   1   0   1   1   0   1   0   1
3   1   0   1   1   0   1   1   0   0   1
4   1   1   0   1   0   1   0   0   1   1

If you want to drop the columns with all ones:

print df.select(lambda x: not df[x].all(), axis = 1)    

   1K  1T  2A  2S  4D  4E  4H
0   0   1   1   0   1   0   0
1   0   1   1   0   1   0   0
2   0   1   0   1   0   1   0
3   0   1   1   0   1   0   0
4   1   0   1   0   0   0   1

上一篇：gnuplot的 - 不同长度的水平关键标题排列排列、长度、水平、不同

下一篇：我怎样才能右对齐文本字段在一个应用程序，内部监督办公室的设置捆绑？字段、应用程序、文本、办公室

相关推荐

精彩图集

精彩推荐

图片推荐