Single-molecule protein sequencing through fingerprinting: Computational assessment

(1)

PAPER

Single-molecule protein sequencing through ﬁngerprinting:

computational assessment

Yao Yao1,2

, Margreet Docter1

, Jetty van Ginkel1

, Dick de Ridder2,3

and Chirlmin Joo1

1 _{Kavli Institute of NanoScience and Department of BioNanoScience, Delft University of Technology, Lorentzweg 1, 2628CJ, Delft,}

The Netherlands

2 _{The Delft Bioinformatics Lab, Department of Intelligent Systems, Delft University of Technology, Mekelweg 4, 2628 CD, Delft,}

The Netherlands

3 _{Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708 PB, Wageningen, The Netherlands}

E-mail:dick.deridder@wur.nlandc.joo@tudelft.nl

Keywords: single-molecule biophysics, computational biophysics, proteomics Supplementary material for this article is availableonline

Abstract

Proteins are vital in all biological systems as they constitute the main structural and functional

components of cells. Recent advances in mass spectrometry have brought the promise of complete

proteomics by helping draft the human proteome. Yet, this commonly used protein sequencing

technique has fundamental limitations in sensitivity. Here we propose a method for single-molecule

(SM) protein sequencing. A major challenge lies in the fact that proteins are composed of 20 different

amino acids, which demands 20 molecular reporters. We computationally demonstrate that it sufﬁces

to measure only two types of amino acids to identify proteins and suggest an experimental scheme

using SM

ﬂuorescence. When achieved, this highly sensitive approach will result in a paradigm shift in

proteomics, with major impact in the biological and medical sciences.

In 2014 two international teams produced the ﬁrst draft of the human proteome, using mass spectro-metry (MS) [1, 2]. By opening a new chapter in proteomics, these large scale studies will help us understand complex cellular processes. Yet, MS—the most widely used protein sequencing technology— requires a large amount of sample. This hampers quantiﬁcation, precludes detecting many proteins of interest that are present only in low concentrations in the cell, and renders single-cell analysis impossible.

Single-molecule (SM) protein sequencing would bring about‘protein deep sequencing’ [3–5]. How-ever, unlike DNA sequencing that needs to read out only four nucleotides, protein sequencing demands differentiation of 20 amino acids, far beyond what cur-rent SM techniques can offer [3]. SM protein sequen-cing has therefore not followed up SM DNA sequencing that usesﬂuorescence and nanopores [6– 8]. Here we propose a novel SM protein sequencing method that overcomes this challenge and assess its feasibility using computational analysis.

Unique to protein sequencing is that a protein can be identiﬁed using incomplete information with

reference to proteomic databases. Consider a 2 bit fin-gerprinting scheme in which only two types of amino acids are labeled (figure1). A consecutive read of 15 labeled amino acids is sufficient to identify up to 215= 32 768 unique protein sequences. This exceeds the number of (major isoform) protein species that most organisms express. As the median length of a protein ranges from 270 (bacteria) to 350 amino acids (eukaryotes), it is not difficult to choose two amino acid types that appear more than 15 times in each pro-tein (supplementary figure 1, stacks.iop.org/PB/12/ 055003/mmedia).

Figure 2 describes a SM protein fingerprinting scheme using fluorescence. We chose to label two highly nucleophilic amino acids, lysine (K) and cysteine (C) as they are frequent (supplementary figure 2) and can be labeled both efficiently and ortho-gonally (NHS–ester coupling with lysine and mal-eimide coupling with cysteine) [9]. A similar idea using lysine and arginine for monitoring protein synthesis inside a living cell was patented by Anima Cell Metrology [10]. Recently, Swaminathan et al dis-cussed fingerprinting schemes that are based on OPEN ACCESS

RECEIVED

26 January 2015

REVISED

27 May 2015

ACCEPTED FOR PUBLICATION

30 June 2015

PUBLISHED

11 August 2015

Content from this work may be used under the terms of theCreative Commons Attribution 3.0 licence.

Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

(2)

multiple labels, including two labels [3]. Separately, a work published in 2013 shows how ourﬁngerprinting approach might be implemented using nanopores [4].

To assess the predictive power offingerprinting, we developed a dedicated search algorithm. In brief, we search CK (cysteine–lysine) fingerprints by com-bining a filtering strategy to decrease computation time with a dynamic programming-based alignment

step, considering a speciﬁc set of potential experi-mental errors (see methods).

For our analysis, we used a canonical human pro-teome database based on Uniprot release 2014.04 [11]. We simulated 2000 different read-outs, searched for each of them in the database and measured the detec-tion precision (P), i.e. the probability of retrieving the correct sequence. In an ideal situation with no experi-mental error, P is 90% (figure3(a), blue). Next, we assessed the robustness of the method against inac-curacies that are expected from actual experiments, by iteratively introducing errors into eachfingerprint at random (see error simulation in methods for details) up to a specified error level (α,a number of errors (k) divided by a CKfingerprint length (lck, the length of

CK sequences excluding other amino acids). Figure3 reports P of these computations. As expected, P drops when α increases. For example, atα = 10%,half of the sequences are correctly and uniquely retrieved (ﬁgure3(a), blue).

To improve performance, we considered other information: the distance between C’s and K’s (ﬁgure3(a), red). At anyα,P was dramatically higher with the distance included: atα = 10%,P increased to 85%. In general, P increases when lckbecomes longer

(ﬁgures3(b)–(c)). At any lck, P for CKﬁngerprinting

with distance information (named ‘CK-dist printing’) is higher than or equal to P for CK ﬁnger-printing. A similar observation was made when different additional information was considered (sup-plementaryﬁgure 5). Taken together, these demon-strate the feasibility of the technique for identifying primary protein sequences.

Another application area could be clinical diag-nostics. As an example of detecting infections, we chose human respiratory syncytial virus (HRSV) and tuberculosis (TB). We determined that a set of HRSV and TB proteins contain a unique CKﬁngerprint and thus can be detected atα as high as 15–20% and be potentially used as markers for HRSV and TB (supple-mentaryﬁgure 6).

The CKfingerprinting technique will also enable us to detect post-translational modifications of pro-teins when it is expanded to a three-colorfluorescence measurement. For example, glycosylated amino acids can be labeled with a third acceptor dye using hydra-zide–aldehyde coupling chemistry, which is orthogo-nal to the labeling methods for lysine and cysteine residues. Phosphorylated serine and threonine can be labeled with a third acceptor using another coupling scheme [12]. This will advance the proof-of-principle of detecting a post-translationally modified peptide using a nanopore that was reported in 2014 [5].

We described SM proteinﬁngerprinting, a techni-que that will provide proteomics with high sensitivity and a large dynamic range. Our computational assess-ment indicated that, even if we read only two amino acid types, we could correctly identify proteins with reference to proteomic databases. When this entirely Figure 1. A single-molecule read-out. CKﬁngerprinting read:

the order of C’s and K’s are detected. CK-dist ﬁngerprinting read: the distances between C’s and K’s are additionally measured.

Figure 2. The schematic shows how single-molecule finger-printing is carried out. (a) Proteins are obtained from diverse sources including single cells, tissues, and bodyfluids. (b) Extracted proteins are denatured, and cysteines and lysines are labeled with_{fluorescent dyes. (c) An engineered version of} a protein translocase (e.g. bacterial ClpX) grabs individual substrate proteins, unfolds them, and translocates them through its nanochannel. Proteins are sequenced using FRET (Förster resonance energy transfer). The translocase is labeled with a donor dye. FRET occurs between the donor on the translocase and the two distinct acceptor dyes on a substrate when the substrate passes through the nanomachine. The FRET signals report the order of the labeled amino acids.

(3)

new SM protein sequencing approach is achieved, it will become a proteomics tool that complements MS and opens up new avenues in global, high-throughput protein analysis.

Methods

Here we describe the approaches that we used to simulate errors and find protein fingerprints that match a given queryfingerprint pattern.

Error simulation

We simulated 2000 read-outs, each for a different protein. The proteins are randomly picked from the database and thus contain random amino acids and ﬁngerprint lengths. Next, to assess the robustness of the method against inaccuracies that are expected from actual experiments, errors are iteratively intro-duced for each read-out up to the error level we want to investigate.

We expect that actual data will be convoluted with poor dye-labeling, photoblinking and photobleaching of dyes, local structures of a substrate protein, non-uniform speed of substrate translocation, proximity between dyes etc. The poor labeling, photoblinking,

and photobleaching of acceptor dyes will appear as deletion errors (figure4). The non-uniform speed of translocation will introduce insertion and deletion errors to CK-dist fingerprinting. The proximity of acceptor dyes will bring deletion and transposition/ substitution errors. If a donor dye is photobleached during a measurement, it will appear as a truncation error. We do not consider this error forfingerprinting analysis since donor photobleaching can be deter-mined from SM time traces and thus can be easily excluded from further analysis. Other complications, such as aggregation of denatured proteins, may also be expected but are not considered in our analysis. See a pseudo-code for simulating these errors (supplemen-tary information).

In figure 3, we investigated one combination of errors (70% deletions, 20% insertions, 10% transposi-tions) for CKfingerprinting, in which we assigned the largest percentage to deletions since this error is the most likely to occur (poor labeling of acceptors due to incomplete denaturation of proteins, photoblinking/ photobleaching of acceptors, and presence of con-secutive identical acceptorfluorophores). For CK-dist fingerprinting analysis, we considered the same com-bination of errors as for CKfingerprinting but with errors at CK residues and errors of the distance between CK residues equally likely to occur. In supple-mentaryfigure 3, we expanded the error space that we explored and obtained trends nearly identical to that found infigure3(a).

Overview of the CKﬁngerprinting

The 2000 simulated readouts are searched for in the database, and the numbers of true positives and the number of matches are recorded. To examine the Figure 3. (a) Detection precision, P, at various error levels, :α blue for CKﬁngerprinting, red for CK-dist ﬁngerprinting. Error bars are

the standard deviation from three independent simulations. (b) P as a function of CK_{ﬁngerprint length (l}ck, the length of CK

sequences excluding other amino acids) at variousαfor CKﬁngerprinting. (c) P as a function of lckat variousαfor CK-dist

ﬁngerprinting.

Figure 4. Expected experimental errors.‘R’ is reference sequence.‘Q’ is query sequence.

(4)

performance variability of our algorithm in retrieving proteins usingfingerprints, three independent repeti-tions are executed. In each repetition, detection precision (P) (figure 3) and detection recall (R) (supplementaryfigure 4) are calculated based on the outputs. P is defined as the number of true positives divided by the number of read-outs returned by the algorithm. R is the number of true positives divided by the number of conditional matches.

The inputs to our method are a reference database

R containing ﬁngerprint representations of protein

sequences, a queryﬁngerprint Q and an error level .α

The alphabet isΣ ={ ,C K},since we only compare ﬁngerprints of these two amino acids. Let LQbe the

query length and Rx∈ R denote the xth reference

sequence in the database R with length L .xR The

dis-tance S R( x, Q)between a reference ﬁngerprint Rx and a query Q is the minimal number of steps required to transform Q intoRx. Formally, given Q, R, andα,

the problem is toﬁnd all Rx∈Rfor which S R( x, Q)

is smaller than k =α × LQ.

Given the inputs, the algorithm takes two steps to retrieve matches: (1) aﬁltration strategy is applied to identify candidate sequences inR;and (2) a

veriﬁca-tion method is employed to examine all candidates for possible matches.

Filtration: eliminating uninteresting sequences Dynamic programming is computationally costly, prohibiting direct application on large databases in a high-throughput setting [13]. In order to reduce the running time without affecting sensitivity, we use filtration to remove those references that definitely cannot match the queryfingerprint Q with distance smaller than or equal to k. Filtration exploits the fact that it is easier to tell a referencefingerprint that does not match a queryfingerprint than to tell one that does match. Typically, it uses a simple and highly efficient filter criterion to analyze the reference sequences, leaving only a small number of R ’sx for further (more

expensive) analysis. We devised a new ﬁltration method combining two existing algorithms, partial exact matching andn gram− counting.

In partial exact matching, the queryﬁngerprint Q is divided into k( + 1)pieces q0, q1,…,qk,where k equalsα ×L .Q For a match to be possible, there must

be at least one piece that appears exactly in a reference sequence Rx [15]. If this is not the case, Rx is

discarded.

A faster ﬁltration method is n gram− counting, which compares then grams− of twoﬁngerprints. An

n gram− [16] on the alphabet setΣ ={ ,C K}is any string in_Σn,_where_Σn_{is the set of all possible strings of}

length n over Σ. For example, the 2 grams− for

C K

{ , }

Σ = are CC, CK, KC and KK. Then gram− distance is deﬁned as the sum of the absolute differ-ences between the numbers of occurrdiffer-ences of each

n gram.− If the n gram− distance exceeds2nk, Rxis

discarded [16].

We combined the partial exact matching and

n gram− counting approaches to decide whether there

exists at least one piece in Q that appears with a limited amount of errors as a piece of Rx[17]. The distance

function between two pieces of Q andR ,x qjand r ,xj

based on theirn grams− was deﬁned as:

(

)

(

)

( )

S r q G q G r , max [ ] [ ], 0 , npm xj j j xj n

∑

ν ν = − ν∈Σ

where[ ]ν is ann gram− andG q [ ]

( )

j ν andG r

( )

xj [ ]ν denote the total number of times[ ]ν occurs in qj_and

r ,_xj respectively.

For each piece qj_{in the query, the corresponding}

piece r_xj contains the same letters in the reference sequence with an additional k letters on both sides, as shown inﬁgure5. It is sufﬁcient to compare the rxjin

the reference with the qj _{in the query to determine}

whether the piece qj_{appears in the reference}_{R ,} x since

k errors cannot alter more than k positions. Since a

query piece is searched in a limited range in the refer-ence, it can discard more entries in the reference data-base than the partial exact matching method, in which the qj_{is compared with the entire reference sequence.}

The distance between a piece qj_{in query Q and the}

corresponding piece r_xj in Rxis computed to

deter-mine whether Rx is a candidate match. For each qj

and its corresponding r ,_xj we check whether any

n gram− occurs more often in qjthan in r .xj If not, the

Snpm( ,rxj qj) is zero, i.e. the n-grams in qj appear exactly in r_xj. Only if for at least one q ,j _S _{( ,}_r _q₎

npm xj j

is zero,Rxis kept as a candidate.

Veriﬁcation: ﬁnding matches

The remaining candidate matches are examined by a global alignment dynamic programming approach considering a number of possible error types. In our analysis, four types of error may occur: deletion, insertion, mismatching an amino acid with another one (substitution), and swapping (transposition).

The dynamic programming algorithm is designed to provide the optimal gapped alignment between two sequences, i.e. an alignment with long regions of iden-tical amino acid pairs and very few mismatches and gaps [14]. As the sequences become more dissimilar, more mismatched amino acid pairs and gaps should

Figure 5. Consecutive piecesqj 1−,_qj_and_qj 1+ _{of query}_Q and their corresponding pieces rxj 1−,rxjand rxj 1+ in reference

R .x Each query piece is compared to a limited range in the reference.

(5)

appear. Toﬁnd the optimal alignment, a dynamic pro-gramming matrixMﬁrst needs to be calculated. Each

elementMi j, represents the maximum score of

align-ing the substralign-ings Q[1... ] and Ri x[1... ].j Let c

denote the scores of the four operations. The base cases, M0,j and M ,i,0 are deﬁned as (cdel×j) and

(cins×i) for all 1 ⩽ j ⩽ LxR (length of R )x and

i L

1 ⩽ ⩽ Q(length of Q) respectively. Then,

con-sidering the four possibilities,M is updated using the following recursive relation

M M c M c M c M c max , , , . i j i j i j i j i j , 1, 1 sub 1, ins , 1 del 2, 2 swap ⎧ ⎨ ⎪ ⎪ ⎩ ⎪ ⎪ = + + + + − − − − − −

The score for each operation is set based on the estimation of how likely each error is to occur in our measurements. Currently, deletions caused by low labeling efﬁciency are the dominating errors, fol-lowed by insertions, transpositions and substitutions (i.e. matching C to K or vice versa) (see error simula-tion). Hence we choose a relatively low penalty (nega-tive) for deletions and higher penalties (nega(nega-tive) for transpositions and substitutions. For the matching positions, the score is positive (see supplementary table 2).

By memorizing the solutions to the subproblems for 1 ⩽ j ⩽ L_xR and 1 ⩽ i ⩽ LQ stored in the

dynamic matrix, we can recursively compute the max-imum score of aligningRxand Q. Therefore weﬁnd

the score of the optimal alignment of the two sequen-ces starting from the maximum value in the last row or last column. We maintain a matrix of traceback poin-ters in the recursion, so that we remember which case was used to calculate every cell M ,i j, allowing to

recon-struct the optimal alignment.

From this alignment the numbers of errors for dif-ferent types as well as the total number of errors can be calculated. The distance between the query and the reference S R( x,Q)is deﬁned as the total number of errors. If this distance is smaller thank,the reference sequenceRxis considered as a match. Otherwise, it is not a match of the query sequence within the error bound k. A match is considered a true positive match when the match is the exact query protein. If a match has the same ﬁngerprint but a different amino acid sequence, it is not considered to be a true positive match. In our analysis, this is determined by checking the protein accession codes.

Additional information, such as the distance between two read-outs, can be deduced from the mea-surements. This distance is the space between two labeled amino acids, which is the number of non-labeled amino acids in between, which show a differ-ent pattern in the measuremdiffer-ent. For this to be esti-mated reliably, proteins will have to be sequenced at a relatively constant speed, an assumption which is not

a priori valid. From the sequencing signals, we cannot easily determine the start or the end of proteins in the time trace if they do not correspond to a labeled amino acid. Thus, the starting and ending non-labeled amino acids are not included when we construct the ﬁnger-print with distance information.

This distance information is added to the original CKﬁngerprints using an additional symbol (say, ‘o’), occurring multiple times (representing the length of distance). Next, two distances between query and reference are calculated to examine whether a refer-ence sequrefer-ence is a match. One is the S R

(

x,Q

)

betweenﬁngerprints with distance information, the other the S R′

(

x,Q

)

between CKﬁngerprints only.

Reference sequence Rx is considered a match if and

only if S R′

(

x,Q

)

is smaller than k′ = (α × LQ′)

and S R

(

x,Q

)

is smaller than k = α × L ,Q where LQis the length of the query CKﬁngerprint,L_Q′ is the length of the queryﬁngerprint with distance informa-tion and k′ and k represent the numbers of errors allowed. Experimental error on the distance informa-tion is also taken into considerainforma-tion.

Note: Supplementary information is available in the online version of the paper. An animated experi-mental scheme is available at https://youtube.com/ watch?v=YpWCCWO5q10.

Acknowledgments

We would like to thank L Restrepo and L Loeff for critical reading. C J was funded by Foundation for Fundamental Research on Matter (12PR3029).

Author contributions

C J conceived the study. Y Y, M D, and D R con-ducted the computational analysis. Y Y, M D, J G, D R and C J discussed the data and wrote the manuscript.

Competingﬁnancial interests

C J and J Gﬁled a patent (WO2014014347).

References

[1] Kim M S et al 2014 A draft map of the human proteome Nature

509 575–81

[2] Wilhelm M et al 2014 Mass-spectrometry-based draft of the human proteome Nature509 582–7

[3] Swaminathan J, Boulgakov A A and Marcotte E M 2015 A theoretical justiﬁcation for single molecule peptide sequencing PLoS Comput. Biol.11 e1004080

[4] Nivala J, Marks D B and Akeson M 2013 Unfoldase-mediated protein translocation through an alpha-hemolysin nanopore Nat. Biotechnol.31 247_–50

[5] Rosen C B, Rodriguez-Larrea D and Bayley H 2014 Single-molecule site-speciﬁc detection of protein phosphorylation with a nanopore Nat. Biotechnol.32 179–81

[6] Harris T D et al 2008 Single-molecule DNA sequencing of a viral genome Science320 106–9

[7] Eid J et al 2009 Real-time DNA sequencing from single polymerase molecules Science323 133–8

[8] Clarke J et al 2009 Continuous base identiﬁcation for single-molecule nanopore DNA sequencing Nat. Nanotechnology4 265–70

(6)

[9] Joo C, Dekker C, van Ginkel H G T M and Meyer A S 2014 Single molecule protein sequencing WO patent 2014014347 [10] Preminger M and Smilansky Z 2009 Methods for evaluating

ribonucleotide sequences US patent 20090081743 [11] UniProt Consortium 2014 Activities at the universal protein

resource (UniProt) Nucleic Acids Res.42 D191–8

[12] Kim J S, Kim J, Oh J M and Kim H J 2011 Tandem mass spectro-metric method for deﬁnitive localization of phosphorylation sites using bromine signature Anal. Biochem.414 294–6

[13] Navarro G 2001 A guided tour to approximate string matching ACM Comput. Surv.33 31–88

[14] Needleman S B and Wunsch C D 1970 A general method applicable to the search for similarities in the amino acid sequence of two proteins J. Mol. Biol.48 443–53

[15] Wu S and Manber U 1992 Fast text searching: allowing errors Commun. ACM35 83–91

[16] Ukkonen E 1992 Approximate string-matching with q-grams and maximal matches Theor. Comput. Sci.92 191–211

[17] Lu C W, Lu C L and Lee R C 2013 A newﬁltration method and a hybrid strategy for approximate string matching Theor. Comput. Sci.481 9–17