Online citations, reference lists, and bibliographies.
← Back to Search

Improving The Accuracy Of PSI-BLAST Protein Database Searches With Composition-based Statistics And Other Refinements.

A. Schäffer, L. Aravind, T. L. Madden, S. Shavirin, J. Spouge, Y. Wolf, E. Koonin, S. Altschul
Published 2001 · Medicine, Biology

Cite This
Download PDF
Analyze on Scholarcy
Share
PSI-BLAST is an iterative program to search a database for proteins with distant similarity to a query sequence. We investigated over a dozen modifications to the methods used in PSI-BLAST, with the goal of improving accuracy in finding true positive matches. To evaluate performance we used a set of 103 queries for which the true positives in yeast had been annotated by human experts, and a popular measure of retrieval accuracy (ROC) that can be normalized to take on values between 0 (worst) and 1 (best). The modifications we consider novel improve the ROC score from 0.758 +/- 0.005 to 0.895 +/- 0.003. This does not include the benefits from four modifications we included in the 'baseline' version, even though they were not implemented in PSI-BLAST version 2.0. The improvement in accuracy was confirmed on a small second test set. This test involved analyzing three protein families with curated lists of true positives from the non-redundant protein database. The modification that accounts for the majority of the improvement is the use, for each database sequence, of a position-specific scoring system tuned to that sequence's amino acid composition. The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST.
This paper references
Comparison of sequence profiles
L. Rychlewski (2000)
10.1126/SCIENCE.282.5396.2022
Comparison of the complete protein sets of worm and yeast: orthology and divergence.
S. Chervitz (1998)
10.1016/S0065-3233(00)54003-0
Amino acid substitution matrices.
S. Henikoff (2000)
10.1093/NAR/29.2.351
The estimation of statistical parameters for local alignment score distributions.
S. Altschul (2001)
10.1110/PS.9.2.232
Comparison of sequence profiles. Strategies for structural predictions using sequence information
L. Rychlewski (2000)
The Megaprior Heuristic for Discovering Protein Sequence Patterns
T. Bailey (1996)
10.1016/S0022-2836(99)80003-5
Weighting aligned protein or nucleic acid sequences to correct for unequal representation.
P. Sibbald (1990)
10.1016/0022-2496(75)90001-2
The area above the ordinal dominance graph and the area below the receiver operating characteristic graph
D. Bamber (1975)
An evolutionary classification of the metallo-beta-lactamase fold proteins
L. Aravind (1998)
10.1093/NAR/13.2.645
The statistical distribution of nucleic acid similarities.
T. Smith (1985)
A model of evolutionary change in proteins
M. O. Dayhoff (1968)
10.1016/S0968-0004(98)01298-5
Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases.
S. Altschul (1998)
10.1073/PNAS.90.19.8777
Weighting in sequence space: a comparison of methods in terms of generalized sequences.
M. Vingron (1993)
10.1016/S0968-0004(97)01162-6
A novel family of predicted phosphoesterases includes Drosophila prune protein and bacterial RecJ exonuclease.
L. Aravind (1998)
10.1006/JMBI.1999.2653
Gleaning non-trivial structural, functional and evolutionary information about proteins by iterative database searches.
L. Aravind (1999)
Changes in Protein Evolution Appendix : A method to weight protein sequences to correct for unequal representation
M. Gerstein (1999)
10.1006/JMBI.1998.2221
Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods.
J. Park (1998)
10.1016/0022-2836(91)90193-A
Amino acid substitution matrices from an information theoretic perspective
S. Altschul (1991)
10.1016/S0092-8240(05)80176-4
Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores
R. Mott (1992)
10.1214/AOP/1176988493
Limit Distribution of Maximal Non-Aligned Two-Sequence Segmental Score
A. Dembo (1994)
10.1093/bioinformatics/10.1.19
Improved sensitivity of profile searches through the use of sequence weights and gap excision
J. Thompson (1994)
10.1093/bioinformatics/15.12.1000
IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices
A. Schäffer (1999)
Amino acid substitution matrices from protein
S. Henikoff (1992)
10.1073/PNAS.88.20.8880
Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins.
A. Robinson (1991)
10.1002/(SICI)1097-0134(20000701)40:1<6::AID-PROT30>3.0.CO;2-7
Large‐scale comparison of protein sequence alignment algorithms with structure alignments
J. Michael Sauder (2000)
10.1073/PNAS.95.21.12123
Crystal structure of the BTB domain from PLZF.
K. Ahmad (1998)
10.1093/bioinformatics/4.1.11
Optimal alignments in linear space
E. Myers (1988)
Atlas of protein sequence and structure
M. A. Chang (1965)
Rapid Assessment of Extremal Statistics for Gapped Local Alignment
R. Olsen (1999)
10.1002/PROT.340090107
Database of homology‐derived protein structures and the structural meaning of sequence alignment
Christian Sander (1991)
Maximum Entropy Weighting of Aligned Sequences of Proteins or DNA
A. Krogh (1995)
10.1093/bioinformatics/14.10.846
Hidden Markov models for detecting remote protein homologies
K. Karplus (1998)
10.1093/nar/28.1.235
The Protein Data Bank
H. Berman (2000)
Matrices for detecting distant relationships
R. Schwartz (1978)
10.1073/PNAS.91.11.4625
Rapid and accurate estimates of statistical significance for sequence data base searches.
M. Waterman (1994)
10.1073/PNAS.87.6.2264
Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.
S. Karlin (1990)
10.1073/PNAS.80.5.1382
Optimal sequence alignments.
Walter M. Fitch (1983)
10.1006/JMBI.1999.3233
Benchmarking PSI-BLAST in genome annotation.
A. Mueller (1999)
10.1089/cmb.1995.2.9
Maximum Discrimination Hidden Markov Models of Sequence Consensus
S. Eddy (1995)
10.1002/PRO.5560040613
Comparison of methods for searching protein sequence databases
W. Pearson (1995)
10.1006/JMBI.1997.1525
Empirical statistical estimates for sequence similarity searches.
W. Pearson (1998)
10.1006/JMBI.1998.2394
Fold prediction and evolutionary analysis of the POZ domain: structural and evolutionary relationship with the potassium channel tetramerization domain.
L. Aravind (1999)
10.1006/JMBI.2000.3875
Accurate formula for P-values of gapped local sequence and profile alignments.
R. Mott (2000)
Gapped BLAST and PSI-BLAST: A new
D. Lipman (1997)
10.1093/bioinformatics/11.5.543
A weighting system and algorithm for aligning many phylogenetically related sequences
O. Gotoh (1995)
10.1016/S0097-8485(96)80004-0
Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching
M. Gribskov (1996)
10.1016/0022-2836(89)90234-9
Weights for data related by a tree.
S. Altschul (1989)
10.1016/0022-2836(94)90012-4
Volume changes in protein evolution.
M. Gerstein (1994)
10.1016/0306-4379(92)90019-J
An information measure of retrieval performance
W. Wilbur (1992)
10.1016/S0022-2836(05)80360-2
Basic local alignment search tool.
S. Altschul (1990)
Rapid and accurate estimates of statistical significance for sequence database searches
M. S. Waterman (1994)
10.1093/bioinformatics/4.1.67
The significance of protein sequence similarities
J. Collins (1988)
10.1016/0097-8485(93)85006-X
Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases
J. Wootton (1993)
10.1093/bioinformatics/16.11.988
Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases
A. Wallqvist (2000)
10.1126/SCIENCE.3287615
Measuring the accuracy of diagnostic systems.
J. Swets (1988)
10.1073/PNAS.85.8.2444
Improved tools for biological sequence comparison.
W. Pearson (1988)
10.1002/(SICI)1097-0134(19980701)32:1<88::AID-PROT10>3.0.CO;2-J
Generalized affine gap costs for protein sequence alignment
S. Altschul (1998)
10.1073/PNAS.89.22.10915
Amino acid substitution matrices from protein blocks.
S. Henikoff (1992)
10.1016/S0076-6879(96)66029-7
Local alignment statistics.
S. Altschul (1996)
Hidden Markov models for detecting remote protein
K. Karplus (1998)
10.1038/ng0294-119
Issues in searching molecular sequence databases
S. Altschul (1994)
10.1016/S0092-8240(86)90010-8
Optimal sequence alignment using affine gap costs.
S. Altschul (1986)
10.1016/0022-2836(81)90087-5
Identification of common molecular subsequences.
T. Smith (1981)
10.1016/0022-2836(82)90398-9
An improved algorithm for matching biological sequences.
O. Gotoh (1982)
10.1016/0022-2836(94)90032-9
Position-based sequence weights.
S. Henikoff (1994)



This paper is referenced by
10.1098/rsta.2002.0987
Application of high-throughput computing in bioinformatics
M. Swindells (2002)
10.1093/bioinformatics/bti070
The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions
Y. Yu (2005)
10.1089/cmb.2004.11.843
Accurate Detection of Very Sparse Sequence Motifs
A. Heger (2004)
mSQL : SQL Extensions and Database Mechanisms for Managing Biosequences
W. Mao (2003)
Running head : PlanTAPDB : A resource of transcription associated proteins Corresponding author :
Sandra Richardt (2007)
10.1016/j.vaccine.2012.11.064
Isolation of a novel gene from Photobacterium damselae subsp. piscicida and analysis of the recombinant antigen as promising vaccine candidate.
F. Andreoni (2013)
10.1504/IJBRA.2011.043772
Structural and functional features of Streptolysin O
A. Ahmad (2011)
10.2174/1381612825666191107100758
Recent Development of Computational Predicting Bioluminescent Proteins.
D. Zhang (2019)
10.1653/0015-4040(2007)90[196:PSIFSG]2.0.CO;2
PHYTOREOVIRUS-LIKE SEQUENCES ISOLATED FROM SALIVARY GLANDS OF THE GLASSY-WINGED SHARPSHOOTER HOMALODISCA VITRIPENNIS (HEMIPTERA: CICADELLIDAE)
C. Katsar (2007)
10.1128/JB.01662-06
HpdR is a transcriptional activator of Sinorhizobium meliloti hpdA, which encodes a herbicide-targeted 4-hydroxyphenylpyruvate dioxygenase.
S. Loprasert (2007)
10.1074/jbc.M611542200
Identification, Molecular Cloning, and Characterization of the Sixth Subunit of Human Transcription Factor TFIIIC*
Hélène Dumay-Odelot (2007)
10.1109/MLSP.2007.4414279
Sensitivity Analysis of Boosting PSI-Blast with Case Study on Subcellular Localization
F. Mai (2007)
10.1016/J.BBRC.2007.08.184
Cloning and characterization of GDP-perosamine synthetase (Per) from Escherichia coli O157:H7 and synthesis of GDP-perosamine in vitro.
Guohui Zhao (2007)
10.1007/978-1-60761-842-3_6
Template-based protein structure modeling.
A. Fiser (2010)
10.1111/j.1365-2761.2009.01050.x
Expression and characterization of the periplasmic cobalamin-binding protein of Photobacterium damselae subsp. piscicida.
R. Boiani (2009)
10.1186/s12859-017-1792-8
EL_PSSM-RT: DNA-binding residue prediction by integrating ensemble learning with PSSM Relation Transformation
Jiyun Zhou (2017)
International Journal of Applied Sciences and Biotechnology
Vinod Koshti (2014)
10.1371/journal.pone.0020592
A Multi-Label Classifier for Predicting the Subcellular Localization of Gram-Negative Bacterial Proteins with Both Single and Multiple Sites
X. Xiao (2011)
Comparative modelling: an essential methodology for protein structure prediction in the post-genomic era.
Bruno Contreras-Moreira (2002)
10.1186/1471-2105-11-89
Detection of distant evolutionary relationships between protein families using theory of sequence profile-profile comparison
Mindaugas Margelevičius (2009)
10.1016/J.ASPEN.2011.08.002
High temperature interrupts initial egg diapause in Paratlanticus ussuriensis and induces expression of a heat shock protein 70 gene
Jae-Kyoung Shim (2012)
Symmetry and Fractal-like Structures in the Statistics of Sequence Comparison
H. Booth (2002)
10.1110/ps.0302103
Three monophyletic superfamilies account for the majority of the known glycosyltransferases
J. Liu (2003)
10.1007/978-3-319-72377-8_1
Prolog: Bioinformatics with the Wolfram Language
G. Mias (2018)
10.1016/J.ASPEN.2017.11.008
Characteristics of the glucose-regulated protein 78 (grp78) gene from Bemisia tabaci MED cryptic species and its expression under thermal and nutritional stress conditions
Bong-Gi Choi (2018)
10.1080/13102818.2018.1432417
Homology modelling and in silico substrate-binding analysis of a Rhizobium sp. RC1 haloalkanoic acid permease
Muhammed Adamu Musa (2018)
10.1371/journal.pone.0129668
Phylogeny of Echinoderm Hemoglobins
A. B. Christensen (2015)
10.1186/s13742-015-0080-7
NCBI BLAST+ integrated into Galaxy
P. Cock (2015)
10.1093/nar/gkp212
PROCAIN: protein profile comparison with assisting information
Y. Wang (2009)
10.1093/bioinformatics/btn384
Powerful fusion: PSI-BLAST and consensus sequences
D. Przybylski (2008)
10.1093/molbev/msn039
Extremely intron-rich genes in the alveolate ancestors inferred with a flexible maximum-likelihood approach.
M. Csűrös (2008)
MODELACIÓN POR HOMOLOGÍA DE LA CATEPSINA B DE Fasciola hepatica
D. Naranjo (2007)
See more
Semantic Scholar Logo Some data provided by SemanticScholar