← Back to Search
Improving The Accuracy Of PSI-BLAST Protein Database Searches With Composition-based Statistics And Other Refinements.
A. Schäffer, L. Aravind, T. L. Madden, S. Shavirin, J. Spouge, Y. Wolf, E. Koonin, S. Altschul
Published 2001 · Medicine, Biology
Download PDFAnalyze on Scholarcy
PSI-BLAST is an iterative program to search a database for proteins with distant similarity to a query sequence. We investigated over a dozen modifications to the methods used in PSI-BLAST, with the goal of improving accuracy in finding true positive matches. To evaluate performance we used a set of 103 queries for which the true positives in yeast had been annotated by human experts, and a popular measure of retrieval accuracy (ROC) that can be normalized to take on values between 0 (worst) and 1 (best). The modifications we consider novel improve the ROC score from 0.758 +/- 0.005 to 0.895 +/- 0.003. This does not include the benefits from four modifications we included in the 'baseline' version, even though they were not implemented in PSI-BLAST version 2.0. The improvement in accuracy was confirmed on a small second test set. This test involved analyzing three protein families with curated lists of true positives from the non-redundant protein database. The modification that accounts for the majority of the improvement is the use, for each database sequence, of a position-specific scoring system tuned to that sequence's amino acid composition. The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST.
This paper references
Comparison of sequence profiles
L. Rychlewski (2000)
Comparison of the complete protein sets of worm and yeast: orthology and divergence.
S. Chervitz (1998)
Amino acid substitution matrices.
S. Henikoff (2000)
The estimation of statistical parameters for local alignment score distributions.
S. Altschul (2001)
Comparison of sequence profiles. Strategies for structural predictions using sequence information
L. Rychlewski (2000)
The Megaprior Heuristic for Discovering Protein Sequence Patterns
T. Bailey (1996)
Weighting aligned protein or nucleic acid sequences to correct for unequal representation.
P. Sibbald (1990)
The area above the ordinal dominance graph and the area below the receiver operating characteristic graph
D. Bamber (1975)
An evolutionary classification of the metallo-beta-lactamase fold proteins
L. Aravind (1998)
The statistical distribution of nucleic acid similarities.
T. Smith (1985)
A model of evolutionary change in proteins
M. O. Dayhoff (1968)
Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases.
S. Altschul (1998)
Weighting in sequence space: a comparison of methods in terms of generalized sequences.
M. Vingron (1993)
A novel family of predicted phosphoesterases includes Drosophila prune protein and bacterial RecJ exonuclease.
L. Aravind (1998)
Gleaning non-trivial structural, functional and evolutionary information about proteins by iterative database searches.
L. Aravind (1999)
Changes in Protein Evolution Appendix : A method to weight protein sequences to correct for unequal representation
M. Gerstein (1999)
Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods.
J. Park (1998)
Amino acid substitution matrices from an information theoretic perspective
S. Altschul (1991)
Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores
R. Mott (1992)
Limit Distribution of Maximal Non-Aligned Two-Sequence Segmental Score
A. Dembo (1994)
Improved sensitivity of profile searches through the use of sequence weights and gap excision
J. Thompson (1994)
IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices
A. Schäffer (1999)
Amino acid substitution matrices from protein
S. Henikoff (1992)
Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins.
A. Robinson (1991)
Large‐scale comparison of protein sequence alignment algorithms with structure alignments
J. Michael Sauder (2000)
Crystal structure of the BTB domain from PLZF.
K. Ahmad (1998)
Optimal alignments in linear space
E. Myers (1988)
Atlas of protein sequence and structure
M. A. Chang (1965)
Rapid Assessment of Extremal Statistics for Gapped Local Alignment
R. Olsen (1999)
Database of homology‐derived protein structures and the structural meaning of sequence alignment
Christian Sander (1991)
Maximum Entropy Weighting of Aligned Sequences of Proteins or DNA
A. Krogh (1995)
Hidden Markov models for detecting remote protein homologies
K. Karplus (1998)
The Protein Data Bank
H. Berman (2000)
Matrices for detecting distant relationships
R. Schwartz (1978)
Rapid and accurate estimates of statistical significance for sequence data base searches.
M. Waterman (1994)
Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.
S. Karlin (1990)
Optimal sequence alignments.
Walter M. Fitch (1983)
Benchmarking PSI-BLAST in genome annotation.
A. Mueller (1999)
Maximum Discrimination Hidden Markov Models of Sequence Consensus
S. Eddy (1995)
Comparison of methods for searching protein sequence databases
W. Pearson (1995)
Empirical statistical estimates for sequence similarity searches.
W. Pearson (1998)
Fold prediction and evolutionary analysis of the POZ domain: structural and evolutionary relationship with the potassium channel tetramerization domain.
L. Aravind (1999)
Accurate formula for P-values of gapped local sequence and profile alignments.
R. Mott (2000)
Gapped BLAST and PSI-BLAST: A new
D. Lipman (1997)
A weighting system and algorithm for aligning many phylogenetically related sequences
O. Gotoh (1995)
Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching
M. Gribskov (1996)
Weights for data related by a tree.
S. Altschul (1989)
Volume changes in protein evolution.
M. Gerstein (1994)
An information measure of retrieval performance
W. Wilbur (1992)
Basic local alignment search tool.
S. Altschul (1990)
Rapid and accurate estimates of statistical significance for sequence database searches
M. S. Waterman (1994)
The significance of protein sequence similarities
J. Collins (1988)
Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases
J. Wootton (1993)
Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases
A. Wallqvist (2000)
Measuring the accuracy of diagnostic systems.
J. Swets (1988)
Improved tools for biological sequence comparison.
W. Pearson (1988)
Generalized affine gap costs for protein sequence alignment
S. Altschul (1998)
Amino acid substitution matrices from protein blocks.
S. Henikoff (1992)
Local alignment statistics.
S. Altschul (1996)
Hidden Markov models for detecting remote protein
K. Karplus (1998)
Issues in searching molecular sequence databases
S. Altschul (1994)
Optimal sequence alignment using affine gap costs.
S. Altschul (1986)
Identification of common molecular subsequences.
T. Smith (1981)
An improved algorithm for matching biological sequences.
O. Gotoh (1982)
Position-based sequence weights.
S. Henikoff (1994)
This paper is referenced by
Application of high-throughput computing in bioinformatics
M. Swindells (2002)
The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions
Y. Yu (2005)
Accurate Detection of Very Sparse Sequence Motifs
A. Heger (2004)
mSQL : SQL Extensions and Database Mechanisms for Managing Biosequences
W. Mao (2003)
Running head : PlanTAPDB : A resource of transcription associated proteins Corresponding author :
Sandra Richardt (2007)
Isolation of a novel gene from Photobacterium damselae subsp. piscicida and analysis of the recombinant antigen as promising vaccine candidate.
F. Andreoni (2013)
Structural and functional features of Streptolysin O
A. Ahmad (2011)
Recent Development of Computational Predicting Bioluminescent Proteins.
D. Zhang (2019)
PHYTOREOVIRUS-LIKE SEQUENCES ISOLATED FROM SALIVARY GLANDS OF THE GLASSY-WINGED SHARPSHOOTER HOMALODISCA VITRIPENNIS (HEMIPTERA: CICADELLIDAE)
C. Katsar (2007)
HpdR is a transcriptional activator of Sinorhizobium meliloti hpdA, which encodes a herbicide-targeted 4-hydroxyphenylpyruvate dioxygenase.
S. Loprasert (2007)
Identification, Molecular Cloning, and Characterization of the Sixth Subunit of Human Transcription Factor TFIIIC*
Hélène Dumay-Odelot (2007)
Sensitivity Analysis of Boosting PSI-Blast with Case Study on Subcellular Localization
F. Mai (2007)
Cloning and characterization of GDP-perosamine synthetase (Per) from Escherichia coli O157:H7 and synthesis of GDP-perosamine in vitro.
Guohui Zhao (2007)
Template-based protein structure modeling.
A. Fiser (2010)
Expression and characterization of the periplasmic cobalamin-binding protein of Photobacterium damselae subsp. piscicida.
R. Boiani (2009)
EL_PSSM-RT: DNA-binding residue prediction by integrating ensemble learning with PSSM Relation Transformation
Jiyun Zhou (2017)
International Journal of Applied Sciences and Biotechnology
Vinod Koshti (2014)
A Multi-Label Classifier for Predicting the Subcellular Localization of Gram-Negative Bacterial Proteins with Both Single and Multiple Sites
X. Xiao (2011)
Comparative modelling: an essential methodology for protein structure prediction in the post-genomic era.
Bruno Contreras-Moreira (2002)
Detection of distant evolutionary relationships between protein families using theory of sequence profile-profile comparison
Mindaugas Margelevičius (2009)
High temperature interrupts initial egg diapause in Paratlanticus ussuriensis and induces expression of a heat shock protein 70 gene
Jae-Kyoung Shim (2012)
Symmetry and Fractal-like Structures in the Statistics of Sequence Comparison
H. Booth (2002)
Three monophyletic superfamilies account for the majority of the known glycosyltransferases
J. Liu (2003)
Prolog: Bioinformatics with the Wolfram Language
G. Mias (2018)
Characteristics of the glucose-regulated protein 78 (grp78) gene from Bemisia tabaci MED cryptic species and its expression under thermal and nutritional stress conditions
Bong-Gi Choi (2018)
Homology modelling and in silico substrate-binding analysis of a Rhizobium sp. RC1 haloalkanoic acid permease
Muhammed Adamu Musa (2018)
Phylogeny of Echinoderm Hemoglobins
A. B. Christensen (2015)
NCBI BLAST+ integrated into Galaxy
P. Cock (2015)
PROCAIN: protein profile comparison with assisting information
Y. Wang (2009)
Powerful fusion: PSI-BLAST and consensus sequences
D. Przybylski (2008)
Extremely intron-rich genes in the alveolate ancestors inferred with a flexible maximum-likelihood approach.
M. Csűrös (2008)
MODELACIÓN POR HOMOLOGÍA DE LA CATEPSINA B DE Fasciola hepatica
D. Naranjo (2007)See more