Online citations, reference lists, and bibliographies.
← Back to Search

Dirichlet Mixtures: A Method For Improved Detection Of Weak But Significant Protein Sequence Homology

K. Sjölander, K. Karplus, Michael Brown, R. Hughey, A. Krogh, I. Mian, D. Haussler
Published 1996 · Mathematics, Computer Science, Medicine

Cite This
Download PDF
Analyze on Scholarcy
Share
We present a method for condensing the information in multiple alignments of proteins into a mixture of Dirichlet densities over amino acid distributions. Dirichlet mixture densities are designed to be combined with observed amino acid frequencies to form estimates of expected amino acid probabilities at each position in a profile, hidden Markov model or other statistical model. These estimates give a statistical model greater generalization capacity, so that remotely related family members can be more reliably recognized by the model. This paper corrects the previously published formula for estimating these expected probabilities, and contains complete derivations of the Dirichlet mixture formulas, methods for optimizing the mixtures to match particular databases, and suggestions for efficient implementation.
This paper references
10.1093/bioinformatics/12.2.135
Using substitution probabilities to improve position-specific scoring matrices
J. Henikoff (1996)
10.1016/S0022-2836(99)80003-5
Weighting aligned protein or nucleic acid sequences to correct for unequal representation.
P. Sibbald (1990)
10.1002/PROT.340090107
Database of homology‐derived protein structures and the structural meaning of sequence alignment
Christian Sander (1991)
10.1016/0022-2836(90)90133-7
Flexible protein sequence patterns. A sensitive method to detect weak structural similarities.
G. Barton (1990)
Component 8 gives high probability to methionine, but allows substitution with most neutral residues, especially the aliphatics
Hidden Markov models for sequence analysis: Extension and analysis
Hughey (1996)
Hidden Markov models in computational
K. olander (1994)
Protein classi cation by stochastic modeling and optimal ltering of amino-acid sequences
James V. White (1994)
10.1016/0968-0004(88)90033-3
Of urfs and orfs : a primer on how to analyze devised amino acid sequences
R. Doolittle (1986)
10.2307/2288950
Statistical Decision Theory and Bayesian Analysis
J. O. Berger (1988)
The value of prior knowledge
T. L. Bailey (1995)
10.1002/PROT.340170108
Performance evaluation of amino acid substitution matrices
S. Henikoff (1993)
10.1111/J.2517-6161.1977.TB01600.X
Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper
A. Dempster (1977)
10.1093/nar/12.1Part2.539
The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression
M. Gribskov (1984)
10.1016/0097-8485(94)85024-0
Some Useful Statistical Properties of Position-weight Matrices
J. Claverie (1994)
10.1002/PRO.5560031221
Residue–Residue contact substitution probabilities derived from aligned three‐dimensional structures and the identification of common folds
M. A. Rodionov (1994)
Pro le analysis: Detection of distantly
Michael Gribskov (1987)
10.1109/HICSS.1993.270612
HMM with protein structure grammar
K. Asai (1993)
Component 7 gives high probability to negatively charged residues, allowing substitutions with certain of the hydrophilic polar residues
Improved sensitivity of pro le searches
Julie D. Thompson (1994)
10.1093/NAR/19.23.6565
Automated assembly of protein blocks for database searching.
S. Henikoff (1991)
Pattern Classification and Scene
R. Duda (1973)
10.1016/S0092-8240(89)80049-7
Stochastic models for heterogeneous DNA sequences.
G. Churchill (1989)
10.1073/PNAS.89.22.10915
Amino acid substitution matrices from protein blocks.
S. Henikoff (1992)
10.1002/PROT.340100307
Secondary structure‐based profiles: Use of structure‐conserving scoring tables in searching protein sequence databases for structural similarities
R. Luethy (1991)
10.1016/S0959-440X(96)80056-X
Hidden Markov models.
S. Eddy (1996)
10.1016/S0022-2836(05)80360-2
Basic local alignment search tool.
S. Altschul (1990)
A exible motif search technique based on generalized prooles
Bucher (1996)
10.1016/S0092-8240(84)80060-9
Line geometries for sequence comparisons
M. Waterman (1984)
Structural analysis based
C. M. Stultz (1993)
Secondary structure-based pro les: Use of structure
R. uthy (1991)
10.1080/00401706.1991.10484818
The Statistical Analysis of Discrete Data
T. Santner (1989)
10.1073/PNAS.84.13.4355
Profile analysis: detection of distantly related proteins.
M. Gribskov (1987)
10.1093/bioinformatics/12.2.95
Hidden Markov models for sequence analysis: extension and analysis of the basic method
R. Hughey (1996)
Weighting aligned protein or nucleic
P. Sibbald (1990)
Proole analysis
M Gribskov (1990)
10.1016/0022-2836(91)90193-A
Amino acid substitution matrices from an information theoretic perspective
S. Altschul (1991)
10.1016/0022-2836(94)90032-9
Position-based sequence weights.
S. Henikoff (1994)
Protein classification by
Whitej.V (1994)
Multiple Alignment Using Hidden Markov Models
S. Eddy (1995)
10.1016/S0097-8485(96)80003-9
A Flexible Motif Search Technique Based on Generalized Profiles
P. Bucher (1996)
10.1016/0022-2836(92)90723-W
Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments.
L. Cardon (1992)
10.1006/JMBI.1994.1104
Hidden Markov models in computational biology. Applications to protein modeling.
A. Krogh (1994)
Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families
Michael Brown (1993)
10.1093/PROTEIN/9.5.381
Complementary classification approaches for protein sequences.
J. T. Wang (1996)
A exible motif search technique based on generalized prooles. Computers and Chemistry
Bucher (1995)
10.2307/2344977
Pattern Classi cation and Scene Analysis
R. O. Duda (1974)
10.1093/bioinformatics/8.3.275
The rapid generation of mutation data matrices from protein sequences
D. Jones (1992)
10.1016/0076-6879(90)83009-X
Finding protein similarities with nucleotide sequence databases.
S. Henikoff (1990)
Secondary structure-based pro les: Use of structureconserving scoring table in searching protein sequence databases for structural similarities. Proteins: Structure, Function, and Genetics 10:229{239
R. uthy (1991)
10.1089/cmb.1995.2.9
Maximum Discrimination Hidden Markov Models of Sequence Consensus
S. Eddy (1995)
Component 9 gives high probability to distributions peaked around individual amino acids
REGULARIZERS FOR ESTIMATING DISTRIBUTIONS OF AMINO ACIDS FROM SMALL SAMPLES
K. Karplus (1995)
The Statistical Analysis of Discrete
Santner.T.J (1989)
10.1016/0166-2236(89)90097-0
The EF-hand family of calcium-modulated proteins
A. Persechini (1989)
Finding protein similarities with nucleotide
Steven Heniko (1990)
10.1007/BF00162998
Evolution of EF-hand calcium-modulated proteins. II. Domains of several subfamilies have diverse evolutionary histories
S. Nakayama (2004)
10.1093/bioinformatics/10.1.19
Improved sensitivity of profile searches through the use of sequence weights and gap excision
J. Thompson (1994)
10.1109/HICSS.1993.270611
Protein modeling using hidden Markov models: analysis of globins
D. Haussler (1993)
Adaptive algorithms for modeling and analysis
P. Baldi (1992)
10.1016/0076-6879(90)83022-2
Mutation data matrix and its uses.
D. George (1990)
10.1073/PNAS.91.25.12091
Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks.
R. Tatusov (1994)
The Value of Prior Knowledge in Discovering Motifs with MEME
T. Bailey (1995)
Line geometries for sequence
M. S. Waterman (1986)
Adaptive algorithms for modeling and analysis of biological primary sequence information
P. Baldi (1992)
Component 4 gives high probability to positively charged amino acids (especially K and R) and Q|favoring residues with long sidechains that can function as hydrogen donors
Component 4 g i v es high probability to positively charged amino acids (especially K and R) and Q|favoring residues with long sidechains that can function as hydrogen donors
10.1002/PROT.340070105
An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences
C. Lawrence (1990)
10.1016/0076-6879(90)83011-W
[9] Profile analysis
M. Gribskov (1990)
A exible motif search technique based
Philipp Bucher (1996)
10.1126/science.1853201
A method to identify protein sequences that fold into a known three-dimensional structure.
J. U. Bowie (1991)
10.1126/SCIENCE.7542800
Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.
R. Fleischmann (1995)
10.2307/2286028
Pattern classification and scene analysis
R. Duda (1973)
10.1115/1.3452897
Table of Integrals, Series, and Products
I. S. Gradshteyn (1943)
Hidden Markov models for sequence analysis: Extension and analysis of the basic method
Hughey.R (1996)
10.1002/PRO.5560020302
Structural analysis based on state‐space modeling
C. Stultz (1993)
10.1162/neco.1994.6.2.307
Smooth On-Line Learning Algorithms for Hidden Markov Models
P. Baldi (1994)
A exible motif search technique based on generalized pro les
Philipp Bucher (1996)
Bayesian Theory
J. M. Bernardo (1994)
Using substitution probabilities to improve position-speci c scoring matrices
J. G. Heniko (1996)
Protein classi cation by stochastic modeling
James V. White (1994)
Protein classiication by stochastic modeling and optimal ltering of amino-acid sequences
James V White (1994)
A model of evolutionary change in proteins
M. O. Dayhoff (1968)
Massively Parallel Biosequence Analysis
R. Hughey (1993)
10.1016/0097-8485(93)85010-A
Information Enhancement Methods for Large Scale Sequence Analysis
J. Claverie (1993)
Maximum Likelihood Competitive Learning
S. Nowlan (1989)
10.1093/NAR/22.22.4673
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.
J. Thompson (1994)



This paper is referenced by
10.1093/bioinformatics/bti109
An alternative model of amino acid replacement
G. Crooks (2005)
HMMER User's Guide - Biological sequence analysis using profile hidden Markov models
S. Eddy (1998)
10.1007/978-3-642-18991-3_64
Model-based clustering with Hidden Markov Models and its application to financial times series data
B. Knab (2003)
10.1101/022616
A Profile-Based Method for Measuring the Impact of Genetic Variation
Nicole E. Wheeler (2015)
Deep learning languages: a key fundamental shift from probabilities to weights?
Franccois Coste (2019)
Deterministic Annealing Framework in MMMs-Induced Fuzzy Co-Clustering and Its Applicability
Shunnya Oshio (2016)
Identificação de RNAs não-codificadores por modelos de covariância com prioris Dirichlet adaptadas a grupos de ncRNAs com estruturas secundárias similares = Identifying non-coding RNAs using covariance models with Dirichlet priors specific to groups of ncRNAs of similar secondary structures
F. A. Lessa (2012)
10.1016/J.JBI.2006.07.001
A new approach to the assessment of the quality of predictions of transcription factor binding sites
S. Nowakowski (2007)
10.1109/ICTAI.2004.30
An improved hidden Markov model for transmembrane topology prediction
Robel Y. Kahsay (2004)
Finding homologous genes with primers designed using evolutionary models
D. Thompson (2004)
10.1002/0470022620.BBC03
Bayesian Methods in Biological Sequence Analysis
J. Liu (2004)
Sparse sequence modeling with applications to computational biology and intrusion detection
S. Stolfo (2002)
10.1147/rd.453.0449
Hidden Markov models in biological sequence analysis
E. Birney (2001)
Profile Clusters Derived from BLOCKS Suggest a Simple Model of Column Evolution in Multiple Alignments of Protein Families
Igor V Merkeev (2003)
Improvement of the jpHMM approach to recombination detection in viral genomes and its application to HIV and HBV
A. Schultz (2011)
10.1371/journal.pone.0030126
Dirichlet Multinomial Mixtures: Generative Models for Microbial Metagenomics
I. Holmes (2012)
10.1007/978-1-59745-398-1
Functional Proteomics
J. D. Thompson (2008)
10.1371/journal.pcbi.1000069
A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation
S. Eddy (2008)
A Study in Modeling Low-Conservation Protein Superfamilies
C. Wang (2004)
10.1186/1471-2148-6-51
PHOG-BLAST – a new generation tool for fast similarity search of protein families
I. V. Merkeev (2005)
10.1111/j.1467-8640.2009.00341.x
NOVELTY DETECTION AND IN‐LINE LEARNING OF NOVEL CONCEPTS ACCORDING TO A CASE‐BASED REASONING PROCESS SCHEMA FOR HIGH‐CONTENT IMAGE ANALYSIS IN SYSTEM BIOLOGY AND MEDICINE
P. Perner (2009)
10.4137/EBO.S11609
Top-Down Clustering for Protein Subfamily Identification
E. P. Costa (2013)
performance of homology search methods on noncoding RNA Exploring genomic dark matter: A critical assessment of the
Eva Freyhult (2010)
Predicting Function of Genes and Proteins from Sequence, Structure and Expression Data
T. R. Hvidsten (2004)
From Sequence to Structure And Back Again: An Alignment Tale
V. Simossis (2005)
10.2165/00822942-200403020-00011
Inferring Property Selection Pressure from Positional Residue Conservation
R. Hoberman (2004)
10.1007/978-94-010-0612-5_6
Mixture Models and Profiles
T. Koski (2001)
In silico Evaluation of Nonsynonymous Single Nucleotide Polymorphisms in the ADIPOQ Gene Associated with Diabetes, Obesity, and Inflammation
N. Swamy (2015)
Similarity of Medical Cases in Health Care Using Cosine Similarity and Ontology
S. Begum (2007)
10.1016/J.YGENO.2007.01.008
Simple alignment-free methods for protein classification: a case study from G-protein-coupled receptors.
Pooja K. Strope (2007)
10.1101/022616
A profile-based method for identifying functional divergence of orthologous genes in bacterial genomes
N. Wheeler (2016)
Sztochasztikus modellek a fehérjekutatásban
Tusnady E. Gabor (1999)
See more
Semantic Scholar Logo Some data provided by SemanticScholar