More Bayesian Probability June 25, 2007
Posted by Peter in Exam 1/P, Exam 4/C.trackback
We know 10% of all proteins are membrane proteins. There are three types of amino acid: hydrophobic (H); polar (P); and charged (C). In globular protein (non-membrane protein) the percentage of each type of amino acids is equal: 1/3 each. In membrane protein the percentages are: H 50%; P 25%; C 25%. Now we have a unidentified sequence: HHHPH. What is the probability that it is a membrane protein?
Solution: Let M be the event that the protein is a membrane protein, and G be the event that the protein is globular (non-membrane). Then
if we assume that all proteins are classified as belonging to either type M or type G. Now, we are also given that
that is, the probability that a selected amino acid is hydrophobic, polar, or charged, given that it belongs to a globular protein, is 1/3 each. We also have
Our prior hypothesis is that there is a 0.1 probability that the protein is a membrane protein. Now, the likelihood of observing the amino sequence HHHPH given that the protein is membrane, is
This assumes that amino acid types are independent of each other within a given protein. Similarly, the likelihood of observing the same sequence given that the protein is globular, is
The joint probabilities are then
and therefore the unconditional probability of observing the sequence HHHPH is, by the law of total probability,
Hence by Bayes’ theorem, the posterior probability of the protein being membrane, given that we observed the particular amino sequence HHHPH, is
which is approximately 29.67%. This answer makes sense, because in the absence of any information, we can only conclude there is a 10% probability of selecting a membrane protein. However, once we observed the sequence HHHPH, the posterior probability is significantly greater, since it is far more likely to observe such a sequence if the protein were membrane than if it were globular—indeed, the likelihood was 1/64 versus 1/243. However, because the overall distribution of proteins is such that 90% are globular, the posterior probability is not vastly greater—only 30%.
Exercise: Suppose you observed the sequence HHPCHHPHHHCH. What is the posterior probability of the protein being membrane? Why do we get a different result here? Why do we have to observe far longer sequences before we can have a high posterior probability that the sequence belongs to a membrane protein, compared to a similar degree of confidence that the sequence belongs to a globular protein?
Comments»
No comments yet — be the first.