## More Bayesian Probability June 25, 2007

Posted by Peter in Exam 1/P, Exam 4/C.

We know 10% of all proteins are membrane proteins. There are three types of amino acid: hydrophobic (H); polar (P); and charged (C). In globular protein (non-membrane protein) the percentage of each type of amino acids is equal: 1/3 each. In membrane protein the percentages are: H 50%; P 25%; C 25%. Now we have a unidentified sequence: HHHPH. What is the probability that it is a membrane protein?

Solution: Let M be the event that the protein is a membrane protein, and G be the event that the protein is globular (non-membrane). Then

$\Pr[M] = \frac{1}{10}; \quad \Pr[G] = 1 - \Pr[M] = \frac{9}{10},$

if we assume that all proteins are classified as belonging to either type M or type G. Now, we are also given that

$\Pr[H|G] = \Pr[P|G] = \Pr[C|G] = \frac{1}{3};$

that is, the probability that a selected amino acid is hydrophobic, polar, or charged, given that it belongs to a globular protein, is 1/3 each. We also have

$\Pr[H|M] = \frac{1}{2}, \Pr[P|M] = \frac{1}{4}, \Pr[C|M] = \frac{1}{4}.$

Our prior hypothesis is that there is a 0.1 probability that the protein is a membrane protein. Now, the likelihood of observing the amino sequence HHHPH given that the protein is membrane, is

$\Pr[{\it HHHPH}|M] = \Pr[H|M]^4 \Pr[P|M] = \left(\frac{1}{2}\right)^4 \left(\frac{1}{4}\right) = \frac{1}{64}.$

This assumes that amino acid types are independent of each other within a given protein. Similarly, the likelihood of observing the same sequence given that the protein is globular, is

$\Pr[{\it HHHPH}|G] = \left(\frac{1}{3}\right)^5 = \frac{1}{243}.$

The joint probabilities are then

$\begin{array}{c} \Pr[{\it HHHPH}|M]\Pr[M] = \left(\frac{1}{64}\right)\left(\frac{1}{10}\right) = \frac{1}{640}, \\ \Pr[{\it HHHPH}|G]\Pr[G] = \left(\frac{1}{243}\right)\left(\frac{9}{10}\right) = \frac{1}{270},\end{array}$

and therefore the unconditional probability of observing the sequence HHHPH is, by the law of total probability,

${\setlength\arraycolsep{2pt} \begin{array}{rcl}\Pr[{\it HHHPH}] &=& \Pr[{\it HHHPH}|M]\Pr[M] + \Pr[{\it HHHPH}|G]\Pr[G] \\ &=& \frac{91}{17280}. \end{array}}$

Hence by Bayes’ theorem, the posterior probability of the protein being membrane, given that we observed the particular amino sequence HHHPH, is

$\displaystyle \Pr[M|{\it HHHPH}] = \frac{\Pr[{\it HHHPH}|M]\Pr[M]}{\Pr[{\it HHHPH}]} = \frac{1/640}{91/17280} = \frac{27}{91},$

which is approximately 29.67%. This answer makes sense, because in the absence of any information, we can only conclude there is a 10% probability of selecting a membrane protein. However, once we observed the sequence HHHPH, the posterior probability is significantly greater, since it is far more likely to observe such a sequence if the protein were membrane than if it were globular—indeed, the likelihood was 1/64 versus 1/243. However, because the overall distribution of proteins is such that 90% are globular, the posterior probability is not vastly greater—only 30%.

Exercise: Suppose you observed the sequence HHPCHHPHHHCH. What is the posterior probability of the protein being membrane? Why do we get a different result here? Why do we have to observe far longer sequences before we can have a high posterior probability that the sequence belongs to a membrane protein, compared to a similar degree of confidence that the sequence belongs to a globular protein?