## More Bayesian ProbabilityJune 25, 2007

Posted by Peter in Exam 1/P, Exam 4/C.

We know 10% of all proteins are membrane proteins. There are three types of amino acid: hydrophobic (H); polar (P); and charged (C). In globular protein (non-membrane protein) the percentage of each type of amino acids is equal: 1/3 each. In membrane protein the percentages are: H 50%; P 25%; C 25%. Now we have a unidentified sequence: HHHPH. What is the probability that it is a membrane protein?

Solution: Let M be the event that the protein is a membrane protein, and G be the event that the protein is globular (non-membrane). Then

$\Pr[M] = \frac{1}{10}; \quad \Pr[G] = 1 - \Pr[M] = \frac{9}{10},$

if we assume that all proteins are classified as belonging to either type M or type G. Now, we are also given that

$\Pr[H|G] = \Pr[P|G] = \Pr[C|G] = \frac{1}{3};$

that is, the probability that a selected amino acid is hydrophobic, polar, or charged, given that it belongs to a globular protein, is 1/3 each. We also have

$\Pr[H|M] = \frac{1}{2}, \Pr[P|M] = \frac{1}{4}, \Pr[C|M] = \frac{1}{4}.$

Our prior hypothesis is that there is a 0.1 probability that the protein is a membrane protein. Now, the likelihood of observing the amino sequence HHHPH given that the protein is membrane, is

$\Pr[{\it HHHPH}|M] = \Pr[H|M]^4 \Pr[P|M] = \left(\frac{1}{2}\right)^4 \left(\frac{1}{4}\right) = \frac{1}{64}.$

This assumes that amino acid types are independent of each other within a given protein. Similarly, the likelihood of observing the same sequence given that the protein is globular, is

$\Pr[{\it HHHPH}|G] = \left(\frac{1}{3}\right)^5 = \frac{1}{243}.$

The joint probabilities are then

$\begin{array}{c} \Pr[{\it HHHPH}|M]\Pr[M] = \left(\frac{1}{64}\right)\left(\frac{1}{10}\right) = \frac{1}{640}, \\ \Pr[{\it HHHPH}|G]\Pr[G] = \left(\frac{1}{243}\right)\left(\frac{9}{10}\right) = \frac{1}{270},\end{array}$

and therefore the unconditional probability of observing the sequence HHHPH is, by the law of total probability,

${\setlength\arraycolsep{2pt} \begin{array}{rcl}\Pr[{\it HHHPH}] &=& \Pr[{\it HHHPH}|M]\Pr[M] + \Pr[{\it HHHPH}|G]\Pr[G] \\ &=& \frac{91}{17280}. \end{array}}$

Hence by Bayes’ theorem, the posterior probability of the protein being membrane, given that we observed the particular amino sequence HHHPH, is

$\displaystyle \Pr[M|{\it HHHPH}] = \frac{\Pr[{\it HHHPH}|M]\Pr[M]}{\Pr[{\it HHHPH}]} = \frac{1/640}{91/17280} = \frac{27}{91},$

which is approximately 29.67%. This answer makes sense, because in the absence of any information, we can only conclude there is a 10% probability of selecting a membrane protein. However, once we observed the sequence HHHPH, the posterior probability is significantly greater, since it is far more likely to observe such a sequence if the protein were membrane than if it were globular—indeed, the likelihood was 1/64 versus 1/243. However, because the overall distribution of proteins is such that 90% are globular, the posterior probability is not vastly greater—only 30%.

Exercise: Suppose you observed the sequence HHPCHHPHHHCH. What is the posterior probability of the protein being membrane? Why do we get a different result here? Why do we have to observe far longer sequences before we can have a high posterior probability that the sequence belongs to a membrane protein, compared to a similar degree of confidence that the sequence belongs to a globular protein?

## The moment calculation shortcutJune 20, 2007

Posted by Peter in Exam 1/P, Exam 3/MLC, Exam 4/C.

Suppose X is a continuous random variable that takes on nonnegative values. Then we have the following definition:

$\displaystyle {\rm E}[X] = \int_0^\infty x f_X(x) \, dx,$

where $f_X(x)$ is the probability density function of X. Indeed, in the general case, let g be a function on the support of X. Then

$\displaystyle {\rm E}[g(X)] = \int_0^\infty g(x) f_X(x) \, dx.$

So when $g(x) = x^k$ for some positive integer k, we obtain the formula for the k(th) raw moment of X. Let’s work through an example.

Show that the expected value (first raw moment) of a Pareto distribution with parameters α and θ is equal to θ/(α-1). Recall that the density of the Pareto distribution is $\displaystyle f(x) = \frac{\alpha \theta^\alpha}{(x+\theta)^{\alpha+1}}.$

Solution: We compute

$\displaystyle {\rm E}[X] = \int_0^\infty \frac{x \alpha\theta^\alpha}{(x+\theta)^{\alpha+1}} \, dx = \alpha\theta^\alpha \!\! \int_0^\infty \frac{x}{(x+\theta)^{\alpha+1}} \, dx.$

Then make the substitution $u = x+\theta$, $du = dx$, to obtain

$\displaystyle {\rm E}[X] = \alpha\theta^\alpha \!\!\int_{u=\theta}^\infty \!\frac{u-\theta}{u^{\alpha+1}} \, du = \alpha\theta^\alpha \left( \int_\theta^\infty \!u^{-\alpha} \, du - \theta \!\int_\theta^\infty \!u^{-\alpha-1} \, du \right) .$

For $\alpha > 1$, the integrals converge, giving

$\displaystyle {\rm E}[X] = \alpha\theta^\alpha\! \left(\!-\frac{\theta^{-\alpha+1}}{-\alpha+1} + \theta \cdot \frac{\theta^{-\alpha}}{-\alpha}\right) = \alpha\theta \left(\frac{1}{\alpha-1} - \frac{1}{\alpha}\right) = \frac{\theta}{\alpha-1},$

which proves the desired result. However, the computation is quite tedious, and there is often an easier approach. We will now show that instead of using the density of X, we can use the survival of X when computing moments. Recall that

$\displaystyle S_X(x) = \Pr[X > x] = \int_x^\infty f_X(x) \, dx = 1 - F_X(x);$

that is, the survival function is the probability that X exceeds x, which is the integral of the density on the interval $[x, \infty)$, or the complement of the cumulative distribution function F(x). With this in mind, let’s try integration by parts on the definition of the expected value, with the choices $u = g(x)$, $du = g'(x) \, dx$; $dv = f_X(x) \, dx$, $v = \int f(x) \, dx = F_X(x)$:

${\setlength\arraycolsep{2pt} \begin{array}{rcl} {\rm E}[g(X)] &=& \displaystyle \int_0^\infty \!\! g(x) f_X(x) \, dx = \bigg[g(x) F_X(x)\bigg]_{x=0}^\infty \! - \!\!\int_0^\infty \!\! g'(x) F_X(x) \, dx \\ &=& \displaystyle \bigg[g(x)\left(1-S(x)\right)\bigg]_{x=0}^\infty - \int_0^\infty \! g'(x)\left(1 - S(x)\right) \, dx \\ &=& \displaystyle\int_0^\infty g'(x) S(x) \, dx, \end{array}}$

where the last equality holds because of two assumptions: First, that $\displaystyle \lim_{x \rightarrow \infty} g(x) S(x) = 0$, and g(0) is finite; and second, that the resulting integral of g'(x) S(x) is convergent. Note that the individual terms of the integration by parts are not themselves convergent, but taken together, they are—thus, a fully rigorous proof requires a more formal treatment than what is furnished here.

A consequence of this result is that for positive integers k,

$\displaystyle {\rm E}[X^k] = \int_0^\infty kx^{k-1} S(x) \, dx.$

This formula is easier to work with in some instances, compared to the original definition. For instance, we know that the Pareto survival function is

$\displaystyle S(x) = \left(\frac{\theta}{x+\theta}\right)^\alpha,$

so we find

$\displaystyle {\rm E}[X] = \int_0^\infty S(x) \, dx = \int_0^\infty \left(\frac{\theta}{x+\theta}\right)^\alpha \, dx$

which we can immediately see results in a simpler integrand. This result also proves the life contingencies relationship

$\displaystyle \overset{\circ}{e}_x = \int_{t=0}^{\omega-x} \,_t p_x \, dt,$

since the complete expectation of life is simply the expected value E[T(x)] of the future lifetime variable T(x), and $\,_t p_x$ is the survival function of T(x). In life contingencies notation, the definition of expected value would then look like this:

$\displaystyle \overset{\circ}{e}_x = \int_{t=0}^{\omega-x} x_{\,t} p_x \mu_x(t) \, dt,$

which is usually more cumbersome than the formula using only the survival function.

## Order Statistics of Exponential RVsJune 5, 2007

Posted by Peter in Exam 1/P, Exam 3/MLC, Exam 4/C.

Here’s a question I read from the AoPS forum, and answered therein:

In analyzing the risk of a catastrophic event, an insurer uses the exponential distribution with mean $\theta$ as the distribution of the time until the event occurs. The insured had n independent catastrophe policies of this type. Find the expected time until the insured will have the first catastrophe claim.

The sum $S = X_{1}+X_{2}+\cdots+X_{n}$ of n independent and identically distributed exponential random variables $X_{1}, X_{2}, \ldots, X_{n}$ is gamma distributed. Specifically, if $X_{k}$ are exponential with mean $\theta$, then S is gamma with mean $n\theta$ and variance $n\theta^{2}$.

It’s noteworthy that the sum S is a random variable that describes the time until the n-th claim if claims followed a Poisson process (whose interarrival times are exponentially distributed).

However, according to the model you specified, the events are not interarrival times, but rather they run concurrently. So the time until the k-th event is NOT gamma; rather, it is the k-th order statistic $Y_{k}$. Fortunately, the first such order statistic $Y_{1}$ is exponential, which we show by recalling that $Y_{k}$ has PDF

$\displaystyle f_{Y_{k}}(x) = \frac{n!}{(k-1)!(n-k)!}F_{X}(x)^{k-1}(1-F_{X}(x))^{n-k}f_{X}(x).$

If k = 1, we immediately obtain

$\displaystyle f_{Y_{1}}(x) = n(e^{-x/\theta})^{n-1}\frac{1}{\theta}e^{-x/\theta} = \frac{n}{\theta}e^{-x(n/\theta)},$

which is exponential with mean $\theta/n$. Note that this answer makes sense because as the number of policies n held by the insured increases, the expected waiting time until the first claim decreases. This would not be the case if one and only one policy at a time were in force at all times until the last policy claim–such a scenario would correspond to the gamma distribution previously mentioned.

To check your understanding, here are some exercises:

1. What is the PDF of the second order statistic $Y_{2}$, and what does it represent?
2. Which of the $Y_{k}$ belong to the gamma or exponential family of distributions?
3. Prove that $\displaystyle {\rm E}[Y_{k}] = \theta \sum_{j=1}^{k}\frac{1}{n-k+j}.$

## Question 12, Spring 2007 Exam 4/CMay 27, 2007

Posted by Peter in Exam 4/C.

12. For 200 auto accident claims you are given:

1. Claims are submitted t months after the accident occurs, t = 0, 1, 2, ….
2. There are no censored observations.
3. $\hat{S}(t)$ is calculated using the Kaplan-Meier product limit estimator.
4. $\displaystyle c_S^2(t) = \frac{\widehat{\rm Var}[\hat{S}(t)]}{\hat{S}^2(t)}$, where $\widehat{\rm Var}[\hat{S}(t)]$ is calculated using Greenwood’s approximation.
5. $\hat{S}(8) = 0.22, \; \hat{S}(9) = 0.16, \; c_S^2(9) = 0.02625, \; c_S^2(10) = 0.04045$.

Determine the number of claims that were submitted to the company 10 months after an accident occurred.

Solution. There are two key observations we need to make. The first is that we are given the risk set at time t = 0, namely $r_0 = 200$. The second observation is that because no observations are censored, the Kaplan-Meier estimator of the survival time takes on a particularly simple form. This is because in the absence of censoring, $r_{j+1} = r_j - s_j$; that is, the risk set at time $y_{j+1}$ is simply the risk set at time $y_j$ minus those who died in the meantime. Therefore

$\displaystyle \hat{S}(y_n) = \prod_{j=0}^n \frac{r_j - s_j}{r_j} = \prod_{j=0}^n \frac{r_{j+1}}{r_j} = \frac{r_{n+1}}{r_0} = \frac{r_{n+1}}{200}.$

So with this in mind, we have $r_{10} = 200\hat{S}(9) = 32$. Recalling Greenwood’s approximation,

$\displaystyle c_S^2(t) = \frac{\widehat{\rm Var}[\hat{S}(t)]}{\hat{S}^2(t)} = \sum_{j=1}^t \frac{s_j}{r_j(r_j-s_j)},$

so $\displaystyle c_S^2(10) - c_S^2(9) = \frac{s_{10}}{r_{10}(r_{10}-s_{10})} = 0.0142.$

Substituting and solving, we obtain $s_{10} = 9.9978 \approx 10$.