Gamma & Beta Distribution

Gamma Function

Before introducing the Gamma and Beta distributions, we need the special functions that serve as their building blocks. The gamma function extends the factorial to non-integer (and even complex) arguments. This generalization is essential: the normalizing constants of many continuous distributions involve factorials of non-integer parameters, and the gamma function provides the mathematically rigorous way to handle them.

Definition: Gamma Function

The gamma function is defined as: \[ \Gamma(z) = \int_{0}^{\infty} t^{z-1}e^{-t}\, dt, \quad \text{Re}(z) > 0. \]

Theorem 1:

For any positive integer \(n\), \[ \Gamma (n+1) = n! \]

Proof:

Substituting \(z= n+1\) into the definition: \[ \Gamma (n+1) = \int_{0} ^\infty t^{n}e^{-t} dt \] By integration by parts

(\(u = t^n \Longrightarrow du =ntdt\), and \(dv = e^{-t} dt \Longrightarrow v = -e^{-t}\).)

\[ \begin{align*} \Gamma (n+1) &= [-t^ne^{-t}]_{0} ^\infty + \int_{0} ^\infty nt^{n-1}e^{-t} dt \\\\ &= n\int_{0} ^\infty t^{n-1}e^{-t} dt \\\\ &= n \Gamma(n) \end{align*} \] Thus, the gamma function has the recursive property like factorials hold.

Here, we assume that \(\Gamma(k+1) = k!\) for some \(k \in \mathbb{Z}^+\). By the recursive property, \[ \begin{align*} \Gamma (k+2) &= (k+1)\Gamma(k+1) \\\\ &= (k+1)k! \\\\ &= (k+1)! \\\\ \end{align*} \] Also, when \(n = 1\): \[ \Gamma (2) = \int_{0} ^\infty te^{-t} dt \] By integration by parts

(\(u = t \Longrightarrow du =dt\), and \(dv = e^{-t} dt \Longrightarrow v = -e^{-t}\).)

\[ \begin{align*} \Gamma (2) &= [-te^{-t}]_{0} ^\infty + \int_{0} ^\infty e^{-t} dt \\\\ &= 1 \\\\ &= (2-1)! \end{align*} \] Therefore, by mathematical induction, \(\Gamma(n+1) = n!\) or equivalently, \(\Gamma (n) = (n-1)!\).

Let's compute the most iconic value of the gamma function. \[ \Gamma (\frac{1}{2}) = \int_{0} ^\infty t^{\frac{-1}{2}}e^{-t} dt \] Let \(t = u^2\), so \(dt =2udu\). Then \[ \begin{align*} \Gamma (\frac{1}{2}) &= \int_{0} ^\infty u^{2\cdot \frac{-1}{2}}e^{-u^2} 2udu \\\\ &= 2\int_{0} ^\infty e^{-u^2} du \\\\ &= \int_{-\infty} ^\infty e^{-u^2} du \tag{1} \\\\ &= \sqrt{\pi} \end{align*} \] Note: in (1), we define the Gaussian function as \[ f(x) = e^{-x^2} \] and then \[ \int_{-\infty} ^\infty e^{-x^2} dx = \sqrt{\pi}. \]

We will revisit this important result in the section of the normal distribution.

Technically, the factorial \(n!\) is defined only for non-negative integers, but "informally", we might consider \(\Gamma (0.5) = (-0.5)!\). By the recursive property of factorial: \(n! = n(n-1)!\), we can compute non-integer factorials: \[\begin{align*} &(0.5)! = 0.5(-0.5)! = \frac{1}{2}\sqrt{\pi} = \Gamma(1.5) \\\\ &(1.5)! = 1.5(0.5)! = \frac{3}{4}\sqrt{\pi} = \Gamma(2.5) \\\\\ &(2.5)! = 2.5(1.5)! = \frac{15}{8}\sqrt{\pi} = \Gamma(3.5) \end{align*} \] (Formally, we should use the recursive property of the gamma function: \(\Gamma (n+1) = n \Gamma(n).\) )

BONUS: Volume of the \(n\)-dimensional sphere \[ V = \frac{\pi^{\frac{n}{2}}}{\Gamma (\frac{n}{2} + 1)}r^n. \]

Gamma Distribution

With the gamma function in hand, we can now define a flexible family of continuous distributions on \([0, \infty )\). The gamma distribution arises naturally as the distribution of waiting times: if events occur independently at a constant average rate, the time until the \(\alpha\)-th event follows a gamma distribution.

Definition: Gamma Distribution

A random variable \(X\) has the gamma distribution with shape parameter \(\alpha > 0\) and rate parameter \(\beta > 0\) if its p.d.f. is: \[ f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x}, \quad x \geq 0. \tag{2} \] We write \(X \sim \text{Gamma}(\alpha, \beta)\). The mean and variance are: \[ \mathbb{E}[X] = \frac{\alpha}{\beta}, \qquad \text{Var}(X) = \frac{\alpha}{\beta^2}. \]

The mean and variance of the gamma distribution:

Using definition of the mean, \[ \begin{align*} \mathbb{E }[X] &= \int_{0}^\infty x f(x)dx \\\\ &= \frac{\beta^\alpha}{\Gamma(\alpha)} \int_{0}^\infty x^{\alpha}e^{-\beta x} dx \end{align*} \] Let \(t = \beta x\) and so \(x = \frac{t}{\beta}\) and \(dx = \frac{1}{\beta}dt\). Substituting these into the equation: \[ \begin{align*} \mathbb{E }[X] &= \frac{\beta^\alpha}{\Gamma(\alpha)} \int_{0}^\infty (\frac{t}{\beta})^{\alpha}e^{-t} \frac{1}{\beta}dt \\\\\ &= \frac{\beta^\alpha}{\Gamma(\alpha) \beta^{\alpha +1}} \int_{0}^\infty t^{\alpha}e^{-t} dt \\\\ &= \frac{\Gamma (\alpha +1)}{\Gamma(\alpha) \beta} \\\\ \end{align*} \] Since \(\Gamma(\alpha +1) = \alpha \Gamma(\alpha)\), \[ \begin{align*} \mathbb{E }[X] &= \frac{\alpha \Gamma (\alpha)}{\Gamma(\alpha) \beta} \\\\ &= \frac{\alpha}{\beta} \end{align*} \] We use the fact \(\text{Var }(X) = \mathbb{E }[X^2] - (\mathbb{E }[X] )^2\).

Compute\(\mathbb{E }[X^2]\) applying the same substitution as we did above: \[ \begin{align*} \mathbb{E }[X^2] &= \int_{0}^\infty x^2 f(x)dx \\\\ &= \frac{\beta^\alpha}{\Gamma(\alpha)} \int_{0}^\infty x^{\alpha+1}e^{-\beta x} dx \\\\ &= \frac{\beta^\alpha}{\Gamma(\alpha) \beta^{\alpha +2}} \int_{0}^\infty t^{\alpha+1}e^{-t} dt \\\\ &= \frac{\Gamma (\alpha +2)}{\Gamma(\alpha) \beta^2}. \end{align*} \] Since \(\Gamma(\alpha +2) = (\alpha +1)\Gamma (\alpha +1) = (\alpha +1 )\alpha \Gamma(\alpha)\), \[ \begin{align*} \mathbb{E }[X^2] &= \frac{(\alpha +1 )\alpha \Gamma(\alpha)}{\Gamma(\alpha) \beta^2} \\\\ &= \frac{\alpha(\alpha +1 )}{\beta^2} \end{align*} \] Substitute the results: \[ \begin{align*} \text{Var }(X) &= \mathbb{E }[X^2] - (\mathbb{E }[X] )^2 \\\\ &= \frac{\alpha(\alpha +1 )}{\beta^2} - (\frac{\alpha}{\beta})^2 \\\\ &= \frac{\alpha}{\beta^2} \end{align*} \]

Note: Often, the gamma distribution is parametrized such that \[ X \sim \text{Gamma }(k = \alpha, \theta = \frac{1}{\beta}). \]

The Exponential distribution is a special case of the gamma distribution. \[ X \sim \text{Exp }(\lambda) = X \sim \text{Gamma }(1, \beta = \lambda). \] So, p.d.f. (2) becomes \[ f(x) = \lambda e^{-\lambda x} \qquad x \geq 0. \] The exponential distribution represents a process in which events occur continuously and independently at an average rate \(\lambda\). For example, if a machine gets an error once every 20 years, then the time to failure is represented by the exponential distribution with \(\lambda = \frac{1}{20}\).

The mean and variance of an exponential distribution are given by \[ \mathbb{E }[X] = \frac{1}{\lambda} \qquad \text{Var }(X) = \frac{1}{\lambda^2}. \] This implies the mean is equivalent to the standard deviation. Also, its c.d.f. is give by \[ F(x) = \lambda \int_{0} ^x e^{-\lambda u} du = 1 - e^{-\lambda x}. \]

Beta Function

Just as the gamma function underpins the gamma distribution, the beta function provides the normalizing constant for distributions supported on bounded intervals. Its intimate connection to the gamma function - expressed through Theorem 2 below - is what makes the beta distribution analytically tractable.

Definition: Beta Function

The beta function is defined as: \[ B(a, b) = \int_{0}^1 t^{a-1}(1-t)^{b-1}\, dt, \quad \text{Re}(a) > 0,\; \text{Re}(b) > 0. \]

Theorem 2:

The beta function can be represented by the gamma function: \[ B(a, b) = \frac{\Gamma (a) \Gamma (b)}{\Gamma (a+b)} \]

Proof:

Consider the product of two distinct gamma functions with inputs \(a, b > 0\). \[ \begin{align*} \Gamma (a) \Gamma(b) &= \int_{0}^\infty u^{a-1}e^{-u}du \cdot \int_{0}^\infty v^{b-1}e^{-v}dv \\\\ &= \int_{0}^\infty \int_{0}^\infty u^{a-1} v^{b-1} e^{-(u+v)} dudv \end{align*} \] Let \(s = u + v\) and \(t = \frac{u}{u+v} \). Then \[ u = (u+v)t = st \Longrightarrow du = sdt \] and \[ v = s - u = (1-t)s \Longrightarrow dv = ds. \] Substituting these into the product: \[ \begin{align*} \Gamma (a) \Gamma(b) &= \int_{0}^\infty \int_{0}^1 (st)^{a-1} ((1-t)s)^{b-1} e^{-s} s dt ds \\\\\ &= \int_{0}^\infty s s^{a-1} s^{b-1} e^{-s} ds \cdot \int_{0}^1 t^{a-1} (1-t)^{b-1} dt \\\\ &= \int_{0}^\infty s^{(a+b)-1} e^{-s} ds \cdot \int_{0}^1 t^{a-1} (1-t)^{b-1} dt \\\\ &= \Gamma (a+b) \cdot \text{B }(a, b) \end{align*} \] Therefore, \[ B(a, b) = \frac{\Gamma (a) \Gamma (b)}{\Gamma (a+b)} \]

Beta Distribution

The beta function leads directly to a distribution on the unit interval \([0, 1]\). This makes the beta distribution the natural choice for modeling quantities that represent probabilities or proportions - and it is the reason the beta distribution plays a central role as a conjugate prior in Bayesian inference.

Definition: Beta Distribution

A random variable \(X\) has a beta distribution on the interval \([0, 1]\) with parameters \(a > 0\) and \(b > 0\) if its p.d.f. is: \[ f(x) = \frac{1}{B(a, b)}x^{a-1}(1-x)^{b-1}, \quad x \in [0, 1]. \tag{3} \] We write \(X \sim \text{Beta}(a, b)\). The mean and variance are: \[ \mathbb{E}[X] = \frac{a}{a+b}, \qquad \text{Var}(X) = \frac{ab}{(a+b)^2(a+b+1)}. \]

Mean and Variance of the Beta Distribution:

We can rewrite (3) using Theorem 2: \[ f(x) = \frac{\Gamma (a+b)}{\Gamma (a) \Gamma (b)}x^{a-1} (1-x)^{b-1} \quad x \in [0, 1]. \] Using definition of the mean, \[ \begin{align*} \mathbb{E }[X] &= \int_{0}^1 x f(x)dx \\\\ &= \int_{0}^1 x \frac{\Gamma (a+b)}{\Gamma (a) \Gamma (b)}x^{a-1} (1-x)^{b-1} dx \\\\ &= \frac{\Gamma (a+b)}{\Gamma (a) \Gamma (b)} \int_{0}^1 x^{a} (1-x)^{b-1} dx \\\\ &= \frac{\Gamma (a+b)}{\Gamma (a) \Gamma (b)} B(a+1, b) \\\\ &= \frac{\Gamma (a+b)}{\Gamma (a) \Gamma (b)} \frac{\Gamma (a+1) \Gamma (b)}{\Gamma (a+b+1)} \end{align*} \] Since \(\Gamma (a+1) = a \Gamma (a)\) and similarly, \(\Gamma (a+b+1) = (a+b)\Gamma (a+b)\), \[ \begin{align*} \mathbb{E }[X] &= \frac{\Gamma (a+b)}{\Gamma (a) \Gamma (b)} \frac{a\Gamma (a) \Gamma (b)}{(a+b)\Gamma (a+b)} \\\\ &= \frac{a}{a+b} \end{align*} \] We use the fact \(\text{Var }(X) = \mathbb{E }[X^2] - (\mathbb{E }[X] )^2\).

Compute\(\mathbb{E }[X^2]\): \[ \begin{align*} \mathbb{E }[X^2] &= \int_{0}^1 x^2 f(x)dx \\\\ &= \frac{\Gamma (a+b)}{\Gamma (a) \Gamma (b)} \int_{0}^1 x^{a+1} (1-x)^{b-1} dx \\\\ &= \frac{\Gamma (a+b)}{\Gamma (a) \Gamma (b)} B(a+2, b) \\\\ &= \frac{\Gamma (a+b)}{\Gamma (a) \Gamma (b)} \frac{\Gamma (a+2) \Gamma (b)}{\Gamma (a+b+2)} \end{align*} \] Since \(\Gamma (a+2) = (a+1) \Gamma (a+1) = (a+1) a \Gamma (a)\), and similarly, \(\Gamma (a+b+2) = (a+b+1)(a+b)\Gamma (a+b)\), \[ \begin{align*} \mathbb{E }[X^2] &= \frac{\Gamma (a+b)}{\Gamma (a) \Gamma (b)} \frac{(a+1)a\Gamma (a) \Gamma (b)}{(a+b+1)(a+b)\Gamma (a+b)} \\\\ &= \frac{a(a+1)}{(a+b)(a+b+1)}. \end{align*} \] Substitute the results: \[ \begin{align*} \text{Var }(X) &= \mathbb{E }[X^2] - (\mathbb{E }[X] )^2 \\\\ &= \frac{a(a+1)}{(a+b)(a+b+1)} - (\frac{a}{a+b})^2 \\\\ &= \frac{a(a+1)(a+b) - a^2 (a+b+1) }{(a+b)^2(a+b+1)} \\\\ &= \frac{ab}{(a+b)^2(a+b+1)} \end{align*} \]

The beta family includes several important special cases. The simplest is the uniform distribution, which arises when both parameters equal one.

The uniform distribution over the interval \([a, b]\) is a special case of the beta distribution. \[ X \sim U[0, 1] = X \sim \text{Beta }(1, 1) \text{ then } f(x) = 1. \] In general, the p.d.f. of the uniform distribution \(X \sim \text{Uniform}[a, b]\) is given by \[ f(x) = \frac{1}{b - a} \qquad x \in [a, b]. \] Note: Otherwise, \(f(x) = 0\).

The uniform distribution represents the situations where every value is equally likely over the interval \([a, b]\). For example, this distribution is very popular for generating random numbers in programming languages.

The mean and variance of the uniform distribution are given by \[ \mathbb{E }[X] = \frac{a+b}{2} \qquad \text{Var }(X) = \frac{(b-a)^2}{12} \] and its c.d.f. is given by \[ F(x) = \frac{x-a}{b-a} \qquad x \in [a, b]. \] Note: if \(x < a\), \(\, F(x) = 0\) and if \(x > b\), \(\, F(x) = 1\).

Mean and Variance of the Uniform Distribution:

Using definition of the mean, \[ \begin{align*} \mathbb{E }[X] &= \int_{a}^b x \frac{1}{b-a}dx \\\\ &= \frac{1}{b-a} (\frac{b^2 - a^2}{2}) \\\\ &= \frac{a+b}{2} \end{align*} \] Also, \[ \begin{align*} \mathbb{E }[X^2] &= \int_{a}^b x^2 \frac{1}{b-a}dx \\\\ &= \frac{1}{b-a} (\frac{b^3 - a^3}{3}) \\\\ &= \frac{a^2 +ab + b^2}{3} \end{align*} \] and then \[ \begin{align*} \text{Var }(X) &= \mathbb{E }[X^2] - (\mathbb{E }[X] )^2 \\\\ &= \frac{a^2 +ab + b^2}{3} - (\frac{a+b}{2})^2 \\\\ &= \frac{4(a^2 + ab +b^2)-3(a^2 + 2ab + b^2)}{12} \\\\ &= \frac{(b-a)^2}{12}. \end{align*} \]

Insight: Why Beta and Gamma? (Conjugate Priors)

In Bayesian inference, a conjugate prior is a prior distribution that, when multiplied by the likelihood, results in a posterior distribution of the same functional form (family) as the prior.

Beta distribution is the conjugate prior for Bernoulli and Binomial likelihoods. This makes it the standard choice for modeling uncertainty about a probability (e.g., click-through rates).
Gamma distribution is the conjugate prior for the Poisson distribution and the precision (inverse variance) of a Gaussian.

Using these distributions allows us to update our beliefs with simple algebraic additions to the parameters, avoiding the need for complex numerical integration.

Most common distributions in machine learning belong to the exponential family, which guarantees the existence of a conjugate prior. This allows us to update our beliefs with simple algebraic additions to parameters, avoiding intractable integrals.

Interactive Visualization

Below is an interactive visualization to help you understand the Gamma and Beta distributions. You can adjust the parameters using the sliders and observe how the probability density function changes. Try the special cases to see important variants of these distributions.

With the Gamma and Beta distributions established, we now turn to the single most important distribution in all of statistics: the normal (Gaussian) distribution.