# Stat 210A Lecture 4: Factorization, Minimal Sufficiency

# Review

\(T(X)\) is *sufficient* for \(\mathcal{P} = \{ P_\theta \colon \theta \in \Theta \}\) if \(P_\theta(X \mid T)\) does not depend on \(\theta\).

*Interpretation*: \(T(X)\) has all the information about \(\theta\). Once we know \(T(X)\), \(X\) has no more to tell us about \(\theta\) might have been.

We can think of the data as being generated by a 2-step process:

\[ \Theta \longrightarrow T \longrightarrow X \]

A similar interpretation is if we forgot to write down \(X\) in a scientific experiment, we could use \(T\) to make a new \(\tilde{X}\), and nobody could tell even if they knew the original \(\theta\).

# Sufficiency principle

*Sufficiency principle.* If \(T(X)\) is sufficient, then any statistical procedure (estimation, confidence intervals, Bayesian, etc.) should depend only on \(T(X)\).

Say that we make \(\tilde{X}\), which is “fake data” which is equal in distribution to \(X\), regardless of \(\theta\). In particular, \(\delta(X) \stackrel{d}{=} \delta(\tilde{X})\).

It seems silly to estimate based on \(\tilde{X}\), in a precise sense we can prove that an estimation procedure relying on more than \(T(X)\) has some additional randomization that is undesireable, even if the data it relies on is “true” data.

*Note:* The definition of sufficiency with a statistic is in respect to a model. If the model gets “bigger”, then some statistics may no longer be sufficient.

# Factorization

Let’s now prove a convenient way for us to to show if a statistic is sufficient. We will give a “physics proof”, where we blithely throw symbols around.

*(Factorization)*

Let \(\mathcal{P} = \{ p_\theta \colon \theta \in \Theta \}\) be a family of densities w.r.t a common measure \(\mu\). Then \(T(X)\) is sufficient iff \(\exists g_\theta \geq 0, h \geq 0\) such that

\[ p_\theta(x) \stackrel{a.e.}{=} g_\theta(T(x))(h(x)) \]

a.e. wrt \(\mu\) (i.e. equal except on some measure-0 set with respect to \(\mu\))

*Proof.* (For a rigorous proof see Keener 6.4)

(\(\Leftarrow\)) We are given \(g_\theta(T)\) and \(h\). \[p_\theta(x \mid T(x) = t) = \frac{1_{\{T(x) = t\}} g_\theta(t) h(x)}{\int_{T(z) = t} g_\theta(t) h(z) d\mu(z)} \] \[ = \frac{1_{\{ T(x) = t \}} h(x)}{\int_{T(z) = t} h(z) d\mu(z)} \]

This doesn’t depend on \(\theta\)! For a rigorous proof for non-discrete \(\mu\), we need to use measure theory.

(\(\Rightarrow\)) Take \(g_\theta(t)\) to be the density of \(T\), \(g_\theta(t) = P_\theta(T(x) = t)\). Take \(h(x)\) to be the conditional density of \(x \mid t\), \(h(x) = P_\theta(X = x \mid T(X) = T(x))\)

This doesn’t depend on \(\theta\), so \(h(x)\) is valid.

**Examples**

The big, important one.

*Exponential families*We’ve been calling \(T(X)\) the “sufficient statistic” in these bad bois so far, and it turns out it is sufficient! See this by factorizing the likelihood:

\[ p_\theta(x) = \underbrace{e^{\eta(\theta)' T(x) - B(\theta)}}_{g_\theta(T(x))} h(x) \]

\(X_1, \ldots, X_n \stackrel{i.i.d}{\sim} P_\theta^{(1)}\) for any univariate model model \(\mathcal{P}^{(1)} = \{ P_\theta^{(1)} \colon \theta \in \Theta \}\), \(X_i \in \mathbb{R}\).

Then \(T(X) = (X_{(1)}, \ldots, X_{(n)})\) where \(X_{(1)} \leq X_{(2)} \leq \ldots X_{(n)}\). We are sorting the vector. \(T\) is called the vector of

*order statistics*.\(T(X)\) is sufficient, because once we know the order statistics, the distribution of \(X\) is uniform on all permutations.

\(X_1, \ldots, X_n \stackrel{i.i.d.}{\sim} U[\theta, \theta + 1] = 1_{\{\theta \leq x \leq \theta + 1\}}\)

Then \(P_\theta(x) = \prod_{i=1}^n 1_{\{\theta \leq x_i \leq \theta + 1\}}\) \(= 1_{\{\theta \leq X_{(1)}, X_{(n)} \leq \theta + 1 \}}\)

Then \(T(X) = (X_{(1)}, X_{(n)})\) is sufficient for this model, again because we can factorize.

# Minimal sufficiency

As an example, let \(X_1, \ldots, X_n \stackrel{i.i.d.}{\sim} \mathcal{N}(0, 1)\). There are several sufficient statistics we have discovered:

- \(T(X) = \sum X_i\) is the sufficient statistic for the exponential family defined by the product of distributions
- \(X = (X_1, \ldots X_n)\) (just the data) is trivially sufficient
- \(S(X) = (X_{(1)}, \ldots, X_{(n)})\)

Notice that we can recover \(S(X)\) from \(X\), but not from \(T(X)\). We can recocer \(T(X)\) from either \(X\) or \(S(X)\).

Since \(T(X)\) can be recovered from them, \(X\) and \(S(X)\) can be compressed further. Does the same hold for \(T(X)\)? No, because it is *minimal sufficient*. Before we explore this, let’s prove a useful proposition:

**Proposition.** If \(T(X)\) is sufficient and \(T(X) = f(S(X))\) then \(S(X)\) is sufficient.

*Proof.* \(p_\theta(x) = g_\theta(T(x))h(x)\) \(= (g_\theta \circ f)(S(x)) h(x)\).

We were able to factorize \(p_\theta(x)\) using \(S\), so it is sufficient.

This proposition says intuitively that if \(T\) has all information about \(\theta\) and we can compute \(T\) from \(S\), then \(S\) must have all information about \(\theta\).

\(T(X)\) is *minimal sufficient* if:

- \(T(X)\) is sufficient
- For any \(S(X)\) sufficient, \(T(X) \stackrel{a.s.}{=} f(S(X))\) for some \(f\)

Note that condition 2 by itself is not all that restrictive. Constant statistics always satisfy condition 2. It’s the combination that is hard to satisfy.

Note that the sufficiency applies to minimal sufficient statistics because they are sufficient.

*Are minimal sufficient statistics unique?*Let \(T_1, T_2\) be minimal sufficient statistics. Then \(T_1 = f(T_2)\) and \(T_2 = g(T_1)\), so there is always some one-to-one correspondence. Note that what is important here is the information that is communicated by the statistics, not the actual values the statistics take on.*Do minimal sufficinet statistics exist?*At the level of this class, yes. In homework, we prove that likelihood ratios are universal minimal sufficient statistics. Rigorously, there is alwas a minimal sufficient \(\sigma\)-field.

If we lose all assumptions on \(P_\theta\) and it is an arbitrary joint dsitribution, then the data itself is minimal sufficient. The data is always at least sufficient, so this is the worst case, so to speak.

The following condition gives us a nice sufficient condition for a sufficient statistic to be minimal:

Assume \(\mathcal{P} = \{ p_\theta \colon \theta \in \Theta \}\) is a family of densities w.r.t common \(\mu\), and assume \(T(X)\) is sufficient.

If for all \(x, y\) such that \(p_\theta(x) \propto_\theta p_\theta(y)\) we have that \(T(x) = T(y)\), then \(T(x)\) is minimal sufficient.

*Proof.* Suppose \(S(X)\) is sufficient. Assume there does not exist \(f\) such that \(T = f(S)\). Then \(\exists x, y\) with \(S(x) = S(y) = s\) but \(T(x) \neq T(y)\).

Then \(p_\theta(x) = g_\theta(S(x)) h(x)\) \(\propto_\theta g_\theta(S(y)) h(y) = p_\theta(y)\), so \(T(x) = T(y)\) by assumption, which is a contradiction.

So \(T\) is minimal sufficient.

Let’s elaborate on what \(p_\theta(x) \propto_\theta p_\theta(y)\) means. Here, \(x\) and \(y\) are fixed. Then if we vary \(\theta\), we can think of \(p_\theta(x), p_\theta(y)\) as functions \(\mathbb{R} \to \mathbb{R}\) acting as \(\theta \mapsto p_\theta(x)\). Saying they are proportional instead of equal means that their shape can be the same up to some rescaling. Saying they are “proportional w.r.t theta” \(\propto_\theta\) means that we look at them as functions of \(\theta\). More formally, \(p_\theta(x) \propto_\theta p_\theta(y)\) means there is a \(k(x, y)\) (dependent on \(x\) and \(y\) *but not* \(\theta\)) such that \(p_\theta(x) = k(x, y) p_\theta(y)\) for all \(\theta\).

# Minimal Sufficiency for Exponential Families

Consider an exponential family \(p_\theta(x) = e^{\eta(\theta)' T(x) - B(\theta)} h(x)\), with \(T(x) \in \mathbb{R}^s\). We are parameterizing the family here by \(\theta\) rather than the natural parameter \(\eta\) to see what happens if we restrict to some subset of the parameter space.

We know that \(T(X)\) is sufficient. A natural question to ask, is \(T(X)\) minimal? Let’s use the theorem we just proved.

Fix \(x, y\) such that \(p_\theta(x) \propto_\theta p_\theta(y)\). Then

\[ e^{\eta(\theta)' T(X)} e^{-B(\theta)} h(x) \propto_\theta e^{\eta(\theta)' T(y)} e^{-B(\theta)} h(y) \]

Eliminating the terms constant given \(\theta\):

\[ e^{\eta(\theta)' T(X)} \propto_\theta e^{\eta(\theta)' T(y)} \]

Then \(\eta(\theta)' T(x) = \eta(\theta)' T(y) + C\) for constant \(C\), so

\[ [\eta(\theta_1) - \eta(\theta_2)]'[T(x) - T(y)] = 0 \quad \forall \theta_1, \theta_2 \]

Stating this in terms of vectors, we have

\[T(x) - T(y) \perp \text{span}\{ \eta(\theta_1) - \eta(\theta_2) \colon \theta_1, \theta_2 \in \Theta \}\]

For us to have \(T(x) = T(y)\) as expected of a minimal statistic, we must have \(\text{span}\{ \eta(\theta_1) - \eta(\theta_2) \colon \theta_1, \theta_2 \in \Theta \} = \mathbb{R}^s\), so that the only orthogonal vector is 0.

Therefore we come up with the following condition for minimality of the sufficient statistic for exponential families:

For an exponential family \(p_\theta(x) = e^{\eta(\theta)'T(x) - B(\theta)}h(x)\) with \(T(x) \in \mathbb{R}^s\), \(T(x)\) is minimal if

\[\text{span}\{ \eta(\theta_1) - \eta(\theta_2) \colon \theta_1, \theta_2 \in \Theta \} = \mathbb{R}^s\]

**Examples**

Let \(\begin{pmatrix} x_1 \\ x_2 \end{pmatrix} \sim \mathcal{N}\left(\begin{pmatrix} \cos \theta \\ \sin \theta \end{pmatrix}, I_2\right)\).

We have that \(\eta(\theta) = (\cos \theta, \sin \theta)'\). The image of \(\eta\) is the unit circle.

Consider the following diagram, where we use two pairs of \(\theta_1, \theta_2\) to construct basis vectors for \(\mathbb{R}^2\):

Then \(T(x) = \begin{pmatrix} x_1 \\ x_2 \end{pmatrix}\) is minimal sufficient.