\(\newcommand{L}[1]{\| #1 \|}\newcommand{VL}[1]{\L{ \vec{#1} }}\newcommand{R}[1]{\operatorname{Re}\,(#1)}\newcommand{I}[1]{\operatorname{Im}\, (#1)}\)

Hypothesis tesing with the general linear model¶

General linear model reprise¶

This page starts at the same place as introduction to the general linear model.

>>> # Import numerical and plotting libraries
>>> import numpy as np
>>> import numpy.linalg as npl
>>> import matplotlib.pyplot as plt
>>> # Only show 6 decimals when printing
>>> np.set_printoptions(precision=6)

In that page, we had questionnaire measures of psychopathy from 12 students:

>>> psychopathy = np.array([11.416,   4.514,  12.204,  14.835,
...                          8.416,   6.563,  17.343, 13.02,
...                          15.19 ,  11.902,  22.721,  22.324])

We also had skin-conductance scores from the palms of the each of the same 12 students, to get a measure of how sweaty they are:

>>> clammy = np.array([0.389,  0.2  ,  0.241,  0.463,
...                    4.585,  1.097,  1.642,  4.972,
...                    7.957,  5.585,  5.527,  6.964])

We believe that the clammy score has some straight-line relationship to the psychopathy scores. \(n\) is the number of elements in psychopathy and clammy: \(n = 12\). Call the 12 values for psychopathy \(\vec{y} = [y_1, y_2, .... , y_n]\). The 12 values for clammy are \(\vec{x} = [x_1, x_2, ... , x_n]\). Our straight line model is:

\[ \begin{align}\begin{aligned}\newcommand{\yvec}{\vec{y}} \newcommand{\xvec}{\vec{x}} \newcommand{\evec}{\vec{\varepsilon}} \newcommand{Xmat}{\boldsymbol X} \newcommand{\bvec}{\vec{\beta}} \newcommand{\bhat}{\hat{\bvec}} \newcommand{\yhat}{\hat{\yvec}} \newcommand{\ehat}{\hat{\evec}} \newcommand{\cvec}{\vec{c}} \newcommand{\rank}{\textrm{rank}}\\y_i = c + b x_i + e_i\end{aligned}\end{align} \]

where \(c\) is the intercept, \(b\) is the slope, and \(e_i\) is the remainder of \(y_i\) after subtracting \(c + b x_i\).

We then defined a new vector \(\evec = [e_1, e_2, ... e_n]\) for remaining error, and rewrote the same formula in vector notation:

\[\yvec = c + b \xvec + \evec\]

We defined a new \(n=12\) element vector \(\vec{1}\) containing all ones, and used this to build a two-column design matrix \(\Xmat\), with first column \(\vec{1}\) and second column \(\vec{x}\). This allowed us to rewrite the vector formulation as a matrix multiplication and addition:

\[\yvec = \Xmat \bvec + \evec\]

where \(\bvec\) is:

\[\begin{split}\left[ \begin{array}{\bvec} c \\ b \\ \end{array} \right]\end{split}\]

Using the matrix formulation of the general linear model, we found the least squares estimate for \(\bvec\) is:

\[\bhat = (\Xmat^T \Xmat)^{-1} \Xmat^T \yvec\]

The formula above applies when \(\Xmat^T \Xmat\) is invertible. Generalizing to the case where \(\Xmat^T \Xmat\) is not invertible, the least squares estimate is:

\[\bhat = \Xmat^+ \yvec\]

where \(\Xmat^+\) is the Moore-Penrose pseudoinverse of \(\Xmat\).

The ^ on \(\bhat\) reminds us that this is an estimate of \(\bvec\). We derived this \(\bhat\) estimate from our sample, hoping that it will be a reasonable estimate for the \(\bvec\) that applies to the whole population.

The residual error¶

\(\bhat\) gives us a corresponding estimate of \(\evec\):

\[\ehat = \yvec - \Xmat \bhat\]

The least squares criterion that we used to derive \(\bhat\) specifies that \(\bhat\) is the vector giving us the smallest sum of squares of \(\ehat\). We can write that criterion for \(\bhat\) like this:

\[\bhat = \textrm{argmin}_{\bvec} \sum_{i=1}^n e_i^2\]

Read this as “\(\bhat\) is the value of the vector \(\bvec\) that gives the minimum value for the sum of the squared residual errors”.

From now on, we will abbreviate \(\sum_{i=1}^n e_i^2\) as \(\sum e_i^2\), assuming it is the sum over all elements index \(1 .. n\).

Remembering the definition of the dot product, we can also write \(\sum e_i^2\) as the dot product of \(\ehat\) with itself:

\[\sum e_i^2 \equiv \ehat \cdot \ehat\]

Read \(\equiv\) as “equivalent to”. We can also express \(\sum e_i^2\) as the matrix multiplication of \(\ehat\) as a row vector with \(\ehat\) as a column vector. Because we assume that vectors are column vectors in matrix operations, we can write that formulation as:

\[\sum e_i^2 \equiv \ehat^T \ehat\]

Unbiased estimate of population variance¶

We will soon need an unbiased estimate of the population variance. The population variance is \(\frac{1}{N} \sum e_i^2\) where the population has \(N\) elements, and \(e_1, e_2, ... e_N\) are the remaining errors for all \(N\) observations in the population.

However, we do not have all \(N\) observations in the population, we only have a \(n\)-size sample from the population. In our particular case \(n=12\).

We could use the sample variance as this estimate: \(\frac{1}{n} \sum e_i\). Unfortunately, for reasons we don’t have space to go into, this is a biased estimate of the population variance.

To get an unbiased estimate of the variance, we need to allow for the number of independent columns in the design \(\Xmat\). The number of independent columns in the design is given by the matrix rank of \(\Xmat\). Specifically, if \(\rank(\Xmat)\) is the matrix rank of \(\Xmat\), an unbiased estimate of population variance is given by:

\[\hat\sigma^2 = \frac{1}{n - \rank(\Xmat)} \sum e_i^2\]

For example, we saw in the worked example of GLM, that when we have a single regressor, and \(\rank(\Xmat) = 1\), we divide the sum of squares of the residuals by \(n - 1\) where \(n\) is the number of rows in the design. This \(n-1\) divisor is Bessel’s correction.

We will also use these terms below:

\(\rank(\Xmat)\): degrees of freedom of the design;
\(n - \rank(\Xmat)\): degrees of freedom of the error.

Hypothesis testing¶

We used contrast vectors to form particular linear combinations of the parameter estimates in \(\bhat\). For example, we used the contrast vector \(\cvec = [0, 1]\) to select the estimate for \(b\) – the slope of the line:

\[b = [0, 1] \bhat\]

t tests using contrast vectors¶

The formula for a t statistic test on any linear combination of the parameters in \(\bhat\) is:

\[\newcommand{\cvec}{\vec{c}} t = \frac{\cvec^T \bhat} {\sqrt{\hat{\sigma}^2 \cvec^T (\Xmat^T \Xmat)^+ \cvec}}\]

where \(\hat{\sigma^2}\) is our unbiased estimate of the population variance.

Here is the t statistic calculation in Python:

>>> # Data vector
>>> y = psychopathy
>>> # Covariate vector
>>> x = clammy
>>> # Contrast vector as column vector
>>> c = np.array([0, 1]).reshape((2, 1))
>>> n = len(y)
>>> # Design matrix
>>> X = np.ones((n, 2))
>>> X[:, 1] = x
>>> # X.T X is invertible
>>> iXtX = npl.inv(X.T.dot(X))
>>> # Least-squares estimate of B
>>> B = iXtX.dot(X.T).dot(y)
>>> e = y - X.dot(B)
>>> # Degrees of freedom of design
>>> rank_x = npl.matrix_rank(X)
>>> # The two columns are not colinear, so rank is 2
>>> rank_x
2
>>> # Unbiased estimate of population variance
>>> df_error = n - rank_x
>>> s2_hat = e.dot(e) / df_error
>>> t = c.T.dot(B) / np.sqrt(s2_hat * c.T.dot(iXtX).dot(c))
>>> t
array([[ 1.914389]])

F tests¶

F tests are another way to test hypotheses about the linear models. They are particularly useful for testing whether there is a significant reduction in the residual error when adding one or more regressors.

The simplest and generally most useful way of thinking of F test is as a test comparing two models: a full model and a reduced model. The full model contains the regressors that we want to test. We will use \(\Xmat_f\) for the full model. The reduced model is a model that does not contain the regressors we want to test, but does contain all other regressors in the full model [1]. We will use \(\Xmat_r\) for the reduced model.

In our case, \(\Xmat_f\) is the model containing the clammy regressor, as well as the column of ones that models the intercept.

\(\Xmat_r\) is our original model, that only contains the column of ones.

If the full model is a better fit to the data than the reduced model, then adding the new regressor(s) will cause a convincing drop in the size of residuals.

The F test is a measure that reflects the drop in the magnitude of squared residuals as a result of adding the new regressors.

Now we define the \(SSR(\Xmat_r)\) and \(SSR(\Xmat_f)\). These are the Sums of Squares of the Residuals of the reduced and full model respectively.

\[ \begin{align}\begin{aligned}\begin{split}\bhat_r = \Xmat_r^+ \yvec \\ \hat\evec_r = \yvec - \Xmat_r \bhat_r \\ SSR(\Xmat_r) = \hat\evec_r^T \hat\evec_r \\\end{split}\\\begin{split}\bhat_f = \Xmat_f^+ \yvec \\ \hat\evec_f = \yvec - \Xmat_f \bhat_f \\ SSR(\Xmat_f) = \hat\evec_f^T \hat\evec_f\end{split}\end{aligned}\end{align} \]

\(ESS = SSR(\Xmat_r) - SSR(\Xmat_f)\) is the Extra Sum of Squared residuals explained by the full compared to the reduced model. The top half of the ratio that forms the F statistic is \(ESS / \nu_1\), where \(\nu_1\) is the number of extra independent regressors (columns) in \(\Xmat_f\) compared to \(\Xmat_r\). Specifically:

\[\nu_1 = \rank(\Xmat_f) - \rank(\Xmat_r)\]

The bottom half of the F statistic is the estimated variance \(\hat{\sigma^2}\) from the full model. This can also be written as \(SSR(\Xmat_f) / \nu_2\) where \(\nu_2\) is the degrees of freedom of the error:

\[\begin{split}\begin{eqnarray} F_{\nu_1, \nu_2} & = & \frac{ (\hat\evec_r^T \hat\evec_r - \hat\evec_f^T \hat\evec_f) / \nu_{1} } {\hat\evec_f^T \hat\evec_f / \nu_{2}} \\ & = & \frac{ (\textrm{SSR}(\Xmat_r) - \textrm{SSR}(\Xmat_f)) / \nu_1} {\textrm{SSR}(\Xmat_f) / \nu_2} \end{eqnarray}\end{split}\]

Here is the F-statistic calculation in Python:

>>> # We already have X, e, rank_x, for the full model, from
>>> # the t calculation
>>> X_f, e_f, rank_f = X, e, rank_x
>>> # Now calculate the same for the reduced model
>>> X_r = np.ones((n, 1))
>>> iXtX_r = npl.inv(X_r.T.dot(X_r))
>>> B_r = iXtX_r.dot(X_r.T).dot(y)
>>> e_r = y - X_r.dot(B_r)
>>> rank_r = npl.matrix_rank(X_r)  # One column, rank 1
>>> rank_r
1
>>> # Calculate the F statistic
>>> SSR_f = e_f.dot(e_f)
>>> SSR_r = e_r.dot(e_r)
>>> nu_1 = rank_f - rank_r
>>> F = ((SSR_r - SSR_f) / nu_1) / (SSR_f / (n - rank_f))
>>> F
3.66488...

For reasons that we haven’t explained here, the F statistic for a single column is the square of the t statistic testing the same column:

>>> t ** 2
array([[ 3.664886]])

Footnotes

[1]

Actually, the full model need not contain exactly the same regressors as the reduced model, but it must be able to model all the data vectors that the reduced design can, once matrix multiplied by a suitable parameter vector \(\bvec\). For example, consider the following two designs:

\[\begin{split}\boldsymbol X_1 = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 0 \\ \end{bmatrix} \\ \boldsymbol X_2 = \begin{bmatrix} 2 & 1 \\ 1 & 2 \\ 0 & 0 \\ \end{bmatrix}\end{split}\]

These designs do not contain the same regressors, but, with suitable \(\bvec\) vectors, they can both give an exact fit to any data vector of form:

\[\begin{split}y = \begin{bmatrix} p \\ q \\ 0 \\ \end{bmatrix}\end{split}\]

In fact, these are the only vectors they can fit. So, although the two designs \(\boldsymbol X_1\) and \(\boldsymbol X_2\) do not have the same regressors, they do model the same effects – meaning, they can fit the same range of data vectors.