\(\newcommand{L}[1]{\| #1 \|}\newcommand{VL}[1]{\L{ \vec{#1} }}\newcommand{R}[1]{\operatorname{Re}\,(#1)}\newcommand{I}[1]{\operatorname{Im}\, (#1)}\)

A worked example of the general linear model

See: introduction to the general linear model.

Here we go through the matrix formulation of the general linear model with the simplest possible example – a t-test for the difference of the sample mean from zero.

Let us say we have the hypothesis that men are more psychopathic than women.

We find 10 male-female couples, and we give them each a psychopathy questionnaire. We score the questionnaires, and then, for each couple, we subtract the woman’s score from the man’s score, to get a difference score.

We have 10 difference scores:

>>> #: Our standard imports
>>> import numpy as np
>>> import numpy.linalg as npl
>>> # Only show 4 decimals when printing arrays
>>> np.set_printoptions(precision=4)
>>> differences = np.array([ 1.5993, -0.13  ,  2.3806,  1.3761, -1.3595,
...                          1.0286,  0.8466,  1.6669, -0.8241,  0.4469])

Our hypothesis is that the females in the couple have a lower psychopathy score. We therefore predict that, in the whole population of male-female couples, the difference measure will, on average, be positive (men have higher scores than women).

We teat this hypothesis by testing whether the sample mean is far enough from zero to make it unlikely that the population mean is actually zero.

The standard error of the mean

One way to test this, is to compare the mean value, to the standard error of the mean.

The standard error of a statistic (such as a mean) is the standard deviation of the sampling distribution of the statistic. The sampling distribution is the distribution of the statistic that we expect when taking a very large number of samples of the statistic. For example, the sampling distribution of the mean, is the expected distribution of the sample means generated by taking a very large number of samples.

As usual, define the mean of a vector of values \(\vec{x}\) as:

\[\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i\]

Define the unbiased estimate of the standard deviation of \(\vec{x}\) as:

\[s_x = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2}\]

Notice the \(\frac{1}{n-1}\). We have previously used \(n\) rather than \(n-1\) as the divisor for the sample standard deviation. Using \(n\), our sample standard deviation is a biased estimate of the population standard deviation. Using \(n-1\) as the divisor gives an unbiased estimate for population standard deviation.

The standard error of the mean is:

\[SE_{\bar{x}} = \frac{s_x}{\sqrt{n}}\]

where \(n\) is the number of samples, 10 in our case.

A t statistic is given by a value divided by the standard error of that value. The t statistic for \(\bar{x}\) is:

\[t = \frac{\bar{x}}{SE_{\bar{x}}}\]

In our case:

>>> x = differences
>>> n = len(x)
>>> x_bar = np.mean(x)
>>> x_bar
0.70314...
>>> unviased_var_x = 1. / (n - 1) * np.sum((x - x_bar) ** 2)
>>> s_x = np.sqrt(unviased_var_x)
>>> s_x
1.17718...
>>> SEM = s_x / np.sqrt(n)
>>> SEM
0.37225...
>>> t = x_bar / SEM
>>> t
1.88884...

The standard error as an estimator

Imagine that we repeat the following procedure many times, collecting each value into a vector \(\vec{m}\):

  • take \(n\) samples from some distribution;
  • take the mean of these \(n\) samples.

The standard error of the mean is an estimate of the standard deviation of the values in \(\vec{m}\).

We can simulate this process when we know the exact distribution of the values in the population. For example, let us imagine that we know that the population distribution is a normal distribution with population mean 10 (\(\mu = 10\)) and population standard deviation 1.5 (\(\sigma = 1.5\)). We can use numpy.random to draw samples from this distribution:

>>> # 10 samples from normal distribution with mean 5 and sd 1.5
>>> sample = np.random.normal(5, 1.5, size=10)
>>> sample
array([ 5.7451,  4.7926,  5.9715,  7.2845,  4.6488,  4.6488,  7.3688,
        6.1512,  4.2958,  5.8138])

We know the exact mean (\(\mu\)) and standard deviation (\(\sigma\)). If we draw a near-infinite number of samples like this, then the standard deviation of the mean of these samples will be \(\frac{\sigma}{\sqrt{n}}\):

>>> 1.5 / np.sqrt(n)
0.474341...

We can compare this number to an estimate of the same thing from a very large number of means from samples size 10:

>>> n_samples = 100000
>>> pop_mean = 5
>>> pop_sd = 1.5
>>> means_of_samples = np.zeros(n_samples)
>>> for i in range(n_samples):
...     sample = np.random.normal(pop_mean, pop_sd, size=n)
...     means_of_samples[i] = sample.mean()
>>> sq_deviations = (means_of_samples - pop_mean) ** 2
>>> # With lots of samples, this value is close to the exact number
>>> means_std_dev = np.sqrt(1. / n_samples * np.sum(sq_deviations))
>>> means_std_dev
0.474735...

Normally we do not know the exact standard deviation of the population. The standard error of the mean is for that situation. First we use the sample that we have to get an unbiased estimate of the population standard deviation:

>>> n = 10
>>> y = sample
>>> y_bar = y.mean()
>>> unbiased_var_estimate = 1. / (n - 1) * np.sum((y - y_bar) ** 2)
>>> unbiased_sd_est = np.sqrt(unbiased_var_estimate)

Of course, this estimate will not be exactly the same as the population standard deviation (\(\sigma = 1.5\)):

>>> unbiased_sd_est
1.433659...

We use the unbiased standard deviation estimate to give an unbiased estimate for the standard error of the mean. This estimate will be close to the standard deviation of means from many samples of size \(n\):

>>> # Standard deviation of means from population
>>> 1.5 / np.sqrt(n)
0.474341...
>>> # Our estimate for the standard error of mean
>>> unbiased_sd_est / np.sqrt(n)
0.453362...

Testing using the general linear model

It is overkill for us to use the general linear model for this, but it does show how the machinery works in the simplest case.

\(\newcommand{Xmat}{\boldsymbol X} \newcommand{\bvec}{\vec{\beta}}\) \(\newcommand{\yvec}{\vec{y}} \newcommand{\xvec}{\vec{x}} \newcommand{\evec}{\vec{\varepsilon}}\)

The matrix expression of the general linear model is:

\[\yvec = \Xmat \bvec + \evec\]

\(\newcommand{Xmat}{\boldsymbol X} \newcommand{\bvec}{\vec{\beta}}\)

Define our design matrix \(\Xmat\) to have a single column of ones:

>>> X = np.ones((n, 1))
>>> X
array([[ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.]])

\(\newcommand{\bhat}{\hat{\bvec}} \newcommand{\yhat}{\hat{\yvec}}\)

\(\bhat\) is The least squares estimate for \(\bvec\), and is given by:

\[\bhat = (\Xmat^T \Xmat)^{-1} \Xmat^T \yvec\]

Because \(\Xmat\) is just a column of ones, \(\Xmat^T \yvec = \sum_i{y_i}\).

\(\Xmat^T \Xmat = n\), so \((\Xmat^T \Xmat)^{-1} = \frac{1}{n}\).

Thus:

\[\begin{split}\bhat = (\Xmat^T \Xmat)^{-1} \Xmat^T \yvec \\ = \frac{1}{n} \sum_i{y_i} \\ = \bar{y}\end{split}\]

The student’s t statistic from the general linear model is:

\[t = \frac{c^T \hat\beta}{\sqrt{\hat{\sigma}^2 c^T (\Xmat^T \Xmat)^+ c}}\]

where \(\hat{\sigma}^2\) is our estimate of variance in the residuals, \(c\) is a contrast vector to select some combination of the design columns, and \((\Xmat^T \Xmat)^+\) is the pseudoinverse of \(\Xmat^T \Xmat\).

In our case we have only one design column, so \(c = [1]\) and we can omit it. \(\hat{\sigma}^2 = s_x^2\) for \(s_x\) defined above. \(\Xmat^T \Xmat\) is invertible, and we know the inverse already: \(\frac{1}{n}\). Therefore:

\[\begin{split}t = \frac{\bar{y}}{s_x \sqrt{\frac{1}{n}}} \\ = \bar{y} \Big/ \frac{s_x}{\sqrt{n}} \\ = \frac{\bar{y}}{SE_{\bar{y}}}\end{split}\]