############################################ A worked example of the general linear model ############################################ See: `introduction to the general linear model`_. Here we go through the matrix formulation of the general linear model with the simplest possible example |--| a t-test for the difference of the sample mean from zero. Let us say we have the hypothesis that men are more psychopathic than women. We find 10 male-female couples, and we give them each a psychopathy questionnaire. We score the questionnaires, and then, for each couple, we subtract the woman's score from the man's score, to get a difference score. We have 10 difference scores: .. nbplot:: >>> #: Our standard imports >>> import numpy as np >>> import numpy.linalg as npl >>> # Only show 4 decimals when printing arrays >>> np.set_printoptions(precision=4) .. nbplot:: >>> differences = np.array([ 1.5993, -0.13 , 2.3806, 1.3761, -1.3595, ... 1.0286, 0.8466, 1.6669, -0.8241, 0.4469]) Our hypothesis is that the females in the couple have a lower psychopathy score. We therefore predict that, in the whole population of male-female couples, the difference measure will, on average, be positive (men have higher scores than women). We teat this hypothesis by testing whether the sample mean is far enough from zero to make it unlikely that the population mean is actually zero. ****************************** The standard error of the mean ****************************** One way to test this, is to compare the mean value, to the `standard error of the mean `_. The `standard error`_ of a statistic (such as a mean) is the standard deviation of the sampling distribution of the statistic. The sampling distribution is the distribution of the statistic that we expect when taking a very large number of samples of the statistic. For example, the sampling distribution of the mean, is the expected distribution of the sample means generated by taking a very large number of samples. As usual, define the *mean* of a vector of values $\vec{x}$ as: .. math:: \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i Define the *unbiased estimate of the standard deviation* of $\vec{x}$ as: .. math:: s_x = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2} Notice the $\frac{1}{n-1}$. We have previously used $n$ rather than $n-1$ as the divisor for the *sample standard deviation*. Using $n$, our *sample standard deviation* is a biased estimate of the population standard deviation. Using $n-1$ as the divisor gives an unbiased estimate for population standard deviation. The standard error of the mean is: .. math:: SE_{\bar{x}} = \frac{s_x}{\sqrt{n}} where $n$ is the number of samples, 10 in our case. A t statistic is given by a value divided by the *standard error* of that value. The t statistic for $\bar{x}$ is: .. math:: t = \frac{\bar{x}}{SE_{\bar{x}}} In our case: .. nbplot:: >>> x = differences >>> n = len(x) >>> x_bar = np.mean(x) >>> x_bar 0.70314... >>> unviased_var_x = 1. / (n - 1) * np.sum((x - x_bar) ** 2) >>> s_x = np.sqrt(unviased_var_x) >>> s_x 1.17718... >>> SEM = s_x / np.sqrt(n) >>> SEM 0.37225... >>> t = x_bar / SEM >>> t 1.88884... ********************************** The standard error as an estimator ********************************** Imagine that we repeat the following procedure many times, collecting each value into a vector $\vec{m}$: * take $n$ samples from some distribution; * take the mean of these $n$ samples. The standard error of the mean is an estimate of the standard deviation of the values in $\vec{m}$. We can simulate this process when we know the exact distribution of the values in the population. For example, let us imagine that we know that the population distribution is a normal distribution with population mean 10 ($\mu = 10$) and population standard deviation 1.5 ($\sigma = 1.5$). We can use :doc:`numpy.random ` to draw samples from this distribution: .. nbplot:: :include-source: false >>> np.random.seed(42) .. nbplot:: >>> # 10 samples from normal distribution with mean 5 and sd 1.5 >>> sample = np.random.normal(5, 1.5, size=10) >>> sample array([ 5.7451, 4.7926, 5.9715, 7.2845, 4.6488, 4.6488, 7.3688, 6.1512, 4.2958, 5.8138]) We know the exact mean ($\mu$) and standard deviation ($\sigma$). If we draw a near-infinite number of samples like this, then the standard deviation of the mean of these samples will be $\frac{\sigma}{\sqrt{n}}$: .. nbplot:: >>> 1.5 / np.sqrt(n) 0.474341... We can compare this number to an estimate of the same thing from a very large number of means from samples size 10: .. nbplot:: >>> n_samples = 100000 >>> pop_mean = 5 >>> pop_sd = 1.5 >>> means_of_samples = np.zeros(n_samples) >>> for i in range(n_samples): ... sample = np.random.normal(pop_mean, pop_sd, size=n) ... means_of_samples[i] = sample.mean() >>> sq_deviations = (means_of_samples - pop_mean) ** 2 >>> # With lots of samples, this value is close to the exact number >>> means_std_dev = np.sqrt(1. / n_samples * np.sum(sq_deviations)) >>> means_std_dev 0.474735... Normally we do not know the exact standard deviation of the population. The standard error of the mean is for that situation. First we use the sample that we have to get an *unbiased estimate* of the population standard deviation: .. nbplot:: >>> n = 10 >>> y = sample >>> y_bar = y.mean() >>> unbiased_var_estimate = 1. / (n - 1) * np.sum((y - y_bar) ** 2) >>> unbiased_sd_est = np.sqrt(unbiased_var_estimate) Of course, this estimate will not be exactly the same as the population standard deviation ($\sigma = 1.5$): .. nbplot:: >>> unbiased_sd_est 1.433659... We use the unbiased standard deviation estimate to give an unbiased estimate for the standard error of the mean. This estimate will be close to the standard deviation of means from many samples of size $n$: .. nbplot:: >>> # Standard deviation of means from population >>> 1.5 / np.sqrt(n) 0.474341... >>> # Our estimate for the standard error of mean >>> unbiased_sd_est / np.sqrt(n) 0.453362... ************************************** Testing using the general linear model ************************************** It is overkill for us to use the general linear model for this, but it does show how the machinery works in the simplest case. :math:`\newcommand{Xmat}{\boldsymbol X} \newcommand{\bvec}{\vec{\beta}}` :math:`\newcommand{\yvec}{\vec{y}} \newcommand{\xvec}{\vec{x}} \newcommand{\evec}{\vec{\varepsilon}}` The matrix expression of the general linear model is: .. math:: \yvec = \Xmat \bvec + \evec :math:`\newcommand{Xmat}{\boldsymbol X} \newcommand{\bvec}{\vec{\beta}}` Define our design matrix $\Xmat$ to have a single column of ones: .. nbplot:: >>> X = np.ones((n, 1)) >>> X array([[ 1.], [ 1.], [ 1.], [ 1.], [ 1.], [ 1.], [ 1.], [ 1.], [ 1.], [ 1.]]) :math:`\newcommand{\bhat}{\hat{\bvec}} \newcommand{\yhat}{\hat{\yvec}}` $\bhat$ is The least squares estimate for $\bvec$, and is given by: .. math:: \bhat = (\Xmat^T \Xmat)^{-1} \Xmat^T \yvec Because $\Xmat$ is just a column of ones, $\Xmat^T \yvec = \sum_i{y_i}$. $\Xmat^T \Xmat = n$, so $(\Xmat^T \Xmat)^{-1} = \frac{1}{n}$. Thus: .. math:: \bhat = (\Xmat^T \Xmat)^{-1} \Xmat^T \yvec \\ = \frac{1}{n} \sum_i{y_i} \\ = \bar{y} The student's t statistic from the general linear model is: .. math:: t = \frac{c^T \hat\beta}{\sqrt{\hat{\sigma}^2 c^T (\Xmat^T \Xmat)^+ c}} where $\hat{\sigma}^2$ is our estimate of variance in the residuals, $c$ is a contrast vector to select some combination of the design columns, and $(\Xmat^T \Xmat)^+$ is the *pseudoinverse* of $\Xmat^T \Xmat$. In our case we have only one design column, so $c = [1]$ and we can omit it. $\hat{\sigma}^2 = s_x^2$ for $s_x$ defined above. $\Xmat^T \Xmat$ is invertible, and we know the inverse already: $\frac{1}{n}$. Therefore: .. math:: t = \frac{\bar{y}}{s_x \sqrt{\frac{1}{n}}} \\ = \bar{y} \Big/ \frac{s_x}{\sqrt{n}} \\ = \frac{\bar{y}}{SE_{\bar{y}}}