################################
Validating the GLM against scipy
################################

.. nbplot::

    >>> import numpy as np
    >>> import numpy.linalg as npl
    >>> import matplotlib.pyplot as plt
    >>> # Print array values to 4 decimal places
    >>> np.set_printoptions(precision=4)

Make some random but predictable data with :doc:`numpy.random <numpy_random>`:

.. nbplot::

    >>> # Make random number generation predictable
    >>> np.random.seed(1966)
    >>> # Make a fake regressor and data.
    >>> n = 20
    >>> x = np.random.normal(10, 2, size=n)
    >>> y = np.random.normal(20, 1, size=n)
    >>> plt.plot(x, y, '+')
    [...]

Do a simple linear regression with the GLM:

.. math::

    \newcommand{\yvec}{\vec{y}}
    \newcommand{\xvec}{\vec{x}}
    \newcommand{\evec}{\vec{\varepsilon}}
    \newcommand{Xmat}{\boldsymbol X}
    \newcommand{\bvec}{\vec{\beta}}
    \newcommand{\bhat}{\hat{\bvec}}
    \newcommand{\yhat}{\hat{\yvec}}
    \newcommand{\ehat}{\hat{\evec}}
    \newcommand{\cvec}{\vec{c}}
    \newcommand{\rank}{\textrm{rank}}

    y_i = c + b x_i + e_i \implies \\

    \yvec = \Xmat \bvec + \evec

.. nbplot::

    >>> X = np.ones((n, 2))
    >>> X[:, 1] = x
    >>> B = npl.pinv(X).dot(y)
    >>> B
    array([ 19.3567,   0.0723])
    >>> E = y - X.dot(B)

Build the t statistic:

.. math::

    \newcommand{\cvec}{\vec{c}}
    \hat\sigma^2 = \frac{1}{n - \rank(\Xmat)} \sum e_i^2 \\

    t = \frac{\cvec^T \bhat}
    {\sqrt{\hat{\sigma}^2 \cvec^T (\Xmat^T \Xmat)^+ \cvec}}

.. nbplot::

    >>> # Contrast vector selects slope parameter
    >>> c = np.array([0, 1])
    >>> df = n - npl.matrix_rank(X)
    >>> sigma_2 = np.sum(E ** 2) / df
    >>> c_b_cov = c.dot(npl.pinv(X.T.dot(X))).dot(c)
    >>> t = c.dot(B) / np.sqrt(sigma_2 * c_b_cov)
    >>> t
    0.82220...

Test the t statistic against a t distribution with ``df`` degrees of freedom:

.. nbplot::

    >>> import scipy.stats
    >>> t_dist = scipy.stats.t(df=df)
    >>> p_value = 1 - t_dist.cdf(t)
    >>> # One-tailed t-test (t is positive)
    >>> p_value
    0.21085...
    >>> # Two-tailed p value is just 2 * one tailed value, because
    >>> # distribution is symmetric
    >>> 2 * p_value
    0.42171...

Now do the same test with ``scipy.stats.linregress``:

.. nbplot::

    >>> res = scipy.stats.linregress(x, y)
    >>> res.slope
    0.07227...
    >>> res.intercept
    19.35665...
    >>> # This is the same as the manual GLM fit
    >>> np.allclose(B, [res.intercept, res.slope])
    True
    >>> # p value is always two-tailed
    >>> res.pvalue
    0.42171...
    >>> np.allclose(p_value * 2, res.pvalue)
    True

Now do the same thing with the two-sample t-test.

.. nbplot::

    >>> X2 = np.zeros((n, 2))
    >>> X2[:10, 0] = 1
    >>> X2[10:, 1] = 1
    >>> X2
    array([[ 1.,  0.],
           [ 1.,  0.],
           [ 1.,  0.],
           [ 1.,  0.],
           [ 1.,  0.],
           [ 1.,  0.],
           [ 1.,  0.],
           [ 1.,  0.],
           [ 1.,  0.],
           [ 1.,  0.],
           [ 0.,  1.],
           [ 0.,  1.],
           [ 0.,  1.],
           [ 0.,  1.],
           [ 0.,  1.],
           [ 0.,  1.],
           [ 0.,  1.],
           [ 0.,  1.],
           [ 0.,  1.],
           [ 0.,  1.]])
    >>> B2 = npl.pinv(X2).dot(y)
    >>> E2 = y - X2.dot(B2)
    >>> c2 = np.array([-1, 1])
    >>> df = n - npl.matrix_rank(X2)
    >>> sigma_2 = np.sum(E2 ** 2) / df
    >>> c_b_cov = c2.dot(npl.pinv(X2.T.dot(X2))).dot(c2)
    >>> t = c2.dot(B2) / np.sqrt(sigma_2 * c_b_cov)
    >>> t
    -0.30792...
    >>> t_dist = scipy.stats.t(df=df)
    >>> # One-tailed p value, for negative value
    >>> p_value_2 = t_dist.cdf(t)
    >>> p_value_2
    0.38083...
    >>> # Two-tailed p value
    >>> p_value_2 * 2
    0.76167...

The same thing using ``scipy.stats.ttest_ind`` for t test between two
independent samples:

.. nbplot::

    >>> scipy.stats.ttest_ind(y[:10], y[10:])
    Ttest_indResult(statistic=0.30792..., pvalue=0.76167...)