{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import scipy as sp\n", "import scipy.special\n", "import scipy.stats\n", "import pandas as pd\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Simply speaking, the **Analysis of Variance**(ANOVA) is a technique of comparing means of multiple$(\\geq 3)$ populations, the name derives from the way how calculations are performed. \n", "\n", "For example, a common hypotheses of ANOVA are \n", "$$\n", "H_0:\\quad \\mu_1=\\mu_2=\\mu_3=\\cdots=\\mu_n\\\\\n", "H_1:\\quad \\text{At least two means differ}\n", "$$\n", "\n", "The first doubt pops up in mind: why aren't we using same old $t$-tests such that\n", "$$\n", "H_0: \\mu_1=\\mu_2 \\qquad H_0: \\mu_1=\\mu_3 \\qquad H_0: \\mu_1=\\mu_4 \\qquad H_0:\\quad \\mu_2=\\mu_3\\quad\\cdots\\\\\n", "H_1: \\mu_1\\neq\\mu_2 \\qquad H_1: \\mu_1\\neq\\mu_3\\qquad H_1:\\mu_1\\neq\\mu_4 \\qquad H_1:\\quad \\mu_2\\neq\\mu_3\\quad\\cdots\\\\\n", "$$\n", "and so on so forth, till exhaustion of all pairwise combination.\n", "\n", "Apparently, the number of $t$-tests will be as large as ${n \\choose 2} $ where $n$ is the number of populations. If there are $5$ populations, then we have to test ${5 \\choose 2}=10$ pairs. With $95\\%$ confidence level, $10$ $t$-tests would cut back confidence level dramatically to $95\\%^{10}=59.8\\%$, which also means the probability of _type I_ error would be around $40\\%$.\n", "\n", "A sidenote, econometric ANOVA is a standard practice that all statistical packages automatically print, however the terminologies in statistics are peculiar to econometric practioners, but still semantically sensible that will be clarified in later discussion." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# One-Way Analysis of Variance " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If samples are independently drawn from populations, the technique of ANOVA is called **One-Way ANOVA**. The statistic that measures the proximity of the sample means to each other is called **sum of squares for treatments** (SST). The terminology _treatment_ was introduced in 1920s when conducting ANOVA on different treatments of fertilisers for testing potential yields. For instance, an agronomist can sample three different corn yields applied with three different fertilisers. \n", "\n", "The sum of squares for treatments (SST) represents the **between-treatment variation**, the mathematical form is\n", "$$\n", "SST=\\sum_{i=1}^kn_i(\\bar{x}_i-\\bar{\\bar{x}})^2\n", "$$\n", "where $n_i$ is the sample size of treatments $i$, $\\bar{\\bar{x}}$ is the grand mean, i.e. the _mean of the sample means_, $k$ is the number of treatments.\n", "\n", "There are also **within-treatments variations** which is denoted by **sum of squares for error** (SSE), it measures the deviation of all observations from its sample mean. \n", "$$\n", "SSE=\\sum_{i=1}^k\\sum_{j=1}^{n_i}(x_{ij}-\\bar{x}_i)^2=\\sum_{j=1}^{n_1}(x_{j1}-\\bar{x}_1)^2+\\sum_{j=1}^{n_2}(x_{j2}-\\bar{x}_2)^2+\\cdots+\\sum_{j=1}^{n_k}(x_{jk}-\\bar{x}_k)^2\n", "$$\n", "If we divide $SSE$ by $(n_i-1)$, i.e. the degree of freedom of each sample, $SSE$ can be rewritten more consicely as\n", "$$\n", "SSE =(n_1-1)s_1^2+(n_2-1)s_2^2+\\cdots+(n_k-1)s_k^2\n", "$$\n", "where $s_i^2$ is the sample variance of sample $i$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to construct $F$-statistic, we need to introduce two more statistics, the first one is **Mean Square for Treatments** (MST)\n", "$$\n", "MST=\\frac{SST}{k-1}\n", "$$\n", "And the second one is **Mean Square for Error** (MSE)\n", "$$\n", "MSE=\\frac{SSE}{n-k}\n", "$$\n", "Join them together, an $F$-statistic is constructed\n", "$$\n", "F=\\frac{MST}{MSE}\n", "$$\n", "\n", "There are three assumptions for an ANOVA test to be valid.\n", "
    \n", "
  1. Each sample is independent to each other.
  2. \n", "
  3. Each sample is drawn from a normally distributed population.
  4. \n", "
  5. Population standard deviation are homoskedastic, i.e. constant variance.
  6. \n", "
\n", "Though in practice you might frequently encounter violation of the assumptions, you should be fully aware of the potentially misleading interpretation of the test results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If $SST$ is large, then so is $MST$, more likely to produce a larger $F$-statistic, then a higher probability to reject null hypothesis, the rejection rule is\n", "$$\n", "F>F_{\\alpha, k-1, n-k}\n", "$$\n", "\n", "\n", "Critical value $F_{\\alpha, k-1, n-k}$ can be returned by ```scipy.stats.f.ppf()```. For instance, the number of treatment is $4$, sum of sample sizes is $342$, with $95\\%$ confidence level the crtical value is " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.6321415117354894" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sp.stats.f.ppf(.95, 3, 328)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The widely-known variance decomposition equation is \n", "$$\n", "SS(Total)=SST+SSE\n", "$$\n", "Mathematically\n", "$$\n", "\\sum_{i=1}^k \\sum_{j=1}^{n_i}(x_{ij}-\\bar{\\bar{x}})^2= \\sum_{i=1}^kn_i(\\bar{x}_i-\\bar{\\bar{x}})^2+(n_1-1)s_1^2+(n_2-1)s_2^2+\\cdots+(n_k-1)s_k^2\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## An Example of Population Height " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a simple example, we will use three samples of male heights to perform an ANOVA analysis. The hypotheses are\n", "$$\n", "H_0: \\mu_1=\\mu_2=\\mu_3\\\\\n", "H_1: \\text{At least two means differ}\\\\\n", "$$" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
JapaneseDutchDanish
0161.783130187.726286174.746213
1145.329934179.338741174.133579
2174.569597176.566656178.966745
3160.003162184.570245179.335222
4162.242898184.056181167.497992
\n", "
" ], "text/plain": [ " Japanese Dutch Danish\n", "0 161.783130 187.726286 174.746213\n", "1 145.329934 179.338741 174.133579\n", "2 174.569597 176.566656 178.966745\n", "3 160.003162 184.570245 179.335222\n", "4 162.242898 184.056181 167.497992" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_excel('dataset/height_anova2.xlsx')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are ANOVA formulae typed in a verbatim manner." ] }, { "cell_type": "code", "execution_count": 168, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "F-statistic: 37.96057544749984\n", "p-value: 2.3363533330211794e-12\n" ] } ], "source": [ "dutch_mean = df['Dutch'].mean()\n", "japanese_mean = df['Japanese'].mean()\n", "danish_mean = df['Danish'].mean()\n", "grand_mean = (dutch_mean+japanese_mean+danish_mean)/3\n", "\n", "SST = len(df['Japanese'])*(japanese_mean-grand_mean)**2\\\n", " +len(df['Dutch'])*(dutch_mean-grand_mean)**2\\\n", " +len(df['Danish'])*(danish_mean-grand_mean)**2\n", "MST = SST/2\n", "\n", "SSE = (len(df['Japanese'])-1)*df['Japanese'].var(ddof=1)\\\n", " +(len(df['Dutch'])-1)*df['Dutch'].var(ddof=1)\\\n", " +(len(df['Danish'])-1)*df['Danish'].var(ddof=1)\n", "n = len(df['Japanese']) + len(df['Dutch']) + len(df['Danish'])\n", "k = 3\n", "MSE = SSE/(n-k)\n", "\n", "F = MST/MSE\n", "print('F-statistic: {}'.format(F))\n", "print('p-value: {}'.format(1 - sp.stats.f.cdf(F, 2, n-k)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The test results favours the alternative hypothesis overwhelmingly. \n", "\n", "Before we close the case, let's exam the sample variance." ] }, { "cell_type": "code", "execution_count": 169, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Japanese sample variance: 145.93369676074872\n", "Danish sample variance: 20.108195066929262\n", "Dutch sample variance: 15.844843124694973\n" ] } ], "source": [ "print('Japanese sample variance: {}'.format(df['Japanese'].var(ddof=1)))\n", "print('Danish sample variance: {}'.format(df['Danish'].var(ddof=1)))\n", "print('Dutch sample variance: {}'.format(df['Dutch'].var(ddof=1)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apparently they violate on of assumptions of ANOVA, therefore cautious must be taken when interpreting the results, though we firmly know the mean heights in these three countries are different. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A Simulation View of Factors That Affects $F$-Statistic" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rather than stating that the $F$-tests might be invalid due to violation of critial assumptions, we'll step further to generate simulations to show how various parameters affect $F$-statistics.\n", "\n", "The plotting codes are messy, thus hidden in the ```plot_material.anova_plot()```, there 9 groups of charts, e.g. titled as 'Simulation X', each group has a two axes surrounded by a black frame.\n", "\n", "We will repetitively draw samples from three populations, each with its own parameters $\\mu$, $\\sigma$ and $N$, i.e. population mean, population variance and sample size. Each draw can be computed an $F$-statistic, we perform this simuation in a loop of $1000$ rounds, then we plot the fequency distribution of $F$-statistic on the upper axes, and the $p$-value on the lower axes. \n", "\n", "And the red vertical line is the critical value of $F$-statistic, any test results fall to the right-side of the red line shall conclude a rejection to the null hypothesis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For instance, the simulation $1$ has $\\mu_1=3, \\mu_2=6, \\mu_3 = 9$, it surely indicate a large $F$-statistic, because larger $MST$, however we could still see yet a smaller amount of $F$'s that fall short than critical value. The distributions of corresponding p-values are also plotted under the $F$-statistic distribution. \n", "\n", "The simulation $2$ has $\\mu_1=3, \\mu_2=3.1, \\mu_3 = 2.9$, unless the standard deviation are dominantly trivial, we won't expect a high chance of rejecting null hypothesis, and that's also what the chart shows.\n", "\n", "The difference between simulation $2$ and $3$ is the $\\sigma$, apparently the latter one violates the assumption of homoskedasticity, and the $\\sigma$'s mainly are larger than Simulation $1$, thus it pushes more distributions of $F$-statistic to the left side of $F_c$, i.e. fail to reject null.\n", "\n", "The simulation $4$ complies with assumption of homoskedasticity, and due to variaous $\\mu$'s, presumably we would expect a higher volume of rejection of null hypothesis. Visually, yet a large amount of tests fail to reject, possibly due to relative larger standard deviations.\n", "\n", "The simulation $5$ and $6$ privide some interesting insight, when the sample size are small such as $n=10$ and relatively large $\\sigma$ would result in a predominant amout of false negative. The straightforward remedy is to increase the sample size as in the simulation $6$.\n", "\n", "You can experiment on parameters of simulation $7$, $8$ and $9$. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import plot_material\n", "plot_material.anova_plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## LSD Confidence Intervals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have rejected the null hypothesis, but still need further investigation into which treatments deviate from the rest, you need one technique to identify the source of discrepancy. Here's the example of the technique.\n", "\n", "The same height example, there is one more column of Finnish in the sheet 2. Let's import the sheet 2." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "df2 = pd.read_excel('height_anova2.xlsx', 'Sheet2')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
JapaneseDutchDanishFinnish
0161.783130187.726286174.746213175.855378
1145.329934179.338741174.133579175.513979
2174.569597176.566656178.966745173.363995
3160.003162184.570245179.335222178.515200
4162.242898184.056181167.497992173.108095
\n", "
" ], "text/plain": [ " Japanese Dutch Danish Finnish\n", "0 161.783130 187.726286 174.746213 175.855378\n", "1 145.329934 179.338741 174.133579 175.513979\n", "2 174.569597 176.566656 178.966745 173.363995\n", "3 160.003162 184.570245 179.335222 178.515200\n", "4 162.242898 184.056181 167.497992 173.108095" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The method we are going to introduce is called **Fisher's Least Significant Difference** (LSD), mathematically as \n", "$$\n", "LSD= t_{\\alpha/2}\\sqrt{MSE\\bigg(\\frac{1}{n_i}+\\frac{1}{n_j}\\bigg)}\n", "$$\n", "where degree of freedom is $n_1+n_2-2$, and the confidence interval estimator of mean difference is\n", "$$\n", "(\\bar{x}_i-\\bar{x}_j)\\pm t_{\\alpha/2}\\sqrt{MSE\\bigg(\\frac{1}{n_i}+\\frac{1}{n_j}\\bigg)}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, calculate the $MSE$." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "dutch_mean = df2['Dutch'].mean()\n", "japanese_mean = df2['Japanese'].mean()\n", "danish_mean = df2['Danish'].mean()\n", "finnish_mean = df2['Finnish'].mean()\n", "grand_mean = (dutch_mean+japanese_mean+danish_mean+finnish_mean)/4\n", "\n", "SSE = (len(df2['Japanese'])-1)*df2['Japanese'].var(ddof=1)\\\n", " +(len(df2['Dutch'])-1)*df2['Dutch'].var(ddof=1)\\\n", " +(len(df2['Danish'])-1)*df2['Danish'].var(ddof=1)\\\n", " +(len(df2['Finnish'])-1)*df2['Finnish'].var(ddof=1)\n", "\n", "n = len(df2['Japanese']) + len(df2['Dutch']) + len(df2['Danish']) + len(df2['Finnish'])\n", "k = 4\n", "MSE = SSE/(n-k)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The $LSD$ statistic are intended for each pair of treatments.\n", "\n", "$4$ groups means there are $\\binom{4}{2}=6$ pairs to test. Let's write a simple function of $LSD$." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "def lsd(sig_level, MSE, n1, n2):\n", " t = sp.stats.t.ppf(1-sig_level/2, n1+n2-2)\n", " return t*np.sqrt(MSE*(1/n1+1/n2))" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Comparison, Point Estimate, Lower Bound, Upper Bound\n", "Japanese - Dutch : -17.7116, -21.3998, -14.0233\n", "Japanese - Danish : -12.2196, -15.9078, -8.5313\n", "Japanese - Finnish : -12.8530, -16.5413, -9.1648\n", "Dutch - Danish : 5.4920, 1.8038, 9.1803\n", "Dutch - Finnish : 4.8585, 1.1703, 8.5468\n", "Danish - Finnish : -0.6335, -4.3217, 3.0548\n" ] } ], "source": [ "jadu = japanese_mean-dutch_mean\n", "jada = japanese_mean-danish_mean\n", "jafi = japanese_mean-finnish_mean\n", "duda = dutch_mean-danish_mean\n", "dufi = dutch_mean-finnish_mean\n", "dafi = danish_mean-finnish_mean\n", "\n", "jadu_lsd_low = jadu - lsd(.05, MSE, len(df2['Japanese']), len(df2['Dutch']))\n", "jadu_lsd_high = jadu + lsd(.05, MSE, len(df2['Japanese']), len(df2['Dutch']))\n", "\n", "jada_lsd_low = jada - lsd(.05, MSE, len(df2['Japanese']), len(df2['Danish']))\n", "jada_lsd_high = jada + lsd(.05, MSE, len(df2['Japanese']), len(df2['Danish']))\n", "\n", "jafi_lsd_low = jafi - lsd(.05, MSE, len(df2['Japanese']), len(df2['Finnish']))\n", "jafi_lsd_high = jafi + lsd(.05, MSE, len(df2['Japanese']), len(df2['Finnish']))\n", "\n", "duda_lsd_low = duda - lsd(.05, MSE, len(df2['Dutch']), len(df2['Danish']))\n", "duda_lsd_high = duda + lsd(.05, MSE, len(df2['Dutch']), len(df2['Danish']))\n", "\n", "dufi_lsd_low = dufi - lsd(.05, MSE, len(df2['Dutch']), len(df2['Finnish']))\n", "dufi_lsd_high = dufi + lsd(.05, MSE, len(df2['Dutch']), len(df2['Finnish']))\n", "\n", "dafi_lsd_low = dafi - lsd(.05, MSE, len(df2['Danish']), len(df2['Finnish']))\n", "dafi_lsd_high = dafi + lsd(.05, MSE, len(df2['Danish']), len(df2['Finnish']))\n", "\n", "\n", "print('Comparison, Point Estimate, Lower Bound, Upper Bound')\n", "print('Japanese - Dutch : {:.4f}, {:.4f}, {:.4f}'.format(jadu, jadu_lsd_low, jadu_lsd_high))\n", "print('Japanese - Danish : {:.4f}, {:.4f}, {:.4f}'.format(jada, jada_lsd_low , jada_lsd_high))\n", "print('Japanese - Finnish : {:.4f}, {:.4f}, {:.4f}'.format(jafi, jafi_lsd_low , jafi_lsd_high))\n", "print('Dutch - Danish : {:.4f}, {:.4f}, {:.4f}'.format(duda, duda_lsd_low , duda_lsd_high))\n", "print('Dutch - Finnish : {:.4f}, {:.4f}, {:.4f}'.format(dufi, dufi_lsd_low , dufi_lsd_high))\n", "print('Danish - Finnish : {:.4f}, {:.4f}, {:.4f}'.format(dafi, dafi_lsd_low , dafi_lsd_high))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Point estimate is straightforward to interprete, if the estimate is further away from $0$, we have stronger evidence that two sample means differ. \n", "\n", "But to give a clear statistical inference with $\\alpha$ significance level, we should look at confidence interval. If the interval excludes $0$, we could conclude a rejection of $\\mu_i=\\mu_j$. In our example, the only fail to rejection is between Danish and Finnish, it means their male heights are largely indifferentiable.\n", "\n", "Therefore we conclude that the $MST$ is mostly contributed by Japanese compared to other countries." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Chi-Squared Goodness-of-Fit Test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the last topic of the this tutorial session. Let's walk through an example, then you will be able to grasp the essence of **chi-squared goodness-of-fit test**.\n", "\n", "There are three marksman, who are competing shooting beer bottles hanging on a tree $300m$ away, according to their historical records. Here is their hitting rates.\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MarksmanHitting Rate
A$24\\%$
B$40\\%$
C$36\\%$
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apparently $24\\%+40\\%+36\\%=1$, which is a feature of **multinomial experiment**. \n", "\n", "In order to improve performance, Marksman A attended a hunter training camp, then they agree to compete again. They take turn to shoot and will stop until the 500th bottle is shot. Here is the result. \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MarksmanBottles
A$142$
B$187$
C$172$
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We would like to know if Marksman A has improved which also causes the hitting rate changes. The null hypothesis is specified as\n", "$$\n", "H_0: p_1=24\\%, p_2 = 40\\%, p_3=36\\%\\\\\n", "H_1: \\text{At least one $p_i$ is not equal to its specified value}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Without seeing the lastest competition result, we are looking forward to the **expected frequency** to be\n", "$$\n", "e_1 = 500\\times 24\\% = 120\\\\\n", "e_2 = 500\\times 40\\% = 200\\\\\n", "e_3 = 500\\times 36\\% = 180\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here's the comparison bar chart." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "labels = ['Marksman 1', 'Marksman 2', 'Marksman 3']\n", "post_training = np.array([142, 187, 172])\n", "exp_frequency = np.array([120, 200, 180])\n", "\n", "x = np.arange(len(labels)) # the label locations\n", "width = .2 # the width of the bars\n", "\n", "fig, ax = plt.subplots(figsize = (10, 4))\n", "rects1 = ax.bar(x - width/2, post_training, width, label='Post-Training')\n", "rects2 = ax.bar(x + width/2, exp_frequency, width, label='Exp. Freq.')\n", "\n", "ax.set_ylabel('Scores')\n", "ax.set_title('Scores of Exp. Freq. And Post-Training')\n", "ax.set_xticks(x)\n", "ax.set_xticklabels(labels)\n", "ax.legend()\n", "\n", "fig.tight_layout()\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the expected and observed frequencies differ significantly, we would conclude a rejection to the null hypothesis. The test statistic is \n", "$$\n", "\\chi^2=\\sum_{i=1}^k\\frac{(f_i-e_i)^2}{e_i}\n", "$$\n", "where $f_i$ and $e_i$ are observed and expected frequencies. In this example, $\\chi^2$ is\n", "$$\n", "\\chi^2 = \\frac{(f_1-e_1)^2}{e_1}+\\frac{(f_2-e_2)^2}{e_2}+\\frac{(f_3-e_3)^2}{e_3}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the $\\chi^2$" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5.233888888888888" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sum((post_training - exp_frequency)**2/exp_frequency)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Degree of freedom is $\\nu=k-1=2$, therefore the rejection region is \n", "$$\n", "\\chi^2>\\chi^2_{.05, 2}\n", "$$\n", "which can be found by ```sp.stats.chi2.ppf```." ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5.991464547107979" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sp.stats.chi2.ppf(.95, 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because $\\chi^2$-statistic did not surpass the critical value, we conclude a fail to rejection of null hypothesis. That means even though the post-traning result is better than expectation, it's likely to be a statistical fluke rather than evidence of skill improvement." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }