Introduction to testing Statistical Hypotheses

Quality Control: when to blow the whistle

(Hypothesis: process in control, Alternative: out of control)

Application: industrial production

Planned experiment: are there any treatment effects?

(Hypothesis: zero effect, Alternative: non zero effects)

Application: Pharma, Industry, Agriculture (Green Revolution, Fisher's $\chi^{2}-$test)

Independence testing: is there any relation between the observations of different processes?

(Hypothesis: unrelated, independent, Alternative: there is some relation)

Regression Analysis, Econometry,...

suppose our observations are obtained from a data generating process characterized by certain parameters.

our generic Example: Detect a change by one observation: $x_{1}$

"drop money from helicopters" on the economy

$\implies$Treatment effect$\implies$controlled experiment


  1. pour money into "poor" (?) school district, observe SATscores (average) after

  2. compare with what it was before: did it get better? worse? no change?

  3. "ceteris paribus"

NEED: a model for your data are "generated"

continue with

NAIVE Hypotheses testing

FORMAL Hypotheses testing

conclude with

MORE observations, MORE models

Example SAT-scores $X$

  1. Expert opinion on past SAT-scores $X$ (random variable)

    1. model: MATH

    2. $\mu\leq1550$ "except once in a lifetime"

    3. $\mu\leq1450$ "once in a lifetime, maybe"

    4. $\sigma\approx30$

    5. why? 95% within MATH

  2. DATA: $x_{new}=1535$ ("way above average", how much is too much?)

  3. compute probability of observing what you observe (or even more extreme than that)


Question: is there enough evidence for the case MATH

Answer: compute the probability of observering what we observed or even more extreme than that ("p-value"): MATH


  1. the recipe "better SAT by pouring money" is not convincing (so far)

  2. at best "borderline"

  3. we would expect under this model ($\mu=1500$) to observe what we observe ($x=1535$ or even more extreme than that $x>1535$) 12% (pvalue) of all times (not untypical)

Example: Alternative motivation of the MODEL: from complete past SAT records, compute $\mu=1500,\sigma=30$


example: SAT-scores $X$

  1. From records on past SAT-scores $X$

    1. MATH

    2. $\mu=1500$

    3. $\sigma=30$

  2. DATA: $x_{new}=1535$

  3. compute probability of observing what you observe (or even more extreme than that)


compute "p-value": (standardization) MATH

Formal Hypothesis Testing

Null Hypothesis, Alternative, Level of significance

Setting up Hypothesis (H$_{0}$) and Alternative (H$_{1}$)


  1. is of interest ? A positive treatment effect.

  2. do we expect or hope to show ? $\mu>1500$

H$_{0}:$ suppose not

before treatment: $\mu_{0}\leq1500$

after treatment: $\mu>1500$

H$_{0}:$ $\mu\leq1500$

H$_{1}:$ $\mu>1500$

conclusion: there is scant evidence against H$_{0}$ (or there is not enough reason to reject H$_{0}$).

Setting the red line. How much is too much?

Introduce Rejection rule

reject H$_{0}$ if you observe what you observe (or more extreme than that) at most $100\alpha\%$ of all times.

$100\alpha\%$: level of significance

customary: $\alpha=0.01~$(engineering), $0.05$ (common), $0.10$ (social sciences)

Example (continued)


suppose HMATH

Rejection rule

Reject H$_{0}$ if $X>x_{0}$ where MATH

set MATH


from table:

MATH $\implies$




Formal Rejection rule, Final Form:

reject H$_{0}:\mu\leq1500$ at 5% if MATH

Type1 error: reject H$_{0}$ when H$_{0}$ true.


Question: why ever make a Type1 error?

Answer: never make Type1 = never reject=not reject when wrong = Type2 error.

Question: why ever make a Type2 error?

Answer: think about it


canned programs

SAS, STATA, SPSS etc do not test hypotheses; they report p-values

before you use these programs know your data!

two-sided alternatives

before treatment: $\mu_{0}=1500$

after treatment: $\mu\neq1500$

H$_{0}:$ $\mu=1500$

H$_{1}:$ $\mu\neq1500$

two-sided alternative.

split $\alpha$ into $\alpha/2$ left and $\alpha/2$ right

see homework problems

More elaborate Tests

two-sample test (normal test, t-test)

sample1 $x_{1},\ldots,x_{m}$ MATH
sample2 $y_{1},\ldots,y_{n}$ MATH

H$_{0}:$ $\mu_{1}=\mu_{2}$ vs H$_{a}:$ $\mu_{1}\neq\mu_{2}$

k-sample tests (ANOVA)

H$_{0}:$ MATH vs H$_{a}:$ not all equal

NOTE: a Null-hypothesis may always be WRONG and may have to be rejected, but never is accepted.

never accept a null hypothesis.

do not reject H$_{0}$ instead of accepting H$_{0}$