Lesson 2
Hypothesis Testing with SciPy
Introduction to Hypothesis Testing

Welcome back! In this lesson, we will explore hypothesis testing, a fundamental aspect of statistical analysis. Hypothesis testing allows us to make inferences or draw conclusions about a population based on a sample. Imagine you want to compare the effectiveness of two different teaching methods or determine if there is a connection between two categorical variables. Hypothesis testing is used in scenarios like these.

Hypothesis testing is a crucial tool in fields such as psychology, medicine, and marketing, where data-driven decisions are vital. We'll be utilizing SciPy, a powerful Python library, to conduct these tests with ease.

Context of the Lesson: One-Sample t-Test, Two-Sample t-Test, and Chi-Square Test

In this lesson, we will focus on three specific hypothesis tests: the one-sample t-test, the two-sample t-test, and the chi-square test for independence.

  • One-Sample t-Test: This test is used to compare the mean of a single sample to a known or hypothesized population mean. For example, you might want to see if the average test score of a class significantly differs from a standard passing score.

  • Two-Sample t-Test: This test helps us compare the means of two independent groups to see if they are significantly different. For instance, you might want to compare test scores between two different study techniques to see which is more effective.

  • Chi-Square Test for Independence: This test is used to determine if there is a significant association between two categorical variables. For example, you might test whether gender is related to preference for a particular product. The test calculates a chi-square statistic, which reflects how much the observed counts deviate from expected counts if the variables were independent. A high chi-square value often indicates a significant association between the variables.

As a reminder, this course assumes that you know statistics, and introduces you to the SciPy library. If you want to learn or recall hypothesis testing, check out our Introduction to Data Analysis course path.

By the end of this lesson, you'll be proficient in conducting these tests using SciPy, enabling you to apply these techniques in various data analysis scenarios.

Conducting a One-Sample t-Test

Step 1: Prepare Your Data

First, you need a single group of data you want to test against a known population mean. Let's say you're testing whether the average test score of a sample of students significantly differs from a passing score of 10.

Python
1import numpy as np 2 3# Sample data for the group 4group = np.array([10, 12, 14, 15, 9]) 5population_mean = 10

Step 2: Conduct the One-Sample t-Test

Use SciPy's stats.ttest_1samp() to perform this test. This function calculates the t-statistic and p-value.

Python
1from scipy import stats 2 3# Perform a one-sample t-test 4t_statistic, p_value = stats.ttest_1samp(group, population_mean)

Step 3: Interpret the Results

The t_statistic gives an indication of how the mean of your sample compares to the population mean. The p_value tells you the probability that the observed difference occurred by chance. Typically, a p-value of less than 0.05 suggests a significant difference.

Python
1print(f"T-statistic: {t_statistic}, P-value: {p_value}") 2# T-statistic: 1.7541160386140584, P-value: 0.15427287107931661
Conducting a Two-Sample t-Test

Step 1: Prepare Your Data

First, we need to set up our data for the two groups we want to compare. Let's say you're comparing the test scores of two different study methods, where group1 represents the scores from Method A and group2 from Method B.

Python
1import numpy as np 2 3# Sample data for two groups 4group1 = np.array([10, 12, 14, 15, 9]) 5group2 = np.array([8, 9, 11, 13])

Step 2: Conduct the Two-Sample t-Test

Next, we perform the two-sample t-test using SciPy's stats.ttest_ind() function. This function calculates the t-statistic and p-value.

Python
1from scipy import stats 2 3# Perform a two-sample t-test 4t_statistic, p_value = stats.ttest_ind(group1, group2)

Step 3: Interpret the Results

The t_statistic helps determine the difference between the two groups. A higher absolute value indicates a larger difference. The p_value indicates the probability that the observed difference is due to chance. A p-value of less than 0.05 typically suggests a significant difference.

Python
1print(f"T-statistic: {t_statistic}, P-value: {p_value}") 2# T-statistic: 1.081227306384228, P-value: 0.3154321485938413
Recall: Contingency Table

A contingency table is a type of data matrix that displays the frequency distribution of variables. It is primarily used to analyze the relationship between two categorical variables. Each cell in the table represents the count or frequency of occurrences of the variable combinations.

For instance, suppose we want to understand if there is an association between gender (male, female) and preference for a product (like, dislike). A contingency table might look like this:

LikeDislike
Male1020
Female2030

In this example, the table shows how many males and females like or dislike the product. The rows represent gender, and the columns represent the preference category. By assessing how these frequencies deviate from what would be expected if the variables were independent, we can perform tests like the chi-square test to determine if a significant relationship exists between the variables.

Conducting a Chi-Square Test for Independence with SciPy

The chi-square test for independence is a statistical tool used to examine if two categorical variables are related or independent. At its core, the test compares the observed frequency of events to the frequency we would expect if the events were completely independent.

Imagine entering a quiz competition with colored cards. If you shuffle the cards randomly, you would expect a certain number of each color to appear in each round. The chi-square test measures how much your observed shuffle might deviate from this random expectation.

Step 1: Create a Contingency Table

Using a contingency table, you compile observed frequencies of different combinations of categorical variables. This forms the basis for comparison against expected frequencies.

Python
1# Contingency table 2contingency_table = np.array([[10, 20], [20, 30]])

Step 2: Perform the Chi-Square Test

The stats.chi2_contingency() function performs the chi-square test, delivering results such as the chi-square statistic, p-value, degrees of freedom, and expected frequencies.

Python
1# Conducting the chi-square test 2chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table)
  • The chi-square statistic quantifies how much the observed counts deviate from the expected counts, where larger values imply greater deviance.
  • The degrees of freedom (dof) represent the number of values in the final calculation of a statistic that are free to vary.
  • The expected frequencies assume no relationship between the variables, serving as a baseline for comparison.

Step 3: Interpret the Results

The chi2_stat indicates whether the observed distribution differs significantly from the expected distribution under the assumption of independence. A high chi-square statistic value suggests a strong deviation from expectations, implying a potential association between the variables. The p_value tells you the probability that the observed deviation is merely due to chance, with a p-value of less than 0.05 suggesting a significant relationship.

Python
1print(f"Chi-square stat: {chi2_stat}, P-value: {p_value}") 2# Chi-square stat: 0.128, P-value: 0.7205147871362552
Summary and Preparation for Practice

In this lesson, we've covered how to perform three critical types of hypothesis tests using SciPy: the one-sample t-test, two-sample t-test, and the chi-square test for independence. These tests help you analyze data statistically, drawing meaningful insights.

The code examples we've walked through show the practical application of these tests on sample datasets. As you move on to the practice exercises, use these examples as a guide to cement your understanding. By learning these hypothesis testing techniques, you've taken a significant step in enhancing your data analysis toolkit. Keep exploring different datasets and scenarios as you apply these skills in real-world contexts.

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.