Inferential Statistical Analysis
T-Test
- The
T-test
i.e.Student’s T Test
compares two means and tells us if they are different from each other. - It tells whether two samples have been drawn from same sample or not.
- It tell’s how significant our result is, more specifically it tells whether it happened by chance or not.
T-Values and Degrees of Freedom
T-Value
is the ratio of the difference between the mean of the two sample sets and the variation that exists within the sample sets.- T-Value is also called
T-Score
. - Large t-score indicates that the groups are different.
- Small t-score indicates that the groups are similar.
Degree Of Freedom
are the values that has a freedom to vary.- Formula of Degree of Freedom: df = nx + ny - 2
Normal Distribution
- It has a
bell-shaped
density curve. - The density curve is symmetrical and centered about
mean(μ)
. It determines the peak of the curve. - Data spread it determined by
standard deviation(sigma-σ)
i.e. it is a measure of variability. It determines how far the data falls from the mean.
The density curve is as follows:
Standardization (Normalization, z-Scores)
- The process of putting different variables to a same scale is known as
Standardization
. - Also called
Normalization
. - It allows us to compare scores between different types of variables.
- Formula:
- Result of this formula is known as
z-score
. z-score
tells the overall data lies compared to overall population.
Note:The higher (or lower) the Z-score, the more unlikely the result is to happen by chance and the more likely the result is meaningful.
p-Value and alpha
Significance Level-Alpha
-
It is a probability of rejecting the null hypothesis when it is true.
-
Drawing a two tailed graph for alpha = 0.05 and alpha = 0.01:[1]
Some Keypoints:
- We need to shade 5% or 1% of graph that is furthest from null hypothesis(since we are rejecting).
- The sample mean for the given distribution is 330.6.
Observation:
- For the above two-tailed test, the critical region (the shaded part) lies equidistant from the null hypothesis value.
- Sample mean 330.6 is significant in case for 5% significance but not for the case of 1% significance level.
Conclusion
- We’ll reject null hypothesis for 5% significance level.
- Fail to reject null hypothesis for 1% significance level.
P-Value
- The probability that the results from our sample data occurred by chance is known as
p-value
. - Lower p-values are good because they indicate that the data does not occur by chance.
- Example: p-value = 0.01 indicates that their is only 1% probability that the data occured by chance.
- If the observed p-value is less than alpha, then the results are statistically significant.
Type’s Of T-Test
Student’s T-Test
- Sample’s are Independent.
- Sample’s are drawn from Gaussian Distribution.
- Size of each sample must be same.
- Sample’s have same variance.
- Sample’s have different mean.
- Values of one sample does not have any effect to values of other sample.
Formula:
Paired Student’s T-Test
- Sample’s are Dependent.
- They may be from same population.
- Used to check whether the difference of means of two samples are zero or not.
- They have unequal variance.
- Similar to Student’s T-Test they are also drawn from Gaussian Distribution.
- Values of one sample effect the values of other sample.
Formula:
Hypothesis
- Hypothesis testing is used to assess the probability of a hypothesis by using sample data.
- An assumption is made looking into population and test are preformed according to it.
Null Hypothesis
- It stats/assumnes that their is no difference between population characteristics (mean, propotion).
- It is denoted by H0.
Alternative Hypothesis
- It claims that the population is contradictory to null hypothesis. Hence, reject null hypothesis.
- It is denoted by H1.
Procedure
- Determine a null and alternate hypothesis.
- Collect sample data.
- Determine a confidence interval and degrees of freedom.
- Calculate the t-statistic.
- Calculate the critical t-value from the t distribution.
- Compare the critical t-values with the calculated t statistic.
Creating Samples
_emotion_type = input("Enter type of emotion which needs to be tested: ")
_data, _data_numpy, _data_label, _data_label_numpy = select_emotion(df_all, df_label, df, _emotion_type)
sample_1 = _data_numpy.copy()
sample_1
_emotion_type = input("Enter type of emotion which needs to be tested: ")
_data, _data_numpy, _data_label, _data_label_numpy = select_emotion(df_all, df_label, df, _emotion_type)
sample_2 = _data_numpy.copy()
sample_2
T-Test Implementation
Creating some user defined function for Dependent Samples
"""
To find sum squared difference and sum difference between observations.
"""
def find_diff(sample1, sample2):
sq_diff = sum([(sample1[i] - sample2[i]) ** 2 for i in range(len(sample1))])
diff = sum([sample1[i] - sample2[i] for i in range(len(sample1))])
return sq_diff, diff
"""
To find standard deviation.
"""
def find_dev(sq_diff, diff, size):
std = np.sqrt((sq_diff - (diff ** 2 / size)) / (size - 1))
return std
"""
To calculate t-statistic.
"""
def dep_ttest(sample1, sample2, n, sample1_mean, sample2_mean, size):
sq_diff, diff = find_diff(sample1, sample2)
std_dev = find_dev(sq_diff, diff, size)
std_error = std_dev / np.sqrt(size)
t_stat = (sample1_mean - sample2_mean) / std_error
return t_stat
"""
To calculate p-value, compare with critical t-value
"""
def dep_pval(sample1, sample2, size, t_statistic):
# Degree of freedom.
df = size - 1
# p-value after comparision with the t-stat
p = 1 - sci.stats.t.cdf(t_statistic, df = df)
pval = 2 * p
return pval
Creating some user defined function for Independent Samples
"""
To calculate t-statistic for independent samples.
"""
def ind_ttest(sample1, sample2, n, sample1_mean, sample2_mean, size1, size2):
var_1, std_1 = find_var_std(sample1, n)
var_2, std_2 = find_var_std(sample2, n)
print("Variance of sample1:", var_1)
print("Variance of sample2:", var_2)
print("Standard Deviation of sample1: ", std_1)
print("Standard Deviation of sample2: ", std_2)
t_stat = (sample1_mean - sample2_mean)/(np.sqrt(np.sum(np.power(std_1, 2)/size1), np.power(std_2, 2)/size2))
return t_stat
"""
To calculate p-value, compare with critical t-value
"""
def ind_pval(sample1, sample2, t_statistic, size1, size2):
# Degree of freedom.
df = size1 + size2 - 2
# p-value after comparision with the t-stat
p = 1 - sci.stats.t.cdf(t_statistic, df = df)
pval = 2 * p
return pval
Function for T-Test
def t_test(sample_1, sample_2, alpha, n, sample_type, sample_1_mean, sample_2_mean, size1, size2):
try:
n = int(n)
alpha = float(alpha)
if n is 1:
if sample_type is 0:
start = time.time()
statistic, pvalue = sci.stats.ttest_rel(sample1, sample2)
print("Statistics: ", statistic)
print("P-Value: ", pvalue)
print("Same Distributions- fails to Reject H0") if pvalue.any() > alpha else print("Different Distributions- Reject H0")
print("Time taken to formulate: ", time.time() - start)
if sample_type is 1:
start = time.time()
statistic, pvalue = sci.stats.ttest_ind(sample1, sample2)
print("Statistics: ", statistic)
print("P-Value: ", pvalue)
print("Same Distributions- fails to Reject H0") if pvalue.any() > alpha else print("Different Distributions- Reject H0")
print("Time taken to formulate: ", time.time() - start)
if sample_type is not 0 or sample_type is not 1:
print("Enter correct sample type")
elif n is 0:
if sample_type is 0:
start = time.time()
t_stat = dep_ttest(sample_1, sample_2, sample_1_mean, sample_2_mean, n, size1)
pvalue = dep_pval(sample_1, sample_2, size1, t_stat)
print("Statistics: ", t_stat)
print("P-Value: ", pvalue)
print("Same Distributions- fails to Reject H0") if pvalue.any() > alpha else print("Different Distributions- Reject H0")
print("Time taken to formulate: ", time.time() - start)
if sample_type is 1:
start = time.time()
t_stat = ind_ttest(sample_1, sample_2, n, sample_1_mean, sample_2_mean, size1, size2)
pvalue = ind_pval(sample_1, sample_2, t_statistic, size1, size2)
print("Statistics: ", t_stat)
print("P-Value: ", pvalue)
print("Same Distributions- fails to Reject H0") if pvalue.any() > alpha else print("Different Distributions- Reject H0")
print("Time taken to formulate: ", time.time() - start)
elif n is not 0 or n is not 1:
print("Enter correct value")
except StatisticsError as error:
raise error
except (FloatingPointError, NameError, ZeroDivisionError, ValueError, TypeError, AttributeError) as error:
print()
print(error)
raise error
Null Hypothesis (H0): Sample’s are independent if means of sample are same.
Alternate Hypothesis (H1): Samples are independent if means of sample are not same.
n = input("Enter 0 (To calculate without library functions), 1 (Via Library function): ")
alpha = input("Enter alpha value: ")
size_sample_1 = sample_1.shape[0] * sample_1.shape[1]
size_sample_2 = sample_2.shape[0] * sample_2.shape[1]
sample_1_mean = find_mean(sample_1, n)
sample_2_mean = find_mean(sample_1, n)
print("Mean of Sample 1:", sample_1_mean)
print("Mean of Sample 2:", sample_2_mean)
if size_sample_1 != size_sample_2:
sample_type = 1
if sample_1_mean != sample_2_mean:
sample_type = 1
else:
sample_type = 0
t_test(sample_1, sample_2, alpha, n, sample_type, sample_1_mean, sample_2_mean, size_sample_1, size_sample_2)
Check out my other post:
Do visit my GitHub to view complete code!
I would really appreciate your feedback.
Let me know what you think of this article on Twitter @khushi__411 or leave a comment below!
Let me know what you think of this article on Twitter @khushi__411 or leave a comment below!
comments powered by Disqus