Exploratory Statistical Analysis
The process of analysing data so as to take out some important characteristics from it, is known as exploratory data analysis. Performing data analysis through statistical methods like mode
, probability
, expected value
, correlation
etc is known as exploratory statistical analysis.
Exploring Binary and Categorical Data
Binary Data: Data which can take only two possible values, 0 or 1.
Categorical Data: Data which has been divided into categories or groups according to their features.
is the value that appear most offten in dataset.- It is a summay statistic for categorical data i.e. grouped data or set of values representing a possible categories.
- It is not used for numeric data i.e. data expressed in number scale.
The function to calculate model:
def find_mode(counts, n):
"""Finds model of sample.
This function consists of two method to calculate:
* without library function
* via library function
counts : list
number of entries each emotion type has
n : str
select method to calucate iqr range
>>> for n = 0, Time taken to calculate: 0.0002598762512207031, Mode of sample: [3, 8989]
mode : list
emotion along with its number of values
n = int(n)
print("Shape of selected numpy array: ", counts.shape)
print("Data type of _data_numpy: ", type(counts))
if n is 0:
start = time.time()
max = counts[0]
num = 0
for row in counts:
num +=1
if row > max:
mode = num-1
max = row
list = [mode, max]
print("Time taken to calculate: ", time.time() - start)
return list
if n is 1:
start = time.time()
mode = counts.mode(dropna=True)
print("Time taken to calculate: ", time.time() - start)
return mode
if n is not 0 or n is not 1:
print("Enter correct value")
except StatisticsError as error:
raise error
except Exception as error:
raise error
Function Call:
n = input("Enter 0 (To find mode without any library function), Enter 1 (Via using Library Function): ")
mode = find_mode(counts, n)
print("Mode of sample: ", mode)
of an event will happen is how likly the event occurs again and again if tested again and again.- Formula:
The function to calculate probability:
def find_prob(counts):
"""To find probability of emotion type to be predicted.
counts : list
number of entries each emotion type has
>>> Time taken to calculate: 2.6702880859375e-05, Probability of each emotion type: [0.13801655195474685, 0.01524228829381113, 0.14269791289324826, 0.25048067545350683, 0.1693370858528158, 0.11151670521358709, 0.17270878033828405]
prob : list
probability of ouccurance of each emotion.
print("Shape of selected numpy array: ", counts.shape)
print("Data type of _data_numpy: ", type(counts))
start = time.time()
sum = 0
for row in counts:
sum += row
probability = []
for row in counts:
p = row/sum
print("Time taken to calculate: ", time.time() - start)
return probability
except StatisticsError as error:
raise error
except Exception as error:
raise error
Function Call:
prob = find_prob(counts)
print("Probability of each emotion type: ", prob)
Expected Value
Expected Value
is calculated by:i) Multiply each outcome by its probability of occurrence.
ii) Sum these values.
- It is a form of
weighted mean
. - This concept is basically used for categorical data.
- Formula:
Here, \(x_1\), \(x_2\), ……, \(x_i$4 are data and f(\)x_i$$) are the probability of outcomes and E[X]: Represents Expected Value
The function to calculate expected value:
def find_ev(counts, prob):
"""Finds expected value.
counts : list
number of entries each emotion has
prob : list
probability of each emotion
>>> Time taken to calculate: 3.528594970703125e-05
>>> Expected Value of each emotion type: [683.5959818318612, 8.337531696714688, 730.7560119263244, 2251.570791651573, 1029.0614707275615, 446.2898542647755, 1070.4490205366844]
>>> Total Expected value: 6220.060662635495
ev : list
expected value of each emotion
total_ev : float
expected value of comeplete sample
print("Shape of selected numpy array: ", counts.shape)
print("Data type of _data_numpy: ", type(counts))
start = time.time()
ev = []
num = 0
for row in counts:
val = row * prob[num]
num += 1
total_ev = 0
for row in ev:
total_ev += row
print("Time taken to calculate: ", time.time() - start)
return ev, total_ev
except StatisticsError as error:
raise error
except Exception as error
raise error
Function Call:
ev, total_ev = find_ev(counts, prob)
print("Expected Value of each emotion type: ", ev)
print("Total Expected value: ", total_ev)
is the measure which is used to know how similar the two variables are.
Correlation Coefficient
- It is used to measure the strength of correlation between two variables.
- Range: 1 to -1
- Values above 1 or less than -1 are concluded as error in calculation.
Negative Correlation
indicates that if one variable increases the other decreases or vice versa.Positive Correlation
indicates that both are similar i.e. if one increases the other one also increases and vice versa.0
correlation means that the samples shows no relation between each other.
Correlation Matrix
A table where variables are shown as rows and columns and the cell values are the correlations between the variables is known as Correlation Matrix
The function to calculate correlation matrix
def corr_matrix(df_all):
"""To find correlation matrix.
df_all : dataframe
>>> Time taken to calculate: 356.91664814949036
corr : pandas.core.frame.DataFrame
correlation matrix
start = time.time()
corr = df_all.corr()
print("Time taken to calculate: ", time.time() - start)
return corr
except StatisticsError as error:
raise error
except Exception as error:
raise error
Function Call:
corr = corr_matrix(df_all)
Correlation Plot
def corr_plot(corr, cmap_ar):
"""To plot correlation.
corr : dataframe
of correlation matrix
cmap_ar : str
type of heatmap to be plotted
>>> Enter the cmap style from the above arrguments: 7
>>> Color plotted: Blues_r
>>> Time taken to plot: 6.074030876159668
start = time.time()
plt.figure(figsize=(16, 16))
sns.heatmap(corr, vmin=-1, vmax=1, cmap=cmap_ar)
plt.title('Heatmap correlation')
print("Time taken to plot: ", time.time() - start)
except AttributeError as error:
print("Attribute Error Occured.")
print("The error is ", error)
except ValueError as error:
print("Value Error Occured.")
print("The error is ", error)
Function Call:
val = ['coolwarm', sns.diverging_palette(20, 220, as_cmap=True), 'Blues', 'YlGnBu', 'Accent', 'Accent_r', 'Blues', 'Blues_r', 'BrBG', 'BrBG_r', 'BuGn', 'BuGn_r', 'BuPu', 'BuPu_r', 'CMRmap', 'CMRmap_r', 'Dark2', 'Dark2_r', 'GnBu', 'GnBu_r', 'Greens', 'Greens_r', 'Greys', 'Greys_r', 'OrRd', 'OrRd_r', 'Oranges', 'Oranges_r', 'PRGn', 'PRGn_r', 'Paired', 'Paired_r', 'Pastel1', 'Pastel1_r', 'Pastel2', 'Pastel2_r', 'PiYG', 'PiYG_r', 'PuBu', 'PuBuGn', 'PuBuGn_r', 'PuBu_r', 'PuOr', 'PuOr_r', 'PuRd', 'PuRd_r', 'Purples', 'Purples_r', 'RdBu', 'RdBu_r', 'RdGy', 'RdGy_r', 'RdPu', 'RdPu_r', 'RdYlBu', 'RdYlBu_r', 'RdYlGn', 'RdYlGn_r', 'Reds', 'Reds_r', 'Set1', 'Set1_r', 'Set2', 'Set2_r', 'Set3', 'Set3_r', 'Spectral', 'Spectral_r', 'Wistia', 'Wistia_r', 'YlGn', 'YlGnBu', 'YlGnBu_r', 'YlGn_r', 'YlOrBr', 'YlOrBr_r', 'YlOrRd', 'YlOrRd_r', 'afmhot', 'afmhot_r', 'autumn', 'autumn_r', 'binary', 'binary_r', 'bone', 'bone_r', 'brg', 'brg_r', 'bwr', 'bwr_r', 'cividis', 'cividis_r', 'cool', 'cool_r', 'coolwarm', 'coolwarm_r', 'copper', 'copper_r', 'crest', 'crest_r', 'cubehelix', 'cubehelix_r', 'flag', 'flag_r', 'flare', 'flare_r', 'gist_earth', 'gist_earth_r', 'gist_gray', 'gist_gray_r', 'gist_heat', 'gist_heat_r', 'gist_ncar', 'gist_ncar_r', 'gist_rainbow', 'gist_rainbow_r', 'gist_stern', 'gist_stern_r', 'gist_yarg', 'gist_yarg_r', 'gnuplot', 'gnuplot2', 'gnuplot2_r', 'gnuplot_r', 'gray', 'gray_r', 'hot', 'hot_r', 'hsv', 'hsv_r', 'icefire', 'icefire_r', 'inferno', 'inferno_r', 'jet', 'jet_r', 'magma', 'magma_r', 'mako', 'mako_r', 'nipy_spectral', 'nipy_spectral_r', 'ocean', 'ocean_r', 'pink', 'pink_r', 'plasma', 'plasma_r', 'prism', 'prism_r', 'rainbow', 'rainbow_r', 'rocket', 'rocket_r', 'seismic', 'seismic_r', 'spring', 'spring_r', 'summer', 'summer_r', 'tab10', 'tab10_r', 'tab20', 'tab20_r', 'tab20b', 'tab20b_r', 'tab20c', 'tab20c_r', 'terrain', 'terrain_r', 'twilight', 'twilight_r', 'twilight_shifted', 'twilight_shifted_r', 'viridis', 'viridis_r', 'vlag', 'vlag_r', 'winter', 'winter_r']
n = input("Enter the cmap style from the above arrguments: ")
n = int(n)
cmap_ar = val[n]
print("Color plotted: ", val[n])
corr_plot(corr, cmap_ar)
Pearson’s Correlation Coefficient
Steps to calculate:
- Calculate covariance of the given two varaibles.
- Calculate standard deviation of both variables.
- Put in formula.
\[\begin{equation} r = \frac{\sum^n_{i=1}(x_i-\bar{x})(y_i-\bar{y})}{(n-1)s_xs_y} \end{equation}\]Check out my other posts related to topic:
Do visit my GitHub to view complete code!
Let me know what you think of this article on Twitter @khushi__411 or leave a comment below!