Exploratory Statistical Analysis
The process of analysing data so as to take out some important characteristics from it, is known as exploratory data analysis. Performing data analysis through statistical methods like mode, probability, expected value, correlation etc is known as exploratory statistical analysis.
Exploring Binary and Categorical Data
Binary Data: Data which can take only two possible values, 0 or 1.
Categorical Data: Data which has been divided into categories or groups according to their features.
Mode
- Modeis the value that appear most offten in dataset.
- It is a summay statistic for categorical data i.e. grouped data or set of values representing a possible categories.
- It is not used for numeric data i.e. data expressed in number scale.
The function to calculate model:
def find_mode(counts, n):
  """Finds model of sample.
  This function consists of two method to calculate:
  * without library function
  * via library function
  Parameters
  ----------
  counts : list
           number of entries each emotion type has
  n : str
      select method to calucate iqr range
  Example
  -------
  >>> for n = 0, Time taken to calculate:  0.0002598762512207031, Mode of sample:  [3, 8989]
  Returns
  -------
  mode : list
         emotion along with its number of values
"""
  try:
    n = int(n)
    print("Shape of selected numpy array: ", counts.shape)
    print("Data type of _data_numpy: ", type(counts))
    if n is 0:
      start = time.time()
      max = counts[0]
      num = 0
      for row in counts:
        num +=1
        if row > max:
          mode = num-1
          max = row
      list = [mode, max]
      print("Time taken to calculate: ", time.time() - start)
      return list
    if n is 1:
   
      start = time.time()
      mode = counts.mode(dropna=True)
      print("Time taken to calculate: ", time.time() - start)
      return mode
    if n is not 0 or n is not 1:
      print("Enter correct value")
  
  except StatisticsError as error:
    raise error
  
  except Exception as error:
    print(error)
    raise error
Function Call:
n = input("Enter 0 (To find mode without any library function), Enter 1 (Via using Library Function): ")
mode = find_mode(counts, n)
print("Mode of sample: ", mode)
Probability
- Probabilityof an event will happen is how likly the event occurs again and again if tested again and again.
- Formula:
The function to calculate probability:
def find_prob(counts):
  
  """To find probability of emotion type to be predicted.
  Parameters
  ----------
  counts : list
           number of entries each emotion type has
  Example
  -------
  >>> Time taken to calculate:  2.6702880859375e-05, Probability of each emotion type:  [0.13801655195474685, 0.01524228829381113, 0.14269791289324826, 0.25048067545350683, 0.1693370858528158, 0.11151670521358709, 0.17270878033828405]
  
  Returns
  -------
  prob : list
         probability of ouccurance of each emotion.
         
  """
  try:
    print("Shape of selected numpy array: ", counts.shape)
    print("Data type of _data_numpy: ", type(counts))
    start = time.time()
    sum = 0
    for row in counts:
      sum += row
    probability = []
    for row in counts:
      p = row/sum
      probability.append(p)
    print("Time taken to calculate: ", time.time() - start)
    return probability
  
  except StatisticsError as error:
    raise error
  
  except Exception as error:
    print(error)
    raise error
    
Function Call:
prob = find_prob(counts)
print("Probability of each emotion type: ", prob)
Expected Value
- 
    Expected Valueis calculated by:i) Multiply each outcome by its probability of occurrence. ii) Sum these values. 
- It is a form of weighted mean.
- This concept is basically used for categorical data.
- Formula:

Here, \(x_1\), \(x_2\), ……, \(x_i$4 are data and f(\)x_i$$) are the probability of outcomes and E[X]: Represents Expected Value
The function to calculate expected value:
def find_ev(counts, prob):
  """Finds expected value.
  Parameters
  ----------
  counts : list
           number of entries each emotion has
  prob : list
         probability of each emotion
  Example 
  -------
  >>> Time taken to calculate:  3.528594970703125e-05
  >>> Expected Value of each emotion type:  [683.5959818318612, 8.337531696714688, 730.7560119263244, 2251.570791651573, 1029.0614707275615, 446.2898542647755, 1070.4490205366844]
  >>> Total Expected value:  6220.060662635495
  Returns
  -------
  ev : list
       expected value of each emotion 
  total_ev : float
             expected value of comeplete sample
  """
  try:
    print("Shape of selected numpy array: ", counts.shape)
    print("Data type of _data_numpy: ", type(counts))
    start = time.time()
    ev = []
    num = 0
    for row in counts:
      val = row * prob[num]
      num += 1
      ev.append(val)
    total_ev = 0
    for row in ev:
      total_ev += row
    
    print("Time taken to calculate: ", time.time() - start)
    return ev, total_ev
  except StatisticsError as error:
    raise error
  
  except Exception as error
    print(error)
    raise error
    
Function Call:
ev, total_ev = find_ev(counts, prob)
print("Expected Value of each emotion type: ", ev)
print("Total Expected value: ", total_ev)
Correlation
Definition
Correlation is the measure which is used to know how similar the two variables are.
Correlation Coefficient
- It is used to measure the strength of correlation between two variables.
- Range: 1 to -1
- Values above 1 or less than -1 are concluded as error in calculation.
- Negative Correlationindicates that if one variable increases the other decreases or vice versa.
- Positive Correlationindicates that both are similar i.e. if one increases the other one also increases and vice versa.
- 0correlation means that the samples shows no relation between each other.

Correlation Matrix
A table where variables are shown as rows and columns and the cell values are the correlations between the variables is known as Correlation Matrix.
The function to calculate correlation matrix:
def corr_matrix(df_all):
  """To find correlation matrix.
  Parameters
  ----------
  df_all : dataframe
           
  Example
  -------
  >>> Time taken to calculate:  356.91664814949036
  Returns
  -------
  corr : pandas.core.frame.DataFrame
         correlation matrix
  """
  try:
    start = time.time()
    corr = df_all.corr() 
    print("Time taken to calculate: ", time.time() - start)
    return corr
  except StatisticsError as error:
    raise error
  
  except Exception as error:
    print()
    print(error)
    raise error
Function Call:
corr = corr_matrix(df_all)
print(corr)
Output:

Correlation Plot
def corr_plot(corr, cmap_ar):
  """To plot correlation.
  Parameters
  ----------
  corr : dataframe
         of correlation matrix
  
  cmap_ar : str
            type of heatmap to be plotted
  
  Example
  -------
  >>> Enter the cmap style from the above arrguments: 7
  >>> Color plotted:  Blues_r   
  >>> Time taken to plot:  6.074030876159668
  Returns
  -------
  Graph
  """
  try:
    start = time.time()
    plt.figure(figsize=(16, 16))
    sns.heatmap(corr, vmin=-1, vmax=1, cmap=cmap_ar) 
    sns.set(font_scale=2,style='white')
    plt.tight_layout()
    
    plt.title('Heatmap correlation')
    plt.show()
    print("Time taken to plot: ", time.time() - start)
  except AttributeError as error:
    print("Attribute Error Occured.")
    print("The error is ", error)
    
  except ValueError as error:
    print("Value Error Occured.")
    print("The error is ", error)
    
Function Call:
val = ['coolwarm', sns.diverging_palette(20, 220, as_cmap=True), 'Blues', 'YlGnBu', 'Accent', 'Accent_r', 'Blues', 'Blues_r', 'BrBG', 'BrBG_r', 'BuGn', 'BuGn_r', 'BuPu', 'BuPu_r', 'CMRmap', 'CMRmap_r', 'Dark2', 'Dark2_r', 'GnBu', 'GnBu_r', 'Greens', 'Greens_r', 'Greys', 'Greys_r', 'OrRd', 'OrRd_r', 'Oranges', 'Oranges_r', 'PRGn', 'PRGn_r', 'Paired', 'Paired_r', 'Pastel1', 'Pastel1_r', 'Pastel2', 'Pastel2_r', 'PiYG', 'PiYG_r', 'PuBu', 'PuBuGn', 'PuBuGn_r', 'PuBu_r', 'PuOr', 'PuOr_r', 'PuRd', 'PuRd_r', 'Purples', 'Purples_r', 'RdBu', 'RdBu_r', 'RdGy', 'RdGy_r', 'RdPu', 'RdPu_r', 'RdYlBu', 'RdYlBu_r', 'RdYlGn', 'RdYlGn_r', 'Reds', 'Reds_r', 'Set1', 'Set1_r', 'Set2', 'Set2_r', 'Set3', 'Set3_r', 'Spectral', 'Spectral_r', 'Wistia', 'Wistia_r', 'YlGn', 'YlGnBu', 'YlGnBu_r', 'YlGn_r', 'YlOrBr', 'YlOrBr_r', 'YlOrRd', 'YlOrRd_r', 'afmhot', 'afmhot_r', 'autumn', 'autumn_r', 'binary', 'binary_r', 'bone', 'bone_r', 'brg', 'brg_r', 'bwr', 'bwr_r', 'cividis', 'cividis_r', 'cool', 'cool_r', 'coolwarm', 'coolwarm_r', 'copper', 'copper_r', 'crest', 'crest_r', 'cubehelix', 'cubehelix_r', 'flag', 'flag_r', 'flare', 'flare_r', 'gist_earth', 'gist_earth_r', 'gist_gray', 'gist_gray_r', 'gist_heat', 'gist_heat_r', 'gist_ncar', 'gist_ncar_r', 'gist_rainbow', 'gist_rainbow_r', 'gist_stern', 'gist_stern_r', 'gist_yarg', 'gist_yarg_r', 'gnuplot', 'gnuplot2', 'gnuplot2_r', 'gnuplot_r', 'gray', 'gray_r', 'hot', 'hot_r', 'hsv', 'hsv_r', 'icefire', 'icefire_r', 'inferno', 'inferno_r', 'jet', 'jet_r', 'magma', 'magma_r', 'mako', 'mako_r', 'nipy_spectral', 'nipy_spectral_r', 'ocean', 'ocean_r', 'pink', 'pink_r', 'plasma', 'plasma_r', 'prism', 'prism_r', 'rainbow', 'rainbow_r', 'rocket', 'rocket_r', 'seismic', 'seismic_r', 'spring', 'spring_r', 'summer', 'summer_r', 'tab10', 'tab10_r', 'tab20', 'tab20_r', 'tab20b', 'tab20b_r', 'tab20c', 'tab20c_r', 'terrain', 'terrain_r', 'twilight', 'twilight_r', 'twilight_shifted', 'twilight_shifted_r', 'viridis', 'viridis_r', 'vlag', 'vlag_r', 'winter', 'winter_r']
n = input("Enter the cmap style from the above arrguments: ")
n = int(n)
cmap_ar = val[n]
print("Color plotted: ", val[n])
corr_plot(corr, cmap_ar)
Output:

Pearson’s Correlation Coefficient
Steps to calculate:
- Calculate covariance of the given two varaibles.
- Calculate standard deviation of both variables.
- Put in formula.
Formula:
\[\begin{equation} r = \frac{\sum^n_{i=1}(x_i-\bar{x})(y_i-\bar{y})}{(n-1)s_xs_y} \end{equation}\]Check out my other posts related to topic:
Do visit my GitHub to view complete code!
Let me know what you think of this article on Twitter @khushi__411 or leave a comment below!