Statistics

Descriptive Statistics : Provides statistical information about whole sets of data or populations.

Inferential Statistics : Provides statistical information about a dataset represented by a sample. It does so through a sample of the same population.

Collected Data types

  • Time Series : Data collected over a period of time. (eg: same guy at different age)
  • Cross Sectional : Data collected at one point in time. (eg: different guys at one point in time)
  • Pooled : Combination of cross sectional and Time series data.  (eg: different guys at different age)
  • Longitudinal / Panel : Same as pooled

Data Scale

  • Quantitative
    • Discrete : Integers
    • Continuous : Real Numbers
  • Categorical
    • Ordinal scale : Natural order exists (eg. high, medium, low)
    • Nominal scale : No natural order (eg. Names)

Sampling 

Simple Random Sample : Every element in the population has an equal probability of being included in the sample ie a sample with no bias.

Stratified Sampling : The population is divided into separate to called strata. Then a probability sample is drawn from each group.

Cluster Sampling : The total population is divided into clusters and a simple random sample of the groups is selected.

 

Measures of Central Tendency

Mean : μ (population) | (sample) |    1/ni=1nxi
The sum total of units divided by the number of units. [Average]

Median : The middle / midpoint value in a sorted sequence. It divides the data 50(more):50(less). [Middle]

Mode : The most commonly occurring value [Frequent]
Two modes = Bimodal ; Multiple modes = Multimodal

Q) Why we use mean most times ?
A) It takes all values into consideration. Mode/median ignores.

Q) When do we use median ?
A) It is less prone to outliers as it doesn’t get affected by them.

 

X = set of population elements    x = set of sample elements
N = population size                          n = sample size
μ = population mean                        = sample mean
σ = population std dev                    s = sample std dev

 

Normal Distribution Data

68.3 % = μ ± 1σ
95.5 % = μ ± 2σ
99.7 % = μ ± 3σ

Population – our entire group of interest. || Parameter – numeric summary about a population

Sample – subset of the population || Statistic – numeric summary about a sample

 

Measures of Dispersion/Spread

Range : Max – Min value

Percentile : Divides the data into 100 equal parts using 99 points

Decile : Divides the data into 10 equal parts using 9 points

Quartile : Divides the data into 4 equal parts using 3 points (25%, 50%, 75%)

Median is the 2nd quartile (50%) . Interquartile range : 25% – 75% data
Boxplot uses a limit of 1.5 IQR at whiskers to identify outliers

Standard Deviation : σ (population) | s (sample) |  [Unit = same as data]

How far are the data dispersed from the mean. How much the members of a group differ from the mean value for the group. sq-root of variance.

Variance : σ2

Indicates how spread out the data are. [Unit = data squared]

Total Error = Sum of deviances from the mean = ∑ (xi – x̄)

Sum of squared error (SSE) =  ∑ (xi – x̄)2

Variance = SSE ➗ (n-1) for estimating population

Variance = SSE ➗ (n) for sample

Standard Error (of the mean) (SEM) : (using the sample means to estimate the population mean) is the standard deviation of all sample means (of a given size). [see Central Limit Theorem]

σ ÷ √n

Confidence Interval = mean ± [(z-scores for confidence level)*standard error]

Z Scores : z is a unit of measure that is equivalent to the number of standard deviations a value is away from the mean value.

Z = (Data Point – Mean) / Standard Deviation     Eg :  if z = 1.79 , then it is 1.79 σ away from the mean

 

Distributions

Binomial Distribution : Discrete distribution used in statistics. Only counts 2 states typically 0 and 1.

Normal Distribution (Gaussian Distribution) (Continuous Distribution) : data distributed symmetrically around the centre (skewness = kurtosis = 0). 
Mean = Median = Mode. Also known as the bell curve.

Uniform Distribution : Consists of similar values throughout

Skewed Distribution : data distributed which spikes towards either ends as opposed to the central spike in normal distribution (skewness: lack of symmetry)

Kurtosis : pointiness of the curve

Positive kurtosis = leptokurtic ||  Negative kurtosis = platykurtic

Data dist type Measure of Central Tendency Measure of spread (variation)
Normal Mean Standard Deviation
Non normal (skewed) Median Range, Percentile & IQR

To convert any dataset with any mean & std deviation to a dataset with mean = 0 & std dev = 1.

Can be done using z-scores: z = (x – x̄) ÷ s

Central Limit Theorem

The distribution of the sample means tends towards a normal distribution as the number of samples increase

  • Take a random population and plot its distribution (assuming distribution ≠ normal distribution)
  • Samples of constant size n are to be taken from the population  (n1 = n2 = n3)
  • Take a random sample of size n1 from the population and calculate its mean 1 
  • Take a random sample of size n2 from the population and calculate its mean 2
  • Plot the means 1, x̄2, x̄3
  • The sampling distribution (plot of 1, x̄2, x̄3) tends to be a normal distribution as the number of samples increases
  • The variance of the sampling distribution = The variance of the population / sample size

σ2 = σ2 / n

  • With a higher value of n, the sampling distribution tends to have a lower variance (high kurtosis), and vice-versa

The stand deviation of the sampling means is called Standard Error of the Mean.