Statistics

Descriptive Statistics : Provides statistical information about whole sets of data or populations.

Inferential Statistics : Provides statistical information about a dataset represented by a sample. It does so through a sample of the same population.

Collected Data types

Time Series : Data collected over a period of time. (eg: same guy at different age)
Cross Sectional : Data collected at one point in time. (eg: different guys at one point in time)
Pooled : Combination of cross sectional and Time series data. (eg: different guys at different age)
Longitudinal / Panel : Same as pooled

Data Scale

Quantitative
- Discrete : Integers
- Continuous : Real Numbers
Categorical
- Ordinal scale : Natural order exists (eg. high, medium, low)
- Nominal scale : No natural order (eg. Names)

Sampling

Simple Random Sample : Every element in the population has an equal probability of being included in the sample ie a sample with no bias.

Stratified Sampling : The population is divided into separate to called strata. Then a probability sample is drawn from each group.

Cluster Sampling : The total population is divided into clusters and a simple random sample of the groups is selected.

Measures of Central Tendency

Mean : μ (population) | x̄ (sample) | 1/ni=1nxi
The sum total of units divided by the number of units. [Average]

Median : The middle / midpoint value in a sorted sequence. It divides the data 50(more):50(less). [Middle]

Mode : The most commonly occurring value [Frequent]
Two modes = Bimodal ; Multiple modes = Multimodal

Q) Why we use mean most times ?
A) It takes all values into consideration. Mode/median ignores.

Q) When do we use median ?
A) It is less prone to outliers as it doesn’t get affected by them.

X = set of population elements x = set of sample elements
N = population size n = sample size
μ = population mean x̄ = sample mean
σ = population std dev s = sample std dev

Normal Distribution Data

68.3 % = μ ± 1σ
95.5 % = μ ± 2σ
99.7 % = μ ± 3σ

Population – our entire group of interest. || Parameter – numeric summary about a population

Sample – subset of the population || Statistic – numeric summary about a sample

Measures of Dispersion/Spread

Range : Max – Min value

Percentile : Divides the data into 100 equal parts using 99 points

Decile : Divides the data into 10 equal parts using 9 points

Quartile : Divides the data into 4 equal parts using 3 points (25%, 50%, 75%)

Median is the 2nd quartile (50%) . Interquartile range : 25% – 75% data
Boxplot uses a limit of 1.5 IQR at whiskers to identify outliers

Standard Deviation : σ (population) | s (sample) | [Unit = same as data]

How far are the data dispersed from the mean. How much the members of a group differ from the mean value for the group. sq-root of variance.

Variance : σ2

Indicates how spread out the data are. [Unit = data squared]

Total Error = Sum of deviances from the mean = ∑ (xi – x̄)

Sum of squared error (SSE) = ∑ (xi – x̄)2

Variance = SSE ➗ (n-1) for estimating population

Variance = SSE ➗ (n) for sample

Standard Error (of the mean) (SEM) : (using the sample means to estimate the population mean) is the standard deviation of all sample means (of a given size). [see Central Limit Theorem]

σ ÷ √n

Confidence Interval = mean ± [(z-scores for confidence level)*standard error]

Z Scores : z is a unit of measure that is equivalent to the number of standard deviations a value is away from the mean value.

Z = (Data Point – Mean) / Standard Deviation Eg : if z = 1.79 , then it is 1.79 σ away from the mean

Distributions

Binomial Distribution : Discrete distribution used in statistics. Only counts 2 states typically 0 and 1.

Normal Distribution (Gaussian Distribution) (Continuous Distribution) : data distributed symmetrically around the centre (skewness = kurtosis = 0).
Mean = Median = Mode. Also known as the bell curve.

Uniform Distribution : Consists of similar values throughout

Skewed Distribution : data distributed which spikes towards either ends as opposed to the central spike in normal distribution (skewness: lack of symmetry)

Kurtosis : pointiness of the curve

Positive kurtosis = leptokurtic || Negative kurtosis = platykurtic

Data dist type	Measure of Central Tendency	Measure of spread (variation)
Normal	Mean	Standard Deviation
Non normal (skewed)	Median	Range, Percentile & IQR

To convert any dataset with any mean & std deviation to a dataset with mean = 0 & std dev = 1.

Can be done using z-scores: z = (x – x̄) ÷ s

Central Limit Theorem

The distribution of the sample means tends towards a normal distribution as the number of samples increase

Take a random population and plot its distribution (assuming distribution ≠ normal distribution)

Samples of constant size n are to be taken from the population (n1 = n2 = n3 …)
Take a random sample of size n1 from the population and calculate its mean x̄1
Take a random sample of size n2 from the population and calculate its mean x̄2
Plot the means x̄1, x̄2, x̄3 …
The sampling distribution (plot of x̄1, x̄2, x̄3 …) tends to be a normal distribution as the number of samples increases
The variance of the sampling distribution = The variance of the population / sample size

σx̄2 = σ2 / n

With a higher value of n, the sampling distribution tends to have a lower variance (high kurtosis), and vice-versa

The stand deviation of the sampling means is called Standard Error of the Mean.