Descriptive Statistics : Provides statistical information about whole sets of data or populations.
Inferential Statistics : Provides statistical information about a dataset represented by a sample. It does so through a sample of the same population.
Collected Data types
- Time Series : Data collected over a period of time. (eg: same guy at different age)
- Cross Sectional : Data collected at one point in time. (eg: different guys at one point in time)
- Pooled : Combination of cross sectional and Time series data. (eg: different guys at different age)
- Longitudinal / Panel : Same as pooled
Data Scale
- Quantitative
- Discrete : Integers
- Continuous : Real Numbers
- Categorical
- Ordinal scale : Natural order exists (eg. high, medium, low)
- Nominal scale : No natural order (eg. Names)
Sampling
Simple Random Sample : Every element in the population has an equal probability of being included in the sample ie a sample with no bias.
Stratified Sampling : The population is divided into separate to called strata. Then a probability sample is drawn from each group.
Cluster Sampling : The total population is divided into clusters and a simple random sample of the groups is selected.
Measures of Central Tendency
Mean : μ (population) | x̄ (sample) | 1/ni=1nxi
The sum total of units divided by the number of units. [Average]
Median : The middle / midpoint value in a sorted sequence. It divides the data 50(more):50(less). [Middle]
Mode : The most commonly occurring value [Frequent]
Two modes = Bimodal ; Multiple modes = Multimodal
Q) Why we use mean most times ?
A) It takes all values into consideration. Mode/median ignores.
Q) When do we use median ?
A) It is less prone to outliers as it doesn’t get affected by them.
X = set of population elements x = set of sample elements
N = population size n = sample size
μ = population mean x̄ = sample mean
σ = population std dev s = sample std dev
Normal Distribution Data
68.3 % = μ ± 1σ
95.5 % = μ ± 2σ
99.7 % = μ ± 3σ
Population – our entire group of interest. || Parameter – numeric summary about a population
Sample – subset of the population || Statistic – numeric summary about a sample
Measures of Dispersion/Spread
Range : Max – Min value
Percentile : Divides the data into 100 equal parts using 99 points
Decile : Divides the data into 10 equal parts using 9 points
Quartile : Divides the data into 4 equal parts using 3 points (25%, 50%, 75%)
Median is the 2nd quartile (50%) . Interquartile range : 25% – 75% data
Boxplot uses a limit of 1.5 IQR at whiskers to identify outliers
Standard Deviation : σ (population) | s (sample) | [Unit = same as data]
How far are the data dispersed from the mean. How much the members of a group differ from the mean value for the group. sq-root of variance.
Variance : σ2
Indicates how spread out the data are. [Unit = data squared]
Total Error = Sum of deviances from the mean = ∑ (xi – x̄)
Sum of squared error (SSE) = ∑ (xi – x̄)2
Variance = SSE ➗ (n-1) for estimating population
Variance = SSE ➗ (n) for sample
Standard Error (of the mean) (SEM) : (using the sample means to estimate the population mean) is the standard deviation of all sample means (of a given size). [see Central Limit Theorem]
σ ÷ √n
Confidence Interval = mean ± [(z-scores for confidence level)*standard error]
Z Scores : z is a unit of measure that is equivalent to the number of standard deviations a value is away from the mean value.
Z = (Data Point – Mean) / Standard Deviation Eg : if z = 1.79 , then it is 1.79 σ away from the mean
Distributions
Binomial Distribution : Discrete distribution used in statistics. Only counts 2 states typically 0 and 1.
Normal Distribution (Gaussian Distribution) (Continuous Distribution) : data distributed symmetrically around the centre (skewness = kurtosis = 0).
Mean = Median = Mode. Also known as the bell curve.
Uniform Distribution : Consists of similar values throughout
Skewed Distribution : data distributed which spikes towards either ends as opposed to the central spike in normal distribution (skewness: lack of symmetry)
Kurtosis : pointiness of the curve
Positive kurtosis = leptokurtic || Negative kurtosis = platykurtic
| Data dist type | Measure of Central Tendency | Measure of spread (variation) |
| Normal | Mean | Standard Deviation |
| Non normal (skewed) | Median | Range, Percentile & IQR |
To convert any dataset with any mean & std deviation to a dataset with mean = 0 & std dev = 1.
Can be done using z-scores: z = (x – x̄) ÷ s
Central Limit Theorem
The distribution of the sample means tends towards a normal distribution as the number of samples increase
- Take a random population and plot its distribution (assuming distribution ≠ normal distribution)
- Samples of constant size n are to be taken from the population (n1 = n2 = n3 …)
- Take a random sample of size n1 from the population and calculate its mean x̄1
- Take a random sample of size n2 from the population and calculate its mean x̄2
- Plot the means x̄1, x̄2, x̄3 …
- The sampling distribution (plot of x̄1, x̄2, x̄3 …) tends to be a normal distribution as the number of samples increases
- The variance of the sampling distribution = The variance of the population / sample size
σx̄2 = σ2 / n
- With a higher value of n, the sampling distribution tends to have a lower variance (high kurtosis), and vice-versa
The stand deviation of the sampling means is called Standard Error of the Mean.