Statistics

Introduction to Statistics

Statistics is the science of collecting, analyzing, interpreting, presenting, and organizing data. It helps in decision making using data.

It provides tools to quantify uncertainty, summarize patterns, and draw conclusions about populations from samples.

Two main branches are:

Descriptive Statistics: Summarizes and visualizes data (mean, median, distributions).
Inferential Statistics: Uses samples to make generalizations about populations (confidence intervals, hypothesis tests).

Learning statistics is essential for data analysis, research, and real-world decision making.

Basic Concepts

Types of Data

Qualitative / Categorical Data
Quantitative / Numerical Data
Discrete vs Continuous

Data type determines which statistical methods are valid and how to visualize results.

Measures of Central Tendency

Mean: μ = (Σx)/n
Median: Middle value
Mode: Most frequent value

Mean is sensitive to outliers, while median is robust for skewed distributions.

Measures of Dispersion

Variance: σ² = Σ(x-μ)² / n
Standard Deviation: σ = √σ²
Range: Max - Min

Dispersion measures describe variability; higher variance means data is more spread out.

Probability

Probability quantifies uncertainty and provides a framework for modeling random events.

P(Event) = Number of favorable outcomes / Total outcomes
Example: Probability of rolling a 3 on a dice = 1/6

Key concepts include independent vs dependent events, conditional probability, and Bayes’ theorem.

Probability Rules

Addition Rule: P(A or B) = P(A) + P(B) - P(A and B)
Multiplication Rule: P(A and B) = P(A) * P(B|A)
Complement Rule: P(not A) = 1 - P(A)

Conditional probability is critical in real-world decision making:

P(A|B) = P(A and B) / P(B)
Bayes: P(A|B) = P(B|A)P(A) / P(B)

Statistical Distributions

Normal Distribution

Symmetrical bell-shaped curve, mean = median = mode.

PDF: f(x) = (1/(σ√2π)) * e^(-(x-μ)²/(2σ²))

The normal distribution underpins many statistical methods due to the Central Limit Theorem.

Binomial Distribution

P(X=k) = C(n,k) * p^k * (1-p)^(n-k)

Models the number of successes in fixed trials; common in A/B testing and yes/no outcomes.

Poisson Distribution

P(X=k) = (λ^k * e^-λ)/k!

Models event counts over a fixed interval; often used for arrival rates and rare events.

Correlation & Regression

Correlation

r = Σ((X-meanX)(Y-meanY)) / √(Σ(X-meanX)² * Σ(Y-meanY)²)
- Measures strength & direction of relationship

Correlation ranges from -1 to +1 and captures linear association, not causation.

Simple Linear Regression

Y = a + bX
b = Σ((X-meanX)(Y-meanY)) / Σ(X-meanX)²
a = meanY - b*meanX

Regression models predict outcomes; evaluate fit using R² and residual analysis.

Advanced Statistics

Hypothesis Testing

Steps:

Formulate null & alternative hypotheses
Choose significance level (α)
Compute test statistic (Z, t, χ²)
Compare with critical value or p-value
Accept/Reject H₀

Focus on effect size and confidence intervals, not just p‑values.

ANOVA (Analysis of Variance)

F = Variance between groups / Variance within groups

Use ANOVA to compare means across 3+ groups; follow with post‑hoc tests if significant.

Chi-Square Test

χ² = Σ((Observed - Expected)² / Expected)

Chi‑square tests independence in categorical data and goodness‑of‑fit to expected distributions.

Time Series & Forecasting

Analyze trends, seasonality, and make predictions using moving averages or exponential smoothing.

Advanced forecasting uses ARIMA or exponential smoothing state space models when patterns are complex.

Descriptive Statistics

Descriptive Statistics is a branch of statistics that focuses on summarizing, organizing, and describing data. It helps us understand what the data represents without making predictions.

📊 Descriptive statistics answers:

What is the average?
How spread out is the data?
Where do most values lie?

It focuses on summarizing data without making inferences about a larger population.

1. Types of Descriptive Statistics

Measures of Central Tendency
Measures of Dispersion
Measures of Position
Graphical Representation

2. Measures of Central Tendency

These describe the central value of the dataset.

🔹 Mean (Arithmetic Average)

Formula:
Mean = Σx / n

The mean is sensitive to outliers; for skewed data, use median or trimmed mean.

Python Example:

In practice, also review distribution shape (skewness, kurtosis) to interpret these measures correctly.

data = [10, 20, 30, 40, 50]
mean = sum(data) / len(data)
print("Mean:", mean)

🔹 Median

The middle value after sorting the data.

import statistics

data = [10, 20, 30, 40, 50]
median = statistics.median(data)
print("Median:", median)

🔹 Mode

The value that occurs most frequently.

import statistics

data = [2, 4, 6, 6, 8]
mode = statistics.mode(data)
print("Mode:", mode)

✔ Mean is sensitive to outliers ✔ Median works best for skewed data ✔ Mode is useful for categorical data

3. Measures of Dispersion

These describe how spread out the data is.

🔹 Range

data = [30, 45, 60, 80]
range_value = max(data) - min(data)
print("Range:", range_value)

🔹 Variance

import statistics

data = [10, 20, 30]
variance = statistics.variance(data)
print("Variance:", variance)

🔹 Standard Deviation

import statistics

data = [10, 20, 30]
std_dev = statistics.stdev(data)
print("Standard Deviation:", std_dev)

📌 Low SD → data is consistent 📌 High SD → data is widely spread

4. Measures of Position

🔹 Quartiles

import numpy as np

data = [10, 20, 30, 40, 50, 60]
q1 = np.percentile(data, 25)
q2 = np.percentile(data, 50)
q3 = np.percentile(data, 75)

print("Q1:", q1)
print("Q2 (Median):", q2)
print("Q3:", q3)

🔹 Percentiles

p90 = np.percentile(data, 90)
print("90th Percentile:", p90)

5. Skewness

Skewness measures the asymmetry of data.

from scipy.stats import skew

data = [10, 20, 30, 40, 100]
print("Skewness:", skew(data))

Positive Skew → Mean > Median
Negative Skew → Mean < Median
Zero Skew → Symmetric

6. Graphical Representation (Python)

🔹 Histogram

import matplotlib.pyplot as plt

data = [10, 20, 20, 30, 40, 50]
plt.hist(data, bins=5)
plt.title("Histogram")
plt.show()

🔹 Box Plot

plt.boxplot(data)
plt.title("Box Plot")
plt.show()

📈 Graphs help detect patterns, outliers, and distribution shape.

7. Real-Life Applications

Student marks analysis
Salary distribution
Weather data summary
Sales performance
Sports statistics

Inferential Statistics

Inferential Statistics is a branch of statistics that focuses on drawing conclusions, making predictions, and decisions about a population based on a sample of data.

It relies on sampling theory and probability to quantify uncertainty and control error rates.

📊 Inferential statistics helps answer:

Can results from a sample represent the population?
Is the observed difference significant?
What is the probability of an event?

1. Types of Inferential Statistics

Estimation (Point & Interval)
Hypothesis Testing
Correlation Analysis
Regression Analysis
Probability Distributions

2. Population and Sample

Population: Entire group of interest
Sample: A subset of the population

Population → All students in a college
Sample → 100 randomly selected students

✔ Inferential statistics uses samples to make conclusions about populations

Sampling method matters: random sampling reduces bias and improves generalization.

3. Estimation

🔹 Point Estimation

A single value used to estimate a population parameter.

Sample mean → Point estimate of population mean

data = [45, 50, 55, 60, 65]
mean = sum(data) / len(data)
print("Point Estimate (Mean):", mean)

🔹 Interval Estimation (Confidence Interval)

A range of values within which the population parameter is likely to lie.

from scipy import stats
import numpy as np

data = [45, 50, 55, 60, 65]
confidence_level = 0.95

mean = np.mean(data)
std_error = stats.sem(data)

ci = stats.t.interval(confidence_level, len(data)-1, mean, std_error)
print("95% Confidence Interval:", ci)

Confidence intervals provide a range of plausible values for population parameters.

4. Hypothesis Testing

Hypothesis testing is a statistical method used to make decisions using data.

🔹 Types of Hypothesis

Null Hypothesis (H₀): No effect or no difference
Alternative Hypothesis (H₁): There is an effect or difference

H₀: Mean score = 50
H₁: Mean score ≠ 50

🔹 One-Sample t-Test

from scipy.stats import ttest_1samp

data = [48, 50, 52, 49, 51]
t_stat, p_value = ttest_1samp(data, 50)

print("t-statistic:", t_stat)
print("p-value:", p_value)

✔ If p-value < 0.05 → Reject H₀ ✔ If p-value ≥ 0.05 → Fail to reject H₀

Always check assumptions (normality, independence, equal variances) before choosing a test.

5. Common Statistical Tests

🔹 Independent t-Test

from scipy.stats import ttest_ind

group1 = [10, 12, 14, 16]
group2 = [11, 13, 15, 17]

t_stat, p_val = ttest_ind(group1, group2)
print("p-value:", p_val)

🔹 Chi-Square Test

from scipy.stats import chi2_contingency

data = [[20, 30], [25, 35]]
chi2, p, dof, expected = chi2_contingency(data)
print("Chi-square p-value:", p)

Choose tests based on data type and distribution; misuse can lead to incorrect conclusions.

6. Correlation Analysis

Correlation measures the strength and direction of relationship between variables.

from scipy.stats import pearsonr

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

corr, p_val = pearsonr(x, y)
print("Correlation:", corr)

+1 → Perfect positive correlation
0 → No correlation
-1 → Perfect negative correlation

Correlation does not imply causation; consider confounding variables.

7. Regression Analysis

Regression predicts a dependent variable based on one or more independent variables.

🔹 Simple Linear Regression

from sklearn.linear_model import LinearRegression
import numpy as np

x = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 4, 6, 8, 10])

model = LinearRegression()
model.fit(x, y)

print("Slope:", model.coef_)
print("Intercept:", model.intercept_)

Regression assumptions include linearity, homoscedasticity, and independent errors.

8. Probability & Distributions

🔹 Normal Distribution

from scipy.stats import norm

mean = 50
std = 10

prob = norm.cdf(60, mean, std)
print("Probability:", prob)

📈 Many real-life variables follow a normal distribution

Distribution choice impacts inference; match model assumptions to the data.

9. Real-Life Applications

Medical trials
Market research
Quality control
Election predictions
Machine learning models

Important Statistics Formulas