Introduction to Statistics
Statistics is the science of collecting, analyzing, interpreting, presenting, and organizing data. It helps in decision making using data.
It provides tools to quantify uncertainty, summarize patterns, and draw conclusions about populations from samples.
Two main branches are:
- Descriptive Statistics: Summarizes and visualizes data (mean, median, distributions).
- Inferential Statistics: Uses samples to make generalizations about populations (confidence intervals, hypothesis tests).
Basic Concepts
Types of Data
- Qualitative / Categorical Data
- Quantitative / Numerical Data
- Discrete vs Continuous
Data type determines which statistical methods are valid and how to visualize results.
Measures of Central Tendency
Mean: μ = (Σx)/n
Median: Middle value
Mode: Most frequent value
Mean is sensitive to outliers, while median is robust for skewed distributions.
Measures of Dispersion
Variance: σ² = Σ(x-μ)² / n
Standard Deviation: σ = √σ²
Range: Max - Min
Dispersion measures describe variability; higher variance means data is more spread out.
Probability
Probability quantifies uncertainty and provides a framework for modeling random events.
P(Event) = Number of favorable outcomes / Total outcomes
Example: Probability of rolling a 3 on a dice = 1/6
Key concepts include independent vs dependent events, conditional probability, and Bayes’ theorem.
Probability Rules
- Addition Rule: P(A or B) = P(A) + P(B) - P(A and B)
- Multiplication Rule: P(A and B) = P(A) * P(B|A)
- Complement Rule: P(not A) = 1 - P(A)
Conditional probability is critical in real-world decision making:
P(A|B) = P(A and B) / P(B)
Bayes: P(A|B) = P(B|A)P(A) / P(B)
Statistical Distributions
Normal Distribution
Symmetrical bell-shaped curve, mean = median = mode.
PDF: f(x) = (1/(σ√2π)) * e^(-(x-μ)²/(2σ²))
The normal distribution underpins many statistical methods due to the Central Limit Theorem.
Binomial Distribution
P(X=k) = C(n,k) * p^k * (1-p)^(n-k)
Models the number of successes in fixed trials; common in A/B testing and yes/no outcomes.
Poisson Distribution
P(X=k) = (λ^k * e^-λ)/k!
Models event counts over a fixed interval; often used for arrival rates and rare events.
Correlation & Regression
Correlation
r = Σ((X-meanX)(Y-meanY)) / √(Σ(X-meanX)² * Σ(Y-meanY)²)
- Measures strength & direction of relationship
Correlation ranges from -1 to +1 and captures linear association, not causation.
Simple Linear Regression
Y = a + bX
b = Σ((X-meanX)(Y-meanY)) / Σ(X-meanX)²
a = meanY - b*meanX
Regression models predict outcomes; evaluate fit using R² and residual analysis.
Advanced Statistics
Hypothesis Testing
Steps:
- Formulate null & alternative hypotheses
- Choose significance level (α)
- Compute test statistic (Z, t, χ²)
- Compare with critical value or p-value
- Accept/Reject H₀
Focus on effect size and confidence intervals, not just p‑values.
ANOVA (Analysis of Variance)
F = Variance between groups / Variance within groups
Use ANOVA to compare means across 3+ groups; follow with post‑hoc tests if significant.
Chi-Square Test
χ² = Σ((Observed - Expected)² / Expected)
Chi‑square tests independence in categorical data and goodness‑of‑fit to expected distributions.
Time Series & Forecasting
Analyze trends, seasonality, and make predictions using moving averages or exponential smoothing.
Advanced forecasting uses ARIMA or exponential smoothing state space models when patterns are complex.
Descriptive Statistics
Descriptive Statistics is a branch of statistics that focuses on summarizing, organizing, and describing data. It helps us understand what the data represents without making predictions.
- What is the average?
- How spread out is the data?
- Where do most values lie?
It focuses on summarizing data without making inferences about a larger population.
1. Types of Descriptive Statistics
- Measures of Central Tendency
- Measures of Dispersion
- Measures of Position
- Graphical Representation
2. Measures of Central Tendency
These describe the central value of the dataset.
🔹 Mean (Arithmetic Average)
Formula: Mean = Σx / n
The mean is sensitive to outliers; for skewed data, use median or trimmed mean.
Python Example:
In practice, also review distribution shape (skewness, kurtosis) to interpret these measures correctly.
data = [10, 20, 30, 40, 50]
mean = sum(data) / len(data)
print("Mean:", mean)
🔹 Median
The middle value after sorting the data.
import statistics
data = [10, 20, 30, 40, 50]
median = statistics.median(data)
print("Median:", median)
🔹 Mode
The value that occurs most frequently.
import statistics
data = [2, 4, 6, 6, 8]
mode = statistics.mode(data)
print("Mode:", mode)
3. Measures of Dispersion
These describe how spread out the data is.
🔹 Range
data = [30, 45, 60, 80]
range_value = max(data) - min(data)
print("Range:", range_value)
🔹 Variance
import statistics
data = [10, 20, 30]
variance = statistics.variance(data)
print("Variance:", variance)
🔹 Standard Deviation
import statistics
data = [10, 20, 30]
std_dev = statistics.stdev(data)
print("Standard Deviation:", std_dev)
4. Measures of Position
🔹 Quartiles
import numpy as np
data = [10, 20, 30, 40, 50, 60]
q1 = np.percentile(data, 25)
q2 = np.percentile(data, 50)
q3 = np.percentile(data, 75)
print("Q1:", q1)
print("Q2 (Median):", q2)
print("Q3:", q3)
🔹 Percentiles
p90 = np.percentile(data, 90)
print("90th Percentile:", p90)
5. Skewness
Skewness measures the asymmetry of data.
from scipy.stats import skew
data = [10, 20, 30, 40, 100]
print("Skewness:", skew(data))
- Positive Skew → Mean > Median
- Negative Skew → Mean < Median
- Zero Skew → Symmetric
6. Graphical Representation (Python)
🔹 Histogram
import matplotlib.pyplot as plt
data = [10, 20, 20, 30, 40, 50]
plt.hist(data, bins=5)
plt.title("Histogram")
plt.show()
🔹 Box Plot
plt.boxplot(data)
plt.title("Box Plot")
plt.show()
7. Real-Life Applications
- Student marks analysis
- Salary distribution
- Weather data summary
- Sales performance
- Sports statistics
Inferential Statistics
Inferential Statistics is a branch of statistics that focuses on drawing conclusions, making predictions, and decisions about a population based on a sample of data.
It relies on sampling theory and probability to quantify uncertainty and control error rates.
- Can results from a sample represent the population?
- Is the observed difference significant?
- What is the probability of an event?
1. Types of Inferential Statistics
- Estimation (Point & Interval)
- Hypothesis Testing
- Correlation Analysis
- Regression Analysis
- Probability Distributions
2. Population and Sample
Population: Entire group of interest
Sample: A subset of the population
Population → All students in a college Sample → 100 randomly selected students
Sampling method matters: random sampling reduces bias and improves generalization.
3. Estimation
🔹 Point Estimation
A single value used to estimate a population parameter.
Sample mean → Point estimate of population mean
data = [45, 50, 55, 60, 65]
mean = sum(data) / len(data)
print("Point Estimate (Mean):", mean)
🔹 Interval Estimation (Confidence Interval)
A range of values within which the population parameter is likely to lie.
from scipy import stats
import numpy as np
data = [45, 50, 55, 60, 65]
confidence_level = 0.95
mean = np.mean(data)
std_error = stats.sem(data)
ci = stats.t.interval(confidence_level, len(data)-1, mean, std_error)
print("95% Confidence Interval:", ci)
Confidence intervals provide a range of plausible values for population parameters.
4. Hypothesis Testing
Hypothesis testing is a statistical method used to make decisions using data.
🔹 Types of Hypothesis
- Null Hypothesis (H₀): No effect or no difference
- Alternative Hypothesis (H₁): There is an effect or difference
H₀: Mean score = 50 H₁: Mean score ≠ 50
🔹 One-Sample t-Test
from scipy.stats import ttest_1samp
data = [48, 50, 52, 49, 51]
t_stat, p_value = ttest_1samp(data, 50)
print("t-statistic:", t_stat)
print("p-value:", p_value)
Always check assumptions (normality, independence, equal variances) before choosing a test.
5. Common Statistical Tests
🔹 Independent t-Test
from scipy.stats import ttest_ind
group1 = [10, 12, 14, 16]
group2 = [11, 13, 15, 17]
t_stat, p_val = ttest_ind(group1, group2)
print("p-value:", p_val)
🔹 Chi-Square Test
from scipy.stats import chi2_contingency
data = [[20, 30], [25, 35]]
chi2, p, dof, expected = chi2_contingency(data)
print("Chi-square p-value:", p)
Choose tests based on data type and distribution; misuse can lead to incorrect conclusions.
6. Correlation Analysis
Correlation measures the strength and direction of relationship between variables.
from scipy.stats import pearsonr
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
corr, p_val = pearsonr(x, y)
print("Correlation:", corr)
- +1 → Perfect positive correlation
- 0 → No correlation
- -1 → Perfect negative correlation
Correlation does not imply causation; consider confounding variables.
7. Regression Analysis
Regression predicts a dependent variable based on one or more independent variables.
🔹 Simple Linear Regression
from sklearn.linear_model import LinearRegression
import numpy as np
x = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 4, 6, 8, 10])
model = LinearRegression()
model.fit(x, y)
print("Slope:", model.coef_)
print("Intercept:", model.intercept_)
Regression assumptions include linearity, homoscedasticity, and independent errors.
8. Probability & Distributions
🔹 Normal Distribution
from scipy.stats import norm
mean = 50
std = 10
prob = norm.cdf(60, mean, std)
print("Probability:", prob)
Distribution choice impacts inference; match model assumptions to the data.
9. Real-Life Applications
- Medical trials
- Market research
- Quality control
- Election predictions
- Machine learning models