Python Java C++ HTML CSS Bootstrap JavaScript jQuery AngularJS React Node.js TypeScript Django NumPy Pandas Matplotlib Seaborn Machine Learning Deep Learning Decipher XML

Introduction to Statistics

Statistics is the science of collecting, analyzing, interpreting, presenting, and organizing data. It helps in decision making using data.

It provides tools to quantify uncertainty, summarize patterns, and draw conclusions about populations from samples.

Two main branches are:

  • Descriptive Statistics: Summarizes and visualizes data (mean, median, distributions).
  • Inferential Statistics: Uses samples to make generalizations about populations (confidence intervals, hypothesis tests).
Learning statistics is essential for data analysis, research, and real-world decision making.

Basic Concepts

Types of Data

  • Qualitative / Categorical Data
  • Quantitative / Numerical Data
  • Discrete vs Continuous

Data type determines which statistical methods are valid and how to visualize results.

Measures of Central Tendency

Mean: μ = (Σx)/n
Median: Middle value
Mode: Most frequent value
      

Mean is sensitive to outliers, while median is robust for skewed distributions.

Measures of Dispersion

Variance: σ² = Σ(x-μ)² / n
Standard Deviation: σ = √σ²
Range: Max - Min
      

Dispersion measures describe variability; higher variance means data is more spread out.

Probability

Probability quantifies uncertainty and provides a framework for modeling random events.

P(Event) = Number of favorable outcomes / Total outcomes
Example: Probability of rolling a 3 on a dice = 1/6
      

Key concepts include independent vs dependent events, conditional probability, and Bayes’ theorem.

Probability Rules

  • Addition Rule: P(A or B) = P(A) + P(B) - P(A and B)
  • Multiplication Rule: P(A and B) = P(A) * P(B|A)
  • Complement Rule: P(not A) = 1 - P(A)

Conditional probability is critical in real-world decision making:

P(A|B) = P(A and B) / P(B)
Bayes: P(A|B) = P(B|A)P(A) / P(B)
      

Statistical Distributions

Normal Distribution

Symmetrical bell-shaped curve, mean = median = mode.

PDF: f(x) = (1/(σ√2π)) * e^(-(x-μ)²/(2σ²))
      

The normal distribution underpins many statistical methods due to the Central Limit Theorem.

Binomial Distribution

P(X=k) = C(n,k) * p^k * (1-p)^(n-k)
      

Models the number of successes in fixed trials; common in A/B testing and yes/no outcomes.

Poisson Distribution

P(X=k) = (λ^k * e^-λ)/k!
      

Models event counts over a fixed interval; often used for arrival rates and rare events.

Correlation & Regression

Correlation

r = Σ((X-meanX)(Y-meanY)) / √(Σ(X-meanX)² * Σ(Y-meanY)²)
- Measures strength & direction of relationship
      

Correlation ranges from -1 to +1 and captures linear association, not causation.

Simple Linear Regression

Y = a + bX
b = Σ((X-meanX)(Y-meanY)) / Σ(X-meanX)²
a = meanY - b*meanX
      

Regression models predict outcomes; evaluate fit using R² and residual analysis.

Advanced Statistics

Hypothesis Testing

Steps:

  • Formulate null & alternative hypotheses
  • Choose significance level (α)
  • Compute test statistic (Z, t, χ²)
  • Compare with critical value or p-value
  • Accept/Reject H₀

Focus on effect size and confidence intervals, not just p‑values.

ANOVA (Analysis of Variance)

F = Variance between groups / Variance within groups
      

Use ANOVA to compare means across 3+ groups; follow with post‑hoc tests if significant.

Chi-Square Test

χ² = Σ((Observed - Expected)² / Expected)
      

Chi‑square tests independence in categorical data and goodness‑of‑fit to expected distributions.

Time Series & Forecasting

Analyze trends, seasonality, and make predictions using moving averages or exponential smoothing.

Advanced forecasting uses ARIMA or exponential smoothing state space models when patterns are complex.

Descriptive Statistics

Descriptive Statistics is a branch of statistics that focuses on summarizing, organizing, and describing data. It helps us understand what the data represents without making predictions.

📊 Descriptive statistics answers:
  • What is the average?
  • How spread out is the data?
  • Where do most values lie?

It focuses on summarizing data without making inferences about a larger population.

1. Types of Descriptive Statistics

  • Measures of Central Tendency
  • Measures of Dispersion
  • Measures of Position
  • Graphical Representation

2. Measures of Central Tendency

These describe the central value of the dataset.

🔹 Mean (Arithmetic Average)

Formula:
Mean = Σx / n
  

The mean is sensitive to outliers; for skewed data, use median or trimmed mean.

Python Example:

In practice, also review distribution shape (skewness, kurtosis) to interpret these measures correctly.

data = [10, 20, 30, 40, 50]
mean = sum(data) / len(data)
print("Mean:", mean)
  

🔹 Median

The middle value after sorting the data.

import statistics

data = [10, 20, 30, 40, 50]
median = statistics.median(data)
print("Median:", median)
  

🔹 Mode

The value that occurs most frequently.

import statistics

data = [2, 4, 6, 6, 8]
mode = statistics.mode(data)
print("Mode:", mode)
  
✔ Mean is sensitive to outliers ✔ Median works best for skewed data ✔ Mode is useful for categorical data

3. Measures of Dispersion

These describe how spread out the data is.

🔹 Range

data = [30, 45, 60, 80]
range_value = max(data) - min(data)
print("Range:", range_value)
  

🔹 Variance

import statistics

data = [10, 20, 30]
variance = statistics.variance(data)
print("Variance:", variance)
  

🔹 Standard Deviation

import statistics

data = [10, 20, 30]
std_dev = statistics.stdev(data)
print("Standard Deviation:", std_dev)
  
📌 Low SD → data is consistent 📌 High SD → data is widely spread

4. Measures of Position

🔹 Quartiles

import numpy as np

data = [10, 20, 30, 40, 50, 60]
q1 = np.percentile(data, 25)
q2 = np.percentile(data, 50)
q3 = np.percentile(data, 75)

print("Q1:", q1)
print("Q2 (Median):", q2)
print("Q3:", q3)
  

🔹 Percentiles

p90 = np.percentile(data, 90)
print("90th Percentile:", p90)
  

5. Skewness

Skewness measures the asymmetry of data.

from scipy.stats import skew

data = [10, 20, 30, 40, 100]
print("Skewness:", skew(data))
  
  • Positive Skew → Mean > Median
  • Negative Skew → Mean < Median
  • Zero Skew → Symmetric

6. Graphical Representation (Python)

🔹 Histogram

import matplotlib.pyplot as plt

data = [10, 20, 20, 30, 40, 50]
plt.hist(data, bins=5)
plt.title("Histogram")
plt.show()
  

🔹 Box Plot

plt.boxplot(data)
plt.title("Box Plot")
plt.show()
  
📈 Graphs help detect patterns, outliers, and distribution shape.

7. Real-Life Applications

  • Student marks analysis
  • Salary distribution
  • Weather data summary
  • Sales performance
  • Sports statistics

Inferential Statistics

Inferential Statistics is a branch of statistics that focuses on drawing conclusions, making predictions, and decisions about a population based on a sample of data.

It relies on sampling theory and probability to quantify uncertainty and control error rates.

📊 Inferential statistics helps answer:
  • Can results from a sample represent the population?
  • Is the observed difference significant?
  • What is the probability of an event?

1. Types of Inferential Statistics

  • Estimation (Point & Interval)
  • Hypothesis Testing
  • Correlation Analysis
  • Regression Analysis
  • Probability Distributions

2. Population and Sample

Population: Entire group of interest
Sample: A subset of the population

Population → All students in a college
Sample → 100 randomly selected students
  
✔ Inferential statistics uses samples to make conclusions about populations

Sampling method matters: random sampling reduces bias and improves generalization.

3. Estimation

🔹 Point Estimation

A single value used to estimate a population parameter.

Sample mean → Point estimate of population mean
  
data = [45, 50, 55, 60, 65]
mean = sum(data) / len(data)
print("Point Estimate (Mean):", mean)
  

🔹 Interval Estimation (Confidence Interval)

A range of values within which the population parameter is likely to lie.

from scipy import stats
import numpy as np

data = [45, 50, 55, 60, 65]
confidence_level = 0.95

mean = np.mean(data)
std_error = stats.sem(data)

ci = stats.t.interval(confidence_level, len(data)-1, mean, std_error)
print("95% Confidence Interval:", ci)
  

Confidence intervals provide a range of plausible values for population parameters.

4. Hypothesis Testing

Hypothesis testing is a statistical method used to make decisions using data.

🔹 Types of Hypothesis

  • Null Hypothesis (H₀): No effect or no difference
  • Alternative Hypothesis (H₁): There is an effect or difference
H₀: Mean score = 50
H₁: Mean score ≠ 50
  

🔹 One-Sample t-Test

from scipy.stats import ttest_1samp

data = [48, 50, 52, 49, 51]
t_stat, p_value = ttest_1samp(data, 50)

print("t-statistic:", t_stat)
print("p-value:", p_value)
  
✔ If p-value < 0.05 → Reject H₀ ✔ If p-value ≥ 0.05 → Fail to reject H₀

Always check assumptions (normality, independence, equal variances) before choosing a test.

5. Common Statistical Tests

🔹 Independent t-Test

from scipy.stats import ttest_ind

group1 = [10, 12, 14, 16]
group2 = [11, 13, 15, 17]

t_stat, p_val = ttest_ind(group1, group2)
print("p-value:", p_val)
  

🔹 Chi-Square Test

from scipy.stats import chi2_contingency

data = [[20, 30], [25, 35]]
chi2, p, dof, expected = chi2_contingency(data)
print("Chi-square p-value:", p)
  

Choose tests based on data type and distribution; misuse can lead to incorrect conclusions.

6. Correlation Analysis

Correlation measures the strength and direction of relationship between variables.

from scipy.stats import pearsonr

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

corr, p_val = pearsonr(x, y)
print("Correlation:", corr)
  
  • +1 → Perfect positive correlation
  • 0 → No correlation
  • -1 → Perfect negative correlation

Correlation does not imply causation; consider confounding variables.

7. Regression Analysis

Regression predicts a dependent variable based on one or more independent variables.

🔹 Simple Linear Regression

from sklearn.linear_model import LinearRegression
import numpy as np

x = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 4, 6, 8, 10])

model = LinearRegression()
model.fit(x, y)

print("Slope:", model.coef_)
print("Intercept:", model.intercept_)
  

Regression assumptions include linearity, homoscedasticity, and independent errors.

8. Probability & Distributions

🔹 Normal Distribution

from scipy.stats import norm

mean = 50
std = 10

prob = norm.cdf(60, mean, std)
print("Probability:", prob)
  
📈 Many real-life variables follow a normal distribution

Distribution choice impacts inference; match model assumptions to the data.

9. Real-Life Applications

  • Medical trials
  • Market research
  • Quality control
  • Election predictions
  • Machine learning models