GitHub - Dicklesworthstone/hoeffdings_d_explainer: A Detailed Introduction to My Favorite Statistical Measure, Hoeffding's D

The Nugget

  • Hoeffding's D is a statistical measure that quantifies the dependency or association between two sequences of data by comparing their joint distribution to what would be expected if the sequences were independent.

Key quotes

  • "When you use Pearson correlation, you are implicitly looking for a particular kind of relationship between the two sequences: a linear relationship."
  • "Hoeffding’s D is the best measure yet discovered for finding any kind of relationship between two sequences without making assumptions about their distributions."
  • "Hoeffding's D provides a statistical measure of dependency between sequences that is more robust to outliers and does not assume a linear relationship compared to other measures like Pearson's correlation."

Key insights

Overview of Measures of Association

  • Pearson correlation is good for linear relationships.
  • Spearman’s Rho adjusts for tied ranks, adapting Pearson correlation for cases with outliers.
  • Kendall’s Tau looks at individual pairs to assess concordance and discordance.
  • Hoeffding's D goes beyond pairwise comparisons to evaluate all possible quadruples, introducing a unique approach to quantify dependency.

Intuition Behind Hoeffding's D

  • Ranking and Pairwise Comparisons:
    • Ranks of data points are assigned and then pairwise comparisons are made to assess concordant and discordant pairs.
  • Quadruple Comparisons:
    • Hoeffding's D evaluates all possible quadruples to capture more complex dependencies.
  • Summation:
    • The core of Hoeffding's D involves summing terms derived from concordance and discordance assessments across quadruples.
  • Normalization:
    • The final calculation normalizes the sum obtained to provide a measure ranging from -0.5 to 1.

Implementation of Hoeffding's D

  • Efficient Python Implementation with Numpy and Scipy:
    • The Python code efficiently calculates Hoeffding's D for datasets, illustrating the step-by-step breakdown.
  • Rust Library for Faster Computation:
    • A more efficient Rust version is available as a Python library for faster computation of Hoeffding's D, particularly useful for large datasets.

Make it stick

  • 💡 Hoeffding's D quantifies the dependency between two sequences by assessing the difference between their joint and expected independent distributions.
  • 💪 Hoeffding's D is robust to outliers, does not assume a linear relationship, and provides a comprehensive measure of association compared to other traditional measures.
This summary contains AI-generated information and may have important inaccuracies or omissions.