AUGUST: An Interpretable, Resolution-based Two-sample Test

Brown, Benjamin; Zhang, Kai

doi:10.51387/23-NEJSDS54

The New England Journal of Statistics in Data Science

AUGUST: An Interpretable, Resolution-based Two-sample Test

Volume 2, Issue 3 (2024), pp. 357–367

Benjamin Brown Kai Zhang

https://doi.org/10.51387/23-NEJSDS54

Pub. online: 15 December 2023 Type: Methodology Article

Open Access

Area: Statistical Methodology

Accepted
4 September 2023

Published
15 December 2023

Abstract

Two-sample testing is a fundamental problem in statistics. While many powerful nonparametric methods exist for both the univariate and multivariate context, it is comparatively less common to see a framework for determining which data features lead to rejection of the null. In this paper, we propose a new nonparametric two-sample test named AUGUST, which incorporates a framework for interpretation while maintaining power comparable to existing methods. AUGUST tests for inequality in distribution up to a predetermined resolution using symmetry statistics from binary expansion. Designed for univariate and low to moderate-dimensional multivariate data, this construction allows us to understand distributional differences as a combination of fundamental orthogonal signals. Asymptotic theory for the test statistic facilitates p-value computation and power analysis, and an efficient algorithm enables computation on large data sets. In empirical studies, we show that our test has power comparable to that of popular existing methods, as well as greater power in some circumstances. We illustrate the interpretability of our method using NBA shooting data.

Supplementary material

Supplementary Material

Supplementary material for AUGUST.

References

[1]

Anderson, T. W. and Darling, D. A. (1952). Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes. The Annals of Mathematical Statistics 193–212. https://doi.org/10.1214/aoms/1177729437. MR0050238

[2]

Aslan, B. and Zech, G. (2005). New test for the multivariate two-sample problem based on the concept of minimum energy. Journal of Statistical Computation and Simulation 75(2) 109–119. https://doi.org/10.1080/00949650410001661440. MR2117010

[3]

Banerjee, B. and Ghosh, A. K. (2022). On high dimensional behaviour of some two-sample tests based on ball divergence. arXiv preprint arXiv:2212.08566.

[4]

Baumgartner, W., WeiSS, P. and Schindler, H. (1998). A nonparametric test for the general two-sample problem. Biometrics 1129–1135.

[5]

Bhattacharya, B. B. (2019). A general asymptotic framework for distribution-free graph-based two-sample tests. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 81(3) 575–602. MR3961499

[6]

Biswas, M. and Ghosh, A. K. (2014). A nonparametric two-sample test applicable to high dimensional data. Journal of Multivariate Analysis 123 160–171. https://doi.org/10.1016/j.jmva.2013.09.004. MR3130427

[7]

Biswas, M., Mukhopadhyay, M. and Ghosh, A. K. (2014). A distribution-free two-sample run test applicable to high-dimensional data. Biometrika 101(4) 913–926. https://doi.org/10.1093/biomet/asu045. MR3286925

[8]

Brown, B., Zhang, K. and Meng, X. -L. (2022). BELIEF in dependence: leveraging atomic linearity in data bits for rethinking generalized linear models. arXiv preprint arXiv:2210.10852.

[9]

Chen, H. and Friedman, J. H. (2017). A new graph-based two-sample test for multivariate and object data. Journal of the American statistical association 112(517) 397–409. https://doi.org/10.1080/01621459.2016.1147356. MR3646580

[10]

Chen, H., Chen, X. and Su, Y. (2018). A weighted edge-count two-sample test for multivariate and object data. Journal of the American Statistical Association 113(523) 1146–1155. https://doi.org/10.1080/01621459.2017.1307757. MR3862346

[11]

Chwialkowski, K. P., Ramdas, A., Sejdinovic, D. and Gretton, A. (2015). Fast two-sample testing with analytic representations of probability measures. Advances in Neural Information Processing Systems 28 1981–1989.

[12]

Cramér, H. (1928). On the composition of elementary errors: First paper: Mathematical deductions. Scandinavian Actuarial Journal 1928(1) 13–74.

[13]

Cucconi, O. (1968). Un nuovo test non parametrico per il confronto fra due gruppi di valori campionari. Giornale degli Economisti e Annali di Economia 225–248.

[14]

DeCost, B. L. and Holm, E. A. (2017). Characterizing powder materials using keypoint-based computer vision methods. Computational Materials Science 126 438–445.

[15]

Dobrushin, R. L. (1970). Prescribing a system of random variables by conditional distributions. Theory of Probability & Its Applications 15(3) 458–486. MR0298716

[16]

Dowd, C. (2020). A new ECDF two-sample test statistic. arXiv preprint arXiv:2007.01360.

[17]

Duong, T. (2013). Local significant differences from nonparametric two-sample tests. Journal of Nonparametric Statistics 25(3) 635–645. https://doi.org/10.1080/10485252.2013.810217. MR3174288

[18]

Friedman, J. H. and Rafsky, L. C. (1979). Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. The Annals of Statistics 697–717. MR0532236

[19]

Gorsky, S. and Ma, L. (2022). Multi-scale Fisher’s independence test for multivariate dependence. Biometrika 109(3) 569–587. https://doi.org/10.1093/biomet/asac013. MR4472834

[20]

Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B. and Smola, A. J. (2007). A kernel statistical test of independence. In: Advances in Neural Information Processing Systems 585–592.

[21]

Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B. and Smola, A. (2012). A kernel two-sample test. The Journal of Machine Learning Research 13(1) 723–773. MR2913716

[22]

Harchaoui, Z., Bach, F. R. and Moulines, E. (2007). Testing for homogeneity with kernel Fisher discriminant analysis. In NIPS 609–616. Citeseer.

[23]

Hazelton, M. L. and Davies, T. M. (2022). Pointwise comparison of two multivariate density functions. Scandinavian Journal of Statistics 49(4) 1791–1810. MR4544820

[24]

Hettmansperger, T. P., Möttönen, J. and Oja, H. (1998). Affine invariant multivariate rank tests for several samples. Statistica Sinica 785–800. MR1651508

[25]

Jitkrittum, W., Szabó, Z., Chwialkowski, K. P. and Gretton, A. (2016). Interpretable distribution features with maximum testing power. Advances in Neural Information Processing Systems 29.

[26]

Kolmogorov, A. (1933). Sulla determinazione empirica di una lgge di distribuzione. Inst. Ital. Attuari, Giorn. 4 83–91.

[27]

Lepage, Y. (1971). A combination of Wilcoxon’s and Ansari-Bradley’s statistics. Biometrika 58(1) 213–217. https://doi.org/10.1093/biomet/58.1.213. MR0408101

[28]

Li, J. (2018). Asymptotic normality of interpoint distances for high-dimensional data with applications to the two-sample problem. Biometrika 105(3) 529–546. https://doi.org/10.1093/biomet/asy020. MR3842883

[29]

Li, X. and Meng, X. -L. (2021). A multi-resolution theory for approximating infinite-p-zero-n: Transitional inference, individualized predictions, and a world without bias-variance tradeoff. Journal of the American Statistical Association 116(533) 353–367. https://doi.org/10.1080/01621459.2020.1844210. MR4227699

[30]

Liu, R. Y. (1992). Data depth and multivariate rank tests. L1-Statistical Analysis and Related Methods 279–294. MR1214839

[31]

Lopez-Paz, D. and Oquab, M. (2016). Revisiting classifier two-sample tests. arXiv preprint arXiv:1610.06545.

[32]

Mahajan, K. K., Gaur, A. and Arora, S. (2011). A nonparametric test for a two-sample scale problem based on subsample medians. Statistics & Probability Letters 81(8) 983–988. https://doi.org/10.1016/j.spl.2011.01.018. MR2803733

[33]

Mann, H. B. and Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics 50–60. https://doi.org/10.1214/aoms/1177730491. MR0022058

[34]

Mueller, J. W. and Jaakkola, T. (2015). Principal differences analysis: Interpretable characterization of differences between distributions. In: Advances in Neural Information Processing Systems 28.

[35]

Oja, H. (2010) Multivariate Nonparametric Methods with R: An Approach Based on Spatial Signs and Ranks. Springer Science & Business Media. https://doi.org/10.1007/978-1-4419-0468-3. MR2598854

[36]

Pan, W., Tian, Y., Wang, X. and Zhang, H. (2018). Ball divergence: nonparametric two sample test. Annals of Statistics 46(3) 1109. https://doi.org/10.1214/17-AOS1579. MR3797998

[37]

Pandit, P. V., Kumari, S. and Javali, S. (2014). Tests for two-sample location problem based on subsample quantiles. Open Journal of Statistics 2014.

[38]

Robert Stephenson, W. and Ghosh, M. (1985). Two sample nonparametric tests based on subsamples. Communications in Statistics-Theory and Methods 14(7) 1669–1684. https://doi.org/10.1080/03610928508829003. MR0801632

[39]

Rosenbaum, P. R. (2005). An exact distribution-free test comparing two multivariate distributions based on adjacency. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(4) 515–530. https://doi.org/10.1111/j.1467-9868.2005.00513.x. MR2168202

[40]

Rousson, V. (2002). On distribution-free tests for the multivariate two-sample location-scale model. Journal of Multivariate Analysis 80(1) 43–57. https://doi.org/10.1006/jmva.2000.1981. MR1889832

[41]

Song, H. and Chen, H. (2020). Generalized kernel two-sample tests. arXiv preprint arXiv:2011.06127.

[42]

Székely, G. J. and Rizzo, M. L. (2013). Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference 143(8) 1249–1272. https://doi.org/10.1016/j.jspi.2013.03.018. MR3055745

[43]

Villani, C. (2009) Optimal Transport: Old and New 338. Springer. https://doi.org/10.1007/978-3-540-71050-9. MR2459454

[44]

Yamada, M., Wu, D., Tsai, Y. q. H. H., Takeuchi, I., Salakhutdinov, R. and Fukumizu, K. (2018). Post selection inference with incomplete maximum mean discrepancy estimator. arXiv preprint arXiv:1802.06226.

[45]

Zhang, K. (2019). BET on Independence. Journal of the American Statistical Association 114(528) 1620–1637. https://doi.org/10.1080/01621459.2018.1537921.

[46]

Zhang, K., Zhao, Z. and Zhou, W. (2021). BEAUTY powered BEAST. arXiv preprint arXiv:2103.00674.

Full article

Open access article under the CC BY license.

Keywords

Distributional difference Interpretability Power Symmetry Visualization

Funding

This research is partially supported by NSF grants DMS-1613112, IIS-1633212, DMS-1916237, and DMS-2152289.

Metrics

since December 2021

260

Article info
views

143

Full article
views

164

PDF
downloads

XML
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file