AUGUST: An Interpretable, Resolution-based Two-sample Test
Volume 2, Issue 3 (2024), pp. 357–367
Pub. online: 15 December 2023
Type: Methodology Article
Open Access
Area: Statistical Methodology
Accepted
4 September 2023
4 September 2023
Published
15 December 2023
15 December 2023
Abstract
Two-sample testing is a fundamental problem in statistics. While many powerful nonparametric methods exist for both the univariate and multivariate context, it is comparatively less common to see a framework for determining which data features lead to rejection of the null. In this paper, we propose a new nonparametric two-sample test named AUGUST, which incorporates a framework for interpretation while maintaining power comparable to existing methods. AUGUST tests for inequality in distribution up to a predetermined resolution using symmetry statistics from binary expansion. Designed for univariate and low to moderate-dimensional multivariate data, this construction allows us to understand distributional differences as a combination of fundamental orthogonal signals. Asymptotic theory for the test statistic facilitates p-value computation and power analysis, and an efficient algorithm enables computation on large data sets. In empirical studies, we show that our test has power comparable to that of popular existing methods, as well as greater power in some circumstances. We illustrate the interpretability of our method using NBA shooting data.
References
Anderson, T. W. and Darling, D. A. (1952). Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes. The Annals of Mathematical Statistics 193–212. https://doi.org/10.1214/aoms/1177729437. MR0050238
Aslan, B. and Zech, G. (2005). New test for the multivariate two-sample problem based on the concept of minimum energy. Journal of Statistical Computation and Simulation 75(2) 109–119. https://doi.org/10.1080/00949650410001661440. MR2117010
Banerjee, B. and Ghosh, A. K. (2022). On high dimensional behaviour of some two-sample tests based on ball divergence. arXiv preprint arXiv:2212.08566.
Bhattacharya, B. B. (2019). A general asymptotic framework for distribution-free graph-based two-sample tests. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 81(3) 575–602. MR3961499
Biswas, M. and Ghosh, A. K. (2014). A nonparametric two-sample test applicable to high dimensional data. Journal of Multivariate Analysis 123 160–171. https://doi.org/10.1016/j.jmva.2013.09.004. MR3130427
Biswas, M., Mukhopadhyay, M. and Ghosh, A. K. (2014). A distribution-free two-sample run test applicable to high-dimensional data. Biometrika 101(4) 913–926. https://doi.org/10.1093/biomet/asu045. MR3286925
Brown, B., Zhang, K. and Meng, X. -L. (2022). BELIEF in dependence: leveraging atomic linearity in data bits for rethinking generalized linear models. arXiv preprint arXiv:2210.10852.
Chen, H. and Friedman, J. H. (2017). A new graph-based two-sample test for multivariate and object data. Journal of the American statistical association 112(517) 397–409. https://doi.org/10.1080/01621459.2016.1147356. MR3646580
Chen, H., Chen, X. and Su, Y. (2018). A weighted edge-count two-sample test for multivariate and object data. Journal of the American Statistical Association 113(523) 1146–1155. https://doi.org/10.1080/01621459.2017.1307757. MR3862346
Dobrushin, R. L. (1970). Prescribing a system of random variables by conditional distributions. Theory of Probability & Its Applications 15(3) 458–486. MR0298716
Dowd, C. (2020). A new ECDF two-sample test statistic. arXiv preprint arXiv:2007.01360.
Duong, T. (2013). Local significant differences from nonparametric two-sample tests. Journal of Nonparametric Statistics 25(3) 635–645. https://doi.org/10.1080/10485252.2013.810217. MR3174288
Friedman, J. H. and Rafsky, L. C. (1979). Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. The Annals of Statistics 697–717. MR0532236
Gorsky, S. and Ma, L. (2022). Multi-scale Fisher’s independence test for multivariate dependence. Biometrika 109(3) 569–587. https://doi.org/10.1093/biomet/asac013. MR4472834
Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B. and Smola, A. (2012). A kernel two-sample test. The Journal of Machine Learning Research 13(1) 723–773. MR2913716
Hazelton, M. L. and Davies, T. M. (2022). Pointwise comparison of two multivariate density functions. Scandinavian Journal of Statistics 49(4) 1791–1810. MR4544820
Hettmansperger, T. P., Möttönen, J. and Oja, H. (1998). Affine invariant multivariate rank tests for several samples. Statistica Sinica 785–800. MR1651508
Lepage, Y. (1971). A combination of Wilcoxon’s and Ansari-Bradley’s statistics. Biometrika 58(1) 213–217. https://doi.org/10.1093/biomet/58.1.213. MR0408101
Li, J. (2018). Asymptotic normality of interpoint distances for high-dimensional data with applications to the two-sample problem. Biometrika 105(3) 529–546. https://doi.org/10.1093/biomet/asy020. MR3842883
Li, X. and Meng, X. -L. (2021). A multi-resolution theory for approximating infinite-p-zero-n: Transitional inference, individualized predictions, and a world without bias-variance tradeoff. Journal of the American Statistical Association 116(533) 353–367. https://doi.org/10.1080/01621459.2020.1844210. MR4227699
Liu, R. Y. (1992). Data depth and multivariate rank tests. L1-Statistical Analysis and Related Methods 279–294. MR1214839
Lopez-Paz, D. and Oquab, M. (2016). Revisiting classifier two-sample tests. arXiv preprint arXiv:1610.06545.
Mahajan, K. K., Gaur, A. and Arora, S. (2011). A nonparametric test for a two-sample scale problem based on subsample medians. Statistics & Probability Letters 81(8) 983–988. https://doi.org/10.1016/j.spl.2011.01.018. MR2803733
Mann, H. B. and Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics 50–60. https://doi.org/10.1214/aoms/1177730491. MR0022058
Oja, H. (2010) Multivariate Nonparametric Methods with R: An Approach Based on Spatial Signs and Ranks. Springer Science & Business Media. https://doi.org/10.1007/978-1-4419-0468-3. MR2598854
Pan, W., Tian, Y., Wang, X. and Zhang, H. (2018). Ball divergence: nonparametric two sample test. Annals of Statistics 46(3) 1109. https://doi.org/10.1214/17-AOS1579. MR3797998
Robert Stephenson, W. and Ghosh, M. (1985). Two sample nonparametric tests based on subsamples. Communications in Statistics-Theory and Methods 14(7) 1669–1684. https://doi.org/10.1080/03610928508829003. MR0801632
Rosenbaum, P. R. (2005). An exact distribution-free test comparing two multivariate distributions based on adjacency. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(4) 515–530. https://doi.org/10.1111/j.1467-9868.2005.00513.x. MR2168202
Rousson, V. (2002). On distribution-free tests for the multivariate two-sample location-scale model. Journal of Multivariate Analysis 80(1) 43–57. https://doi.org/10.1006/jmva.2000.1981. MR1889832
Song, H. and Chen, H. (2020). Generalized kernel two-sample tests. arXiv preprint arXiv:2011.06127.
Székely, G. J. and Rizzo, M. L. (2013). Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference 143(8) 1249–1272. https://doi.org/10.1016/j.jspi.2013.03.018. MR3055745
Villani, C. (2009) Optimal Transport: Old and New 338. Springer. https://doi.org/10.1007/978-3-540-71050-9. MR2459454
Yamada, M., Wu, D., Tsai, Y. q. H. H., Takeuchi, I., Salakhutdinov, R. and Fukumizu, K. (2018). Post selection inference with incomplete maximum mean discrepancy estimator. arXiv preprint arXiv:1802.06226.
Zhang, K. (2019). BET on Independence. Journal of the American Statistical Association 114(528) 1620–1637. https://doi.org/10.1080/01621459.2018.1537921.
Zhang, K., Zhao, Z. and Zhou, W. (2021). BEAUTY powered BEAST. arXiv preprint arXiv:2103.00674.