Conformal Prediction for Text Infilling and Part-of-Speech Prediction
Volume 1, Issue 1 (2023), pp. 69–83
Pub. online: 5 October 2022
Type: Machine Learning And Data Mining
Open Access
Accepted
18 August 2022
18 August 2022
Published
5 October 2022
5 October 2022
Abstract
Modern machine learning algorithms are capable of providing remarkably accurate point-predictions; however, questions remain about their statistical reliability. Unlike conventional machine learning methods, conformal prediction algorithms return confidence sets (i.e., set-valued predictions) that correspond to a given significance level. Moreover, these confidence sets are valid in the sense that they guarantee finite sample control over type 1 error probabilities, allowing the practitioner to choose an acceptable error rate. In our paper, we propose inductive conformal prediction (ICP) algorithms for the tasks of text infilling and part-of-speech (POS) prediction for natural language data. We construct new ICP-enhanced algorithms for POS tagging based on BERT (bidirectional encoder representations from transformers) and BiLSTM (bidirectional long short-term memory) models. For text infilling, we design a new ICP-enhanced BERT algorithm. We analyze the performance of the algorithms in simulations using the Brown Corpus, which contains over 57,000 sentences. Our results demonstrate that the ICP algorithms are able to produce valid set-valued predictions that are small enough to be applicable in real-world applications. We also provide a real data example for how our proposed set-valued predictions can improve machine generated audio transcriptions.
References
Cauchois, M., Gupta, S. and Duchi, J. C. Knowing what you know: valid and validated confidence sets in multiclass and multilabel prediction. Journal of Machine Learning Research 22(81) 1–42 (2021). MR4253774.
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. and Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014). arXiv preprint 1406.1078.
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding (2019). arXiv preprint 1810.04805.
Fisch, A., Schuster, T., Jaakkola, T. and Barzilay, R. Few-shot conformal prediction with auxiliary tasks (2021). arXiv preprint 2102.08898.
Gers, F. A., Schraudolph, N. N. and Schmidhuber, J. Learning precise timing with LSTM recurrent networks. Journal of machine learning research 3. 115–143 (2002). https://doi.org/10.1162/153244303768966139. MR1966056.
Goldberg, Y. A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research 57. 345–420 (2016). https://doi.org/10.1613/jair.4992. MR3584073.
Hu, Y., Huber, A., Anumula, J. and Liu, S.-C. Overcoming the vanishing gradient problem in plain recurrent networks (2018). arXiv preprint 1801.06105.
Jurafsky, D. and Martin, J. H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition 3rd ed. (2021). https://web.stanford.edu/~jurafsky/slp3/.
Ling, W., Dyer, C., Black, A. W., Trancoso, I., Fermandez, R., Amir, S., Marujo, L. and Luís, T. Finding function in form: Compositional character models for open vocabulary word representation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing 1520–1530 (2015).
Mortier, T., Wydmuch, M., Hüllermeier, E., Dembczynski, K. and Waegeman, W. Efficient algorithms for set-valued prediction in multi-class classification (2019). arXiv preprint 1906.08129. https://doi.org/10.1007/s10618-021-00751-x. MR4277133.
Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P. and Allen, J. A corpus and evaluation framework for deeper understanding of commonsense stories (2016). arXiv preprint 1604.01696.
Olah, C. Understanding LSTM networks (2015). http://colah.github.io/posts/2015-08-Understanding-LSTMs/.
Pascanu, R., Mikolov, T. and Bengio, Y. Understanding the exploding gradient problem (2012). arXiv preprint 1211.5063.
Pennington, J., Socher, R. and Manning, C. D. GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) 1532–1543 (2014). https://doi.org/10.1126/science.aaa8685. MR3382218.
Plank, B., Søgaard, A. and Goldberg, Y. Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In Proceedings of ACL 2016, Association for Computational Linguistics (ACL) (2016). https://doi.org/10.1162/COLI_a_00253. MR3553982.
Shafer, G. and Vovk, V. A tutorial on conformal prediction. Journal of Machine Learning Research 9(3) (2008). MR2417240.
Vovk, V., Gammerman, A. and Shafer, G. Algorithmic learning in a random world. Springer (2005). MR2161220.
Wang, P., Qian, Y., Soong, F. K., He, L. and Zhao, H. Part-of-speech tagging with bidirectional long short-term memory recurrent neural network (2015). arXiv preprint 1510.06168.
Zhu, W., Hu, Z. and Xing, E. Text infilling (2019). arXiv preprint 1901.00158.