Conformal Prediction for Text Infilling and Part-of-Speech Prediction

Dey, Neil; Ding, Jing; Ferrell, Jack; Kapper, Carolina; Lovig, Maxwell; Planchon, Emiliano; Williams, Jonathan P.

doi:10.51387/22-NEJSDS8

The New England Journal of Statistics in Data Science

Conformal Prediction for Text Infilling and Part-of-Speech Prediction

Volume 1, Issue 1 (2023), pp. 69–83

Neil Dey Jing Ding Jack Ferrell All authors (7)

https://doi.org/10.51387/22-NEJSDS8

Pub. online: 5 October 2022 Type: Methodology Article

Open Access

Area: Machine Learning and Data Mining

Accepted
18 August 2022

Published
5 October 2022

Abstract

Modern machine learning algorithms are capable of providing remarkably accurate point-predictions; however, questions remain about their statistical reliability. Unlike conventional machine learning methods, conformal prediction algorithms return confidence sets (i.e., set-valued predictions) that correspond to a given significance level. Moreover, these confidence sets are valid in the sense that they guarantee finite sample control over type 1 error probabilities, allowing the practitioner to choose an acceptable error rate. In our paper, we propose inductive conformal prediction (ICP) algorithms for the tasks of text infilling and part-of-speech (POS) prediction for natural language data. We construct new ICP-enhanced algorithms for POS tagging based on BERT (bidirectional encoder representations from transformers) and BiLSTM (bidirectional long short-term memory) models. For text infilling, we design a new ICP-enhanced BERT algorithm. We analyze the performance of the algorithms in simulations using the Brown Corpus, which contains over 57,000 sentences. Our results demonstrate that the ICP algorithms are able to produce valid set-valued predictions that are small enough to be applicable in real-world applications. We also provide a real data example for how our proposed set-valued predictions can improve machine generated audio transcriptions.

References

[1]

Bengio, Y., Simard, P. and Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks 5(2) 157–166 (1994).

[2]

Bohnet, B., McDonald, R., Simões, G., Andor, D., Pitler, E. and Maynez, J. Morphosyntactic tagging with a meta-BiLSTM model over context sensitive token encodings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2642–2652 (2018).

[3]

Brants, T. TnT: a statistical part-of-speech tagger. In Proceedings of the sixth conference on Applied natural language processing 224–231 (2000).

[4]

Cauchois, M., Gupta, S. and Duchi, J. C. Knowing what you know: valid and validated confidence sets in multiclass and multilabel prediction. Journal of Machine Learning Research 22(81) 1–42 (2021). MR4253774.

[5]

Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. and Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014). arXiv preprint 1406.1078.

[6]

Devlin, J. and Chang, M.-W. Open sourcing BERT: state-of-the-art pre-training for natural language processing. Google AI Blog 2 (2018).

[7]

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding (2019). arXiv preprint 1810.04805.

[8]

Donahue, C., Lee, M. and Liang, P. Enabling language models to fill in the blanks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2492–2501 (2020).

[9]

Elman, J. L. Finding structure in time. Cognitive Science 14(2) 179–211 (1990).

[10]

Fedus, W., Goodfellow, I. and Dai, A. M. Maskgan: Better text generation via filling in the _. In International Conference on Learning Representations (2018).

[11]

Fisch, A., Schuster, T., Jaakkola, T. and Barzilay, R. Few-shot conformal prediction with auxiliary tasks (2021). arXiv preprint 2102.08898.

[12]

Fisch, A., Schuster, T., Jaakkola, T. S. and Barzilay, R. Efficient conformal prediction via cascaded inference with expanded admission. In International Conference on Learning Representations (2020).

[13]

Fisch, A., Schuster, T., Jaakkola, T. S. and Barzilay, R. Relaxed conformal prediction cascades for efficient inference over many labels. International Conference on Learning Representations (2021).

[14]

Francis, W. N. and Kucera, H. Brown Corpus manual. Letters to the Editor 5(2) 7 (1979).

[15]

Gers, F. A., Schraudolph, N. N. and Schmidhuber, J. Learning precise timing with LSTM recurrent networks. Journal of machine learning research 3. 115–143 (2002). https://doi.org/10.1162/153244303768966139. MR1966056.

[16]

Giovannotti, P. and Gammerman, A. Transformer-based conformal predictors for paraphrase detection. In Conformal and Probabilistic Prediction and Applications, PMLR 243–265 (2021).

[17]

Goldberg, Y. A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research 57. 345–420 (2016). https://doi.org/10.1613/jair.4992. MR3584073.

[18]

Heid, S. H., Wever, M. D. and Hüllermeier, E. Reliable part-of-speech tagging of historical corpora through set-valued prediction. Journal of Data Mining and Digital Humanities (2020).

[19]

Hernandez, F., Nguyen, V., Ghannay, S., Tomashenko, N. and Esteve, Y. TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation. In International conference on speech and computer 198–208. Springer (2018).

[20]

Hochreiter, S. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 6(02) 107–116 (1998).

[21]

Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation 9(8) 1735–1780 (1997).

[22]

Hu, Y., Huber, A., Anumula, J. and Liu, S.-C. Overcoming the vanishing gradient problem in plain recurrent networks (2018). arXiv preprint 1801.06105.

[23]

Jurafsky, D. and Martin, J. H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition 3rd ed. (2021). https://web.stanford.edu/~jurafsky/slp3/.

[24]

Koutník, J., Greff, K., Gomez, F. and Schmidhuber, J. A clockwork RNN. In International Conference on Machine Learning, PMLR 1863–1871 (2014).

[25]

Kupiec, J. Robust part-of-speech tagging using a hidden Markov model. Computer Speech & Language 6(3) 225–242 (1992).

[26]

Lafferty, J., McCallum, A. and Pereira, F. C. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning 2001 (2001).

[27]

Ling, W., Dyer, C., Black, A. W., Trancoso, I., Fermandez, R., Amir, S., Marujo, L. and Luís, T. Finding function in form: Compositional character models for open vocabulary word representation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing 1520–1530 (2015).

[28]

Liu, D., Fu, J., Liu, P. and Lv, J. Tigs: An inference algorithm for text infilling with gradient search. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019).

[29]

Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J. and Han, J. On the variance of the adaptive learning rate and beyond. International Conference on Learning Representations (2020).

[30]

Liu, L., Shang, J., Ren, X., Xu, F., Gui, H., Peng, J. and Han, J. Empower sequence labeling with task-aware neural language model. In Proceedings of the AAAI Conference on Artificial Intelligence 32 (2018).

[31]

Maltoudoglou, L., Paisios, A., Lenc, L., Martínek, J., Král, P. and Papadopoulos, H. Well-calibrated confidence measures for multi-label text classification with a large number of labels. Pattern Recognition 122, 108271 (2022).

[32]

Maltoudoglou, L., Paisios, A. and Papadopoulos, H. BERT-based conformal predictor for sentiment analysis. In Conformal and Probabilistic Prediction and Applications, PMLR 269–284 (2020).

[33]

Manning, C. D. Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In International conference on intelligent text processing and computational linguistics 171–189. Springer (2011).

[34]

Messoudi, S., Rousseau, S. and Destercke, S. Deep conformal prediction for robust models. In International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems 528–540. Springer (2020).

[35]

Mortier, T., Wydmuch, M., Hüllermeier, E., Dembczynski, K. and Waegeman, W. Efficient algorithms for set-valued prediction in multi-class classification (2019). arXiv preprint 1906.08129. https://doi.org/10.1007/s10618-021-00751-x. MR4277133.

[36]

Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P. and Allen, J. A corpus and evaluation framework for deeper understanding of commonsense stories (2016). arXiv preprint 1604.01696.

[37]

Olah, C. Understanding LSTM networks (2015). http://colah.github.io/posts/2015-08-Understanding-LSTMs/.

[38]

Paisios, A., Lenc, L., Martínek, J., Král, P. and Papadopoulos, H. A deep neural network conformal predictor for multi-label text classification. In Conformal and Probabilistic Prediction and Applications, PMLR 228–245 (2019).

[39]

Papadopoulos, H. Inductive conformal prediction: Theory and application to neural networks. In Tools in Artificial Intelligence, IntechOpen (2008).

[40]

Parker, R., Graff, D., Kong, J., Chen, K. and Maeda, K. English Gigaword fifth edition. Linguistic Data Consortium, Philadelphia (2011). Technical Report.

[41]

Pascanu, R., Mikolov, T. and Bengio, Y. Understanding the exploding gradient problem (2012). arXiv preprint 1211.5063.

[42]

Pennington, J., Socher, R. and Manning, C. D. GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) 1532–1543 (2014). https://doi.org/10.1126/science.aaa8685. MR3382218.

[43]

Peters, M., Ammar, W., Bhagavatula, C. and Power, R. Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 1756–1765 (2017).

[44]

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L. Deep contextualized word representations. In Proceedings of NAACL-HLT 2227–2237 (2018).

[45]

Plank, B., Søgaard, A. and Goldberg, Y. Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In Proceedings of ACL 2016, Association for Computational Linguistics (ACL) (2016). https://doi.org/10.1162/COLI_a_00253. MR3553982.

[46]

Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I. Improving language understanding by generative pre-training (2018).

[47]

Rogers, A., Kovaleva, O. and Rumshisky, A. A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics 8. 842–866 (2020).

[48]

Shafer, G. and Vovk, V. A tutorial on conformal prediction. Journal of Machine Learning Research 9(3) (2008). MR2417240.

[49]

Sharfuddin, A. A., Tihami, M. N. and Islam, M. S. A deep recurrent neural network with BiLSMT model for sentiment classification. In 2018 International Conference on Bangla Speech and Language Processing (ICBSLP) 1–4. IEEE (2018).

[50]

Shen, T., Quach, V., Barzilay, R. and Jaakkola, T. Blank language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 5186–5198 (2020).

[51]

Srinivasan, S., Arora, R. and Riedl, M. A simple and effective approach to the story cloze test. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) 92–96 (2018).

[52]

Sun, X. Structure regularization for structured prediction. Advances in Neural Information Processing Systems 27. 2402–2410 (2014).

[53]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, . and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems 5998–6008 (2017).

[54]

Vovk, V., Gammerman, A. and Saunders, C. Machine-learning applications of algorithmic randomness. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML ’99 444–453. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1999).

[55]

Vovk, V., Gammerman, A. and Shafer, G. Algorithmic learning in a random world. Springer (2005). MR2161220.

[56]

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O. and Bowman, S. R. GLUE: A multi-task benchmark and analysis platform for natural language understanding (2019).

[57]

Wang, P., Qian, Y., Soong, F. K., He, L. and Zhao, H. Part-of-speech tagging with bidirectional long short-term memory recurrent neural network (2015). arXiv preprint 1510.06168.

[58]

Xin, Y., Hart, E., Mahajan, V. and Ruvini, J. D. Learning better internal structure of words for sequence labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 2584–2593 (2018).

[59]

Yasunaga, M., Kasai, J. and Radev, D. Robust multilingual part-of-speech tagging via adversarial training. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) 976–986 (2018).

[60]

Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M. and Liu, Q. ERNIE: Enhanced language representation with informative entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 1441–1451 (2019).

[61]

Zhu, W., Hu, Z. and Xing, E. Text infilling (2019). arXiv preprint 1901.00158.

[62]

Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A. and Fidler, S. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision 19–27 (2015).

Full article

Open access article under the CC BY license.

Keywords

BERT BiLSTM Natural language processing Set-valued prediction Uncertainty quantification

Funding

Research reported in this publication was supported by the National Science Foundationa and the National Security Agency under Award Numbers 2051010 and H98230-21-1-0014, respectively.

Metrics

since December 2021

662

Article info
views

779

Full article
views

388

PDF
downloads

119

XML
downloads

RSS

Authors

Abstract

References

Export citation

Copy and paste formatted citation

Download citation in file