Comparative Analysis of NLP Methods for Emotion Detection in Student Responses During COVID-19
Pub. online: 1 June 2026
Type: Case Study, Application, And/or Practice Article
Open Access
Area: NextGen
Accepted
19 May 2026
19 May 2026
Published
1 June 2026
1 June 2026
Abstract
Natural language processing (NLP) algorithms have demonstrated significant capabilities in understanding responses to open-ended questions in survey data. However, the reliability and uncertainty of these methods on this task still need to be thoroughly investigated. To address this issue, this paper presents a comprehensive comparative analysis of various NLP methods for detecting fine-grained emotions in student responses about their mental health during the COVID-19 pandemic. The evaluated models include a Lexicon-based approach, the bag-of-words (BoW) model, Term Frequency-Inverse Document Frequency (TF-IDF), a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model, MentalBERT, and OpenAI’s GPT-3.5. We carefully assess the efficacy of these models in accurately classifying emotions into predetermined categories using performance metrics such as accuracy and F1 score. Furthermore, model stability and distinguishing ability were quantified through repetitive cross-validation and the Area Under the Receiver Operating Characteristic Curve (AUC). The consistency of emotion detection across different models is also evaluated. The study highlights that the effectiveness of employing NLP methods for mental health analysis may vary depending on the emotions being analyzed, and their stability and uncertainty require thorough examination. Our work can provide valuable guidance for data scientists on applying NLP methods to survey data, particularly for understanding survey respondents’ emotions.
References
Desmet, B. and Hoste, V. (2013). Emotion detection in suicide notes. Expert Systems with Applications 40(16) 6351–6358. https://doi.org/10.1016/j.eswa.2013.05.050.
Hofmann, T., Schölkopf, B. and Smola, A. J. (2008). Kernel methods in machine learning. The Annals of Statistics 36(3) 1171–1220. https://doi.org/10.1214/009053607000000677. MR2418654
Jain, B., Goyal, G. and Sharma, M. (2024). Evaluating Emotional Detection & Classification Capabilities of GPT-2 & GPT-Neo Using Textual Data. In 2024 14th International Conference on Cloud Computing, Data Science & Engineering (Confluence) 12–18. https://doi.org/10.1109/Confluence60223.2024.10463396.
Kim, H., Rackoff, G. N., Fitzsimmons-Craft, E. E., Shin, K. E., Zainal, N. H., Schwob, J. T., Eisenberg, D., Wilfley, D. E., Taylor, C. B. and Newman, M. G. (2022). College mental health before and during the COVID-19 pandemic: results from a nationwide survey. Cognitive Therapy and Research 46(1) 1–10.
Lossio-Ventura, J. A., Weger, R., Lee, A. Y., Guinee, E. P., Chung, J., Atlas, L., Linos, E. and Pereira, F. (2024). A comparison of ChatGPT and fine-tuned open pre-trained transformers (OPT) against widely used sentiment analysis tools: sentiment analysis of COVID-19 survey data. JMIR Mental Health 11 50150.
Mohammad, S. M. (2018). Word Affect Intensities. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan. https://aclanthology.org/L18-1027.
Mohammad, S. M. and Turney, P. D. (2013). Crowdsourcing a word–emotion association lexicon. Computational Intelligence 29(3) 436–465. https://doi.org/10.1111/j.1467-8640.2012.00460.x. MR3093841
OpenAI (2023). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.
Qader, W. A., Ameen, M. M. and Ahmed, B. I. (2019). An Overview of Bag of Words: Importance, Implementation, Applications, and Challenges. In 2019 International Engineering Conference (IEC) 200–204. https://doi.org/10.1109/IEC47844.2019.8950616.
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. et al. (2018). Improving language understanding by generative pre-training. Technical Report, OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/.
Rahman, S. S. M. M., Biplob, K. B. M. B., Rahman, M. H., Sarker, K. and Islam, T. (2020). An investigation and evaluation of N-Gram, TF-IDF and ensemble methods in sentiment classification. In Cyber Security and Computer Science: Second EAI International Conference, ICONCS 2020, Dhaka, Bangladesh, February 15-16, 2020, Proceedings 2 391–402. Springer.
Smola, A. J. and Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing 14 199–222. https://doi.org/10.1023/B:STCO.0000035301.49549.88. MR2086398
Weger, R., Lossio-Ventura, J. A., Rose-McCandlish, M., Shaw, J. S., Sinclair, S., Pereira, F., Chung, J. Y., Atlas, L. Y. et al. (2023). Trends in language use during the COVID-19 pandemic and relationship between language use and mental health: text analysis based on free responses from a longitudinal study. JMIR Mental Health 10(1) 40899.