1 Introduction
In the early 2020s, the lives of university students were significantly impacted by the COVID-19 pandemic. The abrupt transition to online learning, loss of contact with family and friends, and social and financial uncertainty had a considerable impact on students’ mental health. Many surveys [35, 42, 9, 18] have been conducted in colleges and universities to assess their impact and long-term effects. These surveys typically included two types of questions: quantitative questions that measure agreement or disagreement with specific statements, and open-ended questions that invite students to describe their personal feelings and experiences. Traditional survey analysis [24] often focuses on the quantitative responses, as these numerical responses can be easily analyzed using mathematical and statistical models. However, the responses to open-ended questions can provide more details and personalized insights into students’ mental health, which can then lead to studying the reasons behind these impacts. To analyze this information in large data sets, it is necessary to apply Natural Language Processing (NLP) techniques [23] to detect emotions within the students’ responses.
Advancements in NLP for emotion detection have been significant over the past few decades. Early researchers developed the Lexicon-based method [38] to create a dictionary that scores the emotions associated with individual words. Subsequently, techniques like the Bag-of-Words (BoW) model [29] and Term Frequency-Inverse Document Frequency (TF-IDF) [32] were introduced to convert texts into numerical vectors. Machine learning algorithms, such as logistic regression [36] and Support Vector Machines (SVM) [34], can then be trained using these vectors to classify texts into different emotion categories. Since 2017, the Transformer architecture [41] in deep learning has achieved remarkable success in NLP. Large Language Models (LLMs) [26], such as the Bidirectional Encoder Representations from Transformers (BERT) [12] and the Generative Pre-trained Transformer (GPT) [30], have demonstrated strong capabilities in various NLP tasks. We can either use these LLMs directly for emotion detection or fine-tune them for specific emotions to further enhance accuracy.
Despite the success of NLP methods, two significant challenges remain when applying them to understand the mental health of college students through open-ended survey questions. First, mental health encompasses a range of nuanced emotions, such as depression, anxiety, stress, and isolation. However, many traditional sentiment analysis methods [24] only consider general positive or negative sentiments in texts. This limitation can lead to the neglect of personal emotions expressed in open-ended survey responses, making it difficult to identify subtle differences in mental health. Second, using NLP to analyze mental health can yield educational and psychological insights; thus, the stability and distinguishing ability of these methods are crucial for reliable studies. There has been some effort to explore differences between such methods in the context of social media text [4, 25] and free response surveys [44, 43, 19] related to COVID. However, most of those studies only focus on the consistency in the general sentiment prediction, not on fine-grained emotion detection [10] that we are concerned with.
To address the challenges of reliable emotion detection, we conduct a comparative analysis of various NLP methods based on a recent study of college student responses during COVID-19 [1]. This study surveyed students at a large mid-Atlantic university in the U.S. during the early months of the COVID-19 pandemic to assess its impact on their mental health. The survey included both traditional quantitative scoring questions and open-ended responses. In their research, Amona et al. [1] carefully annotated ten common emotions—such as isolation, depression, and anxiety—derived from the students’ responses to the question, “How is COVID affecting your mental health?” They then examined how these emotions impacted different subgroups within the student population.
Here, we compare a wide range of NLP methods, including Lexicon-based approaches, BoW, TF-IDF, fine-tuned BERT, and zero-shot GPT, for the automatic detection of emotions in these survey responses. First, we assess the performance of various NLP methods across all identified emotions. We find that despite the complicated Transformer method achieving the best overall performance, simpler methods, such as Lexicon, can effectively identify specific emotions. Next, we evaluate the stability and distinguishing ability of these models, demonstrating the performance-complexity trade-off when applying NLP methods. Finally, we evaluate the detection consistency in emotion detection between NLP methods and the true labels, assessing whether these methods yield similar results or not. This comprehensive study highlights that the effectiveness of employing NLP methods to analyze mental health through survey data varies for the emotions being analyzed. Moreover, for the methods with top overall performance, their stability and uncertainty need to be thoroughly examined. We summarize insights from our experimental studies and offer method selection recommendations for NLP analysis of survey data to guide future data science practices.
The paper is organized as follows: In Section 2, we introduce related work that examines students’ mental health during COVID-19 and advancements in NLP methods for emotion detection. Section 3 outlines three aspects of our methodology: data collection and annotation, the implementation of NLP methods, and the comparison framework. In Section 4, we present our results and discuss the outcomes of emotion detection using NLP methods in relation to mental health. Finally, Section 5 summarizes the conclusions of our study.
2 Related Work
2.1 NLP Methods in Comparison Study
First, we will review the related NLP methods in our comparison study. Traditional methods include the Lexicon-based approach, where predefined dictionaries of emotional words are used to identify emotions in text. For example, Mohammad et al. [21] developed the National Research Council of Canada (NRC) Emotion Lexicon (EmoLex), a widely used resource for Lexicon-based emotion detection. The model establishes connections between words and basic emotions, including anger, joy, and sadness. This Lexicon was developed through crowdsourcing, ensuring a diverse and comprehensive set of word-emotion associations. Mohammad [20] later extended their work by adding real-valued scores of intensity to emotions to create NRC Affect Intensity Lexicon (AIL), enabling more fine-grained analysis.
The next school of methods for text classification involves converting sentences to numeric vectors using BoW or TF-IDF and applying machine learning algorithms to them. Sebastiani [33] provided a thorough analysis of these algorithms, highlighting their performance across different datasets and establishing their strengths and limitations in text classification. BoW is a technique that turns text or images into a histogram of words. BoW models convert text into a matrix of token counts, representing the frequency of each word in the text. This representation is then used as input for machine learning classifiers such as logistic regression, SVM, or Naive Bayes. Based on the study in [29], this makes the BoW computationally simple, helping it score well on performance tests. Barry [2] studied using BoW on Amazon and Yelp food reviews to classify whether they were positive or negative. With its best machine learning model, they achieved an accuracy score of over $95\% $. Desmet and Hoste [11] used BoW to detect 15 emotions. Their results varied by emotion, but six of the seven most common emotions had acceptable accuracy.
TF-IDF improves upon BoW by weighting terms based on their importance, calculated as the product of Term Frequency (TF) and Inverse Document Frequency (IDF). Ramos et al. [32] explained that the less frequently a word appears in documents, the greater the weight it should receive. This weighting helps to emphasize significant words while downplaying common ones, enhancing the model’s ability to distinguish between different classes. Rahman et al. [31] conducted sentiment classification by tweaking TF-IDF with various vectorization methods and classifiers. With the correct classifier, they achieved $100\% $ accuracy. Sundaram et al. [37] used TF-IDF for six emotions. For emotions with large training sets, they had an accuracy of about $85\% $.
The advent of the Transformer architecture in deep learning, such as BERT and GPT, has revolutionized NLP. BERT, introduced in [12], employs a bidirectional training approach to understand the context of words in a sentence. BERT’s architecture comprises multiple layers of encoders within the Transformer, enabling it to capture intricate relationships between words. BERT can be fine-tuned for specific tasks, such as emotion detection, which involves additional training on a labeled dataset to optimize the model’s performance for that particular task. Tang et al. [39] further explored the fine-tuning of BERT for multi-label sentiment analysis, showcasing its effectiveness in handling multiple co-occurring emotions under unbalanced class distributions. Ji et al. [17] developed MentalBERT, a BERT-based model fine-tuned on mental health-related text, demonstrating significant improvements in understanding and classifying emotional content compared to standard BERT models.
GPT models [30, 27], such as GPT-3.5, leverage generative pre-training on a vast corpus of text to generate human-like responses. Floridi and Chiriatti [13] explained that such models will transform the writing process and are capable of producing texts on the level of some humans. These models can be adapted for emotion detection by fine-tuning them on specific datasets or using prompt engineering to elicit desired outputs. Jain et al. [16] used two GPT models for emotion detection, achieving an accuracy score of 0.98 over the mental health datasets they tested it on. The BERT and GPT models show much promise, with the GPT models being the most cutting-edge technology available.
2.2 Fine-Grained Emotion Detection
For this study, multiple fine-grained emotions related to mental health, such as isolation, anxiety, and depression, need to be detected from students’ responses to the open-ended question. Bouzazizi et al. [5] tackled the challenging task of multi-class emotion detection on Twitter posts, achieving $60.2\% $ accuracy for seven emotion classes. Their study emphasized the complexity of multi-class classification and proposed a model to better extract and understand emotions present in text rather than classifying them into predefined categories. The authors introduced a system that first classifies text as positive or negative and then assigns scores for corresponding emotion subclasses, improving the robustness and accuracy of emotion classification. Demszky et al. [10] created a labeled dataset of 58k comments for 27 emotions, including gratitude, confusion, and remorse. They also trained a BERT-based model, achieving 0.46 F1 score. Mustafa et al. [22] leveraged Twitter data and machine learning to classify depression severity, achieving $91\% $ accuracy by analyzing the top 100 words used by individuals and their psychological attributes. The study highlighted the importance of feature selection in enhancing classifier performance and proposed incorporating additional data, such as emojis and images, to improve future analyses. Guo et al. [14] introduced a multi-way matching deep neural network model for fine-grained emotion detection of user reviews. Their approach predicted scores for specific attributes within reviews, such as location, service, price, and environment. The model consists of two steps: attribute detection and attribute classification. In the first step, the model identifies the relevant attributes mentioned in the text. In the second step, it assigns a score ranging from $-5$ to 5 for each emotion, reflecting the user’s opinion. This fine-grained analysis offers a more detailed understanding of user emotions by focusing on specific aspects of their reviews, demonstrating that NLP methods can effectively distinguish between various emotion categories.
3 Methodology
3.1 Data Collection and Labeling
The dataset has been previously studied in [1] and [7], and was collected from students at a large mid-Atlantic university in the U.S. between April and June 2020. We focus on one part of the collected data containing short-answer responses from students concerning the impact of COVID-19 on their mental health. The students had to answer the question, “How is coronavirus/COVID-19 affecting your mental health?” The responses are labeled manually by our research team with various emotional indicators, which serve as the ground truth for our emotion detection models. Each response is annotated with binary labels for 10 emotions: Isolation, Depression, Anxiety, Negative Feelings, Lack of Motivation, All Stress, Issues With Home Life, No/Positive Effects, Lack of Routine, Miscellaneous. The binary labels (1 if positive, 0 if negative) indicate whether the labelers believed the respondent’s answers expressed the corresponding emotions.
The ten categories of emotions are inspired by [6] and [35]. In Appendix A, we provide an example for each emotion, and explain our definitions of Negative Feelings, All Stress, and Miscellaneous. A response could be labeled into more than one emotion category. To ensure the labeling quality, two team members collaboratively categorized emotions for each response. For any questionable answers, they would sort out with multiple members to reach a consensus on labeling. In preparation for our analysis, the data were cleaned by removing responses with no emotion detected, which typically occurred when there was no response or only random characters. We also removed responses that lacked demographic information to facilitate future analysis. This left 398 responses in the dataset. The percentages of the remaining responses labeled as 1 for each emotion, ordered from largest to smallest, are illustrated in Figure 1.
Figure 1
The proportions of the responses expressing the corresponding emotions from the human labeling results. The ten emotions are ordered from the largest to the smallest.
In Figure 2, we analyze correlation and hierarchical clustering for all emotions. We find that most of the emotion pairs have near-zero correlations. The clustering analysis shows that “Depression” and “Anxiety” are the closest emotions. However, their correlation is only 0.3. Other close emotions also show small correlation coefficients of $\sim 0.1$. Thus, we simplify this multiple-label classification problem into 10 binary classification problems. For every NLP method, we train ten models, each using labels for a single emotion. This simplification will provide a fair method-comparison framework.
3.2 Emotion Detection Using NLP
3.2.1 Text Preprocessing
We follow the common steps in NLP [23] to preprocess the students’ responses in all the following methods, except GPT, which takes the original text as input. The preprocessing steps include:
After the preprocessing, the tokenized text will serve as input to the following NLP models to detect emotions expressed in the responses.
-
• Text cleaning: Removal of special characters, numbers, and extraneous whitespace.
-
• Tokenization: Splitting text into individual words or tokens.
-
• Lowercasing: Converting all text to lowercase to ensure uniformity.
-
• Stop words removal: Removing common words that do not contribute to emotional meaning, such as “and,” “the,” etc.
-
• Lemmatization: Reducing words to their base or root form.
3.2.2 Lexicon-Based Method
The Lexicon-based model uses a custom dictionary created from a human-encoded text dataset. After the preprocessing, the word frequencies are calculated to understand the distribution of terms within the dataset. Words are then scored based on their association with emotion labels, using metrics such as pointwise mutual information (PMI) to quantify the strength of association between words and emotions.
For a response r with words ${w_{1}},{w_{2}},\dots ,{w_{n}}$, its probability of including the emotion ${\textbf{E}_{j}}$ is:
where $\text{Score}({w_{i}},{\textbf{E}_{j}})$ represents the score of word ${w_{i}}$ for emotion ${\textbf{E}_{j}}$, $\text{Intercept}({\textbf{E}_{j}})$ is the intercept for emotion ${\textbf{E}_{j}}$, and $\sigma \{\cdot \}$ is the Sigmoid function. The Lexicon model will conclude that a response expresses an emotion when its predicted probability is at least 0.50.
(3.1)
\[ p(r,{\textbf{E}_{j}})=\sigma \{{\sum \limits_{i=1}^{n}}\text{Score}({w_{i}},{\textbf{E}_{j}})+\text{Intercept}({\textbf{E}_{j}})\}\]Our study implemented the Lexicon-based model using R’s SentimentAnalysis package [28]. All responses were scanned to create a custom dictionary for each emotion. The scores of top words for the three typical emotions identified in Section 4.1, as well as their intercepts, are presented in Table 1. We find that most of the words with coefficients different from 0 are explainable. For the “Depression” emotion, “depress” is associated with positive instances, while “routin”, “tend”, and “schedul” are associated with negative instances. For “Lack of Motivation”, “motiv”, “focus”. and “bed” are associated with positive ones. For “Miscellaneous”, the method only finds two words. The word “sleep” shows a clear positive association, indicating the labelers put the sleep issues in this category. On the other hand, the words with near-zero coefficients are less explainable. They might be introduced in the dictionaries due to random sampling of positive/negative instances.
Table 1
The word scores and intercepts of three typical emotions in the customized dictionary for our dataset.
| Depression | Lack of Motivation | Miscellaneous | |||
| Word | Score | Word | Score | Word | Score |
| depress | 0.321 | motiv | 0.138 | sleep | 0.028 |
| routin | $-0.033$ | focus | 0.075 | focus | 0.003 |
| tend | $-0.023$ | bed | 0.044 | ||
| schedul | $-0.016$ | there | 0.020 | ||
| becom | $-0.005$ | anymore | 0.012 | ||
| effect | $-0.003$ | cant | 0.001 | ||
| Intercept | 0.036 | Intercept | 0.086 | Intercept | 0.059 |
3.2.3 BoW and TF-IDF
The BoW [29] and TF-IDF [32] methods will convert the tokenized text into a vector or matrix, and then train machine learning models for emotion detection. The BoW model transforms text into a matrix of token counts. Each response is represented as a vector v indicating the frequency of each word in the text. For a response r with words ${w_{1}},{w_{2}},\dots ,{w_{n}}$, the vector representation ${\mathbf{v}_{\text{BoW}}}(r)$ is given by:
where $f({w_{i}},r)$ is the frequency of word ${w_{i}}$ in response r.
The TF-IDF model improves upon the BoW model by weighing terms based on their importance. The term frequency (TF) measures how often a word appears in a document, while the inverse document frequency (IDF) measures how unique or rare a word is across all documents. The TF-IDF score for a word w in response r is calculated as:
where:
and
where $f(w,r)$ is the frequency of word w in response r, N is the total number of responses, and $|\{r\in R:w\in r\}|$ is the number of responses containing the word w. Finally, for a response r with words ${w_{1}},{w_{2}},\dots ,{w_{n}}$, the vector representation ${\mathbf{v}_{\text{TF-IDF}}}(r)$ is given by:
After transforming each response into a vector, we adopt machine learning methods to train classifiers to detect each emotion. In this study, we consider two methods: logistic regression [36] and SVM [34]. Logistic regression predicts the probability of a class by applying a logistic function to a linear combination of input features, whereas SVM finds the hyperplane that best separates the data into classes by maximizing the margin between the classes. With the kernel method [15], those linear classifiers can be extended for non-linear classification. However, the performance of non-linear classifiers depends on the careful choice of kernels and their hyper-parameters for specific problems and datasets. To avoid excessive parameter tuning, we only consider the linear classifiers in the BoW and TF-IDF methods. To handle imbalanced data, we can also use the Synthetic Minority Over-sampling Technique (SMOTE) [8] to generate synthetic samples for the minority class, balancing the dataset. After conducting preliminary experiments, we employ the SVM with a linear kernel for the BoW method and logistic regression with SMOTE for the TF-IDF method, as these combinations provide generally better accuracy across different emotions.
3.2.4 Fine-Tuned MentalBERT
BERT [12] is a transformer-based model designed to understand the context of words in a sentence through bidirectional training. MentalBERT [17] is a pre-trained BERT model specialized in mental health-related text. We first load the pretrained network “mental-bert-base-uncased”. Then, we fine-tune MentalBERT using our dataset to tailor it to the specific emotion-related student mental health during the COVID-19 pandemic. The model was trained for 5 epochs with a batch size of 16, learning rate of ${2^{-5}}$, and a maximum sequence length of 128 tokens. The fine-tuning process adjusts the pre-trained model’s parameters to minimize the loss on the training data using the true emotion labels in our dataset.
3.2.5 Zero-Shot GPT
GPT [30], a generative LLM, has revolutionized NLP and artificial intelligence since ChatGPT was introduced in 2022. In this study, we utilize OpenAI’s GPT-3.5 [27] to generate emotion predictions for each response, as a baseline method. Given its ability to understand and generate human-like text, GPT-3.5 can be prompted with the students’ responses and asked whether the input paragraph expresses specific emotions. The predictions are then mapped to the binary labels for further evaluation. The details of GPT’s prompts are listed in Appendix B. We do not input the human labeling into the GPT prompt; thus, the method can be considered a zero-shot one. The experiment was conducted using GPT-3.5-turbo, the July 2024 version.
In Table 2, we summarize the training complexity of the five methods compared in this paper.
Table 2
The model complexity comparison of the five methods we compare in this study.
| Model | Complexity |
| Lexicon | Build a customized dictionary for each emotion, usually including around 10 words in our application. |
| BoW | Convert each instance to a vector with maximum length 1000, and train a linear SVM (parameters < 1k). |
| TF-IDF | Convert each instance to a vector with maximum length 1000, and train a logistic regression (parameters < 1k). |
| MentalBERT | Pretrained 110M parameters, fine-tuning for each emotion. |
| GPT-3.5 | Pretrained 20B parameters, training is not needed. |
3.3 Comparison Framework
3.3.1 Performance Criterion
To compare the performance of the different models, we use several standard evaluation metrics: True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN), Accuracy, Precision, Recall, F1 Score, Receiver Operating Characteristic (ROC) Curve, and Area Under the ROC Curve (AUC). For a response r, we set ${y_{r}}({\textbf{E}_{j}})=1$ if human labeling considers that it includes the emotion ${\textbf{E}_{j}}$, otherwise ${y_{r}}({\textbf{E}_{j}})=0$. Then, for an NLP method, its prediction result is ${\hat{y}_{r}}({\textbf{E}_{j}})$. The first four metrics are calculated as:
\[\begin{aligned}{}{\text{TP}_{{\textbf{E}_{j}}}}& ={\sum \limits_{r=1}^{R}}\mathbf{1}\Big\{{y_{r}}({\textbf{E}_{j}})={\hat{y}_{r}}({\textbf{E}_{j}})=1\Big\}\\ {} {\text{FP}_{{\textbf{E}_{j}}}}& ={\sum \limits_{r=1}^{R}}\mathbf{1}\Big\{{y_{r}}({\textbf{E}_{j}})\ne 1\hspace{2.5pt}\text{and}\hspace{2.5pt}{\hat{y}_{r}}({\textbf{E}_{j}})=1\Big\}\\ {} {\text{FN}_{{\textbf{E}_{j}}}}& ={\sum \limits_{r=1}^{R}}\mathbf{1}\Big\{{y_{r}}({\textbf{E}_{j}})=1\hspace{2.5pt}\text{and}\hspace{2.5pt}{\hat{y}_{r}}({\textbf{E}_{j}})\ne 1\Big\}\\ {} {\text{TN}_{{\textbf{E}_{j}}}}& ={\sum \limits_{r=1}^{R}}\mathbf{1}\Big\{{y_{r}}({\textbf{E}_{j}})={\hat{y}_{r}}({\textbf{E}_{j}})\ne 1\Big\},\end{aligned}\]
where $\mathbf{1}\{\cdot \}$ denotes an indicator function. Then, accuracy measures the proportion of correct predictions (both TP and TN) out of the total number of predictions, calculated as:
Precision measures the proportion of true positive predictions out of all positive predictions (TP and FP), calculated as:
Recall measures the proportion of true positive predictions out of all actual positive cases (TP and FN), calculated as:
At last, the F1 Score is the harmonic mean of Precision and Recall, providing a single metric that balances both concerns, calculated as:
The Receiver Operating Characteristic (ROC) Curve is a graphical representation of a model’s diagnostic ability. It plots the True Positive Rate (Recall) against the False Positive Rate (FPR) with different thresholds, where FPR is defined as:
Then, we can calculate the Area Under the ROC Curve (AUC), which quantifies the model’s overall ability to discriminate between positive and negative classes. A higher AUC indicates a better-performing model.
3.3.2 Comparison Steps
To give a comprehensive comparison among the NLP methods in the emotion detection from students’ responses, our study includes the following three steps:
-
1. Compare the emotion detection performance of the five NLP methods. We use the accuracy and F1 scores as the criteria. For the four trainable methods, i.e., Lexicon, BoW, TF-IDF, and MentalBERT, we vary the size of the training data using $20\% $, $50\% $, and $80\% $ of the entire dataset, to evaluate how the sample size affects the performance. We repeat the training/testing splits 100 times and report the average accuracy and F1 scores on the testing set. The zero-shot method, GPT-3.5, is used as the baseline.
-
2. Evaluate the stability and distinguishing ability of the NLP methods. We performed the 5-fold stratified cross-validation [45] 100 times with different data separations. In each stratified cross-validation, the data was split into five stratified folds, ensuring proportional representation of the positive/negative instances. The model is trained on four folds and tested on the remaining fold to evaluate performance. The process is repeated five times, once for each fold as the testing set. By doing so, we reduce variability caused by differences in the proportions of positive/negative instances between the training and testing datasets. The standard deviations of the accuracy and F1 scores are calculated to assess whether the model’s performance is sensitive to data splitting. Then, we obtain the predicted probabilities for emotions and calculate the average AUC for each model across 100 stratified cross-validations to assess their distinguishing abilities.
-
3. Show the consistency of the detection results among the five NLP methods for different emotions. We conduct pairwise comparisons to determine whether the detection results of one approach are consistent with those of another. This analysis highlights the similarities and differences among the five NLP methods.
4 Comparison Results and Discussion
4.1 Detection Performance Comparison
In this section, we examine the performance of four trainable methods – Lexicon, BoW, TF-IDF, and MentalBERT – as well as the zero-shot method, GPT. For the four trainable methods, we consider training data splits of $80\% $, $50\% $, and $20\% $ to evaluate the impact of training sample size. We choose accuracy and F1 scores from the testing data, as defined in Section 3, as the performance criteria. For each percentage, we repeat the training/testing splits 100 times, and report the average performance criteria. The original results are presented in the tables in Appendix C.1.
To compare the performance of the five methods, we show the boxplots of their accuracy (upper panel) and F1-scores (lower panel) with various training percentages in Figure 3. We observe that the training percentage has a limited effect on the Lexicon’s performance. The median F1 score of the Lexicon with a $20\% $ training split is even slightly higher than those with $80\% $ and $50\% $ training percentages. The performance of BoW and TF-IDF shows modest impacts of the training percentage, and their overall trends are similar to each other. At last, the performance of MentalBERT, especially its F1 score, is significantly impacted by the training percentage. The median F1 score of MentalBERT with $80\% $ is the highest among all NLP methods; however, its median F1 score with $20\% $ is the lowest. A possible reason is that $20\% $ of the training data, comprising only 80 instances, is insufficient to fine-tune the 110M parameters in the MentalBERT model. The zero-shot GPT demonstrates modest performance, similar to BoW and TF-IDF with a $50\% $ training percentage. However, its F1 scores surpass BoW, TF-IDF, and MentalBERT with a $20\% $ training percentage. Overall, the Lexicon method is relatively resistant to decreases in the training dataset, while MentalBERT is the most sensitive. Meanwhile, the performance changes in BoW and TF-IDF with varying training set sizes are moderate.
Figure 3
The boxplots of accuracy and F1 scores from the testing data among 10 emotions for the four trainable methods with $80\% $, $50\% $, and $20\% $ training percentages and the zero-shot GPT. The values of the three typical emotions, Depression, Lack of Motivation, and Miscellaneous, are also highlighted.
We examine the performance of those methods for individual emotions. We select three emotions: Depression, Lack of Motivation, and Miscellaneous, which present various levels of detection performance. The emotion with good performance, Depression, can be identified by keywords such as “depression” and “depressed” from the responses. The emotion associated with poor performance, Miscellaneous, is ambiguous and has only a small positive sample size. The moderate one, Lack of Motivation, has a clear definition but requires a comprehensive understanding of the responses to detect it. The accuracies and F1 scores of the three typical emotions of NLP methods with various training percentages are shown in Figure 4. We also highlight the values of the three typical emotions in Figure 3.
Figure 4
The accuracy and F1 score of the four trainable methods with $80\% $, $50\% $, and $20\% $ training data split, and the zero-shot GPT for three typical emotions spanning different levels of detection performance.
For Depression, the four trainable NLP methods with $80\% $ training percentages achieve high performance with an accuracy of $0.9\sim 1.0$ and F1 scores of $0.8\sim 1.0$. Moreover, Lexicon, BoW, and MentalBERT all achieve significant improvements in both accuracy and F1 score compared to GPT, when the training percentage exceeds $50\% $, whereas TF-IDF yields a minor improvement. However, with a $20\% $ training percentage, the performance of TF-IDF and MentalBERT drops dramatically, while the performance of Lexicon and BoW remains stable.
For the emotion with moderate detection difficulty, Lack of Motivation, the four trainable methods, with an $80\% $ training percentage, can achieve higher or similar accuracy or F1 scores compared to the zero-shot GPT. However, the increase between the trainable methods and the GPT is smaller than that for the Depression case. The F1 score of the MentalBERT is superior to that of other methods, demonstrating its capability to understand the context of the responses. When the training percentage reaches $50\% $ and $20\% $, the F1 scores of most trainable methods decrease to levels below or similar to those of GPT. This suggests that we require a sufficiently large dataset to train the model for this emotion.
For the challenging emotion, Miscellaneous, the four trainable methods, along with GPT, achieve high accuracy of around $0.90\sim 0.95$ and low F1-scores of around $0\sim 0.1$. The results show that there is a very small number of positive data points for the Miscellaneous emotion, and the NLP methods will predict all sentences as negative. Due to this issue, the training percentages cannot help the performance of the four trainable results. Moreover, the zero-shot GPT also fails to recognize this ambiguous definition with a near-zero F1 score.
4.2 Model Stability and Distinguishing Ability
The next step is to evaluate the stability and distinguishing ability of the four trainable methods: Lexicon, BoW, TF-IDF, and MentalBERT. We find the mean and standard deviation of the accuracies and F1 scores from 100 repetitions of stratified 5-fold cross-validation. The means are similar to those with an $80\% $ training percentage in Section 4.1, as the models using 5-fold cross-validation were also fitted from $80\% $ of the data. The standard deviation tells us how the performance measurements change depending on which $80\% $ of the data they are trained with, showing the stability of the methods. Then, we calculate the average AUC of the 100 repetitions, which indicates the model’s ability to separate each emotion. The AUC is close to 1 when a method is capable of identifying all instances of an emotion with very few false positives. At the same time, an AUC of 0.5 means the method distinguishes an emotion no better than a random guess. The original results are presented in the tables in Appendix C.2.
Figure 5
The boxplots of stability and distinguishing ability measurements among 10 emotions of the four trainable methods using 100 repetitions of 5-fold cross-validation. The values of the three typical emotions, Depression, Lack of Motivation, and Miscellaneous, are also highlighted.
The left and middle plots in Figure 5 show the boxplots of the standard deviations of the accuracy and F1 scores, and the right plot illustrates the boxplots of average AUCs among 10 emotions for the four trainable methods. There is no result for GPT, as it does not require a training process and cannot produce a probability of a positive detection. For the accuracy, it is clear that Lexicon has the smallest standard deviations, while MentalBERT has the largest. For the standard deviations of F1 scores, the trend is less obvious. However, the lower bound of Lexicon’s boxplot is lower than that of the other three trainable methods, showing that it can achieve the highest stability for some emotions. Those plots show that Lexicon excels in the stability measurements, while MentalBERT’s performance is sensitive to the part of the data with which it is trained. In the right plot, the upper bounds of the boxplots are all close to 1.0, but their lower bounds are different. Among the four methods, TF-IDF shows the highest average AUCs, and Lexicon and MentalBERT have the lowest values. Thus, when using AUC as the criterion, TF-IDF shows the highest distinguishing ability for some challenging emotions.
We show the stability and distinguishing ability measurements for the three typical emotions identified in Section 4.1. Their values are also highlighted in Figure 5 with special legends. For the emotion with good detection performance, Depression, the standard deviations of accuracy and F1 score are close to 0 for Lexicon, and slightly increase from BoW and TF-IDF to MentalBERT. The AUCs of all four methods are almost 1, showing excellent distinguishing ability. For the moderate one, Lack of Motivation, the standard deviation of accuracy again increases from Lexicon, BoW, TF-IDF, to MentalBERT. However, MentalBERT achieves the smallest standard deviation of F1 scores. One possible reason is that MentalBERT is better able to understand the concept of Lack of Motivation, thereby improving its stability. Their AUCs are around 0.75 and 0.85, and TF-IDF achieves the highest AUC.
For the emotion with poor performance, Miscellaneous, the standard deviations of Lexicon are still low despite its overall poor averages shown in Section 4.1. MentalBERT again has the largest standard deviation in its accuracy and F1 scores. The AUCs are around 0.5 and 0.6, indicating the model’s prediction is slightly better than the random guess. Among the four methods, TF-IDF has the highest AUC, while Lexicon and MentalBERT have the lowest. We can conclude that for both easy and challenging emotions, Lexicon exhibits the highest stability, while MentalBERT shows the lowest. For the moderate one, MentalBERT’s stability becomes better. For moderate and challenging emotions, TF-IDF’s distinguishing ability outperforms others when using AUC as the criterion.
4.3 Prediction Consistency
In this section, we aim to compare the consistency between the detections of different NLP methods, which indicates whether they yield identical predictions for each student’s response. To handle emotions with a very small number of positive samples, we choose the Jaccard index, also known as the Jaccard similarity, to measure consistency. The Jaccard index between the two methods for a certain emotion can be calculated as:
\[ \text{Jac}({\text{Meth}_{1}},{\text{Meth}_{2}})=\frac{{\textstyle\textstyle\sum _{r=1}^{R}}I({\hat{y}_{r}^{{\text{Meth}_{1}}}}=1\hspace{2.5pt}\text{and}\hspace{2.5pt}{\hat{y}_{r}^{{\text{Meth}_{2}}}}=1)}{{\textstyle\textstyle\sum _{r=1}^{R}}I({\hat{y}_{r}^{{\text{Meth}_{1}}}}=1\hspace{2.5pt}\text{or}\hspace{2.5pt}{\hat{y}_{r}^{{\text{Meth}_{2}}}}=1)},\]
where $I(\cdot )$ is the indicator function, and ${\hat{y}_{r}^{{\text{Meth}_{1}}}}$ and ${\hat{y}_{r}^{{\text{Meth}_{2}}}}$ denote whether the two methods predict is that emotion expressed in response r. The prediction results are based on the first repetition of the 100 5-fold cross-validation processes in Section 4.2. We calculate the Jaccard indices from the 15 pair-wise comparisons between five NLP methods and the true labels, and the results of the ten emotions are listed in Appendix C.3. Figure 7 presents the average pairwise consistencies over the ten emotions. MentalBERT vs. true labels (0.5) and BoW vs. TF-IDF (0.48) have the two highest average Jaccard indices. The first pair demonstrates the capability of trainable Transformer models to predict results close to the true labels. Meanwhile, the second pair is possibly caused by the similar mechanisms of the two methods, which first convert the responses to vectors and then train a machine learning classifier. Then, we examine the leftmost column, which shows the consistency of the five NLP methods compared to the true label. MentalBERT has the highest score (0.5), followed by BoW (0.43), and Lexicon, TF-IDF, and GPT have the lowest scores (0.37).Figure 7
The average Jaccard indices among 10 emotions between five NLP methods and the true labels.
Finally, we present the Jaccard indices for the three typical emotions. For the good one, Depression, almost all pairwise consistencies are above 0.5, while the consistencies between the true labels, Lexicon, BoW, and MentalBERT are higher than 0.75. However, the TF-IDF and GPT generate relatively inconsistent predictions, while their Jaccard index is lower than 0.5, which is consistent with the results in Figure 4 where these two methods show lower accuracies and F1 scores. For the emotion with moderate detection performance, Lack of Motivation, we find that MentalBERT achieves the highest consistency with the true labels, which is around 0.6, while the other Jaccard indices range from 0.22 to 0.57. This result highlights the capabilities of trainable LLMs. For the emotion with poor performance, Miscellaneous, every pairwise comparison is below 0.1, indicating that none of the NLP methods, whether trainable or not, can capture the ambiguous concept of Miscellaneous.
4.4 Discussion
Performance and Complexity Trade-off: The NLP methods investigated in this study exhibit dramatically different levels of complexity. The Lexicon-based method requires learning scores for only dozens of keywords, while MentalBERT must fine-tune over 100 million parameters in its Transformer architecture [17]. These complexity differences translate into distinct performance patterns. The Lexicon-based method achieves stable performance across training sample sizes ranging from $20\% $ to $80\% $, demonstrating the lowest standard deviations in both accuracy and F1 scores. However, its peak performance cannot compete with more sophisticated methods. In contrast, fine-tuned MentalBERT demonstrates the highest accuracy and F1 scores with $80\% $ training data and achieves the best Jaccard Index agreement with true labels. Yet its performance drops dramatically with limited training data ($20\% $), and it exhibits the highest variability in repeated cross-validation experiments.
BoW and TF-IDF methods, both of which convert responses into numeric vectors before training traditional machine learning models, demonstrate balanced performance, complexity, and stability. Their methodological similarity is reflected in their relatively high pairwise Jaccard index. While TF-IDF achieves better AUC scores than BoW, indicating good distinguishing ability for challenging emotions, its performance declines more rapidly with reduced training data.
The zero-shot GPT method, implemented through OpenAI’s API, requires no training process and achieves performance comparable to BoW and TF-IDF models trained on $50\% $ of the data, though it underperforms compared to MentalBERT trained on $80\% $ of the data. Notably, Lossio-Venture et al. [19] found that zero-shot ChatGPT outperformed fine-tuned Transformers in sentiment analysis for COVID-19 survey data. This apparent discrepancy likely stems from task complexity differences: sentiment analysis predicts general positive/negative sentiment, a more universal task that ChatGPT’s vast training data can handle effectively. In contrast, our fine-grained emotion detection requires distinguishing among ten mental health-related emotions, many specifically defined by our research team. Without access to labeled training data, the zero-shot GPT model cannot accurately detect these domain-specific emotional categories.
Method Selection Suggestions: Based on our performance observations, we offer the following guidelines for NLP method selection. The Lexicon-based method proves particularly effective for detecting well-defined emotions with clear linguistic indicators, especially when training data is limited or stability is prioritized over peak performance. Fine-tuned MentalBERT is most suitable for detecting contextually complex emotions when sufficient training data ($\gt 50\% $ of available samples) and computational resources are available. Traditional machine learning methods, such as BoW and TF-IDF, provide effective predictions when labeled data or computational resources are insufficient for Transformer models. Finally, zero-shot GPT can generate quick assessments when no training data is available, though performance will be limited for domain-specific emotions.
Uncertainty in Mental Health Studies: When deploying NLP methods for emotion detection in mental health research, their inherent uncertainty must be carefully examined, as detection results can have serious consequences for both research conclusions and potential interventions. Uncertainty arises from multiple sources, beginning with the training dataset itself. As demonstrated by our experiments, most NLP methods show performance sensitivity to training size variations. Additionally, repeated cross-validation reveals that resampling the training dataset while maintaining the same sample size yields variable predictions, particularly for high-complexity models like fine-tuned MentalBERT. Therefore, sensitivity analysis for training sample size and cross-validation repetitions is essential for evaluating model stability.
A second source of uncertainty stems from the binarization of predicted probabilities. All four trainable methods output probabilities indicating the likelihood of emotion presence, with positive detection determined by a 0.5 threshold. However, a response with 0.99 prediction probability represents a different uncertainty level compared to one with 0.51 probability. In this study, we employed AUC to evaluate the distinguishing ability of NLP models based on predicted probabilities, and ROC curves can provide additional insights into model uncertainty characteristics.
Emotion Labeling Impact: A crucial finding is that the target emotion significantly impacts NLP method performance. Figures 3 and 5 demonstrate that accuracy and stability measurements for the same method vary substantially across the ten emotions studied. Training sample size sensitivity also depends on the specific emotions being detected. Consequently, method comparisons yield different conclusions for different emotion types, as illustrated by our analysis of Depression, Lack of Motivation, and Miscellaneous categories. Such performance disparities across emotions can lead to inconsistent findings in studies relying on NLP detection results. While traditional mental health studies, such as [1], design emotion categories based on domain knowledge, the increasing role of NLP methods in data analysis necessitates careful selection and design of emotion categories to ensure both performance and stability of automated detection algorithms. For instance, Miscellaneous, a convenient category for human labelers, leads to poor performance for all NLP methods. Such categories should be avoided when incorporating NLP methods for data analysis.
We note that all four trainable methods can improve the model’s consistency in producing predictions that are similar to those of human labelers. Thus, the zero-shot GPT-3.5 performs relatively poorly compared to other methods when we use human labels as the “golden standard”. However, as pre-trained LLMs become more powerful, their predictions could be more valuable when the quality of human labels cannot be ensured. Collaboration between LLMs and human experts can be beneficial for mental health research.
Multi-label Classification: There are machine learning research related to multi-label classification [3, 40]. They found that by adopting specialized methods, such as problem transformation and algorithm selection, we can capture the inner structure of the labels to improve accuracy. In Figure 2, there are weak associations between our ten emotions. For example, No/Positive Effects have $-0.20\sim -0.03$ correlations with other 9 emotions. Employing multi-label classification to improve the efficiency of emotion detection would be an interesting future study.
5 Conclusion
This paper presents a comprehensive comparative study of NLP methods for detecting fine-grained emotions in college student responses regarding their mental health during the COVID-19 pandemic. We evaluated five distinct approaches: Lexicon, BoW, TF-IDF, fine-tuned MentalBERT, and zero-shot GPT, examining their performance, training sample size sensitivity, stability, distinguishing ability, and inter-method consistency. Our experimental results reveal performance-complexity trade-offs among NLP methods and provide evidence-based guidelines for method selection. We demonstrate the critical importance of recognizing uncertainty inherent in NLP detections and emphasize the need for careful emotion category design to ensure detection quality. Our insights bridge a critical gap in NLP analysis between data science and mental health studies, utilizing survey data with various applications.
This work establishes a foundation for future NLP development in mental health survey research through several promising directions. First, hybrid or mixture-of-experts frameworks could be designed to balance performance-complexity trade-offs by selecting appropriate models based on emotion type and available training sample size, thereby providing stable detection results across diverse conditions. Second, uncertainty-aware algorithms could be developed based on our analytical framework, incorporating prediction probabilities, cross-validation standard deviations, and inter-method consistency scores to generate uncertainty estimates. In mental health applications, such systems could restrict automated decisions to low-uncertainty cases while flagging high-uncertainty responses for human review. Finally, our findings highlight the need for developing emotion categories that balance mental-health insights with computational detectability, potentially through collaborative efforts between domain experts and NLP researchers.