TClustVID: A Novel Machine Learning Classification Model to Investigate Topics and Sentiment in COVID-19 Tweets

Md. Shahriare Satu; Md. Imran Khan; Mufti Mahmud; Shahadat Uddin; Matthew A. Summers; Julian M.W. Quinn; Mohammad Ali Moni

doi:10.1101/2020.08.04.20167973

Abstract

COVID-19, caused by the SARS-Cov2, varies greatly in its severity but represent serious respiratory symptoms with vascular and other complications, particularly in older adults. The disease can be spread by both symptomatic and asymptomatic infected individuals, and remains uncertainty over key aspects of its infectivity, no effective remedy yet exists and this disease causes severe economic effects globally. For these reasons, COVID-19 is the subject of intense and widespread discussion on social media platforms including Facebook and Twitter. These public forums substantially impact on public opinions in some cases and exacerbate widespread panic and misinformation spread during the crisis. Thus, this work aimed to design an intelligent clustering-based classification and topics extracting model (named TClustVID) that analyze COVID-19-related public tweets to extract significant sentiments with high accuracy. We gathered COVID-19 Twitter datasets from the IEEE Dataport repository and employed a range of data preprocessing methods to clean the raw data, then applied tokenization and produced a word-to-index dictionary. Thereafter, different classifications were employed to Twitter datasets which enabled exploration of the performance of traditional and TclustVID classification methods. TClustVID showed higher performance compared to the traditional classifiers determined by clustering criteria. Finally, we extracted significant topic clusters from TClustVID, split them into positive, neutral and negative clusters and implemented latent dirichlet allocation for extraction of popular COVID-19 topics. This approach identified common prevailing public opinions and concerns related to COVID-19, as well as attitudes to infection prevention strategies held by people from different countries concerning the current pandemic situation.

1. Introduction

COVID-19 has become a global concern as a major and dangerous public health threat. The World Health Organization (WHO) declared COVID-19 a Public Health Emergency of International Concern (PHEIC) on February 28, 2020. During the 1960s, coronaviruses (CoVs) were found to infect humans mainly in the upper respiratory tract, most commonly human coronavirus 229E and OC43 [1]. Many CoVs circulate in wild mammalian populations, and cause only minor, if any, human health problems. This picture changed with the emergence of severe acute respiratory syndrome (SARS-CoV) and the Middle East Respiratory Syndrome coronavirus (MERS-CoV) that infect the lung epithelial tissues and cause serious and often deadly respiratory disease [2]. However, SARS-CoV and MERS-CoV outbreaks in 2002 and 2012 respectively receded, probably due to the lack of spread from non-symptomatic individuals that allowed rapid containment. In contrast, SARS-CoV2 which causes pneumonia-like symptoms and cardiovascular complications ranging in severity from undetectable to rapidly lethal. This, coupled with its rapid spread has caused huge economic disruption and personal health fears and uncertainties that have dominated both the news and social media.

The massive use of web and mobile technologies gives opportunities for people to share their opinions about issues affecting them on social media platforms such as Facebook and Twitter. During the COVID-19 pandemic social media has been used both for normal daily interaction and to spread health messages, but there are also significant numbers of messages left by users sharing their general feelings about personal situations, their health status, the care they take to stay well, and much other COVID-19-relevant information [3]. Such messages may provide useful large scale insights into behavioral responses to the pandemic, however it is not easy to judge whether a social media message carries important information, not least because semantic abstruseness makes it hard to understand many messages. Nevertheless, machine learning and computational methods have increasingly been used to scrutinize social media data in the biomedical sector [4]. The content of COVID-19 related messages may be used to extract information that can inform physicians and policy-makers. Twitter, in particular, is a popular microblogging and public networking service widely used for messaging and posting [5]. Automatic classification of tweets into particular classes is challenging, not least because these messages are short, 140 characters, or less [6]. The analysis requires identification of sentiments in Twitter messages (tweets) which contain abbreviations, spelling variations and ambiguous or informal language.

Some recent studies have attempted to scrutinize COVID-19 tweets in bulk for health purposes, although it is likely they have also been mined for commercial purposes. Lopez et al. [7] generated a dataset of multilingual tweets collected from all over the world since January 22^nd. In this dataset they identified common responses and how they changed across time. Kouzy et al. [8] explored tweets using 14 trending hashtags and keywords about COVID-19 and investigated the magnitude of misinformation by comparing terms and hashtags of tweets. Cinelli et al. [9] analyzed the dissemination of information about COVID-19 on Twitter, Instagram, YouTube, Reddit, and Gab, and found a quite different volume of misinformation in each of the platforms. Medford et al. [10] analyzed all twitter user data from January 14^th to 28^th, 2020 and applied sentiment analysis and topics modeling using LDA to explore discussion topics over time. However, there are few dedicated machine learning based tweet analysis models to investigate user sentiments about COVID-19. In this study, we sourced several twitter datasets and investigated sentiment topics related to COVID-19 by designing a novel clustering based analysis model named TClustVID. This model was used to explore significant subsets (clusters) from COVID-19 twitter datasets and select them by applying the highest classification performance approach. Each of these twitter clusters has been split into the positive, negative and neutral cluster and employed latent dirichlet allocation (LDA) to extract key topics from each of them. Topics were interpreted to identify the most frequent significant topic among the tweets studied. This methodology can be used to generate information relevant to researchers and policymakers when dealing with COVID-19 issues that relate to the general public and human social behaviour at large.

2. Materials and Methods

We proposed a machine learning based COVID-19 tweets analytic model that can be used to explore significant topics from Twitter datasets. To process different types of tweets, several natural language processing techniques are used, along with machine learning methods as illustrated in Figure 1.

Figure 1:

Details of Working Methodology where A. Data preprocessing B. Traditional classification and evaluation C. Clustering, classification and c valuation D. Comparison the outcomes between traditional and TClustVID and Select the best clusters E. Identify positive, neutral and negative clusters, extract topics by LDA and represent top frequent topics from it

2.1 Data Description

COVID-19 twitter datasets were collected from the IEEE Data portal that originated from the LSTM model, developed by Rabindra Lamsal, which monitors the real-time twitter feed for COVID-19-related tweets [11]. It generates over 0.3 million requests every 24 hours and its time-series graph is updated every 30 seconds. Almost 16 million tweets were identified before March 20^th 2020. Each database (*.db) contains three attributes in which first, second, and the third columns denote the date and time of the tweet, and the tweets and sentiment scores. However, these sentiment scores are manipulated within the range [0,2] where the most negative, neutral, and positive sentiment are indicated as 0, 1 and 2, respectively. Eight twitter datasets (corona tweets 1M.db, corona tweets 1M 2, corona tweets 1M, corona tweets 2L, corona tweets 2M.db, corona tweets 2M 2, corona tweets 2M 3 and corona tweets 3M) have been investigated and deemed suitable models to classify tweets in this study. Each dataset has been denoted the tweets related to COVID-19 of each day before March 20^th 2020. We gathered datasets of a couple days to understand and extract various topics everyday. The first seven of these datasets are denoted as dataset-1, dataset-2, dataset-3, dataset-4, dataset-5, dataset-6, and dataset-7. In this study, corona tweets 3M was split into dataset-8 and dataset-9 because the computational cost is manipulated very high for the corona tweets 3M.

2.2 Data Preprocessing

In preprocessing steps, different twitter datasets have been prepared to manipulate them. Tweets contain various HTML tags, punctuation, numbers, single characters and multiple spaces. Several functions were used to clean datasets in this step. The symbols ‘ <>′ were replaced with empty spaces. Each of the single characters was replaced with a space as the single letters do not indicate any meaningful communication. Finally, all multiple spaces were removed from these tweets. This process was employed in the nine twitter datasets and combined for further analysis. Table 1 represents the number of tweets before and after prepossessing steps.

View this table:

Table 1:

Number of Cleaned Tweets COVID-19 After Data Preprocessing

2.3 Tokenization

After preprocessing steps, tokenization procedures were used to generate a word-to-index dictionary whereby each word is created as a key in the corpus. Hence, the corresponding unique index indicates the value of the keys. In the training phase, each list holds each sentence where the size is dissimilar. Thus, the maximum length of each list is fixed. If the length of any list is exceeded, it is truncated to the maximum permitted length. Zeroes are added to the endpoint of a shortlist until it reaches maximum length, a process called padding. Thus, Glove embedding tokenization [12] has been used to create a dictionary that holds a word as a key and the corresponding list as values. Finally, an embedding matrix is generated whereby each row number matches the index of the word in the corpus. Raw tweets contain text instances which cannot handle by machine learning procedure. Therefore, we run data pre-processing and tokenization process to make it executable for clustering and classification computation.

2.4 Traditional Approach

After manipulating data preprocessing and tokenization, we implemented different machine learning baseline classifiers into twitter datasets and evaluate the results. This process is called tradition approach. It is a general process to apply classifiers into the dataset. In this work, we implement various well known classifiers into traditional way in the dataset and compare the results with TClustVID. This procedure is used to justify the performance of TClustVID and assist to explore best clusters comparing other methods. However, both traditional and TClustVID use same baseline classifier which are indicated at section 2.6.

2.5 TClustVID: Clustered Based Classification and Topics Modeling Approach

We proposed a novel clustering-based topics modeling approach called TClustVID which represents at 1. It splits the twitter datasets into several clusters (groups) applying k-means clustering algorithms. It is implemented into COVID-19 twitter datasets following preprocessing and tokenization process. Clustering is an unsupervised method to find homogeneous groups from the dataset. This procedure is used to create clusters as features and improve classification results. There are remaining various clustering algorithms such as k-means, k-medoids, fuzzy C-means, hierarchical clustering, and density based clustering [13, 14]. K-medoids is not the best choice for analyzing sparse data like tweets. Besides, fuzzy C-means is useful to the sheer volumes of tweets and contains low scalability where human annotation really expensive. The performance of hierarchical clustering is slower than k-means. Density based clustering is highly efficient for clustering unstructured data and less prone to outliers and noise. In this work, we handle a large amount of tweet data where K-means defines the mean point within the cluster by optimizing the Euclidean distance between each instance and cluster mean in a less time [15, 14]. The default values of k are taken as 5 which is also used more frequently this type of work. Each cluster contain positive, negative and neutral tweets. When the clusters are found, the tokens were replaced by primary tweets and re-tokenized each cluster. Then, baseline classifiers were used to investigate the performance of different datasets and extracted clusters using 10 fold cross-validation. Different evaluation metrics such as accuracy, area under the curve (AUC), f-measure, g-mean, sensitivity and specificity have been used to investigate these results.

Algorithm 1

TClustVID: Clustered Based Proposed Classification and Topics Modeling Approach

Compared to the classification results of traditional approach and TClustVID, the best performing clusters represent more frequent topics because they show the highest classification performance relative to the traditional approach. These clusters are divided into positive, neutral and negative clusters for further analysis. Therefore, LDA was used to explore significant topics of positive, neutral and negative clusters from the high performing nine clusters. There were extracted 20 topics from each cluster. We represent individual topics into word cloud where each topics contain different words/tokens. In addition, each word cloud represent individual words into different sizes because they organize words according to the weights of them.

But, LDA cannot interpret these topics, hence, we manually analyze the words/tokens of each topics and interpret them.

2.6 Baseline Classification

In previous studies, various classifiers such as decision tree (DT), Gradient Boosting (GB), K-Nearest Neighbor (KNN), Logistic Regression (LR), Multi-Layer Perceptron (MLP), Naïve Bayes (NB), Random Forest (RF), Support Vector Machine (SVM) and XGBoost (XGB) have been commonly used to investigate different types of tweet datasets for sentiment analysis. These classifiers were used in similar kinds of twitter data analysis such as C5.0 (DT), KNN, SVM, LR and ZeroR [16], personality prediction using KNN, NB, SVM, and XGB [17, 18], spam detection using RF, NB, SMO and Ibk (KNN equivalent) [19], sentiment analysis using NB, SVM, and MLP of top colleges [20], prediction of alternation price fluctuation using GB [21]. Following this tasks, we selected them to classify the COVID-19 twitter dataset, then explored the best clusters using 10-fold cross-validation.

2.7 Evaluation Metrics

A confusion matrix is specified for the performance of the classifier that indicates the number of correct and incorrect predictions when considering known true values. Based on positive and negative classes, it denotes True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN).

Accuracy: It represents the efficiency of the algorithm in terms of predicting true values that is shown in the following equation.
AUC: It is used to explore machine learning models considering the TP and TN rates represent how well positive classes are isolated from negative classes.
F-measure: It represents the harmonic mean of the precision and recall which shows the following equation.
Geometric-mean (G-mean): It specifies the root of the class-specific sensitivity product and makes a trade-off between the expansion of accuracy on each class and balancing accuracy.
Sensitivity: The portion of appropriately detected actual positives is indicated as sensitivity using the following equation.
Specificity: The portion of correctly identified actual negatives is denoted as specificity which represents by the following equation.

3. Experimental Result & Discussion

3.1 Classification Approach

Various classification algorithms were used to analyze the COVID-19 twitter dataset, using the sci-kit-learn machine learning python library [22]. In this study, we have proposed a clustering based classification and topics extraction model, TClustVID, which detects positive, negative, and neutral tweets more accurately than previous methods which allows to explore more significant thematic topics. COVID-19 twitter datasets were cleaned using the data preprocessing procedures described above. Word-to-index dictionaries were then created using GloVe embedding tokenization. Several classification algorithms such as DT, GB, KNN, LR, MLP, NB, RF, SVM and XGB were analyzed sentiments of the COVID-19 datasets, using 10 fold cross validation approach. The experimental analyses of COVID-19 twitter datasets (from dataset 1 to 9) are represented at Table 2 to 10. We used various evaluation metrics such as accuracy, area under the curve (AUC), f-measure, g-mean, sensitivity and specificity to profile the results of the nine COVID-19 twitter datasets used with this model.

View this table:

Table 2:

Experimental Result of Dataset-1

In traditional approaches (see Table-2, 3, 4, 5, 6,7, 8, 9 and 10), RF gave, respectively, 3, 7, 5, 7, 3 and 7 times the highest accuracy, AUC, f-measure, sensitivity, and specificity, respectively, in different twitter datasets. Alternatively, DT gave 6, 2, 2, 2, 6 and 2 times the highest accuracy, AUC, f-measure, sensitivity, and specificity respectively. If the frequency of generating the best values are calculated, RF showed a total 32 times higher results to analyze twitter datasets. However, DT provided total of 20 times the highest results corresponding to RF. It was also noted that DT was better at predicting true positive instances compared to RF. Both of these showed 90% average results to scrutinize COVID-19 twitter datasets. Without DF and RF, KNN and MLP showed better results than other classifiers from dataset-1 to 9. However, the performance of KNN was better than MLP at all times and MLP showed better results than XGB, GB, LR and SVM. Thus, they were considered as the third and fourth top-performing classifiers correspondingly. In the comparison of GB and XGB, most of the time XGB showed better results than GB. XGB gave 5 times higher results when compared to GB. In some cases, GB showed a better result than XGB for only a few metrics (e.g., better accuracy, specificity in dataset-1, better accuracy and sensitivity in dataset-8 and 9). Hence, XGB was the 5^th high performing classifier in this work. SVM and LR did not demonstrate sound outcomes in analyzing the COVID-19 datasets. For most of these cases, LR gave higher results than SVM.

View this table:

Table 3:

Experimental Result of Dataset-2

View this table:

Table 4:

Experimental Result of Dataset-3

View this table:

Table 5:

Experimental Result of Dataset-4

View this table:

Table 6:

Experimental Result of Dataset-5

View this table:

Table 7:

Experimental Result of Dataset-6

View this table:

Table 8:

Experimental Result of Dataset-7

View this table:

Table 9:

Experimental Result of Dataset-8

View this table:

Table 10:

Experimental Result of Dataset-9

In this work, we implemented TClustVID where these results of individual classifiers have been improved over the traditional approaches (see Table-2, 3, 4, 5, 6,7, 8, 9 and 10). However, the performance order of individual classification remained almost the same. Various classifiers such as DT, RF, GB, KNN, MLP, RF, SVM and XGB were employed to compare with TClustVID. Moreover, the performance of DT and RF were the most similar when investigating COVID-19 twitter data analysis for various evaluation metrics. Some dissimilarities were noted when the data processing methodology is modified in different steps. Using the traditional approaches, RF showed better results than DT except accuracy and sensitivity. When TClustVID was employed, the performance of DT increased comparing to the traditional approach. DT showed 7,6,7,7 and 2 times the highest accuracy, AUC, f-measure, sensitivity and specificity at the nine COVID-19 twitter datasets respectively. Again, RF showed 2,3,2,2 and 7 times the highest accuracy, AUC, f-measure, sensitivity and specificity. The results of KNN and MLP were improved but were still third and fourth high performing classifiers for all of these datasets. With the traditional approach, GB showed better performance than XGB under various conditions. However, the results of XGB showed superiority to GB in almost all the time with TClustVID, so that GB showed greater accuracy and sensitivity than XGB in dataset-8 and 9. Moreover, LR and SVM showed lower performance than GB whereas LR showed better performance than SVM.

However, the average results for the combination of traditional and TClustVID are illustrated in Figure 1). Subsequently, the individual average results of traditional and TClustVID were explored to understand the average hierarchy of individual classifiers for both of these approaches. Using Traditional approach, RF showed 7 times and DT showed 2 times top results corresponding all metrics respectively. Thus, RF considered the best performing and DT represented the second best performing classifier in this analysis, with KNN and MLP third and fourth classifier in terms of their performance. Besides, XGB, GB, LR and SVM showed 8 times as the fifth, sixth, seventh best performing algorithms respectively. Instead, the average results of almost all classifiers were improved by TClustVID. DT showed greater average result with TClustVID, where it showed the highest outcomes at the five twitter clusters. Thus, RF can be considered as the second highest average performing classifier in this work. KNN and MLP showed the third and fourth highest performing classifier in both of these approaches. Therefore, XGB, GB, LR and SVM that showed the next best average performance.

Figure 1:

Average Performance of various classifiers in (a) Dataset-1, (b) Dataset-2, (c) Dataset-3, (d) Dataset-4, (e) Dataset-5, (f) Dataset-6, (g) Dataset-7, (h) Dataset-8 and (i) Dataset-9 of traditional and TClustVID

The highest results of different classifiers are indicated the best performance in analyzing COVID-19 tweets. Therefore, the highest results for different classifiers are shown on Table 11. In this table, the findings of TClustVID were also shown improved outcomes relative to the traditional approach. In both apporaches, RF showed the best results among all of the classification methods. Then, DT showed the second maximum results to investigate COVID-19 related tweets. Again, KNN and MLP showed the third and fourth best results, similar to previous analyses. We then found that XGB and GB also gave better results, with XGB giving better results than GB. Using the traditional approach, SVM showed greater accuracy, AUC, sensitivity and specificity than LR. Instead, LR showed greater accuracy, AUC and sensitivity in TClustVID.

View this table:

Table 11:

Highest Results of Different Classifiers

After calculating the average results of the different classifiers, it was clear that TClustVID showed better results compared to the more traditional approach (see Table 12). However, the order of average performances is similar to whether the traditional approach or TClustVID was used. RF showed the highest average accuracy, f-measure and sensitivity and was the highest average classification model in this analysis. Instead, DT appears as the second ranked for average performing classifier and KNN and MLP were found third and fourth performing classifier, which was also seen in another analysis. XGB showed a better average performance than GB. Hence, XGB and GB represent the fifth and sixth performing classifiers in this work. Finally, LR and SVM show the lowest average order of performance.

View this table:

Table 12:

Average Results of Different Classifiers

In Fig-2a using the traditional approach, the sequence of average highest outcomes of different classifiers are also represented as RF, DT, KNN, MLP, XGB, GB, SVM and LR. Similarly, TClustVID represents the ranking of average best results of classifiers as RF, DT, KNN, MLP, XGB, GB, SVM and LR respectively. On the other hand, the average performance of averaged classification results is illustrated at Fig-2b. In the traditional approach, the sequences of average results of averaged classifiers are represented as RF, DT, KNN, MLP, XGB, GB, LR and SVM. However, the average performance of TClustVID, DT showed better results than RF. Moreover, the performance of other classifiers keep the same sequence according to the previous analysis. Thus, the final average order of top average performing classifiers in TClustVID were DT, RF, KNN, MLP, XGB, GB, LR and SVM. However, in the mixture of the traditional and TClustVID, average results showed a similar average ranking of performance of various classifiers which are RF, DT, KNN, MLP, XGB, GB, LR and SVM respectively.

Along with observing the performance of various classifiers, we noticed that TClustVID shows better performance than traditional approach. Hence, top modeling approach is used high performing clusters to extract significant topics in next section.

3.2 Topic Modeling Approach

A comprehensive analysis of different classifiers in traditional and TClustVID analyses indicated that TClustVID is the best model to identify significant groups of tweets from large COVID-19 Twitter datasets. The data obtained from identification of groups/clusters were significant because they showed the highest classification accuracy were achieved compared to traditional analysis in primary data. In the TClustVID analysis, we generated significant clusters from each of these twitter datasets (for positive neutral, and negative categories) that showed greatly improved results for the different classifiers. These clusters have been denoted as Cluster-1, Cluster-2, Cluster-3, Cluster-4, Cluster-5, Cluster-6, Cluster-7, Cluster-8, and Cluster-9, respectively. A number of topics were then extracted from these clusters where within nine clusters seven clusters produces positive, neutral and negative topics and two of them extracts positive and neutral topics using LDA. Each topic contains 10 tokens along with related weights and they can be used to prioritize each token. 20 topics were identified from each of the categories (positive, neutral and negative) in these clusters. Therefore, all topics of individual clusters are represented as word cloud in the supplementary section. In this paper, extracted positive, neutral and negative topics of cluster-3 are visualized with word cloud in Figure 4, 5 and 6 individually. However, LDA cannot interpret the meaning, so we defined each topic by realising the meaning and weight values at different groups manually. The positive, neutral and negative topics are represented at Table 13, 14 and 15 respectively. These tasks are not simple because many preprocessed words do not have any semantic meaning. However, it can be hard to understand the association between the different words/tokens in these topics and these interpretations may slightly differ with other types of reviews.

View this table:

Table 13:

Positive Topics of All Significant Clusters

View this table:

Table 14:

Neutral Topics of All Significant Clusters

View this table:

Table 15:

Negative Topics of All Significant Clusters

Figure 2:

Average Performance of various classifiers (a) Maximum results (b) Average results corresponding to the nine twitter experimental datasets

Figure 3:

Word Cloud of Various Topics

Figure 4:

Positive Topics of Cluster-3

Figure 5:

Neutral Topics of Cluster-3

Figure 6:

Negative Topics of Cluster-3

In the different categories of tweets, we manipulated the frequency of different topics that appears several times. Positive, neutral and negative topics were represented to identify what activities are generated in the context. To understand individual topics into different categories, we considered the best topics which are appeared more than 1 times (see Figure 7). The examples of positive topics of cluster-3 are shown as the word cloud in Figure 4. The positive topics of different clusters are shown in Table 13 and the top frequent positive topics are shown in Figure 7a. For the positive cases, ‘awareness’ and ‘situation’ are the most frequent topics that appear many times in different clusters. Both of these appear 17 times in different significant clusters. ‘Awareness’ is specified those actions whose are taken by individuals and situation symbolizes the general situation of particular places/incidents where pandemic news indicates a generic situation relating to COVID-19. ‘Wishes’ appear 8 and ‘new’ appears 7 times in this study. Furthermore, ‘caring’, ‘coronavirus’, ‘right’ and ‘treatment’ are found 5 times, and ‘message’, and ‘social distance’ are found 4 times this effort. Subsequently, ‘cases’, ‘prevention’, ‘testing’ and ‘tourism’ are found 3 times in the COVID-19 situation. In addition, other precaution related topics such as ‘affect’, ‘annoying’, ‘blaming’, ‘closing’, ‘crisis’, ‘effect’, ‘facts’, ‘financial help’, ‘help’, ‘infectious’, ‘lockdown’, ‘medicine’, ‘need’, ‘panic’, ‘quarantine’, ‘risk’ and ‘scaring’ are shown their frequency 2 times in different clusters. These are appeared regularly and specifies how we can improve this condition. However, some of negative topics, for instance’blaming’,’crisis’,’infectious’, ‘panic’, ‘risk’ appeared in positive cases but their frequencies are not greater. More upcoming positive issues are also addressed in this analysis included ‘financial help, ‘, ‘help’, ‘lockdown’, ‘quarantine’ and ‘medicine.

Figure 7:

Top Frequency of (a) Positive (b) Neutral (c) Negative COVID-19 Associated Topics

In the neutral category, there are appeared the mixture of positive and negative topics which indicates the most frequent topics in recent times. For example, we represent an example of neutral topics as a world cloud is shown in Figure 5. Besides this, neutral topics of different clusters are provided in Table 14 and top frequent topics are shown at Figure 7b. Therefore, ‘situation’, ‘panic’ and ‘awareness’ are found 19, 16 and 13 times in the following list of twitter topics. ‘Panic’ is a related topic to explain epidemic conditions and news. In addition, ‘wish’ and ‘coronavirus’ appear 6 times as well as ‘caring’ which appears 5 times at negative tweets. Consequently, ‘blaming’, ‘cases’, ‘die’, ‘warning’ and ‘protection’ appear 4 times while education, ‘food’, ‘joke’, ‘message’, ‘news’, ‘prevention’, and ‘symptom’ appear 3 times in this condition. The rest of the topics perform 2 times to represent as neutral topics.

The more upcoming issue before and after COVID-19like’Financial’,’lose’,’crisis’,’food’, ‘education’ also arose in this analysis.

The negative topics using the word cloud are represented in Figure 6. Thus, the meaning of negative topics has been provided in Table 15 and topmost frequent topics are shown in Figure 7c. In this category, panic and situation appear most of the times than other topics. Both of them appear 20 and 18 times respectively. ‘Dead’ and ‘disease’ appear 6 and 5 times enabling estimation of its influence. Thus, ‘food’ and ‘blaming’ appear 4 times and ‘treatment’, ‘sick’, ‘fake news’ and ‘avoid’ appear 3 times to represent significant topics. Some cases like ‘food’ and ‘treatment’ indicate the level of crisis perceived. The rest of the topics presented with a frequency of 2 in this work. Therefore, these topics shown in the top list indicate feelings or perceptions relating to the COVID-19 that are negative.

3.3 Implication

Therefore, we explored different topics that represent feelings or perceptions that relate to the current situation of the COVID-19 pandemic. Every day many people share their idea, opinion, argument etc. on different social media like twitter. But, these huge amount of opinion cannot represent to the general people more frequently. In this case, this dynamic topics modeling is so much helpful to understand this pandemic and predict the future condition. Proposed TClustVID shows more accuracy than traditional approach. In high performing clusters, we extracted positive, neutral and negative topics to investigate what mattered to the tweets and realized the associated topics of individual categories. These opinions and comments on social media reflecting significant values and gives various information about related issues. Hence, these topics can be informative to government and policymakers that need to make a rapid decision and deal with the uncertain COVID-19 situation using the best available information. In addition, these types of analysis help to clarify the concerns to the people finding themselves experiencing the pandemic situation in every day. The most frequently raised topics thus indicate perspectives on the current situation from the point of view of public reaction. These twitter datasets are open source and so can be gathered the largest quantities of tweets of the users. However, physicians and researchers also get various kinds of information that help them to get proper knowledge about this and explore innovative things to prevent this pandemic.

4. Conclusion

In this work, we proposed a clustered based classification and topics extraction model named TClustVID that can produce improved results of the classifiers compared to the traditional methods, and extracted significant topics from the high performing clusters of COVID-19 twitter datasets. This is almost the first study in which COVID-19-related twitter data has been investigated using this proposed model where different machine learning algorithms show the best results compared to more traditional approaches. In this work, TClustVID generates several clusters where one of them represent high classification accuracy that means it contains more significant topics that really represents the public opinions on twitter. In TClustVID, we used most widely used k-means clustering [14], when the contents of primary data are merged with clustered groups again and retokenizes this process. So, it makes the best appropriate results and helps machine learning classifier to understand different categories more clearly. In addition, this study not only identified the best classification model but also extracted significant topics that could be used for designing strategies to counter the pandemic. A great deal of information can be abstracted from very large numbers of tweets by the extraction of commonly occurring topics using LDA. These knowledge can be extracted from positive, neutral and negative tweets and identified high frequent information that are being transmitted and commented as the response to the pandemic situation. Such information can also use to gain insight into population activities, demands, opinions and responsibilities, and might be used to trace otherwise unidentified hot spots of COVID-19 infection by investigating topics and its categories and cross correlating this with medical data from other sources. There are some important limitations to note, such as these datasets do not contain more instances in upcoming months that relate to the COVID-19 tweets on twitter. Again, the interpretation of topics is a challenging task, hence some manual interpretation of topics may misinterpret in the topics modeling. In future work, more COVID-19 twitter data will be collected from different data repositories and investigated with these and other more advanced techniques currently being developed, which will enable more significant information extraction on COVID-19 topics.

References

[1].↵
G. Lippi, M. Plebani, Procalcitonin in patients with severe coronavirus disease 2019 (covid-19): a meta-analysis, Clinica Chimica Acta; Inter- national Journal of Clinical Chemistry (2020).
Google Scholar
[2].↵
R.-H. Xu, J.-F. He, M. R. Evans, G.-W. Peng, H. E. Field, D.-W. Yu, C.-K. Lee, H.-M. Luo, W.-S. Lin, P. Lin, et al., Epidemiologic clues to sars origin in china, Emerging infectious diseases 10 (2004) 1030.
OpenUrl CrossRef PubMed Web of Science Google Scholar
[3].↵
H. Zhang, C. Wheldon, A. G. Dunn, C. Tao, J. Huo, R. Zhang, M. Pros- peri, Y. Guo, J. Bian, Mining twitter to assess the determinants of health behavior toward human papillomavirus vaccination in the united states, Journal of the American Medical Informatics Association 27 (2020) 225– 235.
OpenUrl Google Scholar
[4].↵
A. Akay, A. Dragomir, B. Erlandsson, Network-based modeling and intelligent data mining of social media for improving care, IEEE Journal of Biomedical and Health Informatics 19 (2015) 210–218.
OpenUrl Google Scholar
[5].↵
D. J. Fiander, Social media for academic libraries, in: Social Media for Academics, Elsevier, 2012, pp. 193–210.
Google Scholar
[6].↵
D. T. Nguyen, K. A. A. Mannai, S. Joty, H. Sajjad, M. Imran, P. Mi- tra, Rapid classification of crisis-related data on social networks using convolutional neural networks, arXiv preprint arXiv:1608.03902 (2016).
Google Scholar
[7].↵
C. E. Lopez, M. Vasu, C. Gallemore, Understanding the perception of covid-19 policies by mining a multilanguage twitter dataset, arXiv preprint arXiv:2003.10359 (2020).
Google Scholar
[8].↵
R. Kouzy, J. Abi Jaoude, A. Kraitem, M. B. El Alam, B. Karam, E. Adib, J. Zarka, C. Traboulsi, E. W. Akl, K. Baddour, Coronavirus goes viral: quantifying the covid-19 misinformation epidemic on twitter, Cureus 12 (2020).
Google Scholar
[9].↵
M. Cinelli, W. Quattrociocchi, A. Galeazzi, C. M. Valensise, E. Brugnoli, A. L. Schmidt, P. Zola, F. Zollo, A. Scala, The covid-19 social media infodemic, arXiv preprint arXiv:2003.05004 (2020).
Google Scholar
[10].↵
R. J. Medford, S. N. Saleh, A. Sumarsono, T. M. Perl, C. U. Lehmann, An” infodemic”: Leveraging high-volume twitter data to understand public sentiment for the covid-19 outbreak, medRxiv (2020).
Google Scholar
[11].↵
R. Lamsal, Corona virus (covid-19) tweets dataset, 2020. URL: http://dx.doi.org/10.21227/781w-ef42:. doi:10.21227/781w-ef42.
OpenUrl CrossRef Google Scholar
[12].↵
K. Sangeetha, D. Prabha, Sentiment analysis of student feedback using multi-head attention fusion model of word and context embedding for lstm, Journal of Ambient Intelligence and Humanized Computing (2020) 1–10.
Google Scholar
[13].↵
K. Crockett, D. Mclean, A. Latham, N. Alnajran, Cluster analysis of twitter data: a review of algorithms, in: Proceedings of the 9th International Conference on Agents and Artificial Intelligence, volume 2, Science and Technology Publications (SCITEPRESS)/Springer Books, 2017, pp. 239–249.
OpenUrl Google Scholar
[14].↵
S. Ahuja, G. Dubey, Clustering and sentiment analysis on twitter data, in: 2017 2nd International Conference on Telecommunication and Net- works (TEL-NET), IEEE, 2017, pp. 1–5.
Google Scholar
[15].↵
D. Godfrey, C. Johns, C. Meyer, S. Race, C. Sadek, A case study in text mining: Interpreting twitter data from world cup tweets, arXiv preprint arXiv:1408.5427 (2014).
Google Scholar
[16].↵
K. Lee, D. Palsetia, R. Narayanan, M. M. A. Patwary, A. Agrawal, A. Choudhary, Twitter trending topic classification, in: 2011 IEEE 11th International Conference on Data Mining Workshops, IEEE, 2011, pp. 251–258.
Google Scholar
[17].↵
V. Ong, A. D. Rahmanto, D. Suhartono, A. E. Nugroho, E. W. An-dangsari, M. N. Suprayogi, et al., Personality prediction based on twitter information in bahasa indonesia, in: 2017 Federated Conference on Computer Science and Information Systems (FedCSIS), IEEE, 2017, pp. 367–372.
Google Scholar
[18].↵
B. Y. Pratama, R. Sarno, Personality classification based on twitter text using naive bayes, knn and svm, in: 2015 International Conference on Data and Software Engineering (ICoDSE), IEEE, 2015, pp. 170–174.
Google Scholar
[19].↵
M. Mccord, M. Chuah, Spam detection on twitter using traditional classifiers, in: international conference on Autonomic and trusted computing, Springer, 2011, pp. 175–186.
Google Scholar
[20].↵
N. Mamgain, E. Mehta, A. Mittal, G. Bhatt, Sentiment analysis of top colleges in india using twitter data, in: 2016 International Conference on Computational Techniques in Information and Communication Technologies (ICCTICT), IEEE, 2016, pp. 525–530.
Google Scholar
[21].↵
T. R. Li, A. Chamrajnagar, X. Fong, N. Rizik, F. Fu, Sentiment-based prediction of alternative cryptocurrency price fluctuations using gradient boosting tree model, Frontiers in Physics 7 (2019) 98.
OpenUrl Google Scholar
[22].↵
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Van-derplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, Ádouard Duchesnay, Scikit-learn: Machine learning in python, Journal of Machine Learning Research 12 (2011) 2825–2830. URL:http://jmlr.org/papers/v12/pedregosa11a.html.
OpenUrl CrossRef Google Scholar

Posted August 04, 2020.

Download PDF

Author Declarations

Data/Code

Citation Tools

Get QR code

Tweet Widget

Subject Area

Infectious Diseases (except HIV/AIDS)

Reviews and Context

Comment

TRIP Peer Reviews

Community Reviews

Automated Services

Blogs/Media

Author Videos

Subject Areas

All Articles

Addiction Medicine (413)
Allergy and Immunology (734)
Anesthesia (215)
Cardiovascular Medicine (3142)
Dentistry and Oral Medicine (351)
Dermatology (267)
Emergency Medicine (467)
Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1115)
Epidemiology (13106)
Forensic Medicine (15)
Gastroenterology (873)
Genetic and Genomic Medicine (4923)
Geriatric Medicine (453)
Health Economics (758)
Health Informatics (3104)
Health Policy (1110)
Health Systems and Quality Improvement (1145)
Hematology (414)
HIV/AIDS (978)
Infectious Diseases (except HIV/AIDS) (14396)
Intensive Care and Critical Care Medicine (891)
Medical Education (459)
Medical Ethics (121)
Nephrology (507)
Neurology (4684)
Nursing (248)
Nutrition (692)
Obstetrics and Gynecology (850)
Occupational and Environmental Health (770)
Oncology (2414)
Ophthalmology (688)
Orthopedics (270)
Otolaryngology (334)
Pain Medicine (313)
Palliative Medicine (88)
Pathology (522)
Pediatrics (1254)
Pharmacology and Therapeutics (530)
Primary Care Research (526)
Psychiatry and Clinical Psychology (4025)
Public and Global Health (7244)
Radiology and Imaging (1619)
Rehabilitation Medicine and Physical Therapy (965)
Respiratory Medicine (949)
Rheumatology (463)
Sexual and Reproductive Health (482)
Sports Medicine (407)
Surgery (522)
Toxicology (65)
Transplantation (223)
Urology (193)

Comments

medRxiv aims to provide a venue for anyone to comment on a medRxiv preprint. Comments are moderated for offensive or irrelevant content (this can take ~24 h). Please avoid duplicate submissions and read our Comment Policy before commenting. The content of a comment is not endorsed by medRxiv.

medRxiv aims to inform readers about online discussion of this preprint occurring elsewhere. The content at the links below is not endorsed by either medRxiv or the preprint's authors.

Community reviews for this article:

There are no community reviews for this paper.

Automated Evaluations

Certain services provide automated analysis of preprints. Analyses invited by the authors are displayed at the top of this tab. Those done independently of authors are shown underneath . None of these analyses is endorsed by medRxiv.

Automated Evaluations:

There are no automated evaluations for this paper.

[1] [1].↵
G. Lippi, M. Plebani, Procalcitonin in patients with severe coronavirus disease 2019 (covid-19): a meta-analysis, Clinica Chimica Acta; Inter- national Journal of Clinical Chemistry (2020).
Google Scholar

[2] [2].↵
R.-H. Xu, J.-F. He, M. R. Evans, G.-W. Peng, H. E. Field, D.-W. Yu, C.-K. Lee, H.-M. Luo, W.-S. Lin, P. Lin, et al., Epidemiologic clues to sars origin in china, Emerging infectious diseases 10 (2004) 1030.
OpenUrl CrossRef PubMed Web of Science Google Scholar

[3] [3].↵
H. Zhang, C. Wheldon, A. G. Dunn, C. Tao, J. Huo, R. Zhang, M. Pros- peri, Y. Guo, J. Bian, Mining twitter to assess the determinants of health behavior toward human papillomavirus vaccination in the united states, Journal of the American Medical Informatics Association 27 (2020) 225– 235.
OpenUrl Google Scholar

[4] [4].↵
A. Akay, A. Dragomir, B. Erlandsson, Network-based modeling and intelligent data mining of social media for improving care, IEEE Journal of Biomedical and Health Informatics 19 (2015) 210–218.
OpenUrl Google Scholar

[5] [5].↵
D. J. Fiander, Social media for academic libraries, in: Social Media for Academics, Elsevier, 2012, pp. 193–210.
Google Scholar

[6] [6].↵
D. T. Nguyen, K. A. A. Mannai, S. Joty, H. Sajjad, M. Imran, P. Mi- tra, Rapid classification of crisis-related data on social networks using convolutional neural networks, arXiv preprint arXiv:1608.03902 (2016).
Google Scholar

[7] [7].↵
C. E. Lopez, M. Vasu, C. Gallemore, Understanding the perception of covid-19 policies by mining a multilanguage twitter dataset, arXiv preprint arXiv:2003.10359 (2020).
Google Scholar

[8] [8].↵
R. Kouzy, J. Abi Jaoude, A. Kraitem, M. B. El Alam, B. Karam, E. Adib, J. Zarka, C. Traboulsi, E. W. Akl, K. Baddour, Coronavirus goes viral: quantifying the covid-19 misinformation epidemic on twitter, Cureus 12 (2020).
Google Scholar

[9] [9].↵
M. Cinelli, W. Quattrociocchi, A. Galeazzi, C. M. Valensise, E. Brugnoli, A. L. Schmidt, P. Zola, F. Zollo, A. Scala, The covid-19 social media infodemic, arXiv preprint arXiv:2003.05004 (2020).
Google Scholar

[10] [10].↵
R. J. Medford, S. N. Saleh, A. Sumarsono, T. M. Perl, C. U. Lehmann, An” infodemic”: Leveraging high-volume twitter data to understand public sentiment for the covid-19 outbreak, medRxiv (2020).
Google Scholar

[11] [11].↵
R. Lamsal, Corona virus (covid-19) tweets dataset, 2020. URL: http://dx.doi.org/10.21227/781w-ef42:. doi:10.21227/781w-ef42.
OpenUrl CrossRef Google Scholar

[12] [12].↵
K. Sangeetha, D. Prabha, Sentiment analysis of student feedback using multi-head attention fusion model of word and context embedding for lstm, Journal of Ambient Intelligence and Humanized Computing (2020) 1–10.
Google Scholar

[13] [13].↵
K. Crockett, D. Mclean, A. Latham, N. Alnajran, Cluster analysis of twitter data: a review of algorithms, in: Proceedings of the 9th International Conference on Agents and Artificial Intelligence, volume 2, Science and Technology Publications (SCITEPRESS)/Springer Books, 2017, pp. 239–249.
OpenUrl Google Scholar

[14] [14].↵
S. Ahuja, G. Dubey, Clustering and sentiment analysis on twitter data, in: 2017 2nd International Conference on Telecommunication and Net- works (TEL-NET), IEEE, 2017, pp. 1–5.
Google Scholar

[15] [15].↵
D. Godfrey, C. Johns, C. Meyer, S. Race, C. Sadek, A case study in text mining: Interpreting twitter data from world cup tweets, arXiv preprint arXiv:1408.5427 (2014).
Google Scholar

[16] [16].↵
K. Lee, D. Palsetia, R. Narayanan, M. M. A. Patwary, A. Agrawal, A. Choudhary, Twitter trending topic classification, in: 2011 IEEE 11th International Conference on Data Mining Workshops, IEEE, 2011, pp. 251–258.
Google Scholar

[17] [17].↵
V. Ong, A. D. Rahmanto, D. Suhartono, A. E. Nugroho, E. W. An-dangsari, M. N. Suprayogi, et al., Personality prediction based on twitter information in bahasa indonesia, in: 2017 Federated Conference on Computer Science and Information Systems (FedCSIS), IEEE, 2017, pp. 367–372.
Google Scholar

[18] [18].↵
B. Y. Pratama, R. Sarno, Personality classification based on twitter text using naive bayes, knn and svm, in: 2015 International Conference on Data and Software Engineering (ICoDSE), IEEE, 2015, pp. 170–174.
Google Scholar

[19] [19].↵
M. Mccord, M. Chuah, Spam detection on twitter using traditional classifiers, in: international conference on Autonomic and trusted computing, Springer, 2011, pp. 175–186.
Google Scholar

[20] [20].↵
N. Mamgain, E. Mehta, A. Mittal, G. Bhatt, Sentiment analysis of top colleges in india using twitter data, in: 2016 International Conference on Computational Techniques in Information and Communication Technologies (ICCTICT), IEEE, 2016, pp. 525–530.
Google Scholar

[21] [21].↵
T. R. Li, A. Chamrajnagar, X. Fong, N. Rizik, F. Fu, Sentiment-based prediction of alternative cryptocurrency price fluctuations using gradient boosting tree model, Frontiers in Physics 7 (2019) 98.
OpenUrl Google Scholar

[22] [22].↵
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Van-derplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, Ádouard Duchesnay, Scikit-learn: Machine learning in python, Journal of Machine Learning Research 12 (2011) 2825–2830. URL:http://jmlr.org/papers/v12/pedregosa11a.html.
OpenUrl CrossRef Google Scholar

TClustVID: A Novel Machine Learning Classification Model to Investigate Topics and Sentiment in COVID-19 Tweets

Abstract

1. Introduction