Billion-Scale Investigation of COVID-19 Impact on Human Communication in 104 Languages

Muhammad Abdul-Mageed
7 min readJun 15, 2020
Figure 1. World map coverage of Mega-COV V0.1. Each dot is a city. Contiguous cities of the same color belong to the same country.

The Footprint of COVID-19: A Global Impact on Human Life

COVID-19 has changed the way we lead our lives. Regardless of age, gender, culture, economic class, education, income, language, place, profession, etc., we are all in this together. When they communicate, however, different people choose to talk about different topics, use different styles, express different emotions, report different experiences, refer to different places, interact with different people and media, etc. This is to say, even though the pandemic is impacting everyone, this does not mean all people are necessarily responding similarly. But how can we study that? What are the most important issues we should prioritize? Which places (countries, provinces/states, cities, etc.) are most important to start with? Another important question is what our point of comparison should be such that we identify the scale with which people’s behaviors, hopes, fears, needs, are different from what they used to be? A third question is related to data: What data are useful? And so many other questions. Ultimately, the impact the pandemic is having on human life, and perhaps all types of life on our planet, will be studied for a long time.

In our recent paper, currently released on arXiv, we report our efforts to create a billion-scale dataset from social media (i.e., Twitter) to enable the study of COVID-19’s impact on our daily lives. (We note that the current post is based on an updated version of the paper that will soon be on ArXiv as well. We will update the link when the new version is online). We wanted to create a sufficiently large dataset whose investigation can result in reasonably generalizable conclusions. We also wanted a dataset that is diverse (e.g., representing different languages, cultures, communities, countries). Finally, we are interested in comparing user behavior and communication patterns over time and hence we wanted the data to have extended, multi-year temporal coverage. This resulted in us designing Mega-COV, the dataset we report in the paper, around these principles. We explain each of these design aspects next.

Mega-COV: A Billion-Scale Dataset

The dataset we created, which we call, Mega-COV, is composed of ~ 1 billion tweets from ~ 1 million users. The size of the data makes it possible to easily slice and dice based on various attributes, depending on the specific research questions that could be raised, without having to worry about basing analysis on small user or tweet samples. As a single example, imagine trying to measure population well-being in a given city based on a language of interest during a specific week in March 2020. The data should, generally speaking, be sufficient to carry out such modeling at both the user and tweet levels.

Geographic Diversity

Mega-COV comprises data posted from a total of 167, 202 cities (or towns, etc.) from 268 countries. We base this analysis on two types of tweet-level information (that we also project at the user-level under certain conditions): (a) ‘point’ locations such as cities, and (b) geo-tags (i.e., longitude and latitude). Figure 1 above shows each city from which the data are collected as a point on the map. Figure 2 below shows the geo-locations from which the tweets were posted.

Figure 2. Geo-location coverage in Mega-COV V0.1. Each dot is a point co-ordinate (longitude and Latitude) from which at least one tweet was posted.

Figure 3 below further illustrates geographic diversity in the data.

Figure 3. Geographical diversity in Mega-COV V0.2. We show the distribution of our geo-located data over the top 20 countries with most tweets and responses. Overall, there are 268 countries in the data.

Table 1 below provides the distribution of Mega-COV V0.2 over select countries from different continents.

Table 1. Distribution of data over top countries per continent in Mega-COV V0.2 (all data vs. 2020 ).

Temporal Coverage

To enable comparsions over time, Mega-COV is designed with past coverage in mind. In other words, rather than crawling individual tweets, we collect up to 3,200 tweets from each user. This gives us historical data as back as 2007 for some of the users. As such, historical data can be used to contextualize and provide points of comparison to the outbreak time.

Figure 4. Distribution of Mega-COV V0.2 data points (including a breakdown of tweets, replies, and retweets) over time.

Twitter users, perhaps for the first time in history!, are more more interested in directly talking to one another (i.e., in replies) more than they just broadcasting (i.e., posting tweets).

A Striking Discovery!

One striking discovery we could uncover from the data is that users, perhaps for the first time in the history of Twitter?!, interact with one another (i.e., in replies) more than they tweet. This is also shown in Figure 4 above. Another observation is that users are re-tweeting much more frequently than they are tweeting, and of course the crystal clear surge in use of Twitter in general — 2020 data from Jan.-May already exceeds all data from the whole of 2019!

Our data also shows that the biggest surge in Twitter activity was in March. Figure 5 below clearly illustrates this finding.

Figure 5. Twitter user activity for Jan-May, 2020

Linguistic Diversity

Mega-COV V0.2 comprises data from 104 languages. We perform this analysis on tweets, retweets, and responses in the data. A total of 65 language tags come from Twitter, and we use the language identification tool langid (Lui and Baldwin, 2012) to tag all examples labeled as “undef” (for “unidentified” language) by Twitter. Table 2 below shows top languages in the data as tagged by Twitter (left) and languid (right). It is exciting how linguistically rich Mega-COV V0.2 is!

Table 2. Top 20 languages assigned by Twitter (left) and top 20 languages assigned by langid (right) in Mega-COV V0.2. BI: Bahasa Indonesia.

Hashtag Content Analysis

Hashtags usually provide a window on the topics in tweets and user interactions (e.g., attitudes, preferences) around these topics. For this reason, we perform an analysis of the hashtags used in the data. We find that hashtags related to the pandemic are abundant. Examples are COVID19, coronavirus, Coronavirus, COVID19, Covid19, covid19 and StayAtHome. However, we also find many hashtags related to gaming and politics. Interestingly, there is telling regional variation in the hashtags. For example, in the Chinese data, while we find hashtags such as ChinaPneumonia and WuhanPneumonia, these are not observable in other regions. In addition, apple seemed to be trending in China during the first 4 months on 2020 (hashtags such as appledaily and appledailytw). Figure 6 below shows word clouds based on hashtags from the top 10 languages in our data.

Figure 6: Word clouds for hashtags in tweets from the top 10 languages in the data. We note that tweets in non-English can still carry English hashtags or employ Latin script.

Domain Sharing Analysis

There is a surge in tweets involving news websites, and there is a rise in ranks for the majority of these websites compared to 2019.

We also We perform an analysis of the top 200 domains shared in each of 2019 and 2020. There is a major observation: There is a surge in tweets involving news websites, and there is a rise in ranks for the majority of these websites compared to 2019. Table 3 below shows this trend.

Table 3: Top 40 domains in 2020 data and their rank change relative to their rank in 2019.

Case Study: Mapping Human Mobility with Mega-COV.V.02

We also use Mega-COV geotags to map human mobility in a number of ways. The below are some visualizations from this analysis and the reader is referred to the paper for more details. (We show the viz below in order of `interestingness’, which is quite subjective, but also based on the amount of data we have from each country).

The U.S.

Figure 7. Inter-state user mobility in the U.S. for Jan.-May, 2020

The U.K.

Figure 8. User mobility between United Kingdom countries during Jan.-May 2020

Canada

Figure 9. User mobility between Canada Provinces during Jan.-May 2020

Italy

Figure 10. User mobility between Italy regions (regioni) during Jan.-May 2020

Saudi Arabia

Figure 11. User mobility between Saudi Arabia regions during Jan.-May 2020

Brazil

Figure 12: User mobility between Brazil states (estados) during Jan.-May 2020

Ethics

We collect Mega-COV from the public domain (Twitter). In compliance with Twitter policy, we do not publish hydrated tweet content. Rather, we only publish publicly available tweet IDs. All Twitter policies, including respect and prprotection of user privacy, apply. We decided not to assign geographic region tags to the tweet ids we distribute, although these already exist on the json object that can be retrieved from Twitter. We encourage all researchers who decide to use Mega-COV to review Twitter policy at this link before they start working with the data.

Concluding Remarks

We hope Mega-COV will be useful for research around the pandemic. We are excited about work on this space, and hope researchers will collectively contribute valuable solutions that help society get through these hard times. We will likely have more updates on Mega-COV here. We welcome any feedback.

Acknowledgements

I gratefully acknowledge support from the Natural Sciences and Engineering Research Council of Canada (NSERC), the Social Sciences and Humanities Research Council of Canada (SSHRC), the Canadian Foundation for Innovation (CFI), Compute Canada, and UBC. I also acknowledge the hard work of my students and co-authors, AbdelRahim Elmadany, Dinesh Pabbi, Kunal Verma, and Rannie Lin.

--

--

Muhammad Abdul-Mageed

Canada Research Chair in Natural Language Processing and Machine Learning, The University of British Columbia; Director of UBC Deep Learning & NLP Group