Automatic Content Analysis of Social Media Short Texts: Scoping Review of Tools and Technologies
Abstract
In the continuously growing methodological pool of modern social research, the content analysis remains widely applied method (Hsieh & Shannon, 2005) and one of the most important research techniques (Krippendorff, 2018) applicable to qualitative and quantitative data. With this method, verbal, written or visual information could be researched. Today, when social media is gaining increasing importance, content analysis can be applied to examine social media texts. Many scientists, who apply content analysis, manually read and analyze texts. Although, computer programs could be applied at two phases of content analysis. Firstly, for storing, analyzing and reporting research data, and secondly – for automatic screening of texts, identifying and coding of words and phrases (Macnamara, 2005). The aim of this scoping review was to identify nature and extent of methods and tools for automatic content analysis (ACA) of social media short texts. While focusing to qualitative content analysis, the following research questions were raised: i) how the automatic content analysis tools are used for the empirical examination of social media short texts? ii) what is scope and categories of tools and technologies used in empirical automatic content analysis research? Articles included using these criteria: i) only empirical open accessed articles (2015-2019) were researched; ii) articles were freely accessible from Sage, Springer, and Emerald portfolios; iii) the methods and/or tools on automatic content analysis were researched in these articles; iv) articles in which social media short texts were used for empirical data analysis were studied. 106 articles were selected as fulfilling the search criteria. Additionally, web portals, which provided info not accessible from researched articles, were included. Scoping review methodology (Arksey & O’Malley, 2005) was applied for data collection from empirical articles and web portals, systematically mapping the scientific literature and newest knowledge available on the topic; identifying automatic content analysis categories and popular techniques, sources of empirical evidence and gaps in scholar research. In this scoping review, ACA categories (classification of texts; text segmentation; text clustering; text retrieval; text extraction; natural language processing; study for specific language; machine learning (including deep learning); working with specific language analysis tools) were identified from empirical articles and web portals. Associated research questions: how to identify text? How to collect text? How to preprocess text for analysis? How to prepare specific language text? How to analyze text? were discussed. Three of the most frequently asked questions in empirical ACA studies were related to preprocessing the networked texts (61% of researched cases); analyzing social media short texts (51% of researched cases) and collecting texts (30% of research cases). The largest variety of scientific empirical articles describing the use of ACA was found in the scholar journals of the Sage and Emerald publishing houses. The topics of identifying, collecting, preprocessing, analyzing social media data and working on non-normal language short texts were enclosed. In Springer, the topics of identifying, collecting, and analyzing social media short texts were covered in detail. The scoping review showed, that ACA tools could be applied to categorize all forms of content (e.g. Faraz, 2015); predict conditions and processes from social media short texts (e.g. De Choudhury, Gamon, Counts et al., 2013); explore influences and powers; judgements, interactions and decisions (e.g. Riff, Lacy, Fico et al., 2019); and make general insights. Automatic prediction and evaluation could be done by identifying educational and psychosocial parameters from the use of normative or non-standard language specific words and terms in mediated texts. Forecasting could be developed using large multifunctional models of mathematical regression or support vector machines. Social insights and predictions could be measured by repeated univariate (or bivariate) analyses. In scoping review, the most widely used ACA technologies for social media short texts analysis were grouped into two classes- specific software tools that are user-friendly (e.g. Leximancer) or require coding (e.g. Mallet or Stanford TMT) and programming language packages and libraries (e.g. R or Python). Although, ACA is a rapidly growing method widely applied in empirical research; different ACA tools and techniques exist, this scoping review discussed the common gaps and draw the recommendations for application of automated content analysis for social scientists. It was discovered, that ACA works best when social media short text data are trained and applied to a specific situation or context. Natural language processing requires consistent explanations and, in many cases, the relatively low accuracy and reliability of the intermediate links in natural language processing studies was observed. This research discusses, that even the latest natural language processing tools do not equate to the researcher's ability to analyze texts, extracting codes, categories and their subjective meanings.