Online reactions to the 2017 'Unite the Right' rally in Charlottesville: measuring polarization in Twitter networks using media followership

We study the Twitter conversation following the August 2017 `Unite the Right' rally in Charlottesville, Virginia, using tools from network analysis and data science. We use media followership on Twitter and principal component analysis (PCA) to compute a `Left'/`Right' media score on a one-dimensional axis to characterize nodes. We then use these scores, in concert with retweet relationships, to examine the structure of a retweet network of approximately 300,000 accounts that communicated with the #Charlottesville hashtag. The retweet network is sharply polarized, with an assortativity coefficient of 0.8 with respect to the sign of the media PCA score. Community detection using two approaches, a Louvain method and InfoMap, yields largely homogeneous communities in terms of Left/Right node composition. When comparing tweet content, we find that tweets about `Trump' were widespread in both the Left and Right, though the accompanying language was unsurprisingly different. Nodes with large degrees in communities on the Left include accounts that are associated with disparate areas, including activism, business, arts and entertainment, media, and politics. Support of Donald Trump was a common thread among the Right communities, connecting communities with accounts that reference white-supremacist hate symbols, communities with influential personalities in the alt-right, and the largest Right community (which includes the Twitter account FoxNews).


Introduction
On 11-12 August 2017, a 'Unite the Right' rally was held in Charlottesville, Virginia, USA in the context of the removal of Confederate monuments from nearby Emancipation Park. Attendees at the rally included members of the 'alt-right', white supremacists, Neo-Nazis, and members of other far-right extremist groups [1]. Violent clashes between protesters and counter-protesters ensued. A prominent event amidst these clashes was the death of Heather Heyer when a rally attendee rammed his car into a crowd of counter-protesters [2]. In the aftermath, President Donald Trump stated that there were 'very fine people on both sides' [3]. White supremacists were galvanized by Trump's response, with one former leader stating that the president's comments marked "the most important day in the White nationalist movement" ( [4], p. 61). Reactions to the removal of confederate statues, the violence at the rally, and President Trump's controversial response generated vigorous debate across the United States.
In the present paper, we examine the structure of the online conversation surrounding the August 2017 events in Charlottesville as a case study of applying a network-science lens to the study polarization in online communication. Using tools from network analysis and data science, we examine Twitter data, from communication following the 'Unite the Right' rally, that includes the hashtag #Charlottesville. Our specific objectives are to (1) present a simple approach for characterizing Twitter accounts based on their online media preferences; (2) use this characterization to examine the extent of polarization in the Twitter conversation about Charlottesville; (3) evaluate whether key accounts were particularly influential in shaping this discussion; (4) identify natural groupings (in the form of network 'communities') of accounts based on their Twitter interactions; and (5) characterize these communities in terms of their account composition and tweet content.
Social media platforms are important mechanisms for shaping public discourse, and data analysis of social media is a large and rapidly growing area of research [5]. It has been estimated that almost two-thirds of American adults use social media networking sites [6], with even higher usage among certain subsets of the population (such as activists [7] and college students [6]). Online forums and social media platforms are also significant mechanisms for communication, dissemination, and recruitment for various types of ethnonationalist and extremist groups [4]. Twitter, in particular, has been a key platform for white-supremacist efforts to shape public discourse on race and immigration ( [4], p. 64).
The study of how the internet and social media platforms affect public discourse is an extensive research area [32][33][34]. In principle, social media and online news consumption have the potential to increase exposure to disparate political views [35]. However, in practice, they instead often serve as filter bubbles [36,37] and echo chambers [33,34]; and they thus potentially heighten polarization. Several previous studies have examined political homophily in Twitter networks [8,10,29,38,39], including with analysis based on tweet content and followership of political accounts [10]. We also examine political polarization using Twitter data, but we take a different approach: we focus on the homophily of media preferences on Twitter. Specifically, we examine media followership on Twitter and perform principal component analysis (PCA) [40] to calculate a scalar measure of media preference. We then use this scalar measure to characterize accounts in our Charlottesville Twitter data set. To study homophily, we examine assortativity of this scalar quantity for accounts that are linked by one or more retweets.
The influence of Twitter accounts on shaping content propagation and online discourse depends on many factors, including the number of 'followers' (accounts who subscribe to a given account's posts, which then appear in their feed), community structure and other aspects of network architecture [26], account characteristics such as tweet activity [11], and specific tweet content [41]. One can calculate 'centrality' measures [42] to identify important nodes in a Twitter network. There are many notions of centrality, including degree, PageRank [43], betweenness [44], hyperlink-induced text search (HITS, which allows the examination of both hubs and authorities) [45], and more. In the context of our study, it is also useful to keep in mind that some structural features are particular to Twitter networks, and these may influence which centrality measures are most appropriate to consider. Prominent examples of such features include asymmetry between the numbers of followers and accounts being followed for many accounts [41], automated accounts ('bots') that may retweet at very high frequencies [46], and heterogeneous retweeting properties across different accounts [9]. The importance of such features has also led to the development of Twitter-specific centrality measures [9,47,48]. We examine a variety of different measures of centrality for the #Charlottesville retweet network to identify important accounts both for generating novel content and for spreading existing content.
Community detection, in which one tries to find dense sets (called 'communities') of nodes that are connected sparsely to other dense sets of nodes, is another approach that can give insights into network structure (especially at large scales) [49,50]. Communities in a network can influence dynamical processes, such as content propagation [26,51,52]. Investigating community structure and other large-scale network structures can be very useful for the study of online social networks, as some accounts are anonymous and demographic data may be incomplete or of questionable validity. Community detection yields tightly-knit groupings of accounts that can help reveal what segments of the population are engaged in a conversation on Twitter. One can then examine such groupings, in conjunction with other tools from network analysis, to characterize communities in terms of structural network properties (e.g., distributions of degree or other centrality measures) and/or metadata (e.g., profile information), identify influential accounts within communities, and study dynamical processes on a network (such as how content propagates both within and between communities [26]).
In the present paper, we combine community detection with analysis of tweet content within and across communities. Previous studies have reported differences in language between online communities [53]. Such differences can help reveal differences in demography, political affiliation, and views on specific topics [8,10,54]. For example, the 'linguistic framing' of issues such as immigration can help reveal political orientations and agendas [55,56], and changes in language over time can reflect political movements and influence campaigns [57]. We combine community detection with tweet content analysis to compare subsets of the Twitter population who participated in the #Charlottesville conversation by characterizing them based on the language in different communities for describing both the broader conversation topic (namely, #Charlottesville) and specific subtopics (e.g., 'Trump').
Our paper proceeds as follows. In Section 2, we briefly discuss our Twitter data collection and cleaning. In Section 3, we discuss the media preferences that we infer from our Twitter data. In Section 4, we examine the structure (in terms of both centrality measures and large-scale community structure) of a network of retweet relationships that we construct from these data. In Section 5, we examine the media-preference assortativity of nodes in the network. In Section 6, we compare the content of tweets from nodes on the 'Left' (specifically, nodes with a negative media-preference score) and those on the 'Right' (specifically, nodes with a positive media-preference score). In Section 7, we conclude and discuss our results.

Data collection
We collected Tweets with the hashtag #Charlottesville and the follower lists for 13 media organizations using Twitter's API and the Python package tweepy. Public data accessibility through Twitter's API has greatly facilitated research studies on Twitter data, but such data have important limitations [5,13], including potential biases due to Twitter's proprietary API sampling scheme [13]. For example, Morstatter et al [31] illustrated that the API can produce artifacts in topical tweet volume, potentially resulting in misleading changes in the number of tweets on a given topic over time. In our analysis, we do not consider changes in tweet volume over time; instead, we examine features of the data after aggregating over a collection-time window. Tüfekci [5] discussed several potential issues with hashtag sampling, including different hashtag usages across different groups and discontinuation of a given hashtag once the corresponding topic has been established. (This latter phenomenon is called 'hashtag drift' [58].) We collected the tweets that we study from a six-day period from shortly after the 'Unite the Right' rally; this should lessen the potential for hashtag drift. As was pointed out by Tüfekci [5], hashtag sampling draws from accounts that choose to tweet a given hashtag, and this necessarily entails biases. Nevertheless, hashtag sampling is able to provide valuable insights on the shape of online conversations. For example, we can use the collected data to examine what types of accounts chose to post tweets about #Charlottesville. It is known, for example, that the extent that 'peripheral' accounts engage in online conversations about social protest can be an important factor for content propagation on Twitter [11].
Our data collection is in accord with the Twitter Terms of Service and Developer Agreement. To protect user privacy, we include account names (i.e., "handles") only for Twitter-verified accounts and Twitter accounts that belong to organizations. As described by Twitter, "an account may be verified if it is determined to be an account of public interest" [59].

Tweets about #Charlottesville
We used Twitter's search API to sample 486,894 publicly available tweets that include the hashtag #Charlottesville and were sent by 270,975 unique accounts between 16 August 2017 and 21 August 2017. Our data includes account name (i.e., "handle"), time and date in coordinated universal time (UTC), and tweet content. In UTC, the earliest tweet date is 2017-08-16 22:16:21, and the latest tweet date is 2017-08-20 01:48:00. We performed our data acquisition using the Python package tweepy.

Media followership
In December 2016, we used the Twitter API to acquire the complete lists of Twitter users who follow the following 13 media accounts: BreitbartNews, DRUDGE REPORT, FiveThirtyEight, FoxNews, MotherJones, NPR, NRO 1 , WSJ 2 , csmonitor, dailykos, theblaze, thenation, and washingtonpost. At the time of access, these media accounts had significant Twitter followings, ranging from 62,078 followers (csmonitor) to more than 12 million followers (WSJ); and they include both sources that studies have concluded as preferred by conservative readers and those that they have concluded as preferred by liberal ones [60,61].

Twitter media preferences
Of the Twitter accounts in our #Charlottesville data set, 99,412 accounts followed at least 1 of the 13 media sources at the time (December 2016) that we accessed the media follower lists. Restricting to these accounts gives a 99,412 × 13 media-choice matrix M of 0 entries (not following) and 1 entries (following). We perform principal-component analysis (PCA) on M , and we highlight the first three components in Table 1.
We interpret the first component as encoding liberal versus conservative media preference, as reflected by the signs of the entries of this component. Specifically, media accounts with a positive first component seem to correspond to accounts that previous studies have found to have a conservative slant (and to be preferred by individuals who identify as conservative), whereas accounts with a negative first component correspond predominantly to accounts that studies have concluded to have a liberal slant and/or are preferred by liberals [60,61]. The sign of the first principal component is also consistent with conventional wisdom about liberal versus conservative leanings of these media accounts, with the exception of The Wall Street Journal (WSJ), which is widely considered to be conservative-leaning [33] but has a negative first component in our PCA. However, our findings are consistent with previous studies that, based on readership and co-citations, grouped The Wall Street Journal with liberal media organizations [33,62,63] politically conservative [60]. Although the sign of the first component has a clear interpretation, the magnitude of these entries does not appear to provide an intuitive ordering (for example, with respect to a hand-curated media bias chart [64]) on the liberal-conservative spectrum.
In the rest of the paper, we focus on the value of the first principal component; for simplicity, we use the term 'media PCA score' to refer to this score. Positive values for this score indicate followership of the media accounts that we show in red, whereas negative values indicate followership of accounts that we show in black (see Table 1). To frame our discussion, we refer to nodes with a positive media PCA score as nodes on the 'Right' and to those with a negative media PCA score as being on the 'Left', although we note that we have not validated this measure as an indicator of political belief or affiliation. Our approach is similar to that of Bail et al. [32], who applied PCA to followership of a large set of 'opinion leaders' to assess political orientation.

Network structure
LetG denote our retweet network, which is a weighted, directed graph with weighted adjacency matrixÃ, whereÃ ij denotes the number of times that node j retweeted node i. The graph G has 238,892 nodes, 365,589 edges (ignoring weights), and 389,736 retweets. We focus on G, the largest connected component ofG when we ignore directionality (so it isG's largest weakly connected component). The graph G has 221,137 nodes, 353,548 edges (ignoring weights), and 376,978 retweets. Let A denote the weighted adjacency matrix for G. In all cases, weights represent multi-edges.

Degree distribution
Let the out-degree of node k correspond to the total number of retweets posted by node k, and the in-degree of k correspond to the total number of times that node k was retweeted. Unless we specifically note otherwise, we include weights when calculating the in-degrees and out-degrees  In-degree represents the number of times that a node was retweeted, and out-degree represents the number of times that a node sent a retweet. The two distributions differ from each other, with the in-degree distribution having a longer tail (corresponding to a few accounts that were retweeted very heavily).
(i.e., we count all edges in a multi-edge). For example, n i=1 A ij gives the out-degree of node j, and n j=1 A ij gives the in-degree of node i. In Figure 1, we show the in-degree and out-degree distributions for G. The two distributions are rather different from one another, as the in-degree distribution has a much longer tail (corresponding to a few accounts that were retweeted very heavily).
In Figure 2a, we show the in-degrees for the twenty most heavily retweeted accounts. The mean in-degree is 1.70, and the standard deviation is 69.22, indicating extreme heterogeneity in the number of times retweeted. The account (RepCohen) with the largest in-degree was retweeted 16,180 times. By contrast, 208,241 nodes (i.e., 94% of them) in G were never retweeted at all. We also observe heterogeneity in the out-degree, but it is much less extreme than for in-degree, as the standard deviation is 4.89. (By definition, the mean in-degree and mean out-degree are the same, as every edge has both an origin and terminus in G.) The account with the largest out-degree sent 141 retweets in our data set. By contrast, 7,852 accounts had an out-degree of 0; these accounts were retweeted, but they did not retweet any accounts. In Figure 2b, we show the twenty accounts that sent the most retweets.
We also consider the in-degree and out-degree distributions for accounts with and without media PCA scores to examine whether there are systematic differences between the two types of accounts. The heterogeneity for the in-degree distribution that we observed when examining all nodes in G is also present when we consider the in-degree distribution separately for nodes with and without media PCA scores; the standard deviation is 105.24 for nodes with media PCA and 36.65 for nodes without it. The mean in-degree for nodes with a media PCA score is larger than for nodes without one (2.85 versus 1.08). Nodes with large in-degree with media PCA scores include DineshDSouza, pastormarkburns, RepCohen, wkamaubell, johncardillo, and many others. However, there are also some heavily retweeted nodes -such as larryelder, TheNormanLear, and NancyPelosithat do not follow any of the 13 media accounts that we used for computing media PCA scores. We thus cannot compute media PCA scores for these nodes. In each case, we also show the corresponding in-degrees and out-degrees, respectively. The largest in-degrees are much larger than the largest out-degrees, although the vast majority (94%) of nodes were never retweeted at all. We show the account names (i.e., handles) for verified accounts on the vertical axes. Blank labels correspond to accounts that are not verified. Note that the majority of accounts in panel (a) correspond to verified accounts, whereas none of the nodes that sent the most retweets (i.e., the accounts in panel (b)) are verified accounts.

Centralities
We now examine important accounts by computing several centrality measures. We start with degree (i.e., degree centrality), the simplest way of trying to measure a node's importance. In Figure 2, we show the twenty nodes with the largest in-degrees and the twenty nodes with the largest out-degrees. These two sets are disjoint, indicating that the nodes that generated most of the original content in the Twitter conversation about #Charlottesville were distinct from those that were most active in promoting existing content through retweets. Degree is a local centrality measure that does not take into account any characteristics of neighboring nodes. For comparison, we also calculate two additional widely-used centrality measures, PageRank [43] and HITS [45], that take some non-local information into account. PageRank corresponds to the stationary distribution of a random walk on a network that combines transitions according to network structure and 'teleportation' according to a user-supplied distribution [65], with a parameter that determines the relative weightings of these two processes. We compute PageRank with standard uniform-at-random teleportation using Matlab's centrality function with the default damping factor of 0.85 (so teleportation occurs for 15% of the steps in the associated random walk). In the left column of Figure 3, we list the twenty most central nodes according to PageRank. Nine of the these nodes are also on our list of nodes with the largest in-degrees. An exception is harikondobalu, which was retweeted only 38 times in our data set. The large PageRank value for harikondobalu, despite its small in-degree, reflects the fact that harikondobalu was one of only two nodes that were retweeted by wkamaubell, which was retweeted 8,582 times. Color corresponds to the mean media PCA score for the community assignment of each node from modularity maximization using a Louvain method (see Section 4.3). Nodes that appear in more than one column are linked by colored lines. There is overlap between the PageRank and Authority nodes, but these two sets are disjoint from the Hub accounts. All of the leading hubs belong to communities with negative (i.e., Left) mean media PCA scores.
Hub and authority centralities [45] are another useful set of centrality measures. Using the HITS algorithm, one can simultaneously examine hubs and authorities. As discussed in [45], a good hub tends to point to good authorities, and a good authority tends to have good hubs that point to it. In the context of retweeting, we expect that accounts with large authority scores tend to be retweeted by accounts with large hub scores, and we expect that good hub accounts tend to retweet accounts that are good authorities. As in PageRank, the importances of adjacent nodes influence a node's hub and authority scores. Under our convention that the (i, j) entry of a graph's adjacency matrix corresponds to the edge weight from j to i, hub and authority scores correspond, respectively, to the principal right eigenvectors of A t A and AA t . We compute hubs and authorities using Matlab's centrality function.
We list the twenty nodes with the largest authority and hub scores, respectively, in the center and right columns of Figure 3. Color indicates the mean media PCA scores for the community assignment of each account from modularity maximization using a Louvain method ( [66][67][68]; see Section 4.3). Only two of the nodes among the top twenty authorities are in communities with positive (i.e., Right) media PCA scores; these accounts, pastormarkburns and DineshDSouza, belong to two prominent conservative personalities. Neither pastormarkburns nor DineshDSouza were ever retweeted by any of the top 50 hubs. By contrast, all of the other authorities were retweeted at least 3 times by the leading hubs. The hub scores for all nodes has a bimodal distribution, with a clear separation between the nodes with small and large values (e.g., using 4 × 10 −5 as a threshold hub score). We refer to nodes with hub scores that are larger than 4 × 10 −5 as 'large hub-score nodes'. Consider the set of nodes that retweeted DineshDSouza. Of these, the fraction that are large hub-score nodes is 9.0×10 −4 . The fraction of nodes that retweeted pastormarkburns that are large hub-score nodes is 1.0 × 10 −3 . For comparison, the fraction of nodes that retweeted itsmikebivins that are large hub-score nodes is 1.3×10 −2 . A few other examples of such fractions are 0.05 for wkamaubell, 0.15 for tribelaw, and 1 for RepCohen.
As is standard for hub and authority scores, there are two qualitatively different ways for a node to have a large authority score: it can either be retweeted many times (e.g., DineshDSouza), or it can be retweeted by nodes with large hub scores (e.g., itsmikebivins). Both of the large-authority Right accounts (DineshDSouza and pastormarkburns) lie in the former category. Figure 3 also allows us to compare important accounts according to different centrality measures. As one can see in Figure 3, there is some overlap between the top-PageRank and top-authority accounts. Note, however, that fewer than half of the top-PageRank accounts are also among the top-authority accounts. By comparison, the set of top hubs is disjoint from the top-PageRank and top-authority accounts in Figure 3. Additionally, more than half of the top-PageRank and topauthority accounts in Figure 3 are verified accounts, whereas none of the top-hub accounts were verified.

Community structure
To examine large-scale structure in the #Charlottesville retweet network, we use community detection to identify tightly-knit sets (so-called 'communities') of accounts with relatively sparse connections between these sets [49,50]. There exist numerous methods for community detection. In our investigation, we employ two widely-used methods: modularity maximization [67,68] and InfoMap [69].

Modularity maximization
The modularity of a particular assignment of a network's nodes into communities measures the amount of intra-community edge weight, relative to what one would expect at random under some null model [67,68]. Modularity maximization then treats community detection as an optimization problem by seeking an assignment of nodes into communities that maximizes the modularity objective function. A version of modularity for weighted, directed graphs is [70,71] (4.1) is the sum of all edge weights in a network; w in k and w out k are the in-strength (i.e., a weighted generalization of in-degree) and out-strength (i.e., weighted out-degree), respectively, of node k; the community assignment of node k is C k ; the quantity δ is the Kronecker delta; and γ is a resolution parameter that controls the relative weight given to the null model [72]. Our null-model matrix , so this null model is a type of configuration model [73], in which we preserve expected in-strength and expected out-strength but otherwise randomize connections [49]. For simplicity, we use the resolution-parameter value γ = 1.
To maximize Q, we use a variant [74] (which is implemented in Matlab and was released originally in conjunction with [75]) of the locally-greedy Louvain algorithm [66]. To use the Gen-Louvain code in [74], we symmetrize the modularity matrix B, where B ij = A ij − γ w in i w out j w . As discussed in [71], this is distinct from symmetrizing the adjacency matrix A.
Modularity maximization using GenLouvain yields 228 communities, which range in size from 2 nodes to 47,321 nodes.

InfoMap
InfoMap is a community-detection method that is based on the flow of random walkers on graphs [69]. 3 The intuition for communities in methods based on random walks is that a random walker tends to be trapped for long periods of time within tightly-knit sets of nodes [50]. Rosvall and Bergstrom [69] made this idea concrete by trying to minimize the expected description length a the random walk. For example, one can obtain a concise description of a random walk by allowing node names to be reused between communities. One can apply InfoMap to weighted, directed graphs; and it has been used previously to study Twitter data [26]. To study a directed graph, one introduces a teleportation parameter (as in PageRank); we use the default teleportation value of τ = 0.15 [69].
Our implementation uses code from [76]. With InfoMap, we find 205 communities, which range in size from 1 node to 122, 504 nodes.

Large-scale structure of the retweet network
Several features are evident in our community-detection results from both modularity maximization and InfoMap: (1) communities are largely segregated by media PCA score; (2) overall, the communities skew towards the Left; and (3) most of the nodes on the Right are assigned to a large community that includes prominent right-wing personalities and FoxNews.
To examine the relationship between community structure and Left/Right media preference, we compute the mean media PCA score within each community. The proportion of communities with at least one node with a media PCA score is very similar under modularity maximization (204/228; 89%) and InfoMap (183/205; 89%). We also examine the extent of overlap of Left and Right accounts within communities by computing the Shannon diversity index [77] for each community. This index is given by where H k is the Shannon diversity index for community k, and p k 1 and p k 2 (with p k 1 + p k 2 = 1 for each k) are the fractions of accounts in community k with Left and Right media preferences, respectively. In Figure 4, we show the Shannon diversity scores versus mean media PCA scores for the communities that we detect using modularity maximization and InfoMap.
Both community-detection methods yield a predominantly unimodal shape for PCA score diversity versus mean media PCA score, with more extreme mean media PCA scores associated with lower diversity within a community. Communities with 'centrist' mean media PCA scores (i.e., ones that are near 0) have relatively small sizes. By contrast, the largest communities tend to have mean media scores that are farther from 0, and have small Shannon diversity. For example, InfoMap gives two communities that are much larger than the others. One is on the Left (with 122,504 nodes and a mean media PCA score of −0.43), and the other is on the Right (with 58,185 nodes and a mean media PCA score of 0.74. In these two communities, 91% of the nodes in the largest Left community have negative media PCA scores, compared with 6% in the largest Right community. Similarly, the large communities from modularity maximization also have little Left/Right node diversity within communities. Another prominent feature is that both community-detection approaches yield one community on the Right that is much larger than other communities with a positive mean media PCA score. Furthermore, the two methods are similar in terms of large-degree accounts in the largest Right community from each method. Specifically, the five nodes with largest in-degrees and out-degrees are the same, with DineshDSouza, pastormarkburns, larryelder, johncardillo, and FoxNews as the five most heavily retweeted accounts (i.e., the ones with the largest in-degrees) in the community. Figure 4 also suggests that there are more Left-leaning communities than Right-leaning ones. For example, 106/130 (i.e., about 82%) of the InfoMap communities with at least ten nodes have negative mean media PCA scores. Modularity maximization gives a bimodal distribution of community sizes, and we use a community size of 100 to distinguish between 'small' and 'large' communities. Of the large modularity-maximization communities, 76/93 (i.e., 82%) have negative mean media PCA scores. For comparison, we have PCA scores for 78,339 nodes, and 44,797 of them (about 57%) have a negative first PCA score.

Finer features of the retweet network
A difference between the two methods is that two large communities dominate for InfoMap (one each on the Left and Right), whereas modularity maximization yields a partition of the network into many more communities. We now examine some of these details focusing specifically on moderate to large communities from modularity maximization.
Modularity maximization yields 41 communities with at least 1,001 nodes. To further characterize these 41 communities, we examine the accounts with largest in-degree (i.e., the ones that are retweeted the most) within each community and characterize these nodes by hand from their profiles and, when available (e.g., when account owners are known public personalities), information about the owners of these accounts. More than 85% (specifically, 35 of 41) of these communities have negative (i.e., Left-leaning) mean media PCA scores. The accounts with the largest in-degrees in these 35 communities include activists (e.g., Everytown, IndivisibleTeam, UNHumanRights, and womensmarch), businesses (e.g., benandjerrys), people from arts and entertainment (e.g., jk rowling, LatuffCartoons, FallonTonight, ladygaga, Sethrogen, TheNormanLear, and wkamaubell), journalists (e.g., AmyKNelson), media organizations (e.g., AJEnglish, CBSThisMorning, and HuffPostCanada), and politicians (e.g., NancyPelosi, RepCohen, and JoeBiden). By comparison, only six of the largest communities have positive (i.e., Right-leaning) mean media PCA scores. The largest of these (with 47,321 nodes) includes opinion leaders on the Right (e.g., DineshDSouza, pastormarkburns, and larryelder) and FoxNews, as we discussed previously. Another community has a mean PCA score close to 0 (specifically, it is 0.086), and it appears to be a business-oriented community with tweets that are critical of Donald Trump. Two of the remaining four communities with positive media scores are Right-oriented activist communities. One activist community has 3, 987 nodes, and one of its accounts of largest in-degree (i.e., that is retweeted very heavily) references an influential alt-right account [78]. The other activist community has 2,710 nodes and one of its most retweeted accounts references a well-known white supremacist hate symbol in its handle. A third community appears to be a media community with foreign media personalities (e.g., KTHopkins), and the final community of these four is a community that is dominated by accounts that tweet in German.

Media-preference assortativity
To examine homophily in media-preference scores in the Twitter conversation about #Charlottesville, we measure media-preference assortativity by computing the Pearson correlation coefficient of the first media PCA score for nodes in the retweet network. Specifically, we compute the correlation of the first media PCA score for dyads (i.e., nodes that are adjacent to each other via an edge) in the retweet network. We ignore edge weights, and we restrict our calculations to dyads for which we have a PCA score for both nodes. There are 93,521 such pairs.
The correlation coefficient of the first media PCA scores is ρ ≈ 0.67. For comparison, we compute the correlation coefficient distribution for 100,000 random permutations of the PCA scores of the nodes. Specifically, in each realization, we fix the network and assign the PCA scores uniformly at random to the nodes for which PCA scores were available originally. The resulting distribution for the correlation coefficient ρ appears to be approximately Gaussian, with a mean of −1.29×10 −5 and a standard deviation of 0.0033. The z-score for the measured correlation coefficient of 0.67 is larger than 203, indicating that the retweet network has a statistically significant mediapreference assortativity.
We also compute the assortativity coefficient r that was introduced by Newman [79,80]. Suppose that there are g types of nodes in the network. Following [80], we calculate where e s is the fraction of the total edges that emanate from a node of type and terminate at a node of type s, the quantity a = g s=1 e s is the fraction of total edges that emanate from a node of type , and b s = g =1 e s is the fraction of total edges that terminate at a node of type s.  Table 2: Mixing matrix of the proportion of total edge weight that corresponds to edges between different types of accounts, as characterized by the sign of their media PCA score (i.e., first PCA score). Left indicates a negative media PCA score, and Right indicates a positive media PCA score.
(No nodes have a media PCA score of exactly 0.) Accounts tend to mix with (i.e., be adjacent to) accounts with a PCA score of the same sign, as indicated by the larger weights on the diagonal of the matrix.
To calculate (5.4) for the retweet network, we classify nodes according to the sign of their media PCA score. In the largest weakly connected component of the retweet network, we have PCA scores for 78, 339 nodes, of which 44, 797 (i.e., 57% of them) have a negative first PCA score. The resulting assortativity coefficient is r ≈ 0.80. We show the mixing matrix e in Table 2. As a comparison, Newman [80] calculated an assortativity coefficient of 0.62 by ethnicity for the sexual-partner network that was described in [81].
Five 4 of the media accounts that we used to compute the media PCA score also appear as nodes in the retweet network G. Of these, FoxNews was retweeted 3049 times, NPR was retweeted 69 times, MotherJones was retweeted 15 times, and csmonitor was retweeted 6 times.Removing these media accounts from G has a negligible effect on the assortativity coefficient r.
Although the assortativity by PCA score in the retweet network is rather strong, there are some prominent individual exceptions. For example, RepCurbelo and SenatorTimScott 5 , the accounts for two Republican members of Congress, were heavily retweeted in Left-leaning communities that we detected with modularity maximization. However, both RepCurbelo (0.49) and SenatorTimScott (0.12) have positive (i.e., Right) media PCA scores, consistent with their affiliation with the Republican party. RepCurbelo was the fourth-most retweeted account in a community from modularity maximization with a negative (i.e., Left) mean media PCA score (−0.32). RepCurbelo, who spoke out strongly against the events in Charlottesville [82], was retweeted by 22 accounts. We have PCA scores for 9 of these accounts, of which 4 have media PCA scores on the Left. Similarly, SenatorTimScott was the second-most retweeted account in a Left-leaning community (with a mean media PCA score of −0.26) that we obtained from modularity maximization. SenatorTimScott was retweeted by 78 accounts, and nearly half (specifically, 20 of 43) of the accounts that retweeted SenatorTimScott for which we have PCA scores have negative media PCA scores. We identified RepCurbelo and SenatorTimScott as accounts that warrant examination by first compiling the list of nodes that were retweeted by accounts with media PCA scores of the opposite sign and then examining this list for prominent accounts. One can further develop this approach (for example, to identify negative or mocking retweets [5]), and it may be useful in other situations for identifying accounts that generate communication across ideological or other divides.

Comparison of tweet content between Left and Right
We use the Python library nltk 3.3 to tokenize tweets into words and punctuation. In Table 3, we show the twenty-five most numerous words in our data set, where we separately consider accounts with negative (i.e., Left) and positive (i.e., Right) media PCA scores after removing stop words. 6 Additionally, we do not stem the words in our data set, and we treat different capitalizations as different words in our analysis. We find some overlap between the Left and Right data sets; for example, tweets related to 'Trump' were very common regardless of media PCA score. 'Barcelona' was also one of the most numerous words in tweets that were sent by both the Left and the Right. There was a 17 August 2017 van attack in that city that killed 13 individuals (at the time of data collection) and injured more than 100 others. 7 However, there are also many differences between the two sets of words that we show in Table 3. We indicate these differences by coloring the relevant words. For example, 'Obama' was the third-most numerous word in tweets that were sent by nodes with positive media PCA scores, but it was not in the top one hundred for nodes with negative media PCA scores. 'Nazi' appeared commonly in tweets from the Left, but it did not appear often in tweets from the Right, whereas the words 'Antifa' and 'MSM' were used often by the Right but not by the Left.

Left
Right Word  Count  Word  Count  Charlottesville  98782  Charlottesville  84282  Trump  19352  Trump  11376  realDonaldTrump  10289  Obama  8195  white  9472  white  8174  Nazis  7743  DineshDSouza  8026  Nazi  6451  POTUS  7614  comments  5068  pastormarkburns  7394  charlottesville  4759  Barcelona  6348  good  4693  supremacist  6004  people  4637  organizer  5864  response  4091  rally  5851  must  4080  violence  5671  hate  3933  guy  5501  Barcelona  3930  MSM  5070  violence  3884  hate  4882  supremacy  3642  Right  4531  introducing  3584  city  4490  attack  3460  Americans  4489  via  3382  larryelder  4421  RepCohen  3358  Antifa  4409  rally  3345  Since  4183  Impeachment  3341  11  4011  Articles  3326  Chicago  3946  Klansmen  3310  Statues  3905  right  3155  40  3900   Table 3: The twenty-five most numerous words for nodes with negative (i.e., Left) and positive (i.e., Right) media PCA scores. The blue text indicates words that appear in the top twenty for the Left but not for the Right, and the red text indicates words that appear in the top twenty for the Right but not for the Left.
We observe additional qualitative differences between the tweet content of the Left and Right on shared common words, such as 'Trump' and 'Barcelona', in the #Charlottesville data set. The 'Trump' subset 8 for which we have media PCA scores consists of 34,084 total tweets (of which about 32% are unique) from the Left, and 18,791 total tweets (of which about 23% are unique) from the Right. 9 As we show in the left set of columns of Table 4, the Left and Right conversations  Left   Right  Word  Count  Word  Count  Charlottesville  30995  Charlottesville  16218  Trump  19396  Trump  11376  realDonaldTrump  10291  realDonaldTrump  3245  Nazis  5600  MAGA  1403  comments  4941  POTUS  1278  good  3788  President  1229  introducing  3580  Romney  1209  Impeachment  3331  comments  1175  white  3318  apologize  1164  Articles  3313  racist  1149  RepCohen  3306  blame  1096  Klansmen  3301  Mayor  1085  must  2699  antifa  980  Congress  2606  charlottesville  959  censure  2540  Vice  958  supremacy  2239  Barcelona  880  wake  2107  left  868  defense  2060  alt  818  NancyPelosi  2054  coming  781  repulsive  2018  non  776   Left  Right  Word  Count  Word  Count  Charlottesville  4213  Charlottesville  7384  Barcelona  3926  Barcelona  6345  attack  1023  Muslims  2101  Trump  1020  13  2036  terrorism  842  right  2035  2  665  CNN  2019  prayers  450  kills  2006  realDonaldTrump  438  condemned  2002  thoughts  420  kill  1996  Terror  415  someone  1977  gets  400  attack  1707  settle  399  left  1686  directed  399  lunatic  1632  scolding  399  One  1621  intentional  397  wholesale  1620  ambigu  395  johncardillo  1617  condemns  356  copycat  1190  took  346  Trump  718  comment  317  Blitzer  648  immediately  317 people 642  [83] used a chi-square statistic to analyze the different phrase usage of Democrats and Republicans in Congressional speeches. We apply their approach to words in tweets from the Left and Right in the 'Trump' subset (specifically, using equation (1) in [83] with 'phrases' that consist of a single word) and find that the five words (which include 'Nazis', 'antifa', and 'Vice') with the largest chi-square values were also among the most common words (see Table 4). Therefore, we observe some consistency in results across different methods.
We also use hashtags to compare tweets between the Left and Right communities. In Figure 5, we show the most numerous hashtag for each community, 10 together with the community's mean media PCA score. On the Left, the most numerous hashtag is #Trump (in 13 of 35 communities), followed by #HeatherHeyer (in 5 of 35 communities, if we include a single community with '#HeatherHayer') and then #Barcelona (in 4 of 35 communities). Other top hashtags include #Ex-poseTheAltRight, #DumpTrump, #FightRacism, and #DisarmHate. On the Right, #Barcelona is the most numerous hashtag (in 3 of 6 communities, if we include a single community with #Barcellona). Other top hashtags on the Right are #UniteTheRight (from a community with an account of large in-degree whose Twitter handle references an influential account that identifies with the alt-right [78]) and #fakenews (from a community with an account of large in-degree whose handle references a well-known white-supremacist hate symbol). The

Conclusions and discussion
Our investigation illustrates strong polarization in the Twitter conversation about #Charlottesville. We found that media followership on Twitter is informative and that the #Charlottesville retweet network is strongly assortative with respect to a corresponding PCA-based Left/Right orientation score. Our finding of positive assortativity with respect to media preference on Twitter is consistent with previous studies of Twitter data [8,10,38,39]. Our approach of using a principal component analysis of media followership to characterize nodes is simple and easy to interpret, and it provides a valuable complement to characterizing nodes based on the content of their tweets. We found that the #Charlottesville retweet network is strongly assortative with respect to media preference, making this a potentially useful indicator of marked polarization on Twitter about the 'Unite the Right' rally and its aftermath. Whether differences in media preferences are a cause or an effect (or both) of assortativity on social media is not something that our approach allows us to conclude, but they are correlated strongly with each other in our data.
Polarization is also evident in the community structure of the retweet network, as the communities are highly segregated in terms of their Left/Right node composition. The Left has a larger proportion of tweets with original content (as opposed to retweets) than the Right, and nodes with large hub scores tended to retweet nodes on the Left rather than those on the Right. We additionally find that modularity maximization detects Left communities with central nodes from disparate focal areas such as business, media, entertainment, and politics.
On the Right, both employed community-detection methods identify a large community that includes FoxNews and right-wing personas such as DineshDSouza, pastormarkburns, larryelder, and johncardillo. Heavily retweeted posts from this community about #Charlottesville included references to the mainstream media, Antifa, and Barack Obama. There were also many tweets that referenced Trump, and 'POTUS' is the fifth 11 most common word that was tweeted by members of this community.
Note, however, that Twitter users are not a representative sample of the general population [61], and hashtag sampling introduces its own set of biases [5]. Differences in Twitter usage and propensity to tweet political content may also differ with political affiliation [10]. Consequently, it is also important to compare our findings from Twitter to offline information. Our findings are consistent with a Quinnipiac poll that suggested that nearly one third of Republicans (but only 4% of Democrats) considered counterprotesters to be more to blame than white supremacists for the violence at Charlottesville [84]. We observe that several of the communities on the Right that we obtained from modularity maximization of the #Charlottesville retweet network also appear to reflect core participants of the 'Unite the Right' rally [1], as indicated by the referencing by central nodes in these communities of white-supremacist hate symbols or influential personalities in the alt-right.
Our analysis illustrated a stark distinction between Left and Right when we examined tweets that include the word 'Trump', with criticism on the Left versus support on the Right (see Table  4). For example, the most numerous hashtags from the Left in tweets that include 'Trump' were #Impeachment and #ImpeachTrump; by contrast, the most numerous hashtags were '#Barcelona', '#MAGA', and '#fakenews' from accounts in communities on the Right. Our findings are consistent with the extreme polarization and political tribalism in American society that have been described by other studies [85][86][87]. Such societal divisions are apparent on Twitter, as documented both by the present study and by prior ones [8,10,38,39], including recent work that suggested that polarization on Twitter is increasing over time [28].
It is also important to examine the role that fully automated accounts ('bots') and partially automated accounts (which have been dubbed 'cyborgs' [88]) play in shaping conversations (especially political ones) on Twitter and other social media platforms [46,[88][89][90][91][92][93]. Although an in-depth analysis of the role of bots in the #Charlottesville discussion is beyond the scope of the present paper, it is likely that many bot accounts are present in our data set. For example, automated naming schemes have been noted as an indicator of bot accounts [89]; and naming schemes that end in sequences of eight digits, as well as accounts that consist of hexadecimal strings, both exist in the #Charlottesville data. Detailed investigation of these accounts and their behavior is an important topic for future work. Sockpuppet accounts (i.e., false accounts that are operated by an entity [94]), such as those that are operated by the Internet Research Agency in St. Petersburg, Russia [95,96], can also play important roles in content propagation and thus warrant further investigation. Antipathy and distrust across party lines can provide opportunities for actors who seek to fan societal divisions. For example, our data set includes tweets by prominent accounts operated by the Internet Research Agency [95,96] that attacked both the Left and the Right.
It would be interesting to apply our approach for analyzing the Twitter conversation about #Charlottesville to also examine polarization on other topics (e.g., Brexit) and to see how polarization across political divides and attempts to bridge them change over time. It is not clear whether engagement with Twitter accounts with different viewpoints will decrease or increase polarization on divisive topics. For example, the empirical results of Bail et al. [32] suggest that exposure on Twitter to contrasting ideologies can lead to increased polarization. An interesting question is how exposure shapes viewpoints of individuals with 'centrist' media preferences or ideologies. Our investigation focused primarily on the sign of a media PCA score, but the underlying media PCA score is continuous; and one can use it to examine media preferences in a more nuanced way. In particular, characterization of accounts with moderate ('Centrist') media PCA scores, study of network structure and tweet content by these nodes, and tracking the evolution of these characteristics over time is both feasible and relevant. It would also be interesting to consider multiple ideological dimensions (e.g., as in studies of voting by legislators on bills [97]) and to simultaneously analyze multiple types of Twitter relationships as a muiltilayer network [98]. More broadly, we expect that our approach is generalizable other contexts, and it may be helpful for examining other types of node characterization (such as by analyzing different media outlets or types of followed accounts). To conduct increasingly nuanced investigations, it will also be worthwhile to study PCA components other than the first one and to use other types of multidimensional-scaling techniques.