Tracking Russian Twitter Bots
Bots, trolls, and fake news; how does one define a “bot account”? Even if you can concretely define it, how do you identify one? What makes a bot different from just some outspoken guy with strong opinions? Those were my questions when asked to join this project. While it didn’t require any machine learning, it was loads of fun working on this project as it required some good data diving and detective work. My partner scraped Twitter to get me the nodes and edges, then I got to work analyzing the network. It was a bit of a needle in a haystack, but once I found a little thread, I started pulling!
The Scrape
Twitter does have an API, but to no one’s surprise, they don’t really like people just scraping loads of data off of them, so they have some speed limits when using the API. As such we needed to be smart about how to go about scraping, the procedure was thus: start with RT, Sputnik, and Ruptly as they are known sources, and grab users who have had at least 10 interactions with these nodes, then follow all of their edges to new nodes and so on. This gave us a little network of users who are at least interested in what Russian state media has to say (though simply interacting with any of those sources doesn’t mean one is a bot!).
Hunting for Curiosities
With the network, I began looking at it in various ways using matplotlib and networkx. Before long, I came across the graph below. When limiting the network to only the n most connected nodes (users with the most connections/interactions to other users), one would expect network density to keep increasing as n gets smaller (higher density = more edges, restricting the network to only the most connected nodes should see constant increase in density). However, starting with the top 200 most connected users from the RT network, the density drops to 0, ZERO, as in not a single edge. If you plotted the relationships of the 200 most popular kids in a (really big) high school, would you believe that none of those 200 are friends with each other?
With this little thread, I start pulling by looking the connections that these top 200 do have. Again, these sub networks do not reflect normal social interaction; none of the nodes connected to the top 200 have any connections between themselves either - if you had 100s of friends, wouldn’t at least some of those friends know each other? Below you will see a spring map of just 4 potential bots (for ease of illustration) and their connections (purple). Note how none of their “friends” are friends with each other (though some are connected to more than 1 bot) resulting in purple "blooms”. In later maps you’ll be able to see how every bot has these blooms, with very nearly no connections between them.
At this point, you may be wondering if this is all just a consequence of how the data was scraped - perhaps these 200 most popular nodes are simply the first ones that we picked up from the 3 “news” outlets? I looked into that and found that not only was the scraping not responsible for what we were seeing, but that the top 200 nodes almost never came from the first line of nodes (those that directly interacted with the “news” outlets). Instead, the potential bots seemed to trawl for retweets made by regular users and then reply to those.
It seems as if these bots are playing more of a supportive role to a narrative being told, and this sort of strategy would make sense - if they are bots, they probably don’t have much of a real following, so simply retweeting what the state sponsored outlets are publishing would have very little or no effect. However, arguing against those who argue against the “articles” would be very useful in attempting to provide legitimacy. Further, these bot accounts were making 100s if not 1000s of tweets daily; not very (normal) human behavior.
So, it seems likely that they are either automated bots, state sponsored bots (people), or very passionate trolls who have the time to tweet 100s of times a day, which might as well be considered a bot. It isn’t beyond a shadow of a doubt level proof, but we aren’t sending anyone to jail either.
You Found the Bots, Now What?
Aside from just the fun of finding them, the findings would be helpful for creating a metric that measures level of “botieness”; combined with others it could be used to ban bot accounts - if you’re into that kind of thing. We can do better, you have perhaps noticed the circles at the center of the network maps that are neither green nor purple. The maps are spring/tension maps, and those central nodes are nodes that the bots target frequently (in terms of many bots targeting). This, along with applying a little NLP for topic detection, would be a very effective way of seeing what and who is currently of interest to the Kremlin. Take a look below to see the central nodes, and know that this analysis was done in spring of 2020 (during the first lock down).
You can find the paper we wrote to get more details, along with the data and my notebooks here.