FireRosenstein hashtag analysis – hunting for Twitter bots, fakes, or suspicious accounts with rtweet. Part 1 – Getting the Twitter data and preparing it for Gephi

Categories R, Twitter analysis

Hunting Twitter bots / sockpuppets with rtweet and Gephi

rtweet and Gephi are two useful (and free) options to investigate Twitter data through their API. In this series of articles, I will show you what can be done with this data to try to identify Twitter bots or sockpuppets (i.e. ‘fake’ accounts run by someone pretending to be someone else). I will use the hashtag #firerosenstein as the subject of this case study. As found by the excellent @conspirator0 and friends, this hashtag was used frequently by pro-Trump accounts following the release of Devin Nunes’ memo on 2nd February. ‘Rosenstein’, in case you didn’t know, refers to Rod Rosenstein, the Deputy Attorney General of the United States.

In this first post I will show you how to get the data from Twitter and then prepare it in a format that can be read by a social network analysis program such as Gephi.

Just to note that I have only been using rtweet for a couple of weeks, so if anything appears to be inefficiently done in the following tutorial, please let me know!

Getting Twitter data using rtweet

First you will need a Twitter account, and then you will need to sign up for a ‘Twitter App’ so that you can access data from the Twitter API. You can do that here. I won’t go into detail on how to do that here but here is a useful guide to get you started.

A couple of things you need to note from the above. You will need to note down the app name, consumer key and consumer secret key going forward. Also, on the ‘Settings’ page you will need to set the callback URL to http://127.0.0.1:1410 for rtweet to work. Once that is done, we can move onto using rtweet!

Loading rtweet

A very useful introduction to rtweet can be found on the documentation website here, which gives an overview of all the main functions, including those I will use here.

## install rtweet from CRAN
# install.packages("rtweet")

## load rtweet into R
require("rtweet")

## Create an access token for rtweet. Take the following information from your own app
# create_token(
#   app = "appname",
#   consumer_key = "consumer_key", 
#   consumer_secret = "consumer_secret_key"
#   )

Getting your first tweets

We are now ready to search Twitter for tweets. However, it is good to note here that there are significant limitations to the public API. You can only search for tweets posted in the last 6-9 days. Also, there are rate limitations that prevent you from being able to take tens of thousands of tweets / user details at once. These limitations are highlighted on the rtweet info page.

To get your first tweets, we will use the function ‘search_tweets’. Below (and throughout) I have shown the original rtweet commands I used in commented code, as I didn’t want to show my api details here.

# rosenstein_tweets_add <- search_tweets(
  # "#firerosenstein", n = 50000, include_rts = T, type = "recent", retryonratelimit = T, parse = T
# )

Cleaning the data to export as csv

Now that we have the Tweets, it would be useful to save this as a csv so that we can keep it and reload it later. Unfortunately the ‘search_tweets’ function gives you a kind of data frame with a number of embedded lists inside. We can sort deal with this now turning them into character format. Also, information about the Tweet posters is saved as attributes attached to the main dataframe. It would be useful to save this as a separate csv file that we can also use later.

## Extract the Tweet poster information from the tweets
# poster_details <- as.data.frame(attr(rosenstein_tweets_add, 'users'))

# write.csv(poster_details, "rosenstein_tweets_2_poster_details.csv")

## Change the embedded lists to character vectors 
# rosenstein_tweets_add <- apply(rosenstein_tweets_add,2,as.character)
# rosenstein_tweets_add <- as.data.frame(rosenstein_tweets_add, stringsAsFactors = F)

# write.csv(rosenstein_tweets_add, "rosenstein_tweets_2.csv")

Now we have the tweets safely stored in csv files. In my csv, I have 7400 tweets in the data, spanning from 3 to 13 February 2018. You can find the files here on Github, along with the poster information file, so that you can follow along with this analysis. However, I have noticed that expert Twitter bot hunter ‘@conspirator0’ is included as a prominent tweeter in the database due to his investigation of the #FireRosenstein. As I am looking for suspicious accounts, I will remove him from the dataset, leaving us with 6966 tweets.

## Load csv files
rosenstein_tweets_add <- read.csv("rosenstein_tweets_2.csv")
rosenstein_tweets_add_poster_details <- read.csv("rosenstein_tweets_2_poster_details.csv")

## Remove any tweet including @conspirator0, as this is a well-known twitter troll hunter
rosenstein_tweets_add <- rosenstein_tweets_add[-grep("conspirator0",rosenstein_tweets_add$text),]

Finding retweets and poster names

The ‘screen_name’ column gives us the name of the poster. But to get the name of the user that they retweeted, we can use a regex function to find the first word after an ‘@’ sign. There is a handy ‘is_retweet’ column in the dataset that we can use to get this data.

We will then manipulate the Twitter text data to extract the name of the original poster (that is being retweeted). Then we will put the posters and retweeters side by side (I will call this an ‘edges’ table, as in network analysis speak, an edge is a connection between two nodes), in preparation for export to network visualisation / analysis tools.

require(stringr) # For the 'str_extract' function

rosenstein_text_add <- as.character(rosenstein_tweets_add$text)

## Find rows with retweets in them.
rt_patterns <- which(rosenstein_tweets_add$is_retweet == T)

# CREATE A LIST OF POSTERS AND RETWEETERS TO STORE USER NAMES

who_retweet = as.list(1:length(rt_patterns))
who_post = as.list(1:length(rt_patterns))

  # get tweet with retweet entity
  twit = rosenstein_tweets_add[rt_patterns,]
  # get retweet source 
  poster = str_extract(twit$text, "(RT|via)((?:\\b\\W*@\\w+)+)") 
  #remove ':'
  poster = gsub(":.*", "", poster) 
  # name of retweeted user
  who_post = gsub("(RT @|via @)", "", poster, ignore.case=TRUE) 
  # name of retweeting user 
  who_retweet = as.character(twit$screen_name)

# UNLIST THE RESULTS
who_post = unlist(who_post)
who_retweet = unlist(who_retweet)

## Prepare a matrix of posters and retweeters

poster_retweet <- cbind(who_post, who_retweet)

Working with the Edge table for analysis or export

R has a useful network analysis package called ‘igraph’. You can also create network visualisations using this package, but it takes some time and effort to make them look pretty. That’s where Gephi comes in useful.

Here I will perform a basic network analysis using the Eigenvector centrality function to calculate the relative centrality of the nodes in the retweet network we have created. I will create a new column based on this data (for which I will need to use the dply package).
Then I will export a csv capable of being read directly by Gephi to visualise the network.

require(igraph)
require(dplyr)

## Create the network graph in R based on our edge matrix
rt_graph <- graph.edgelist(poster_retweet)

## Calculate the eigenvector centrality

eigen_cent <- eigen_centrality(rt_graph)
eigen_cent_users <- as.data.frame(eigen_cent[[1]])
eigen_cent_users$screen_name <- rownames(eigen_cent_users)
colnames(eigen_cent_users)[1] <- "eigen_cent"

## Find the nodes in the top 5% of eigenvector centrality scores. This is for creating a label column for Gephi that will show only the names of this top 5% of users.

top_5_eigen <- quantile(eigen_cent_users$eigen_cent, prob = .95)
eigen_cent_users$top_5_eigen_users <- ifelse(eigen_cent_users$eigen_cent >= top_5_eigen, eigen_cent_users$screen_name,NA)

# Joining the new columns back onto the edge matrix, and
colnames(poster_retweet)[1] <- "screen_name"
poster_retweet <- left_join(as.data.frame(poster_retweet), eigen_cent_users, by = "screen_name")
## Warning: Column `screen_name` joining factor and character vector, coercing
## into character vector
poster_retweet <- as.matrix(poster_retweet)

Now that we have joined our additional data column of interest to our edge matrix, we can prepare and export the Gephi-ready edge and nodes csv files using the following function.

## Create the function
prep_for_gephi <- function(edge_file_name, nodes_file_name,df) {

df_1 <- as.data.frame(df)
colnames(df_1)[1:2] <- c("Source","Target")
write.csv(df_1, file=paste(edge_file_name,".csv", sep = ""), row.names=FALSE)

# make nodes
df_1$Source <- as.character(df_1$Source)
df_1$Target <- as.character(df_1$Target)
nodes <- c(df_1$Source, df_1$Target)
nodes <- as.data.frame(nodes, stringsAsFactors = F)
nodes <- unique(nodes)

# gephi also requires Ids and labels that are the same as the node names
nodes$Id <- nodes$nodes
nodes$Label <- nodes$nodes

# Adding in the extra columns to the nodes data frame
df_1_join <- df_1

df_1_join <- df_1[!duplicated(df_1$Source),]
nodes$Source <- nodes$Id
nodes <- as.data.frame(nodes, stringsAsFactors = F)
nodes <- left_join(nodes, df_1_join, by = "Source")
nodes$Source <- NULL; nodes$Target <- NULL

write.csv(nodes, file=paste(nodes_file_name,".csv", sep = ""), row.names=FALSE)
}

## Create the csv files
prep_for_gephi("retweet_edges","retweet_nodes",poster_retweet)

Next steps

I won’t go into how to use Gephi in this series of posts, suffice to say that it is worth learning if you intend on doing any sort of network analysis at all (and it is free, so really no excuse). The website itself has some decent tutorials.

In the next post I will look more deeply into the users that post #FireRosenstein. What patterns can we see in their posts that indicate a suspicious account? And what do we do with this information?

Gephi output of #firerosenstein Twitter network based on data from 3 to 13 February 2018
Data science and R