FireRosenstein hashtag analysis – hunting for Twitter bots, fakes, or suspicious accounts with rtweet. Part 2 – Identifying suspicious Twitter bots / fake accounts for further investigation

Categories R, Twitter analysis

Identifying suspicious accounts using rtweet

In my last post I showed you how to get Twitter data for a specific hashtag using rtweet, then how to prepare it for a network analysis program such as Gephi. You can find the data files I am using in this example here on Github.

From our 'firerosenstein' hashtag dataset we are now going to identify the accounts that seem suspicious, i.e. those that may be bots, or may be 'sockpuppet' accounts who may not really be the person they say they are.

First I will load the rtweet package and the csv dataset. Like last time, IPP have removed the tweets referring to the account 'conspirator0' from the data, who is a well-known Twitter bot hunter.

## loading ggplot2 for graphs
require("ggplot2")
## loading ggplotly for interactive graphs
require("plotly")

## loading dplyr for joins and group_by functions
require("dplyr")

## Load csv files from last time
ros_tweets <- read.csv("rosenstein_tweets_2.csv")
ros_tweets_poster_details <- read.csv("rosenstein_tweets_2_poster_details.csv")

unique_poster_info <- ros_tweets_poster_details[!duplicated(ros_tweets_poster_details$screen_name),]

## Remove any tweet including @conspirator0, as this is a well-known twitter troll hunter
ros_tweets <- ros_tweets[-grep("conspirator0",ros_tweets$text),]
ros_tweets <- ros_tweets[-grep("conspirator0",ros_tweets$screen_name),]

Identifying suspicious Twitter bot characteristics

I have used the principles from this excellent DFR Lab article, '#BotSpot: Twelve Ways to Spot a Bot' to try to identify bots in our data. We do not have the full tweet stream from the accounts listed in our dataset, so we cannot investigate all twelve ways to spot a fake at the moment. We can do that later once we have narrowed down our investigation to a few key accounts.

With the information we have currently, we can investigate the following 8 criteria of the 12 in the article:

1 – Activity
3 – Amplification (partially)
4 – Low posts / high results
5 – Common content
6 – The secret society of silhouettes
8 – Bot's in a name (kind of)
9 – Twitter of Babel
12 – Retweets and likes

The missing numbers we can look at further later when we download Tweet streams of a specific user, and look at their profile page in a browser.

I will go through each number in turn. I won't spend much time describing why we are doing this, you should check out the article for a fuller description of each

1. Activity

Here I am calculating the number of tweets per day by each poster. I am getting the overall number of tweets for each user, then dividing this by the number of days, then joining this back to the main data frame.

post_number <- as.data.frame(table(ros_tweets$screen_name))
colnames(post_number) <- c("screen_name","post_number")

# ros_tweets$created_at <- as.POSIXlt.factor(ros_tweets$created_at)
ros_tweets$created_at <- as.POSIXct(ros_tweets$created_at)

time_between <- with(ros_tweets, (max(created_at) - min(created_at)))
# Note 'as.double' is required today, can't use 'as.numeric'
time_between <- as.double(time_between, units = "days")

post_number$post_freq <- post_number$post_number / time_between

ros_tweets <- left_join(ros_tweets, post_number, by = "screen_name")

Let's graph up this new variable 'post_freq'. As we can see, most of the posters post less than one tweet per day using the hashtag 'firerosentstein'. However, there are a number of outliers, which we could label as suspicious.

post_freq_graph <- ggplot(ros_tweets[!duplicated(ros_tweets$screen_name),], aes(x = post_freq)) + geom_histogram(binwidth = 0.5, aes(y=..count..)) + theme_bw() + scale_x_continuous(name = "Number of tweets per day") + scale_y_continuous(name = "Number of users") + coord_cartesian(ylim = c(0, 50))


post_freq_graph <- ggplotly(post_freq_graph)

post_freq_graph_link = api_create(post_freq_graph, filename="firerosenstein_rtweet_2_post_freq_graph")
post_freq_graph_link

Let's create another variable to start labeling the suspicious accounts for further investigation later on. Below, I have labeled any account that is in the top 25% of post frequencies as potentially suspicious.

# If post frequency greater than the third quartile, suspicious
sum_post_number <- as.numeric(summary(ros_tweets$post_freq))

ros_tweets$C1_high_post_freq <- with(ros_tweets, ifelse(post_freq > sum_post_number[4], 1, 0))

ros_tweets$likely_susp <- 0
ros_tweets$likely_susp <- with(ros_tweets, ifelse(C1_high_post_freq == 1, 1, likely_susp))

3 Amplification

Using dplyr's group_by and summarise, I add together the total number or retweets and quotes, then divide this by the total number of posts to get a proportion.

retweet_number_user <- group_by(ros_tweets, screen_name)

retweet_number_user_sum <- summarise(retweet_number_user,
 no_retweets = sum(is_retweet,na.rm = T),
 no_quotes = sum(is_quote,na.rm = T),
 no_posts = n()
 )

retweet_number_user_sum$quote_retweet_prop <- with(retweet_number_user_sum, (no_retweets + no_quotes) / no_posts)

retweet_quote_nos <- with(retweet_number_user_sum, data.frame(screen_name, quote_retweet_prop))

ros_tweets <- left_join(ros_tweets, retweet_quote_nos, by = "screen_name")

Let's graph these up – first amplification (for #3). We can see from the graph that the numbers are highly skewed to 0 or 1 – this is likely because a lot of users only posted once

amp_graph <- ggplot(ros_tweets[(!duplicated(ros_tweets$screen_name) & ros_tweets$post_number >= 5),], aes(x = quote_retweet_prop)) + geom_histogram(binwidth = 0.01, aes(y=..count..)) + theme_bw() + scale_x_continuous(name = "Proportion of quotes + retweets / total tweets (users with 5 + tweets)") + scale_y_continuous(name = "Number of users")

amp_graph <- ggplotly(amp_graph)

amp_graph_link = api_create(amp_graph, filename="firerosenstein_rtweet_2_amp_graph")
amp_graph_link

Let's say that anyone with a (quotes + retweet) / total tweets proportion of 1 can be labelled as suspicious. In order to make sure the retweet proportions are not skewed to 1 or 0 because the user on had 1 tweet, I first subset the data to include only users who tweeted at least five times.

ros_tweets$C3_amp <- with(ros_tweets, ifelse((quote_retweet_prop == 1 & ros_tweets$post_number >= 5), 1, 0))

ros_tweets$likely_susp <- with(ros_tweets, ifelse(C3_amp == 1, 1, likely_susp))

4 – Low posts / high results

This metric is referring to posters whose tweets reach a much larger audience than you would expect. For this measure, I am comparing the number of retweets for their post with the number of followers they have. I added 1 to the followers accounts to avoid getting an 'Inf' error

poster_followers <- with(unique_poster_info,data.frame(screen_name, followers_count))

ros_tweets <- left_join(ros_tweets, poster_followers, by = "screen_name")

ros_tweets$retweet_follower_prop <- NA
ros_tweets$retweet_follower_prop <- with(ros_tweets, ifelse(is_retweet == F, retweet_count / (followers_count + 1), 0))

We can visualise this by showing the proportion of retweets over followers for each account

results_graph <- ggplot(ros_tweets[!duplicated(ros_tweets$screen_name) & ros_tweets$is_retweet == F,], aes(x = retweet_follower_prop)) + geom_histogram(binwidth = 0.05, aes(y =..count..)) + theme_bw() + scale_x_continuous(name = "Proportion of retweets for post / number of followers user has") + scale_y_continuous(name = "Number of users") + coord_cartesian(ylim=c(0,200))

results_graph <- ggplotly(results_graph)

results_graph_link = api_create(results_graph, filename="firerosenstein_rtweet_2_results_graph")
results_graph_link

It doesn't seem like there are any posts at all with high numbers of tweets compared to the number of followers, and on checking the data, the top four are label. I would say that anyone with a (quotes + retweet) / total tweets proportion of 1 can be labelled as suspicious. In this case that wouldn't make any difference. I will keep the code here just in case you want to use it for other datasets.

ros_tweets$C4_high_retweets <- NA

ros_tweets$C4_high_retweets <- with(ros_tweets, ifelse(retweet_follower_prop > 1, 1, 0))

ros_tweets$likely_susp <- with(ros_tweets, ifelse(C4_high_retweets == 1, 1, likely_susp))

5 – Common content

Here we identify accounts that are tweeting exactly the same text as others (that are not retweets). Here I am only considering tweets with more than thirty characters, so we are not including tweets that e.g. just include the Firerosenstein hashtag.

Below I am creating a column labelling tweets as repeated or not, then I am summing them for each user.

ros_tweets$repeated_tweet <- NA
ros_tweets$repeated_tweet <- ifelse(duplicated(ros_tweets$text) & ros_tweets$is_retweet == F & nchar(as.character(ros_tweets$text)) > 30,1,0)

repeated_tweet_by_user <- as.data.frame(tapply(ros_tweets$repeated_tweet, ros_tweets$screen_name,sum))
repeated_tweet_by_user$screen_name <- rownames(repeated_tweet_by_user)
colnames(repeated_tweet_by_user)[1] <- "repeat_tweets_no"

ros_tweets <- left_join(ros_tweets, repeated_tweet_by_user, by = "screen_name")

The graph below shows the number of accounts that have tweeted repeated content.

common_graph <- ggplot(ros_tweets[!duplicated(ros_tweets$screen_name),], aes(x = as.factor(repeat_tweets_no), fill = as.factor(repeat_tweets_no), text = paste("No. of users:", ..count..,"<br>","No. of repeated tweets:",x-1))) +
geom_bar(width = 1) + theme_bw() + scale_x_discrete(name = "Number of tweets with non-unique content", scale_name="Repeated tweets") + scale_y_continuous(name = "Number of users") + coord_cartesian(ylim = c(0,50)) + theme(legend.position = "none") + scale_fill_brewer(type="qual", palette = "Pastel1")

common_graph <- ggplotly(common_graph,tooltip = c("text"))

common_graph_link = api_create(common_graph, filename="firerosenstein_rtweet_2_common_graph")
common_graph_link

I will label accounts with more than one repeated content tweet as suspicious (only four of them):

ros_tweets$C5_repeat_tweets <- with(ros_tweets, ifelse(repeat_tweets_no > 1, 1, 0))

ros_tweets$likely_susp <- with(ros_tweets, ifelse(C5_repeat_tweets == 1, 1, likely_susp))

6 – The secret society of silhouettes

According to Twitter the default profile image is the below url link:

https://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png

From the user information we get from our rtweet export we can search for all users with this image as their profile picture.

unique_poster_info$C6_default_profile_pic <- NA
unique_poster_info$C6_default_profile_pic <- ifelse(grepl("default_profile_normal.png",unique_poster_info$profile_image_url),1,0)

default_profile_pic_users <- with(unique_poster_info, data.frame(screen_name, C6_default_profile_pic))

ros_tweets <- left_join(ros_tweets, default_profile_pic_users, by = "screen_name")

A surprisingly high number of tweeters in our dataset (253 accounts) are using the default profile picture.

secret_graph <- ggplot(ros_tweets[!duplicated(ros_tweets$screen_name),], aes(x = as.factor(C6_default_profile_pic), fill = as.factor(C6_default_profile_pic), text = paste("No. of users:", ..count..,"<br>","Default profile pic:",x-1))) +
geom_bar() + theme_bw() + scale_x_discrete(name = "Uses Twitter default profile pic", scale_name="Repeated tweets") + scale_y_continuous(name = "Number of users") + theme(legend.position = "none") #+ scale_color_brewer(type="qual", palette = "Pastel2") + #coord_cartesian(ylim = c(0,50)) 

secret_graph <- ggplotly(secret_graph,tooltip = c("text"))

secret_graph_link = api_create(secret_graph, filename="firerosenstein_rtweet_2_secret_graph")
secret_graph_link

I will label any account using the default profile pic as suspicious.

ros_tweets$likely_susp <- with(ros_tweets, ifelse(C6_default_profile_pic == 1, 1, likely_susp))

8 – Bots in a name (kind of)

Here the article talks about identifying accounts that are a random collection of letters and numbers. I haven't found online a simple way to do this, but I have come up with my own rough estimation of the same pattern by considering the number of consonants in the name

If you are using a random alphanumeric generator, you would expect there to be 36 possible characters (alphabet + numbers). There are 21 consonants in English, so you would expect the proportion of consonants in a string to be 21 / 36 or 0.59. I will give a range of 0.1, i.e. between 0.54 and 0.64. Similarly, you would expect the proportion of numbers to be around 10 / 36 or .28, which with a range of 0.1 would give 0.23 to 0.33 as our outer bounds. All these conditions would need to be true to suggest the name is a random combination of alphanumeric characters.

In the process of writing this code, I notice that some of the screen names that had a very high or low proportion of consonants were also very strange (I.e. they seemed to be random collections of letters based on solely consonants, or random collections of vowels / numbers). So I have included accounts with screen names of <= 10% or >= 90% consonants in my consideration of 'weird' screen names. Finally, accounts with more than 75% numbers are also, in my opinion, pretty suspicious.

For this, we need the package stringr.

require(stringr)

# First counting the number of consonants
list_const <- str_extract_all(tolower(ros_tweets$screen_name[!duplicated(ros_tweets$screen_name)]), '[bcdfghjklmnpqrstvwxyz]')
# You can use '[bcdfghjklmnpqrstvwxyz]+' if you wanted to gather neighbouring consonants together
list_conc <- unlist(lapply(list_const,paste, sep="", collapse=""))
no_cons <- sapply(list_conc,nchar)

cons_length_df <- with(ros_tweets[!duplicated(ros_tweets$screen_name),],data.frame(screen_name,no_cons))
cons_length_df$screen_name <- as.character(cons_length_df$screen_name)
cons_length_df$prop_cons <-no_cons / nchar(cons_length_df$screen_name) 


# Doing the same as above, but for numbers
list_nums <- str_extract_all(tolower(ros_tweets$screen_name[!duplicated(ros_tweets$screen_name)]), '[0-9]')
list_nums <- unlist(lapply(list_nums,paste, sep="", collapse=""))
no_nums <- sapply(list_nums,nchar)
cons_length_df$prop_nums <-no_nums / nchar(cons_length_df$screen_name) 

# Putting all these rules together

cons_length_df$prop_cons_nums_weird <- NA
cons_length_df$prop_cons_nums_weird <- with(cons_length_df, ifelse((prop_cons > .54 & prop_cons < .64 & prop_nums > 0.23 & prop_nums < 0.33) | prop_cons >= .9 | prop_cons <= .1 | prop_nums >= 0.75 , 1, 0))

cons_length_df_join <- with(cons_length_df, data.frame(screen_name, prop_cons, prop_cons_nums_weird))

ros_tweets <- left_join(ros_tweets, cons_length_df_join, by = "screen_name")

Another thing I noticed in my time using twitter is the preponderence of bot-like behaviour for twitter accounts with long strings of numbers attached to the end of their name. Here I have considered a string of four consecutive numbers or more as a 'weird' name.

screen_name <- unique_poster_info$screen_name
# Looking for unique posting usernames with four or more numbers in them
unique_number_posters <- as.data.frame(screen_name)

unique_number_posters$four_number_posters <- NA
unique_number_posters$four_number_posters <- ifelse( grepl("[0-9]+{4}",unique_number_posters$screen_name),1,0)

ros_tweets <- left_join(ros_tweets, unique_number_posters, by = "screen_name")


ros_tweets$C8_weird_names <- NA
ros_tweets$C8_weird_names <- with(ros_tweets, ifelse(prop_cons_nums_weird == 1 | four_number_posters == 1, 1, 0))

We can see from the below graph that 554 of the users in the dataset have a 'weird' name. Only 66 of them have a proportion of consonants / numbers in the name that I have labelled as 'suspicious', while 500 accounts have four or more consecutive numbers in the name.

name_graph <- ggplot(ros_tweets[!duplicated(ros_tweets$screen_name),], aes(x = as.factor(C8_weird_names), fill = as.factor(C8_weird_names), text = paste("No. of users:", ..count..,"<br>","Weird names?",x-1))) +
geom_bar() + theme_bw() + scale_x_discrete(name = "Has a strange username?", scale_name="Repeated tweets") + scale_y_continuous(name = "Number of users") + theme(legend.position = "none") #+ scale_color_brewer(type="qual", palette = "Pastel2") + #coord_cartesian(ylim = c(0,50)) 

name_graph <- ggplotly(name_graph,tooltip = c("text"))

name_graph_link = api_create(name_graph, filename="firerosenstein_rtweet_2_name_graph")
name_graph_link

table(ros_tweets$four_number_posters[!duplicated(ros_tweets$screen_name)])
## 
##    0    1 
## 2989  505
table(ros_tweets$prop_cons_nums_weird[!duplicated(ros_tweets$screen_name)])
## 
##    0    1 
## 3428   66

I will label accounts that have a 'weird name' according to my definition above as suspicious.

ros_tweets$likely_susp <- with(ros_tweets, ifelse(C8_weird_names == 1, 1, likely_susp))

9 – Tower of Babel

Using the group_by and summarise function, I add up the total number of languages used by the tweeter. This covers # 9. Then I put these new columns back into the main data frame.

lang_number_user <- group_by(ros_tweets, screen_name)

# The language 'und' means undefined; I will just convert it to english for the calculation
lang_number_user$lang <- as.character(gsub("und", "en", lang_number_user$lang))
lang_number_user_sum <- summarise(lang_number_user,
no_langs = length(unique(lang))
)

lang_nos <- with(lang_number_user_sum, data.frame(screen_name, no_langs))

ros_tweets <- left_join(ros_tweets, lang_nos, by = "screen_name")

Next, a graph of the number of languages (for #9).

lang_graph <- ggplot(ros_tweets[!duplicated(ros_tweets$screen_name),], aes(x = as.factor(no_langs),fill= as.factor(no_langs), text = paste("No. of users",..count..))) + geom_bar() + theme_bw() + scale_x_discrete(name = "Number of languages used") + scale_y_continuous(name = "Number of users") + scale_fill_brewer(type = "qual", palette = "Pastel2") + coord_cartesian(ylim = c(0,100)) + theme(legend.position = "none")

lang_graph <- ggplotly(lang_graph,tooltip = c("text"))

lang_graph_link = api_create(lang_graph, filename="firerosenstein_rtweet_2_lang_graph")
lang_graph_link

We can see from the above graph and table that there are very few posters posting in more than one language (only 22 out of about 3500 in fact). Let's label anyone tweeting in more than one language as suspicious.

ros_tweets$C9_mult_langs <- with(ros_tweets, ifelse(no_langs > 1, 1, 0))

ros_tweets$likely_susp <- with(ros_tweets, ifelse(C9_mult_langs == 1, 1, likely_susp))

12 – Retweets and likes

This is based on the assumption that accounts that have almost the same number of retweets and likes are suspicious. Unfortunately, the data we get from the API doesn't allow us to separate regular tweets from likes. However, here I have assumed that anyway if an account is liking as much as they tweet or retweet, that is also pretty suspicious. I have assumed a proportion of likes of between 47.5 and 52.5% of total actions is suspicious.

unique_poster_info$sum_status_fav <- with(unique_poster_info, (statuses_count + favourites_count))
unique_poster_info$favs_prop_total <- with(unique_poster_info, favourites_count / sum_status_fav)

# Focusing on users with between 45 and 55% favourites of total posts
unique_poster_info$C12_same_prop_favs <- NA
unique_poster_info$C12_same_prop_favs <- with(unique_poster_info, ifelse(favs_prop_total >= .475 & favs_prop_total <= .525,1,0))

prop_favs <- with(unique_poster_info,data.frame(screen_name, C12_same_prop_favs))
ros_tweets <- left_join(ros_tweets, prop_favs, by = "screen_name")

We can see that 466 users from the dataset have a very similar proportion of likes to tweets.

ret_likes_graph <- ggplot(ros_tweets[!duplicated(ros_tweets$screen_name),], aes(x = as.factor(C12_same_prop_favs),fill= as.factor(C12_same_prop_favs), text = paste("No. of users",..count..))) + geom_bar() + theme_bw() + scale_x_discrete(name = "Similar number of likes to tweets?") + scale_y_continuous(name = "Number of users") + theme(legend.position = "none")

ret_likes_graph <- ggplotly(ret_likes_graph,tooltip = c("text"))

ret_likes_graph_link = api_create(ret_likes_graph, filename="firerosenstein_rtweet_2_ret_likes_graph")
ret_likes_graph_link

I will label users within the defined proportions of likes compared to tweets as suspicious.

ros_tweets$likely_susp <- with(ros_tweets, ifelse(C12_same_prop_favs == 1, 1, likely_susp))

Creating a 'suspicion index'

Looking at the accounts I have so far labeled as suspicious, we can see that we have 1150 of the 3494 unique accounts considered are 'suspicious' under at least one of the categories we have considered.

susp_graph <- ggplot(ros_tweets[!duplicated(ros_tweets$screen_name),], aes(x = as.factor(likely_susp),fill= as.factor(likely_susp), text = paste("No. of users",..count..))) + geom_bar() + theme_bw() + scale_x_discrete(name = "Labelled as suspicious") + scale_y_continuous(name = "Number of users") + theme(legend.position = "none")

susp_graph <- ggplotly(susp_graph,tooltip = c("text"))

susp_graph_link = api_create(susp_graph, filename="firerosenstein_rtweet_2_susp_graph")
susp_graph_link

But how do these accounts break down in terms of how many 'suspicious' criteria they meet? We can easily add them up to make a 'suspicion index' (max. score possible 8), so we can focus on the most likely bots / fakes.

ros_tweets$susp_index <- ros_tweets$C1_+ros_tweets$C3+ros_tweets$C4+ros_tweets$C5+ros_tweets$C6+ros_tweets$C8+ros_tweets$C9+ros_tweets$C12

susp_index_graph <- ggplot(ros_tweets[!duplicated(ros_tweets$screen_name),], aes(x = as.factor(susp_index),fill= as.factor(susp_index), text = paste("No. of users",..count..))) + geom_bar() + theme_bw() + scale_x_discrete(name = "Suspicion index score") + scale_y_continuous(name = "Number of users") + theme(legend.position = "none") + scale_fill_brewer()

susp_index_graph <- ggplotly(susp_index_graph,tooltip = c("text"))

susp_index_graph_link = api_create(susp_index_graph, filename="firerosenstein_rtweet_2_susp_index_graph")
susp_index_graph_link

We have 18 accounts that fulfil 3 of the 8 criteria that we have analysed here. Here are their screen names.

unique(ros_tweets$screen_name[ros_tweets$susp_index == 3])
##  [1] "GreyWolf1065"    "JEFF02163191"    "JamesCo69645981"
##  [4] "mikede14785"     "staggerlee420"   "LM16612718"     
##  [7] "Jed42253333"     "tired1025"       "DJTagainin2020" 
## [10] "Patches1811"     "Whatsthe4114"    "jeffb12751"     
## [13] "BobDutton5"      "S92167397"       "JamesMa67750068"
## [16] "parrothead3322"  "yiayia1234"      "senator16"

Focusing on the most influential

18 accounts is still quite a lot of accounts to deal with if you want to go investigating their posting history. One way to prioritise which accounts to look at is by measuring their influence in the 'firerosenstein' hashtag network. As I showed in the previous post, we can do this by getting the eigenvector centrality value of each account using the retweet data. We need the package 'igraph' for this to work.

rt_patterns <- which(ros_tweets$is_retweet == T)

who_retweet = as.list(1:length(rt_patterns))
who_post = as.list(1:length(rt_patterns))

  # get tweet with retweet entity
  twit = ros_tweets[rt_patterns,]
  # get retweet source 
  poster = str_extract(twit$text, "(RT|via)((?:\\b\\W*@\\w+)+)") 
  #remove ':'
  poster = gsub(":.*", "", poster) 
  # name of retweeted user
  who_post = gsub("(RT @|via @)", "", poster, ignore.case=TRUE) 
  # name of retweeting user 
  who_retweet = as.character(twit$screen_name)

# UNLIST THE RESULTS
who_post = unlist(who_post)
who_retweet = unlist(who_retweet)

poster_retweet <- cbind(who_post, who_retweet)

require(igraph)

## Generate the graph
rt_graph <- graph.edgelist(poster_retweet)

eigen_cent <- eigen_centrality(rt_graph)
eigen_cent_users <- as.data.frame(eigen_cent[[1]])
eigen_cent_users$screen_name <- rownames(eigen_cent_users)
colnames(eigen_cent_users)[1] <- "eigen_cent"

ros_tweets <- left_join(ros_tweets, eigen_cent_users, by = "screen_name")

Now let's see who is the most influential in the 'suspicious accounts' list according to their eigenvector centrality.

ros_tweets_most_susp <- ros_tweets[ros_tweets$susp_index == 3,]

ros_tweets_most_susp_rank <- with(ros_tweets_most_susp, data.frame(screen_name, eigen_cent))

ros_tweets_most_susp_rank_unique <- ros_tweets_most_susp_rank[!duplicated(ros_tweets_most_susp_rank$screen_name),]

ros_tweets_most_susp_rank_unique[order(-ros_tweets_most_susp_rank_unique$eigen_cent),]
##         screen_name             eigen_cent
## 5     staggerlee420 0.67402721114207897468
## 1      GreyWolf1065 0.03211250000100186741
## 81       BobDutton5 0.02969193910965948688
## 4       mikede14785 0.02534802841033530901
## 80       jeffb12751 0.02528905501672723544
## 3   JamesCo69645981 0.02528905501672722503
## 307 JamesMa67750068 0.00778118260440472092
## 36   DJTagainin2020 0.00111210562204536764
## 358  parrothead3322 0.00077317700857879577
## 30       LM16612718 0.00045965506863352199
## 32        tired1025 0.00007439031272290101
## 367      yiayia1234 0.00006594323374774065
## 79     Whatsthe4114 0.00000018513001422248
## 31      Jed42253333 0.00000000000000000351
## 2      JEFF02163191                     NA
## 75      Patches1811                     NA
## 306       S92167397                     NA
## 382       senator16                     NA

Far and away the most influential user in our list of 'suspicious' accounts is 'staggerlee420'. In my next post, I will look in more detail at this user's tweet histor, and I will consider the final four bot criteria.

Data science and R