Content Analysis of a Facebook Group Using R

Ilan Dan-Gur

Author: Ilan Dan-Gur
Date of publication: December 28th, 2020
Date of statistical data: December 16th-22nd, 2020

I used data mining [1] to analyze the content of posts in a facebook group in the Bow Valley, Alberta, over a one-week period (in December, 2020). The analysis was done using R (a language for statistical computing and graphics).

I will start by displaying the results, followed by a detailed description of how the analysis was done including the data file and the R code used to produce the results.

Results

Type of posts

List 1: Proportions of posting types

Most frequent words used

  1. Canmore
  2. Christmas
  3. Know
  4. Please
  5. Valley
  6. Banff
  7. Bow
  8. Thanks
  9. Covid
  10. Looking
  11. Need
  12. Parks
  13. Today
  14. Free
  15. Get
  16. Town
  17. Year
  18. Back
  19. Family
  20. Time

List 2: Most frequent words used

One week in one image

The software automatically created an image of a "word cloud" [2] that included many words used in the posts. The most-frequent words are the largest and at the center of the image. The less-frequent words are smaller and at the edge of the image. Obviously, the least-frequent words were not included due to space limitations.

Many words appearing in different sizes. The larger words are at the centre of the image, while smaller words appear at the edge.

How the analysis was done

1. Data collection and preprocessing

I collected posts from a facebook group in the Bow Valley, Alberta, over a one-week period (in December, 2020). The messages were saved as anonymous – only the textual content of the posts without names.
The data was organized in a CSV file containing two features: "type" and "text" (the data file can be downloaded here [3]).
The facebook posts were included in the "text" feature (see data file).

Each post was identified manually as one of the following types:

  1. Business (promoting a specific business)
  2. Cause (supporting a cause, e.g. community)
  3. Complaint (complaining about an issue)
  4. Covid (covid related)
  5. Help (looking for help, asking a question, looking for recommendation)
  6. Info (providing info)
  7. Lost (lost and found)
  8. Thanks (thank you note)
  9. Vandalism (reporting vandalism)

2. Analysis: Proportions of posting types

(The code is in R programming language)

I began by importing the CSV data from the file (facebook.csv) and saving it to a data frame:

fb_raw <- read.csv("facebook.csv", stringsAsFactors = FALSE)

The "type" feature is currently a character vector. Since it's a categorical variable, I converted it to a factor:

fb_raw$type <- factor(fb_raw$type)

I then verified there were no identical (i.e. duplicate) posts in the dataset:

anyDuplicated(fb_raw$text)

The anyDuplicated() function returns "0" if no duplicates exist. Otherwise it returns the line number of the first duplicate posting. To remove a duplicate posting (say, the 4th) you can use:

fb_raw <- fb_raw[-4,]

Next, I calculated the proportions of posting types, and sorted the results in decreasing order:

sort(round(100*prop.table(table(fb_raw$type)), digits = 1),decreasing = TRUE)

Which resulted in the following percentage output (displayed also in list 1 above):

 
Help 34.8
Cause 15.2
Info 15.2
Business 9.8
Covid 7.1
Lost 6.2
Complaint 6.2
Thanks 4.5
Vandalism 0.9
 

3. Analysis: Cleaning the text

I installed the "tm" (Text Mining) package/library, and used it to:

Here is the R code for all that:


install.packages("tm")
library(tm)
fb_corpus <- VCorpus(VectorSource(fb_raw$text))
fb_corpus_clean <- tm_map(fb_corpus,content_transformer(tolower))
fb_corpus_clean <- tm_map(fb_corpus_clean, removeNumbers)
exFillerWords <- c("will","anyone","can","like","just","new","also","everyone")
fb_corpus_clean <- tm_map(fb_corpus_clean,removeWords,c(stopwords("english"),exFillerWords))
replacePunctuation <- function(x){gsub("[[:punct:]]+"," ",x)}
fb_corpus_clean <- tm_map(fb_corpus_clean,content_transformer(replacePunctuation))
fb_corpus_clean <- tm_map(fb_corpus_clean, stripWhitespace)

4. Analysis: Most frequent words used

I used the tm package (see step 3 above) to split the facebook postings into individual words. I then summed up how many times each unique word was used across all postings. Finally, I found the most frequent words and sorted the results (i.e. the list of most frequent words) in decreasing order.

The list of the 20 most frequent words are displayed in list 2 above.

Here is the R code for all that:


fb_dtm <- DocumentTermMatrix(fb_corpus_clean)
freq <- colSums(as.matrix(fb_dtm))
sort(freq[findFreqTerms(fb_dtm,5)],decreasing = TRUE)

5. Visualization

I installed the "wordcloud" package/library, and used it to create an image of a "word cloud" that included many words used in the posts.

The image is displayed above.

Here is the R code:


install.packages("wordcloud")
library(wordcloud)
wordcloud(fb_corpus_clean, min.freq = 5, random.order = FALSE)