Using Code to Sift Through Your Facebook History

9 January 2019

Shiri Avni

I often get asked by people why I enjoy coding so much, and how I respond depends on the person asking. I may answer that you can do neat simulations with code, develop video games, or do interesting computer vision tasks. But some of the friends who ask me this question aren’t interested in any of those things, and represent more typical computer end users. So, this post goes to answer why even normal end users could find coding useful and rewarding, by demonstrating how a few short lines of code can help you analyze your entire social media history. While in this example I demonstrate an analysis of personal Facebook history, you could easily adapt it to other social platforms.

The Goal

In this post, I’ll show you how to find all the conversations and chat snippets that include specific keywords. For instance, let’s imagine that you spent a semester living in Antarctica, but you don’t remember much about the details of your time there. By searching for all snippets including the word “Antarctica” in your message history with your friends, you’ll be able to find your messages and remind yourself of the details from that experience. Given that I have 1,047 different conversations in my history, this is not something I would want to do manually. Here’s where code can come in handy!

Download Your Data

To begin, download your Facebook data to your local computer, by following the directions at this link. I only downloaded my messages, but feel free to download whatever you’re interested in. Once you extract the zip file, you should have a folder with a file called your_messages.html and a folder called messages.


facebook download

The extracted zip files assuming you downloaded solely messages from Facebook.

Identify the Relevant Files

As I’m only interested in two-way conversations that I’ve been a part of, the files I’ll be analyzing are those in the messages/archived_threads and messages/inbox folder. However, feel free to include the messages/message_requests if you think you’ll find something interesting there.

Looking in either folder, you’ll further see a list of folders: either one or two folders per conversation you’ve ever had. If your conversation with person X has only ever consisted of regular text, then you will have one folder for that conversation, with the folder containing a single file called message.html. If you’ve also sent or received files or photos as part of a conversation with someone, then the conversation will be split into two folders: the first is identical to the folder we just discussed, and the other folder will contain a files and photos folder, which we will ignore.


facebook conversation

An imaginary conversation with Ron Weasley contains two folders, one for the conversation's text, and the other for the conversation's attachments.

Looping

We’ve identified the message.html as the file we want to look at for every given conversation. Using Python’s os library, we can quickly iterate over directory’s substructure, and extract the relevant file.

import os #this library provides us with file/directory methods

# your filepaths here
folders = ["/home/files/facebook-shiriavni/messages/archived_threads",
           "/home/files/facebook-shiriavni/messages/inbox"]

for folder in folders:
    for sub_folder in os.listdir(folder):
        file = os.path.join(folder, sub_folder, "message.html")
        if os.path.exists(file): 
            analyze_file(file) #to implement

Analyze the Files

Our initial implementation will not use any other libraries besides python’s regular expression (re) library, which is useful for searching for patterns in strings. We’ll define a search phrase to look for in the conversations - in my case “china”, as I spent a semester there. Then, I’ve decided to include the next 200 characters after every occurrence of the word “china”, so that I can understand more about the context in which the word is used.

The rest is easy: open the file, read the lines, and combine them into one last string. Transform all characters into lower-case, so that your search phrase doesn’t need to be case sensitive.

Using the regular expression library, you will get all the indices where your pattern (e.g. “china”) begins. Then, for each such index, print the 200 characters after it. Lastly, print the file name as well, so you know where the message snippet came from.

def analyze_file(file):
    search_phrase = "china" #lower case, set your own
    num_chars_after = 200 

    with open(file) as f:
        lines = f.readlines()
        lines = ' '.join(lines).lower()

        positions = [m.start() for m in re.finditer(search_phrase, lines)]
        for i in range(len(positions)):
            end_position = min(len(lines), positions[i] + num_chars_after) #don't go past end of convo

            print(os.path.basename(os.path.dirname(file)))
            print(lines[positions[i]:end_position])
            print("\n")

Sample output (with names changed):

nevillelongbottom_-okj1nonaq
china's great! ima and i had a wonderful time. sorry for the late reply - my connection has been bad (and facebook is still illegal here, so without a good connection it's difficult to acces


ronweasley_sdd1lqccgq
china's prime ministers come from either tsinghua or peking university (peking is right nearby). some of the people on the board of directors of tsinghua are mark zuckerberg and elon musk, who ac

The messages showed the irony: Facebook was illegal in China, but Mark Zuckerberg was on the board of directors of a top Chinese university.

Looking at the output here, for instance, I was reminded of the VPN problems I’d had in China, despite going with a paid, well-known VPN company. I’d spent an hour with the company’s tech team, but they couldn’t resolve nor figure out my problem (suspicious, perhaps?). All this resulted in sketchy Facebook access. At the same time, the second snippet reminded me that Facebook’s Mark Zuckerberg was on my China university’s School of Economics and Management board of directors. As a side note, this is just one of the many humorous and exasperating contradictions I encountered during my term in China!

There’s a few things we can take away from this:

  1. We should start printing text before the start of the search pattern occurence, because the search phrase is often found mid-sentence.
  2. Non-ascii chars look strange and the HTML content is included.

While this solution is far from perfect, it’s already enough to let you get the gist of the message and identify the file you need to look at to read more about your desired search phrase. If that satisfies you, then happy sailing. If you’d like to know more, read on.

Fix 1

We’ll change the previous for-loop to include the previous 200 characters before the search phrase, taking care not to go beyond the 0th index of the entire string:

for i in range(len(positions)):
    end_position = min(len(lines), positions[i] + num_chars_after) #don't go past end of convo

    print(os.path.basename(os.path.dirname(file)))
    print(lines[positions[i]:end_position])
    print("\n")

The output now looks like this, and unfortunately it’s not helpful at the moment due to all the HTML text. It’ll come in handy after Fix #2 though.

neville_longbottom_-okj1nonaq
="_3-94 _2lem">nov 12, 2015 7:44am</div></div><div class="pam _3-95 _2pi0 _2lej uiboxwhite noborder"><div class="_3-96 _2pio _2lek _2lel">shiri avni</div><div class="_3-96 _2let"><div><div></div><div>china&#039;s great! mom and i had a wonderful time. sorry for the late reply - my connection has been bad (and facebook is still illegal here, so without a good connection it&#039;s difficult to acces

Fix 2

To get rid of the HTML, CSS, and scripts, we’ll use a library called Beautiful Soup, which lets us easily parse HTML files. The code is the same as before, but parses the file using this library (read online for more details):

from bs4 import BeautifulSoup

def clean_soup(soup):
    for tag in soup.find_all(['script', 'style']): #remove css and javascript
        tag.clear()

def analyze_file(file):
    search_phrase = "china"  # lower case
    num_chars_after = 200

    with open(file) as f:
        lines = f.readlines()
        lines = ' '.join(lines).lower()
        soup = BeautifulSoup(lines, 'lxml')
        clean_soup(soup)
        lines = soup.text

        positions = [m.start() for m in re.finditer(search_phrase, lines)]
        for i in range(len(positions)):
            # as before...
ronweasley_skwq4yox1q
it was fascinating. they have such a different perspective, and you learn things about china that you simply can't from just visiting there... all of china's prime ministers come from either tsinghua or peking university (peking is right nearby). some of the people on the board of directors of tsinghua are mark zuckerberg and elon musk, who actuall

Notice that besides all the annoying HTML details disappearing, the apostrophes are also rendering correctly now.

How Many Messages?

A last point: in the beginning of this post, I stated that I had 1,047 different conversations. To find out how many conversations you’ve been a part of, simply adapt the main code snippet to include the lines with a ‘#’ at the end:

num_convos=0 #
for folder in folders:
    for sub_folder in os.listdir(folder):
        file = os.path.join(folder, sub_folder, "message.html")
        if os.path.exists(file):
            analyze_file(file)
            num_convos+=1 #

print(num_convos) #

Any questions? Feel free to comment below.

Relevant Posts:

Derangements: A Riddle About Misallocated Hats