### 9 January 2019

• Keywords:
• code-for-novices

Shiri Avni

# The Goal

In this post, I’ll show you how to find all the conversations and chat snippets that include specific keywords. For instance, let’s imagine that you spent a semester living in Antarctica, but you don’t remember much about the details of your time there. By searching for all snippets including the word “Antarctica” in your message history with your friends, you’ll be able to find your messages and remind yourself of the details from that experience. Given that I have 1,047 different conversations in my history, this is not something I would want to do manually. Here’s where code can come in handy!

To begin, download your Facebook data to your local computer, by following the directions at this link. I only downloaded my messages, but feel free to download whatever you’re interested in. Once you extract the zip file, you should have a folder with a file called your_messages.html and a folder called messages.

# Identify the Relevant Files

As I’m only interested in two-way conversations that I’ve been a part of, the files I’ll be analyzing are those in the messages/archived_threads and messages/inbox folder. However, feel free to include the messages/message_requests if you think you’ll find something interesting there.

Looking in either folder, you’ll further see a list of folders: either one or two folders per conversation you’ve ever had. If your conversation with person X has only ever consisted of regular text, then you will have one folder for that conversation, with the folder containing a single file called message.html. If you’ve also sent or received files or photos as part of a conversation with someone, then the conversation will be split into two folders: the first is identical to the folder we just discussed, and the other folder will contain a files and photos folder, which we will ignore.

An imaginary conversation with Ron Weasley contains two folders, one for the conversation's text, and the other for the conversation's attachments.

## Looping

We’ve identified the message.html as the file we want to look at for every given conversation. Using Python’s os library, we can quickly iterate over directory’s substructure, and extract the relevant file.

import os #this library provides us with file/directory methods

for folder in folders:
for sub_folder in os.listdir(folder):
file = os.path.join(folder, sub_folder, "message.html")
if os.path.exists(file):
analyze_file(file) #to implement


# Analyze the Files

Our initial implementation will not use any other libraries besides python’s regular expression (re) library, which is useful for searching for patterns in strings. We’ll define a search phrase to look for in the conversations - in my case “china”, as I spent a semester there. Then, I’ve decided to include the next 200 characters after every occurrence of the word “china”, so that I can understand more about the context in which the word is used.

The rest is easy: open the file, read the lines, and combine them into one last string. Transform all characters into lower-case, so that your search phrase doesn’t need to be case sensitive.

Using the regular expression library, you will get all the indices where your pattern (e.g. “china”) begins. Then, for each such index, print the 200 characters after it. Lastly, print the file name as well, so you know where the message snippet came from.

def analyze_file(file):
search_phrase = "china" #lower case, set your own
num_chars_after = 200

with open(file) as f:
lines = ' '.join(lines).lower()

positions = [m.start() for m in re.finditer(search_phrase, lines)]
for i in range(len(positions)):
end_position = min(len(lines), positions[i] + num_chars_after) #don't go past end of convo

print(os.path.basename(os.path.dirname(file)))
print(lines[positions[i]:end_position])
print("\n")


Sample output (with names changed):

nevillelongbottom_-okj1nonaq
china&#039;s great! ima and i had a wonderful time. sorry for the late reply - my connection has been bad (and facebook is still illegal here, so without a good connection it&#039;s difficult to acces

ronweasley_sdd1lqccgq
china&#039;s prime ministers come from either tsinghua or peking university (peking is right nearby). some of the people on the board of directors of tsinghua are mark zuckerberg and elon musk, who ac



The messages showed the irony: Facebook was illegal in China, but Mark Zuckerberg was on the board of directors of a top Chinese university.

Looking at the output here, for instance, I was reminded of the VPN problems I’d had in China, despite going with a paid, well-known VPN company. I’d spent an hour with the company’s tech team, but they couldn’t resolve nor figure out my problem (suspicious, perhaps?). All this resulted in sketchy Facebook access. At the same time, the second snippet reminded me that Facebook’s Mark Zuckerberg was on my China university’s School of Economics and Management board of directors. As a side note, this is just one of the many humorous and exasperating contradictions I encountered during my term in China!

There’s a few things we can take away from this:

1. We should start printing text before the start of the search pattern occurence, because the search phrase is often found mid-sentence.
2. Non-ascii chars look strange and the HTML content is included.

While this solution is far from perfect, it’s already enough to let you get the gist of the message and identify the file you need to look at to read more about your desired search phrase. If that satisfies you, then happy sailing. If you’d like to know more, read on.

## Fix 1

We’ll change the previous for-loop to include the previous 200 characters before the search phrase, taking care not to go beyond the 0th index of the entire string:

for i in range(len(positions)):
end_position = min(len(lines), positions[i] + num_chars_after) #don't go past end of convo

print(os.path.basename(os.path.dirname(file)))
print(lines[positions[i]:end_position])
print("\n")


The output now looks like this, and unfortunately it’s not helpful at the moment due to all the HTML text. It’ll come in handy after Fix #2 though.

neville_longbottom_-okj1nonaq
="_3-94 _2lem">nov 12, 2015 7:44am</div></div><div class="pam _3-95 _2pi0 _2lej uiboxwhite noborder"><div class="_3-96 _2pio _2lek _2lel">shiri avni</div><div class="_3-96 _2let"><div><div></div><div>china&#039;s great! mom and i had a wonderful time. sorry for the late reply - my connection has been bad (and facebook is still illegal here, so without a good connection it&#039;s difficult to acces


## Fix 2

To get rid of the HTML, CSS, and scripts, we’ll use a library called Beautiful Soup, which lets us easily parse HTML files. The code is the same as before, but parses the file using this library (read online for more details):

from bs4 import BeautifulSoup

def clean_soup(soup):
for tag in soup.find_all(['script', 'style']): #remove css and javascript
tag.clear()

def analyze_file(file):
search_phrase = "china"  # lower case
num_chars_after = 200

with open(file) as f:
lines = ' '.join(lines).lower()
soup = BeautifulSoup(lines, 'lxml')
clean_soup(soup)
lines = soup.text

positions = [m.start() for m in re.finditer(search_phrase, lines)]
for i in range(len(positions)):
# as before...

ronweasley_skwq4yox1q
it was fascinating. they have such a different perspective, and you learn things about china that you simply can't from just visiting there... all of china's prime ministers come from either tsinghua or peking university (peking is right nearby). some of the people on the board of directors of tsinghua are mark zuckerberg and elon musk, who actuall


Notice that besides all the annoying HTML details disappearing, the apostrophes are also rendering correctly now.

# How Many Messages?

A last point: in the beginning of this post, I stated that I had 1,047 different conversations. To find out how many conversations you’ve been a part of, simply adapt the main code snippet to include the lines with a ‘#’ at the end:

num_convos=0 #
for folder in folders:
for sub_folder in os.listdir(folder):
file = os.path.join(folder, sub_folder, "message.html")
if os.path.exists(file):
analyze_file(file)
num_convos+=1 #

print(num_convos) #


Any questions? Feel free to comment below.