Table of Contents:
I often get asked by people why I enjoy coding so much, and how I respond depends on the person asking. I may answer that you can do neat simulations with code, develop video games, or do interesting computer vision tasks. But some of the friends who ask me this question aren’t interested in any of those things, and represent more typical computer end users. So, this post goes to answer why even normal end users could find coding useful and rewarding, by demonstrating how a few short lines of code can help you analyze your entire social media history. While in this example I demonstrate an analysis of personal Facebook history, you could easily adapt it to other social platforms.
In this post, I’ll show you how to find all the conversations and chat snippets that include specific keywords. For instance, let’s imagine that you spent a semester living in Antarctica, but you don’t remember much about the details of your time there. By searching for all snippets including the word “Antarctica” in your message history with your friends, you’ll be able to find your messages and remind yourself of the details from that experience. Given that I have 1,047 different conversations in my history, this is not something I would want to do manually. Here’s where code can come in handy!
To begin, download your Facebook data to your local computer, by following the directions at this link. I only downloaded my messages, but feel free to download whatever you’re interested in. Once you extract the zip file, you should have a folder with a file called
your_messages.html and a folder called
As I’m only interested in two-way conversations that I’ve been a part of, the files I’ll be analyzing are those in the
messages/inbox folder. However, feel free to include the
messages/message_requests if you think you’ll find something interesting there.
Looking in either folder, you’ll further see a list of folders: either one or two folders per conversation you’ve ever had. If your conversation with person X has only ever consisted of regular text, then you will have one folder for that conversation, with the folder containing a single file called
message.html. If you’ve also sent or received files or photos as part of a conversation with someone, then the conversation will be split into two folders: the first is identical to the folder we just discussed, and the other folder will contain a
photos folder, which we will ignore.
We’ve identified the
message.html as the file we want to look at for every given conversation. Using Python’s os library, we can quickly iterate over directory’s substructure, and extract the relevant file.
import os #this library provides us with file/directory methods # your filepaths here folders = ["/home/files/facebook-shiriavni/messages/archived_threads", "/home/files/facebook-shiriavni/messages/inbox"] for folder in folders: for sub_folder in os.listdir(folder): file = os.path.join(folder, sub_folder, "message.html") if os.path.exists(file): analyze_file(file) #to implement
Our initial implementation will not use any other libraries besides python’s regular expression (re) library, which is useful for searching for patterns in strings. We’ll define a search phrase to look for in the conversations - in my case “china”, as I spent a semester there. Then, I’ve decided to include the next 200 characters after every occurrence of the word “china”, so that I can understand more about the context in which the word is used.
The rest is easy: open the file, read the lines, and combine them into one last string. Transform all characters into lower-case, so that your search phrase doesn’t need to be case sensitive.
Using the regular expression library, you will get all the indices where your pattern (e.g. “china”) begins. Then, for each such index, print the 200 characters after it. Lastly, print the file name as well, so you know where the message snippet came from.
def analyze_file(file): search_phrase = "china" #lower case, set your own num_chars_after = 200 with open(file) as f: lines = f.readlines() lines = ' '.join(lines).lower() positions = [m.start() for m in re.finditer(search_phrase, lines)] for i in range(len(positions)): end_position = min(len(lines), positions[i] + num_chars_after) #don't go past end of convo print(os.path.basename(os.path.dirname(file))) print(lines[positions[i]:end_position]) print("\n")
Sample output (with names changed):
nevillelongbottom_-okj1nonaq china's great! ima and i had a wonderful time. sorry for the late reply - my connection has been bad (and facebook is still illegal here, so without a good connection it's difficult to acces ronweasley_sdd1lqccgq china's prime ministers come from either tsinghua or peking university (peking is right nearby). some of the people on the board of directors of tsinghua are mark zuckerberg and elon musk, who ac
The messages showed the irony: Facebook was illegal in China, but Mark Zuckerberg was on the board of directors of a top Chinese university.
Looking at the output here, for instance, I was reminded of the VPN problems I’d had in China, despite going with a paid, well-known VPN company. I’d spent an hour with the company’s tech team, but they couldn’t resolve nor figure out my problem (suspicious, perhaps?). All this resulted in sketchy Facebook access. At the same time, the second snippet reminded me that Facebook’s Mark Zuckerberg was on my China university’s School of Economics and Management board of directors. As a side note, this is just one of the many humorous and exasperating contradictions I encountered during my term in China!
There’s a few things we can take away from this:
While this solution is far from perfect, it’s already enough to let you get the gist of the message and identify the file you need to look at to read more about your desired search phrase. If that satisfies you, then happy sailing. If you’d like to know more, read on.
We’ll change the previous for-loop to include the previous 200 characters before the search phrase, taking care not to go beyond the 0th index of the entire string:
for i in range(len(positions)): end_position = min(len(lines), positions[i] + num_chars_after) #don't go past end of convo print(os.path.basename(os.path.dirname(file))) print(lines[positions[i]:end_position]) print("\n")
The output now looks like this, and unfortunately it’s not helpful at the moment due to all the HTML text. It’ll come in handy after Fix #2 though.
neville_longbottom_-okj1nonaq ="_3-94 _2lem">nov 12, 2015 7:44am</div></div><div class="pam _3-95 _2pi0 _2lej uiboxwhite noborder"><div class="_3-96 _2pio _2lek _2lel">shiri avni</div><div class="_3-96 _2let"><div><div></div><div>china's great! mom and i had a wonderful time. sorry for the late reply - my connection has been bad (and facebook is still illegal here, so without a good connection it's difficult to acces
To get rid of the HTML, CSS, and scripts, we’ll use a library called Beautiful Soup, which lets us easily parse HTML files. The code is the same as before, but parses the file using this library (read online for more details):
ronweasley_skwq4yox1q it was fascinating. they have such a different perspective, and you learn things about china that you simply can't from just visiting there... all of china's prime ministers come from either tsinghua or peking university (peking is right nearby). some of the people on the board of directors of tsinghua are mark zuckerberg and elon musk, who actuall
Notice that besides all the annoying HTML details disappearing, the apostrophes are also rendering correctly now.
A last point: in the beginning of this post, I stated that I had 1,047 different conversations. To find out how many conversations you’ve been a part of, simply adapt the main code snippet to include the lines with a ‘#’ at the end:
num_convos=0 # for folder in folders: for sub_folder in os.listdir(folder): file = os.path.join(folder, sub_folder, "message.html") if os.path.exists(file): analyze_file(file) num_convos+=1 # print(num_convos) #
Any questions? Feel free to comment below.