It gets a lot harder to do fun data science projects when you become a data scientist, but this annual analysis is one of my favorites. It’s a lot of hard work, but the outcomes are worth all the effort. If you haven’t looked at one of these posts before, I’ve been analyzing monthly emails from my 8 best friends since 2015. I uncover the type of people we are based on the emails we write.
Each month, the 9 of us write an email update to the group and divide our monthly updates into “highs” – high points in our month- and “lows” – the bad parts of our month. Sometime people classify things as “meh” or “medium” for anything in between. I usually include that in the lows. Data was collected from Jan 2018 – Dec 2018, although previous years of data were examined for a trend analysis. All of us are between the ages of 26 – 28 .
- Over time we tend to say less and become less positive.
- There is a correlation between how frequently someone writes and how positive their sentiment is.
- Though similar in how we speak, some words are indicative of specific individuals
As always, I like to pull the top words from our emails. Filler words like “and”, “or” and “the” get removed during this process. I also used an n-grams model to pull our top phrases. To no one’s surprise, Patrick is one of my top words because I talk about him all the time.
Most Defining Words
I also included the most “insert name here” words. These are words that each of us uses frequently that the rest of the group does not. I think of these as our defining words. Spencer is really into design, so SHOCKER that’s his most defining word.
I also look at the sentiment of how we write. I usually take the sentiment of our high points and our low points, but this year had so many “mehs” I ran the sentiment analysis (using VADER) on all the emails. The scale runs -100 (super negative) to +100 (super positive). A sentiment score of 33 is low for our group of friends but compared to the rest of social media its just above neutral (0).
We are all best friends, but some of us are closer than others. I wanted to see how similar we were to each other based on the emails we wrote. I used word2vec to compare the similarity of the words (and the context) to each other. Most people were similar to Kayla, Nancy and Todd. It could just be that they had the most to say and were the most encompassing of writing styles.
This is one of my poorer performing models, but I wanted to see how we described each other’s cities. I again used a word2vec model to find the words that were most used like the city mentioned. The results ended up being the names of a lot of other cities and the people who visit them.
I don’t think it would be fair to talk about this group of friends without talking about our significant others, since we talk about them so much I thought I’d include them in this analysis. I looked for the words that were most used in the same context as someone’s name. For Kelsey, Todd’s wife, we get lots of terms around their newborn and how involved they are in the church.
Unfortunately, Nancy’s boyfriend Will does not have any data around his name because the word frequency model I use can’t distinguish between Will as in William and will as in “will you just recognize his name already.”
Descriptions of You
I wanted to know how we as a group talked about each person. Unfortunately we don’t often talk about each other to each other so the resulting model is pretty weak because there isn’t enough data. Still, the results are pretty funny and some of them make sense. Like calling Michelle Hitler
The context around that was Nancy wrote about how good Michelle was at the game secret Hitler.
These incredible humans send me their updates every month and I copy and paste them into a csv file for further analysis.
Natural Language Processing
- I used NLTK to tokenize the text and remove stop words.
- Vader sentiment as used to extract the compound sentiment from each person’s monthly email. Sentiment over time was calculated, but the average sentiment shown in this post was an average across the 12 months.
- NLTK.bigrams was used on the tokenized dataset to identify the most frequent collection of words. The top phrases that made sense were chosen.
- Gensim was used to train a word2vec model. This model was used to pull top associated words for cities and people’s names.
- I calculated how often each word was written by an individual and calculated the correlations for each combination of people. Those with the highest correlations were deemed to be most similar. (if anyone has recs on a better way to do this let me know!)