My first big project, before I had any graduate experience was to analyze emails between a core group of friends. The nine of us, since we graduated, have been emailing each other monthly with updates about our lives, categorized as highs and lows. This analysis covers the year 2015 (Jan – December).
Subjects: The nine of us, ranging in ages from 23-25 served on the same executive board for a student organization in business school. Geographically we were dispersed between Seattle, Chicago, Houston and Austin with various careers in consulting, sales, corporate finance, oil & gas and event planning.
Hypothesis: Stated mostly after the fact, but because we come from similar backgrounds, with similar levels of education, with similar jobs, our syntax would likely be very similar.
Collection Method: I collected every email from each person and listed each word used along with its frequency (with the exception of words in the english corpus). There is some missing data. Some people didn’t email every month so they have less data associated with them, although in one case, a lack of emails did not equate to a lack of words. That person’s emails were always incredibly long.
Variables: I wanted to see how we were similar and how we were dissimilar. I looked at the following:
Top 10 words used overall (on a monthly basis)
Top 5 words used by each person (for the year)
Least used word by each person (for the year)
I also looked at what words are the most descriptive of that person. What words are the most “Chandler”? Meaning, what words does someone use the most that everyone uses less?
Analysis: Getting the top words for each month and person wasn’t very hard. A simple filter gave me the highest frequency.
Find the most descriptive words was tough (for me). I created a list of every word used by us collectively with total frequencies and plotted it against each persons individual frequencies.
The X axis represents frequency of the group and Y is the frequency of the individual. Points in the upper left show words used by the individual, but not the group. The top right represents words that everyone uses, like “really”the bottom left are words no one uses and neither does the individual and words on the bottom left are heavily used by the group, but not by the individual.
Using these points, I could identify “outliers” which would be the most descriptive of that person. For example, the word I use the most that the group does not, is my boyfriend’s name, Patrick.
Overall, we like saying “really” a lot, and are generally consumed by work, our projects and our friends. We also clearly had a few weddings/engagements going on.
Michelle is clearly in love, as she was proposed to and was in the midst of planning a wedding. Todd had a particularly bad year, with an unsatisfactory project. Pranitha was excited about living in a new city and I was obsessed with my boyfriend, Patrick.
Next time: It would be interesting to see if a model would describe what I was seeing for it a k-means cluster analysis would show which friends are the most similar, based on their word count. I know we’re different, but next I want to know how different and if its statistically significant.