rfmcdonald: (Default)
[personal profile] rfmcdonald
Language Log's Mark Liberman has posted the results of a recent study he conducted about language use in the Untied States. It turns out that Twitter does a great job of letting people track the use of colloquial language.

It took me a while to really make sense of Twitter. For the longest time, it was (to me) the stomping ground of 14-year-olds and Ashton Kutcher, each issuing a minute-by-minute feed of their lives. Around the time Twitter arrived, however, I had just had a breakthrough on YouTube's enormous popularity - it was only after watching a dozen different videos of the Super Mario Brothers theme song performed a dozen different ways that I finally got it: I may not care about cats playing the keyboard or wedding parties dancing down the aisle, but somebody does, and without a distribution system for people to broadcast whatever their hearts felt like, I never would have had my life improved by that kid with the beatboxing flute or the one with the double guitar.

So I waited for a similar breakthrough with Twitter. It came, at long last, after I realized that it was exactly what I first thought it was: 14-year-olds (and Ashton Kutcher) chronicling the minutiae of their lives. It is colloquial language, constrained by 140 characters: everyday conversations about waiting in line at the grocery store, your flight just landing at ORD, what to do this Saturday night, "omg did u see hr dress?" In spurts it is, of course, much more than that, as its use during the protests of the 2009 Iranian election proved, but in its unmarked use, it's the language of how millions of people across the world talk to their friends.

To say Twitter is colloquial is putting it lightly. "Brother," for example, occurs in Twitter data during the week of May 10-17, 2010 with an average frequency of once every 7,338 words, not too distant from its frequency in its closest cousin, the Corpus of Contemporary American English (once every 9,405 words). The difference for "bro," however, is much more dramatic: in the Twitter data during that same period, it occurs once every 5,833 words (more frequently, in fact, than "brother"), while in the COCA it occurs once every 757,575 words - two orders of magnitude less frequently.

In April 2010, Twitter had approximately 106M registered users. The volume of data that flows through the Twitter pipe dwarfs any other publicly available linguistic corpus in existence (except the web itself), and unlike fixed corpora, it still flows. Such a huge dataset has proven itself to be a fertile resource for a number of natural language processing tasks (such as trend detection and sentiment analysis), but its value as a collection of colloquial language begs to be used for lexicography as well: if the purpose of a dictionary is to record actual usage, then Twitter data allows us to broaden the scope of our corpus beyond newswire, literary works and other forms of privileged publication and include the unedited language of everyday folks as well.


Liberman covers all manner of demographics--estimated age, confirmed location, interests. It's a great post.
Page generated Feb. 10th, 2026 12:46 pm
Powered by Dreamwidth Studios