Differences

This shows you the differences between two versions of the page.

research:articles:progress_20110209 [2011/02/13 12:56]
chiragmehta [Running the code]
research:articles:progress_20110209 [2019/11/07 19:40] (current)
Line 5: Line 5:
 ==== Predicting what people type ==== ==== Predicting what people type ====
  
-The primary goal of the [[:home|KType]] project is to improve communication for people with [[research:disabilities|disabilities]]. Typing long words and sentences can be very difficult for such users and the best way to reduce the discomfort is to predict what they want to type based on just a few keystrokes. Chances are that a user who starts to type "goo" wants to say "good morning" or "good bye". Search engines like [[http://www.google.com|Google]] do a great job at showing possible search phrases with just a few letters. [[http://scribe.googlelabs.com/|Google Scribe]] is another good example of predictive typing. +The primary goal of the [[:home|KType]] project is to improve communication for people with [[research:disabilities|disabilities]]. Typing long words and sentences can be very difficult for such users and the best way to reduce the discomfort is to predict what they want to type based on just a few keystrokes. Chances are that a user who starts to type "goo" wants to say "good morning" or "good bye". Search engines like [[https://www.google.com|Google]] do a great job at showing possible search phrases with just a few letters. [[https://scribe.googlelabs.com/|Google Scribe]] is another good example of predictive typing.
  
 ==== Why Twitter ==== ==== Why Twitter ====
  
-Since the users of the KType iPad app could be offline for long periods, the predictive typing feature cannot rely on any web service. Once I knew I had to build my own auto-complete suggestion database, I started to look for large text datasets. [[http://www.gutenberg.org|Project Gutenberg]] makes tens of thousands of books available but normal people don't talk like that. For the same reason, I can't use text from Wikipedia, news articles, or even personal blogs. Where do people type normally? In personal emails, instant messages, text/sms messages, and luckily for me, Twitter. I came across a rich dataset with over [[http://infolab.tamu.edu/resources/dataset/|9 million tweets]] by Infolab via [[http://www.reddit.com/r/datasets/comments/fezdi/twitter_data_set_9281007_tweets_across_135825/|reddit]] and immediately started data crunching.+Since the users of the KType iPad app could be offline for long periods, the predictive typing feature cannot rely on any web service. Once I knew I had to build my own auto-complete suggestion database, I started to look for large text datasets. [[https://www.gutenberg.org|Project Gutenberg]] makes tens of thousands of books available but normal people don't talk like that. For the same reason, I can't use text from Wikipedia, news articles, or even personal blogs. Where do people type normally? In personal emails, instant messages, text/sms messages, and luckily for me, Twitter. I came across a rich dataset with over [[https://infolab.tamu.edu/resources/dataset/|9 million tweets]] by Infolab via [[https://www.reddit.com/r/datasets/comments/fezdi/twitter_data_set_9281007_tweets_across_135825/|reddit]] and immediately started data crunching.
  
 ==== N-Grams ==== ==== N-Grams ====
  
-Predicting what word or phrase a user is trying to type depends on the text they have already typed. The phrase "you for" could be "you for the" if it was preceded by "thank", otherwise it could be "you forgot my" or "you forgot to". The simplest way to guess what word or phrase comes next would be to analyze the 9 million short sentences written by over a hundred thousand people and count all the 2-3-4 word phrases they use and how often. These phrases are called [[http://en.wikipedia.org/wiki/N-gram|n-grams]] where 'n' is the number of words in the phrase. N-grams are relatively easy to generate from plain-text and I extracted 1-grams (or single words) in my [[http://chir.ag/projects/preztags/|US Presidential Tag-Cloud]] project a few years ago. For KType's prediction algorithm, I think 1-4-grams would be sufficient.+Predicting what word or phrase a user is trying to type depends on the text they have already typed. The phrase "you for" could be "you for the" if it was preceded by "thank", otherwise it could be "you forgot my" or "you forgot to". The simplest way to guess what word or phrase comes next would be to analyze the 9 million short sentences written by over a hundred thousand people and count all the 2-3-4 word phrases they use and how often. These phrases are called [[https://en.wikipedia.org/wiki/N-gram|n-grams]] where 'n' is the number of words in the phrase. N-grams are relatively easy to generate from plain-text and I extracted 1-grams (or single words) in my [[https://chir.ag/projects/preztags/|US Presidential Tag-Cloud]] project a few years ago. For KType's prediction algorithm, I think 1-4-grams would be sufficient.
  
 ==== Parsing the data ==== ==== Parsing the data ====
  
-Each line of the Infolab tweet data that I obtained contained a unique ID for each user, a unique ID for each tweet, the tweet text, and the time it was posted. For my purposes, I only needed the user's ID and tweet text. To keep things simple for KType users, I limited the n-grams to lowercase alphabets only: a-z. So there are no punctuation marks, quotes, capital letters, numbers, URLs, emoticons/smileys, or special characters. For my purposes, that is more than sufficient. If you want a more thorough n-gram list, you can check out [[http://ngrams.googlelabs.com/datasets|Google Books Ngram Viewer dataset]].+Each line of the Infolab tweet data that I obtained contained a unique ID for each user, a unique ID for each tweet, the tweet text, and the time it was posted. For my purposes, I only needed the user's ID and tweet text. To keep things simple for KType users, I limited the n-grams to lowercase alphabets only: a-z. So there are no punctuation marks, quotes, capital letters, numbers, URLs, emoticons/smileys, or special characters. For my purposes, that is more than sufficient. If you want a more thorough n-gram list, you can check out [[https://ngrams.googlelabs.com/datasets|Google Books Ngram Viewer dataset]].
  
 Initially, I only looked at the raw tweet text and created my 1-4-gram tables based solely on how many times a word or phrase was used. Naturally, words like 'the', 'of', 'in' showed up on the top but surprisingly words like 'gangsta' and 'apache' also featured quite high. Digging in, I found that some users were typing the latter phrases way too many times, possibly band names or song lyrics. That's why I started looking at the user's ID to limit the effect a single user could have on the weight of a given word or phrase. Initially, I only looked at the raw tweet text and created my 1-4-gram tables based solely on how many times a word or phrase was used. Naturally, words like 'the', 'of', 'in' showed up on the top but surprisingly words like 'gangsta' and 'apache' also featured quite high. Digging in, I found that some users were typing the latter phrases way too many times, possibly band names or song lyrics. That's why I started looking at the user's ID to limit the effect a single user could have on the weight of a given word or phrase.
Line 35: Line 35:
   ---------------------------------------------------------------------------------   ---------------------------------------------------------------------------------
   Total score for 'I love'                         204 points (2 users, 4 mentions)   Total score for 'I love'                         204 points (2 users, 4 mentions)
-  +
 If any user mentions 'I love' more than 99 times, it will not add any points beyond the first 101 + 98 = 199 points. This means a phrase said by two users just once will always have a higher score than a phrase said 1000 times by a single user. However, it is possible that five people can say a phrase 50 times each (150 * 5 = 750) and that is higher than seven people saying it seven times each. If any user mentions 'I love' more than 99 times, it will not add any points beyond the first 101 + 98 = 199 points. This means a phrase said by two users just once will always have a higher score than a phrase said 1000 times by a single user. However, it is possible that five people can say a phrase 50 times each (150 * 5 = 750) and that is higher than seven people saying it seven times each.
  
Line 44: Line 44:
 ==== Download the code + full n-gram list ==== ==== Download the code + full n-gram list ====
  
-Download: [[http://ktype.net/share/ktype-twitter-ngrams.zip|ktype-twitter-ngrams.zip]] (30 mb).+Download: [[https://ktype.net/share/ktype-twitter-ngrams.zip|ktype-twitter-ngrams.zip]] (30 mb).
  
 The file contains sample tweet data, the original Python script I used, and the complete 1-2-3-4-gram lists I generated. It does not contain the entire 9 million tweet dataset. I had to prep the Infolab dataset by removing any newline characters within tweets and sorting the entire file by the user ID. So if you download the original source files I used, you will have to do the same before the above Python script can be run. The file contains sample tweet data, the original Python script I used, and the complete 1-2-3-4-gram lists I generated. It does not contain the entire 9 million tweet dataset. I had to prep the Infolab dataset by removing any newline characters within tweets and sorting the entire file by the user ID. So if you download the original source files I used, you will have to do the same before the above Python script can be run.
Line 1057: Line 1057:
 |boo|621260|you ever|438918|a piece of|138742|the way to go|38504|gooo|115556| |boo|621260|you ever|438918|a piece of|138742|the way to go|38504|gooo|115556|
 |age|621109|headed to|438684|got home from|138625|is supposed to be|38497|happend|115442| |age|621109|headed to|438684|got home from|138625|is supposed to be|38497|happend|115442|
 +
 
research/articles/progress_20110209.1297630602.txt.gz · Last modified: 2011/02/13 12:56 by chiragmehta
 
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki