Word Analysis

A Zipfian Distributions Experiment

Insert text below:

Analyse

Remove punctuation

Remove wikipedia references

Press enter again to submit

Introduction

What are the patterns in word use in the english language? This website is for exploring the patterns in word use accross the english langugae. Although it's primary use is with the Engligh langugae, it is possible to insert text from other languages and see similar results.

The first peculiarity about our language (and others) that can be seen when using this tool is that about 20% of unique words make up 80% of the total words in nearly any sizeable body of text. Does this mean that word occurance can be predicted? Yes, it can - the percentage occurance of a word in a piece of text is likely to be very similar to the same word's ranking in another piece of text.

It is because of this Zipfian distribution in word usage that we can spot trends and apply a 'formula' to something as abstract as linguistic expression.




Zipfian distributions

Zipf's law is a statistical formula that can be used to describe many types of data observed in both the physical and social sciences. This can also be used to predict data patterns for seemingly random occurances.

To learn more about Zipfian distributions, read about it here, or watch this video by Vasuce.

As it turns out, the most common words in the english language are:

  • 1. the
  • 2. be
  • 3. to
  • 4. of
  • 5. and
  • 6. a
  • 7. in
  • 8. that
  • 9. have
  • 10. I
  • 11. it
  • 12. for
  • 13. not
  • 14. on
  • 15. with
  • 16. he
  • 17. as
  • 18. you
  • 19. do
  • 20. at



Instructions

To use this tool, simply type or paste words into the text box above. As with any statistical probability, the more words you use, the better. I would reccoment using a minimum of 100 words to start to see interesting results, although long passages of many thousands of words are preferred. To help you with this, I have included options to remove punctuation and wikipedia references for those pasting in text from wikipedia

On the results page, you will see a graph showing each different word in order of occurance, with the percentage of times it occurs in the text. If your passage follows a Zipfian distribution, you should see a curved graph, with about 80% of the words occuring in the first 20% of the graph.

Following this, there will be a section with 'cards' giving you more information about the text. Included are the most common words found in your piece of text, and the most frequent 'uncommon' words in your text.

Working...

reset

100%

75%

50%

25%

Try to use more words next time for more interesting results

This is a test

    The list below is the top ten most used words in this piece of text

The list below represents the top ten 'uncommon' words found in this passage

info

Zipfian distributions

Zipf's law is a statistical formula that can be used to describe many types of data observed in both the physical and social sciences. This can also be used to predict data patterns for seemingly random occurances.

If your graph shows an exponential decrease in word use as you scroll left, then your text shows a Zipfian distribution in word use.