Word Frequency Comparison

This is an app that can take large amounts of text and make sure that you're not overusing particular words. This started as most things do these days when you run across an errant post on social that RUINS YOUR DAY. For me, that was author Selene Kallan's Facebook repost of her Retweet. My own manuscript of nearly 90,000 words had 47 instances of "sigh" in it. Was that too much? Too little? Would other writers want to know?

So I solved this problem like I do most in my life. I scraped more data from the web than anyone else ever has on such a snipe hunt. Specifically, from the Brown Corpus of American English, which has about a million words from various sources all bundled up nicely for analysis. And very helpful, their data can be broken down by lemmas, which is basically the root form of the word, so plurals and tense changes don't get binned separately (e.g., sigh, sighs, sighed, all get counted as "sigh").

One note, I seemed to have a problem when I compiled the lemmas. It seemed to miss the verb form "be", which is about 7% of American English, which I find hilarious. I could not replicate the error, which means there may be more silliness in this tool. If you see something like that, please Contact Me.

    Instructions
  1. Spell-check your work. This app does not have any ability to do a spell check.
  2. Paste your text down below. Large amounts of text may crash the browser. My full 90,000 words did, but about half of that didn't, so some experimentation may be needed.
  3. Click on the "Word Frequency" button. Depending on your text load, this will take A WHILE. Go do something else. Seriously. My build processed about 1,000 wpm on my benchmark,
  4. Keep pasting and clicking on "Word Frequency" until you have all the text you need to be analyzed.
  5. Click on the "Analyze Frequencies" button.
  6. Patience is a virtue. This will take a bit.
  7. Your lemmas will appear in the table below, ordered, but how far off they are from the word frequency list in the Brown corpus. You will probably see things like proper names pop to the top of the list, and that's to be expected. But maybe you'll see some other offenders which you may have to reduce.

Epilogue: My 47 instances of "sigh" represented 0.054% of my manuscript, as opposed to the Brown corpus, where it is 0.051%. It turns out my heroine does not sigh too often, afterall.

Progress Bar Sorry, your browser does not support inline SVG.



Lemma Count Your Frequency Corpus Frequency Absolute Difference Relative Difference