This is part 3 of my Quixotic tilt against the windmills. Find Part 1 here, and Part 2 here.

So for my previous article, I wanted to include a quick graphic that compared adverbs and nonadverbs in my text.

I have clearly made a rookie mistake. I forgot about the Stop Words!

Stop words are basically the glue to sentences that don’t carry a lot of unique meaning inside the sentences, but are required for them to function grammatically. They help the meaning of words relate to each other effectively, but if you yank them out of context they lose any analyzable meaning. Nonstop words, meanwhile, will still be analyzable outside of context, as they are rare enough to provide meaning by their very existence.

More importantly, for my quest of growing ridiculousness, should the adverbs that are stop words be really counted as an adverb? Like, the most represented adverb in my story is “not”. Can I actually get rid of “not” when it occurs? Without actually changing the whole meaning of the sentence.

So I will have to start from square one. I have to pick what is and isn’t a stop word. You think it would be easy, but lists abound. The NLTK list was one of the most commonly used repos for Python, so that’s what I elected to use. Of note, about 40% of the Brown Corpus is NLTK stop words. I smell Zipf.

Enough about the hygiene of mathematicians. When I remove the stop words from The Brown Corpus (fiction/adventure/mystery/romance/sci-fi), now 6.6% of words are adverbs.

Wait – my stuff is 7% – maybe I don’t have too many adverbs after all!

I’m also a little higher now, 7.2%, but we’re making progress. But is seems like most of that progress was that I forgot the Sanford POS tagger classifies the “n’t” at the end of contractions as a separate instance of “not”, which is good but I had forgotten to add that to my stop word list. The ghost of Zipf haunts again.

Brown FictionSandra is Phoenix
Say what you will, I sure killed “said”.

Now it seems I only need to remove 272 adverbs to normalize my work, but then again, if “said” is considered normal in the Brown Corpus, then what am I actually doing.

OMG, is this series actually over?

Categories: Data

Greg Neyman

Father, Physician, Computer Programmer, and now Author, apparently?

0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.