the blog for the eggnchips search engine research project

Search and stop words

By jasonslater • Jun 20th, 2008 • Category: Lead Story

Whilst planned how to build a word and phrase vocabulary for use in the EggnChips project we need to make a decision regarding the use of stop words.

Stop words are a mix of short, common and everyday words such as “a”, “when”, “besides”, and “to” that are sometimes ignored by search engines when a user is searching for information. There are positive and negative elements to this, the positive element is that it makes word lists much shorter making calculations quicker but negatives can impact the context of the intended search. This can be compounded by stemming phrases (We will cover stemming in a later post but perhaps besides may be stemmed to beside).

An example of the negative element could be if I were searching for images of “plants beside water”, if the word besides is stemmed to beside which is filtered as a stop word then the search engine may think I am interested in “plant water” or “water plants” or something else which would likely impact the results receive.

www.eggnchips.com

An example portion of a stop word list may be:

b ba back be became because become been before began beget behind being beside best bet between big bin both but by

Stemming could increase the size of the list to effectively make the following variations of the word “back” into stop words: backed, backing, backs, backend, backy.

There does not appear to be a universal list of stop words, which complicates matters, and an important factor is language variations which could make some words “as typed” stop words in one language but not in other languages.

Our initial view is that we may generate a stop word list however not implement it - or even make it optional. This way we could assess the difference in the results gained from using stop words and not using stop words for a particular search phrase.

Leave a Reply