Building a bot, phase I
By jasonslater • Jul 21st, 2008 • Category: Lead StoryBots are a popular search engine activity. When users recommend sites they are sometimes asked to enter information such as Title, URL, Description and possibly some Keywords. Some search engines however simply require a URL and the essential information is automatically retrieved using an automated mechanism - bots.
Bots are ideal for this repetitive automated activity by taking a work queue (a given list of URLs) and producing a set of information that can be passed to the next stage in the submission process. Even though these techniques are called bots they are essentially just software programs that run either at a scheduled time or in a loop constantly checking the work queue.
In addition to the usual information a bot can periodically extract additional words from body text itself by parsing out html tags, removing stop words and building a word frequency table. A word frequency table is simply a list of words with the number of times a word appears in the given text.
Recently I have been working to find a way of automatically extracting information for a given URL by accessing the html from a submitted URL, parsing the relevant information and using this information as input to the submission process. Information is often placed as meta tags in the HEAD section of html but I have found this process so far to somewhat hit and miss as some sites include the information and others do not. For those that do not an additional ways needs to be identified to provide a suitable link title and description - it may be the case that this will always require some user intervention but further analysis should identify more.
The bot also needs to be aware of the period of time since the last change of a web page to ensure that valuable information or updates is not missed and that processing power isn’t wasted on a link that does not change very much. The way we can address this is by producing a hash for a page and comparing it with a previously stored hash.
A hash, simply put is the result of a consistent algorithm applied to information. MD5 is a well known hash mechanism that I may investigate in a later article. For now a simple example to demonstrate a hash would be to calculate a check digit by taking letter values of a phrase, adding them together then finding the modulus using a number say 27 (allowing 26 letters and the space), in the example the first two phrases give the same answer however the third phrase with one letter different gives a much different check digit.
I have a working program up and running - but not in bot mode yet -but can be initiated manually. I need to add the loop code to continually check the work queue for submitted URLs.
jasonslater is
Email this author | All posts by jasonslater

[...] - bookmarked by 1 members originally found by mrwilliams on 2008-09-30 Building a bot, phase I http://www.eggnchips.com/blog/2008/07/21/building-a-bot-phase-i/ - bookmarked by 4 members [...]