Google looks to penetrate ‘Deep Web’ with HTML forms crawling

April 24, 2008

Google’s ever-active search bots that scour the Web constantly for new pages, have started a novel, more active phase of their indexing jobs. Alon Halevy and Jayant Madhavan of Google’s crawling and indexing team stated the firm has launched an experiment in which its indexing software will experimentally enter text in web site forms to check what previously undiscovered pages may appear.

In the last few months, we have been constantly exploring some HTML forms in an effort to discover new Web pages and URLs, which we otherwise could not find and index for users who search on Google.

This experiment, according to them, is part of Google’s broader effort to increase and enhance its coverage of the Web. In fact, HTML forms have for quite some time been thought to be the ‘gateway’ to large volumes of data beyond the normal purview of search engines.

The new Google indexing practice will involve only ‘high quality’ sites and will not run on sites with ‘robots.txt’ files. To decide what words are to be typed into the forms, the indexing software samples from among terms on the web page surrounding the form. Google has taken one step closer to the Deep Web with this experiment to index HTML forms inclusive of drop-down boxes and select menus.

No Comments

No comments yet.

RSS feed for comments on this post
TrackBack URI

Leave a comment