Updating google search appliance
Our older spider from another company never had any problem with these two sites: is there some hidden collection limit or content size issue that might be causing the problem?Dr Search replies: Dr Search replies: The bad news is that just because one spider can index a site does not mean that other spiders can access it.The GSA, like Google and Yahoo and other respectable web crawlers, tend to respect directives, although there may be some variation.But there is no guideline that requires any crawler to respect any of your directives.The status code one to watch for is ‘404’: file not found.In Figure 1, you can see that our internal site has no robots.txt, so GSA will start to fetch content.What you want to look for in your log file is any activity from your GSA crawler, which identifies itself as ‘gsa-crawler’ and is known as the User Agent.The GSA further provides a portion of the GSA license identifier and the email address of the GSA administrator.
Check to see whether your GSA crawler has ever tried to fetch pages.
I expect the steps in this article will have solved your problem with crawling all of your internal sites.
If not, email us and we can drill down on your specific issue.
This last bit is so webmasters know who to contact if a web spider is unwanted.
The numeric value just before the Agent Name is the HTTP status code your web server returned to the GSA. There are other ‘normal’ return values, so don’t worry too much if you see many other values (See the Resources section below for pointers to more information on web server status codes).
For additional information on using robots.txt, visit: