Today I found a little weird thing. A client SharePoint 2013 environment would not crawl at all. It caused the following standard error that I am sure you have all seen a million times.
“Item not crawled due to one of the following reasons: Preventive crawl rule; Specified content source hops/depth exceeded; URL has query string parameter; Required protocol handler not found; Preventive robots directive. (This item was deleted because it was excluded by a crawl rule)“
I tried everything I could think of, deleting and re-creating result sources, crawling one page, crawling not as SharePoint, removing the “robots.txt” files to name a few things. I then resorted to trawling through the “web.config” and found the following entries:
<add name=”X-Content-Type-Options” value=”nosniff” />
<add name=”X-MS-InvokeApp” value=”1; RequireReadOnly” />
HTTP Header <add name=”X-Content-Type-Options” value=”nosniff” />
Each type of file delivered from the web server has an associated MIME type (also called a “content-type and not a SharePoint Content Type either”) that describes the nature of the content (e.g. image, text, application, etc.). For compatibility reasons, Internet Explorer has a MIME-sniffing feature that will attempt to determine the content-type for each downloaded resource. In some cases, Internet Explorer reports a MIME type different than the type specified by the web server. For instance, if Internet Explorer finds HTML content in a file delivered with the HTTP response header Content-Type: text/plain, IE determines that the content should be rendered as HTML. Because of the number of legacy servers on the web (e.g. those that serve all files as text/plain) MIME-sniffing is an important compatibility feature.
Unfortunately, MIME-sniffing also can lead to security problems for servers hosting untrusted content. Consider, for instance, the case of a picture-sharing web service which hosts pictures uploaded by anonymous users. An attacker could upload a specially crafted JPEG file that contained script content, and then send a link to the file to unsuspecting victims. When the victims visited the server, the malicious file would be downloaded, the script would be detected, and it would run in the context of the picture-sharing site. This script could then steal the victim’s cookies, generate a phony page, etc.
HTTP Header <add name=”X-MS-InvokeApp” value=”1; RequireReadOnly” />
The “DirectInvoke” feature in Internet Explorer enables applications to register their MIME Types for direct invocation from a URL. When Internet Explorer encounters a file type that it doesn’t handle natively, it can use a handler application, rather than downloading the file. Using “DirectInvoke“, handler applications have control over how their files are downloaded and enables smart techniques specific to the application’s requirements. Microsoft Office maintains a local cache of documents on a user’s machine. When a user attempts to download an Office document in Windows Internet Explorer, “DirectInvoke” calls Office using the target URL, rather than downloading the file. Office checks its cache and only downloads the file if it isn’t already in the cache. This behavior can provide significant bandwidth savings, especially when handling large, media-rich documents.
When the “web.config” is set to “1;RequireReadOnly“, then the server is requesting that a “DirectInvoke” configured application be used and requires that the file opens in read-only mode.
For this environment I removed the two lines and the crawl worked. I need to do some further investigation but right now this worked perfectly. The security and functionality impact for the “default” zone crawl in my mind are negligible, for the external side I would not remove these at all. I will post updates once I have more detail but for now, this has done the trick. More to come 🙂