Then I took all the 'generic' document attributes (Title, Length, Uri. To 'fix' this design flaw, I pulled out all the Html-specific code from Spider and put it into HtmlDocument. This made it difficult to add the new functionality required for supporting IFilter (or any other document types we might like to add) that don't have the same attributes as an Html page. Notice that the StripHtml() method is in the Spider class - doesn't make sense, does it? In version 3, all the code to: download a file, parse the html, extract the links, extract the words, add the to catalog and save the catalog was crammed into two classes ( Spider and HtmlDocument see right). The Catalog-File-Word design that supports searching the Catalog remains basically unchanged (from Version 1!), however there has been a total reorganization of the classes used to generate the Catalog. The UI (Search.aspx) hasn't really changed at all (except for class name changes as a result of refactoring) - I have a whole list of ideas & suggestions to improve it, but they will have to wait for another day.I do NOT take credit for these projects - but thank the authors for the hard work that went into them, and for making the source available. I've included two projects from other authors: Eyal's IFilter code (from CodeProject and his blog on bypassing COM) and the Mono.GetOptions code (nice way to handle Command Line arguments).As far as I know, it's still possible to shoehorn the code into VWD (with App_Code directory and assemblies from the ZIP file) if you want to give it a try. In previous versions I tried to keep the code in a small number of files, and structure it so it'd be easy to open/run in Visual WebDev Express (heck, the first version was written in WebMatrix), but it's just getting too big. You need Visual Studio 2005 to work with this code.I hope this makes it easier for people to read/understand and edit to add the stuff they need. The code has been significantly refactored (thanks to encouragement from mrhassell and j105 Rob).You can run the Spider locally via a CommandLine application then upload the Catalog file to your server (useful if your server doesn't have all the IFilter's installed to parse the documents you want indexed).There is a rudimentary effort to follow links hiding in javascript ( ckohler).You can 'mark' regions of your html to be ignored during indexing ( xbit45).It parses and obeys your robots.txt file (in addition to the robots META tag, which it already understood) ( cool263).This is probably the coolest bit of the whole project - but all credit goes to Eyal for his excellent article. It can now index/search Word, Powerpoint, PDF and many other file types, thanks to the excellent Using IFilter in C# article by Eyal Post.Version 4 of Searcharoo has changed in the following ways (often prompted by CodeProject members): A number of bugs reported via CodeProject were also fixed. It also spidered FRAMESETs and added Stop words, Go words and Stemming to the indexer. Searcharoo Version 3 implemented a 'save to disk' function for the catalog, so it could be reloaded across IIS application restarts without having to be generated each time. This article also discusses how multiple search words results are combined into a single set of 'matches'. This means downloading files via HTTP, parsing the HTML to find more links and ensuring we don't get into a recursive loop because many web pages refer to each other. Searcharoo Version 2 focused on adding a 'spider' to find data to index by following web links (rather than just looking at directory listings in the file system). A basic design and object model was developed to support simple, single-word searches, whose results were displayed ina rudimentary query/results page. Searcharoo Version 1 describes building a simple search engine that crawls the file system from a specified folder, and indexes all HTML (or other known types) of document. This article follows on from the previous three Searcharoo samples:
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |