Majestic-12 : DSearch : TechnologyDSearch is powered by search engine technology developed by Alex Chudnovsky from scratch using Microsoft .NET. It is already partially ported to Linux/FreeBSD, and complete portability to all platforms supported by Mono is expected shortly. The technology is split into 4 distict areas: crawler, indexer, merger and searcher. All components were designed to scale to World Wide Web levels of data. As of time of this writing (17 Mar 2006), the technology was used to crawl over 7 billion web pages and over 1 billion of them were indexed and made available for searching. As of end of March 2006 the search engine is able to scale to billions of web pages that can be added in addition to existing index. You can also view a rather simplistic presentation made before Birmingham Perl Mongers User Group (UK) on topic of "Building a scalable distributed WWW search engine ... NOT in Perl!" (requires PowerPoint).
Crawler is responsible for retrieval of web pages. It is an optional component if data is already available or crawled via other means. The main feature of crawler is that its based on a distributed model that allows to scale crawling linearly by running it on consumer grade broadband connections (cable/DSL). Key features:
- Robust distributed model that allows to run it on consumer level broadband connections (much like SETI@home)
- A lot of time-tested features to compensate for running it in non-laboratory uncontrolled conditions
- High-performance, but low-CPU usage crawling that is known to have exceeded 2 mln urls from one PC per day
- Robust support for robots.txt, with Crawl-Delay param
- Allows to offload processing of crawled data (ie: parsing of HTML or full indexing)
- Web-site friendly crawling limiting possible heavy impact on sites hosted on same IPs
- Used to crawl over 7 bln web pages
- Very small final archived pages requiring a lot less bandwidth (comparing to what is used for crawling) at the central server
- Supports Linux/FreeBSD (using Mono framework)
Indexer is responsible for conversion of crawled data into indexed packages that will be used by merger to turn them into final searchable index. Key features:
- Massively parallel design that allows to run it on multiple machines using multiple cores/CPUs to achieve linear scalability
- High performance indexing (over 100 typical web pages per second per core/CPU on AMD Athlon X2 3800)
- Very compact data barrels that allows to minimise temporary space requirements (15-20 MB without HTML cache)
- Currently supports HTML/text files, however additional filters for PDF/DOC can be easily added
- Stores key fields where words were found (ie title, bold text etc)
- Stores word positioning that can be used to give higher score to phrase matches
- Supports anchor (link) text pointing to urls that were not part of the indexed data
Merger is responsible for conversion of indexed packages into final searchable index. The key for this part is its high performance that allows to produce new index quickly, thus keeping it up to date. Only newly added or updated pages need to be re-indexed before merging.
- High-performance design allowing to merge around 200 mln indexed pages per 24 hours using one dual-core PC
- Avoids potentially dangerous compromises such as dropping very non-frequent terms (keywords)
- Soon (before end of April 2006) to support merging on multiple servers giving ability to merge billions of pages very quickly
Searcher is the actual search engine that the end users are using. Key features:
- High-performance design that allows it to scale well (currently indexed: 1 bln, all hosted on a two consumer grade servers, test system is here)
- Real-time ranking - users can change scores
- Phrase matching, including multiple phrases and mixed non-phrases
- Different scoring depending on proximity (distance between words) and their position (title, bold text etc)
- Geo-targeting - pages can be given score boost/reduction depending on which top level domain they belong to, auto-detection from IP or user controlled (loc: command)
- Boolean AND, NOT logic (OR to be added shortly)
- Supports listing of backlinks (link: command)
- Supports listing and searching of pages for given site (site: command)
- Supports query independent ranking of web pages that depends on links pointing to a given page
- Supports finding of sites that were not crawled yet (based on anchor (link) text pointing to them)
- Provides either exact or approximate match counts
- Supports secondary index for user-submission of pages that appear in the index almost instantly
- Supports distributed searching using one or more additional search servers, thus allowing to scale to billions of pages
- A lot more feature in development right now!
Alpha version of the search engine can be found here. Please note that this version is hosted on just one (!) consumer grade server with many caching options turned off. This was done intentionally to ensure that the search engine is improved in "much worse than production" conditions. If you are wondering how fast will it be with your own data sets, then we can arrange a test whereby your data will be indexed, merged and made available to you for searches - all stages of the process can be benchmarked so that you know exactly how long it takes to process your data.
For enquires please use this email.