Majestic-12 : DSearch : MJ12bot
Email Address for Queries about the bot ( if you are too busy to read rest of the page ): firstname.lastname@example.org (we respond very quickly!)
You may have reached this page by clicking a link left by MJ12bot in your log files. Below you can see some of the most Frequently Asked Questions regarding MJ12bot.
What is MJ12bot doing on my site(s)?
We spider the Web for the purpose of building a search engine with a fast and efficient downloadable distributed crawler that enables people with broadband connections to help contribute to, what we hope, will become the biggest search engine in the world. Production of a full text search engine at Majestic-12 is currently in the research phase, funded in part by the commercialisation of research at Majestic.
What happens with crawled data?
Crawled data (currently only web graph of links) is added to the the largest public backlinks search engine index that we maintain as a dedicated tool called Site Explorer. All webmasters can obtain full free data on backlinks by verifying ownership of their site - learn about your own backlinks from the extensive backlinks index.
Why do you keep crawling 404 or 301 pages?
We have a long memory and what to ensure that temporary errors, website down pages or other temporary changes to sites do not cause ireperable changes to your site profile when they shouldn't. Also if there are still links to these pages they will continue to be found and followed. Google have published a statement since they are also asked this question, their reason is of course the same as ours and their answer can be found here: Google 404 policy
Your are crawling links with rel=nofollow
This is a common misunderstanding of the (perhaps poorly named) nofollow attribute. Google introduced the 'rel=nofollow' attribute in 2005 stating that links so marked would not influence the target's Pagerank, it does not stop the crawler from visiting the target page, this becomes particularly obvious if the target page has several links to it, some may have this attribute, some may not. If you wish to stop bots from crawling a page then the robots.txt file should be used to disallow the target page.
More information on rel=nofollow can be found here: Wikipedia Nofollow
How can I block MJ12bot?
MJ12bot adheres to the robots.txt standard. If you want the bot to prevent website from being crawled then add the following text to your robots.txt:
Please do not waste your time trying to block bot via IP in htaccess - we do not use any consecutive IP blocks so your efforts will be in vain. Also please make sure the bot can actually retrieve robots.txt itself - if it can't then it will assume (this is the industry practice) that its okay to crawl your site.
If you have reason to believe that MJ12bot did NOT obey your robots.txt commands, then please let us know via email: email@example.com. Please provide URL to your website and log entries showing bot trying to retrieve pages that it was not supposed to.
What non-standard features of robots.txt MJ12bot supports?
Our current crawler supports the following non-standard extensions to robots.txt:
- Crawl-Delay for up to 20 seconds (higher values will be rounded down to maximum our bot supports)
- Redirects ( within same site ) when trying to fetch robots.txt
- Simple pattern matching in Disallow directives compatible with Yahoo's wildcard specification
- Allow directives can override Disallow if they are more specific (longer in length)
- Certain failures to fetch robots.txt such as 403 Forbidden will be treated as blanket disallow directive
Why did a robots.txt block not work on MJ12bot?
We are keen to see any reports of potential violations of robots.txt by MJ12bot.
There are a number of false positives raised - this can be a useful checklist when configuring a web server:
- Off site redirects when requesting robots.txt - MJ12Bot follows redirects, but only on the same domain. The ideal is for robots.txt to be available at "/robots.txt" as specified in the standard.
- Multiple domains running on the same server. Modern webservers such as Apache can log accesses to a number of domains to one file - this can cause confusion when attempting to see what webserver was accessed at which point. You may wish to consider adding domain information to the access log, or splitting access logs on a per domain basis
- Robots.txt out of sync with developer copy. We have had complaints that MJ12Bot has disobeyed robots.txt - only to find out that the developer was testing against a development server which was not in-sync with the live version
Historically, there was a period when the MJ12Bot User-Agent was spoofed. Bad bots often used spoofed user agents, which are easily faked. The discussion regarding the fake V1.08 MJ12Bot is archived here. Majestic-12 is therefore interested to hear of any reports of robots.txt violation, In order to check if MJ12bot is ours or not we need log requests showing IP address of the bot, the request for robots.txt and subsequent requests which you believe are in violation.
How can I slow down MJ12bot?
You can easily slow down bot by adding the following to your robots.txt file:
Crawl-Delay should be an integer number and it signifies number of seconds of wait between requests. MJ12bot will make an up to 20 seconds delay between requests to your site - note however that while it is unlikely, it is still possible your site may have been crawled from multiple MJ12bots at the same time. Making high Crawl-Delay should minimise impact on your site. This Crawl-Delay parameter will also be active if it was used for * wildcard.
If our bot detects that you used Crawl-Delay for any other bot then it will automatically crawl slower even though MJ12bot specifically was not asked to do so.
What are the current versions of MJ12bot?
Current operating versions of MJ12bot are:
- v1.4.x series - most common: v1.4.5 (new as of April 2014) and v1.4.4 (to be phased out before end of May 2014)
If you have not been satisfied with the information above then feel free to contact us: firstname.lastname@example.org