Crawler Options

In order to make your node work more effectively there are various Options that you can configure to make your Node crawl faster, have more open URL buckets for work and enable finished buckets be compressed more so that there is less data to send back to the master server. Below we will describe what each of these options do and what effect it will have on your node. Please note that any option that is changed that has a * marked by it will only take effect once the node is restarted.

1) Maximum number of async workers* - Workers used to actually crawl the web and each worker will be assigned one URL at a time to crawl and collect the page information. The number of workers used will depend on the amount of domains that are available in the bucket that's being crawled, for example if a bucket has 20 domains left in it and 300 URL left to crawl then the node can ONLY use up to a maximum of 20 workers for that bucket. This is done so that the node will not degrade the performance of the website by overloading it with workers. As the domains in a bucket decrease so will the workers in line with this rule. This is not a problem so long as crawler is allowed to open more buckets (see that setting below)

Do NOT think that specifying maximum possible number of workers will make node crawl faster -- this is not the case, in fact specifying too many workers will result in higher number of errors and your crawling will actually suffer. The best approach is to add 5-10 more workers and then leave node for an hour or two keeping an eye on errors -- the number for timeouts should not exceed 3-5%.

Recommended Settings: 15 workers for 512kb-1MB line, 30 workers for 2MB line, 60 workers for 4MB line & Over 100 workers for 4MB and above.

2) Maximum open URL buckets - This setting will allow you to specify how many buckets are allowed to be open at any one time. There will be times when you will get a lot of sparse buckets which will take a long time to complete. In order for you to process more buckets at the same time so that you can continue to use your maximum bandwidth, then set it to have more than 10 max open buckets. You can set higher values if your Internet connection is very fast and expect to have many buckets waiting to complete.

Note: as crawler opens more buckets it will use more temporary space on your hard drive, assume that 250-300 MB will be used per bucket, but note that crawler will only open as many buckets as its necessary.

3) Use persistent connections - There may be times when some will have to disable this option if they are using a router that cannot cope with the amount of connections that the node makes. This will often resort with the node not being able to contact the master server to send results back or very high timeout values shown in status. Having this option enabled will use more computer memory, so if you've got only a small amount of memory then switch this option off.

Note: if you have high number of timeouts or other errors then you may want to try turning this option off.

4) Delay before crawling* - this is the delay the crawler makes before starting work when node program is launched. Its designed to ensure node starts smoothly without taking too much resources.

5) Pre-cache buckets - You can set a number of buckets to be kept in reserve in case the master server goes down. This will allow you to continue working even if your node cannot contact the server to get new work buckets. If you only intend to run the node for a few days then please keep this value set at around 1-2 buckets, but if you are going to stay crawling with us then feel free to set the value to about 15-20 (this should give you about 150,000-200,000 URL's to work on if the server goes down)

Note: server keeps track of issued buckets and may refuse to issue you more you just get url buckets without crawling for data.

6) Ensure min free disk space (MB) - This option will let you set a minimum disk space value to ensure that your computer doesn't run out of disk space while running the node. Typically this should be left at 1,000 MB (1GB) but there may be circumstances when you may need to set this value to 500 MB if you are finding that your node stops crawling all buckets.

7) Buckets when near low on disk - In case the above scenario is met and your computer hits the low disk value then you can set the node to only work on the number of buckets specified in this option. Typically you should always leave this value set to 1.

8) Limit IP's by subnet - This option should be left as is as its primarily used in trouble shooting and likely to be removed in the future versions.

9) Use fixed upload chunk - This option will enable you to maximise usage of upstream bandwidth that you have by sending the data back to the server in manageable chunks. You can find the perfect value for this option by running "Benchmark uploading" and then entering the value here.

10) Use persistent connection - This is the same option as in the Downloading section but this time applies to uploading. If you are having problems uploading results back to the master server then try switching this option off.

Note: unchecking this option will reduce uploading speed and generally unchecking it won't help fix things.

11) No delays between upload chunks - This option will send the uploads back to the master server as quickly as it can, ignoring any pauses between upload chunks.

Note: if you check this option you will effectively DISABLE upstream throttling and your node will upload at the maximum possible speed it can. This is not a good idea to have this option checked if you have limited upstream capability.

12) Wait between barrel uploads (secs) - Typically this option is set to wait 120 seconds before uploading a finished bucket back to the master server, however if you really want to get more results back quicker and not wait between finished buckets then set this value to 5-10 seconds.

13) Stop if uploads are greater than - You can use this value to set a maximum number of finished buckets, that if reached it will stop all downloading on current buckets and wait until the value is lower before resuming downloading. This option is useful to set if you have a high bandwidth internet connection and a slow upload speed. It prevents your node from getting huge backlog and if you do get backlog of barrels ready for upload it would mean that you either need to slow down your crawling or optimise uploading.

14) Enable barrel sorting before archiving - a new algorithms that will replace HKom compression and enable much smaller uploads when archiving. Feel free to enable it to get smaller uploads but be careful if you only have a small amount of memory, as it can consume a lot when compressing! You can expect to see up to 30% better compression with this option enabled.

15) RobotsTXT flush delay,  URL's flush delay, Archiving delay (ms per 32k) - These options are for advanced users that wish to control finely how much CPU is used when node is crawling. The higher you set this value, the less spiky CPU will be when node flushes robots.txt cache. It is highly recommended to leave it as is.

16) Use better compression level - Setting this value above 1 can lead to excessive CPU usage. Only use this if you have a very faster computer or a server (e.g. AMD XP 2200+/Intel P4 2Ghz and above) Typically you will find excellent compression if this value is set to 9, however it will slow down everything else that you are doing and cause a lot of CPU activity. Also using RAR instead (see Misc Options) of the default archiving tool will enable you to have excellent compression.

WARNING: you will experience much higher CPU usage during archiving if you check this option, it is NOT recommended to use it unless you are running node on dedicated server.

17) Enable real-time HKom pre-processing / non-realtime HKom compression - Please leave these disabled as they will be phased out in the next version of the Node, please use Barrel sorting before archiving instead. Thanks!