| View previous topic :: View next topic |
| Author |
Message |
alexc Site Admin

Joined: 17 Dec 2004 Posts: 25292 Location: Birmingham, UK
|
Posted: Tue Feb 12, 2013 3:04 pm Post subject: New buckets thread |
|
|
This is the new thread dedicated to new buckets that are now being added to the main crawl.
New buckets will have .DZ extension rather than .GZ, no node update needed to support them (this was added in the past).
Some of the older buckets can still be issued but we are going to de-prioritise them today. Once we are sure new model works we'll remove them (next week hopefully).
Server side we see much improved crawl in terms of data returned back.
New crawl mode allows for a lot of different tuning parameters and that's something we'll be doing over next few weeks, please bear with me.
Please post here your observations if you have any - my expectation is that number of 404s should be reduced, also I hope other errors will reduce considerably once all tuning params are in place. |
|
| Back to top |
|
 |
C0d3r Full Member
Joined: 15 Jul 2010 Posts: 113 Location: Hungary
|
|
| Back to top |
|
 |
Leo Junior Member
Joined: 18 Sep 2011 Posts: 76 Location: Klepp, Norway
|
Posted: Tue Feb 12, 2013 3:47 pm Post subject: |
|
|
| Wow C0d3r, I see I might need to reconfigure my nodes soon.. |
|
| Back to top |
|
 |
Evil-Dragon Senior Member

Joined: 30 Aug 2005 Posts: 2249 Location: Birmingham, UK
|
Posted: Tue Feb 12, 2013 6:23 pm Post subject: |
|
|
Given the data quality (low dns/conn error rates) i'd say i'm getting some of the new buckets.
Will this new crawl model hopefully cut down on the 'BusyTryLater' errors when requesting urls?
Also I had a suggestion to tier members based on the number of nodes, those on a higher tier (more than 5-10 nodes) could be assigned larger work buckets (50,000 urls per bucket) so that they are still crawling more urls but requesting less buckets per 24 hours.
If the above wasn't possible, perhaps an idea of allowing users to opt-in for larger buckets? _________________ Building the Majestic 12 SEO Engine, 10,000 URL's at a time! |
|
| Back to top |
|
 |
nemercry Junior Member
Joined: 12 Jan 2013 Posts: 47
|
Posted: Tue Feb 12, 2013 6:48 pm Post subject: |
|
|
| Evil-Dragon wrote: | Given the data quality (low dns/conn error rates) i'd say i'm getting some of the new buckets.
Will this new crawl model hopefully cut down on the 'BusyTryLater' errors when requesting urls?
Also I had a suggestion to tier members based on the number of nodes, those on a higher tier (more than 5-10 nodes) could be assigned larger work buckets (50,000 urls per bucket) so that they are still crawling more urls but requesting less buckets per 24 hours.
If the above wasn't possible, perhaps an idea of allowing users to opt-in for larger buckets? |
Not sure if this really solves more problems than it does ?
People like C0d3r wouldnt have an improvement through that and maybe run even more out of urls. |
|
| Back to top |
|
 |
alexc Site Admin

Joined: 17 Dec 2004 Posts: 25292 Location: Birmingham, UK
|
Posted: Tue Feb 12, 2013 7:14 pm Post subject: |
|
|
We may look into bigger buckets, the main problems we had so far were due to lack of buckets and hard data to crawl (too many errors). These will be addressed first.
Right now we are getting a lot of buckets generated in automated mode.  |
|
| Back to top |
|
 |
C0d3r Full Member
Joined: 15 Jul 2010 Posts: 113 Location: Hungary
|
|
| Back to top |
|
 |
Cow_tipping Senior Member

Joined: 08 Jul 2008 Posts: 955 Location: On the Run
|
Posted: Tue Feb 12, 2013 8:02 pm Post subject: |
|
|
I came home and my computer had crashed (as expected)
Will try to get 2 nodes runnng smoothly on 2 computers before attempting any vm's. _________________
  |
|
| Back to top |
|
 |
alexc Site Admin

Joined: 17 Dec 2004 Posts: 25292 Location: Birmingham, UK
|
Posted: Tue Feb 12, 2013 8:08 pm Post subject: |
|
|
hmmm, thats possible that bucket itself is not removed after being finished, hmmm - I'll check it for new build of node, possibly in the next couple of days if indeed such buckets not removed on completion! |
|
| Back to top |
|
 |
rilian Senior Member

Joined: 30 Mar 2008 Posts: 1074
|
Posted: Tue Feb 12, 2013 8:10 pm Post subject: |
|
|
bigger buckets would be great!!!!!!!!!!!!!!!!!! _________________
I crunch&crawl for Ukraine  |
|
| Back to top |
|
 |
alexc Site Admin

Joined: 17 Dec 2004 Posts: 25292 Location: Birmingham, UK
|
Posted: Tue Feb 12, 2013 8:18 pm Post subject: |
|
|
| I'll think about big buckets, we'll probably run experiment (it's much easier now to do such things in new model). |
|
| Back to top |
|
 |
alexc Site Admin

Joined: 17 Dec 2004 Posts: 25292 Location: Birmingham, UK
|
Posted: Tue Feb 12, 2013 8:40 pm Post subject: |
|
|
Large part of the older crawl has now got lower priority, I'll analyse tomorrow the split between old vs new and will keep eye on a few other things.
It's moving pretty quick now  |
|
| Back to top |
|
 |
Norman_RKN Senior Member

Joined: 26 Dec 2006 Posts: 444 Location: Saarland, Germany
|
Posted: Tue Feb 12, 2013 8:40 pm Post subject: |
|
|
yes, really big buckets are good.
they will reduce maybe the ask for new buckets and maybe the archiving intervals a bit. so is more cputime left.
maybe other good effects  _________________ http://alles-schallundrauch.blogspot.de/
 |
|
| Back to top |
|
 |
alexc Site Admin

Joined: 17 Dec 2004 Posts: 25292 Location: Birmingham, UK
|
Posted: Tue Feb 12, 2013 8:49 pm Post subject: |
|
|
There are some negative effects with big buckets - more to recrawl if it fails, longer waiting period for bucket to be returned.
Let's see first when new model runs its course, especially in regards to reduction of errors (some already addressed, others still need work). |
|
| Back to top |
|
 |
Norman_RKN Senior Member

Joined: 26 Dec 2006 Posts: 444 Location: Saarland, Germany
|
|
| Back to top |
|
 |
|