majestic12.co.uk dsearch Forum Index majestic12.co.uk dsearch
Distributed Search forum
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

New buckets thread
Goto page 1, 2, 3 ... 15, 16, 17  Next
 
Post new topic   Reply to topic    majestic12.co.uk dsearch Forum Index -> General
View previous topic :: View next topic  
Author Message
alexc
Site Admin


Joined: 17 Dec 2004
Posts: 25292
Location: Birmingham, UK

PostPosted: Tue Feb 12, 2013 3:04 pm    Post subject: New buckets thread Reply with quote

This is the new thread dedicated to new buckets that are now being added to the main crawl.

New buckets will have .DZ extension rather than .GZ, no node update needed to support them (this was added in the past).

Some of the older buckets can still be issued but we are going to de-prioritise them today. Once we are sure new model works we'll remove them (next week hopefully).

Server side we see much improved crawl in terms of data returned back.

New crawl mode allows for a lot of different tuning parameters and that's something we'll be doing over next few weeks, please bear with me.

Please post here your observations if you have any - my expectation is that number of 404s should be reduced, also I hope other errors will reduce considerably once all tuning params are in place.
Back to top
View user's profile Send private message Send e-mail
C0d3r
Full Member


Joined: 15 Jul 2010
Posts: 113
Location: Hungary

PostPosted: Tue Feb 12, 2013 3:20 pm    Post subject: Reply with quote

http://img577.imageshack.us/img577/4521/91273517.png
It is a dream! Very Happy
I would like to max. out my 50 mbps line with only one node. Smile It is not far.
Back to top
View user's profile Send private message
Leo
Junior Member


Joined: 18 Sep 2011
Posts: 76
Location: Klepp, Norway

PostPosted: Tue Feb 12, 2013 3:47 pm    Post subject: Reply with quote

Wow C0d3r, I see I might need to reconfigure my nodes soon..
Back to top
View user's profile Send private message
Evil-Dragon
Senior Member


Joined: 30 Aug 2005
Posts: 2249
Location: Birmingham, UK

PostPosted: Tue Feb 12, 2013 6:23 pm    Post subject: Reply with quote

Given the data quality (low dns/conn error rates) i'd say i'm getting some of the new buckets.

Will this new crawl model hopefully cut down on the 'BusyTryLater' errors when requesting urls?

Also I had a suggestion to tier members based on the number of nodes, those on a higher tier (more than 5-10 nodes) could be assigned larger work buckets (50,000 urls per bucket) so that they are still crawling more urls but requesting less buckets per 24 hours.

If the above wasn't possible, perhaps an idea of allowing users to opt-in for larger buckets?
_________________
Building the Majestic 12 SEO Engine, 10,000 URL's at a time!
Back to top
View user's profile Send private message Visit poster's website
nemercry
Junior Member


Joined: 12 Jan 2013
Posts: 47

PostPosted: Tue Feb 12, 2013 6:48 pm    Post subject: Reply with quote

Evil-Dragon wrote:
Given the data quality (low dns/conn error rates) i'd say i'm getting some of the new buckets.

Will this new crawl model hopefully cut down on the 'BusyTryLater' errors when requesting urls?

Also I had a suggestion to tier members based on the number of nodes, those on a higher tier (more than 5-10 nodes) could be assigned larger work buckets (50,000 urls per bucket) so that they are still crawling more urls but requesting less buckets per 24 hours.

If the above wasn't possible, perhaps an idea of allowing users to opt-in for larger buckets?

Not sure if this really solves more problems than it does ?
People like C0d3r wouldnt have an improvement through that and maybe run even more out of urls.
Back to top
View user's profile Send private message
alexc
Site Admin


Joined: 17 Dec 2004
Posts: 25292
Location: Birmingham, UK

PostPosted: Tue Feb 12, 2013 7:14 pm    Post subject: Reply with quote

We may look into bigger buckets, the main problems we had so far were due to lack of buckets and hard data to crawl (too many errors). These will be addressed first.

Right now we are getting a lot of buckets generated in automated mode. Cool
Back to top
View user's profile Send private message Send e-mail
C0d3r
Full Member


Joined: 15 Jul 2010
Posts: 113
Location: Hungary

PostPosted: Tue Feb 12, 2013 7:47 pm    Post subject: Reply with quote

@Alex: I have found a possible bug. The old data files are not deleted automatically:
http://img855.imageshack.us/img855/4575/34604832.png
Back to top
View user's profile Send private message
Cow_tipping
Senior Member


Joined: 08 Jul 2008
Posts: 955
Location: On the Run

PostPosted: Tue Feb 12, 2013 8:02 pm    Post subject: Reply with quote

I came home and my computer had crashed (as expected)
Will try to get 2 nodes runnng smoothly on 2 computers before attempting any vm's.
_________________

Back to top
View user's profile Send private message
alexc
Site Admin


Joined: 17 Dec 2004
Posts: 25292
Location: Birmingham, UK

PostPosted: Tue Feb 12, 2013 8:08 pm    Post subject: Reply with quote

C0d3r wrote:
@Alex: I have found a possible bug. The old data files are not deleted automatically:
http://img855.imageshack.us/img855/4575/34604832.png


hmmm, thats possible that bucket itself is not removed after being finished, hmmm - I'll check it for new build of node, possibly in the next couple of days if indeed such buckets not removed on completion!
Back to top
View user's profile Send private message Send e-mail
rilian
Senior Member


Joined: 30 Mar 2008
Posts: 1074

PostPosted: Tue Feb 12, 2013 8:10 pm    Post subject: Reply with quote

bigger buckets would be great!!!!!!!!!!!!!!!!!!
_________________
I crunch&crawl for Ukraine
Back to top
View user's profile Send private message Visit poster's website
alexc
Site Admin


Joined: 17 Dec 2004
Posts: 25292
Location: Birmingham, UK

PostPosted: Tue Feb 12, 2013 8:18 pm    Post subject: Reply with quote

I'll think about big buckets, we'll probably run experiment (it's much easier now to do such things in new model).
Back to top
View user's profile Send private message Send e-mail
alexc
Site Admin


Joined: 17 Dec 2004
Posts: 25292
Location: Birmingham, UK

PostPosted: Tue Feb 12, 2013 8:40 pm    Post subject: Reply with quote

Large part of the older crawl has now got lower priority, I'll analyse tomorrow the split between old vs new and will keep eye on a few other things.

It's moving pretty quick now Cool
Back to top
View user's profile Send private message Send e-mail
Norman_RKN
Senior Member


Joined: 26 Dec 2006
Posts: 444
Location: Saarland, Germany

PostPosted: Tue Feb 12, 2013 8:40 pm    Post subject: Reply with quote

yes, really big buckets are good.
they will reduce maybe the ask for new buckets and maybe the archiving intervals a bit. so is more cputime left.
maybe other good effects Wink
_________________
http://alles-schallundrauch.blogspot.de/

Back to top
View user's profile Send private message Visit poster's website
alexc
Site Admin


Joined: 17 Dec 2004
Posts: 25292
Location: Birmingham, UK

PostPosted: Tue Feb 12, 2013 8:49 pm    Post subject: Reply with quote

There are some negative effects with big buckets - more to recrawl if it fails, longer waiting period for bucket to be returned.

Let's see first when new model runs its course, especially in regards to reduction of errors (some already addressed, others still need work).
Back to top
View user's profile Send private message Send e-mail
Norman_RKN
Senior Member


Joined: 26 Dec 2006
Posts: 444
Location: Saarland, Germany

PostPosted: Tue Feb 12, 2013 9:20 pm    Post subject: Reply with quote

Code:
Failed to receive more URLs, response code: BusyTryLater

what´s that he ? Razz Wink
_________________
http://alles-schallundrauch.blogspot.de/

Back to top
View user's profile Send private message Visit poster's website
Display posts from previous:   
Post new topic   Reply to topic    majestic12.co.uk dsearch Forum Index -> General All times are GMT + 2 Hours
Goto page 1, 2, 3 ... 15, 16, 17  Next
Page 1 of 17

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group