majestic12.co.uk dsearch Forum Index majestic12.co.uk dsearch
Distributed Search forum
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

New buckets thread
Goto page Previous  1, 2, 3, 4, 5 ... 15, 16, 17  Next
 
Post new topic   Reply to topic    majestic12.co.uk dsearch Forum Index -> General
View previous topic :: View next topic  
Author Message
alexc
Site Admin


Joined: 17 Dec 2004
Posts: 25292
Location: Birmingham, UK

PostPosted: Fri Feb 15, 2013 1:49 am    Post subject: Reply with quote

CaptainMooseInc wrote:
Crawl speed with new buckets is great! One thing, I'm currently experiencing a high number of buckets in progress that aren't open (61), with 10 buckets currently open and being crawled. Should this be considered normal?


It's too early to say because we've got some of the older data as well.

90% success is looking very nice, but it's possible older data will reduce it.

Now we are crawling so hard that we are actually current nearly maxing out our routers' capacity at data center, whoever turned on heavy crawl in the last 24 hours please moderate it a bit - we got scheduled upgrade in a week to 1 Gbit pipe that should solve upload capacity.
Back to top
View user's profile Send private message Send e-mail
alexc
Site Admin


Joined: 17 Dec 2004
Posts: 25292
Location: Birmingham, UK

PostPosted: Fri Feb 15, 2013 1:54 am    Post subject: Reply with quote

Grubee wrote:
Question! are we going to see good data, bad data, good data, bad data? Will the new crawling method introduce the harder to crawl buckets in the middle of good ones rather than in a bunch together?


New crawling model is not fully implemented, but it already takes a number of errors into account to give better buckets. Some of the hard errors are not yet fully dealt with (timeouts can be due to site being temporarily down or just node issue, so no easy way to 100% trust them), however I expect that once new model is fully active we should be getting even better buckets than we have out now.

It's been a pretty long way to get where we are, but you can see real results now and it should not take long to improve on them.

Grubee wrote:
And finally, when did you last feed the squirrels?


Years ago Crying or Very sad
Back to top
View user's profile Send private message Send e-mail
alexc
Site Admin


Joined: 17 Dec 2004
Posts: 25292
Location: Birmingham, UK

PostPosted: Fri Feb 15, 2013 2:23 am    Post subject: Reply with quote

By the way yesterday we've crawled record amount of data - over 60 TB! Shocked

That's over 2 bln URLs - some of Majestic own crawlers were redirected to deal with other data separately, so it's not counted in project totals on homepage here (it's not counted for DCP LLP purposes in any event), but that's only 10% of total so far.

We now need to upgrade our server connection ASAP and also increase indexing capacity to process all this data.

CharoGarcia
Back to top
View user's profile Send private message Send e-mail
elvis1
Senior Member


Joined: 04 Apr 2009
Posts: 1768
Location: Buenos Aires

PostPosted: Fri Feb 15, 2013 3:26 am    Post subject: Reply with quote

alexc wrote:
CaptainMooseInc wrote:
Crawl speed with new buckets is great! One thing, I'm currently experiencing a high number of buckets in progress that aren't open (61), with 10 buckets currently open and being crawled. Should this be considered normal?


It's too early to say because we've got some of the older data as well.

90% success is looking very nice, but it's possible older data will reduce it.

Now we are crawling so hard that we are actually current nearly maxing out our routers' capacity at data center, whoever turned on heavy crawl in the last 24 hours please moderate it a bit - we got scheduled upgrade in a week to 1 Gbit pipe that should solve upload capacity.

refic may I have your attention please lol ( I'm joking).
Alex the new system is great. Thanks


alexc wrote:
By the way yesterday we've crawled record amount of data - over 60 TB! Shocked

That's over 2 bln URLs - some of Majestic own crawlers were redirected to deal with other data separately, so it's not counted in project totals on homepage here (it's not counted for DCP LLP purposes in any event), but that's only 10% of total so far.

CharoGarcia


This is getting really serious. congratulations.
CharoGarcia CharoGarcia
_________________
"There will soon be infinite supply of buckets - we are not far, after that I'll dare you to crawl them all"
Alex dixit Smile
Back to top
View user's profile Send private message Visit poster's website
refic
Moderator


Joined: 27 Sep 2005
Posts: 3212
Location: Oulu, Finland

PostPosted: Fri Feb 15, 2013 8:17 am    Post subject: Reply with quote

alexc wrote:
whoever turned on heavy crawl in the last 24 hours please moderate it a bit


Crying or Very sad Crying or Very sad
_________________
Back to top
View user's profile Send private message
Deadly_Fire
Senior Member


Joined: 28 Jan 2008
Posts: 1589

PostPosted: Fri Feb 15, 2013 8:25 am    Post subject: Reply with quote

Alex, when do you think the server connection will be upgraded? I only have 1 node running now but am planning on adding a few more, though I don't want to cause you any trouble with the router/bandwidth situation.
Back to top
View user's profile Send private message
[eNeRGy]
Senior Member


Joined: 22 Jul 2008
Posts: 417
Location: Nijmegen, The Netherlands

PostPosted: Fri Feb 15, 2013 8:41 am    Post subject: Reply with quote

Deadly_Fire wrote:
Alex, when do you think the server connection will be upgraded? I only have 1 node running now but am planning on adding a few more, though I don't want to cause you any trouble with the router/bandwidth situation.


Quote:
... we got scheduled upgrade in a week to 1 Gbit pipe that should solve upload capacity.

_________________

Back to top
View user's profile Send private message Visit poster's website
Norman_RKN
Senior Member


Joined: 26 Dec 2006
Posts: 444
Location: Saarland, Germany

PostPosted: Fri Feb 15, 2013 8:59 am    Post subject: Reply with quote

same problem as [eNeRGy] with not open buckets on all the nodes.
Open: 18 (+45/0) Confused
_________________
http://alles-schallundrauch.blogspot.de/

Back to top
View user's profile Send private message Visit poster's website
rilian
Senior Member


Joined: 30 Mar 2008
Posts: 1074

PostPosted: Fri Feb 15, 2013 10:23 am    Post subject: Reply with quote

Norman_RKN wrote:
same problem as [eNeRGy] with not open buckets on all the nodes.
Open: 18 (+45/0) Confused

same here
_________________
I crunch&crawl for Ukraine
Back to top
View user's profile Send private message Visit poster's website
J.Kampmeier
Full Member


Joined: 22 Nov 2005
Posts: 153

PostPosted: Fri Feb 15, 2013 11:41 am    Post subject: Reply with quote

Me to. 14 open and 106/0 CharoGarcia
_________________

Back to top
View user's profile Send private message
Movieman
Senior Member


Joined: 20 Nov 2006
Posts: 2056

PostPosted: Fri Feb 15, 2013 12:31 pm    Post subject: Reply with quote

Same as the other guys BUT Dear Lord! I was popping 14mbit on one node for awhile last night..I almost needed to change my shorts when I saw that. Laughing
Was seeing 89% sucess rate for awhile..MINNNGA! Surprised
Back to top
View user's profile Send private message
rilian
Senior Member


Joined: 30 Mar 2008
Posts: 1074

PostPosted: Fri Feb 15, 2013 1:16 pm    Post subject: Reply with quote

is there a reason now to have Zillion nodes per one linux machine to use all downstream ?

need a new topic suggestion structure of dedicated crawlers
_________________
I crunch&crawl for Ukraine
Back to top
View user's profile Send private message Visit poster's website
CaptainMooseInc
Senior Member


Joined: 10 May 2006
Posts: 681

PostPosted: Fri Feb 15, 2013 2:04 pm    Post subject: Reply with quote

Problem is getting worse. I currently have 9 open buckets. There's 123 that are in progress but not open. My max open buckets is set to 80. Last night around 10PM CST, the node threw an error message in the log saying that it found too many open buckets and wouldn't request more. Well, apparently, it ignored that and proceeded to get an additional 40+ buckets overnight.

I just watched the node download 2 new buckets while over the 80 limit and proceed to load them up into the "in progress but not open" section...

I smell a bug!
_________________
Back to top
View user's profile Send private message
Movieman
Senior Member


Joined: 20 Nov 2006
Posts: 2056

PostPosted: Fri Feb 15, 2013 2:10 pm    Post subject: Reply with quote

Hey captain: Just send them over to me! Laughing
Back to top
View user's profile Send private message
alexc
Site Admin


Joined: 17 Dec 2004
Posts: 25292
Location: Birmingham, UK

PostPosted: Fri Feb 15, 2013 3:15 pm    Post subject: Reply with quote

We've had some other issues that are being fixed now.

Our upgrade of network will happen net Saturday, we did small thing to reduce bandwidth usage but it's not a full solution yet, so until next week please don't push too hard.
Back to top
View user's profile Send private message Send e-mail
Display posts from previous:   
Post new topic   Reply to topic    majestic12.co.uk dsearch Forum Index -> General All times are GMT + 2 Hours
Goto page Previous  1, 2, 3, 4, 5 ... 15, 16, 17  Next
Page 4 of 17

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group