| View previous topic :: View next topic |
| Author |
Message |
alexc Site Admin

Joined: 17 Dec 2004 Posts: 25292 Location: Birmingham, UK
|
Posted: Fri Feb 15, 2013 1:49 am Post subject: |
|
|
| CaptainMooseInc wrote: | | Crawl speed with new buckets is great! One thing, I'm currently experiencing a high number of buckets in progress that aren't open (61), with 10 buckets currently open and being crawled. Should this be considered normal? |
It's too early to say because we've got some of the older data as well.
90% success is looking very nice, but it's possible older data will reduce it.
Now we are crawling so hard that we are actually current nearly maxing out our routers' capacity at data center, whoever turned on heavy crawl in the last 24 hours please moderate it a bit - we got scheduled upgrade in a week to 1 Gbit pipe that should solve upload capacity. |
|
| Back to top |
|
 |
alexc Site Admin

Joined: 17 Dec 2004 Posts: 25292 Location: Birmingham, UK
|
Posted: Fri Feb 15, 2013 1:54 am Post subject: |
|
|
| Grubee wrote: | | Question! are we going to see good data, bad data, good data, bad data? Will the new crawling method introduce the harder to crawl buckets in the middle of good ones rather than in a bunch together? |
New crawling model is not fully implemented, but it already takes a number of errors into account to give better buckets. Some of the hard errors are not yet fully dealt with (timeouts can be due to site being temporarily down or just node issue, so no easy way to 100% trust them), however I expect that once new model is fully active we should be getting even better buckets than we have out now.
It's been a pretty long way to get where we are, but you can see real results now and it should not take long to improve on them.
| Grubee wrote: | | And finally, when did you last feed the squirrels? |
Years ago  |
|
| Back to top |
|
 |
alexc Site Admin

Joined: 17 Dec 2004 Posts: 25292 Location: Birmingham, UK
|
Posted: Fri Feb 15, 2013 2:23 am Post subject: |
|
|
By the way yesterday we've crawled record amount of data - over 60 TB!
That's over 2 bln URLs - some of Majestic own crawlers were redirected to deal with other data separately, so it's not counted in project totals on homepage here (it's not counted for DCP LLP purposes in any event), but that's only 10% of total so far.
We now need to upgrade our server connection ASAP and also increase indexing capacity to process all this data.
 |
|
| Back to top |
|
 |
elvis1 Senior Member

Joined: 04 Apr 2009 Posts: 1768 Location: Buenos Aires
|
Posted: Fri Feb 15, 2013 3:26 am Post subject: |
|
|
| alexc wrote: | | CaptainMooseInc wrote: | | Crawl speed with new buckets is great! One thing, I'm currently experiencing a high number of buckets in progress that aren't open (61), with 10 buckets currently open and being crawled. Should this be considered normal? |
It's too early to say because we've got some of the older data as well.
90% success is looking very nice, but it's possible older data will reduce it.
Now we are crawling so hard that we are actually current nearly maxing out our routers' capacity at data center, whoever turned on heavy crawl in the last 24 hours please moderate it a bit - we got scheduled upgrade in a week to 1 Gbit pipe that should solve upload capacity. |
refic may I have your attention please lol ( I'm joking).
Alex the new system is great. Thanks
| alexc wrote: | By the way yesterday we've crawled record amount of data - over 60 TB!
That's over 2 bln URLs - some of Majestic own crawlers were redirected to deal with other data separately, so it's not counted in project totals on homepage here (it's not counted for DCP LLP purposes in any event), but that's only 10% of total so far.
 |
This is getting really serious. congratulations.
 _________________ "There will soon be infinite supply of buckets - we are not far, after that I'll dare you to crawl them all"
Alex dixit  |
|
| Back to top |
|
 |
refic Moderator

Joined: 27 Sep 2005 Posts: 3212 Location: Oulu, Finland
|
Posted: Fri Feb 15, 2013 8:17 am Post subject: |
|
|
| alexc wrote: | | whoever turned on heavy crawl in the last 24 hours please moderate it a bit |
 _________________
 |
|
| Back to top |
|
 |
Deadly_Fire Senior Member

Joined: 28 Jan 2008 Posts: 1589
|
Posted: Fri Feb 15, 2013 8:25 am Post subject: |
|
|
| Alex, when do you think the server connection will be upgraded? I only have 1 node running now but am planning on adding a few more, though I don't want to cause you any trouble with the router/bandwidth situation. |
|
| Back to top |
|
 |
[eNeRGy] Senior Member

Joined: 22 Jul 2008 Posts: 417 Location: Nijmegen, The Netherlands
|
Posted: Fri Feb 15, 2013 8:41 am Post subject: |
|
|
| Deadly_Fire wrote: | | Alex, when do you think the server connection will be upgraded? I only have 1 node running now but am planning on adding a few more, though I don't want to cause you any trouble with the router/bandwidth situation. |
| Quote: | | ... we got scheduled upgrade in a week to 1 Gbit pipe that should solve upload capacity. |
_________________
 |
|
| Back to top |
|
 |
Norman_RKN Senior Member

Joined: 26 Dec 2006 Posts: 444 Location: Saarland, Germany
|
|
| Back to top |
|
 |
rilian Senior Member

Joined: 30 Mar 2008 Posts: 1074
|
Posted: Fri Feb 15, 2013 10:23 am Post subject: |
|
|
| Norman_RKN wrote: | same problem as [eNeRGy] with not open buckets on all the nodes.
Open: 18 (+45/0)  |
same here _________________
I crunch&crawl for Ukraine  |
|
| Back to top |
|
 |
J.Kampmeier Full Member
Joined: 22 Nov 2005 Posts: 153
|
Posted: Fri Feb 15, 2013 11:41 am Post subject: |
|
|
Me to. 14 open and 106/0  _________________
 |
|
| Back to top |
|
 |
Movieman Senior Member
Joined: 20 Nov 2006 Posts: 2056
|
Posted: Fri Feb 15, 2013 12:31 pm Post subject: |
|
|
Same as the other guys BUT Dear Lord! I was popping 14mbit on one node for awhile last night..I almost needed to change my shorts when I saw that.
Was seeing 89% sucess rate for awhile..MINNNGA!  |
|
| Back to top |
|
 |
rilian Senior Member

Joined: 30 Mar 2008 Posts: 1074
|
Posted: Fri Feb 15, 2013 1:16 pm Post subject: |
|
|
is there a reason now to have Zillion nodes per one linux machine to use all downstream ?
need a new topic suggestion structure of dedicated crawlers _________________
I crunch&crawl for Ukraine  |
|
| Back to top |
|
 |
CaptainMooseInc Senior Member

Joined: 10 May 2006 Posts: 681
|
Posted: Fri Feb 15, 2013 2:04 pm Post subject: |
|
|
Problem is getting worse. I currently have 9 open buckets. There's 123 that are in progress but not open. My max open buckets is set to 80. Last night around 10PM CST, the node threw an error message in the log saying that it found too many open buckets and wouldn't request more. Well, apparently, it ignored that and proceeded to get an additional 40+ buckets overnight.
I just watched the node download 2 new buckets while over the 80 limit and proceed to load them up into the "in progress but not open" section...
I smell a bug! _________________
  |
|
| Back to top |
|
 |
Movieman Senior Member
Joined: 20 Nov 2006 Posts: 2056
|
Posted: Fri Feb 15, 2013 2:10 pm Post subject: |
|
|
Hey captain: Just send them over to me!  |
|
| Back to top |
|
 |
alexc Site Admin

Joined: 17 Dec 2004 Posts: 25292 Location: Birmingham, UK
|
Posted: Fri Feb 15, 2013 3:15 pm Post subject: |
|
|
We've had some other issues that are being fixed now.
Our upgrade of network will happen net Saturday, we did small thing to reduce bandwidth usage but it's not a full solution yet, so until next week please don't push too hard. |
|
| Back to top |
|
 |
|