Majestic-12 : Projects : C# HTML parser (.NET)
Free .NET HTML parser (C#) is an open source high-performance .NET C# module that was created to parse HTML for links, indexing and other purposes. Full source code (~5k lines) is available under BSD license (this means you can use it in your commercial applications). This cross-platform code is verified to run very well under Mono. The parser is 100% self-contained managed code that does not depend on any external DLLs apart from core .NET libraries. We use this parser to process well over 3 TB of HTML every day.
I created this module for use in Distributed Search Engine that required processing of terabytes of HTML on a daily basis, and naturally it had to be done very fast. Thus, the focus for this project was its high performance. I've spent countless hours making sure its fast, and you will be able to benchmark it on your own hardware, but Majestic-12's homepage snapshot (20 KB) is parsed as fast as under
2 ms (v1.0) 0.47 msecs (v3.0) on an Athlon x2 3800 (2 Ghz) PC (using single core, dual channel DDR 400).
Current version is about
2 4 (!) times faster than the one released last year, it also supports non-English words support via encodings (see Main.cs for details) as well as Unicode characters set via entities, it should also be more suitable for XML parsing.
There are NUnit tests that cover approximately 71% of code, with 91% of key TagParser.cs that deals with tag parsing - you can help by adding to existing
tests, best to use TestDriven.NET as they allow to easily test tests and see how much of the code is covered by those tests.
I would be very interested to know how this module compares to others, so if you made some testing then please email me the results. Also it would be nice to get a few NUnit test cases for automated testing as it is very easy to break parser in a subtle way that won't be immediately apparent.
Finally, if you manage to squeeze more speed out of it, then it would be nice for you to share the changes with me, this would help you too, because I am certainly going to try to make it faster than it is now, so if you share your changes it would mean you won't have to merge my changes into yours.
Download: HTMLparser.zip v3.1.4 (08/08/08) (625 KB)
Old versions (avoid using them as they are no longer supported):
HTMLparser.zip v2.0.0 (04/12/06) (150 KB)
HTMLparser.zip v1.0.1 (04/02/06) (80 KB)