How to parse HTML in .NET
21 Dec 2015
Small post about different parsers in .NET with can help you to work with HTML files.
HtmlAgilityPack is one of the most famous HTML parser in .NET world. This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don’t HAVE to understand XPATH nor XSLT to use it, don’t worry…). It is a .NET code library that allows you to parse “out of the web” HTML files. The parser is very tolerant with “real world” malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
This parser will be useful if you need to parse something small and pretty (also it will be very fast):
But when you will try to use something more interesting, you will see something unreadable and hard for understanding:
A .NET library to select items from a node tree based on a CSS selector. The default implementation is based on HTMLAgilityPack and selects from HTML documents.
CsQuery is a jQuery port for .NET 4. It implements all CSS2 & CSS3 selectors, all the DOM manipulation methods of jQuery, and some of the utility methods. The majority of the jQuery test suite (as of 1.6.2) has been ported to C#.
CSS selectors and jQuery make it really easy to access and manipulate HTML on the client. There’s no reason it should be any more difficult to do the same thing with some arbitrary HTML on the server. It’s a simple as that. Use it in web projects to do post-processing on HTML pages before they’re served, for web scraping, parsing templates, and more.
Note from the author
CsQuery is not being actively maintained. I no longer use it in my day-to-day work, and indeed don’t even work in .NET much these day! Therefore it is difficult for me to spend any time addressing problems or questions. If you post issues, I may not be able to respond to them, and it’s very unlikely I will be able to make bug fixes.
While the current release on NuGet (1.3.4) is stable, there are a couple known bugs (see issues) and there are many changes since the last release in the repository. However, I am not going to publish any more official releases, since I don’t have time to validate the current code base and address the known issues, or support any unforseen problems that may arise from a new release.
AngleSharp is a .NET library that gives you the ability to parse angle bracket based hyper-texts like HTML, SVG, and MathML. XML without validation is also supported by the library. An important aspect of AngleSharp is that CSS can also be parsed. The parser is built upon the official W3C specification. This produces a perfectly portable HTML5 DOM representation of the given source code. Also current features such as
querySelectorAll work for tree traversal.
- Portable (designed as a portable class library)
- Standards conform (works exactly as in all modern browsers)
- Great performance (outperforms most other parsers in many cases)
- Extensible (extend with your own services)
- Useful abstractions (type helpers, jQuery like construction)
- Fully functional DOM (all the lists, iterators and events you love)
- Form submission (easily log in everywhere)
- Navigation (a
BrowsingContextis like a tab - control it from .NET!).
- LINQ enhanced (use LINQ with DOM elements, naturally)
The advantage over similar libraries like the HtmlAgilityPack is that e.g. CSS (including selectors) is already built-in. Also the parser uses the HTML 5.1 specification, which defines error handling and element correction. The AngleSharp library focuses on standards compliance, interactivity and extensibility. It is therefore giving web developers, who are working with C#, all possibilities as they know from using the DOM in any modern browser.
The performance of AngleSharp is quite close to the performance of browsers. Even very large pages can be processed within milliseconds. AngleSharp tries to minimize memory allocations and reuses elements internally to avoid unnecessary object creation.
The worth way of parsing HTML files. Try to avoid using Regex for parsing any HTML like text as Regex is not a tool that can be used to correctly parse HTML. But if you want.. I will should a short piece of code with get all links from some HTML string: