You can obtain the latest version of jsoup from Maven’s Central Repository with the following dependency definition.
The Attributes class is a container for the attributes of the HTML elements and is composed within the Node class.įigure 1.
The Element class represents an HTML element, which consists of a tag name, attributes, and child nodes. The Node class refers to its parent node and knows all the parent’s child nodes. It represents a node in the DOM tree, which could either be the document itself, a text node, a comment, or an element-that is, form elements-within the document. The abstract class is the main element of jsoup. Later, I’ll show you how they map to the DOM elements. Figure 1 shows the class diagram of jsoup framework classes. The DOM and jsoup EssentialsĭOM is the language-independent representation of the HTML documents, which defines the structure and the styling of the document.
#Jsoup clean text code
The complete source code for this article is available on GitHub. I will demonstrate these features with some working examples.Īll the examples in this article are based on jsoup version 1.10.2, which is the latest available version at the time of this writing. It can also do cleanup based on whitelists, tidy HTML output, and complete unbalanced tags automagically. It updates older content based on HTML 4.x to HTML5 or XHTML by converting deprecated tags to new versions. Jsoup can manipulate the content: the HTML element itself, its attributes, or its text. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. Jsoup can parse HTML files, input streams, URLs, or even strings. It has been under active development since then by Jonathan Hedley, and the code uses the liberal MIT license. In this article, I will focus on one of my favorites, jsoup, which was first released as open source in January 2010. Fortunately, there are a handful of Java-based HTML parsers publicly available. Doing so without a parser framework is a most undesirable chore. This work is made difficult at times because parsing HTML content is a tedious task. Today, enterprise Java web application developers use HTML in every aspect of a project.