1 概述
简介
- Jsoup是一款基于Java的HTML解析器,它提供了一种简单、机动且易于利用的API,用于从URL、文件或字符串中解析HTML文档。它可以帮助开辟人员从HTML文档中提取数据、操作DOM元素、处置惩罚表单提交等。
主要特点
Jsoup的主要特点包罗:
- 简单易用:Jsoup提供了一系列简单的API,使得解析HTML变得非常容易。开辟人员可以利用类似于jQuery的选择器语法来选择DOM元素,从而方便地提取所需的数据。
- 强大的HTML处置惩罚本领:Jsoup支持HTML5尺度,并且能够处置惩罚不完整或损坏的HTML文档。它可以自动修复HTML中的错误,并且在解析过程中保留原始的HTML结构。
- 安全可靠:Jsoup内置了防止XSS攻击的机制,可以自动过滤恶意的HTML标签和属性,包管解析过程的安全性。
- 支持CSS选择器:Jsoup支持利用CSS选择器来选择DOM元素,这使得开辟人员可以更加机动地定位和操作HTML文档中的元素。
- 与Java集成:Jsoup是基于Java开辟的,可以与Java程序无缝集成。开辟人员可以利用Java的各种特性和库来处置惩罚解析后的数据。
应用场景
Jsoup 在大数据、云盘算领域的应用场景包罗但不限于:
- 网页数据抓取: Jsoup可以帮助开辟人员从网页中提取所需的数据,比方爬取消息、商品信息等。通过解析HTML文档,可以快速精确地获取所需的数据。
- 数据清洗与处置惩罚: 在云盘算中,大量的数据需要进行清洗和处置惩罚。Jsoup可以帮助开辟人员解析HTML文档,提取出需要的数据,并进行进一步的处置惩罚和分析。
- 网页内容分析: Jsoup可以帮助开辟人员对网页内容进行分析,比方提取关键词、统计标签出现次数等。这对于搜刮引擎优化、网页分析等领域非常有用。
竞品
爬虫解析HTML文档的工具有:
- GitHub - jhy/jsoup: jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
- https://jsoup.org/
- https://mvnrepository.com/artifact/org.jsoup/jsoup/1.12.2
- Beautiful Soup: We called him Tortoise because he taught us.
- GitHub - DeronW/beautifulsoup at v4.4.0
- https://beautifulsoup.readthedocs.io/
- https://beautifulsoup.readthedocs.io/zh-cn/v4.4.0/
回到顶部(Back to Top)
2 利用指南
依靠引入
- <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
- <dependency>
- <groupId>org.jsoup</groupId>
- <artifactId>jsoup</artifactId>
- <!-- 1.12.2 / 1.14.3 / 1.17.2 -->
- <version>1.14.3</version>
- </dependency>
复制代码
焦点 API
org.jsoup.Jsoup
- package org.jsoup;
- import java.io.File;
- import java.io.IOException;
- import java.io.InputStream;
- import java.net.URL;
- import javax.annotation.Nullable;
- import org.jsoup.helper.DataUtil;
- import org.jsoup.helper.HttpConnection;
- import org.jsoup.nodes.Document;
- import org.jsoup.parser.Parser;
- import org.jsoup.safety.Cleaner;
- import org.jsoup.safety.Safelist;
- import org.jsoup.safety.Whitelist;
- public class Jsoup {
- private Jsoup() {
- }
- public static Document parse(String html, String baseUri) {
- return Parser.parse(html, baseUri);
- }
- public static Document parse(String html, String baseUri, Parser parser) {
- return parser.parseInput(html, baseUri);
- }
- public static Document parse(String html, Parser parser) {
- return parser.parseInput(html, "");
- }
- public static Document parse(String html) {
- return Parser.parse(html, "");
- }
- public static Connection connect(String url) {
- return HttpConnection.connect(url);
- }
- public static Connection newSession() {
- return new HttpConnection();
- }
- public static Document parse(File file, @Nullable String charsetName, String baseUri) throws IOException {
- return DataUtil.load(file, charsetName, baseUri);
- }
- public static Document parse(File file, @Nullable String charsetName) throws IOException {
- return DataUtil.load(file, charsetName, file.getAbsolutePath());
- }
- public static Document parse(File file, @Nullable String charsetName, String baseUri, Parser parser) throws IOException {
- return DataUtil.load(file, charsetName, baseUri, parser);
- }
- public static Document parse(InputStream in, @Nullable String charsetName, String baseUri) throws IOException {
- return DataUtil.load(in, charsetName, baseUri);
- }
- public static Document parse(InputStream in, @Nullable String charsetName, String baseUri, Parser parser) throws IOException {
- return DataUtil.load(in, charsetName, baseUri, parser);
- }
- public static Document parseBodyFragment(String bodyHtml, String baseUri) {
- return Parser.parseBodyFragment(bodyHtml, baseUri);
- }
- public static Document parseBodyFragment(String bodyHtml) {
- return Parser.parseBodyFragment(bodyHtml, "");
- }
- public static Document parse(URL url, int timeoutMillis) throws IOException {
- Connection con = HttpConnection.connect(url);
- con.timeout(timeoutMillis);
- return con.get();
- }
- public static String clean(String bodyHtml, String baseUri, Safelist safelist) {
- Document dirty = parseBodyFragment(bodyHtml, baseUri);
- Cleaner cleaner = new Cleaner(safelist);
- Document clean = cleaner.clean(dirty);
- return clean.body().html();
- }
- /** @deprecated */
- @Deprecated
- public static String clean(String bodyHtml, String baseUri, Whitelist safelist) {
- return clean(bodyHtml, baseUri, (Safelist)safelist);
- }
- public static String clean(String bodyHtml, Safelist safelist) {
- return clean(bodyHtml, "", safelist);
- }
- /** @deprecated */
- @Deprecated
- public static String clean(String bodyHtml, Whitelist safelist) {
- return clean(bodyHtml, (Safelist)safelist);
- }
- public static String clean(String bodyHtml, String baseUri, Safelist safelist, Document.OutputSettings outputSettings) {
- Document dirty = parseBodyFragment(bodyHtml, baseUri);
- Cleaner cleaner = new Cleaner(safelist);
- Document clean = cleaner.clean(dirty);
- clean.outputSettings(outputSettings);
- return clean.body().html();
- }
- /** @deprecated */
- @Deprecated
- public static String clean(String bodyHtml, String baseUri, Whitelist safelist, Document.OutputSettings outputSettings) {
- return clean(bodyHtml, baseUri, (Safelist)safelist, outputSettings);
- }
- public static boolean isValid(String bodyHtml, Safelist safelist) {
- return (new Cleaner(safelist)).isValidBodyHtml(bodyHtml);
- }
- /** @deprecated */
- @Deprecated
- public static boolean isValid(String bodyHtml, Whitelist safelist) {
- return isValid(bodyHtml, (Safelist)safelist);
- }
- }
复制代码
org.jsoup.nodes.Node
关键 API
- 根据id查找元素: getElementById(String id)
- 根据标签查找元素: getElementsByTag(String tag)
- 根据class查找元素: getElementsByClass(String className)
- 根据属性查找元素: getElementsByAttribute(String key)
- 兄弟遍历方法: siblingElements(), firstElementSibling(), lastElementSibling(); nextElementSibling(), previousElementSibling()
- 层级之间遍历: parent(), children(), child(int index)
这些方法会返回Element大概Elements节点对象,这些对象可以利用下面的方法获取一些属性:
- attr(String key): 获取某个属性值
- attributes(): 获取节点的所有属性
- id(): 获取节点的id
- className(): 获取当前节点的class名称
- classNames(): 获取当前节点的所有class名称
- text(): 获取当前节点的textNode内容
- html(): 获取当前节点的 inner HTML
- outerHtml(): 获取当前节点的 outer HTML
- data(): 获取当前节点的内容,用于script大概style标签等
- tag(): 获取标签
- tagName(): 获取当前节点的标签名称
有了这些API,就像 JQuery 一样很便利的操作DOM。
- text(String value): 设置内容
- html(String value): 直接替换HTML结构
- append(String html): 元素后面添加节点
- prepend(String html): 元素前面添加节点
- appendText(String text), prependText(String text)
- appendElement(String tagName), prependElement(String tagName)
源码
- package org.jsoup.nodes;
- import java.io.IOException;
- import java.util.ArrayList;
- import java.util.Arrays;
- import java.util.Collections;
- import java.util.Iterator;
- import java.util.LinkedList;
- import java.util.List;
- import javax.annotation.Nullable;
- import org.jsoup.SerializationException;
- import org.jsoup.helper.Validate;
- import org.jsoup.internal.StringUtil;
- import org.jsoup.select.NodeFilter;
- import org.jsoup.select.NodeTraversor;
- import org.jsoup.select.NodeVisitor;
- public abstract class Node implements Cloneable {
- static final List<Node> EmptyNodes = Collections.emptyList();
- static final String EmptyString = "";
- @Nullable
- Node parentNode;
- int siblingIndex;
- protected Node() {
- }
- public abstract String nodeName();
- protected abstract boolean hasAttributes();
- public boolean hasParent() {
- return this.parentNode != null;
- }
- public String attr(String attributeKey) {
- ...
- }
-
- public abstract Attributes attributes();
-
- public int attributesSize() {
- return this.hasAttributes() ? this.attributes().size() : 0;
- }
- public Node attr(String attributeKey, String attributeValue) {
- attributeKey = NodeUtils.parser(this).settings().normalizeAttribute(attributeKey);
- this.attributes().putIgnoreCase(attributeKey, attributeValue);
- return this;
- }
- public boolean hasAttr(String attributeKey) {
- Validate.notNull(attributeKey);
- if (!this.hasAttributes()) {
- return false;
- } else {
- if (attributeKey.startsWith("abs:")) {
- String key = attributeKey.substring("abs:".length());
- if (this.attributes().hasKeyIgnoreCase(key) && !this.absUrl(key).isEmpty()) {
- return true;
- }
- }
- return this.attributes().hasKeyIgnoreCase(attributeKey);
- }
- }
- public Node removeAttr(String attributeKey) {
- Validate.notNull(attributeKey);
- if (this.hasAttributes()) {
- this.attributes().removeIgnoreCase(attributeKey);
- }
- return this;
- }
- public Node clearAttributes() {
- if (this.hasAttributes()) {
- Iterator<Attribute> it = this.attributes().iterator();
- while(it.hasNext()) {
- it.next();
- it.remove();
- }
- }
- return this;
- }
- public abstract String baseUri();
- protected abstract void doSetBaseUri(String var1);
- public void setBaseUri(String baseUri) {
- Validate.notNull(baseUri);
- this.doSetBaseUri(baseUri);
- }
- public String absUrl(String attributeKey) {
- Validate.notEmpty(attributeKey);
- return this.hasAttributes() && this.attributes().hasKeyIgnoreCase(attributeKey) ? StringUtil.resolve(this.baseUri(), this.attributes().getIgnoreCase(attributeKey)) : "";
- }
- protected abstract List<Node> ensureChildNodes();
- public Node childNode(int index) {
- return (Node)this.ensureChildNodes().get(index);
- }
- public List<Node> childNodes() {
- if (this.childNodeSize() == 0) {
- return EmptyNodes;
- } else {
- List<Node> children = this.ensureChildNodes();
- List<Node> rewrap = new ArrayList(children.size());
- rewrap.addAll(children);
- return Collections.unmodifiableList(rewrap);
- }
- }
- public List<Node> childNodesCopy() {
- List<Node> nodes = this.ensureChildNodes();
- ArrayList<Node> children = new ArrayList(nodes.size());
- Iterator var3 = nodes.iterator();
- while(var3.hasNext()) {
- Node node = (Node)var3.next();
- children.add(node.clone());
- }
- return children;
- }
- public abstract int childNodeSize();
- protected Node[] childNodesAsArray() {
- return (Node[])this.ensureChildNodes().toArray(new Node[0]);
- }
- public abstract Node empty();
- @Nullable
- public Node parent() {
- return this.parentNode;
- }
- @Nullable
- public final Node parentNode() {
- return this.parentNode;
- }
- public Node root() {
- Node node;
- for(node = this; node.parentNode != null; node = node.parentNode) {
- }
- return node;
- }
- @Nullable
- public Document ownerDocument() {
- Node root = this.root();
- return root instanceof Document ? (Document)root : null;
- }
- public void remove() {
- Validate.notNull(this.parentNode);
- this.parentNode.removeChild(this);
- }
- public Node before(String html) {
- this.addSiblingHtml(this.siblingIndex, html);
- return this;
- }
- public Node before(Node node) {
- Validate.notNull(node);
- Validate.notNull(this.parentNode);
- this.parentNode.addChildren(this.siblingIndex, node);
- return this;
- }
- public Node after(String html) {
- this.addSiblingHtml(this.siblingIndex + 1, html);
- return this;
- }
- public Node after(Node node) {
- Validate.notNull(node);
- Validate.notNull(this.parentNode);
- this.parentNode.addChildren(this.siblingIndex + 1, node);
- return this;
- }
- private void addSiblingHtml(int index, String html) {
- Validate.notNull(html);
- Validate.notNull(this.parentNode);
- Element context = this.parent() instanceof Element ? (Element)this.parent() : null;
- List<Node> nodes = NodeUtils.parser(this).parseFragmentInput(html, context, this.baseUri());
- this.parentNode.addChildren(index, (Node[])nodes.toArray(new Node[0]));
- }
- public Node wrap(String html) {
- Validate.notEmpty(html);
- Element context = this.parentNode != null && this.parentNode instanceof Element ? (Element)this.parentNode : (this instanceof Element ? (Element)this : null);
- List<Node> wrapChildren = NodeUtils.parser(this).parseFragmentInput(html, context, this.baseUri());
- Node wrapNode = (Node)wrapChildren.get(0);
- if (!(wrapNode instanceof Element)) {
- return this;
- } else {
- Element wrap = (Element)wrapNode;
- Element deepest = this.getDeepChild(wrap);
- if (this.parentNode != null) {
- this.parentNode.replaceChild(this, wrap);
- }
- deepest.addChildren(new Node[]{this});
- if (wrapChildren.size() > 0) {
- for(int i = 0; i < wrapChildren.size(); ++i) {
- Node remainder = (Node)wrapChildren.get(i);
- if (wrap != remainder) {
- if (remainder.parentNode != null) {
- remainder.parentNode.removeChild(remainder);
- }
- wrap.after(remainder);
- }
- }
- }
- return this;
- }
- }
- @Nullable
- public Node unwrap() {
- Validate.notNull(this.parentNode);
- List<Node> childNodes = this.ensureChildNodes();
- Node firstChild = childNodes.size() > 0 ? (Node)childNodes.get(0) : null;
- this.parentNode.addChildren(this.siblingIndex, this.childNodesAsArray());
- this.remove();
- return firstChild;
- }
- private Element getDeepChild(Element el) {
- List<Element> children = el.children();
- return children.size() > 0 ? this.getDeepChild((Element)children.get(0)) : el;
- }
- void nodelistChanged() {
- }
- public void replaceWith(Node in) {
- Validate.notNull(in);
- Validate.notNull(this.parentNode);
- this.parentNode.replaceChild(this, in);
- }
- protected void setParentNode(Node parentNode) {
- Validate.notNull(parentNode);
- if (this.parentNode != null) {
- this.parentNode.removeChild(this);
- }
- this.parentNode = parentNode;
- }
- protected void replaceChild(Node out, Node in) {
- Validate.isTrue(out.parentNode == this);
- Validate.notNull(in);
- if (in.parentNode != null) {
- in.parentNode.removeChild(in);
- }
- int index = out.siblingIndex;
- this.ensureChildNodes().set(index, in);
- in.parentNode = this;
- in.setSiblingIndex(index);
- out.parentNode = null;
- }
- protected void removeChild(Node out) {
- Validate.isTrue(out.parentNode == this);
- int index = out.siblingIndex;
- this.ensureChildNodes().remove(index);
- this.reindexChildren(index);
- out.parentNode = null;
- }
- protected void addChildren(Node... children) {
- List<Node> nodes = this.ensureChildNodes();
- Node[] var3 = children;
- int var4 = children.length;
- for(int var5 = 0; var5 < var4; ++var5) {
- Node child = var3[var5];
- this.reparentChild(child);
- nodes.add(child);
- child.setSiblingIndex(nodes.size() - 1);
- }
- }
- protected void addChildren(int index, Node... children) {
- ...
- }
- protected void reparentChild(Node child) {
- child.setParentNode(this);
- }
- private void reindexChildren(int start) {
- if (this.childNodeSize() != 0) {
- List<Node> childNodes = this.ensureChildNodes();
- for(int i = start; i < childNodes.size(); ++i) {
- ((Node)childNodes.get(i)).setSiblingIndex(i);
- }
- }
- }
- public List<Node> siblingNodes() {
- if (this.parentNode == null) {
- return Collections.emptyList();
- } else {
- List<Node> nodes = this.parentNode.ensureChildNodes();
- List<Node> siblings = new ArrayList(nodes.size() - 1);
- Iterator var3 = nodes.iterator();
- while(var3.hasNext()) {
- Node node = (Node)var3.next();
- if (node != this) {
- siblings.add(node);
- }
- }
- return siblings;
- }
- }
- @Nullable
- public Node nextSibling() {
- if (this.parentNode == null) {
- return null;
- } else {
- List<Node> siblings = this.parentNode.ensureChildNodes();
- int index = this.siblingIndex + 1;
- return siblings.size() > index ? (Node)siblings.get(index) : null;
- }
- }
- @Nullable
- public Node previousSibling() {
- if (this.parentNode == null) {
- return null;
- } else {
- return this.siblingIndex > 0 ? (Node)this.parentNode.ensureChildNodes().get(this.siblingIndex - 1) : null;
- }
- }
- public int siblingIndex() {
- return this.siblingIndex;
- }
- protected void setSiblingIndex(int siblingIndex) {
- this.siblingIndex = siblingIndex;
- }
- public Node traverse(NodeVisitor nodeVisitor) {
- Validate.notNull(nodeVisitor);
- NodeTraversor.traverse(nodeVisitor, this);
- return this;
- }
- public Node filter(NodeFilter nodeFilter) {
- Validate.notNull(nodeFilter);
- NodeTraversor.filter(nodeFilter, this);
- return this;
- }
- public String outerHtml() {
- StringBuilder accum = StringUtil.borrowBuilder();
- this.outerHtml(accum);
- return StringUtil.releaseBuilder(accum);
- }
- protected void outerHtml(Appendable accum) {
- NodeTraversor.traverse(new OuterHtmlVisitor(accum, NodeUtils.outputSettings(this)), this);
- }
- abstract void outerHtmlHead(Appendable var1, int var2, Document.OutputSettings var3) throws IOException;
- abstract void outerHtmlTail(Appendable var1, int var2, Document.OutputSettings var3) throws IOException;
- public <T extends Appendable> T html(T appendable) {
- this.outerHtml(appendable);
- return appendable;
- }
- public String toString() {
- return this.outerHtml();
- }
- protected void indent(Appendable accum, int depth, Document.OutputSettings out) throws IOException {
- accum.append('\n').append(StringUtil.padding(depth * out.indentAmount()));
- }
- public boolean equals(@Nullable Object o) {
- return this == o;
- }
- public int hashCode() {
- return super.hashCode();
- }
- public boolean hasSameValue(@Nullable Object o) {
- if (this == o) {
- return true;
- } else {
- return o != null && this.getClass() == o.getClass() ? this.outerHtml().equals(((Node)o).outerHtml()) : false;
- }
- }
- public Node clone() {
- ...
- }
- ...
复制代码
org.jsoup.nodes.Element extends Node
org.jsoup.nodes.Document extends Element
应用场景
CASE : 解析 HTML文档 => 获得 Document 对象
- import org.jsoup.Jsoup;
- import org.jsoup.nodes.Document;
- String html = "<html><head><title>First parse</title></head><body><p>Parsed HTML into a doc.</p></body></html>";
- Document doc = Jsoup.parse(html);
复制代码
CASE : 解析 HTML 片段 => 获得 Document 对象
- import org.jsoup.Jsoup;
- import org.jsoup.nodes.Document;
- String html = "<div><p>Lorem ipsum.</p>";
- Document doc = Jsoup.parseBodyFragment(html);
- Element body = doc.body();
复制代码
CASE : 解析 URL => 获得 Document 对象
- org.jsoup.Connection connection = Jsoup.connect("http://example.com/");
- Document doc = connection.get();//HTTP Method = GET
- String title = doc.title();
复制代码 还可以携带cookie等参数:(和Python的爬虫类似)
- Document doc = Jsoup.connect("http://example.com")
- .data("query", "Java")
- .userAgent("Mozilla")
- .cookie("auth", "token")
- .timeout(3000)
- .post(); //HTTP Method = POST
复制代码
CASE : 解析 HTML 当地文件 => 获得 Document 对象
- File input = new File("/tmp/input.html");
- Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
- /**
- * 提取文件里面的文本信息
- */
- public static String openFile(String szFileName) {
- try {
- BufferedReader bis = new BufferedReader(new InputStreamReader(new FileInputStream(new File(szFileName)), ENCODE));
- String szContent = "";
- String szTemp;
- while ((szTemp = bis.readLine()) != null) {
- szContent += szTemp + "\n";
- }
- bis.close();
- return szContent;
- } catch (Exception e) {
- return "";
- }
- }
复制代码
CASE : Element#hasText()/text()/ownText()/wholeText()
- <html>
- <head><title>JSoup Parse Demo</title></head>
- <body>
- <div id="demoDivId" style="min-height: 48px;"> hello<span>world<b>!</b></span> </div>
- <p>Parsed HTML into a doc.</p>
- </body>
- </html>
复制代码
- final static String classpath = ClassLoader.getSystemResource("").getPath();//如: /E:/source_code/xxx/xxx-bigdata/xxx-common-demo/target/classes/
- final static String htmlFilePath = classpath + "sample-dataset/html/demo.html";
- String htmlContent = htmlReader.readHtml( htmlFilePath );
- Document document = Jsoup.parse(htmlContent);
- Element element = document.body().getElementById("demoDivId");
- log.info("cssSelector:{}", element.cssSelector());//cssSelector:#demoDivId
- log.info( "hasText:{}" , element.hasText() ); //hasText:true
- log.info( "text:{}" , element.text() );text:helloworld!
- log.info( "ownText:{}" , element.ownText() ); ownText:hello
- log.info( "wholeText:{}" , element.wholeText() ); wholeText: helloworld!
复制代码
CASE : Element#html()/outerHtml()
同上一案例
- ...
- Element element = document.body().getElementById("demoDivId");
- log.info( "html(innerHtml):{}" , element.html() );
- log.info( "outerHtml:\n{}" , element.outerHtml() );
复制代码
- html(innerHtml):hello<span>world<b>!</b></span>
- outerHtml:
- <div id="demoDivId" style="min-height: 48px;">
- hello<span>world<b>!</b></span>
- </div>
复制代码 免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。 |