98 Star 613 Fork 196

易水风萧 / wind-bell

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
Apache-2.0

风铃虫

介绍

风铃虫是一款轻量级的爬虫工具,似风铃一样灵敏,如蜘蛛一般敏捷,能感知任何细小的风吹草动,轻松抓取互联网上的内容。它是一款对目标服务器相对友好的蜘蛛程序,内置了二十余种常见或不常见的浏览器标识,能够自动处理cookie和网页来源信息,轻松绕过服务器限制,智能调整请求间隔时间,动态调整请求频率,防止对目标服务器造成干扰。此外,风铃虫还是一款对普通用户十分友好的工具,它提供的大量链接提取器和内容提取器让用户可以随心所欲地快速配置,甚至于只要提供一个开始请求地址就能配置出自己爬虫程序。同时,风铃虫也开放了许多自定义接口,让高级用户能够根据需要自定义爬虫功能。最后,风铃虫还天然支持分布式和集群功能,让你突破单机环境的束缚,释放出你的爬虫能力。可以说,风铃虫几乎能抓取目前所有的网站里的绝大部分内容。

【声明】 请勿将风铃虫应用到任何可能会违反法律规定和道德约束的工作中,请友善使用风铃虫,遵守蜘蛛协议,不要将风铃虫用于任何非法用途。如您选择使用风铃虫即代表您遵守此协议,作者不承担任何由于您违反此协议带来任何的法律风险和损失,一切后果由您承担。


快速使用

<dependency>
    <groupId>com.yishuifengxiao.common</groupId>
    <artifactId>crawler</artifactId>
    <version>2.3.0</version>
</dependency>

交流 QQ 群 :易水组件交流群 (群号 624646260)

简单使用

提取雅虎财经的内容页的电子货币的名字

// 创建一个提取属性规则
// 该提取规则标识XPATH表示使用XPATH提取器进行提取,
// 该XPATH提取器的XPATH表达式为 //h1/text() , 该提取提取器的作用顺序是0
ExtractFieldRule extractFieldRule = new ExtractFieldRule(Rule.XPATH, "//h1/text()", "", 0);

// 创建一个提取项
ExtractRule extractRule = new ExtractRule();
extractRule
	// 提取项代码,不能为空,同一组提取规则之内每一个提取项的编码必须唯一
	.setCode("code")
	// 提取项名字,可以不设置
	.setName("加密电子货币名字")
	// 设置提取属性规则
	.setRules(Arrays.asList(extractFieldRule));

// 创建一个风铃虫实例
Crawler crawler = CrawlerBuilder.create()
	// 风铃虫的起始链接
	.startUrl("https://hk.finance.yahoo.com/cryptocurrencies")
	// 风铃虫会将请求到的网页中的URL先全部提取出来
    // 然后将匹配链接提取规则的链接过滤出来,放入请求池中
	// 请求池中的链接会作为下次抓取请求的种子链接
    // 可以以添加多个链接提取规则,多个规则之间是并列(或连接)的关系
    // 如果不设置则表示提取链接中所有包含域名关键字(例如此例中的yahoo)的链接放入链接池
    // 此例中表示符合该正则表达式的链接都会被提取出来
	.addLinkRule(new MatcherRule(Pattern.REGEX, "https://hk.finance.yahoo.com/quote/.+"))
	// 内容页地址规则是告诉风铃虫哪些页面是内容页
    // 对于复杂情况下,可以与 内容匹配规则 配合使用
    // 只有符合内容页规则的页面才会被提取数据
    // 对于非内容页,风铃虫不会尝试从中提取数据
    // 此例中表示符合该正则表达式的网页都是内容页,风铃虫会从这些页面里提取数据
	.contentPageRule(new MatcherRule(Pattern.REGEX, "https://hk.finance.yahoo.com/quote/.+")) 
	// 风铃虫可以设置多个提取项,这里为了演示只设置了一个提取项
    // 增加一个提取项规则
	.addExtractRule(extractRule)
    // 请求间隔时间
	// 如果不设置则使用默认时间10秒,此值是为了防止抓取频率太高被服务器封杀
	.interval(3000)// 每次进行爬取时的平均间隔时间,单位为毫秒,
	.creatCrawler();
    
	// 启动爬虫实例
	crawler.start();

    // 这里没有设置信息输出器,表示使用默认的信息输出器
	// 默认的信息输出器使用的logback日志输出方法,因此需要看控制台信息

	// 由于风铃虫是异步运行的,所以演示时这里加入循环
	while (Statu.STOP != crawler.getStatu()) {
			try {
				Thread.sleep(1000 * 20);
			} catch (InterruptedException e) {
				e.printStackTrace();
			}
		}

	

上述例子的作用提取雅虎财经的内容页的电子货币的名字,如果用户想要提取其他信息,只需要按照规则配置好其他的提取规则即可。

注意 上述示例仅供学习演示所用,风铃虫使用者在抓取网页内容请严格遵守相关的法律规定和目标网站的蜘蛛协议


风铃虫原理

风铃虫原理

风铃虫的原理极为简单,主要由 资源调度器网页下载器链接解析器内容解析器信息输出器 这极大部分组成。

他们的作用与功能如下所示:

  • 资源调度器:负责风铃虫资源的调度过程,例如进行任务的储存、任务的调度和任务的管理
  • 网页下载器:负责根据调度器调度的任务下载网页资源
  • 链接解析器:负责解析网页下载器下载的网页内容,从网页内容中提取出所有符合要求的链接
  • 内容解析器:负责对网页下载器下载的网页内容进行内容解析
  • 信息输出器:输出内容解析器解析出来的数据

其中的链接解析器是由一系列的链接提取器组合而成,目前链接提取器主要是支持正则提取。

内容解析器由一系列的内容提取器组合而成,不同的内容提取器功能不同,适用于不同的解析场景,支持多个提取器的重复、循环等多种组合形式。

上述个组件均提供了自定义配置接口,使用户可以根据实际需要进行自定义配置,满足各种复杂乃至异常场景的要求。

风铃虫内置的内容提取器有

  1. 原文提取器
  2. 中文提取器
  3. 常量提取器
  4. CSS内容提取器
  5. CSS文本提取器
  6. 邮箱提取器
  7. 数字提取器
  8. 正则提取器
  9. 字符删除提取器
  10. 字符替换提取器
  11. 字符串截取提取器
  12. XPATH提取器
  13. 数组截取
  14. ...

在进行文本内容提取时,用户可以将这些提取器自由组合以提取出自己需要的内容,更多提取器的具体用法请参见 内容提取器用法

风铃虫内置的浏览器标识有

  1. 谷歌浏览器(windows版、linux版)
  2. Opera浏览器 (windows版、MAC版)
  3. 火狐浏览器(windows版、linux版、MAC版)
  4. IE浏览器(IE9、IE11)
  5. EDAG浏览器
  6. safari浏览器(windows版、MAC版)
  7. ...

抓取js渲染网站

核心代码如下:

  Crawler crawler = ...
          crawler .setDownloader(new SeleniumDownloader("C:\\Users\\yishui\\Desktop\\geckodriver\\win32.exe",3000L))

分布式支持

核心代码如下:

....
//省略其他代码
....
    //创建redis资源调度器
    Scheduler scheduler=new RedisScheduler("唯一的名字",redisTemplate)
    //创建一个redis资源缓存器
    RequestCache requestCache = new RedisRequestCache(redisTemplate);

     crawler     
            .setRequestCache(requestCache) //设置使用redis资源缓存器
            .setScheduler(scheduler); //设置使用redis资源调度器
                 
....
//省略其他代码
....

//启动爬虫实例
crawler.start();

状态监控

风铃虫还提供了强大的状态监控和事件监控能力,通过 状态监听器事件监听器,风铃虫让你对任务的运行情况了如指掌,实时掌控实例运行过程中遇到的各种问题,真正做到对任务的运行情况洞若观火,方便运维。

解析模拟器

由于风铃虫的解析功能十分强大,规定定义十分灵活,为了直观地了解已配置的规则定义的作用,风铃虫提供了解析模拟器,让使用者能够快速了解自己设置的规则定义的效果是否符合预期目标,及时调整规则定义,方便风铃虫实例的配置。



风铃虫平台效果演示

  1. 配置基本信息

配置爬虫的名字、使用的线程数量和超时停止时间

输入图片说明

  1. 配置链接爬取信息
配置爬虫的起始种子链接和从网页里提取下一次抓取时的链接的提取规则

配置链接爬取信息

  1. 配置站点信息
此步骤一般可以省略,但是对于某些会校验cookie和请求头参数的网站,此配置非常有用

配置站点信息

  1. 提取项配置
 配置需要从网站里提取出来的数据,例如新闻标题和网页正文等信息 

内容页配置

  1. 属性提取配置
 调用内容提取器进行任意组合,以根据需要提取出需要的数据

属性提取配置



  1. 属性提取测试

提前检验提取项的配置是否正确,提取出来的数据是否符合预期目标

属性提取测试

相关资源链接

文档地址 :https://gitee.com/zhiyubujian/wind-bell/wikis/pages

API文档https://apidoc.gitee.com/zhiyubujian/wind-bell/

官方文档http://doc.yishuifengxiao.com/windbell/

Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

简介

风铃虫是一款轻量级的高效爬虫工具,配置简单,方便二次开发,能抓取js渲染的网页,可以抓取任何数据,支持保存网页快照,智能防封杀,天然适合分布式。 展开 收起
Java
Apache-2.0
取消

发行版 (10)

全部

贡献者

全部

近期动态

加载更多
不能加载更多了
Java
1
https://gitee.com/zhiyubujian/wind-bell.git
git@gitee.com:zhiyubujian/wind-bell.git
zhiyubujian
wind-bell
wind-bell
master

搜索帮助

14c37bed 8189591 565d56ea 8189591