1 Star 0 Fork 0

幽烛 / Taobao spider base on scrapy

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README

dirbot

This is a Scrapy project to scrape websites from public web directories.

This project is only meant for educational purposes.

Items

The items scraped by this project are websites, and the item is defined in the class:

dirbot.items.Website

See the source code for more details.

Spiders

This project contains one spider called dmoz that you can see by running:

scrapy list

Spider: dmoz

The dmoz spider scrapes the Open Directory Project (dmoz.org), and it's based on the dmoz spider described in the Scrapy tutorial

This spider doesn't crawl the entire dmoz.org site but only a few pages by default (defined in the start_pages attribute). These pages are:

So, if you run the spider regularly (with scrapy crawl dmoz) it will scrape only those two pages. However, you can scrape any dmoz.org page by passing the url instead of the spider name. Scrapy internally resolves the spider to use by looking at the allowed domains of each spider.

For example, to scrape a different URL use:

scrapy crawl http://www.dmoz.org/Computers/Programming/Languages/Erlang/

You can scrape any URL from dmoz.org using this spider

Pipelines

This project uses a pipeline to filter out websites containing certain forbidden words in their description. This pipeline is defined in the class:

dirbot.pipelines.FilterWordsPipeline

空文件

简介

取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
1
https://gitee.com/liangzhiqian310/taobao-spider-base-on-scrapy.git
git@gitee.com:liangzhiqian310/taobao-spider-base-on-scrapy.git
liangzhiqian310
taobao-spider-base-on-scrapy
Taobao spider base on scrapy
master

搜索帮助