1 Star 0 Fork 50

高阳路人 / RuiJi.Net

forked from 朱平齐 / RuiJi.Net 
加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README

RuiJi.Net

RuiJi.Net is a distributed crawl framework written in c#.

RuiJi.Net is a self host webapi written using Microsoft.Owin. Major features include distribute crawler, distribute extracter and managed cookie.

RuiJi.Net support ip polling that using the server public network address and proxy server.

Document

Building

http://www.ruijihg.com/archives/ruijinet/getting-started

Notice

The project is under development.

Features

Crawler

Feature Support
webheader custom
method get/post
auto redirection support
cookie managed/custom
service point ip auto/custom Bind
encoding auto detect/by specify
response raw/string
proxy future additions

Extracter

Feature Support
selector css/xpath/regex/json/text range/exclude text/clear
extrac structure block/tile/meta
jsonconvert extractblock

About extract structure

Image text

Examples

Crawl and Extract with loacl libary

        var crawler = new IPCrawler();
        var request = new Request("http://www.ruijihg.com/%e5%bc%80%e5%8f%91/");

        var response = crawler.Request(request);
        var content = response.Data.ToString();

        var block = new ExtractBlock();
        block.Selectors = new List<ISelector>
        {
            new CssSelector(".entry-content",CssTypeEnum.InnerHtml)
        };

        block.TileSelector = new ExtractTile
        {
            Selectors = new List<ISelector>
            {
                new CssSelector(".pt-cv-content-item",CssTypeEnum.InnerHtml)
            }
        };

        block.TileSelector.Metas.AddMeta("title",new List<ISelector> {
            new CssSelector(".pt-cv-title")
        });

        block.TileSelector.Metas.AddMeta("url", new List<ISelector> {
            new CssSelector(".pt-cv-readmore","href")
        });

        var ext = new RuiJiExtracter();
        var r = ext.Extract(content, block);

Crawl and Extract with cluster

  1. downloaded ZooKeeper from Apache mirrors http://mirrors.hust.edu.cn/apache/zookeeper/zookeeper-3.4.12/

  2. Add the same file as zoo_sample.cfg in folder conf and rename it to zoo.cfg. and change dataDir with your

  3. Please confirm whether the Java runtime environment is installed

  4. run bin/zkServer.cmd in you zookeepr folder

  5. run RuiJi.cmd.exe

if You see the following information

Server Start At http://x.x.x.x:x
proxy x.x.x.x:x ready to startup!
try connect to zookeeper server : x.x.x.x:2181
zookeeper server connected!

the service startup is complete!

Notice
The RuiJi.Cmd.exe have to run as an administrator!
        Common.StartupNodes();

        var request = new Request("http://www.ruijihg.com/%e5%bc%80%e5%8f%91/");

        var response = Crawler.Request(request);

        if (response.StatusCode != System.Net.HttpStatusCode.OK)
            return;

        var content = response.Data.ToString();

        var block = new ExtractBlock();
        block.Selectors = new List<ISelector>
        {
            new CssSelector(".entry-content",CssTypeEnum.InnerHtml)
        };

        block.TileSelector = new ExtractTile
        {
            Selectors = new List<ISelector>
            {
                new CssSelector(".pt-cv-content-item",CssTypeEnum.InnerHtml)
            }
        };

        block.TileSelector.Metas.AddMeta("title", new List<ISelector> {
            new CssSelector(".pt-cv-title")
        });

        block.TileSelector.Metas.AddMeta("url", new List<ISelector> {
            new CssSelector(".pt-cv-readmore","href")
        });

        var r = Extracter.Extract(new ExtractRequest {
            Block = block,
            Content = content
        });

Contact

Please contact me with any suggestion

416803633@qq.com

my website : www.ruijihg.com

空文件

简介

RuiJi.Net is a dotnet distributed crawler framework written in c#.Major features include distribute crawler, distribute extracter and managed cookie, support ip polling that using the server public network address and proxy server. 展开 收起
C#
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
C#
1
https://gitee.com/809367402/RuiJi.Net.git
git@gitee.com:809367402/RuiJi.Net.git
809367402
RuiJi.Net
RuiJi.Net
master

搜索帮助