74 Star 116 Fork 42

calvinwilliams / simspider

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
README-EN 16.62 KB
一键复制 编辑 原始数据 按行查看 历史
calvinwilliams 提交于 2015-03-14 22:14 . UPDATE TO V2.5.0
simspider - Net Spider Engine
1.Overview
Simspider is a lightweight cross-platform web-spider engine, it provides a set of C function interface is used to quickly build your own net spider application, also provides an executable spider program are used to demonstrate how to use the function libaray.
Simspider only depended on third-party libraries libcurl.
Simspider support platform:
* UNIX/Linux
* WINDOWS
Simspider function interface is very easy to use, the main process is as follows:
* Create a spider engine environment
* Set the spider engine environment
* Crawl all pages recursively from entrance url
* Destruction of the spider engine environment
There are a large number of alternative options to customize your spider engine environment, as following:
* Set request queue size
* Set file extension name set of interest
* Whether to allow null file extension name
* Whether to allow climbed out of the current web site
* Set the maximum depth of recursion
* Set the HTTPS certificate file name
* Set up interval of crawl
* Set maximum number of concurrent crawl
Simspider spider engine implements a flexible process framework, provides some callback functions pointer to the spider application designers want to crawl at any time point to add their own custom processing logic, as the following:
* On building the HTTP request header
* On building the HTTP request body ( POST data etc )
* After fetch HTTP response header
* After fetch HTTP response body ( HTML data etc )
* Review queue after crawl
2.My first spider
Use simspider engine function library is fairly easy to achieve a creeper, here is a simple example:
[code]
#include "libsimspider.h"
int main()
{
struct SimSpiderEnv *penv = NULL ;
int nret = 0 ;
nret = InitSimSpiderEnv( & penv , NULL ) ;
if( nret )
{
printf( "InitSimSpiderEnv failed[%d]\n" , nret );
return 1;
}
nret = SimSpiderGo( penv , "" , "http://www.curl.haxx.se/" ) ;
if( nret )
{
printf( "SimSpiderGo failed[%d]\n" , nret );
return 1;
}
CleanSimSpiderEnv( & penv );
return 0;
}
[/code]
The sample itself doesn't make any sense, because it doesn't throw out any data, but it certainly crawl all web pages of http://localhost:80/
May I think crawling is slowly, how can enable concurrent crawl. copy that ( 5 concurrents )
[code]
SetMaxConcurrentCount( penv , 5 );
[/code]
May I want to add some contents in the HTTP disguise my spider, add a callback function code, and set it to the crawler engine environment
[code]
funcRequestHeaderProc RequestHeaderProc ;
int RequestHeaderProc( struct DoneQueueUnit *pdqu )
{
CURL *curl = NULL ;
struct curl_slist *header_list = NULL ;
char buffer[ 1024 + 1 ] ;
curl = GetDoneQueueUnitCurl(pdqu) ;
header_list = curl_slist_append( header_list , "User-Agent: Mozilla/5.0(Windows NT 6.1; WOW64; rv:34.0 ) Gecko/20100101 Firefox/34.0" ) ;
if( GetDoneQueueUnitRefererUrl(pdqu) )
{
memset( buffer , 0x00 , sizeof(buffer) );
SNPRINTF( buffer , sizeof(buffer) , "Referer: %s" , GetDoneQueueUnitRefererUrl(pdqu) );
header_list = curl_slist_append( header_list , buffer ) ;
}
curl_easy_setopt( curl , CURLOPT_HTTPHEADER , header_list );
FreeCurlList1Later( pdqu , header_list );
return 0;
}
InitSimSpiderEnv
...
SetRequestHeaderProc( penv , & RequestHeaderProc );
...
SimSpiderGo
[/code]
May I think real-time output current in my spider crawl url to the screen, add a callback function code, and set it to the crawler engine environment
[code]
funcResponseBodyProc ResponseBodyProc ;
int ResponseBodyProc( struct DoneQueueUnit *pdqu )
{
printf( ">>> [%s]\n" , GetDoneQueueUnitUrl(pdqu) );
return 0;
}
InitSimSpiderEnv
...
SetResponseBodyProc( penv , & ResponseBodyProc );
...
SimSpiderGo
[/code]
May I want to output all url after the completion of all the sites, new a callback function code, and set it to the crawler engine environment
[code]
funcTravelDoneQueueProc TravelDoneQueueProc ;
void TravelDoneQueueProc( char *key , void *value , long value_len , void *pv )
{
struct DoneQueueUnit *pdqu = (struct DoneQueueUnit *)value ;
printf( "[%5d] [%2ld] [%s] [%s]\n" , GetDoneQueueUnitStatus(pdqu) , GetDoneQueueUnitRecursiveDepth(pdqu) , GetDoneQueueUnitRefererUrl(pdqu) , GetDoneQueueUnitUrl(pdqu) );
return;
}
InitSimSpiderEnv
...
SetTravelDoneQueueProc( penv , & TravelDoneQueueProc );
...
SimSpiderGo
[/code]
Is it easy?
An complete example in installation package in src/simspider.c
3.Customize the spider engine environment
* Set request queue size
The spider environment needs two queues: the request queue and the done queue.
The request queue size by default is SIMSPIDER_DEFAULT_REQUESTQUEUE_SIZE, If you want to adjust, the calling function ResizeRequestQueue.
* Set file extension name set of interest
The spider engine crawl site recursively, automatically set to ignore the file extensions are not interested in, for default is SIMSPIDER_DEFAULT_VALIDFILENAMEEXTENSION, you can call function SetValidFileExtnameSet to set customize.
* Whether to allow null file extension name
Call function AllowEmptyFileExtname to enable or not.
* Whether to allow climbed out of the current web site
Call function AllowRunOutofWebsite to enable or not.
* Set the maximum depth of recursion
Call function SetMaxRecursiveDepth to set depth of recursion.
* Set the HTTPS certificate file name
Call functoin SetCertificateFilename to set certificate file name
* Set up interval of crawl
Crawl between two pages need sleep , call function SetRequestDelay, units are seconds.
* Set maximum number of concurrent crawl
Serial crawl slowly too? multiplexing parallel crawl, calling function SetMaxConcurrentCount set of concurrent degree.
Automatically adjust when set to SIMSPIDER_CONCURRENTCOUNT_AUTO. (experimental)
4.Callback functions
* On building the HTTP request header
Add some information on building HTTP request header , write a callback function and set into crawler engine environment.
* On building the HTTP request body
Add some information on building HTTP request body , write a callback function and set into crawler engine environment.
* After fetch HTTP response header
Get information on after fetch HTTP request header , write a callback function and set into crawler engine environment.
* After fetch HTTP response body
Get information on after fetch HTTP request body , write a callback function and set into crawler engine environment.
* Review queue after crawl
Convenient save some creeper results update into database.
5.Experience
* How to output engine internal log
Setup log file name on initialize spider engine environment , struct SimSpiderEnv **ppenv , char *log_file_format , ... ).
* How to preset multiple entrance websites
Function SimSpiderGo set only one entrance , if there are more than a portal entry, use the function AppendRequestQueue preset to spider engine environment , and then call SimSpiderGo(penv,NULL).
* In building the HTTP request header in callback function, if you create a curl list (struct curl_slist *), must be at the back to release
Call function FreeCurlList1Later~FreeCurlList3Later store into spider engine that release it at last.
6.the spider running demo
demo source : src/simspider.c
[code]
$ time ./simspider 192.168.6.79 5
>>> [http://192.168.6.79/]
>>> [http://192.168.6.79/curl-config.html]
>>> [http://192.168.6.79/TheArtOfHttpScripting]
>>> [http://192.168.6.79/libcurl/index.html]
>>> [http://192.168.6.79/index.html]
>>> [http://192.168.6.79/libcurl/libcurl.html]
>>> [http://192.168.6.79/libcurl/libcurl-easy.html]
>>> [http://192.168.6.79/libcurl/libcurl-multi.html]
>>> [http://192.168.6.79/libcurl/libcurl-share.html]
>>> [http://192.168.6.79/libcurl/libcurl-errors.html]
>>> [http://192.168.6.79/curl.html]
>>> [http://192.168.6.79/libcurl/curl_easy_cleanup.html]
>>> [http://192.168.6.79/libcurl/curl_easy_duphandle.html]
>>> [http://192.168.6.79/libcurl/curl_easy_escape.html]
>>> [http://192.168.6.79/libcurl/curl_easy_getinfo.html]
>>> [http://192.168.6.79/libcurl/curl_easy_init.html]
>>> [http://192.168.6.79/libcurl/curl_easy_pause.html]
>>> [http://192.168.6.79/libcurl/curl_easy_perform.html]
>>> [http://192.168.6.79/libcurl/curl_easy_recv.html]
>>> [http://192.168.6.79/libcurl/curl_easy_reset.html]
>>> [http://192.168.6.79/libcurl/curl_easy_strerror.html]
>>> [http://192.168.6.79/libcurl/curl_easy_unescape.html]
>>> [http://192.168.6.79/libcurl/curl_escape.html]
>>> [http://192.168.6.79/libcurl/curl_formadd.html]
>>> [http://192.168.6.79/libcurl/curl_formfree.html]
>>> [http://192.168.6.79/libcurl/curl_formget.html]
>>> [http://192.168.6.79/libcurl/curl_free.html]
>>> [http://192.168.6.79/libcurl/curl_getenv.html]
>>> [http://192.168.6.79/libcurl/curl_easy_send.html]
>>> [http://192.168.6.79/libcurl/curl_global_cleanup.html]
>>> [http://192.168.6.79/libcurl/curl_global_init.html]
>>> [http://192.168.6.79/libcurl/curl_global_init_mem.html]
>>> [http://192.168.6.79/libcurl/curl_mprintf.html]
>>> [http://192.168.6.79/libcurl/curl_multi_add_handle.html]
>>> [http://192.168.6.79/libcurl/curl_multi_assign.html]
>>> [http://192.168.6.79/libcurl/curl_multi_cleanup.html]
>>> [http://192.168.6.79/libcurl/curl_multi_fdset.html]
>>> [http://192.168.6.79/libcurl/curl_getdate.html]
>>> [http://192.168.6.79/libcurl/curl_multi_info_read.html]
>>> [http://192.168.6.79/libcurl/curl_multi_init.html]
>>> [http://192.168.6.79/libcurl/curl_multi_perform.html]
>>> [http://192.168.6.79/libcurl/curl_multi_remove_handle.html]
>>> [http://192.168.6.79/libcurl/curl_multi_setopt.html]
>>> [http://192.168.6.79/libcurl/curl_multi_socket.html]
>>> [http://192.168.6.79/libcurl/curl_multi_socket_action.html]
>>> [http://192.168.6.79/libcurl/curl_multi_strerror.html]
>>> [http://192.168.6.79/libcurl/curl_multi_timeout.html]
>>> [http://192.168.6.79/libcurl/curl_share_cleanup.html]
>>> [http://192.168.6.79/libcurl/curl_share_init.html]
>>> [http://192.168.6.79/libcurl/curl_share_setopt.html]
>>> [http://192.168.6.79/libcurl/curl_share_strerror.html]
>>> [http://192.168.6.79/libcurl/curl_slist_append.html]
>>> [http://192.168.6.79/libcurl/curl_slist_free_all.html]
>>> [http://192.168.6.79/libcurl/curl_strequal.html]
>>> [http://192.168.6.79/libcurl/curl_unescape.html]
>>> [http://192.168.6.79/libcurl/curl_version.html]
>>> [http://192.168.6.79/libcurl/curl_version_info.html]
>>> [http://192.168.6.79/libcurl/]
>>> [http://192.168.6.79/libcurl/libcurl-tutorial.html]
>>> [http://192.168.6.79/libcurl/curl_easy_setopt.html]
[ 200] [ 1] [] [http://192.168.6.79/]
[ 200] [ 2] [http://192.168.6.79/] [http://192.168.6.79/TheArtOfHttpScripting]
[ 200] [ 2] [http://192.168.6.79/] [http://192.168.6.79/curl-config.html]
[ 200] [ 2] [http://192.168.6.79/] [http://192.168.6.79/curl.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/index.html]
[ 200] [ 4] [http://192.168.6.79/libcurl/curl_easy_getinfo.html] [http://192.168.6.79/libcurl/]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_cleanup.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_duphandle.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_escape.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_getinfo.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_init.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_pause.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_perform.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_recv.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_reset.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_send.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_setopt.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_strerror.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_unescape.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_escape.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_formadd.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_formfree.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_formget.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_free.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_getdate.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_getenv.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_global_cleanup.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_global_init.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_global_init_mem.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_mprintf.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_add_handle.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_assign.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_cleanup.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_fdset.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_info_read.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_init.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_perform.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_remove_handle.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_setopt.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_socket.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_socket_action.html]
[ 404] [ 4] [http://192.168.6.79/libcurl/curl_multi_socket.html] [http://192.168.6.79/libcurl/curl_multi_socket_all.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_strerror.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_timeout.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_share_cleanup.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_share_init.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_share_setopt.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_share_strerror.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_slist_append.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_slist_free_all.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_strequal.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_unescape.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_version.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_version_info.html]
[ 200] [ 2] [http://192.168.6.79/] [http://192.168.6.79/libcurl/index.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl-easy.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl-errors.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl-multi.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl-share.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl-tutorial.html]
[ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl.html]
real 0m0.452s
user 0m0.062s
sys 0m0.360s
[/code]
7.Finally
It's time to download it and enjoy yourself.
Welcome to contact me ^_^
Project Homepage : https://github.com/calvinwilliams/simspider
Author EMail : calvinwilliams.c@gmail.com
C
1
https://gitee.com/calvinwilliams/simspider.git
git@gitee.com:calvinwilliams/simspider.git
calvinwilliams
simspider
simspider
master

搜索帮助