Create your Gitee Account
Explore and code with more than 5 million developers,Free private repositories !:)
Sign up
Clone or download
Cancel
Notice: Creating folder will generate an empty file .keep, because not support in Git
Loading...
README-EN
simspider - Net Spider Engine

1.Overview

Simspider is a lightweight cross-platform web-spider engine, it provides a set of C function interface is used to quickly build your own net spider application, also provides an executable spider program are used to demonstrate how to use the function libaray.
Simspider only depended on third-party libraries libcurl.

Simspider support platform:
* UNIX/Linux
* WINDOWS

Simspider function interface is very easy to use, the main process is as follows:
* Create a spider engine environment
* Set the spider engine environment
* Crawl all pages recursively from entrance url
* Destruction of the spider engine environment

There are a large number of alternative options to customize your spider engine environment, as following:
* Set request queue size
* Set file extension name set of interest 
* Whether to allow null file extension name
* Whether to allow climbed out of the current web site
* Set the maximum depth of recursion
* Set the HTTPS certificate file name
* Set up interval of crawl
* Set maximum number of concurrent crawl

Simspider spider engine implements a flexible process framework, provides some callback functions pointer to the spider application designers want to crawl at any time point to add their own custom processing logic, as the following:
* On building the HTTP request header
* On building the HTTP request body ( POST data etc )
* After fetch HTTP response header
* After fetch HTTP response body ( HTML data etc )
* Review queue after crawl

2.My first spider

Use simspider engine function library is fairly easy to achieve a creeper, here is a simple example:
[code]
#include "libsimspider.h"

int main()
{
	struct SimSpiderEnv	*penv = NULL ;
	int			nret = 0 ;
	
	nret = InitSimSpiderEnv( & penv , NULL ) ;
	if( nret )
	{
		printf( "InitSimSpiderEnv failed[%d]\n" , nret );
		return 1;
	}
	
	nret = SimSpiderGo( penv , "" , "http://www.curl.haxx.se/" ) ;
	if( nret )
	{
		printf( "SimSpiderGo failed[%d]\n" , nret );
		return 1;
	}
	
	CleanSimSpiderEnv( & penv );
	
	return 0;
}
[/code]
The sample itself doesn't make any sense, because it doesn't throw out any data, but it certainly crawl all web pages of http://localhost:80/
	
May I think crawling is slowly, how can enable concurrent crawl. copy that ( 5 concurrents )
[code]
	SetMaxConcurrentCount( penv , 5 );
[/code]
	
May I want to add some contents in the HTTP disguise my spider, add a callback function code, and set it to the crawler engine environment
[code]
funcRequestHeaderProc RequestHeaderProc ;
int RequestHeaderProc( struct DoneQueueUnit *pdqu )
{
	CURL			*curl = NULL ;
	struct curl_slist	*header_list = NULL ;
	char			buffer[ 1024 + 1 ] ;
	
	curl = GetDoneQueueUnitCurl(pdqu) ;
	
	header_list = curl_slist_append( header_list , "User-Agent: Mozilla/5.0(Windows NT 6.1; WOW64; rv:34.0 ) Gecko/20100101 Firefox/34.0" ) ;
	
	if( GetDoneQueueUnitRefererUrl(pdqu) )
	{
		memset( buffer , 0x00 , sizeof(buffer) );
		SNPRINTF( buffer , sizeof(buffer) , "Referer: %s" , GetDoneQueueUnitRefererUrl(pdqu) );
		header_list = curl_slist_append( header_list , buffer ) ;
	}
	
	curl_easy_setopt( curl , CURLOPT_HTTPHEADER , header_list );
	FreeCurlList1Later( pdqu , header_list );
	
	return 0;
}

	InitSimSpiderEnv
	...
	SetRequestHeaderProc( penv , & RequestHeaderProc );
	...
	SimSpiderGo
[/code]
	
May I think real-time output current in my spider crawl url to the screen, add a callback function code, and set it to the crawler engine environment
[code]
funcResponseBodyProc ResponseBodyProc ;
int ResponseBodyProc( struct DoneQueueUnit *pdqu )
{
	printf( ">>> [%s]\n" , GetDoneQueueUnitUrl(pdqu) );

	return 0;
}

	InitSimSpiderEnv
	...
	SetResponseBodyProc( penv , & ResponseBodyProc );
	...
	SimSpiderGo
[/code]
	
May I want to output all url after the completion of all the sites, new a callback function code, and set it to the crawler engine environment 
[code]
funcTravelDoneQueueProc TravelDoneQueueProc ;
void TravelDoneQueueProc( char *key , void *value , long value_len , void *pv )
{
	struct DoneQueueUnit	*pdqu = (struct DoneQueueUnit *)value ;
	
	printf( "[%5d] [%2ld] [%s] [%s]\n" , GetDoneQueueUnitStatus(pdqu) , GetDoneQueueUnitRecursiveDepth(pdqu) , GetDoneQueueUnitRefererUrl(pdqu) , GetDoneQueueUnitUrl(pdqu) );
	
	return;
}

	InitSimSpiderEnv
	...
	SetTravelDoneQueueProc( penv , & TravelDoneQueueProc );
	...
	SimSpiderGo
[/code]
	
Is it easy?
An complete example in installation package in src/simspider.c
	
3.Customize the spider engine environment

* Set request queue size
The spider environment needs two queues: the request queue and the done queue.
The request queue size by default is SIMSPIDER_DEFAULT_REQUESTQUEUE_SIZE, If you want to adjust, the calling function ResizeRequestQueue.
* Set file extension name set of interest 
The spider engine crawl site recursively, automatically set to ignore the file extensions are not interested in, for default is SIMSPIDER_DEFAULT_VALIDFILENAMEEXTENSION, you can call function SetValidFileExtnameSet to set customize.
* Whether to allow null file extension name
Call function AllowEmptyFileExtname to enable or not.
* Whether to allow climbed out of the current web site
Call function AllowRunOutofWebsite to enable or not.
* Set the maximum depth of recursion
Call function SetMaxRecursiveDepth to set depth of recursion.
* Set the HTTPS certificate file name
Call functoin SetCertificateFilename to set certificate file name
* Set up interval of crawl
Crawl between two pages need sleep , call function SetRequestDelay, units are seconds.
* Set maximum number of concurrent crawl
Serial crawl slowly too? multiplexing parallel crawl, calling function SetMaxConcurrentCount set of concurrent degree.
Automatically adjust when set to SIMSPIDER_CONCURRENTCOUNT_AUTO. (experimental) 

4.Callback functions

* On building the HTTP request header
Add some information on building HTTP request header , write a callback function and set into crawler engine environment.
* On building the HTTP request body
Add some information on building HTTP request body , write a callback function and set into crawler engine environment.
* After fetch HTTP response header
Get information on after fetch HTTP request header , write a callback function and set into crawler engine environment.
* After fetch HTTP response body
Get information on after fetch HTTP request body , write a callback function and set into crawler engine environment.
* Review queue after crawl
Convenient save some creeper results update into database.

5.Experience

* How to output engine internal log
Setup log file name on initialize spider engine environment , struct SimSpiderEnv **ppenv , char *log_file_format , ... ).

* How to preset multiple entrance websites
Function SimSpiderGo set only one entrance , if there are more than a portal entry, use the function AppendRequestQueue preset to spider engine environment , and then call SimSpiderGo(penv,NULL).

* In building the HTTP request header in callback function, if you create a curl list (struct curl_slist *), must be at the back to release
Call function FreeCurlList1Later~FreeCurlList3Later store into spider engine that release it at last.

6.the spider running demo

demo source : src/simspider.c
[code]
$ time ./simspider 192.168.6.79 5
>>> [http://192.168.6.79/]
>>> [http://192.168.6.79/curl-config.html]
>>> [http://192.168.6.79/TheArtOfHttpScripting]
>>> [http://192.168.6.79/libcurl/index.html]
>>> [http://192.168.6.79/index.html]
>>> [http://192.168.6.79/libcurl/libcurl.html]
>>> [http://192.168.6.79/libcurl/libcurl-easy.html]
>>> [http://192.168.6.79/libcurl/libcurl-multi.html]
>>> [http://192.168.6.79/libcurl/libcurl-share.html]
>>> [http://192.168.6.79/libcurl/libcurl-errors.html]
>>> [http://192.168.6.79/curl.html]
>>> [http://192.168.6.79/libcurl/curl_easy_cleanup.html]
>>> [http://192.168.6.79/libcurl/curl_easy_duphandle.html]
>>> [http://192.168.6.79/libcurl/curl_easy_escape.html]
>>> [http://192.168.6.79/libcurl/curl_easy_getinfo.html]
>>> [http://192.168.6.79/libcurl/curl_easy_init.html]
>>> [http://192.168.6.79/libcurl/curl_easy_pause.html]
>>> [http://192.168.6.79/libcurl/curl_easy_perform.html]
>>> [http://192.168.6.79/libcurl/curl_easy_recv.html]
>>> [http://192.168.6.79/libcurl/curl_easy_reset.html]
>>> [http://192.168.6.79/libcurl/curl_easy_strerror.html]
>>> [http://192.168.6.79/libcurl/curl_easy_unescape.html]
>>> [http://192.168.6.79/libcurl/curl_escape.html]
>>> [http://192.168.6.79/libcurl/curl_formadd.html]
>>> [http://192.168.6.79/libcurl/curl_formfree.html]
>>> [http://192.168.6.79/libcurl/curl_formget.html]
>>> [http://192.168.6.79/libcurl/curl_free.html]
>>> [http://192.168.6.79/libcurl/curl_getenv.html]
>>> [http://192.168.6.79/libcurl/curl_easy_send.html]
>>> [http://192.168.6.79/libcurl/curl_global_cleanup.html]
>>> [http://192.168.6.79/libcurl/curl_global_init.html]
>>> [http://192.168.6.79/libcurl/curl_global_init_mem.html]
>>> [http://192.168.6.79/libcurl/curl_mprintf.html]
>>> [http://192.168.6.79/libcurl/curl_multi_add_handle.html]
>>> [http://192.168.6.79/libcurl/curl_multi_assign.html]
>>> [http://192.168.6.79/libcurl/curl_multi_cleanup.html]
>>> [http://192.168.6.79/libcurl/curl_multi_fdset.html]
>>> [http://192.168.6.79/libcurl/curl_getdate.html]
>>> [http://192.168.6.79/libcurl/curl_multi_info_read.html]
>>> [http://192.168.6.79/libcurl/curl_multi_init.html]
>>> [http://192.168.6.79/libcurl/curl_multi_perform.html]
>>> [http://192.168.6.79/libcurl/curl_multi_remove_handle.html]
>>> [http://192.168.6.79/libcurl/curl_multi_setopt.html]
>>> [http://192.168.6.79/libcurl/curl_multi_socket.html]
>>> [http://192.168.6.79/libcurl/curl_multi_socket_action.html]
>>> [http://192.168.6.79/libcurl/curl_multi_strerror.html]
>>> [http://192.168.6.79/libcurl/curl_multi_timeout.html]
>>> [http://192.168.6.79/libcurl/curl_share_cleanup.html]
>>> [http://192.168.6.79/libcurl/curl_share_init.html]
>>> [http://192.168.6.79/libcurl/curl_share_setopt.html]
>>> [http://192.168.6.79/libcurl/curl_share_strerror.html]
>>> [http://192.168.6.79/libcurl/curl_slist_append.html]
>>> [http://192.168.6.79/libcurl/curl_slist_free_all.html]
>>> [http://192.168.6.79/libcurl/curl_strequal.html]
>>> [http://192.168.6.79/libcurl/curl_unescape.html]
>>> [http://192.168.6.79/libcurl/curl_version.html]
>>> [http://192.168.6.79/libcurl/curl_version_info.html]
>>> [http://192.168.6.79/libcurl/]
>>> [http://192.168.6.79/libcurl/libcurl-tutorial.html]
>>> [http://192.168.6.79/libcurl/curl_easy_setopt.html]
[  200] [ 1] [] [http://192.168.6.79/]
[  200] [ 2] [http://192.168.6.79/] [http://192.168.6.79/TheArtOfHttpScripting]
[  200] [ 2] [http://192.168.6.79/] [http://192.168.6.79/curl-config.html]
[  200] [ 2] [http://192.168.6.79/] [http://192.168.6.79/curl.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/index.html]
[  200] [ 4] [http://192.168.6.79/libcurl/curl_easy_getinfo.html] [http://192.168.6.79/libcurl/]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_cleanup.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_duphandle.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_escape.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_getinfo.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_init.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_pause.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_perform.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_recv.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_reset.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_send.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_setopt.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_strerror.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_unescape.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_escape.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_formadd.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_formfree.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_formget.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_free.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_getdate.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_getenv.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_global_cleanup.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_global_init.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_global_init_mem.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_mprintf.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_add_handle.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_assign.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_cleanup.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_fdset.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_info_read.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_init.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_perform.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_remove_handle.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_setopt.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_socket.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_socket_action.html]
[  404] [ 4] [http://192.168.6.79/libcurl/curl_multi_socket.html] [http://192.168.6.79/libcurl/curl_multi_socket_all.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_strerror.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_timeout.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_share_cleanup.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_share_init.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_share_setopt.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_share_strerror.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_slist_append.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_slist_free_all.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_strequal.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_unescape.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_version.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_version_info.html]
[  200] [ 2] [http://192.168.6.79/] [http://192.168.6.79/libcurl/index.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl-easy.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl-errors.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl-multi.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl-share.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl-tutorial.html]
[  200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl.html]
real    0m0.452s
user    0m0.062s
sys     0m0.360s
[/code]

7.Finally

It's time to download it and enjoy yourself.

Welcome to contact me ^_^
Project Homepage : https://github.com/calvinwilliams/simspider
Author EMail     : calvinwilliams.c@gmail.com

Comments ( 0 )

Sign in for post a comment

About

C语言接口的网络爬虫函数库 spread retract
Cancel

Releases

No release

Contributors

All

Activities

load more
can not load any more