New collection node: the first step is to set the basic information and the website index page rules
Node basic information
The name of the node:
Target page coding:
GB2312
UTF8
BIG5
Regional matching model:
Regular expression
string
Content import order:
Agree with the target station
Opposite the target station
The following options only need to be set on the anti-hotlinking mode. If the target site has no anti-hotlinking function, please do not open it, otherwise it will reduce the collection speed.
Anti-hotlinking mode:
Don't open
open
Resource download timeout:
seconds
Reference site:
(A web site for one of the posts on the target site)
List url for rules
The source attribute:
Batch generate list url
Manually specify the list url
Get it from RSS
RSSThe url:
Batch generate address Settings:
Match website:
(Such as: http://www.dedecms.com/html/test/list_ (*). The HTML, if you can't match all site, can enter the url specified in the manual place to add url)
(*)from
to
(page number or regular number) The increment per page:
Enable multi-column distribution(#)
Manual address:
Some unmatched urls can be specified here after specifying the rules of distribution.
Multi-column distribution rules:
If the target site USES a single template, you can use "(#)" in the matching url to indicate the difference in the approximate url, then set the set in the general distribution rule, and you can specify the export column.
Format for:“[(#)= wildcard characterstring; (*)=num-num; typeid=num]A newline”
For example,:[(#)=>labs/list_3; (*)=>1-25; typeid=>7]Match the url:http://www.aaa.com/(#)_(*).html
Article url matching rules
Content address matching model:
Specify the area that contains the url of the article (you can access the url, title, image, etc.) of the site.
Specify the url regular expression (only access to the url information)
Regular expressions of urls:
Contains the locale of the article url:
The beginning of the HTML in the region:
HTML for the end of the region:
If the link contains pictures:
Don't deal with
Gather as a thumbnail
Refilter the regional web site:
(using regular expressions)
Must include:
(priority over the latter)
Cannot include: