采集带分页的文章遇到一个麻烦:
比如采集这个页:http://et.21cn.com/star/zhuixing/neidi/2007/06/21/3307015.shtml
代码如下: <TD class="link14pp" align="center"><table width="95%" border="0" cellspacing="0" cellpadding="0" align="center"><tr><td style="font-size: 14px"><div align="center"><font color="#3888C9"><font color=#FF0000>[1]</font> <a target=_self href=3307015_1.shtml>[2]</a> <a target=_self href=3307015_2.shtml>[3]</a> <a target=_self href=3307015_3.shtml>[4]</a> <font color=#3888C9><a target=_self href=3307015_1.shtml>[下一页]</font></a> </font></div></td></tr></table></TD>
有2个问题:
而列表页面的文章不是同一天的,其他日期的就采集不到分页。
2、就算能采集到分页,每次总是把<a target=_self href=3307015_1.shtml>[下一页]这个页重复采集了一次。
请问以上问题如何解决?
附上采集规则:
# SupeSite Dump
# Version: SupeSite 5.5.2
# Time: 2007-06-25 14:00:25
# From: 黄石门户 ( http://www.huangshi.com)
#
# This file was BASE64 encoded
#
# SupeSite: http://www.supesite.com
# Please visit our website for latest news about SupeSite
# --------------------------------------------------------
YTozNzp7czo3OiJyb2JvdGlkIjtzOjM6IjEyNSI7czo0OiJuYW
1lIjtzOjU1OiIyMWNuX8P30MdfxNq12CjQ6NKq0N64xM7E1cLE
2sjdt9bSs8G0vdNVUkyyubPkx7DXusjVxtopIjtzOjM6InVpZC
I7czoxOiI0IjtzOjg6ImRhdGVsaW5lIjtzOjEwOiIxMTgyNDA5
ODIzIjtzOjg6Imxhc3R0aW1lIjtzOjEwOiIxMTgyNDA5ODM4Ij
tzOjg6InJvYm90bnVtIjtzOjI6IjgzIjtzOjExOiJsaXN0dXJs
dHlwZSI7czo0OiJhdXRvIjtzOjc6Imxpc3R1cmwiO3M6NTU6Im
h0dHA6Ly9ldC4yMWNuLmNvbS9zdGFyL3podWl4aW5nL25laWRp
L2xpc3RbcGFnZV0uc2h0bWwiO3M6MTM6Imxpc3RwYWdlc3Rhcn
QiO3M6MToiMSI7czoxMToibGlzdHBhZ2VlbmQiO3M6MToiMiI7
czo2OiJhbGxudW0iO3M6MjoiMzMiO3M6NjoicGVybnVtIjtzOj
E6IjEiO3M6Nzoic2F2ZXBpYyI7czoxOiIwIjtzOjY6ImVuY29k
ZSI7czowOiIiO3M6MTM6InBpY3VybGxpbmtwcmUiO3M6MDoiIj
tzOjk6InNhdmVmbGFzaCI7czoxOiIwIjtzOjE0OiJzdWJqZWN0
dXJscnVsZSI7czo5OToiOjogPHNwYW4gY2xhc3M9InVubmFtZW
QxIj48Zm9udCBjb2xvcj0iIzAwMDAwMCI+xNq12NDHzsU8L2Zv
bnQ+PC9zcGFuPio6OltsaXN0XSA8IS0tyc/Su9KztcS0+sLrLS
0+IjtzOjE4OiJzdWJqZWN0dXJsbGlua3J1bGUiO3M6NDc6Ijxh
IGhyZWY9Ilt1cmxdIiB0YXJnZXQ9Il9ibGFuayIgY2xhc3M9Ij
E0cDE2aCI+IjtzOjE3OiJzdWJqZWN0dXJsbGlua3ByZSI7czox
ODoiaHR0cDovL2V0LjIxY24uY29tIjtzOjExOiJzdWJqZWN0cn
VsZSI7czoyNDoiPHRpdGxlPltzdWJqZWN0XTwvdGl0bGU+Ijtz
OjEzOiJzdWJqZWN0ZmlsdGVyIjtzOjExNzoiLSAyMUNOLkNPTS
AtINPpwNbGtbXAfC0gMjFDTi5DT00gLSDT6cDWINL9t6LQx7jf
s7EhfC0gMjFDTi5DT00gLSDT6cDWINL9t6K/7LjQIXwtIDIxQ0
4uQ09NIC0g0+nA1iDS/beiv+y40HwtIDIxQ04uQ09NIjtzOjE0
OiJzdWJqZWN0cmVwbGFjZSI7czowOiIiO3M6MTY6InN1YmplY3
RyZXBsYWNldG8iO3M6MDoiIjtzOjEwOiJzdWJqZWN0a2V5Ijtz
OjA6IiI7czoxODoic3ViamVjdGFsbG93cmVwZWF0IjtzOjE6Ij
AiO3M6MTI6ImRhdGVsaW5lcnVsZSI7czowOiIiO3M6ODoiZnJv
bXJ1bGUiO3M6MDoiIjtzOjEwOiJhdXRob3JydWxlIjtzOjA6Ii
I7czoxMToibWVzc2FnZXJ1bGUiO3M6Mjk6IjwhLS3V/c7ELS0+
W21lc3NhZ2VdPC9wPjwvVEQ+IjtzOjEzOiJtZXNzYWdlZmlsdG
VyIjtzOjQ3MzoiJmd0OyZndDsmZ3Q7yKuyv76rssrNvMasfCZn
dDsmZ3Q7Jmd0O7S0vai49sjLzby8r3w8QSB0aXRsZT0iIiBocm
VmPSouMjFjbi5jb20qPnwmZ3Q7Jmd0OyZndDvN+NPRzPnNvHzP
4LnYway90zp8PEEgaHJlZj0qPio8L0E+fDIxQ07T6cDW0bY6fD
IxQ07T6cDW0+nRtnwmZ3Q7Jmd0OyZndDvQtNXmxrW1wMir0MLJ
z8/fzPTVvcTjtcTR28fytdfP33y147v3sum/tNfu0MLPytfuyM
jAsbjbzKjQws7FJmd0OyZndDsmZ3Q7Jmd0O3w8U1RST05HKjwv
U1RST05HPnw8QSo+fDxzY3JpcHQqPjwvc2NyaXB0Pnw8L0E+fD
xUQUJMRSBjZWxsU3BhY2luZz0wIGNlbGxQYWRkaW5nPTEwIHdp
ZHRoPSI5NSUiIGJvcmRlcj0wPio8QSo+Ks28xqzGtbXAKiZndD
sgPEEqPio8L0E+fDxUQUJMRSBib3JkZXJDb2xvcj0jYzBjMGMw
IGNlbGxQYWRkaW5nPTAgd2lkdGg9NTUwIGFsaWduPWNlbnRlci
Bib3JkZXI9MT4qzbzGrMa1tcAqJmd0OyA8QSo+KjwvQT4iO3M6
MTU6Im1lc3NhZ2VwYWdldHlwZSI7czo0OiJwYWdlIjtzOjE1Oi
JtZXNzYWdlcGFnZXJ1bGUiO3M6NDg6Ijxmb250IGNvbG9yPSNG
RjAwMDA+WzFdPC9mb250PltwYWdlYXJlYV1bz8LSu9KzXSI7cz
oxODoibWVzc2FnZXBhZ2V1cmxydWxlIjtzOjI4OiI8YSB0YXJn
ZXQ9X3NlbGYgaHJlZj1bcGFnZV0+IjtzOjIxOiJtZXNzYWdlcG
FnZXVybGxpbmtwcmUiO3M6NTA6Imh0dHA6Ly9ldC4yMWNuLmNv
bS9zdGFyL3podWl4aW5nL25laWRpLzIwMDcvMDYvMTkvIjtzOj
E0OiJtZXNzYWdlcmVwbGFjZSI7czowOiIiO3M6MTY6Im1lc3Nh
Z2VyZXBsYWNldG8iO3M6MDoiIjtzOjc6InZlcnNpb24iO3M6NT
oiNS41LjIiO30=
|