Discuz!官方免费开源建站系统

 找回密码
 立即注册

QQ登录

只需一步,快速开始

搜索

[已答复] SupeSite中文分词/多关键词搜索方法

[复制链接]
littlehz 发表于 2009-8-18 11:52:47 | 显示全部楼层 |阅读模式
本帖最后由 littlehz 于 2009-8-18 12:08 编辑

受限于各站长使用的服务器原因,SupeSite很难有中文分词搜索(做大了做强效率高了接近搜索引擎)。近日,本人从张宴博客中了解其开发的基于HTTP协议的开源中文分词系统:HTTPCWS,突发灵感,将其用于SupeSite中文分词搜索。

现描述其工作原理:
1、本人用自己的服务器搭建了HTTPCWS,分词演示(HTTP GET方式):http://www.littz.cn:1989/?w=尤其是,对于那些已经安装使用 Discuz! 和 UCenter Home 的站长来说,通过 SupeSite 7.0,马上就可以快速搭建一个社区门户,拥有一套简洁高效易用的社区资讯发布系统了。,返回的结构会被拆分成
尤其是 , 对于 那些 已经安装 使用  Discuz !  和  UCenter  Home  的 站长 来说 , 通过  SupeSite  7.0 , 马上 就可以 快速 搭建 一个 社区 门户 , 拥有 一套 简洁 高效 易用 的 社区 资讯 发布系统 了 。
注意传递一个整句,会被较准确地拆分为多个词组。
2、SupeSite的batch.search.php文件接收访问者搜索时传递的$searchkey变量。
3、将$searchkey传递给www.littz.cn:1989服务器处理,返回拆分的词语。
4、替换所有中英文符号为空格,并将多余的连续空格去除。
5、匹配数据的搜索语句
  1. SELECT * FROM `supe_spaceitems` WHERE `subject` LIKE '%词组1%词组2%词组3%' ORDER BY `dateline` DESC LIMIT 0,20
复制代码
演示:http://www.littz.cn,使用SupeSite 7.0系统,其中有篇文章的标题为“Wordpress平滑迁移至SupeSite”,在没有分词的情况下如果访问用标题搜索“wordpress迁移至supesite”,搜索结果为空。使用分词搜索之后,访问者输入的“wordpress迁移至supesite”会被拆分为“wordpress” “迁移” “至” “supesite”这几个词。对应的数据搜索语句为
  1. SELECT * FROM `supe_spaceitems` WHERE `subject` LIKE '%wordpress%迁移%至%supesite%' ORDER BY `dateline` DESC LIMIT 0,20
复制代码
所以能够搜索到“Wordpress平滑迁移至SupeSite”这篇文章。
 楼主| littlehz 发表于 2009-8-18 11:53:02 | 显示全部楼层
本帖最后由 littlehz 于 2009-8-18 17:31 编辑

一、GBK版SupeSite的修改:
将下列代码插入到batch.search.php约132行
  1. $urlplus = 'searchkey='.rawurlencode($searchkey).'&type='.rawurlencode($type);
复制代码
下一行(不能改变顺序,否则将无法得到准确的分词结果)。
  1. function clear_point($jiugui)
  2. {
  3.      return str_replace
  4.        (
  5.        array("~","!","@","#","$","%","^","&","*",",",".","?",";",":","/","'",'"',"[","]","{","}","!"," ¥","……","…","、",",","。","?",";",":","‘","“","”","’"," 【","】","~","!","@","#","$","%","^","&","*",",","."," <",">",";",":","'",""","[","]","{","}","/","\"," "),
  6.        array(' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' '),
  7.           $jiugui
  8.        );
  9. }
  10. $searchkey = urlencode($searchkey);
  11. $searchkey = file_get_contents("http://www.littz.cn:1989/?w=".$searchkey);
  12. $searchkey = clear_point($searchkey);
  13. $searchkey1 = preg_replace('/\s+/',' ',$searchkey);
  14. $searchkey = str_replace(' ','%',$searchkey1);
复制代码
一、UTF-8版SupeSite的修改:
将下列代码插入到batch.search.php约132行
  1. $urlplus = 'searchkey='.rawurlencode($searchkey).'&type='.rawurlencode($type);
复制代码
下一行(不能改变顺序,否则将无法得到准确的分词结果)。
  1. function clear_point($jiugui)
  2. {
  3.      return str_replace
  4.        (
  5.        array("~","!","@","#","$","%","^","&","*",",",".","?",";",":","/","'",'"',"[","]","{","}","!"," ¥","……","…","、",",","。","?",";",":","‘","“","”","’"," 【","】","~","!","@","#","$","%","^","&","*",",","."," <",">",";",":","'",""","[","]","{","}","/","\"," "),
  6.        array(' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' '),
  7.           $jiugui
  8.        );
  9. }
  10. $searchkey = iconv("UTF-8", "GBK//IGNORE", $searchkey);
  11. $searchkey = urlencode($searchkey);
  12. $searchkey = file_get_contents("http://www.littz.cn:1989/?w=".$searchkey);
  13. $searchkey = iconv("GBK", "UTF-8//IGNORE", $searchkey);
  14. $searchkey = clear_point($searchkey);
  15. $searchkey1 = preg_replace('/\s+/',' ',$searchkey);
  16. $searchkey = str_replace(' ','%',$searchkey1);
复制代码
因HTTPCWS只能接收GBK的分词,所以UTF-8的词汇需要转换成GBK分词之后再转回。

三、GBK和UTF-8均要做的修改。
默认模版,templates/default/site_search.html.php的56行附近,
  1. <input type="text" class="input_tx" size="50" name="searchkey" value="$searchkey" />
复制代码
修改为
  1. <input type="text" class="input_tx" size="50" name="searchkey" value="$searchkey1" />
复制代码
附加说明:
分词搜索依赖与www.littz.cn的服务器  以及  SS站点所在服务器  连接至www.littz.cn服务器 的网络状况,www.littz.cn服务器在美国硅谷IP:64.71.167.26,受海底光缆影响,例如2009年8月17日的海底光缆故障就导致访问缓慢。HTTPCWS 接口本身的中文分词处理速度非常快,如果有条件的朋友建议自己搭建HTTPCWS + Sphinx搜索服务器,本人不能保证此服务会长期有效运转,但肯定会尽量坚持,是提供一种解决问题的方法

评分

1

查看全部评分

回复

使用道具 举报

qdcaishen 发表于 2010-1-16 13:45:13 | 显示全部楼层
相当不错。。。
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

手机版|小黑屋|Discuz! 官方站 ( 皖ICP备16010102号 )star

GMT+8, 2024-11-18 16:25 , Processed in 0.026666 second(s), 6 queries , Gzip On, Redis On.

Powered by Discuz! X3.4

Copyright © 2001-2023, Tencent Cloud.

快速回复 返回顶部 返回列表