php 分词 jieba scws

laravel中使用php分词库(jieba)和(scws)

坚持开源,坚持分享
这篇文章旨在介绍我用过的两个PHP分词库以及他们的简单使用

  • 目的:完成一段段落的分词

1.Jieba分词库

Jieba分词库,GitHub地址
安装:

composer require fukuball/jieba-php:dev-master

主要代码:

//这边要给内存,不然会炸
    ini_set('memory_limit', '1024M'); 
    
    //初始化
    $this->jieba = new Jieba();
    $this->finalseg = new Finalseg();
    
    $this->jieba->init();
    $this->finalseg->init();
    
    //使用
    $cut_array = $this->jieba->cut('分词字符串',false);
    //分词后的结果是数组

notice:

  1. Jieba分词库可以添加关键字,就是自定义词汇来作分词,有额外需求的可以看GitHub
  2. 词汇的词性是在'src/dict/pos_tag_readable.txt'

2. SCWS分词

官方演示网站,scws4;

这个分词库,个人感觉很快,而且不需要像Jieba那样需要内存那么,当时使用完,感觉还不错,我选择的是 PSCWS4,就是以PHP环境的,而没用PHP扩展,不支持composer;

1.下载安装:

  1. pscws4
  2. 词典(简体中文-utf8)
  3. 将pscws4解压后方到http/Help/scws目录下(新建)
  4. 将词典文件放到public目录下

2.准备:

  1. 修改解压后的pscws4的pscws4.class.php文件名为PSCWS4.php,把require 文件改为use App\Help\scws\XDB_R;
  2. 修改解压后的pscws4的xdb_r.class.php文件名为XDB_R.php
  3. 给两个类文件添加命名空间namespace App\Help\scws;

3.编码测试 简要实现代码(附录有完整代码)

//初始化 并设置utf8,设置词典路径和规则路径
    $this->pscws = new PSCWS4('utf8');
    $this->pscws->set_charset('utf-8');
    $this->pscws->set_dict(public_path().'/dict.utf8.xdb');
    $this->pscws->set_rule(public_path().'/rules.ini');
    
    //使用:
    $this->pscws->send_text("分词的字符串。。。");
    while ($some = $this->pscws->get_result())
    {
        foreach ($some as $word)
        {
            $article[] = $word['word'];
        }
    }

4.效果图

jieba效果图: image pscws4效果图 image

以上可以看出,jieba对于一些英文标点符号没有很好的切割,例如 42的country;而scws对于每个标点符号都作了切割;对于我的需求来说,scws是比较适合我的,如何选择看个人需求。 jieba

  • 优点:能添加关键字;自定义词典
  • 缺点:需要内存大,对于英文分词和标点符号支持不是很好

scws:

  • 优点:词汇字典很大,有28w,可以精细切割每个字符
  • 缺点:无法自己扩展,貌似要钱

附录代码

路由:web.php

Route::get('/scws', 'WordCutController@scwsCut');
Route::get('/jieba', 'WordCutController@jieBaCut');

控制器:WordCutController

<?php

namespace App\Http\Controllers;

use App\Help\scws\PSCWS4;
use Fukuball\Jieba\Finalseg;
use Fukuball\Jieba\Jieba;
use Illuminate\Http\Request;

class WordCutController extends Controller
{
    public $pscws;
    public $jieba;
    public $finalseg;

    /*
     * pscws4分词 实例
     */
    public function scwsCut(){
        $this->pscws = new PSCWS4('utf8');
        $this->pscws->set_charset('utf-8');
        $this->pscws->set_dict(public_path().'/dict.utf8.xdb');
        $this->pscws->set_rule(public_path().'/rules.ini');

        //使用:
        $this->pscws->send_text("Dragon Boat Festival is one the very classic traditional festivals, which has been celebrated since the old China. Firstly, it is to in honor of the great poet Qu Yuan, who jumped into the water and ended his life for loving the country. Nowadays, different places have different ways to celebrate.
    端午节是一个非常经典的传统节日,自古以来就一直被人们所庆祝。首先,是为了纪念伟大的诗人屈原,屈原跳入水自杀,以此来表达了对这个国家的爱。如今,不同的地方有不同的庆祝方式。");
        while ($some = $this->pscws->get_result())
        {
            foreach ($some as $word)
            {
                $article[] = $word['word'];
            }
        }
        dd($article);

    }

    /*
     * jieba分词 实例
     */
    public function jieBaCut(){
        ini_set('memory_limit', '1024M');

        //初始化
        $this->jieba = new Jieba();
        $this->finalseg = new Finalseg();

        $this->jieba->init();
        $this->finalseg->init();

        //使用
        $cut_array = $this->jieba->cut('Dragon Boat Festival is one the very classic traditional festivals, which has been celebrated since the old China. Firstly, it is to in honor of the great poet Qu Yuan, who jumped into the water and ended his life for loving the country. Nowadays, different places have different ways to celebrate.
端午节是一个非常经典的传统节日,自古以来就一直被人们所庆祝。首先,是为了纪念伟大的诗人屈原,屈原跳入水自杀,以此来表达了对这个国家的爱。如今,不同的地方有不同的庆祝方式。',false);

        dd($cut_array);
    }
}

如果有不对或不足的,请大佬们指出来,毕竟我只是满足需求,并没有深入研究,谢谢各位大佬哦:)

最后修改:2019-05-27 12:18:04
0