使用Python自动抓取zblog文章到腾讯云大模型知识引擎LKE,投喂数据专属化自己的知识库
首先登录腾讯云大模型知识引擎LKE创建应用,获取大模型应用ID、腾讯云账号的SecretID和SecretKey。
在网站当前使用的主题目录下创建plugin/ai目录,以zblog自带的default主题为例创建目录结构,如:zb_users/theme/default/plugin/ai/
在ai目录中创建getPost.php,代码如下:
<?php require '../../../../../zb_system/function/c_system_base.php'; require '../../../../../zb_system/function/c_system_admin.php'; $zbp->Load(); //安全起见设置一个密码变量,非密码访问时自动重定向到首页 $password = '123456'; if(GetVars('act','GET') != $password){ header("Location: ".$zbp->host.""); die(); } if($zbp->option['ZC_STATIC_MODE'] == 'REWRITE'){ $p=new Pagebar('./getPost.php?act=123456&p={%page%}&c={%category%}',false); }else{ $p=new Pagebar('./getPost.php&act=123456&p={%page%}&c={%category%}',false); } $p->PageCount = 50; //每页文章数量 $p->PageNow = (int)GetVars('p','GET') == 0 ? 1 : (int)GetVars('p','GET'); $p->PageBarCount = $zbp->pagebarcount; $p->UrlRule->Rules['{%category%}'] = GetVars('c'); $w = array(); $w[] = array('=','log_Status',0); if(GetVars('c')){ $w[]=array('=','log_CateID',GetVars('c')); } $s = ''; $or = array('log_PostTime'=>'DESC'); $l = array(($p->PageNow-1) * $p->PageCount,$p->PageCount); $op = array('pagebar'=>$p); $array = $zbp->GetArticleList($s,$w,$or,$l,$op,false); if(GetVars('c')){ $ca = GetVars('c'); if(isset($zbp->categorys[(int)$ca]->Name)){ echo '<h3>以下是关于' .$zbp->categorys[(int)$ca]->Name. '的内容!</h3>'; } } if($array){ foreach ($array as $article) { $comm_Content = ''; if($article->Status==0){ $introsource = $article->Content; $intro = TransferHTML($introsource,'[nohtml]'); echo '文章标题:'.$article->Title.'<br>'; echo '文章内容:'.$intro.'<br>'; echo '------<br>'; } } echo '<div class="pagebar">'; foreach($p->buttons as $k => $v){ if($p->PageNow == $k){ echo '<span class="now-page">'.$k.'</span>'; }elseif($p->PageNow+1 == $k){ echo '<a href="'.$v.'" class="next-page">'.$k.'</a>'; }else{ echo '<a href="'.$v.'">'.$k.'</a>'; } } echo '</div>'; }else{ echo '<div class="mynull"><p>没有查询到数据!</p></div>'; }
以上地址支持生成指定分类下的文章:如getPost.php?act=123456&c=8,(8为分类ID,不填写&c=分类ID时,则获取全站所有公开的文章内容)
在zblog网站根目录创建python文件output.py,代码如下:
您需要有拓源网账号,并且 登录 后即可查看或下载隐藏部分的内容.
在zblog根目录创建postLKE.py文件,实现将output.py生成的output.txt文章内容,离线提交到腾讯云大模型知识引擎,代码如下:
# -*- coding: utf-8 -*- import json import os from pathlib import Path from tencentcloud.common.common_client import CommonClient from tencentcloud.common import credential from tencentcloud.common.exception.tencent_cloud_sdk_exception import TencentCloudSDKException from tencentcloud.common.profile.client_profile import ClientProfile from tencentcloud.common.profile.http_profile import HttpProfile from qcloud_cos import CosConfig from qcloud_cos import CosS3Client EndPoint = "lke.tencentcloudapi.com" SecretID = "此处替换为腾讯云账号的SecretID" SecretKey = "此处替换为腾讯云账号的SecretKey" BotBizID = "此处替换为大模型应用ID(进入模型配置页的Url可查看appid=****星号即为应用ID)" TypeKeyRealtime = "realtime" # 实时文件上传类型 TypeKeyOffline = "offline" # 离线文档上传类型 Region = "ap-guangzhou" filePath = "output.txt" fileName = Path(filePath).name fileExt = Path(filePath).suffix[1:] print("filePath: ",filePath) print("fileName: ",fileName) print("fileExt: ",fileExt) ######## step 1 获取临时密钥 ########### try: cred = credential.Credential(SecretID, SecretKey) httpProfile = HttpProfile() httpProfile.endpoint = EndPoint clientProfile = ClientProfile() clientProfile.httpProfile = httpProfile # 请注意,此处为离线文档上传,TypeKey取值为offline; # 如果需要复用此处代码上传实时文档,需要修改TypeKey取值为 realtime params = { "BotBizId": BotBizID, "FileType": fileExt, "IsPublic": False, "TypeKey": TypeKeyOffline # "TypeKey": "offline" } common_client = CommonClient("lke", "2023-11-30", cred, Region, profile=clientProfile) resp = common_client.call_json("DescribeStorageCredential", params) tmpSecretId = resp['Response']['Credentials']['TmpSecretId'] tmpSecretKey = resp['Response']['Credentials']['TmpSecretKey'] tmpToken = resp['Response']['Credentials']['Token'] uploadPath = resp['Response']['UploadPath'] bucket = resp['Response']['Bucket'] region = resp['Response']['Region'] print("======== DescribeStorageCredential Success =======") print("tmpSecretId: ", tmpSecretId) print("tmpSecretKey: ", tmpSecretKey) print("tmpToken: ", tmpToken) print("uploadPath: ", uploadPath) print("bucket: ", bucket) print("region: ", region) except TencentCloudSDKException as err: print(err) print("======= 获取临时密钥成功 =============\n\n") ######## step 2 上传文档到知识引擎的cos ########### # 参考:https://cloud.tencent.com/document/product/436/14113 config = CosConfig(Region=Region, SecretId=tmpSecretId, SecretKey=tmpSecretKey, Token=tmpToken, Scheme='https') client = CosS3Client(config) # 使用高级接口上传一次,不重试,此时没有使用断点续传的功能 response = client.upload_file( Bucket=bucket, Key=uploadPath, LocalFilePath=fileName, EnableMD5=False, progress_callback=None ) print('上传后结果:' + str(response)) rsp = response eTag = rsp.get('ETag') cosHash = rsp.get('x-cos-hash-crc64ecma') print('etag: ' + eTag) print('coshash: ' + cosHash) print(" \n\n ============== \n\n") ######### step 3 从cos转存到知识引擎 ########### try: params = { "BotBizId": BotBizID, "FileName": fileName, "FileType": fileExt, "CosUrl": uploadPath, "ETag": eTag, "CosHash": cosHash, "Size": str(os.path.getsize(fileName)), "AttrRange": 1 } cred = credential.Credential(SecretID, SecretKey) httpProfile = HttpProfile() httpProfile.endpoint = EndPoint clientProfile = ClientProfile() clientProfile.httpProfile = httpProfile common_client = CommonClient("lke", "2023-11-30", cred, Region, profile=clientProfile) resp = common_client.call_json("SaveDoc", params) except TencentCloudSDKException as err: print(err)
至此,以上三个文件即可实现python自动抓取zblog文章并离线提交到腾讯云大模型知识引擎,使用方法如下:
在宝塔面板中进入左侧的计划任务,创建两个任务,分别为“自动创建知识库TXT”和“自动提交知识库TXT”。
任务一:自动创建知识库TXT(任务类型:Shell脚本,任务名称:自动创建知识库TXT,执行周期:每周-周一-0小时-0分钟,执行用户:默认root,脚本内容:python3 getPost.py);
任务二:自动提交知识库TXT(任务类型:Shell脚本,任务名称:自动提交知识库TXT,执行周期:每周-周一-2小时-0分钟,执行用户:默认root,脚本内容:python3 postLKE.py);
可根据您的情况自行设定任务执行的周期,提交到腾讯云大模型知识引擎LKE后,即可让大模型依据提交的文章自动学习了。
注:大模型知识库支持3000000字符容量,提交的文章内容字符不要超过模型的限制,且需要在大模型学习后,手动点击发布按钮,将大模型发布就能投入使用了。