基于条件随机场的元数据自动提取技术电子书 摘要 伴随着数字图书馆技术的发展,电子文档成为人们获取信息的主要来源。为了使用户更方便快捷地发现所需要的文献资源,元数据信息提取技术的研究得到越来越多的关注。元数据信息的自动提取解决了早期主要靠人工阅读文档找出相应的元数据这项费时费力的工作,并为电子资源的有序组织、适度控制和高效检索提供更为便利的条件。随着机器学习理论的逐渐成熟,元数据的自动提取成为了当今的研究热点。本文主要对基于条件随机场的元数据自动提取的相关技术进行了研究。 首先,针对以单词为单位组成的论文头部文本序列来进行元数据提取时存在任务量大,抽取精度低等问题,提出一种文本分块策略来对其进行分块,详细阐述了分块的过程,使得每一个抽取域和一个具体的文本分块相对应。在分块的基础上利用文本中含有特征词等信息,通过定义特征提取规则来确定其状态。在路径搜索过程中,采用启发式搜索算法来确定剩余块的状态。其次,为了实现引文元数据的精确抽取,根据引文信息格式的多样性和提取域的密集性,在条件随机场模型的基础上融合重排序来提取引文元 数据,将条件随机场和重排序形成串行处理流程,通过对条件随机模型生成的多个候选标注进行等级排序实现引文元数据的提取。最后,对上述研究方法进行了实验验证及分析,和原有的方法进行了对比,并对今后的研究工作进行了展望。 关键词 元数据提取;条件随机场;文本分块;启发式搜索;重排序 ··································11 2.3 条件随机场和其他模型的比较···························································12 2.3.1 隐马尔可夫模型·············································································12 2.3.2 最大熵马尔可夫模型·····································································14 2.4 条件随机场的优点与不足···································································16 2.5 条件随机场的参数估计·······································································16 2.5.1 最大似然估计·················································································17 2.5.2 参数估计的优化·············································································18 2.6 本章小结·······························································································20 第3 章 基于启发式搜索的论文头部元数据提取··········································21 3.1 元数据概述···························································································21 3.1.1 论文元数据的作用·········································································21 3.1.2 论文头部数据集的定义·································································22 3.2 论文头部的特征选择···········································································23 3.2.1 局部特征·························································································23 3.2.2 版面特征·························································································24 3.2.3 外部词典特征·················································································24 3.2.4 状态转移特征·················································································25 3.3 论文头部的相关工作和技术·······························································26 |
查看评论
已有0位网友发表了看法