摘 要:应用多元线性回归模型,利用《新概念》教材的难度递增分布得到其难度指数,从单词、句子、文章三个维度设置了 8 个评估指标。利用 Python 进行数据提取,得到评估指标,最后做多元线性回归,得到了一个拟合度很高的英文文本难度估计模型 WSA。此外,考虑到不同的文化背景的人对于英语文本的理解难度也是不同的,引入了文化距离的概念。分析其权重,将主观性影响融入了 WSA 模型中,得到 WSAP 模型。
关键词:多元线性回归;难度识别;自然语言处理
DOI:10.19850/j.cnki.2096-4706.2022.011.008
中图分类号:TP391 文献标识码:A 文章编号:2096-4706(2022)11-0030-04
English Text Difficulty Estimation Model of Multiple Linear Regression
AN Kang, ZHANG Yongbo, HUANG Ze
(Hangzhou Dianzi University, Hangzhou 310018, China)
Abstract: This paper applies multiple linear regression model and uses the increasing distribution of difficulty in the New Concept textbook to obtain the difficulty index, and sets up 8 evaluation indicators from three dimensions of words, sentences and articles. Python is used to extract data and obtain evaluation indicators. Finally, multiple linear regression is done to obtain a highly fitting English text difficulty estimation model WSA. In addition, considering that people with different cultural backgrounds have different difficulties in understanding English texts, the concept of cultural distance is introduced. The weight is analyzed and the subjective influence is integrated into the WSA model, the WSAP model is obtained.
Keywords: multiple linear regression; difficulty recognition; natural language processing
参考文献:
[1] 刘璐 . 汉译英文本难度影响因素分析 [J]. 黑龙江教育(理论与实践),2015(Z1):13-14.
[2] 杨纯莉 . 基于统计算法的对外汉语报刊文本易读性词汇因素分析 [D]. 上海:华东师范大学,2018.
[3] 付宇博 . 基于决策树的英语文本难度评估研究 [D]. 武汉:华中师范大学,2018.
[4] 王聪颖 . 一个基于新概念英语课文的文本易读性回归模型[D]. 上海:上海交通大学,2014.
[5] 王欣芳 . 霍夫斯泰德跨文化传播理论研究 [D]. 石家庄:河北经贸大学,2019.
作者简介:安康(2001.08—),男,回族,江苏南京人,本科在读,研究方向:电子信息专业。