摘 要:云原生技术的引入使得 IT 系统规模庞大、架构复杂,IT 运维迎接新的挑战,IT 业务系统集群规模越发庞大,传统的告警不能及时有效地发现系统异常,海量日志无法有效分析,业务调用链复杂,可观测性差,导致故障定界定位极其困难。利用大数据、AI 技术、自动化编排等前沿技术手段,开发了业务端到端故障智能发现诊断自愈解决方案。有效融合 metrics、log、trace 三类数据,实现故障自动发现、诊断、自愈,开启极简运维时代。
关键词:业务端到端;故障智能诊断;故障自愈;AIOps
DOI:10.19850/j.cnki.2096-4706.2022.24.022
中图分类号:TP311 文献标识码:A 文章编号:2096-4706(2022)24-0085-05
Business End-to-End Fault Intelligent Discovery, Diagnosis and Self-Healing
ZUO Jinhu, CHEN Lihua, XIAO Zhongliang
(China Mobile Information Technology Co., Ltd., Beijing 102200, China)
Abstract: The introduction of cloud-native technology has made IT systems large in scale and complex in structure, and IT operation and maintenance are facing new challenges. The scale of the IT service system cluster is becoming larger and larger, the traditional alarm cannot be able to find the system abnormalities in time and effectively, the massive logs cannot be effectively analyzed, the service call chain is complex, and the observability is poor, which is extremely difficult to locate the fault boundaries. Using cutting-edge technologies such as big data, AI technology, and automated orchestration, it develops a business end-to-end fault intelligent discovery, diagnosis, and self-healing solution. It effectively integrates metrics, log and trace three types of data to realize automatic fault discovery, diagnosis and self-healing, so as to open the era of minimalist operation and maintenance.
Keywords: business end-to-end; intelligent fault diagnosis; fault self-healing; AIOps
参考文献:
[1] 裴丹,张圣林,裴昶华 . 基于机器学习的智能运维 [J].中国计算机学会通讯,2017,13(12):68-72.
[2] 冷迪,陈瑞,李英,王旭勇 .AIOps 在企业信息系统运维中的应用探讨 [J]. 电子元器件与信息技术,2021,5(11):119-120.
[3] 张春伟 . 基于 AIOps 的预测算法的研究与应用 [D]. 北京:华北电力大学(北京),2021.
[4] 程鹏 .AIOps 智能运维在中国工商银行的探索与实践 [J].中国金融电脑,2021(5):68-71.
[5] 王新东,王一大,庞国际,等 . 智能运维(AIOps)在中国联通分布式架构下的研究与应用 [J]. 电信工程技术与标准化,2021,34(1):48-54.
作者简介:左金虎(1983—),男,汉族,湖北汉川人,应用业务架构师,硕士研究生,研究方向:应用系统架构演进及AIOps;陈理华(1985—),男,汉族,湖南邵阳人,集团专家,硕士研究生,研究方向:AIOps 运营;肖忠良(1986—),男,汉族,山西朔州人,高级工程师,硕士研究生,研究方向:AIOps 算法。