開源PaddleOCR技術在企業營業執照識別上的改進與實踐

2021-11-04 11:01:06仇建民

現代信息科技 2021年9期

DOI：10.19850/j.cnki.2096-4706.2021.09.018

摘? 要：文章設計一種用于識別企業營業執照圖像的算法，其可自動提取統一社會信用代碼、公司名稱等關鍵字段信息。以開源PaddleOCR框架為基礎，通過圖像方向自動調整、文本輸出結構化、局部二次識別等一系列改進措施，解決了多種圖片質量不佳情況下僅通過PaddleOCR無法準確識別信息的問題，整體識別準確率提升至90%以上，且實現秒級檢測。該成果已投入實際使用，輔助前臺操作人員快速識別所填寫的營業執照信息是否準確，提高人工錄入效率。

關鍵詞：PaddleOCR;圖像識別;企業營業執照;AI

中圖分類號：TP391.4;TP18? 文獻標識碼：A? 文章編號：2096-4706（2021）09-0065-06

Improvement and Practice of Open Source PaddleOCR Technology in?Enterprise Business License Recognition

QIU Jianmin

（China Telecom Corporation Limited Jiangsu Branch，Nanjing? 210037，China）

Abstract：In this paper，an algorithm for recognizing the enterprise business license image is designed，which can extract automatically the unified social credit code，company name and other key field information. Based on the open source PaddleOCR framework，through a series of improvement measures，such as image orientation automatic adjustment，structured text output，local secondary recognition，the problem that information cannot be accurately recognized only by PaddleOCR under the situation of several kinds of poor image quality is solved，the overall recognition accuracy is improved to more than 90%，and second level detection is realized. This achievement has been put into the actual use to assist the front desk operators to quickly identify whether the business license information filled in is accurate or not，and improve the efficiency of manual entry.

Keywords：PaddleOCR;image recognition;enterprise business license;AI

0? 引? 言

本文在開源PaddleOCR框架^[1]的基礎上，設計AI算法自動提取統一社會信用代碼、公司名稱等關鍵字段信息。通過一系列改進措施，在PaddleOCR無法準確識別多種質量不佳圖片的情況下依然能夠準確識別圖中關鍵信息，并實現秒級檢測，該成果已投入實際生產系統使用。適合從事圖像識別相關工作的人員共同研究討論。

1? 背景

早前，營業員在營業系統中登記企業營業執照信息時均通過人工查看之后手動錄入系統，為提升人工錄入效率，需設計AI算法，自動提取營業執照中統一社會信用代碼、公司名稱等關鍵字段信息^[2]。PaddleOCR是由百度公司開源的超輕量OCR（Optical Character Recognition）系統，主要由DB文本檢測^[3]、檢測框矯正^[4]和CRNN文本識別^[5]三部分組成。通過PaddleOCR能夠實現文本的檢測與識別，正常識別的圖片如圖1所示。

但是當圖片存在低亮度、角度歪曲、光照不均、陰影遮擋、水印覆蓋、印章覆蓋等情況時，PaddleOCR均未能完整準確地識別出關鍵信息。未能正常識別的圖片如圖2所示。本文需重點解決以上復雜情況下的文字識別問題。

2? 目標

在PaddleOCR源碼基礎上，通過補充自定義算法，包括增加圖像方向自動調整、文本輸出結構化、局部二次識別等技術，解決以上諸多復雜情況下文字識別不準確的問題，將營業執照關鍵信息的識別準確率由70%提升至90%。

3? 改進舉措

3.1? 自動調整圖像方向

因為原始營業執照圖像可能方向不正、角度不正，需統一將圖像方向調整為豎直方向，增強文字識別的連貫性。做法為：由于營業執照中“營業執照”這四個字的位置相對清晰，PaddleOCR基本均能識別出來，在識別出“營業執照”四個字的基礎之上，增加一段代碼，定義為函數rectifyOCRAngle，用來判斷營業執照的字在圖片中的哪個位置：在整體上1/3就認為是上面，在下1/3就認為是下面;在左1/3就認為是左邊;在右1/3就認為是右邊。進而根據判斷結果進行旋轉，將圖像旋轉至豎直方向。rectifyOCRAngle函數代碼為：

defrectifyOCRAngle（img， coordinates， ocrtexts）：

iangle = 0

idx = -1

sp = img.shape

width = sp[1]

height = sp[0]

foriin range（len（BUSINESSLINCENSE_RULES））：

for j in range（len（ocrtexts））：

idx_find = str（ocrtexts[j][0]）.find（BUSINESSLINCENSE_RULES[i]）

ifidx_find> -1：

idx = j

break

ifidx> -1：

break

ifidx == -1：

returnidx， img， coordinates， ocrtexts

x_center = coordinates[idx][0][0] + （coordinates[idx][1][0] - coordinates[idx][0][0]）/2.0

y_center = coordinates[idx][0][1] + （coordinates[idx][3][1] - coordinates[idx][0][1]）/2.0

transposedImage = cv2.transpose（img）

widthper = width / x_center

heightper = height / y_center

ifwidthper<2.5andwidthper>1.5：# 正面或下面

ifheightper>2：

angle = 0

else：

img = cv2.flip（img， -1）;

for j in range（len（coordinates））：

coordinates[j][0][0] = width - 1 - coordinates [j][0][0]

coordinates[j][0][1] = height - 1 - coordinates [j][0][1]

coordinates[j][1][0] = width - 1 - coordinates [j][1][0]

coordinates[j][1][1] = height - 1 - coordinates [j][1][1]

coordinates[j][2][0] = width - 1 - coordinates [j][2][0]

coordinates[j][2][1] = height - 1 - coordinates [j][2][1]

coordinates[j][3][0] = width - 1 - coordinates [j][3][0]

coordinates[j][3][1] = height - 1 - coordinates [j][3][1]

pt1 = coordinates[j][0]

coordinates[j][0] = coordinates[j][2]

coordinates[j][2] = pt1

pt2 = coordinates[j][1]

coordinates[j][1] = coordinates[j][3]

coordinates[j][3] = pt2

print（coordinates）

angle = 180

else： # 左邊或右面

略（可類比前一段代碼：正面或下面）

returnidx， img， coordinates， ocrtexts

調節前后的對比圖示例如圖3所示。

3.2? 文本輸出結構化

在真實的營業執照圖片中，統一社會信用處存在“（1/1）”“副本編碼”等額外的非相關文本，根據業務統一規范，利用正則表達式定義函數UnionCodeSechema，用于提取出18位標準代碼，刪除括號等多余的文本，確保統一社會信用代碼準確。UnionCodeSechema函數代碼為：

defUnionCodeSechema（sUnionCode，coordinates，iUnionCode）：

i_tmp = sUnionCode.find（'（'）

ifi_tmp> -1：

sUnionCode = sUnionCode[0：i_tmp]

foriin range（len（sUnionCode））：

if （sUnionCode[i] >= '0'andsUnionCode[i] <= '9'） or （sUnionCode[i] >= 'A'andsUnionCode[i] <= 'Z'）：

continue

else：

sUnionCode = sUnionCode[0：len（sUnionCode）]

break

iflen（sUnionCode） >= 18：

sUnionCode = sUnionCode[0：18]

else：

pass

returnsUnionCode

統一社會信用代碼存在多余文字的示例如圖4所示。

圖4? 統一社會信用代碼存在多余文字的示例

3.3? 局部二次識別

通過分析原始PaddleOCR識別不準確的案例，發現90%識別不準確都是由于識別出來的文字不完整，當原始圖片出現亮度低、角度歪曲、光照不均、陰影遮擋、水印覆蓋、印章覆蓋等情況，往往會導致直接識別的結果有缺失。為此，我們提出局部二次識別的解決辦法。針對各種識別結果局部缺失的情況，將識別不準確的局部圖片單獨提取出來進行二次單獨識別，減少了整張圖片的背景干擾，再將局部二次識別出來的結果拼接至首次PaddleOCR識別結果中，從而極大地提升識別的準確率。需要局部二次識別的場景為：

（1）針對統一社會信用代碼：

1）前面的字識別出了，后面的編碼未識別出或者識別不完整：根據字的定位，提取后面區域的圖像進行二次識別，識別出編碼。

2）后面的編碼識別出了，前面的字未識別出或者識別不完整：根據編碼的定位，提取前面區域的圖像進行二次識別，識別出字。

3）前面的字未識別出或者識別不完整，后面的編碼也未識別出或者識別不完整：根據營業執照和公司名稱的位置，估算統一社會信用代碼的位置和范圍，提取出該段區域進行二次識別，識別出編碼。

（2）針對公司名稱：

1）前面的字識別出了，后面的公司名稱未識別出或者識別不完整：根據字的定位，提取后面區域的圖像進行二次識別，識別出公司名稱。

2）后面的公司名稱識別出了，前面的字未識別出或者識別不完整：根據公司名稱的定位，提取前面區域的圖像進行二次識別，識別出字。

3）前面的字未識別出或者識別不完整，后面的公司名稱也未識別出或者識別不完整：根據營業執照和統一社會信用代碼的位置，估算公司名稱的位置和范圍，提取出該段區域進行二次識別，識別出公司名稱。

其中，營業執照識別出的文字滿足以下條件之一即可定位：['營業執照'， '營業執'， '業執照'， '執照'， '營業']）;統一社會信用代碼識別出的文字滿足以下條件之一即可定位：['統一社會信用代碼'， '統一社會信用代'， '統一社會信用'， '統一社會信'， '統一社會'， '統一社'， '社會信用代碼'， '信用代碼'， '代碼'];公司名稱識別出的文字滿足以下條件之一即可定位：['名稱'，名，稱]。

自定義函數secOCRRecog，用于二次識別，自定義函數TextSechema，用于判斷需要二次識別的情況并調用secOCRRecog，代碼分別為：

defsecOCRRecog（self， img， type）：

cv2.imwrite（TEMPPIC_PATH， img）

if type == '3'：

ocrresult = self.ocrModel.ocr（TEMPPIC_PATH， det=True， rec=True， cls=True）

print（'3-----' + str（ocrresult））

b_find = False

sp = img.shape

width = sp[1]

height = sp[0]

foriin range（len（ocrresult））：

if （len（ocrresult[i][1][0]） >0）：

s_search = re.search（r'＼d{8}'， ocrresult[i][1][0]）

ifs_search：

b_find = True

x = int（ocrresult[i][0][0][0]）

y_dis = int（ocrresult[i][0][3][1] - ocrresult[i][0][0][1]）

y = int（ocrresult[i][0][0][1]）

if x > （width // 2）：

cropped = img[y：（y + y_dis）， int（width / 2）：（width - 20），：]

else：

cropped = img[y：（y + y_dis）， 20： int（width / 2），：]

ret_tmp = self.secOCRRecog（cropped， '1'）

returnret_tmp

ifb_find == False：

cropped = img[（height // 4）：（height - height // 4）， 30：（width - 30），：]

ret_tmp = self.secOCRRecog（cropped， '1'）

print（ret_tmp）

ifret_tmp：

returnret_tmp

else：

cropped = img[（height // 6）：（height - height // 2）， 30：（width - 30），：]

ret_tmp = self.secOCRRecog（cropped， '1'）

ifret_tmp：

returnret_tmp

else：

cropped = img[（height // 2）：（height - height // 6）， 30：（width - 30），：]

ret_tmp = self.secOCRRecog（cropped， '1'）

returnret_tmp

elif type == '1'or type == '2'：

略（類比前一段代碼iftype == '3'，繼續完成type == '1'和type == '2'）

defTextSechema（self， img， coordinates， ocrTexts， icode）：

ls_ret = []

b_find = False

sp = img.shape

width = sp[1]

height = sp[0]

foriin range（len（UNIONCODE_RULES））：

for j in range（len（ocrTexts））：

ifocrTexts[j][0].find（UNIONCODE_RULES[i]） > -1：

iflen（ocrTexts[j][0]） >10：

s_search = re.search（"＼d"， ocrTexts[j][0]）

i_tag = s_search.start（）

ls_ret.append（['統一社會信用代碼'， UnionCodeSechema （ocrTexts[j][0][i_tag：]， coordinates， j）]）

b_find = True

else：

略

ifb_find == False：

print（coordinates[j]）

x = int（coordinates[j][0][0]）

y = int（coordinates[j][0][1]）

y_dis = int（coordinates[j][3][1] - coordinates[j][0][1]）

cropped = img[y：（y + y_dis）， x：（x + int（width / 2） + 20），：]

ret_tmp = self.secOCRRecog（cropped， '1'）

ifret_tmp：

ls_ret.append（['統一社會信用代碼'， UnionCodeSechema （ret_tmp[1]， coordinates， j）]）

b_find = True

for j in range（len（ocrTexts））：

if （len（ocrTexts[j][0]） == 2andocrTexts[j][0] == '名稱'and （

（j + 1

ls_ret.append（['名稱'， ocrTexts[j + 1][0]]）

i_name = j + 1

break

eliflen（ocrTexts[j][0]） == 1andocrTexts[j][0] == '稱'andlen（ocrTexts[j + 1][0]） >2：

ls_ret.append（['名稱'， ocrTexts[j + 1][0]]）

i_name = j + 1

break

elif （len（ocrTexts[j][0]） >1） and （ocrTexts[j][0][0] == '稱'）：

ls_ret.append（['名稱'， ocrTexts[j][0][1：]]）

i_name = j

break

elif （len（ocrTexts[j][0]） >2） and （ocrTexts[j][0][0] == '名'） and （ocrTexts[j][0][1] == '稱'）：

ls_ret.append（['名稱'， ocrTexts[j][0][2：]]）

i_name = j

break

elif （ocrTexts[j][0].find（'名稱'） > -1） and （len（ocrTexts[j][0]） >2）：

ls_ret.append（['名稱'， ocrTexts[j][0]]）

i_name = j

break

elif （ocrTexts[j][0] == '名'） and （（j + 2）

ocrTexts[j + 1][0][0] ！= '稱'）：

ls_ret.append（['名稱'， ocrTexts[j + 1][0]]）

i_name = j + 1

break

elif （ocrTexts[j][0] == '名'） and （（j + 1）

i_name = j

print（coordinates[j]）

x = int（coordinates[j][0][0]）

y = int（coordinates[j][0][1]）

y_dis = int（coordinates[j][3][1] - coordinates[j][0][1]）

cropped = img[y：（y + y_dis）， x：（x + int（width / 2） + 150），：]

ret_tmp = self.secOCRRecog（cropped， '2'）

ifret_tmp：

ls_ret.append（ret_tmp）

break

if （b_find == False） and （i_name> -1）：

print（'營業執照位置' + str（coordinates[icode]） + '＼n'）

print（'名稱' + str（coordinates[i_name]） + '＼n'）

x = int（20）

y = int（coordinates[icode][3][1] + 3）

y_ = int（coordinates[i_name][0][1] - 6）

if y < y_：

cropped = img[y：y_， x：（width - 20），：]

else：

cropped = img[y_：y， x：（width - 20），：]

ret_tmp = self.secOCRRecog（cropped， '3'）

ifret_tmp：

ls_ret.append（

['統一社會信用代碼'， UnionCodeSechema （ret_tmp[1]， coordinates， icode）]）

returnls_ret

針對前面章節中的圖2，PaddleOCR識別結果不完整，通過增加局部二次識別，補充識別出完整的統一社會信用代碼，示例對比如圖5所示。

4? 結果及應用情況

使用完整的代碼識別773張樣本，最終識別結果準確率超過90%，分析識別失敗的案例，基本是由于文字中存在生僻字，導致文本識別不準確。識別失敗的案例如圖6所示。

得益于PaddleOCR的輕量化模型部署，識別圖像的速度較快，新增的代碼僅僅針對需要二次識別的部分才會進行二次識別。經實際測驗，使用CPU機器識別單張營業執照，約1～3秒鐘可出結果;使用GPU機器識別單張營業執照，約0.5～1秒鐘可出結果。整體識別準確率和效率均達預期。因此將本模型部署至實際生產系統，用于營業人員在前臺錄入營業執照信息時進行自動識別，提升了人工錄入效率。

5? 結? 論

本文在開源PaddleOCR框架基礎上，對代碼進行了相應的改進和完善，解決了PaddleOCR無法準確識別復雜場景下的營業執照的問題，并將模型順利部署至實際生產系統，幫助前臺營業人員自動錄入營業執照關鍵信息，具有較好的經濟效益。后續可以繼續以PaddleOCR框架為基礎，增加生僻字的fine turning，繼續提升識別準確率。

參考文獻：

[1] DU Y N，LI C X，GUO R Y，et al. PP-OCR：A Practical Ultra Lightweight OCR System [J/OL].arXiv：2009.09941 [cs.CV].（2020-09-21）.https：//arxiv.org/abs/2009.09941v3.

[2] 邵慧敏.營業執照自動識別技術的研究 [D].烏魯木齊：新疆農業大學，2020.

[3] LIAO M H，WAN Z Y，Yao C，et al. Real-Time Scene Text Detection with Differentiable Binarization [J].Proceedings of the AAAI Conference on Artificial Intelligence，2020，34（7）：11474-11481.

[4] YU D L，Li X，ZHANG C Q，et al. Towards Accurate Scene Text Recognition With Semantic Reasoning Networks[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR）.Seattle：IEEE，2020.

[5] LI W，CAO L B，ZHAO D Z，et al. CRNN：Integrating classification rules into neural network[C]//The 2013 International Joint Conference on Neural Networks（IJCNN）.Dallas：IEEE，2013.

作者簡介：仇建民（1988.06—），男，漢族，江蘇揚州人，中級工程師，本科，研究方向：IT系統建設與運維、大數據平臺、數據倉庫、AI開發與應用（文本分類、圖像識別）等。

收稿日期：2021-04-06

現代信息科技2021年9期

現代信息科技的其它文章: 高等教育課程改革中信息技術的應用探索; 淺談微課在電力電子技術實訓教學中的應用; 基于智慧教室的非機院校機械課程教學模式探究; 線上線下混合式教學模式的研究與探索; 基于Proteus的“微機接口技術”課程實驗教學改革; 關于RGB芯片解離失效探討