read

Deliverables

  1. Fromatting the output and saving the ticker to a .srt file and saving other text (non-ticker) to a textfile
  2. Improving OCR accuracy using confidence
  3. Eliminating garbage text
  4. A prototype demo

The .srt format

An SRT file is a subtitle file saved in the SubRip file format. An SRT file consists of four parts:

  1. The number of the caption frame in sequence
  2. Beginning and end timecodes for when the caption frame should appear
  3. The caption itself
  4. A blank line to indicate the start of a new caption sequence Example: hours:minutes:seconds,milliseconds – –> hours:minutes:seconds,milliseconds 00:04:12,40 –> 00:05:11,30

Using confidence

In my previous implementation, there were several instances where certain images gave a garbage output. Another new finding that I discovered was that some images gave me a better output when I colour inverted it and removed the noise while others did not. This was particularly beacuse some images had a dark background with white text and some had a light background with dark text colour. Hence, writing a general code that would work well for all images was proving to be quite troublesome. On suggestion from my mentors, I decided to use the confidence returned by tesseract4.0 in order to decide which of the two works better. The results were as follows:

Without color inversion: “IT HELPS TO IMPROVE PHYSICAL”
Confidence: 94.8

With color inversion: “leh ad he |”
Confidence: 26.75

Using confidence for finding out garbage recognitions was a total failure as almost all of the Hindi text was being recognized with a low confidene ranging between 10%-55% inspite of being accurately recognized.
Example

Output:स्वस्थ जीवनशैली अपनाएं
Confidence: 33.5

Demo

1 0.0:0.0:0.0,40.0 –> 0.0:0.0:2.0,240.0
SUTH GAMES 2019’ ON DD SPORTS (TERRESTRIAL

2
0.0:0.0:4.0,440.0 –> 0.0:0.0:6.0,640.0
1D SPORTS (TERRESTRIAL & DTH NETWORK OF DC

3
0.0:0.0:8.0,840.0 –> 0.0:0.0:10.0,1040.0
K OF DOORDARSHAN} FROM OSTH TO 20TH JANL

4
0.0:0.0:13.0,240.0 –> 0.0:0.0:15.0,440.0
“H JANUARY 6 ADMISSIONS 2019 OPEN FOR F<br/> ..
..
..

57
0.0:4.0:6.0,440.0 –> 0.0:4.0:8.0,640.0
शनमत्री अगस्त 1947 में जो चक हई थी, करतारपर कॉरिडोर

58
0.0:4.0:10.0,840.0 –> 0.0:4.0:12.0,1040.0
‘कॉरिडोर उसका प्रायश्चित है… ४ प्रधानमंत्री: 1984 के सि

59
0.0:4.0:15.0,240.0 –> 0.0:4.0:17.0,440.0
84 के सिख विरोधी दंगों के पीड़ितों को न्याय दिलाने में जुटी है

60
0.0:4.0:19.0,640.0 –> 0.0:4.0:21.0,840.0
[में जुटी है सरकार. क प्रधानमंत्री नरेंद्र मोदी ने गरु गोबिंद

61
0.0:4.0:24.0,40.0 –> 0.0:4.0:26.0,240.0
रु गोथिद सिह जी की 357वीं जयत्ती के मौके पर 350 रुपये क

For complete demo click here here

Blog Logo

Poulami Sarkar


Published

Image

blog

Red Hen Lab bengali and Hindi OCR

Back to Overview