read

Deliverables

  1. Removing redundancies in OCR
  2. Getting started with Slurm jobs

Removing redundancies in OCR

One of the major problems faced while detecting a continuous ticker was the ensuring continuity and removal of incomplete words at the edges of the detected ticker.

By setting the frame_skip (i.e the number of frames skipped before detecting next piece of ticker) value as 110 frames, and eliminating the first and the last words of each piece of detected text, a reasonable amount of continuity can be ensured. The value of frame_skip is channel specific and depends on the speed of the ticker for a particular channel. Since, the news videos I worked on are from DD News, I was able to hardcode this value.
Example:
Before eliminating the first and last words

20190107053014.238|20190107053016.441|TIC2|000450|000 454 501 034|fi DDNEWS4SANSKRIT@GMAIL.COM इति सकते प्र 
20190107053018.641|20190107053020.840|TIC2|000560|031 455 528 029|ति सड्धुते प्रेषयन्त, प्रतियोगिताया परिणाम 12-01-2019 तम-
20190107053023.039|20190107053025.238|TIC2|000670|011 456 548 028|019 तम-दिनाकस्य 'वार्तावली' कार्यक्रमे घोषयिष्यते।  थ

Final output after eliminating the first and last words

20190107053014.238|20190107053016.441|TIC2|000450|000 454 501 034|DDNEWS4SANSKRIT@GMAIL.COM इति सकते
20190107053018.641|20190107053020.840|TIC2|000560|031 455 528 029|सड्धुते प्रेषयन्त, प्रतियोगिताया परिणाम 12-01-2019
20190107053023.039|20190107053025.238|TIC2|000670|011 456 548 028|तम-दिनाकस्य 'वार्तावली' कार्यक्रमे घोषयिष्यते।

Getting started with Slurm jobs

Slurm (Simple Linux Utility for Resource Management) is an open-source job scheduler that allocates compute resources on clusters for queued researcher defined jobs.

Allocating resources to the script

Allocating 5 CPUs and 12GB cpu-memory:
#SBATCH -c 5
#SBATCH --mem-per-cpu=12G
Allocating GPU:
#SBATCH --gres=gpu:k40:2

Final script:
#!/bin/bash
#SBATCH -N 5
#SBATCH -c 5
#SBATCH --gres=gpu:k40:2
#SBATCH --mem-per-cpu=12G
#SBATCH --time=0-24:00:00     # 30 minutes
#SBATCH --output=my.stdout
#SBATCH --mail-user=poulamirulz@gmail.com
#SBATCH --mail-type=ALL
#SBATCH --job-name="just_a_test"

#Put commands for executing job below this line
#This example is loading the default Python module and then
#writing out the version of Python

module load singularity/2.5.1
for month  in /mnt/rds/redhen/gallina/tv/2019/*; do
        for day in $month'/'*;do
                for f in $(ls -R $day | grep DD-News.*.mp4);do
                        date +"%T"
                        [ ! -f outputs/${f/mp4/ocr} ]  && echo $f $day'/' && singularity exec -B /mnt/rds/redhen/gallina/tv:/tv, pwd:/mnt ben-hin-ocr.img python3 /mnt/textdetection.py $f $day'/' 'hin+eng'
                        date +"%T"
                done
        done
done
echo "running OCR"

Job Statistics

[pxs656@hpc3 Bengali-Hindi-OCR]$ scontrol show jobid -dd 13732066
JobId=13732066 JobName=just_a_test
   UserId=pxs656(1061339) GroupId=mbt8(10183) MCS_label=N/A
   Priority=1013 Nice=0 Account=mbt8 QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=02:36:22 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2019-08-13T08:54:59 EligibleTime=2019-08-13T08:54:59
   StartTime=2019-08-13T08:55:00 EndTime=2019-08-14T08:55:00 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=batch AllocNode:Sid=hpc3:83019
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=compt[167-171]
   BatchHost=compt167
   NumNodes=5 NumCPUs=25 NumTasks=5 CPUs/Task=5 ReqB:S:C:T=0:0:*:*
   TRES=cpu=25,mem=300G,node=5
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
     Nodes=compt[167-171] CPU_IDs=0-4 Mem=61440 GRES_IDX=
   MinCPUsNode=5 MinMemoryCPU=12G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/mnt/rds/redhen/gallina/home/pxs656/Github/Bengali-Hindi-OCR/test.slurm
   WorkDir=/mnt/rds/redhen/gallina/home/pxs656/Github/Bengali-Hindi-OCR
   StdErr=/mnt/rds/redhen/gallina/home/pxs656/Github/Bengali-Hindi-OCR/my.stdout
   StdIn=/dev/null
   StdOut=/mnt/rds/redhen/gallina/home/pxs656/Github/Bengali-Hindi-OCR/my.stdout
Blog Logo

Poulami Sarkar


Published

Image

blog

Red Hen Lab bengali and Hindi OCR

Back to Overview