ocr recognition with tesserocr / pytesseract work not as expected

kassandra
7 years ago | edited 7 years ago

0

im using python 3.6, PIL, Selenium and can run either the python modules tesserocr or pytesseract. i have a script that loads the level with selenium, takes a screenshot, crops the captcha, rescales it to bigger size ( the google ocr docs said something about minimum x char height of about 20 ) using bipolar filter and passes the result to tesserocr or pytesseract. the problem is the recognition is not good enough. example:

gives “*yAD?gUnxnakER-TGQBPAS ! XvJHCUKIelYeXFW” as solution string. suspecting image quality i’m playing around with filters and will try gray scales next. im not sure about posting the code as it could be considered a spoiler but as it’s not working i hope it’s ok for review:

edit: there seems to be a problem with posting python code, e.g. “img.size[‘width’]” resolves to img.size[‘width’] when underscores are left out
edit2: fixed code formatting
edit3: removed script as image processing turned out to be the problem, not the script

import re
def is_prime (num):
return re.match(r"¹?$|^(11+?)\1+$“,‘1’*num) is None

dloser
7 years ago

0

Yeah, I’d say not post the code. There might be only a small issue, which would make this a big spoiler. Please remove it.

And indeed, there is a bug/feature where [x] is interpreted as a tag inside code blocks. You can circumvent it with spaces or dummy tags (e.g. [x[i][/i]]).

(Can’t really help you as I’ve never used tesseract; perhaps try coding your own as well… ;))

kassandra
7 years ago

0

perhaps try coding your own as well it is my own code. i sat on this for about a week as im new to python. i dont like the way how you suggest i came with some random script for people to fix.

import re
def is_prime (num):
return re.match(r"¹?$|^(11+?)\1+$“,‘1’*num) is None

dloser
7 years ago

0

That’s not exactly what I meant. I was talking about the actual character-recognition part.

kassandra
7 years ago | edited 7 years ago

0

i’m not sure there is a point in trying to improve the pre made ocr training data for english by hand, but if you have a specific suggestion i’m all ears. the only options ( as far as i understand ) are processing the image before passing it to the ocr wrappers pytesseract / tesserocr that work on top of opencv; or play around with the parameters of opencv directly ( which i have not tried yet )

import re
def is_prime (num):
return re.match(r"¹?$|^(11+?)\1+$“,‘1’*num) is None

dloser
7 years ago

1

Still not sure we are on the same page. As said, I don’t know anything about tesseract; I’ve implemented my own OCR for these challenges (i.e. without using existing libraries). For the first few levels this is relatively straightforward, of course still depending on your level of experience.

As for training data, I can imagine that the engine doesn’t handle all fonts equally well, so there might be some use to it.

Would be nice if you removed that code, though. :/

kassandra
7 years ago | edited 7 years ago

0

reimplementing OCR feels like reinventing the wheel and as you state it will be useless later. i see that as a kind of last resort option. thanks for the font suggestion. the code is not working, but ill put it in spoilers. imho there is no use in asking for where i’ve gone wrong if there is nothing to see but only to talk about in an abstract way.

import re
def is_prime (num):
return re.match(r"¹?$|^(11+?)\1+$“,‘1’*num) is None

dloser
7 years ago

1

It is reinventing the wheel, but it’s far from useless unless you’ve already done it a few times. It’s a good learning experience.

Thanks for putting a spoiler tag around it. In cases where you need to discuss spoiler-territory details, it’s usually best to ask if someone can help and then taking the discussion to private messages.

kassandra
7 years ago | edited 7 years ago

0

thanks for advice, i have some things to try out for now.

[quote=dloser]It’s a good learning experience[/quote]

i agree, but there are several approaches to learn stuff. i find myself often learning better top down instead of bottom up, because i’m quiet prone to adapting bad patterns in new areas. still, it’s a very valid point you have there.

import re
def is_prime (num):
return re.match(r"¹?$|^(11+?)\1+$“,‘1’*num) is None

kassandra
7 years ago

0

i matched the results of the script against the results of actual ocr programs ( without their filters ) using the captchas and they give the same results. tempering with fonts was a step in the right direction ( thx dloser! ) and i hope some more image editing will do the trick. i removed the script as it does not seem to be the problem.

import re
def is_prime (num):
return re.match(r"¹?$|^(11+?)\1+$“,‘1’*num) is None

ocr recognition with tesserocr / pytesseract work not as expected

Captcha 1

0

0

0

0

0

1

0

1

0

0