Handling OCR & CAT-based transcription errors
Thread poster: Marc Brunet
Marc Brunet
Marc Brunet  Identity Verified
Australia
Local time: 23:39
Japanese to English
+ ...
Oct 23, 2013

Beware of transcriptions of text extracted from PDF documents by CAT and OCR tools, even when the source text was accessible as binary fonts, not as a text image.
If in your transcribed CAT working copy you come across terms you cannot make sense of, given the context you are dealing with, consider these possibilities:
1/ the term delivered as two characters may in fact be a single one:
目良 --> 眼
2/ if, in addition, the character was displayed small, the identifyi
... See more
Beware of transcriptions of text extracted from PDF documents by CAT and OCR tools, even when the source text was accessible as binary fonts, not as a text image.
If in your transcribed CAT working copy you come across terms you cannot make sense of, given the context you are dealing with, consider these possibilities:
1/ the term delivered as two characters may in fact be a single one:
目良 --> 眼
2/ if, in addition, the character was displayed small, the identifying
engine may have, based on the character's shape, made an approximation
error as well:
目畏 --> 目良 --> 眼
3/ other than that, consider a possible typo:
スァロイド--> ステロイッド

A proofing rule to keep in mind :
How can you be sure of your hunch? Simple, if your CAT conversion utility 'read' those terms from a PDF document from which the text can be accessed as binary fonts.
If so, simply copy & paste the mysterious term in question in the source PDF file's search window for an instance of it. You may be surprised by the response you get:

a) For all 3 of the above cases I encountered, the PDF search engine did
not return a "not found";
b) In each case, the PDF search engine landed on one instance of the source
term queried, confirming each in its original form: 眼 and ステロイッド.

If your original copy was not a text image, that's both a bonus and an
invaluable security too good to pass by. Try it, and enjoy a worry- and
delay-free translation session!

PS: Now, for a bit of discussion fun if you are interested:
What is possibly happening in the above problem text conversion and
surprising retrieval process?

Q1- can we assume that the format converter 'misread' the original?
A1- 'misreads'? I am not sure, but 'misinterprets' certainly,so far as 1/
and 2/ above are concerned.

Q2- What does the converter picks up? the char shapes displayed or their
underlying codes?
A2- Probably both as a twin parallel set, if that option is available. This
way, the char codes that the format converter picks up are always correctly
retained as the reference. The char shapes eventually settled for and
tentatively displayed for possible editing do take precedence on the
receiving display side, based on the analogue matching of perceived shapes
with those stored on the receiving side.

A3- Since the misinterpreted char shape(s) has/ve been saved as paired
with the intact original char code(s) picked up, pasting in that shape for
a search still refers to the correct char code(s) of the original PDF file,
and can be retrieved unadulterated as displayed in the source file.

A4- If A2 assumptions are wrong, then the codes picked up occasionally
take on different expressions on the receiving char register, but still
retain their original correspondences when reused in the source char
register.

Penny for your own theories, dear colleagues?
Cheers and enjoy your work
Collapse


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Handling OCR & CAT-based transcription errors







CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »