OCR and accented characters conservation

DenisCad · March 4, 2024, 2:56pm

Hello, I know that your application is not yet perfectly adapted for French, but some results are disappointing due to the poor handling of accented characters.
I am unsure which OCR engine you use, but considering I often share memes, graphic screenshots and documents with French texts inside, I get very poor results.
It is also unfortunate that text extraction from web pages does not retain the specific characters of my language correctly.
In short, it’s a pity, given the power your tool is capable of.
I am not talking about the translation of the GUI. An average French user is according to me already very satisfied with the actual GUI in English language.
Cheers

sofia · March 7, 2024, 9:16am

Hi @DenisCad,

Thank you for your feedback.

The App is currently optimised for English, however it can also answer questions in French.

With regards to text extraction from web pages, even if the specific french characters are not recognised, the overall meaning of the text should still be understood by the AI in order to answer your french questions. If that’s not the case, could you please share examples of questions you’ve asked that have not been correctly answered, so we can better understand the specifics of the issue.

It’s also worth making sure that the memes, graphic screenshots and documents that you upload are being properly learned by the AI. If it was unable to extract the content, it won’t be able to answer your questions, no matter the language.

I hope this helps.

Best,
Sofia

DenisCad · March 9, 2024, 10:14am

Sofia, for example try this meme image and read the “preview image content”… It’s not good.
I also saw something quite bad with a web page. I have to find where it was and I’ll be back.

DenisCad · March 9, 2024, 10:25am

Another example with a web page. For example this url A. Bordiga : Auschwitz ou le grand alibi
with the screenshot of the preview… We’ll

sofia · March 11, 2024, 6:08pm

Hi @DenisCad,

Thanks for sharing these examples.

For images to text (such as the meme you sent), the issue is due to the OCR which is optimised for english. It does recognise some accents and particular characters however it currently works best with English text.

For the website that you shared, the issue is with the encoding of this particular website. It uses a ISO-8859-1 encoding, which is why the AI is not able to extract the text correctly. Most pages use UTF-8 or a multibyte encoding which adapts to languages. myReach is able to extract and understand all the text in those. However, if the page is set with a single-byte encoding in french, it can’t provide access to more foreign characters.

I hope this helps.
Best,
Sofia