PastView Blog

Find out more about Digital Access & Discovery

 

PastView Community

Find out more about Digital Access & Discovery

 

Multi-Language Support for Optical Character Recognition (OCR)

Posted by Marshall Parr on March 21, 2022 at 3:01 PM

We have now introduced new multi-language support for use with the OCR (Optical Character Recognition) tool. This is for use with your typed-text items upon import into PastView, or retrospectively for your existing items. 

This feature is available to everyone and can be accessed in the following ways:

  • On Item pages - The OCR item button now has language options available.
  • On the Import Files panel
  • On Collection Pages under 'OCR all items'

What is the purpose of Multi-Language Support for OCR?

This addition to the OCR tool provides increased character recognition accuracy for use with Spanish and French characters and words. This tool is particularly suitable for use with typed-text which is either:

  • A combination of two or more languages (e.g. English and French) where more than half the document is in one particular language or;
  • Entirely written in French or Spanish.

 

What can Multi-Language Support for OCR do?

  • Recognise languages (English, French or Spanish) and output into a single transcription
  • Select the relevant word from the relevant language dictionary in order to recognise a character
  • Allow you to check OCR outputs to determine accuracy and re-process in an alternative language if not accurate.

 

Further Notes

When selecting a language other than English, the OCR system will look for characters and words in that language. This does not mean it won't recognise English characters within the text, however it will not be using an English dictionary to determine which letters are likely to appear, and as such will result in poorer detection of English words.

We therefore recommend only selecting another language if the majority of the document being processed is in that language, or if finding accurate words and phrases in the other language is of higher concern than the English text.