Pay Voice: Point of Sale Recognition for Visually Impaired People

12/14/2018 ∙ by Guilherme Folego, et al. ∙ CPqD 0

Millions of visually impaired people depend on relatives and friends to perform their everyday tasks. One relevant step towards self-sufficiency is to provide them with means to verify the value and operation presented in payment machines. In this work, we developed and released a smartphone application, named Pay Voice, that uses image processing, optical character recognition (OCR) and voice synthesis to recognize the value and operation presented in POS and PIN pad machines, and thus informing the user with auditive and visual feedback. The proposed approach presented significant results for value and operation recognition, especially for POS, due to the higher display quality. Importantly, we achieved the key performance indicators, namely, more than 80 of accuracy in a real-world scenario, and less than 5 seconds of processing time for recognition. Pay Voice is publicly available on Google Play and App Store for free.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In the last decade, technological improvements resulted in the development of approaches to provide accessibility for people with disabilities. One important task that has been studied recently is the creation of user interfaces which are appropriate for people with visual impairment [1]. In Brazil, for instance, there are more than million people with severe permanent blindness [2].

As a consequence, many assistive technologies (AT) have been developed for helping people to identify accessible places, to communicate with individuals who have auditive problems, to describe objects for blind people, among others. One example of AT is the screen reader, a software which converts the content shown on a screen into speech. It can be useful for describing objects, informing routes, allowing blind people to read books and news, and to interact in social media. Financial operations, such as withdrawing money, paying bills and checking accounts, can also be done by blind people with the help of screen readers.

In this context, a challenging issue is to allow visually impaired people to check the value and the operation presented in payment machines, such as point of sale (POS) and PIN pad, illustrated in Fig. 1. While POS are stand-alone devices, PIN pads are directly connected to the merchant sales system; from a physical point of view, PIN pads are devices with simpler screens, usually backlit LCD with low resolution. One possible solution for this problem is using an embedded application in a smartphone for capturing the screen of the POS or PIN pad, recognizing the operation (e.g., credit, debit, voucher) and the value to be paid, converting the text to voice, and then informing the user. The main advantage of this approach is to properly work with existing payment devices in the field, which greatly reduces adoption cost. Even when these devices get upgraded in the future, this method remains viable.

(a) Point of sale.
(b) PIN pad.
Fig. 1: Example of payment machines.

However, the development of AT for reading the value presented on POS or PIN pad screens has some challenges. First, the detection of screen, value and operation to be read can be imprecise due to the environment (e.g., light variation, occlusion, reflection on the screen), low quality of older POS and PIN pad machines, and the positioning of the smartphone at the time of capture. Second, different brands and models of POS and PIN pad devices can have different fonts, sizes and positions for value and operation information. Finally, considering an embedded application, the processing environment can be a limiting factor, due to the timeout for completing the payment operation.

In this work, we introduce Pay Voice, a smartphone application that uses image processing, optical character recognition (OCR), and speech synthesis to recognize the value and operation presented in POS and PIN pad machines, and thus informing the user with auditive and visual feedback. The main contributions of this work are:

  • Detection of POS and PIN pad screens in real time;

  • Detection of regions of interest in the screen;

  • Application of an OCR system to recognize the value and operation;

  • Application of speech synthesis to provide auditive feedback to the user.

Pay Voice is a software application publicly available on Google Play111 and App Store222, which can be downloaded and used for free. Furthermore, it has received attention from the news media [3, 4].

The remaining of this paper is organized as follows. In Section II, we present some related works, and in Section III, we detail our proposed method. We describe our experimental setup and our dataset in Section IV. Finally, we present and discuss our results in Section V, and conclude the paper along with some future directions in Section VI.

Ii Related Works

In order to develop this research, we evaluated a number of studies within assistive technologies and image processing areas, including OCR methods.

Ii-a Assistive technologies

In combination with accessible user interfaces, assistive technologies can help visually impaired people in their rehabilitation and allow them to interact in the virtual world [5].

According to [1], the term assistive technology can be used within several approaches which require some form of assistance, and it can be divided into two main groups. The first group are approaches based on tactile solutions, which generally transform images captured by a camera into electrical or vibrotactile stimuli. It is useful for recognizing different shapes [6], for localization tasks [7], and reading [8]. The second group is composed by approaches based on auditory solutions. These approaches transform the object of interest (e.g., video, text, image) to audio. Auditory solutions can be used, for instance, in obstacle detection [9].

Another example of auditory assistive approach is text to speech (TTS) [10], with the goal of synthesizing artificial human speech from a text. Basically, TTS approaches allow human interactions with technology without the need of visual interfaces. In general, smartphones operating systems have an integrated screen reader mechanism which uses text to speech technology to improve accessibility. For Android-based systems, Google TalkBack service allows visually impaired users to interact with their devices using audible feedback, such as spoken words and sounds. For Apple iOS-based systems there is the Voice Over screen reader. Based on gestures, it tells what the user is touching or dragging.

(a) Input image.
(b) Converted to grayscale.
(c) After Laplacian filter.
(d) After Otsu threshold.
(e) Found contours.
(f) Detected screen.
Fig. 2: Screen detection steps.

Ii-B Optical character recognition

OCR has been studied by scientific community for several years. Although many researchers consider OCR a solved problem, the detection and recognition of text in images and videos have many challenges due to low quality or degraded data in a real-world scenario. Approaches that allow to automatically recognize characters through an optical mechanism transcribing the text have several applications in different areas, such as car plate recognition, real time translation, multimedia retrieval, and TTS methods [11].

One example of OCR approach is the open source OCR engine Tesseract [12]. Assuming the input is a binary image, the first step is to detect the outline of the components using connected component analysis. Then, the outlines are gathered together into blobs. After that, the blobs are organized into text lines, which are analyzed for fixed pitch and proportional text. The obtained lines are broken into words based on the characters spacing. Fixed pitch is chopped in character cells while proportional text is broken into words by definite spaces and fuzzy spaces.

The recognition step of Tesseract has two phases. First, the engine tries to recognize each word in turn. Each satisfactory word is passed to an adaptive classifier as training data. Once this adaptive classifier has been trained, the second phase is to run over the page trying to recognize the words which were not recognized on the first phase.

Iii Proposed Method

Our method consists primarily of two steps, namely, screen detection, and recognition of value and operation. The main requirement for our pipeline was to run completely embedded on smartphones, so it would not depend on Internet connection and would not suffer delays related to network latency. The screen detection step is performed in real time to provide audiovisual feedback to the user for correct positioning of the camera. After confirming we have a good image of the screen, then we perform the recognition step.

We used Tesseract as OCR engine. In order to improve recognition accuracy, we trained a specialized model to recognize texts from POS and PIN pad screens. We selected freely available fonts on the Internet to train the model. This selection was based on the similarity to fonts present in POS and PIN pad devices. We also built a text corpus suitable for the vocabulary used in the payment context. In this section, we describe Pay Voice version .

(a) Detected screen.
(b) After rotation.
(c) After top hat filter.
(d) After dilation.
(e) After Otsu threshold.
(f) Found contours.
Fig. 3: Regions of interest detection steps.

Iii-a Screen detection

Since this step needs to run locally in real time, we defined some assumptions, such as the screen being the largest bright region in the image. This demonstrated to work fairly well in our real-world experiments with a number of different machines and environments.

To ensure this step is executed as fast as necessary, we first convert the input image to grayscale, and resize it to

pixels in height with nearest neighbor interpolation 

[13]. This image is used twice: to check for correct focus, and to detect the screen. The focus is checked by applying the Laplacian filter [14]

and calculating the variance of the filtered image. The screen is detected by applying the Otsu threshold 

[15], and searching for the contour [16]

with largest area. With this contour, we compute a straight rectangle enclosing it, and project it back to the original large image, so we can use the screen region in full scale for the recognition step. We additionally compute a rectangle with minimum area around this contour, so we can estimate the screen rotation. We present intermediate results in Fig. 


In order to guarantee a good camera positioning, we provide some meaningful audiovisual feedbacks to the user. If the area of the straight rectangle is bellow a certain threshold, then the camera is too far from the machine. Conversely, if the area is above another threshold, the camera is too close. We also check the distance between screen rectangle borders and image edges, making sure the screen is roughly centered. It is important to note that the employed thresholds were determined empirically. In case all these verifications are valid for five consecutive frames of the camera stream, including the focus check, then we proceed to the recognition.

Iii-B Recognition

We start the recognition step with a median filter, to reduce the noise from the camera. Then, if the screen angle is larger than degrees (empirically determined), we rotate it around the center to the closest straight position, using an affine transformation with cubic interpolation. Since we cannot know whether the screen is vertical or horizontal, our recognition is limited to degrees of rotation. For instance, in case the screen is vertical, and it has degrees of rotation, it will end up rotated to an horizontal position, and the recognition will not be able to work properly.

After aligning the screen, we need to find the potential regions of interest. This is necessary since the OCR engine cannot handle the complete screen at once, outputting garbled text and taking a long execution time. The detection of these regions starts with resizing the screen image to half its original size with nearest neighbor interpolation, so it executes faster. We then apply a white top hat morphological operation with a large nearly square kernel, which is the difference between the input image and its opening, followed by a dilation with a large horizontal kernel, to connect nearby elements.

After, we perform an Otsu threshold and detect all contours, extracting a straight rectangle region around each of them, and we project these regions back to the original space, with an additional padding of pixels in each side. Finally, we select only the horizontal regions that have a proportional area within a predefined range, to ignore regions that are not useful and to speed the recognition. These steps are illustrated in Fig. 3. Even though there are more general techniques for text spotting with good accuracy [17], limitations regarding latency and hardware rendered the use of such methods impracticable in the present work.

(a) Detected region.
(b) After dilation.
(c) After Otsu threshold.
(d) After dilation.
Fig. 4: Region filtering steps.

In the last step, for each selected region of interest, we try to recognize the value and operation. We clean up this image using an erosion morphological operation with a small ellipse kernel, followed by an Otsu threshold, and the resulting image goes through the OCR engine. Finally, we apply another erosion, and the OCR again. These steps are illustrated in Fig. 4. This repeated erosion and OCR process is necessary due to thin and unconnected fonts, especially in PIN pad machines, as shown in Fig. 5.

After each OCR process, we try to extract the value and operation (e.g., credit, debit) from the recognized text. For the value, we use a regular expression that matches expressions containing integer digits, followed by a decimal separator and two integer digits. The regular expression is flexible enough to handle small variations and recognition errors (e.g., we accept a whitespace between the decimal separator). Confidence for value is given as the weighted mean between the score for the integer part () and the decimal part () of the value. The score for each part is computed as the average score of its characters, as given by the OCR engine. Considering all evaluated regions, we only keep the recognized value with highest calculated score.

For operation, we compute scores based on the distance between the recognized text and a set of previously known operations, selecting the one with highest score. The distance is obtained with a simplified version of Levenshtein algorithm [18]. After that, we also compute scores based on the distance between the recognized text and a previously defined blacklist, which contains expressions we want to avoid. The blacklist contains expressions that are similar to known operations and might be confused as legitimate operations with recognition errors. For instance, the word “digite” may be identified as “débito”.

If the score from the blacklist is higher than the score from the set of known operations, then the operation is changed to unknown. Similarly to the value, considering all evaluated regions, we only keep the operation with highest calculated score.

(a) Detected region.
(b) After dilation.
(c) After Otsu threshold.
(d) After dilation.
Fig. 5: Region filtering steps in a PIN pad machine.
(a) Simple.
(b) Difficult.
(c) Impossible.
Fig. 6: Samples from the dataset.

Iv Experimental Setup

To validate the proposed approach, we crowdsourced the collection of a dataset in a real-world scenario, considering a variety of POS and PIN pad machines, smartphone cameras, and people performing the operation. We collected POS and PIN pad pictures, for a total of images. This collection made use of our screen detection approach, to ensure a good quality of captured images. Then, each sample was manually annotated with the respective value and operation.

(a) Original.
(b) Cropped.
(c) Rotated only.
Fig. 7: Example of simulated rotations.

We illustrate a simple, a difficult and an impossible sample from our dataset in Fig. 6. Note that the simple image has few regions of interest, which makes processing faster and avoids potential mistakes. The difficult sample shows a number of regions that are larger than the value and operation, including the hours display, while the impossible image has a reflection on the value. We decided to keep every sample since they represent the real world scenario, and would better reflect on our performance metrics. Image sizes were selected according to availability in each device, choosing the smallest size that was at least pixels in height. Resulting sizes and respective frequencies are indicated in Table I.

Size Frequency
TABLE I: Image sizes frequency, in pixels.

Additionally, to evaluate our recognition with rotated images, we also generated another set of images with simulated rotations, since this was not very common in our collected data. We selected one sample with a straight screen, and rotated it both clockwise and counterclockwise, with and without cropping. Even though we understand that our approach is limited to degrees of rotation, we generated images up to degrees in each direction, for a total of rotated images with cropping and another without cropping. The original image used is presented in Fig. 7, along with the two generated samples with degrees of clockwise rotation.

V Results and discussions

Machine Thr. Correct Incorrect Unrecog.
PIN pad
PIN pad

Machine Thr. Correct Incorrect Unrecog.
PIN pad
PIN pad
TABLE II: Performance of value and operation recognition, respectively.

Although there is a number of ways to evaluate our performance, we chose to adopt the traditional accuracy metric. In other words, we need to correctly recognize the complete value and the complete operation, considering a partially recognized value as incorrect, even if we miss only one digit. In addition to the recognized values, our system provides a confidence score which can be used to decide whether the application will show the output from our method or will simply inform the user that it was unable to recognize the value or operation, meaning that output is shown only if scores are greater than a given threshold.

Adopted thresholds, one for value and another for operation, were determined empirically. In Fig. 8, we plot multiple performance metrics for value recognition in POS machines, with varying thresholds, from zero to . We note that most of our confidence scores are between and , where there is a steep variation from correct to unrecognized values. From this curve, we opted for as a safe and sound threshold for value recognition in POS. In Table II, we present the performance metrics for value and operation recognition.

From Table II, we can see that the proposed method achieved a remarkable real-world performance for POS recognition. However, PIN pad machines are somewhat harder, due to inferior illumination, higher reflection, thinner fonts and general lower quality of the screen. Using zero as threshold means we always output recognized value and operation, while higher thresholds imply eventually informing the user that this information was unrecognized. Note that some POS and almost all PIN pad machines do not display the operation, so we are simply unable to recognize it. Considering the complete dataset, in the real world (), our approach correctly identified of all values, with only of incorrect recognition. Similarly for the operation (), it made only of mistakes. This means that the output from our method can almost always be trusted, and reached our key performance indicator of .

As we designed and developed this approach to work embedded in smartphones, it is important to consider how long the recognition step takes. In Table III, we show execution times for a number of different devices. The idea was to evaluate an Android and an iOS smartphone in each tier. As such, we considered Samsung S8 and iPhone 7 as high end, LG G3 and iPhone 6 Plus as mid end, and LG X Power and iPhone 5 as low end. For comparison, we additionally included a desktop machine equipped with Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz.

Device Median Mean StdDev Min Max
iPhone 7
Samsung S8
iPhone 6 Plus
iPhone 5
LG X Power
TABLE III: Execution time of recognition, in seconds.

From Table III, we can see that, even for the slowest device, recognition can be expected to complete in less than seconds, which completely satisfies usability requirements from our client and from our users, considering their usage experience. Even though the maximum time was around seconds, it is still acceptable, since the timeout of POS and PIN pad machines are usually seconds, and smartphone hardware tend to improve over time. As a complementary note, this maximum execution time happened due to OCR delay in unimportant regions caused by image noise.

Regarding the generated dataset with artificial rotations, our method correctly classified of the images with rotation only, and of the images with both rotation and cropping. These metrics are well within our expectations and demonstrate that our method can properly deal with rotated images, which improves the user experience.

Fig. 8: Value recognition in POS with different thresholds.

Vi Conclusions

More than 6.5 million people in Brazil show some type of blindness, relying on relatives and friends to perform ordinary tasks, as shopping and managing their personal finances. An important task to improve self-sufficiency of visually impaired people is to provide them with means to interact with payment machines. In this work, we developed and released a mobile application, named Pay Voice, for recognition of value and operation in POS and PIN pad machines in the real world, focused on helping visually impaired people. The proposed approach presented significant results for value and operation recognition, especially for POS, due to the higher display quality. Importantly, we achieved the expected results from our client, namely, more than of accuracy in a real world setting, and less than seconds of processing time for recognition. We understand that this work is simply one step towards promoting integration and accessibility to visually impaired people, making them more independent.

Our intended next steps focus mainly in improving recognition performance and speed. In particular, there is a need to increase recognition accuracy for PIN pad machines. We expect that optimizing the OCR system with additional fonts, closely related to the ones we are currently missing, could bring this improvement. Additionally, there is a number of approaches to reduce execution time. For instance, if we detect that the region of interest contains white text on black background, it is most certainly a POS, and we can safely avoid the repeated dilation and OCR steps. In order to improve eventual OCR delays, we could remove border leftovers and small artifacts from the regions of interest before proceeding with the OCR, and eventually adding a timeout to the OCR processing. Increasing our blacklist could also potentially avoid mistakes and unnecessary steps. Finally, in principle, regions of interest processing could be done in parallel.

For most people technology makes things easier. For people with disabilities, however, technology makes things possible. (Mary Pat Radabaugh)


We thank the support from Claudinei Martins, Rodrigo Morbach, Fernando Marino, and Fabiani de Souza. The resulting Pay Voice application would not be possible without the effort from all the people involved in this project. We also appreciate the financial support from Abecs (Associação Brasileira das Empresas de Cartões de Crédito e Serviços).


  • [1] A. Csapó, G. Wersényi, H. Nagy, and T. Stockman, “A survey of assistive technologies and applications for blind users on mobile platforms: a review and foundation for research,” Springer Journal on Multimodal User Interfaces, vol. 9, no. 4, pp. 275–286, 2015.
  • [2] I. B. de Geografia e Estatística IBGE, “Censo demográfico: Características gerais da população, religião e pessoas com deficiência,” 2010.
  • [3] G1 Campinas e Região, “Aplicativo de celular criado no CPqD ajuda deficiente visual a conferir valor pago com cartão,” 2018, [Accessed 2018-08-11]. [Online]. Available:
  • [4] Band Campinas, “Compras: aplicativo promete facilitar a vida de pessoas com deficiência visual,” 2018, [Accessed 2018-08-11]. [Online]. Available:
  • [5] F. de Souza, C. Martins, and S. Silva, “Usability, accessibility and affectibility in CPqD,” in Proceedings of the XVI Brazilian Symposium on Human Factors in Computing Systems, ser. IHC 2017.   New York, NY, USA: ACM, 2017, pp. 74:1–74:4.
  • [6] K. A. Kaczmarek and S. J. Haase, “Pattern identification and perceived stimulus quality as a function of stimulation waveform on a fingertip-scanned electrotactile display,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 11, no. 1, pp. 9–16, 2003.
  • [7] G. Jansson, “Tactile guidance of movement,” International Journal of Neuroscience, vol. 19, no. 1-4, pp. 37–46, 1983.
  • [8] J. C. Craig, “Tactile letter recognition: Pattern duration and modes of pattern generation,” Perception & Psychophysics, vol. 30, no. 6, pp. 540–546, 1981.
  • [9] L. Dunai, G. P. Fajarnes, V. S. Praderas, B. D. Garcia, and I. Lengua, “Real-time assistance prototype—a new navigation aid for blind people,” in Annual Conference on IEEE Industrial Electronics Society, 2010, pp. 1173–1178.
  • [10] L. S. G. Piccolo, E. M. De Menezes, and B. De Campos Buccolo, “Developing an accessible interaction model for touch screen mobile devices: Preliminary results,” in Proceedings of the 10th Brazilian Symposium on Human Factors in Computing Systems and the 5th Latin American Conference on Human-Computer Interaction, ser. IHC+CLIHC ’11.   Porto Alegre, Brazil, Brazil: Brazilian Computer Society, 2011, pp. 222–226.
  • [11] Q. Ye and D. Doermann, “Text detection and recognition in imagery: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 7, pp. 1480–1500, 2015.
  • [12] R. Smith, “An overview of the tesseract ocr engine,” in IEEE Intl. Conference on Document Analysis and Recognition, 2007, pp. 629–633.
  • [13] J. A. Parker, R. V. Kenyon, and D. E. Troxel, “Comparison of interpolating methods for image resampling,” IEEE Transactions on Medical Imaging, vol. 2, no. 1, pp. 31–39, 1983.
  • [14] G. Aubert and P. Kornprobst,

    Mathematical problems in image processing: partial differential equations and the calculus of variations

    .   Springer Science & Business Media, 2006, vol. 147.
  • [15] M. H. J. Vala and A. Baxi, “A review on otsu image segmentation algorithm,” International Journal of Advanced Research in Computer Engineering & Technology, vol. 2, no. 2, pp. pp–387, 2013.
  • [16] S. Suzuki and K. Abe, “Topological structural analysis of digitized binary images by border following,” Computer vision, graphics, and image processing, vol. 30, no. 1, pp. 32–46, 1985.
  • [17]

    M. Jaderberg, A. Vedaldi, and A. Zisserman, “Deep features for text spotting,” in

    Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds.   Cham: Springer International Publishing, 2014, pp. 512–528.
  • [18] V. Levenshtein, “Binary codes capable of correcting spurious insertions and deletions of ones.” Problems of Information Transmission, 1965.