Log In Sign Up

FedOCR: Communication-Efficient Federated Learning for Scene Text Recognition

While scene text recognition techniques have been widely used in commercial applications, data privacy has rarely been taken into account by this research community. Most existing algorithms have assumed a set of shared or centralized training data. However, in practice, data may be distributed on different local devices that can not be centralized to share due to the privacy restrictions. In this paper, we study how to make use of decentralized datasets for training a robust scene text recognizer while keeping them stay on local devices. To the best of our knowledge, we propose the first framework leveraging federated learning for scene text recognition, which is trained with decentralized datasets collaboratively. Hence we name it FedOCR. To make FedCOR fairly suitable to be deployed on end devices, we make two improvements including using lightweight models and hashing techniques. We argue that both are crucial for FedOCR in terms of the communication efficiency of federated learning. The simulations on decentralized datasets show that the proposed FedOCR achieves competitive results to the models that are trained with centralized data, with fewer communication costs and higher-level privacy-preserving.


page 1

page 2

page 3

page 4


Federated Pruning: Improving Neural Network Efficiency with Federated Learning

Automatic Speech Recognition models require large amount of speech data ...

Privacy-preserving Decentralized Aggregation for Federated Learning

Federated learning is a promising framework for learning over decentrali...

FedCD: Improving Performance in non-IID Federated Learning

Federated learning has been widely applied to enable decentralized devic...

Privacy Preserving Text Recognition with Gradient-Boosting for Federated Learning

Typical machine learning approaches require centralized data for model t...

Decentralized Distributed Learning with Privacy-Preserving Data Synthesis

In the medical field, multi-center collaborations are often sought to yi...

Federated Learning for Localization: A Privacy-Preserving Crowdsourcing Method

Received Signal Strength (RSS) fingerprint-based localization attracted ...

An Efficient Industrial Federated Learning Framework for AIoT: A Face Recognition Application

Recently, the artificial intelligence of things (AIoT) has been gaining ...

1 Introduction

Text in scene images contains valuable semantic information for text reading and has become one of the most popular research topics in academia and industry for a long time [8, 1, 32, 20, 15]

. In practice, scene text recognition has been applied to various real-world scenarios, such as autonomous navigation, photo transcription, and scene understanding. With the development of deep learning and the emergence of public text datasets, significant progress on scene text recognition has been made in recent years.

However, most of the existing scene text recognition algorithms assume that a large scale set of training images is easily accessible. As shown in Fig. 1(a), they may achieve sub-optimal performance and be unable to model the data variations or diversity owing to the lack of sufficient images. To remedy this, some works [3, 40] merge different public datasets to build a more robust text recognizer, as illustrated in Fig. 1(b). However, centralizing data in this way is simply problematic in many real-world scenarios. For example, many laws and regulations strengthening the data privacy constrain the use of data stored on local devices, such as General Data Protection Regulation (GDPR) [34]. Besides, centralizing tremendous image data from different local devices incurs heavy communication loads. That means it is simply intractable to centralize large amounts of data for scene text recognition training in practice. Our solution, which works within the framework of federated learning, is illustrated in Fig. 1(c).

Figure 1: An illustration of training scene text recognizers with (a) a single dataset, (b) a centralized dataset from different devices, and (c) decentralized datasets distributed on different local devices

Federated Learning (FL), a new concept first proposed by McMahan et al. [23], allows data owners to train a shared model collaboratively while keeping data stored on different local devices. However, directly applying FL to scene text recognition faces two inevitable difficulties. First, in most scene text recognition algorithms, a heavyweight backbone model is usually adopted for the sake of better performance. Hence, it results in heavy burdens of the parameter transmission while doing federated learning. Second, there is an extra computational cost from a privacy-preserving module to handle privacy leakage due to the honest-but-curious global server in general federated learning frameworks.

In this paper, to the best of our knowledge, we propose the first federated learning framework for scene text recognition, which we name FedOCR. In our FedOCR (a schematic is given in Fig. 2), all participants train a shared model collaboratively without centralizing the training images. In this manner, datasets on different local devices have an indirect influence on the training of the global model, which leads to a competitive performance to the model trained with a centralized set of data. To improve the communication efficiency between the global server and local clients, we argue two important aspects in FedOCR, i.e., lightweight models and hashing techniques. Moreover, benefited from the hashing technique, we can avoid privacy leakage to the global server by a specific hashing function and the random seeds, which saves an extra computational cost for a privacy-preserving module. As a consequence, the proposed FedOCR is readily to be deployed in practical applications for scene text recognition.

Compared with existing scene text recognition methods [20, 15, 3, 40] without federated learning, the proposed framework has the following intriguing merits. First, FedOCR can make use of more abundant image data from different local devices. Particularly, there are billions of end devices that collect tremendous image data containing text which benefits scene text recognition. Therefore, our framework may have great potential in real-world applications of scene text reading. Second, by design, our framework has a superior trade-off between parameter transmission efficiency and performance. The proposed text recognizer has much fewer parameters than existing scene text recognition algorithms but encouragingly reaches a comparable performance. Last, it can encrypt and decrypt with the hashing technique, which provides higher-level privacy-preserving without an extra computational cost.

In summary, the main contributions of this paper are three-fold.

  • We reveal the problem of data privacy in scene text recognition, which is somehow overlooked by the existing methods.

  • We propose the first federated scene text recognition framework called FedOCR for training a recognizer with decentralized datasets distributed on different local devices.

  • FedOCR is a highly communication-efficient and privacy-preserving framework by incorporating lightweight backbones and hashing techniques, which makes it suitable to be deployed in real privacy-sensitive applications and edge devices.

2 Related Work

Scene text recognition has attracted great interest for a long time. According to Long et al. [18], representative methods can be roughly divided into two mainstreams, i.e., Connectionist Temporal Classification (CTC) based and attention-based methods. Generally, the CTC-based methods model scene text recognition as a sequence recognition task. For example, Shi et al. [29]

combine the convolutional neural network (CNN) with the recurrent neural network (RNN) to extract sequence features from input images, and decode the features with a CTC layer. Different from Shi

et al. [29], Gao et al. [7] use stacked convolutional layers to extract contextual information from inputs without RNN, and show advantages with low computational costs. Meanwhile, attention-based methods extract features more effectively via the attention mechanism. For instance, Liu et al. [17] propose a binary convolutional encoder-decoder network to provide real-time scene text recognition. Unlike other attention-based algorithms, Bai et al. [2]

propose Edit Probability (EP) to handle the misalignment between the output sequence of probability distribution and the ground-truth sequence, which is caused by missing or superfluity of characters in the output.

With the improvement of scene text recognition, researchers start to focus on more difficult settings or scenarios, such as irregular texts [37] and perspective distortion [16, 40]. To improve irregular text recognition, Yang et al. [37] propose a symmetry-constrained rectification network to generate better rectification results than existing algorithms. Instead of using the global rectification, Liu et al. [16] propose a character-aware neural network with a hierarchical attention mechanism, which adopts a local transformation to rectify characters individually. Meanwhile, many works [12, 9, 15] exploit synthetic word images to remedy the insufficiency of training data.

Undoubtedly, large amounts of real-world data are needed in practical applications of those scene text recognition methods. However, tremendous image datasets are distributed on different local devices and can not be centralized to share. To handle this problem, McMahan et al. [23] first propose the concept of Federated Learning (FL) to train deep networks from decentralized data collaboratively. Following McMahan et al. [23], many researchers are working on improving the federated learning with more efficient parameter transmission and higher-level privacy-preserving. To improve privacy security, Wei et al. [36] propose a federated learning framework based on differential privacy, in which artificial noises are added to the local parameters of participants before the model aggregation. To improve communication efficiency, Reisizadeh et al. [27] propose a communication-efficient federated learning method with periodic averaging and quantization.

Especially very recently, the computer vision community starts to pay attention to federated learning, thus arising several pioneering works.

For example, Luo et al. [21] implement object detection algorithms with federated learning and release a reliable benchmark framework. In the medical field, Zhu et al. [43] implement a privacy-preserving federated learning system with the differential privacy for brain tumor segmentation. To the best of our knowledge, we propose the first federated scene text recognition framework, which is more efficient in communication and provides higher-level privacy-preserving.

3 Federated Scene Text Recognition Framework

In this section, we first introduce the pipeline of our federated scene text recognition framework. Then, we describe the details of local training and global aggregation, which are the two main steps in federated learning. Finally, we elaborate on how to improve communication efficiency and preserve data privacy in our framework.

3.1 Pipeline of FedOCR

Figure 2: The pipeline of our federated scene text recognition framework

According to Yang et al. [38], our framework is a kind of horizontal federated learning, where datasets of different participants share the same feature space but are different in samples. Suppose we have data owners, which have different sets of training images . We denote the accuracy of the text recognizer trained with decentralized datasets as . Note that these decentralized datasets are not shared or transferred to other participants during the training procedures. We denote the accuracy of the text recognizer trained with a centralized dataset as . Basically, the objective of FedOCR is to minimize the difference between and . A smaller difference between and means a better performance of our federated learning for scene text recognition.

Fig. 2 illustrates the pipeline of our federated scene text recognition framework. There are participants, each of which has a set of data containing cropped text word images and transcriptions, and a global server for local model parameter aggregation. We assume all participants agree in advance on the same network architecture and the same training objective but do not share their datasets. The whole learning process can be decomposed into four steps:

  1. Before each round of local training, all participants start with the same parameters, which are initialized randomly in the first round and downloaded from the global server in the next rounds.

  2. Each participant trains the model with its dataset for epochs individually.

  3. All participants calculate parameter increments compared to the original parameters in a round, and all parameter increments are sent to the global server.

  4. The global server aggregates all parameter increments by average, and updates a set of global parameters. Before the next local training, the global parameters are downloaded for local model updating.

Following this pipeline, our federated training continues until convergence.

3.1.1 Local Training.

In our FedOCR, each participant and the global server maintain a set of local model parameters and , respectively. Algorithm 1 describes the local training process of our framework. As shown, all participants first download the latest global parameters from the global server and overwrite their local parameters. Then, participants train local models with their datasets independently for epochs and send parameter increments to the global server. During local training, all participants do not share any image data with others. To update the global parameters efficiently, all participants should train their models enough before parameter transmission. McMahan et al. [23] demonstrate that sufficient epochs of local training can bring a dramatic increase in parameter update efficiency. Detailed experiment settings of our FedOCR are provided in the next section.

1:Latest global parameters in round ; Local training learning rate
2:All local parameter increments
3:for each  do

     Overwrite local weight vectors:

5:end for
6:for all local participant  do
7:     for  do
8:         for  do
9:              Sample a minibatch
10:              Compute gradients:
11:              Update local parameters:
12:         end for
13:     end for
14:     Compute local parameter increments:
15:end for
16:Send to the global server
Algorithm 1 Local Training
1:All local parameter increments in round ; Global parameters ;
2:Updated global parameters: in round ;
3:Compute global parameter increments:
4:Update global parameters:
5:Send to all participants
Algorithm 2 Global Aggregation

3.1.2 Global Aggregation.

To aggregate parameter increments from different local participants, McMahan et al. [23] propose a straightforward approach to aggregate all local participants’ parameters by average. Following steps in Algorithm 2, we adapt the federated average method [23] to our federated scene text recognition framework. In the global aggregation step of our FedOCR, we average all parameter increments and update former global parameters, which are available for all participants downloading.

3.2 Communication Efficiency

Communication efficiency is an essential property in federated learning. For instance, if the size of one participant’s model is one hundred megabytes, tens of gigabytes will be required to transmit in a round, when hundreds of clients participate in a federated learning framework. Under such a circumstance, plenty of parameters result in huge communication costs, which lead to a training bottleneck. To reduce communication burdens, we replace the heavyweight backbone, such as ResNet [10]

, for feature extraction in text recognizers with a lightweight neural network. To further decrease the parameter size, we extend a hashing technique 

[4] to compress the parameters of CNN and RNN, which makes it applicable for any text recognizer. In this way, the text recognizer in our FedOCR has much fewer parameters compared with existing text recognition algorithms, which shows great potential in practical federated learning deployment.

1:Compression ratio ; Hashing seeds {}, where is the number of network layers;
2:A compressed network;
3:for each layer in the entire network do
4:     Assume the parameter size of weight matrix is
5:     Generate a real weight vector , and its parameter size is
6:     Generate a random sort of numbers from to with a hashing function and a seed
7:     Generate an index vector:
8:     Reshape as the shape of
9:     Generate a virtual weight matrix:
10:end for
11:Initialize the text recognition network with
12:The actual parameter size of the compressed network is only
Algorithm 3 Hashing Technique

3.2.1 Hashing Technique.

In fact, any well-designed scene text recognition model can be applied in our federated learning framework. However, considering the communication efficiency, the network with fewer parameters is more appropriate and practical. Therefore, we propose to compress model parameters by a hashing technique. Specifically, we compress network parameters in a weight sharing manner that a random subset of parameters in a layer share the same parameter. Following Algorithm 3, we compress nearly all parameters in a scene text recognition network with a hyper-parameter to control the compression ratio, and the hashing technique can reduce the parameter size to a large extent. It should be noted that means the largest integer that is smaller than in Algorithm 3. Notably, the specific hashing function and the random seeds are shared among all local participants to keep the same relationship between real weight vectors and virtual weight matrices of all local models.

3.2.2 Text Recognizer.

Following the above methods, we can improve any existing text recognition algorithms to construct a lightweight text recognizer. Specifically, in our experiments, we optimize a classical text recognizer, ASTER [30]. We replace the encoder in ASTER with ShuffleNetV2 [22]

and apply the hashing technique to the entire model parameters. Especially, we do not compress the parameters of batch normalization layers in networks, because there are only a few parameters. Benefited from hashing techniques and lightweight networks, we successfully decrease communication costs to a large extent in our federated learning framework.

Moreover, we keep the network structure and experiment settings the same with ASTER as much as possible. We briefly introduce the method of scene text recognition as follows: Firstly, an input image is rectified by a rectification network before sent into a recognition network. The rectification network based on the Spatial Transformer Network (STN) aims to rectify perspective or curved texts. Secondly, we use a lightweight neural network as the encoder to extract the feature sequence from the rectified image. Lastly, we use an attentional sequence-to-sequence model as the decoder to translate the feature sequence. During inference, we use a beam search algorithm by holding five candidates with the highest accumulative scores at every step.

3.2.3 Network Training.

After neural network initialization, the mapping relationship between real weight vectors and virtual weight matrices is fixed, which is defined in Algorithm 3. In the forward computation, it is the virtual weight matrices that participate in calculations with input features. In the backward propagation, the gradients of all parameters in real weight vectors are calculated based on virtual weight matrices’ parameter gradients.

Let denote the -th row and -th column element of a virtual weight matrix at layer , and let denote the -th element in the corresponding real weight vector. Assuming that


where is computed from the loss. Moreover,


Based on the above equations, we can obtain any parameter’s gradient in the real weight vector as follows:


3.3 Privacy Preserving

Federated learning can provide training procedures at a high level of security, but the global server still has a chance to compromise data privacy, such as model inversion [6] and GAN-based attacks [11]. Usually, local network parameters or their increments are sent to the global server in each communication round, which gives the honest-but-curious server a chance to spy on local sets of data. In recent works, Phong et al. [25] show that a small portion of gradients may reveal information of training samples and apply an additively homomorphic encryption scheme to their federated framework. Shokri et al. [31] propose to upload partially gradients added with noise to avoid information leakage, and apply differential privacy to parameter updates for a higher level of security. However, the above methods bring more computational costs or a dramatic decrease in accuracy because of the privacy-preserving module.

In our FedOCR, we adopt the hashing technique to compress the entire model parameters with a hashing function and random seeds, which are equivalent to an encryption-decryption module and the keys. For the parameter aggregation in the global server, we only upload increments of the parameters in real weight vectors, which can not be used to reconstruct the complete network without the specific hashing function and the random seeds. As for all local participants, they share the same hashing function and random seeds, so the average operation in the global aggregation can be directly applied to these parameter increments. Therefore, the global server can not compromise the private data, while it can finish its global aggregation task. In this way, we enhance the privacy-preserving in our FedOCR without introducing an extra computational cost.

4 Experiments

4.1 Experiment settings

4.1.1 Datasets.

Two synthetic datasets [12, 9] and six public real-world datasets are used to train local models, and our models are evaluated on seven general datasets. In our federated settings, we construct different local datasets with the public real-world datasets. These datasets are briefly introduced as follows:

  • Synth90k [12] contains 9 million images generated from a set of 90k English words. Words are rendered onto natural images with random transformations and effects.

  • SynthText [9] contains 0.8 million images for end-to-end text detection and recognition tasks. Therefore, we crop word images using the ground-truth word bounding boxes.

  • ICDAR 2003 (IC03) [19] contains 860 cropped word images for evaluation after discarding images that contain non-alphanumeric characters or have fewer than three characters, which follows [24]. For training, we use 1150 cropped images after filtering.

  • ICDAR 2013 (IC13) [14], which inherits most images from IC03 and extends it with new images, contains 1015 cropped word images for evaluation after filtering. For training, we use 848 cropped images after filtering.

  • ICDAR 2015 (IC15) [13] contains images captured by a pair of Google Glasses casually, and many images are severely distorted or blurred. For a fair comparison, we evaluate models on 1811 cropped word images after filtering. For training, we use 4426 cropped images after filtering.

  • IIIT5K-Words (IIIT5K) [24] contains 3000 word images collected for evaluation and 2000 word images for training, which are mostly horizontal text images.

  • Street View Text (SVT) [35] is collected from the Google Street View, and it contains 647 images of cropped words, many of which are severely corrupted by noise, blur, or low resolution.

  • Street View Text Perspective (SVTP) [26], which is collected from Google StreetView and contains many distorted images, contains 645 word images for evaluation.

  • CUTE80 (CUTE) [28] contains 80 real-world curved text images with high quality. For evaluation, we crop 288 word images according to its ground-truth.

  • ArT [5] is a combination of Total-Text, SCUT-CTW1500, and Baidu Curved Scene Text, which contains images with arbitrary-shaped texts. For training, we use 30271 word images after discarding images that contain non-alphanumeric characters and vertical texts.

  • COCO-Text [33] is based on the MS COCO dataset, which contains images of complex everyday scenes. For training, we use 31943 cropped images after discarding images that contain non-alphanumeric characters and vertical texts.

4.1.2 Decentralized Datasets for Federated Learning

are constructed by public real-world datasets in our experiment settings. We use the training word images from IC03 [19], IC13 [14], IC15 [13], IIIT5K [24], ArT [5], and COCO-Text [33]. As a sequence, we have 70638 real-word text images in total. To simply simulate the decentralized datasets distributed on local devices in federated learning, we split all the real-word text images randomly and uniformly into different sets of image data for participants. It should be mentioned that these different sets of image data should not be shared or transferred to other participants during the training procedures.

4.1.3 Federated Settings.

Some hyper-parameters should be noted in our federated settings: , the number of participants in our federated scene text recognition framework; , the compression ratio of the hashing technique; , the number of epochs that each local participant trains the model with its dataset before communication with the global server; , the batch size in local training. In our experiments, we set , , and .

4.1.4 Baseline and FedOCR-Hash.

In our experiments, we adopt ASTER111 [30] as the text recognition baseline in our FedOCR, which is denoted as ASTER-FL. Then, we replace the encoder in ASTER-FL with ShuffleNetV2 [22], and this variant of ASTER-FL in our FedOCR is denoted as FedOCR-Hash. To further reduce the parameter size, we apply the hashing technique to compress FedOCR-Hash with different ratios , and these models are denoted as FedOCR-Hash in the following paper.

4.1.5 Implementation Details.

Following the federated settings, we construct participants in our FedOCR for experiments. In each local training, all models are locally trained via Adadelta [39] with an initialization learning rate of 1.0, and each participant trains the scene text recognition model with its dataset individually for epochs in each round. All word images are trained directly without data argumentation. As for the complete federated training process of our FedOCR, each participant trains its model with the two synthetic datasets for 4 rounds, then trains on its real-world dataset for 40 rounds.

The learning rate is decayed to 0.1 and 0.01 at the 5-th round and the 30-th round, respectively. Following Algorithm 2, in the global aggregation step, the global server aggregates the parameter increments from all participants by average. To simply simulate the communication procedure of federated learning, we replace the parameter transmission between participants and the global server with saving and restoring checkpoints on the hard-disk.

4.1.6 Evaluation Metric.

In our experiments, we use the case-insensitive word accuracy for evaluation. If the word prediction and the ground-truth are the same in the lower case, the prediction is correct. The recognition accuracy is the percentage of the correct number of total. Furthermore, the objective of FedOCR is to minimize the difference between the accuracy of the text recognizer trained with decentralized datasets and trained with a centralized dataset. A smaller difference means a better performance of our FedOCR.

4.2 Experiments on FedOCR

In this subsection, we first compare the parameter reduction and the accuracy decrease of different models in our FedOCR. Then, we analyze the performance of our FedOCR compared with the other two training manners and show that our FedOCR achieves the objective of federated learning. Finally, we evaluate the two improvements in communication efficiency of our FedOCR.

4.2.1 Comparison of Parameter Size and Accuracy.

Table 1 shows the parameter size and model size of different models in our FedOCR. The accuracy is the average result of all testing datasets. The models size refers to the storage occupied on the hard-disk. Compared with ASTER-FL, FedOCR-Hash reduce 36.45% parameter size, but there is only a 3.11% accuracy decrease. As for different FedOCR-Hash in our experiments, FedOCR-Hash with an appropriate compression ratio achieves a 83.90% reduction in parameter size and drops only 7.12% in accuracy. Improved by the lightweight backbone and the hashing technique, the model size of the scene text recognizers in our FedOCR reduces to a large extent, and these lightweight text recognizers encouragingly reach a comparable performance.

Models Backbone Param. (M) Model (MB) Accuracy (%)
ASTER-FL ResNet - 20.99 80.52 91.94
FedOCR-Hash ShuffleNetV2 - 13.34 () 51.37 () 89.08 ()
FedOCR-Hash ShuffleNetV2 1/2 6.70 () 26.05 () 86.65 ()
FedOCR-Hash ShuffleNetV2 1/4 3.38 () 13.38 () 85.39 ()
FedOCR-Hash ShuffleNetV2 1/8 1.72 () 7.05 () 82.58 ()
Table 1: Parameter size and accuracy comparison between different models in our FedOCR. The accuracy is the average result of all testing datasets. The models size refers to the storage occupied on the hard-disk. is the compression ratio of the hashing technique, and “” means that we do not apply the hashing technique to the model. The reduction percentages of parameter size, model size, and accuracy compared with ASTER-FL are shown in parentheses respectively
Models Training IIIT5k SVT IC03 IC13 IC15 SVTP CUTE
single 93.7 89.0 93.7 93.8 80.6 82.3 85.4
ASTER-FL centralized 95.0 91.7 95.3 94.6 82.2 83.3 91.7
federated 95.0 90.7 94.8 94.0 82.0 82.3 91.0
single 90.8 83.0 90.9 89.4 77.3 77.5 82.6
FedOCR-Hash centralized 93.1 86.4 92.5 92.2 79.7 80.6 86.8
federated 92.9 86.9 92.0 91.7 79.4 80.8 86.5
single 89.2 83.0 90.2 88.5 75.3 73.8 77.8
FedOCR-Hash centralized 91.6 83.6 91.0 90.3 77.9 75.5 82.3
federated 91.2 84.2 91.6 90.7 77.5 76.0 82.6
single 87.2 79.1 87.1 86.1 73.4 71.5 77.4
FedOCR-Hash centralized 89.4 81.6 89.5 88.8 75.9 74.3 81.6
federated 89.0 81.8 89.3 89.2 76.3 75.2 81.6
single 83.5 74.8 84.8 81.4 70.2 71.2 73.3
FedOCR-Hash centralized 86.7 78.8 86.7 86.0 72.3 71.6 79.5
federated 86.6 80.1 87.1 85.4 72.4 71.6 79.9
Table 2: Recognition accuracy in different training manners. “single”: The model is trained only with one participant’s dataset; “centralized”: The model is trained with a centralized set of image data; “federated”: The global model is trained with decentralized sets of image data in a federated manner. The detailed structures of different FedOCR-Hash are shown in Table 1

4.2.2 Federated Learning for Scene Text Recognition.

Table 2 shows the detailed results on all testing datasets of ASTER-FL and different FedOCR-Hash in three manners of training. First, “single” training means that the model is trained only with one participant’s dataset. Second, ”centralized” training means that the model is trained with a centralized set of image data. Third, “federated” training means that the model is trained with decentralized sets of image data in a federated manner. As shown in Table 2, “federated” and “centralized” training results of all models are similar to each other and better than “single” training results. In the “single” training manner, scene text recognition faces the problem in practice that the image data for training is limited, which causes poor performance in scene text recognition. However, we succeed in training a shared model with decentralized sets of image data collaboratively in the “federated” training manner, and we do not exchange or expose any image data to other participants. Expectantly, our FedOCR achieves comparable results, which are very close to the results of the “centralized” training manner. Therefore, our FedOCR is effective to train a more robust model without centralizing datasets on different local devices.

Figure 3: Accuracy on IIIT5k versus number of uploaded megabytes of different models with limited transmitted bytes in federated learning

4.2.3 Communication Efficiency Improvement.

In Table 2, FedOCR-Hash shows comparable accuracy with ASTER-FL in the “federated” training manner. Owing to the lightweight backbone in FedOCR-Hash, it has fewer parameters than ASTER-FL, which benefits communication efficiency in federated learning. As shown in Fig. 3, FedOCR-Hash has a higher accuracy than ASTER-FL when little communication bytes are uploaded.

Fig. 3 illustrates the accuracy curves of different models on IIIT5k versus uploaded bytes in federated training procedures. FedOCR-Hash with a smaller compression ratio achieves higher accuracy when limited communication bytes are uploaded, and it shows greater advantages in communication efficiency. The advantage of our FedOCR-Hash will be more distinctive when more local clients participate in our FedOCR. Considering both Table 1 and 2, FedOCR-Hash with an appropriate compression ratio shows a significant overall performance in communication efficiency and accuracy of federated learning. Only megabytes are required to be transmitted by each participant, which results in a faster parameter transmission with the same communication bandwidth.

Benefited from lightweight models and hashing techniques, our federated scene text recognition framework shows a comparable performance and advantages in communication efficiency. Considering plenty of participants and the unstable data transmission network in the real world, our FedOCR has great potential in practical application deployment.

5 Conclusion and Future Work

In this paper, we reveal the problem of data privacy in scene text recognition and address the difficulty in utilizing decentralized datasets distributed on local devices with federated learning. To the best of our knowledge, we propose the first federated scene text recognition framework named FedOCR. In our FedOCR, we succeed in training a shared text recognizer collaboratively with decentralized datasets and avoid violating rules of data privacy. Benefited from lightweight models and hashing techniques, we reduce communication costs to a large extent and provide higher-level privacy-preserving against the honest-but-curious global server. In terms of taking advantage of tremendous decentralized real-world data in practice, our communication-efficient federated learning framework for scene text recognition shows intriguing merits.

Recently, the domain shift in scene text recognition has attracted great interest in academia, and some methods are proposed, such as GA-DAN [41] and SSDAN [42]. Notably, the domain shift occurs in federated learning for scene text recognition as well, which leads to a deterioration in the global model accuracy. Hence, we are working on the domain adaptation of decentralized datasets for future works within the framework of FedOCR.


  • [1] J. Almazán, A. Gordo, A. Fornés, and E. Valveny (2014) Word spotting and recognition with embedded attributes. TPAMI 36 (12), pp. 2552–2566. Cited by: §1.
  • [2] F. Bai, Z. Cheng, Y. Niu, S. Pu, and S. Zhou (2018) Edit probability for scene text recognition. In CVPR, Cited by: §2.
  • [3] C. Bartz, J. Bethge, H. Yang, and C. Meinel (2019) KISS: keeping it simple for scene text recognition. arXiv preprint arXiv:1911.08400. Cited by: §1, §1.
  • [4] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen (2015) Compressing neural networks with the hashing trick. In ICML, Cited by: §3.2.
  • [5] C. Chng, Y. Liu, Y. Sun, C. C. Ng, C. Luo, Z. Ni, C. Fang, S. Zhang, J. Han, E. Ding, et al. (2019) Icdar2019 robust reading challenge on arbitrary-shaped text (rrc-art). arXiv preprint arXiv:1909.07145. Cited by: 10th item, §4.1.2.
  • [6] M. Fredrikson, S. Jha, and T. Ristenpart (2015) Model inversion attacks that exploit confidence information and basic countermeasures. In CCS, Cited by: §3.3.
  • [7] Y. Gao, Y. Chen, J. Wang, M. Tang, and H. Lu (2019) Reading scene text with fully convolutional sequence modeling. Neurocomputing 339, pp. 161–170. Cited by: §2.
  • [8] V. Goel, A. Mishra, K. Alahari, and C. Jawahar (2013) Whole is greater than sum of parts: recognizing scene text words. In ICDAR, Cited by: §1.
  • [9] A. Gupta, A. Vedaldi, and A. Zisserman (2016) Synthetic data for text localisation in natural images. In CVPR, Cited by: §2, 2nd item, §4.1.1.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.2.
  • [11] B. Hitaj, G. Ateniese, and F. Perez-Cruz (2017) Deep models under the gan: information leakage from collaborative deep learning. In CCS, Cited by: §3.3.
  • [12] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227. Cited by: §2, 1st item, §4.1.1.
  • [13] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al. (2015) ICDAR 2015 competition on robust reading. In ICDAR, Cited by: 5th item, §4.1.2.
  • [14] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras (2013) ICDAR 2013 robust reading competition. In ICDAR, Cited by: 4th item, §4.1.2.
  • [15] H. Li, P. Wang, C. Shen, and G. Zhang (2019) Show, attend and read: a simple and strong baseline for irregular text recognition. In AAAI, Cited by: §1, §1, §2.
  • [16] W. Liu, C. Chen, and K. K. Wong (2018) Char-net: a character-aware neural network for distorted scene text recognition. In AAAI, Cited by: §2.
  • [17] Z. Liu, Y. Li, F. Ren, W. L. Goh, and H. Yu (2018) Squeezedtext: a real-time scene text recognition by binary convolutional encoder-decoder network. In AAAI, Cited by: §2.
  • [18] S. Long, X. He, and C. Yao (2018) Scene text detection and recognition: the deep learning era. arXiv preprint arXiv:1811.04256. Cited by: §2.
  • [19] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young, K. Ashida, H. Nagai, M. Okamoto, H. Yamamoto, et al. (2005) ICDAR 2003 robust reading competitions: entries, results, and future directions. IJDAR 7 (2-3), pp. 105–122. Cited by: 3rd item, §4.1.2.
  • [20] C. Luo, L. Jin, and Z. Sun (2019) Moran: a multi-object rectified attention network for scene text recognition. PR 90, pp. 109–118. Cited by: §1, §1.
  • [21] J. Luo, X. Wu, Y. Luo, A. Huang, Y. Huang, Y. Liu, and Q. Yang (2019) Real-world image datasets for federated learning. arXiv preprint arXiv:1910.11089. Cited by: §2.
  • [22] N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In ECCV, pp. 116–131. Cited by: §3.2.2, §4.1.4.
  • [23] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, et al. (2016) Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629. Cited by: §1, §2, §3.1.1, §3.1.2.
  • [24] A. Mishra, K. Alahari, and C. Jawahar (2012) Top-down and bottom-up cues for scene text recognition. In CVPR, Cited by: 3rd item, 6th item, §4.1.2.
  • [25] L. T. Phong, Y. Aono, T. Hayashi, L. Wang, and S. Moriai (2018) Privacy-preserving deep learning via additively homomorphic encryption. TIFS 13 (5), pp. 1333–1345. Cited by: §3.3.
  • [26] T. Quy Phan, P. Shivakumara, S. Tian, and C. Lim Tan (2013) Recognizing text with perspective distortion in natural scenes. In ICCV, Cited by: 8th item.
  • [27] A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani (2019) Fedpaq: a communication-efficient federated learning method with periodic averaging and quantization. arXiv preprint arXiv:1909.13014. Cited by: §2.
  • [28] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan (2014) A robust arbitrary text detection system for natural scene images. Expert Systems with Applications 41 (18), pp. 8027–8048. Cited by: 9th item.
  • [29] B. Shi, X. Bai, and C. Yao (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. TPAMI 39 (11), pp. 2298–2304. Cited by: §2.
  • [30] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai (2018) Aster: an attentional scene text recognizer with flexible rectification. TPAMI. Cited by: §3.2.2, §4.1.4.
  • [31] R. Shokri and V. Shmatikov (2015) Privacy-preserving deep learning. In CCS, Cited by: §3.3.
  • [32] B. Su and S. Lu (2014) Accurate scene text recognition based on recurrent neural network. In ACCV, Cited by: §1.
  • [33] A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie (2016) Coco-text: dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140. Cited by: 11st item, §4.1.2.
  • [34] P. Voigt and A. Von dem Bussche (2017) The eu general data protection regulation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International Publishing. Cited by: §1.
  • [35] K. Wang, B. Babenko, and S. Belongie (2011) End-to-end scene text recognition. In ICCV, Cited by: 7th item.
  • [36] K. Wei, J. Li, M. Ding, C. Ma, H. H. Yang, F. Farhad, S. Jin, T. Q. Quek, and H. V. Poor (2019) Federated learning with differential privacy: algorithms and performance analysis. arXiv preprint arXiv:1911.00222. Cited by: §2.
  • [37] M. Yang, Y. Guan, M. Liao, X. He, K. Bian, S. Bai, C. Yao, and X. Bai (2019) Symmetry-constrained rectification network for scene text recognition. In ICCV, Cited by: §2.
  • [38] Q. Yang, Y. Liu, T. Chen, and Y. Tong (2019)

    Federated machine learning: concept and applications

    TIST 10 (2), pp. 12. Cited by: §3.1.
  • [39] M. D. Zeiler (2012) Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §4.1.5.
  • [40] F. Zhan and S. Lu (2019) Esir: end-to-end scene text recognition via iterative image rectification. In CVPR, Cited by: §1, §1, §2.
  • [41] F. Zhan, C. Xue, and S. Lu (2019) GA-dan: geometry-aware domain adaptation network for scene text detection and recognition. In ICCV, Cited by: §5.
  • [42] Y. Zhang, S. Nie, W. Liu, X. Xu, D. Zhang, and H. T. Shen (2019) Sequence-to-sequence domain adaptation network for robust text image recognition. In CVPR, Cited by: §5.
  • [43] W. Zhu, M. Baust, Y. Cheng, S. Ourselin, M. J. Cardoso, and A. Feng (2019) Privacy-preserving federated brain tumour segmentation. In Machine Learning in Medical Imaging: 10th International Workshop, Cited by: §2.