I Introduction
Text extraction is an important problem with many applications in image processing and computer vision, such as optical character recognition, license plate detection, road sign detection in autonomous driving. Text extraction could be very challenging when the background has complicated texture and has overlapping color distributions as the text. Text extraction is usually accomplished in two steps: text region detection which detects the regions where text is high likely to be present
[1], and text segmentation to find a binary mask which shows the location of text. Our main focus in this work would be on the second step, which is to derive a binary mask for text segmentation in a detected region containing texts.Different algorithms have been proposed in the past for text segmentation from images. In [2]
, Haffner et al proposed an algorithm for text image segmentation and document compression using a hierarchical clustering approach. In
[3], an algorithm is proposed for segmentation of texts and graphics from screen content images and coding. In [4], Kumar proposes an algorithm for extracting texts from document images using matched wavelet filters. They used a clusteringbased technique to estimate globally matched wavelet filters using a collection of groundtruth images. In
[5], Saha proposed a text segmentation algorithm using Hough transform. Various algorithms based on sparse decomposition has also been proposed for text extraction, where the text extraction is achieved by assuming proper prior on the text component [6], [7]. There are also many other works based on histogram analysis, maximally stable extremal region (MSER), and appearance [8][12]. The reader is referred to [17] for a good survey of text recognition.In this work we look at this problem from a signal decomposition perspective. We assume that the foreground texts , and background are combined to create an image . But instead of assuming an additive model () as in [6][7], we consider an overlaying model, which is a more truthful characterization of images with overlaid texts. Specifically, we assume any pixel value in the image comes either from the background component, or from foreground text. We can formulate this as:
(1) 
where denotes the elementwise product, and is the binary mask indicating foreground pixel locations.
If we have some prior knowledge about each component, we can solve the decomposition problem as the following optimization problem:
(2)  
where are some cost functions that should be minimized based on our prior knowledge about , , and . After solving this optimization problem, will give us the location of texts. We will discuss further about the problem formulation and solution in the next section.
Figure 2 compares the segmentation results using hierarchical kmeans approach, sparse decomposition, and the proposed algorithm in this work for a sample image. As we can see, each of the previous approaches have their own difficulties, for example clustering based scheme would have difficulty separating text from background in the case where text has a similar color to background. And sparse decomposition based model, misses some part of the text, while detecting some part of the background as text. On the other hand, the proposed algorithm performs significantly better. This approach can also be used for segmentation of medical images
[14], and also texture extraction from biometrics [15][16].The rest of this paper is structured as follows: Section II presents the problem formulation, and the iterative algorithm to solve the optimization problem. The experimental results and performance analysis are provided in Section III, and the paper is concluded in Section IV.
Ii Problem Formulation
As discussed earlier, we consider text as an overlaid component on the background image. Therefore we can denote an image as the masked summation of the background and text:
(3) 
To further simplify the problem, we assume that each component has a sparse representation using some properly designed dictionary or forming a subspace. Therefore we can write this problem as:
(4) 
where is n matrix, where each column denotes one of the basis functions from the corresponding subspace/dictionary. We would like to note that here we assume that both subspaces are known, in the applications where the choice of subspaces/dictionaries is not clear, we can use dictionary/subspace learning algorithms to learn them [18][23].
Note that, the model in (4) can be rewritten as:
(5) 
where
is a diagonal matrix with the vector
on its main diagonal (i.e. ). If a diagonal element is 1, the corresponding pixel belongs to the foreground, and otherwise to the background.The decomposition problem in Eq. (4) is a highly illposed problem. Therefore we need to impose some prior on each component, and also on to be able to perform this decomposition. We assume that each component has a sparse representation with respect to its own subspace, but not with respect to the other one. We also assume that the second component is sparse, which is very desirable for text. To promote sparsity of the second component, we add the norm of to the cost function (note that corresponds to the support of the second component).
We can incorporate all these priors in an optimization problem as shown below:
(6)  
This problem is a combinatorial problem, and is not tractable, both because of the term in the cost function and also the binary nature of . We relax these conditions to be able to solve this problem in an alternating optimization approach. We replace the in the cost function with , and also relax the condition to . Then we will get the following optimization problem:
(7)  
This problem can be solved with different approaches, such as majorization minimization, alternating direction method, and random sampling approach [24][29]. Here we present an algorithm based on alternating direction method, which simply solves this problem by updating each variable at a time, and setting the gradient of cost function () w.r.t. each variable to zero. The optimization steps with respect to and are symmetric, therefore we only show the solution for . We first ignore the constraint and solve the unconstrained problem, and then keep the largest components of to satisfy the constraint.
Then by keeping the largest components of the above solution, we will derive the update as . We also provide the update step of here. By using the equality , we can rewrite the optimization w.r.t. as below:
First note that, we can simplify the above optimization problem as below:
(8)  
where , and . We can first ignore the constraint, and then project the optimal solution of the cost function onto the feasible set (). Since is a diagonal matrix, the cost function can be decoupled in the components of as:
(9)  
where denotes the ith diagonal element of (which is the ith diagonal element of the vector ). Now this problem can be easily solved with softthresholding as [30]:
(10) 
where denotes the softthresholding operator applied elementwise and defined:
Now if we denote the projection operator on [0,1] by , then the optimization solution of step would be:
(11) 
The overall algorithm is summarized in Algorithm 1.
where .
Iii Experimental Results
In this section we provide the experimental study on the application of the proposed algorithm for text extraction from several challenging images. These images are manually generated by adding text on top of a relatively complicated background. Our dataset contains more than 300 image blocks.
We first define the parameters of our model. We apply our algorithm on blocks of 64x64. We first convert each block into a vector of 4096 dimension, and then use the proposed algorithm for background foreground separation. For the first component we use 40 dimensional lowfrequency DCT subspace [31], and for the second component we use 10 dimensional Hadamard subspace [32]. The sparsity of coefficients are chosen as and . The weight parameter for the sparsity term is chosen to be , which is tuned by testing on a validation set. The number of iterations for optimization algorithm is chosen to be , and
is initialized with uniform random variable in [0,1].
We now discuss two possible ways to perform the binarization. As mentioned earlier, the goal is to solve the binary optimization problem, but to make it tractable, we approximate the binary variables with a continuous variable in
, and binarize them after solving the optimization problem. In the algorithm 1, we first solve the optimization problem in Eq. (7), for a continuous , and then binarize at the very end. An alternative approach is to binarize the variables after each update of in algorithm 1. We tested both approaches for two of the test images, and provided the results in Figure 2. As we can see, usually doing the binarization at the very end works better. One possible reason could be that with the second approach, the final solution is very sensitive to the initialization.We now provide the comparison of the proposed algorithm with prior approaches on text extraction. We compare the proposed algorithm with three previous algorithms; hierarchical kmeans clustering [2], shape primitive extraction, and sparsity based approach [7]. Figure 3 shows the comparison between the proposed algorithm performance compared with the previous approaches for 3 sample images from our dataset. As it can be seen, the proposed algorithm achieves more accurate result than previous methods.
We also provide the average precision, recall and F1 score [33] achieved by different algorithms for the above sample images. The average precision, recall and F1 score by different algorithms are given in Table 1. The precision, recall and F1 score are defined as in Eq. (12) and (13), where TP, FP and FN denote true positive, false positive and false negative respectively. In our evaluation, we treat the text pixels as positive. A pixel that is correctly identified as text (compared to the groundtruth) is considered true positive.
(12) 
(13) 
Segmentation Algorithm  Precision  Recall  F1 score 

SPEC [3]  67%  77%  71.6% 
Hierarchical Clustering [2]  66.5%  92%  77.2% 
Sparse Dec. with TV [7]  71%  91.7%  80% 
The proposed algorithm  95%  92.5%  93.7% 
As it can be seen, the proposed scheme achieves much higher precision and recall than SPEC, hierarchical kmeans clustering and sparse decomposition approach. It is worth mentioning that our algorithm is pretty robust to the initialized value of variables. A complete study of the initialization impact on the segmentation result is presented in
[34].Iv conclusion
This paper proposes a text extraction algorithm from a signal decomposition perspective. We consider texts as an overlaying component on top of a natural scene image, where the pixel values at each point comes from one and only one of the components (in contrast with the traditional signal decomposition case, where the signal component at each point is assumed to be the summation of corresponding values from different components). Each component is assumed to have a sparse representation with respect to a suitable subspace. The text component is also assumed to be sparse. We then propose an optimization framework to separate the background and text components using the alternating direction method. Experimental results show that the proposed algorithm can provide significantly better text extraction when the background is textured and has similar color distribution to text. This algorithm could be further improved by learning the subspaces for the desired application.
Acknowledgment
The authors would like to thank Ivan Selesnick, Pablo Sprechmann, and Arian Maleki for their valuable comments regarding this work.
References
 [1] S Tian, Y Pan, C Huang, S Lu, K Yu, C Tan, “Text flow: A unified text detection system in natural scene images”, IEEE International Conference on Computer Vision, 2015.
 [2] P Haffner, P.G. Howard, P. Simard, Y. Bengio and Y. Lecun, “High quality document image compression with DjVu”, Journal of Electronic Imaging, 7(3), 1998, 410425.
 [3] T. Lin and P. Hao, “Compound image compression for realtime computer screen image transmission”, IEEE Transactions on Image Processing, 14(8), 9931005, 2005.
 [4] S Kumar, R Gupta, N Khanna, S Chaudhury, and S Joshi, “Text extraction and document image segmentation using matched wavelets and MRF model”, IEEE Transactions on Image Processing, 2007.
 [5] S Saha, S Basu, M Nasipuri, DK Basu, “A Hough transform based technique for text segmentation”, Journal of Computing, 2010.
 [6] TV Hoang, S Tabbone, “Text extraction from graphical document images using sparse representation”, Proceedings of the 9th IAPR International Workshop on Document Analysis Systems. ACM, 2010.
 [7] S Minaee, Y Wang, “Screen content image segmentation using robust regression and sparse decomposition”, IEEE Journal on Emerging and Selected Topics in Circuits and Systems 6.4: 573584, 2016.

[8]
V Khare, P Shivakumara, P Raveendran, “A new Histogram Oriented Moments descriptor for multioriented moving text detection in video”, Expert Systems with Applications, 2015.

[9]
L Gomez, D Karatzas, “MSERbased realtime text detection and tracking”, International Conference on Pattern Recognition, IEEE, 2014.
 [10] C Yi, Y Tian, “Text extraction from scene images by character appearance and structure modeling”, Computer Vision and Image Understanding 117.2: 182194, 2013.
 [11] L GomezBigorda, D Karatzas, “Textproposals: A textspecific selective search algorithm for word spotting in the wild”, arXiv preprint arXiv:1604.02619, 2016.
 [12] Y Zhu, C Yao, X Bai, “Scene text detection and recognition: Recent advances and future trends”, Frontiers of Computer Science, 2016.
 [13] S Minaee, Y Wang, “Screen content image segmentation using sparse decomposition and total variation minimization”, International Conference on Image Processing, IEEE, 2016.
 [14] MP Hosseini, MR. NazemZadeh, D Pompili, and H Soltanian, “Statistical validation of automatic methods for hippocampus segmentation in MR images of epileptic patients”, Engineering in Medicine and Biology Society (EMBC), IEEE, 2014.
 [15] S Minaee, Y Wang, “fingerprint recognition using translation invariant scattering network”, Signal Processing in Medicine and Biology Symposium, IEEE, 2015.
 [16] S Minaee, A Abdolrashidi, and Y Wang, “Iris recognition using scattering transform and textural features”, Signal Processing Education Workshop, IEEE, 2015.
 [17] Q Ye, D Doermann, “Text detection and recognition in imagery: A survey”, IEEE transactions on pattern analysis and machine intelligence: 14801500, 2015.
 [18] X Shu, F Porikli, N Ahuja, “Robust orthonormal subspace learning: Efficient recovery of corrupted lowrank matrices”, IEEE Conference on Computer Vision and Pattern Recognition, 2014.

[19]
M Rahmani, G Atia, “A subspace learning approach for high dimensional matrix decomposition with efficient column/row sampling”, International Conference on Machine Learning, 2016.
 [20] M Aharon, M Elad, A Bruckstein, “KSVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation”, IEEE Transactions on signal processing, 54(11), 43114322, 2006.
 [21] J Mairal, F Bach, J Ponce, G Sapiro, “Online dictionary learning for sparse coding”, Proceedings of the 26th annual international conference on machine learning, ACM, 2009.

[22]
S Minaee, Y Wang, “Subspace Learning in The Presence of Sparse Structured Outliers and Noise”, International Symposium on Circuits and Systems, IEEE, 2017.
 [23] A Taalimi, H Shams, A Rahimpour, R Khorsandi, W Wang, R Guo, and H Qi, “Multimodal weighted dictionary learning”, In Advanced Video and Signal Based Surveillance, IEEE, 2016.
 [24] S. Boyd, N. Parikh, E. Chu, B. Peleato and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers”, Foundations and Trends in Machine Learning, 3(1), 1122, 2011.
 [25] DP Bertsekas, “Nonlinear programming”, Belmont: Athena scientific, 1999.
 [26] F Bach, R Jenatton, J Mairal, G. Obozinski, “Convex optimization with sparsityinducing norms”, Optimization for Machine Learning, 2011.
 [27] P. L. Combettes and V. R. Wajs, “Signal recovery by proximal forwardbackward splitting,” Multiscale Modeling and Simulation, vol. 4, no. 4, pp. 11681200, November 2005.
 [28] MA Fischler, RC Bolles, “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography”, Communications of the ACM, 24: 381395, 1981.
 [29] M Zuliani, CS Kenney, BS Manjunath, “The multiransac algorithm and its application to detect planar homographies”, IEEE International Conference on Image Processing, IEEE, 2005.
 [30] D. Donoho, “Denoising by softthresholdingm” IEEE Transactions on Information Theory, 41.3: 613627, 1995.
 [31] A.B. Watson, “Image compression using the discrete cosine transform”, Mathematica journal 4.1: 81, 1994.
 [32] WK Pratt, J Kane, HC Andrews, “Hadamard transform image coding”, Proceedings of the IEEE 57.1: 5868, 1969.
 [33] DM Powers, “Evaluation: from precision, recall and Fmeasure to ROC, informedness, markedness and correlation”, 2011.
 [34] S Minaee, Y Wang, “Masked Signal Decomposition Using Subspace Representation and Its Applications”, arXiv preprint arXiv:1704.07711, 2017.