Understanding text in images shared on platforms such as Facebook and Instagram along with the context in which it appears makes it possible to proactively identify inappropriate or harmful content and keep our community safe. While over the years we had gotten good at handling policy-violating text composed in posts and captions, we were exposed to hate speech, clickbait, policy-violating ads, and other low quality content that manifested as part of an image. Motivated by this problem, Rosetta 
was built to extract overlaid and scene-text from images and video frames and identify policy violating content. Rosetta extracts text from more than a billion public Facebook and Instagram images and video frames (in a wide variety of languages), daily and in real time, and feeds the output to upstream classifiers to understand the context of the text and the image together.
Rosetta employs a two-step approach. The text detection model, based on Faster-RCNN  with the ResNet convolutional body replaced with a ShuffleNet-based  architecture, is responsible of detecting regions of the image that contain words. Each detected region is cropped and fed to a fully-convolutional character-based text recognition model to extract the word. Widely used object detection models such as Faster-RCNN are designed for generic cases and thus output rectangular bounding boxes. For scene-text extraction, this is insufficient as text may come in arbitrary orientations while the second-stage text recognition model typically expects an image patch with horizontal text. According to an error analysis on Rosetta soon after it was deployed, oriented text was found to be the most common source of mistakes with orientations as minimal as 20°failing to be recognized correctly. Therefore, it’s important to enable the system to correctly handle rotated text amongst all the adversarial cases we might face.
In traditional Faster-RCNN detection, while most words are detected with a bounding box, we cannot correctly infer the actual words due to lack of textual orientation information (Figure 1
). One solution is to apply a trained Spatial Transformer Network on the rectangular patch, which benefits from being standalone module. Another approach is to predict an oriented bounding box in detection stage, which benefits from being able to train end-to-end. The two approaches can be combined together if needed. We follow the idea of to replace the Region Proposal Network (RPN) in our Faster-RCNN-based detection pipeline with RRPN. There are a few differences in our implementation compared to original RRPN:
We use Rotated Region of Interest Align (RRoI-Align) that applies bi-linear interpolation, instead of Rotated Region of Interest Pooling (RRoI-Pooling) as RoI transformation to avoid misaligned result on the boundaries due to rounding.
Boundary breaking anchors:
The original RRPN will filter out boundary breaking R-anchors during training and they had to use a border padding of 0.25 times each side to reserve more positive proposals. Our approach automatically performs on-demand padding in the RRoI-Align operator as well as bounding box transformation between the detection and recognition pipeline.
Orientation coverage: Due to trade-offs between orientation coverage and computational efficiency, the original RRPN crops the angle range to be within . However, it’s not a symmetrical range as the text rotated by degrees would be treated as text rotated by degrees. Therefore, we use a more natural orientation coverage of degrees so that any text that is not rotated by more than 90 degrees can be correctly identified. The RPN anchors are chosen accordingly.
2 Rotational Region Proposal Network
2.1 Rotated box Representation
Traditionally, a non-rotated bounding box is represented as where is the top-left corner of the box, or , where is the top-left corner and is the bottom-right corner of the box. RRPN  uses as the representation, where represents the geometric center of the bounding box. The orientation is the angle from the positive direction of x-axis to the direction parallel to the width of the rotated box. We follow a similar representation.
2.2 Rotated Anchors
To fit the objects to different sizes, RPN uses two parameters to control the size and shape of anchors, i.e., scale and aspect ratio. For rotated text detection, anchor generation should include orientation as well. The original RRPN uses as scale anchors (essentially it’s equivalent to in our pipeline with a scaling multiplier of ), as aspect ratio anchors, and as angle anchors. Other than the trade-off between orientation coverage and computational efficiency leading to using half of the angle space (180 degrees), it didn’t give an explanation for the specific cropping angle of degrees, which means boxes in range of degrees would be treated as text rotated by degrees. From theoretical and empirical perspective, we believe that it’s because the convergence of angles would become unstable around the cropping angles and they try to place the boundary at some non-right-angle degrees. On the other hand, in our latest experiments we use the same scales and aspect ratios as our non-rotated model (5 scale anchors , 7 aspect ratio anchors ), while the angle anchors are between and degrees: . While the range of degrees is used for now, our pipeline also supports detecting arbitrary angles (covering all 360 degrees) with a config change. The extra dimension for angle anchors multiplies the number of anchors by and thus increases the memory usage during training by a non-negligible amount, which we handle by using in-place operators.
2.3 Rotated RoI Transformation
Given the region proposals, we’d like to obtain a fixed-size feature map (e.g. ). Operators like RoI Pooling allows us to achieve this goal. In the RRPN paper , the authors extended RoI Pooling to RRoI Pooling by applying rotation transformation. On the other hand, quantization in RoI and RRoI pooling creates misaligned result on the boundaries. To handle this, we implemented Rotated RoI Align operator that applies bi-linear interpolation.
3 Proximity Estimation for Rotated Box
Assuming we have two proposals, how should we evaluate which one is closer to the ground truth? This is the key question for proposal selection as well as non-maximum suppression (NMS). For horizontal boxes, it’s sufficient to use IoU (intersection over union) between two boxes. However, IoU alone is not enough to estimate the proximity of rotated boxes. Given a horizontal box, we can have up toways of representing it: a box rotated by . They occupy exactly the same pixels of the image, but they mean different things in OCR when processed by the text-recognition model. For example, the letter ’’ might be recognized as ’’ if the predicted box has angle, or ’’ if the box has angle. Therefore, angle difference between the rotated boxes needs to be taken into account. We apply the same strategy as : positive R-anchors are those with the highest IoU overlap or an IoU larger than with respect to the ground truth, and an intersection angle with respect to the ground truth of less than degrees. Negative R-anchors are those with an IoU lower than , or an IoU larger than but with an intersection angle with a ground truth larger than degrees. Regions not belonging to either positive or negative are not used during training.
While the key concept is not complicated, the change to support the 5-dimensional bounding box representation without breaking the current pipeline turned out to be pretty invasive. Besides the core changes mentioned above, various efforts were made in supporting training with mixed non-rotated and rotated datasets to bias the model towards the predominant scenario of non-rotated text without sacrificing too much accuracy, on-the-fly rotated data augmentation, and supporting evaluation of rotated boxes.
For efficient inference, we perform int8 quantization on the model to reduce runtime and memory-bandwidth usage. Also, we implemented efficient CPU versions of Caffe2 operators to handle rotated boxes including RotatedRoIAlign, BBoxTransform, Non-Maximum Suppression, GenerateProposals and BoxWithNMSLimit, leading to no increase in inference time compared to baseline non-rotated model. These Caffe2 operators have been open-sourced at https://github.com/pytorch/pytorch/blob/master/caffe2/operators.
Finally, we need to perform rotated box transformation in order to generate the image patch based on predicted rotated box parameters and feed it to the recognition pipeline. Traditional Faster-RCNN clips any boxes that might overflow image boundaries, but doing so for rotated boxes isn’t straight-forward and would lead to cutting-out some characters since the detection output is still a rectangle as opposed to a polygon. To handle this, we pad the original image until the boxes don’t overflow the boundaries anymore 2 in the following steps: First, we use the horizontal bounding rectangle of the predicted rotated box to crop out the region from the original image. Then, we add zero-padding so that the center of the image patch is the same as the predicted rotated box. Finally, we use warp-affine transformation in OpenCV to simultaneously rotate the patch and crop out the region of interest with the correct width and height from the prediction.
4.1 Qualitative Results
Using the RRPN approach, we can correctly predict the rotated bounding box for the example in the beginning (the image and text detection and recognition result in the right of Figure 1).
4.2 Quantitive Evaluation
We performed end-to-end evaluation on two datasets, Rotated (with uniform random rotation in degrees) and Non-rotated datasets, on five models. (1) Baseline model without RRPN; (2) RRPN () trained on an rotation-augmented multilingual dataset; (3) RRPN () model after int8 quantization; (4) RRPN () trained on rotation-augmented multilingual dataset; (5) RRPN () model after int8 quantization. All of them are trained for around iterations. Results are summarized in Figure 3. The RRPN models perform significantly better on Rotated dataset, while slightly worse on Non-rotated dataset. It would be more interesting to train for as many iterations as possible (for example, RRPN with should be trained for iterations for it to see as many non-rotated examples as the baseline non-rotated model) and evaluate the results. If the result shows significant trade-off between the models, it might be an evidence of hitting the model capacity. We also compared the latency of int8-quantized rotated model with the baseline non-rotated model during inference and found no significant increase, which means we can process images at similar throughput as before.
We adapted Rotation Region Proposal Network for detecting oriented text, and made important improvements based on it. This further improves the quality Rosetta, the OCR system at Facebook, to protect against adversarial text in images including engagement bait, policy violating ads and images with profanity and hate speech, as well as improves the accuracy of screen readers to make Facebook more accessible for the visually impaired. In the future, this framework could be extended to more adversarial scenarios such as text with general affine transformations.
The authors would like to thank Albert Gordo, Peizhao Zhang, Manohar Paluri, Tyler Matthews, Mahalia Miller and others who contributed, supported and collaborated with us during the development and deployment of our system.
-  F. Borisyuk, A. Gordo, and V. Sivakumar. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 71–79. ACM, 2018.
-  J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue. Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia, 2018.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arxiv 2017. arXiv preprint arXiv:1707.01083, 2017.