Lukas Tuggener Ismail Elezi Jürgen Schmidhuber Thilo Stadelmann
ZHAW Datalab, Zurich University of Applied Sciences, Winterthur, Switzerland
Dept. of Environmental Sciences, Informatics and Statistics, Ca’Foscari University of Venice, Italy
Faculty of Informatics, Università della Svizzera Italiana, Lugano, Switzerland
email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
1 Introduction and Problem Statement
The goal of Optical Music Recognition (OMR) is to transform images of printed or handwritten music scores into machine readable form, thereby understand the semantic meaning of music notation . It is an important and actively researched area within the music information retrieval community. The two main challenges of OMR are: first the accurate detection and classification of music objects in digital images; and second, the reconstruction of valid music in some digital format. This work is focusing solely on the first task.
Recent progress in computer vision
thanks to the adaptation of convolutional neural networks (CNNs)[8, 15]
provide a solid foundation for the assumption that OMR systems can be drastically improved by using CNNs as well. Initial results of applying deep learning[14, 27] to heavily restricted settings such as staffline removal , symbol classification  or end-to-end OMR for monophonic scores , support such expectations.
In this paper, we introduce a novel general object detection method called Deep Watershed Detector (DWD) motivated by the following two hypotheses: a) deep learning can be used to overcome the classical OMR approach of having hand-crafted pipelines of many preprocessing steps  by being able to operate in a fully data-driven fashion; b) deep learning can cope with larger, more complex inputs than simple glyphs, thereby learning to recognize musical symbols in their context. This will disambiguate meanings (e.g., between staccato and augmentation dots) and allow the system to directly detect a complex alphabet.
DWD operates on full pages of music scores in one pass without any preprocessing besides interline normalization, detects handwritten and digitally rendered music symbols without any restriction on the alphabet of symbols to be detected. We further show that it learns meaningful representation of music notation and achieves state-of-the art detection rates on common symbols.
The remaining structure of this paper is as follows: Sec. 2 puts our approach in context with existing methods; in Sec. 3 we derive our original end-to-end model, and give a detailed explanation on how we use the deep watershed transform for the task of object recognition; Sec. 4 reports on experimental results of our system on the DeepScores digitally rendered dataset in addition to the MUSCIMA++ handwritten dataset before we conclude in Sec. 5 with a discussion and give pointers for future research.
2 Related Work
The visual detection and recognition of objects is one of the most central problems in the field of computer vision. With the recent developments of CNNs, many competing CNN-based approaches have been proposed to solve the problem. R-CNNs , and in particular their successors , are generally considered to be state-of-the-art models in object recognition, and many developed recognition systems are R-CNN based. On the other hand, researchers have also proposed models which are tailored towards computational efficiency instead of detection accuracy. YOLO systems  and Single-Shot Detectors  while slightly compromising on accuracy, are significantly faster than R-CNN models, and can even achieve super real-time performance.
A common aspect of the above-mentioned methods is that they are specifically developed to work on cases where the images are relatively small, and where images contain a small number of relatively large objects [7, 18]. On the contrary, musical sheets usually have high-resolution, and contain a very large number of very small objects, making the mentioned methods not suitable for the task.
The watershed transform is a well understood method that has been applied to segmentation for decades . Bai and Urtasun  were first to propose combining the strengths of deep learning with the power of this classical method. They proposed to directly learn the energy for the watershed transform such that all dividing ridges are at the same height. As a consequence, the components can be extracted by a cut at a single energy level without leading to over-segmentation. The model has been shown to achieve state of the art performance on object segmentation.
For the most part, OMR detectors have been rule based systems working well only within a hard set of constraints. Typically, they require domain knowledge, and work well only on simple typeset music scores with a known music font, and a relatively small number of classes . When faced with low-quality images, complex or even handwritten scores , the performance of these models quickly degrades, to some degree because errors propagate from one step to another . Additionally, it isn’t clear what to do when the classes change, and in many cases, this requires building the new model from scratch.
In response to the above mentioned issues some deep learning based, data driven approaches have been developed. Hajic and Pecina  proposed an adaptation of Faster R-CNN with a custom region proposal mechanism based on the morphological skeleton to accurately detect noteheads, while Choi et al.  were able to detect accidentals in dense piano scores with high accuracy, given previously detected noteheads, that are being used as input-features to the network. A big limitation of both approaches is that the experiments have been done only on a tiny vocabulary of the musical symbols, and therefore their scalability remains an open question.
To our knowledge, the best results so far has been reported in the work of Pacha and Choi  where they explored many models on the MUSCIMA++  dataset of handwritten music notation. They got the best results with a Faster R-CNN model, achieving an impressive score on the standard mAP metric. A serious limitation of that work is that the system wasn’t designed in an end-to-end fashion and needs heavy pre- and post-processing. In particular, they cropped the images in a context-sensitive way, by cutting images first vertically and then horizontally, such that each image contains exactly one staff and has a width-to-height-ratio of no more than 1, with about horizontal overlap to adjacent slices. In practice, this means that all objects significantly exceeding the size of such a cropped region will neither appear in the training nor testing data, as only annotations that have an intersection-over-area of or higher between the object and the cropped region are considered part of the ground truth. Furthermore, all the intermediate results must be combined to one concise final prediction, which is a non-trivial task.
3 Deep Watershed Detection
In this section we present the Deep Watershed Detector (DWD) as a novel object detection system, built on the idea of the deep watershed transform . The watershed transform  is a mathematically well understood method with a simple core idea that can be applied to any topological surface. The algorithm starts filling up the surface from all the local minima, with all the resulting basins corresponding to connected regions. When applied to image gradients, the basins correspond to homogeneous regions of said image (see Fig. 2a). One key drawback of the watershed transform is its tendency to over segment. This issue can be addressed by using the deep watershed transform. It combines the classical method with deep learning by training a deep neural network to create an energy surface based on an input image. This has the advantage that one can design the energy surface to have certain properties. When designed in a way that all segmentation boundaries have energy zero, the watershed transform is reduced to a simple cutoff at a fixed energy level (see Fig. 2b). An objectness energy of this fashion has been used by Bai and Urtasun for instance segmentation . Since we want to do object detection, we further simplify the desired energy surface to having small conical energy peaks of radius pixels at the center of each object and be zero everywhere else (see Fig. 2c).
More formally, we define our energy surface (or: energy map) as follows:
where is the value of at position , is the set of all object centers and are the center coordinates of a given center . corresponds to the maximum energy and is the radius of the center marking.
At first glance this definition might lead to the misinterpretation that object centers that are closer together than cannot be disambiguated using the watershed transform on . This is not the case since we can cut the energy map at any given energy level between and . However, using this method it is not possible to detect multiple bounding boxes that share the exact same center.
3.1 Retrieving Object Centers
After computing an estimateof the energy map, we retrieve the coordinates of detected objects by the following steps:
Cut the energy map at a certain fixed energy level and then binarize the result.
Label the resulting connected components, using the two-pass algorithm . Every component receives a label in , for every component we define as the set of all tuples for which the pixel with coordinates and is part of
The center of any component is given by its center of gravity:
We use these component centers as estimates for the object centers .
3.2 Object Class and Bounding Box
In order to recover bounding boxes we do not only need the object centers, but also the object classes and bounding box dimensions. To achieve this we output two additional maps and as predictions of our network. is defined as:
where is the class label indicating background and is the class label associated with the center that is closest to . We define our estimate for the class of component by a majority vote of all values for all , where is the estimate of . Finally, we define the bounding box map as follows:
where and are the width and height of the bounding box for component . Based on this we define our bounding box estimation as the average of all estimations for label :
3.3 Network Architecture and Losses
As mentioned above we use a deep neural network to predict the dense output maps , and (see Fig. 1). The base neural network for this prediction can be any fully convolutional network with the same input and output dimensions. We use a ResNet-101  (a special case of a Highway Net ) in conjunction with the elaborate RefineNet  upsampling architecture. For the estimators defined above it is crucial to have the highest spacial prediction resolution possible. Our network has three output layers, all of which are an by convolution applied to the last feature map of the RefineNet.
3.3.1 Energy prediction
We predict a quantized and one-hot encoded version of, called , by applying a 1 by 1 convolution of depth to the last feature map of the base network. The loss of the prediction , , is defined as the cross-entropy between and .
3.3.2 Class prediction
We again use the corresponding one-hot encoded version and predict it using an by convolution, with the depth equal to the number of classes, on the last feature map of the base network. The cross-entropy is calculated between between and . Since it is not the goal of this prediction to distinguish between foreground and background, all the loss stemming from locations with will get masked out.
3.3.3 Bounding box prediction
is predicted in its initial form using an by convolution of depth on the last feature map of the base network. The bounding box loss is the mean-squared difference between and . For , the components stemming from background locations will be masked out analogous to .
3.3.4 Combined prediction
We want to jointly train in all tasks, therefore we define a total loss as:
where the are running means of the corresponding losses and the scalars are hyper-parameters of the DWD network. We purposefully use very short extraction heads of one convolutional layer; by doing so we force the base network to do all three tasks simultaneously. We expect this leads to the base network learning a meaningful representation of music notation, from which it can extract the solutions of the three above defined tasks.
4 Experiments and Results
4.1 Used Datasets
is currently the largest publicly available dataset of musical sheets with ground truth for various machine learning tasks, consisting of high-quality pages of written music, rendered atdots per inch. The dataset has full pages as images, containing tens of millions of objects, separated in classes. We randomly split the set into training and testing, using images for training and
images each for testing and validation. The dataset being so large allows efficient training of large convolutional neural networks, in addition to being suitable for transfer learning.
MUSCIMA++ is a dataset of handwritten music notation for musical symbol detection. It contains symbols spread unto pages, consisting of both notation primitives and higher-level notation objects, such as key signatures or time signatures. It features 105 object classes. There are notes in the dataset, of which have a full notehead, have an empty notehead, and are grace notes. We randomly split the dataset into training, validation and testing, with the training set consisting of pages, while validation and testing each consist of pages.
4.2 Network Training and Experimental Setup
We pre-train our network in two stages in order to achieve reasonable results. First we train the ResNet on music symbol classification using the DeepScores classification dataset . Then, we train the ResNet and RefineNet jointly on semantic segmentation data also available from DeepScores. After this pre-training stage we are able to use the network on the tasks defined above in Sec. 3.3.
Since music notation is composed of hierarchically organized sub-symbols, there does not exist a canonical way to define a set of atomic symbols to be detected (e.g., individual numbers in time signatures vs. complete time signatures). We address this issue using a fully data driven approach and detecting the unaltered labels as they are provided by the two datasets.
We rescale every input image to the desired interline value (number of pixels in between two staff lines). We use pixels for DeepScores and pixels for MUSCIMA++. Other than that we apply no preprocessing. We do not define a subset of target objects for our experiments, but attempt to detect all classes for which there is ground truth available. We always feed single images to the network, i.e. we only use batch size = . During training we crop the full page input (and the ground truth) to patches of by
pixels using random coordinates. This serves two purposes: it saves GPU memory and performs efficient data augmentation. This way the network never sees the exact same input twice, even if we train for many epochs. For all of the results described below we train individually on, and and then refine the training using . It turns out that the prediction of is the most fragile, therefore we retrain on again after training on the individual losses in the order defined above, before moving on to
. All the training is done using the RMSProp optimizer with a learning rate of and a decay rate of .
Since our design is invariant to how many objects are present on the input (as long as their centers do not overlap) and we want to obtain bounding boxes for full pages at once, we feed whole pages to the network at inference time. The maximum input size is only bounded by the memory of the GPU. For typical pieces of sheet music this is not an issue, but pieces that use very small interline values (e.g. pieces written for conductors) result in very large inputs due to the interline normalization. At about million pixels even a Tesla P40 with gigabytes runs out of memory.
4.3 Results and Discussion
Tab. 1 shows the average precision (AP) for the twenty best detected classes with an overlap of the detected bounding box and ground truth of and , respectively. We observe that in both cases there are common symbol classes that get detected very well, but there is also a steep fall off. The detection rate outside the top twenty continues to drop and is almost zero for most of the rare classes. We further observe that there is a significant performance gain for the lower overlap threshold, indicating that the bounding-box regression is not very accurate.
Fig. 3 shows an example detection for qualitative analysis. It confirms the conclusions drawn above. The rarest symbol present, an arpeggio, is not detected at all, while the bounding boxes are sometimes inaccurate, especially for large objects (note that stems, bar-lines and beams are not part of the DeepScores alphabet and hence don’t constitute missed detections). On the other hand, staccato dots are detected very well. This is surprising since they are typically hard to detect due to their small size and the context-dependent interpretation of the symbol shape (compare the dots in dotted notes or F-clefs). We attribute this to the opportunity of detecting objects in context, enabled by training on larger parts of full raw pages of sheet music in contrast to the classical processing of tiny, pre-processed image patches or glyphs.
The results for the experiments on MUSCIMA++ in Tab. 2 and Fig. 3b show a very similar outcome. This is intriguing because it suggests that the difficulty in detecting digitally rendered and handwritten scores might be smaller than anticipated. We attribute this to the fully data-driven approach enabled by deep learning instead of hand-crafted rules for handling individual symbols. It is worth noting that ledger-lines are detected with very high performance (see AP@). This explains the relatively poor detection of note-heads on MUSCIMA++, since they tend to overlap.
Fig. 4 shows an estimate for a class map with its corresponding input overlayed. Each color corresponds to one class. This figure proofs that the network is learning a sensible representation of music notation: even though it is only trained to mark the centers of each object with the correct colors, it learns a primitive segmentation mask. This is best illustrated by the (purple) segmentation of the beams.
5 Conclusions and Future Work
We have presented a novel method for object detection that is specifically tailored to detect many tiny objects on large inputs. We have shown that it is able to detect common symbols of music notation with high precision, both in digitally rendered music as well as in handwritten music, without a drop in performance when moving to the ”more complicated” handwritten input. This suggests that deep learning based approaches are able to deal with handwritten sheets just as well as with digitally rendered ones, additionally to their benefit of recognizing objects in their context and with minimal preprocessing as compared to classical OMR pipelines. Pacha et al. show that higher detection rates, especially for uncommon symbols, are possible when using R-CNN on small snippets (cp. Fig. 5
). Despite their higher scores, it is unclear how recognition performance is affected when results of overlapping and potentially disagreeing snippets are aggregated to full page results. A big advantage of our end-to-end system is the complete avoidance of error propagation in longer recognition pipeline of independent components like classifiers, aggregators etc. Moreover, our full-page end-to-end approach has the advantages of speed (compared to a sliding window patch classifier), change of domain (we use the same architecture for both the digital and handwritten datasets) and is easily integrated into complete OMR frameworks.
Arguably the biggest problem we faced is that symbol classes in the dataset are heavily unbalanced. In the DeepScores dataset in particular, the class notehead contains more than half of all the symbols in the entire dataset, while the top classes contain more than of the symbols. Considering that we did not do any class-balancing whatsoever, this imbalance had its effect in training. We observe that in cases where the symbol is common, we get a very high average precision, but it quickly drops when symbols become less common. Furthermore, it is interesting to observe that the neural network actually forgets about the existence of these rarer symbols: Fig. 6 depicts the evolution of of a network that is already trained and gets further trained for another
iterations. When faced with an image containing rare symbols, the initial loss is larger than the loss on more common images. But to our surprise, later during the training process, the loss actually increases when the net encounters rare symbols again, giving the impression that the network is actually treating these symbols as outliers and ignoring them.
Future work will thus concentrate on dealing with the catastrophic imbalance in the data to successfully train DWD to detect all classes. We believe that the solution lies in a combination of data augmentation and improved training regimes (i.e. sample pages containing rare objects more often, synthesizing mock pages filled with rare objects etc.).
Additionally, we plan to investigate the ability of our method beyond OMR on natural images. Initially we will approach canonical datasets like PASCAL VOC  and MS-COCO  that have been at the front-line of object recognition tasks. However, images in those datasets are not exactly natural, and for the most part they are simplistic (small images, containing a few large objects). Recently, researchers have been investigating the ability of state-of-the-art recognition systems on more challenging natural datasets, like DOTA , and unsurprisingly, the results leave much to be desired. The DOTA dataset shares a lot of similarities with musical datasets, with images being high resolution and containing hundreds of small objects, making it a suitable benchmark for our DWD method to recognize tiny objects.
-  M. Bai and R. Urtasun. Deep watershed transform for instance segmentation. In CVPR, 2017.
-  D. Bainbridge and T. Bell. The challenge of optical music recognition. Computers and the Humanities, 2001.
-  A. Baro, P. Riba, and A. Fornés. Towards the recognition of compound music notes in handwritten music scores. In ICFHR, 2016.
-  S. Beucher. The watershed transformation applied to image segmentation. SCANNING MICROSCOPY-SUPPLEMENT-, 1992.
-  J. Calvo-Zaragoza, J. J. Valero-Mas, and A. Pertusa. End-to-end optical music recognition using neural networks. In ISMIR, 2017.
-  K.-Y. Choi, B. Coüasnon, Y. Ricquebourg, and R. Zanibbi. Bootstrapping samples of accidentals in dense piano scores for cnn-based detection. In GREC@ICDAR, 2017.
-  M. Everingham, L. J. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman. The pascal visual object classes (VOC) challenge. IJCV, 2010.
K. Fukushima and S. Miyake.
Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position.Pattern Recognition, 1982.
-  R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
-  R. B. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He. Detectron. https://github.com/facebookresearch/detectron, 2018.
-  J. Hajic and P. Pecina. The MUSCIMA++ dataset for handwritten optical music recognition. In ICDAR, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. 2016.
-  J. H. Jr. and P. Pecina. Detecting noteheads in handwritten scores with convnets and bounding box regression. CoRR, abs/1708.01806, 2017.
-  Y. LeCun, Y. Bengio, and G. E. Hinton. Deep learning. Nature, 2015.
-  Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  G. Lin, A. Milan, C. Shen, and I. D. Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In CVPR, 2017.
-  T.-Y. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: common objects in context. In ECCV, 2014.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C.-Y. Fu, and A. C. Berg. SSD: single shot multibox detector. In ECCV, 2016.
-  A. Pacha, K.-Y. Choi, B. Coüasnon, Y. Ricquebourg, and R. Zanibbi. Handwritten music object detection: Open issues and baseline results. In International Workshop on Document Analysis Systems, 2018.
-  A. Pacha and H. Eidenberger. Towards self-learning optical music recognition. In ICMLA, 2017.
-  A. Rebelo, G. Capela, and J. S. Cardoso. Optical recognition of music symbols - A comparative study. IJDAR, 2010.
-  J. Redmon and A. Farhadi. YOLO9000: better, faster, stronger. In CVPR, 2017.
-  S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, 2015.
-  F. Rossant and I. Bloch. Robust and adaptive OMR system including fuzzy modeling, fusion of musical rules, and possible error detection. EURASIP, 2007.
-  A. J. Gallego Sánchez and J. Calvo-Zaragoza. Staff-line removal with selectional auto-encoders. Expert Syst. Appl., 2017.
-  J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 2015.
-  R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. 2015.
-  T. Tieleman and G. E. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4.2, 2012.
-  Lukas Tuggener, Ismail Elezi, Jurgen Schmidhuber, Marcello Pelillo, and Thilo Stadelmann. Deepscores - a dataset for segmentation, detection and classification of tiny objects. ICPR, 2018.
-  K. Wu, , E. Otoo, and K. Suzuki. Optimizing two-pass connected-component labeling algorithms. Pattern Anal. Appl., 2009.
-  G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, M. Datcu, M. Pelillo, and L. Zhang. Dota: A large-scale dataset for object detection in aerial images. In CVPR, 2018.
-  J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In NIPS, 2014.