I Introduction and Prior Work
Neural networks in machine learning systems are commonly employed to tackle classification problems involving characters or images. In such problems, the neural network (NN) processes an input sample and predicts which class it belongs to. The inputs and their classes are drawn from a dataset, such as the MNIST  dataset containing images of handwritten digits, or the CIFAR 
and ImageNet datasets containing images of common objects such as birds and houses. A NN is first trained using numerous examples where the input sample and its class label are both available, then used for inference (i.e. prediction) of the classes of input samples whose class labels are not available. The training stage is data-hungry and typically requires thousands of labeled examples. It is therefore often a challenge to obtain adequate amounts of high quality and accurate data required to sufficiently train a NN. A possible solution is to obtain data by synthetic instead of natural means. Synthetic data are generated using computer algorithms instead of being collected from real-world scenarios. The advantages are that a) computer algorithms can be tuned to mimic real-world settings to desired levels of accuracy, and b) a theoretically unlimited amount of data can be generated by running the algorithm long enough. The effects of dataset size on network performance has been explored in , in particular, more inputs are beneficial in reducing overfitting and improving robustness and generalization capabilities of NNs [5, 6]. Synthetic data has been successfully used in problems such as 3D imaging , point tracking , breaking Captchas on popular websites , and augmenting real world datasets .
This present work introduces a family of synthetic datasets on classifying Morse codewords. Morse code is a system of communication where each letter, number or symbol in a language is represented using a sequence of dots and dashes, separated by spaces. It is widely used to communicate in situations where voice is not possible, such as helping people with disabilities talk[11, 12, 13, 14], or where message transmission needs to be achieved using only 2 states , or in rehabilitation and education . Morse code is a useful skill to learn and there exist cellphone apps designed to train people in its usage [17, 18].
Our work uses feed-forward multi-layer perceptron neural networks to directly classify Morse codewords into 64 character classes comprising letters, numbers and symbols. This is different from previous works such as[13, 19, 15, 14] which only had 2 classes corresponding to dots and dashes. In particular,  used fuzzy logic on inputs from a microcontroller used in security systems, while  used least mean squares approximation, both to classify dots and dashes. There has also been previous work using time series and recurrent networks to decode English words in Morse code [20, 21], while 
used radial basis function networks to classify characters with 84%accuracy. Accuracy is a common metric for describing the performance of a classification NN and is measured as the percentage of class labels correctly predicted by the NN during inference.
The key contributions of the present work are as follows:
An algorithm (described in Section II) to generate machine learning datasets of varying difficulty. To the best of our knowledge, we are the first to develop an algorithm which can scale the difficulty of machine learning datasets. The difficulty of a dataset can be observed from the accuracy of a NN training on it – harder datasets lead to lower accuracy, and vice-versa. We discuss techniques to make datasets harder and show corresponding accuracy results in Section III
. Encountering harder datasets leads to aggressive exploration of network hyperparameters and learning algorithms, which ultimately increases the robustness of NNs training on them. The algorithm and datasets are open source and available on Github.
In Section IV, we introduce metrics to quantify the difficulty of a dataset. While some of these arise from information theory, we also come up with a new metric which achieves a high correlation coefficient with the accuracy achieved by NNs on a dataset. Our metrics are a useful way to characterize how hard a dataset is without having a NN train on them.
This work is one of few to introduce a spatially 1-dimensional dataset. This is in contrast to the wide array of image and character recognition datasets which are usually 2-dimensional such as MNIST, where each image has width and height, or 3-dimensional such as CIFAR and ImageNet, where each image has width, height and a number of features. The number of spatial dimensions in the input data is important when dealing with low-complexity sparse NNs. Previous works [24, 25, 26, 27, 28, 29, 30] have focused on making NNs sparse, while keeping the resulting accuracy reduction to a minimum. The family of Morse code datasets described in the present work was designed to test the limits of sparse NNs, as described in Section III-C.
Ii Generating Algorithm
We picked 64 class labels for our dataset – the 26 English letters, the 10 Arabic numerals, and 28 other symbols such as (, +, :, etc. Each of these is represented by a sequence of dots and dashes in Morse code, for example, + is represented as — — . So as to mimic a real-world scenario in our algorithm, we imagined a human or a Morse code machine writing out this sequence within a frame of fixed size. Wherever the pen or electronic instrument touches is darkened and has a high intensity, indicating the presence of dots and dashes, while the other parts are left blank (i.e. spaces).
Ii-1 Step 1 – Frame Partitioning
For our algorithm, each Morse codeword lies in a frame which is a vector of 64 values. Within the frame, the length of a sequence having consecutive similar values is used to differentiate between a dot and a dash. In the baseline dataset, a dot can be 1-3 values wide and a dash 4-9. This is in accordance with international Morse code regulations
where the size or duration of a dash is around 3 times that of a dot. The space between a dot and a dash can have a length of 1-3 values. The exact length of a dot, dash or space is chosen from these ranges according to a uniform probability distribution. This is to mimic the human writer who is not expected to make each symbol have a consistent length, but can be expected to make dots and spaces around the same size, and dashes longer than them. The baseline dataset has no leading spaces before the 1st dot or dash, i.e. the codeword starts from the left edge of the frame. There are trailing spaces to fill up the right side of the frame after all the dots and dashes are complete.
Ii-2 Step 2 – Assigning Values for Intensity Levels
All values in the frame are initially real numbers in the range
and indicate the intensity of that point in the frame. For dots and dashes, the values are drawn from a normal distribution with mean. The idea is to have the ‘six-sigma’ range from to . This ensures that any value making up a dot or a dash will lie in the upper half of possible values, i.e. in the range . The value of a space is exactly 0. Once again, these conditions mimic the human or machine writer who is not expected to have consistent intensity for every dot and dash, but can be expected to not let the writing instrument touch portions of the frame which are spaces.
Ii-3 Step 3 – Noising
Noise in input samples is often deliberately injected as a means of avoiding overfitting in NNs , and has been shown to be superior to other methods of avoiding overfitting . This was, however, the secondary reason behind our experimenting with noise. The primary reason was to deliberately make the data hard to classify and test the limits of different NNs processing it. Noise can be thought of as a human accidentally varying the intensity of writing the Morse codeword, or a Morse communication channel having noise. The baseline dataset has no noise, while others have additive noise from a mean-zero normal distribution applied to them. Fig. 1 shows the 3 steps up to this point. Finally, all the values are normalized to lie within the range with precision of 3 decimal places.
Ii-4 Step 4 – Mass Generation
Steps 1-3 describe the generation of 1 input sample corresponding to some particular class label. This can be repeated as many times as required for each of the 64 class labels. This demonstrates a key advantage of synthetic over real-world data – the ability to generate an arbitrary amount of data having an arbitrary prior probability distribution over its classes. The baseline dataset has 7,000 examples for each class, for a total of 448,000 examples.
Ii-a Variations and Difficulty Scaling
The baseline dataset is as described so far, except that , i.e. it has no additive noise. We experimented with the following variations in datasets:
Baseline with additive noise = Normal, . These are called Morse 1., i.e. 1.0 to 1.4, where 1.0 is the baseline.
Instead of having the codeword start from the left edge of the frame, we introduced a random number of leading spaces. For example, in Fig. 1
, the codeword occupies a length of 26 values. The remaining 38 space values can be randomly divided between leading and trailing spaces. This increases the difficulty of the dataset since no particular set of neurons are expected to be learning dots and dashes as the actual codeword could be anywhere in the frame. Just like variation 1, we added noise and call these datasetsMorse 2., .
There is no overlap between the lengths of dots and dashes in the datasets described so far. The difficulty can be increased by making dash length = 3-9 values, which is exactly according to the convention of having dash length thrice of dot length. This means that dashes can masquerade as dots and spaces, and vice-versa. This is done on top of introducing leading spaces. These datasets are called Morse 3., being as before.
The Morse datasets only have 64 inputs, which is quite small compared to others such as MNIST (784 inputs), CIFAR (3072 inputs), or ImageNet (150,528 inputs). This makes the Morse datasets hard to classify since there is less redundancy in inputs, so a given amount of noise will lead to greater reductions in signal-to-noise ratio (SNR) compared to other datasets. To make the Morse datasets easier, we introduced dilation by a factor of 4. This is done by scaling all lengths in variation 3 by a factor of 4, i.e. frame length is 256, dot sizes and space sizes are 4-12, and dash size is 12-36. These datasets are called Morse 4., being as before.
Increasing the number of training examples, i.e. the size of the dataset, makes it easier to classify since a NN has more labeled training examples to learn from. Accordingly we chose Morse 3.1 and scaled the number of examples to obtain Morse Size , . For example, Morse Size has 3,500 examples for each class, for a total of 224,000 examples.
Iii Neural Network Results and Analysis
Iii-a Network Setup
Our NN needs to have 64 output neurons to match the number of classes. The number of input neurons always matches the frame length, i.e. 256 for the Morse 4.
datasets, and 64 for all others. We used a single hidden layer with 1024 neurons. The performance, i.e. accuracy, generally increases on adding more hidden neurons, however, we stuck with 1024 since values above that yielded diminishing returns. The network is purely multi-layer perceptron, i.e. there are only fully connected layers. The hidden layer has ReLU activations, while the output is a softmax probability distribution. We used the Adam optimizer with default parameters, He normal initialization for the weights 
, and trained for 30 epochs using a minibatch size of 128. We usedth of the total examples for training the NN and the remaining th for testing at the end of training. All reported accuracies are those obtained on the test samples.
No constraints were imposed on the weights for the NNs training on Morse 1., 2. and 3., since our experimental results indicated that this led to optimum performance. However, the NNs for Morse 4. are more prone to overfitting due to having more input neurons, leading to more weight parameters. Accordingly we regularized the weights using an L2 coefficient , which was the best value as determined experimentally.
Note that the entirety of this work – generation of various datasets, implementing NNs to process them, and evaluation of metrics – uses the Python programming language. Test accuracy results after training the NN on the different Morse datasets are shown in Fig. 2. As expected, increasing the standard deviation of noise results in drop in performance. This effect is not felt strongly when since the range can take spaces to a value of 3 (on a scale of , i.e. before normalizing to ), while dots and dashes can drop to , so the probability of a space being confused with a dot or dash is basically 0. Confusion can occur for , and gets worse for higher values, as shown in Fig. 3.
Since the codeword lengths do not often stretch beyond 32, the first half of neurons usually encounter high input intensity values corresponding to dots and dashes during training. This means that the latter half of neurons mostly encounter lower input values corresponding to spaces. This aspect changes when introducing leading spaces, which become inputs to some neurons in the first half. The result is an increase in the variance of the input to each neuron. As a result, accuracy drops. The degradation is worse when dashes can have a length of 3-9. Since the lengths are drawn from a uniform distribution,th of dashes can now be confused with rd of dots and rd of intermediate spaces. As an example, for the + codeword which has 2 dashes, 3 dots and 4 intermediate spaces, there is a chance of this confusion occurring. Dilating by 4, however, reduces this chance to . Accuracy is better as a result, and is further improved by properly regularizing the NN so that it doesn’t overfit.
Increasing dataset size has a beneficial effect on performance. Giving the NN more examples to train from is akin to training on a smaller dataset for more epochs, with the important added advantage that overfitting is reduced. This is shown in Fig. 4, which shows improving test accuracy as the dataset is made larger. At the same time, the difference between final training accuracy and test accuracy reduces, which implies that the network is generalizing better and not overfitting. Note that Morse Size 8 has 3 million labeled training examples – a beneficial consequence of being able to cheaply generate large quantities of synthetic data.
Iii-C Results for Sparse Networks
Our previous work [24, 25, 26] has focused on network complexity reduction in the form of pre-defined sparsity. In a pre-defined sparse network, as opposed to a fully connected one, a fraction of the weights are chosen to be deleted before starting training. These weights never appear during the workflow of the NN. Consider our (64,1024,64) NN as an example. When fully connected, it has weights, which gives a fractional density = 1. If we choose to delete 75% of the weights at the beginning, then we are left with a NN which has 32,768 weights, i.e. fractional density = . This leads to reduced storage and operational complexity, which is particularly important for hardware realizations of NNs, but possibly at the cost of performance degradation.
Fig. 5 shows the performance degradation for 4 different Morse datasets. Note how the baseline dataset is reasonably accurately classified by a NN with only a quarter of the weights, while performance drops off much more rapidly when dataset variations are introduced. These variations lead to increased information content per neuron per training example. As a result, the reduction in information learning capability as a result of deleting weights is much more severe. Also note that as density is reduced, Morse 4.2 has the best performance out of the non-baseline models tested in Fig. 5. This is because it has more weights to begin with, due to the increased number of input neurons. Finally, note that regularization was not applied to any of the sparse models since reducing the number of NN parameters reduces the chances of overfitting, so is in itself a form of regularization.
This section discusses possible metrics for quantifying how difficult a dataset is to classify. Each sample in a dataset is a point in an -dimensional space, being the number of features. For the Morse datasets (not considering dilation), . There are classes of points, which is also 64 in our case. The classification problem is essentially finding the class of any new point. Any machine learning classifier will attempt to learn and construct decision boundaries between the classes by partitioning the whole space into regions. The samples of a particular class are clustered in the th region. Suppose a particular input sample actually belongs to class . The classifier commits an error if it ranks some class , , higher than when deciding where that input sample belongs. The probability of this happening is , where subscript stands for pairwise and indicates that the quantity is specific to classes and . The overall probability of error would also depend on the prior probability of the th class occurring. Considering all classes in the dataset, is given according to  as:
The pairwise probabilities can be approximately computed by assuming that the locations of samples of a particular class
are from a Gaussian distribution with mean located at the centroid, which is the average of all samples for the class. To simplify the math, we take the average variance across all dimensions within a class – this gives the variance for class . The distance between 2 classes and is the L2-norm between their centroids, i.e. . A particular class will be more prone to errors if it is close to other classes. This can be quantified by looking at , where the numerator is given as:
With the Gaussian assumption, eq. (IV) simplifies to , where:
where is the tail function for a standard Gaussian distribution.
and can thus be used as metrics for dataset difficulty, higher values for them imply higher probabilities of error, i.e. lower accuracy. A simpler metric can be obtained by just considering . Higher values for this indicate that a) class is close to some other class and the NN will have a hard time differentiating between them, and b) Variance of class is high, so it’s harder to form a decision boundary to separate inputs having labels from those with other labels. Since is different for every class, we experimented with ways to reduce it to a single measure such as taking the minimum, the average and the median. The average worked best, which gives our 3rd metric :
Therefore high values of lead to low accuracy.
The 4th and final metric is , to obtain which, we first compute the class centroids just as before. Then we compute the L1-norm between every pair of centroids and average over , i.e:
Since all features in each input sample are normalized to , all the elements in all the centroid vectors also lie in the range . So the number for every pair of classes is always between 0 and 1, in fact, it is proportional to the absolute distance between the 2 classes. Then we simply count how many of the numbers are less than a threshold, which we empirically set to 0.05. This gives , i.e.:
where is the indicator function, which is 1 if the condition in its argument is true, otherwise 0. The higher the value of , the lower the accuracy. Note that the total number of values will be , so the count for will typically be higher for datasets that have more classes. This is a desired property since more number of classes usually makes a dataset harder to classify. Note that the maximum value of for the Morse datasets is .
Iv-a Goodness of the Metrics
We computed , , and values for all the Morse datasets and plotted these with the classification accuracy results obtained from Section III-B. The results are shown in Fig. 6, while the correlation coefficient of each metric with the accuracy is given in Table I. Note that the metrics are an indicator of dataset difficulty, so they are negatively correlated with accuracy. It is apparent that the and metrics are the best since their values have the highest magnitude.
Iv-B Limitations of the Metrics
As mentioned, each class has a single variance value which is the average variance across dimensions. This is a reasonable simplification to make because our experiments indicate that the variance of the variance values for different dimensions is small. However, this simplification possibly leads to the error bounds and not being sufficiently tight. A possible improvement, involving significantly more computation, would be to compute the covariance matrix for each class.
It is worthwhile noting that all these metrics are a function of the dataset only and are independent of the machine learning algorithm or training setup used. On the other hand, percentage accuracy depends on the learning algorithm and training conditions. As shown in Fig. 4, increasing dataset size leads to accuracy improvement, i.e. the dataset becoming easier, since the NN has more training examples to learn from. However, increasing dataset size drives all the metric values towards indicating higher difficulty. This is because the occurrence of more examples in each class increases its standard deviation and also makes samples of a particular class more scattered, leading to reduced values for and . We hypothesize that these shortcomings of the metrics are due to the fact that most variations of the Morse datasets have a low SNR, while the metrics (the error bounds in particular) are designed for high SNR problems.
This paper presents an algorithm to generate datasets of varying difficulty on classifying Morse code symbols. While the results have been shown for neural networks, any machine learning algorithm can be tried and the challenge arising from more difficult datasets used to fine tune it. The datasets are synthetic and consequently may not completely represent reality unless statistically verified with real-world tests. However, the different aspects of the generating algorithm help to mimic real-world scenarios which can suffer from noise or other inconsistencies. This work highlights one of the biggest advantages of synthetic data – the ability to easily produce large amounts of it and thereby improve the performance of learning algorithms. The given Morse datasets are also useful for testing the limits of various learning algorithms and identifying when they fail or possibly overfit/underfit.
The metrics discussed, while not perfect, can be used to understand the inherent difficulty of the classification problem on any dataset before applying learning algorithms to it. Future work will involve improving the metrics to achieve higher magnitudes of correlation with accuracy, and extension to other types of neural networks and algorithms.
-  Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov 1998.
-  A. Krizhevsky, “Learning multiple layers of features from tiny images,” Master’s thesis, University of Toronto, 2009.
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet
Large Scale Visual Recognition Challenge,”
International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
G. M. Weiss and F. Provost, “Learning when training data are costly: The
effect of class distribution on tree induction,”
Journal of Artificial Intelligence Research, vol. 19, pp. 315–354, 2003.
-  I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
-  I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in Proceedings of the International Conference on Learning Representations (ICLR), 2015.
-  X. Peng, B. Sun, K. Ali, and K. Saenko, “Learning deep object detectors from 3d models,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE Computer Society, 2015, pp. 1278–1286.
-  D. DeTone, T. Malisiewicz, and A. Rabinovich, “Toward geometric deep SLAM,” in arXiv:1707.07410, 2017.
-  T. Anh Le, A. G. Baydin, R. Zinkov, and F. Wood, “Using synthetic data to train neural networks is model-based reasoning,” in arXiv:1703.00868, 2017.
-  N. Patki, R. Wedge, and K. Veeramachaneni, “The synthetic data vault,” in Proceedings of the IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2016, pp. 399–410.
-  N. S. Bakde and A. P. Thakare, “Morse code decoder - using a PIC microcontroller,” International Journal of Science, Engineering and Technology Research (IJSETR), vol. 1, no. 5, 2012.
-  C.-H. Yang, L.-Y. Chuang, C.-H. Yang, and C.-H. Luo, “Morse code application for wireless environmental control systems for severely disabled individuals,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 11, no. 4, pp. 463–469, Dec 2003.
-  C.-H. Yang, C.-H. Yang, L.-Y. Chuang, and T.-K. Truong, “The application of the neural network on morse code recognition for users with physical impairments,” Proceedings of the Institution of Mechanical Engineers, Part H: Journal of Engineering in Medicine, vol. 215, no. 3, pp. 325–331, 2001.
-  C.-H. Luo and C.-H. Shih, “Adaptive morse-coded single-switch communication system for the disabled,” International Journal of Bio-Medical Computing, vol. 41, no. 2, pp. 99–106, 1996.
C. P. Ravikumar and M. Dathi, “A fuzzy-logic based morse code entry system with a touch-pad interface for physically disabled persons,” inProceedings of the IEEE Annual India Conference (INDICON), Dec 2016.
-  T. W. King, Modern Morse Code in Rehabilitation and Education: New Applications in Assistive Technology, 1st ed. Allyn and Bacon, 1999.
-  R. Sheinker. (2017, Aug) Morse code - apps on google play. [Online]. Available: https://play.google.com/store/apps/details?id=com.dev.morsecode&hl=en
-  F. Bonnin. (2018, Mar) Morse-it on the app store. [Online]. Available: https://itunes.apple.com/us/app/morse-it/id284942940?mt=8
-  C.-H. Luo and D.-T. Fuh, “Online morse code automatic recognition with neural network system,” in Proceedings of the 23rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, vol. 1, 2001, pp. 684–686.
-  D. Hill, “Temporally processing neural networks for morse code recognition,” in Theory and Applications of Neural Networks. Springer London, 1992, pp. 180–197.
G. N. Aly and A. M. Sameh, “Evolution of recurrent cascade correlation
networks with distributed collaborative species,” in
Proceedings of the First IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks, 2000, pp. 240–249.
-  R. Li, M. Nguyen, and W. Q. Yan, “Morse codes enter using finger gesture recognition,” in Proceedings of the International Conference on Digital Image Computing: Techniques and Applications (DICTA), Nov 2017.
-  S. Dey. Github repository: souryadey/morse-dataset. [Online]. Available: https://github.com/souryadey/morse-dataset
-  S. Dey, Y. Shao, K. M. Chugg, and P. A. Beerel, “Accelerating training of deep neural networks via sparse edge processing,” in Proceedings of the 26th International Conference on Artificial Neural Networks (ICANN). Springer, 2017, pp. 273–280.
-  S. Dey, P. A. Beerel, and K. M. Chugg, “Interleaver design for deep neural networks,” in Proceedings of the 51st Asilomar Conference on Signals, Systems, and Computers, Oct 2017, pp. 1979–1983.
-  S. Dey, K.-W. Huang, P. A. Beerel, and K. M. Chugg, “Characterizing sparse connectivity patterns in neural networks,” in Proceedings of the Information Theory and Applications Workshop, 2018.
-  S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in Proceedings of the International Conference on Learning Representations (ICLR), 2016.
-  S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Proceedings of the Conference on Neural Information Processing Systems (NIPS), 2015, pp. 1135–1143.
-  W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Compressing neural networks with the hashing trick,” in Proceedings of the International Conference on Machine Learning (ICML). JMLR.org, 2015, pp. 2285–2294.
-  X. Zhou, S. Li, K. Qin, K. Li, F. Tang, S. Hu, S. Liu, and Z. Lin, “Deep adaptive network: An efficient deep neural network with sparse binary connections,” in arXiv:1604.06154, 2016.
-  International Morse Code, Radiocommunication Sector of International Telecommunication Union, Oct 2009, available at http://www.itu.int/rec/R-REC-M.1677-1-200910-I/.
-  R. M. Zur, Y. Jiang, L. L. Pesce, and K. Drukker, “Noise injection for training artificial neural networks: A comparison with weight decay and early stopping,” Medical Physics, vol. 36, no. 10, pp. 4810–4818, Oct 2009.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Dec 2015, pp. 1026–1034.
-  K. Chugg, A. Anastasopoulos, and X. Chen, Iterative Detection: Adaptivity, Complexity Reduction, and Applications. Springer Science & Business Media, 2012, vol. 602.