Image-free single-pixel segmentation

08/24/2021 ∙ by Haiyan Liu, et al. ∙ Beijing Institute of Technology 2

The existing segmentation techniques require high-fidelity images as input to perform semantic segmentation. Since the segmentation results contain most of edge information that is much less than the acquired images, the throughput gap leads to both hardware and software waste. In this letter, we report an image-free single-pixel segmentation technique. The technique combines structured illumination and single-pixel detection together, to efficiently samples and multiplexes scene's segmentation information into compressed one-dimensional measurements. The illumination patterns are optimized together with the subsequent reconstruction neural network, which directly infers segmentation maps from the single-pixel measurements. The end-to-end encoding-and-decoding learning framework enables optimized illumination with corresponding network, which provides both high acquisition and segmentation efficiency. Both simulation and experimental results validate that accurate segmentation can be achieved using two-order-of-magnitude less input data. When the sampling ratio is 1 accuracy reaches above 96 technique can be widely applied in various resource-limited platforms such as UAV and unmanned vehicle that require real-time sensing.



There are no comments yet.


page 1

page 4

page 8

page 11

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Figure 2:

The single-pixel image-free segmentation network structure. The network consists of two parts, including an encoder module which encodes the target’s information into one-dimensional measurements, and a decoder module that involves a feature extraction block and a feature map segmentation block. The decoder receives single-pixel measurements and directly outputs the segmentation results. The feature extraction block contains seven convolutional layers and one deconvolutional layer, and the size of the convolution kernels are

, and , respectively. The feature map segmentation block mainly contains downsampling, upsampling and skip connection operations.

The reported image-free single-pixel segmentation technique.

To further improve modulation and sensing efficiency, we built an end-to-end deep learning framework to simultaneously optimize the modulation patterns and the corresponding inferring network, as shown in Fig. 2. It consists of two parts. The first part is an encoder module that modulates and couples the target light field into one-dimensional measurements. The second part is a decoder module that infers segmentation information from the non-visual measurements.

The encoder contains one convolution layer, in which the kernels represent the modulation patterns. The convolutional filters maintain the same size with the the target scene ( in this work). The -th single-pixel measurement is mathematically modeled as


where represents the -th modulation pattern, and denotes the Hadamard product. After the encoder outputs the one-dimensional measurements (), we added a full connected layer to extract more connotative semantic information. As a result, the data dimension was increased to , and then adjusted into a feature map of to input into the subsequent decoder.

With the input of a feature map produced by the encoder, the decoder outputs a

segmentation map. The decoder is comprised of two parts. The first part is a feature extraction block designed based on the Fast Super-Resolution Convolutional Neural Network (FSRCNN)

[17], which includes five steps of feature extraction, shrinking, mapping, expanding and deconvolution. The feature extraction step utilizes 56 filters to first perform feature extraction on the input feature map. Then, the shrinking step adopts a smaller number of filters to reduce parameters and simplify the model, during which the feature dimension is reduced from 56 to 12. Next, the non-linear mapping part of 4 convolution layers of maps fuzzy features into clear features without changing feature map channels. Further, the expansion step adopts filters to extend the dimensions of the mapped high-resolution feature map and the number of channels back to 56, which is conducive to the reconstruction of a clearer segmentation map. Finally, the deconvolution step upsamples and aggregates the features using a set of deconvolution filters.

By applying the feature extraction block, a feature map with the size of is obtained. Then, the second part of the encoder is a segmentation block designed based on the Unet++ architecture [18]. It mainly contains three steps of upsampling, downsampling and skip connection, which connects the feature maps of different scales. This segmentation block realizes feature extraction and fusion at different scales, and then outputs the final segmentation results.

Due to the lack of large-scale segmentation datasets, direct training of the reported network results in low segmentation accuracy (as validated in Fig. 3

). Considering that large-scale natural image datasets are available, we derived a two-stage training strategy that intrinsically corresponds to the transfer learning theory

[19]. In the first training stage, we apply large-scale natural image datasets to learn image-related prior features, by controlling that the gradient flow only goes back through the encoder and the feature extraction block. This helps better extract target features and improve sensing performance. In the second training stage, the gradient flow goes back through the entire network, namely that both the encoder and the decoder are updated at the same time. This operation transfers the learned features to the segmentation datasets, and optimizes the learning efficiency of the model. Once converged, the optimized filters of the encoder are set as light modulation patterns. Correspondingly, the single-pixel detector collects a sequence of coupled measurements that are input into the sensing decoder, which outputs the final segmentation results.


When training the network, we used the normalization initialization method with the bias initialized to 0, and used the Adam solver for gradient optimization. The weight decay was as 1e-4. The loss objective was set as the Mean Square Error. In the first training stage, we applied the STL-10 dataset


. The learning rate was initialized as 2e-3, and was decreased by 0.8 for each 20 epochs. In the second training stage, the learning rate was set as 1e-3, and was decreased by 0.8 for each 50 epochs. We applied the White Blood Cell (WBC) segmentation dataset

[21] which contains 300 images. The corresponding ground truth segmentation results were manually sketched by domain experts, where the nuclei, cytoplasms and background including red blood cells were marked in white, gray and black respectively. We expanded the dataset to 1200 images by horizontal and vertical mirroring, and then expanded these images three times through affine transformation and 50°rotation. Ultimately, we acquired 3600 images and randomly selected 2700 images to train the network, and evaluated its performance using the other 900 images. Each image is in gray scale and resized to pixels. The entire training process took 7 hours on a computer with an AMD 3700x processor and NVIDIA RTX 3090 graphics card.

Sampling Metric One-stage Two-stage
ratio training training
0.5 PA 96.80 97.21
DICE 80.88 81.72
0.05 PA 94.35 97.00
DICE 78.68 81.71
0.01 PA 77.90 96.76
DICE 57.41 80.89
0.001 PA 77.43 94.40
DICE 57.12 78.15
0.0002 PA 76.87 91.30
DICE 57.01 75.77
Table 1: Segmentation accuracy comparison of the two-stage and one-stage training strategies.

To demonstrate the effectiveness of the reported two-stage training strategy, we first compared the segmentation accuracy of using the two-stage training strategy and the conventional one-stage training, as shown in Tab. 1. We employed two metrics to quantitatively evaluate segmentation accuracy, including the Pixel Accuracy (PA) and the Dice coefficient (DICE). The PA represents the percentage of correctly marked pixels, and the DICE is an ensembled measurement function to evaluate the structural similarity of two maps [22]. The results in Tab. 1 show that the two-stage training strategy obtains higher PA and DICE than the one-stage training at different sampling ratios. When the sampling ratio is 1%, the Dice coefficient reaches 80.89% and the pixel accuracy reaches 96.76%. This validates that accurate segmentation can be achieved using two-order-of-magnitude less input data.

Figure 3: The comparison of the one-stage and two-stage training strategies at different sampling ratios on WBC dataset.

The exemplar segmentation maps of the two training strategies are presented in Fig. 3. We can see that as the sampling ratio reduces from 0.5 to 0.01, the segmentation results of the two-stage training strategy keep high fidelity compared to the ground truth. In contrast, the one-stage strategy produces distorted segmentation maps even at the sampling ratio of 0.5, which further degrade seriously as sampling ratio decreases. The superior performance of the reported two-stage training strategy originates from its intrinsic transfer learning nature. Specifically, the two-stage training runs between a source domain (classification features) and a target domain (segmentation features), which transfers image-based semantic knowledge to segmentation features of single-pixel measurements. Such a strategy effectively improves the generalization ability of the network and corresponding segmentation accuracy.

Sampling Metric Random Hadamard Optimized
ratio modulation modulation modulation
0.1 PA 96.28 96.95 97.08
DICE 78.34 80.11 81.78
0.05 PA 95.61 97.00 97.01
DICE 78.49 80.84 81.71
0.01 PA 94.39 96.51 96.76
DICE 76.69 79.61 80.89
0.0002 PA 76.83 76.72 91.30
DICE 56.16 56.77 75.77
Table 2: Segmentation accuracy on the WBC dataset under different modulation patterns (random patterns, Hadamard patterns and the optimized patterns) and different sampling ratios (0.0002-1).

Then, we trained another two networks with the encoder filters fixed to random and Hadamard modulation patterns [23], to illustrate the effectiveness of the reported optimized modulation. In the training process of these two networks, the encoder was fixed while only the decoder parameters were updated. The segmentation performance of different modulation strategies is presented in Tab. 2. The results show that using the optimized modulation obtains the highest accuracy (especially at low sampling ratios). Specifically, the segmentation accuracy of the optimized modulation at 0.01 sampling ratio is even higher than that of the other two modulation strategies at 0.1 sampling ratio. This validates that the optimized illumination patterns enable to better extract target features and improve segmentation accuracy.

Next, with the same set of single-pixel measurements as input, we compared the segmentation accuracy of the reported image-free technique with that of the conventional first-reconstruction-then-segmentation image-based methods at different sampling ratios. The two state-of-the-art single-pixel imaging reconstruction algorithms were introduced for comparison, including the total-variation-based reconstruction technique (TV-Rec) [24] and the deep-learning-based reconstruction method (DL-Rec) [25]. The images reconstructed by these two techniques were input into the pre-trained UNET++ segmentation network to produce the final semantic maps.

Figure 4: The comparison of segmentation accuracy between the reported image-free segmentation technique and the conventional first-reconstruction-then-segmentation method at different sampling ratios (0.0002-1). (a) and (b) show the PA and DICE evaluations respectively.

Figure 4 shows the quantitative comparison among the above different segmentation methods at different sampling ratios (0.0002-1). We can see that the TV-Rec technique enables to obtain higher segmentation accuracy when the sampling ratio is higher than 0.2, while its segmentation accuracy degrades quickly at low sampling ratios due to unsatisfied image reconstruction quality. The deep-learning reconstruction enables to improve reconstruction quality at low sampling ratios by a certain degree. In contrast, the reported image-free segmentation technique performs the best when the sampling ratio is lower than 0.02. As the sampling ratio reduces from 1.0 to 0.0002, the PA and DICE of the reported technique slightly decreases in the range of 6%. Figure 5 (a) shows several exemplar segmentation maps of the three methods at 0.01 sampling ratio. We can see that the TV-Rec technique failed to segment both the nuclei and cytoplasms areas. Although the DL-Rec technique produces clear cytoplasms region compared with the TV-Rec method, there exist serious aberrations at its segmented nuclei regions compared with the ground truth. In contrast, the reported image-free segmentation technique produces the best visual quality among all the competing algorithms. It enables to distinctly segment nuclei and cytoplasms with less aberrations.

Figure 5: The segmentation maps on synthetic measurements, with the sampling ratio being 0.01. (a) and (b) show the results of the WBC dataset and the UAS dataset respectively.

In addition, we also performed another simulation comparison of the different segmentation methods on the UESTC all-day Scenery (UAS) dataset [26]. The UAS dataset provides all-weather road images and corresponding binary labels, which discriminate the passable and impassable areas. We employed the images of four kinds of weather including sun, dusk, night, and rain, which contain 5670 training images and 710 testing images. We retrained and tested the image-free segmentation method and the conventional first-reconstruction-then-segmentation methods, and the results are presented in Fig. 5(b) (the sampling ratio is 0.01). We can see that the image-free segmentation technique produces better segmentation results than the conventional methods. It produces clear passable areas with high quality at a low sampling ratio, while the TV-Rec technique failed to acquire road information to segment, and the DL-Rec technique produced wrong structure of passable areas.


Figure 6: The proof-of-concept setup of image-free segmentation. The light source illuminated the film printed of the target scene. The DMD implemented the modulation of the pre-trained optimized patterns, and the single-pixel detector acquired the total intensity of the light field. The single-pixel measurements were input into the decoder network to output the final segmentation results.

In order to further validate the effectiveness of the reported technique in practical applications, we built a proof-of-concept setup to acquire single-pixel measurements, as shown in Fig. 6

. The pre-trained patterns were projected by a digital micromirror device (DMD, ViALUX V-7001) for illumination modulation. We set the sampling ratio as 0.01, and 40 patterns were projected for each target. The modulated light was projected onto a transmissive film printed of the target scene. The coupled signal was focused by a lens to an Si amplified photodetector (Thorlabs PDA100A2, 320?1100 nm). The measurements corresponding to different modulation patterns were input into the decoder network to output the final semantic segmentation results. The exemplar segmentation maps are presented in Fig.

7. We can see that the nuclei and cytoplasms areas of the TV-Rec method were wrongly segmented, while the DL-Rec method maintains segmentation aberrations. In comparison, the results of the image-free segmentation technique are consistent with the ground truth with high fidelity.

Figure 7: The segmentation results on experimental data.

Conclusion and Discussion

In summary, we report a novel image-free single-pixel segmentation technique that maintains low hardware and software complexity. Different from the conventional first-reconstruction-then-segmentation methods that require 2-D image acquisition, transmission and processing, the reported technique directly produces segmentation maps from compressed 1-D measurements through an end-to-end neural network, which reduces both hardware and software complexity of the system. Besides, the modulation patterns in the single-pixel acquisition process are pre-trained and optimized together with the sensing network, which improves both the system’s acquisition and segmentation efficiency. In such a computational sensing framework, the reported technique enables to effectively reduce data amount by two orders of magnitude, which might open up a new avenue for real-time scene segmentation on resource-limited platforms such as unmanned aerial vehicle and unmanned vehicle.

The image-free single-pixel segmentation technique can be further extended. First, the segmentation accuracy can be further improved by introducing finely designed network modules such as the self-attention module [27]

, which enables to focus on the areas that need to be segmented to improve segmentation accuracy. Second, although the learned modulation patterns achieve higher segmentation efficiency, they are at gray scale that consumes long implementation time using DMD or other light modulators compared to the binary patterns. We can further train binarized convolution kernels that maintain high implementation speed to lower acquisition time

[28]. Third, the current network is running on the GPU server. To make the system applicable on resource-limited embedded platforms, we will further simplify the network model by judging the importance of different convolution kernels, and deleting filters with little loss of accuracy to reduce network scale [29, 30].


  • [1] Zaitoun, N. M. & Aqel, M. J. Survey on image segmentation techniques. Procedia Computer Sci. 65, 797–806 (2015).
  • [2] Kuruvilla, J., Sukumaran, D., Sankar, A. & Joy, S. P. A review on image processing and image segmentation. In 2016 international conference on data mining and advanced computing (SAPIENCE), 198–203 (IEEE, 2016).
  • [3] Ghosh, S., Das, N., Das, I. & Maulik, U. Understanding deep learning techniques for image segmentation. ACM Computing Surveys (CSUR) 52, 1–35 (2019).
  • [4] Litjens, G. et al. A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017).
  • [5] Badrinarayanan, V., Kendall, A. & Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. on Pattern Anal. and Mach. Intell. 39, 2481–2495 (2017).
  • [6] Leibe, B., Seemann, E. & Schiele, B. Pedestrian detection in crowded scenes. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , vol. 1, 878–885 (IEEE, 2005).
  • [7] Kang, W.-X., Yang, Q.-Q. & Liang, R.-P. The comparative research on image segmentation algorithms. In 2009 First International Workshop on Education Technology and Computer Science, vol. 2, 703–707 (IEEE, 2009).
  • [8] Minaee, S. et al. Image segmentation using deep learning: A survey. IEEE Trans. on Pattern Anal. and Mach. Intell. (2021).
  • [9] Song, Y. & Yan, H. Image segmentation techniques overview. In 2017 Asia Modelling Symposium (AMS), 103–107 (IEEE, 2017).
  • [10] Kulkarni, K. & Turaga, P. Reconstruction-free action inference from compressive imagers. IEEE T. Pattern Anal. 38, 772–784 (2015).
  • [11] Lohit, S., Kulkarni, K., Turaga, P., Wang, J. & Sankaranarayanan, A. C. Reconstruction-free inference on compressive measurements. Conference on Computer Vision and Pattern Recognition (CVPR) 16–24 (2015).
  • [12] Lohit, S., Kulkarni, K. & Turaga, P. Direct inference on compressive measurements using convolutional neural networks. In International Conference on Image Processing (ICIP), 1913–1917 (IEEE, 2016).
  • [13] Adler, A., Elad, M. & Zibulevsky, M. Compressed learning: A deep neural network approach. arXiv preprint arXiv:1610.09615 (2016).
  • [14] Xu, Y., Liu, W. & Kelly, K. F. Compressed domain image classification using a dynamic-rate neural network. IEEE Access 8, 217711–217722 (2020).
  • [15] Fu, H., Bian, L. & Zhang, J. Single-pixel sensing with optimal binarized modulation. Opt. Lett. 45, 3111–3114 (2020).
  • [16] Zhong, J., Zhang, Z., Li, X., Zheng, S. & Zheng, G. Image-free classification of fast-moving objects using ’learned’ structured illumination and single-pixel detection. Opt. Express 28 (2020).
  • [17] Dong, C., Loy, C. C. & Tang, X. Accelerating the super-resolution convolutional neural network. In European Conference on Computer Vision (ECCV), 391–407 (Springer, New York, NY, USA, 2016).
  • [18] Zhou, Z., Siddiquee, M. M. R., Tajbakhsh, N. & Liang, J. Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE T. Med. Imaging 39, 1856–1867 (2019).
  • [19] Tan, C. et al. A survey on deep transfer learning. In International Conference on Artificial Neural Networks (ICANN), 270–279 (Springer, 2018).
  • [20] Coates, A., Ng, A. & Lee, H. An analysis of single layer networks in unsupervised feature learning. In AISTATS (2011). URL
  • [21] Zheng, X., Wang, Y., Wang, G. & Liu, J.

    Fast and robust segmentation of white blood cell images by self-supervised learning.

    Micron 107, 55–71 (2018). URL
  • [22] Bertels, J. et al.

    Optimizing the dice score and jaccard index for medical image segmentation: Theory and practice.

    In International Conference on Medical Image Computing and Computer-Assisted Intervention(MICCAI), 92–100 (Springer, 2019).
  • [23] Zhang, Z., Wang, X., Zheng, G. & Zhong, J. Hadamard single-pixel imaging versus fourier single-pixel imaging. Opt. Express 25, 19619–19639 (2017).
  • [24] Bian, L., Suo, J., Dai, Q. & Chen, F. Experimental comparison of single-pixel imaging algorithms. J. Opt. Soc. Am. A 35, 78–87 (2018).
  • [25] Higham, C. F., Murray-Smith, R., Padgett, M. J. & Edgar, M. P. Deep learning for real-time single-pixel video. Sci. Rep. 8, 1–9 (2018).
  • [26] Zhang, Y. et al. Road segmentation for all-day outdoor robot navigation. Neurocomputing 314, 316-325 (2018).
  • [27] Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  • [28] Li, Y. et al. Discrete cosine single-pixel salient object detection base on deep learning via fast binary illumination. In CLEO: QELS_Fundamental Science, JTh2E–33 (Optical Society of America, 2020).
  • [29] Molchanov, P., Mallya, A., Tyree, S., Frosio, I. & Kautz, J.

    Importance estimation for neural network pruning.

    In Conference on Computer Vision and Pattern Recognition (CVPR), 11264–11272 (2019).
  • [30] Lin, M. et al. Hrank: Filter pruning using high-rank feature map. In Conference on Computer Vision and Pattern Recognition (CVPR), 1529–1538 (2020).