The optical microscope remains a critical tool in medicine and biology. Over the past several decades, advances with digital microscopes and associated post-processing algorithms have led to fundamental improvements in resolution , 3D imaging depths  and acquisition speed  - opening up new avenues for scientific discovery. At the same time, these improvements have led to increasingly more complex microscope devices and astronomically larger datasets.
To help us handle the increasingly large quantities of raw data that we detect about various biological systems, many are now turning to machine learning. However, the majority of machine learning methods are currently used to process “standard” microscope image data that has already been digitally captured and saved. These algorithms do not influence how data is captured and make no effort to improve the quality or relevance of the images.
In this work, we aim to change this paradigm, by creating a sensing framework for an “intelligent” microscope, whose physical parameters (e.g., illumination settings, lens configuration, and sensor properties) are adaptively tuned to capture better task-specific images that contain the most relevant information for each learning goal. To achieve this, we integrate within a deep learning post-processing system an attention mechanism, which models active control of the microscope. Using closed-loop feedback and reinforcement learning, we optimize for a series of measurements, captured under different experimental microscope settings, to maximize the performance of our automated decision software.
Effectively, this principle affords the microscope a large amount of flexibility to extract useful information from the sample of interest. Much in the same way as we physically interact with a new object when trying to understand what it is, we aim here to allow the microscope to reason through and execute physical interactions to form decisions about each sample with as high as accuracy as possible. Clearly, this approach is directly related to reinforcement learning techniques , in which agents iteratively explore and interact with their environment to complete a task.
Our first aim for this new method to help with the general problem of improving the quality of measurements for automated microscope image diagnosis. Almost always, samples extracted from patients are either too large to fit within the microscope’s field-of-view, or too thick and thus significantly scatter light. Within this paper we will examine two classification problems. The identification of the malaria parasite within experimentally collected images of blood smears, and a heavily modified version of the MNIST digit recognition problem. Although these tasks are both classification-based, we argue that the presented paradigm can be generally applied to any high-dimensional, controllable, measurement scheme with a single desired artificial intelligence-driven outcome.
2 Previous Work
There are several recent works that consider use of machine learning to jointly optimize hardware and software for imaging tasks [15, 11, 6, 3, 2, 4, 9]. These approaches aim to find a fixed set of optical parameters that are optimal for a particular task. Although these methods provide improved performance relative to standard parameterizations, their results are only optimal for a single task, and do not provide any means of adaptability across individual samples.
Two recent works [10, 5] have studied the impact of optimizing a programmable LED array over entire datasets. These works have shown that a fixed optimal pattern yields increased performance in both classification and image-to-image mapping tasks. These works show that inclusion of programmable illumination within a microscope allows joint hardware-software optimization using machine learning. The programmable hardware leads to better task-specific performance and can be tuned without requiring any moving parts.
Adaptively choosing imaging parameters is a relatively underexplored area. Some works like Yang et al.  consider the use of reinforcement learning for real time metering and exposure control. However, no work has yet aimed to dynamically change acquisition parameters during multi-image capture in response to the contents of the sample.
Visual attention mechanisms allow machine learning algorithms to self-select visual information which is most relevant for further processing. The concept of recurrent visual attention was first shown in Mnih et al.  where a recurrent policy learned to iteratively sample a “glimpse” of the x-y plane. Further work has both reinforced the performance of recurrent visual attention  and expanded the attention mechanism .
3.1 Adaptive Sensing
We consider an agent interacting with a visual environment (i.e., the sample of interest) via an imaging system. The agent has direct control of the parameters of the imaging system through a visual attention mechanism, allowing the self-selection of information required to accomplish its task. The system is bandwidth limited, since the interaction of light with both the lens and sensor prevent complete capture of environment state within a single image, in response the agent is required to synthesize information across multiple time steps to build an accurate state representation. At each step, the agent is given a choice, to either make a decision about the sample using the information gathered so far (e.g., predict its class), or to select new information through the integrated attention mechanism (i.e., capture a new image under different illumination). This choice presents a trade-off between acquisition cost and task performance, which can be tuned given the needs of overarching system.
The image formation process is modelled as , with the formed image from a system parameterized by of a given sample given by: . An agent (Figures 2 and 3) is constructed which optimizes the configuration of the hardware across multiple acquisitions jointly with the post-processing of images. A Visual Encoder, , is used to convert each observed image into an embedded representation: . This representation is fed into a recurrent unit, , which maintains the agent’s internal state by synthesizing embeddings across the history of observations: .
At each step, the agent’s decision engine () decides if enough information has been gathered to make an accurate classification, or if more samples are required. Depending on this outcome a new system parameterization is produced by the Parameter Network or a classification decision is made using the Classification Network :
The implemented agent uses two layer fully-connected networks for and , while and use a single fully-connected layer and a 256 unit LSTM respectively.
3.2 LED array illumination
In our work, we consider the microscope as the visual environment, and optimize over the crucial element of illumination. Using a programmable LED array, we can design an optimal light pattern, which could be a mixture of bright-field and dark-field illumination, to highlight sample features that are important for a particular task [10, 5].
Consider the image of the sample formed by turning on the th LED in the array with brightness . It can be written as, , where is the physical model of image formation. Under linear image formation, can be found by multiplying with the formed with the th LED at a fixed brightness,
In absence of noise and assuming each LED is mutually incoherent, the image formed by turning on multiple LEDs is equal to the computed sum of the images captured from each LED turned on individually . Specifically, if the brightness of the LED is denoted as and the associated image formed by illuminating the sample with this LED only at a fixed brightness is , then the image , formed by turning on a set of LEDs at variable brightness is given by:
The set of LED brightnesses is denoted as , which is the parameterization of the imaging system such that . We use this image formation process to construct a visual attention mechanism which uses LED brightnesses () select information from the underlying sample .
3.3 Data preparation
For the MNIST based task we simulate microscope image formation under variable illumination using the MNIST datset. Each normalized MNIST image was used to define the height profile of thin, translucent sample (maximum thickness ). The profile was then sequentially illuminated by 25 distinct LEDs placed in a grid located 50mm beneath the sample at a 6mm pitch. The optical fields of each LED were propagated through the sample and a simulated objective lens with magnification and 0.175 NA. Finally, a simulated detector with , m pixels received the optical field to form an image, exhibiting Gaussian noise in readout. The final training set consisted of 60000 samples while the test set contained 10000. Each image set was stored as a tensor.
Our experimental dataset consists of images of blood cells from thin smear slides used in the clinic to diagnose malaria (experimental details in Ref. ). The images were cropped and labelled to construct a binary classification task of diagnosing the presence of the malaria parasite. Variable illumination was provided by 29 LEDs, where each LED contained 3 individually addressable spectral channels, creating 87 uniquely illuminated images. The images were augmented by flipping and rotation, creating a total of 4100 samples stored as tensors. Train and test sets were constructed with an 80-20 split.
Our adaptive machine learning model uses two distinct outputs to accomplish its task. The first is a classifier which translates the hidden state into a classification decision. The second is a decision engine, which evaluates overall classification confidence, and decides if enough information has been captured to make a correct classification, thus ”exiting” the feedback loop. The exit mechanism provides a means to evaluate samples over variable length trajectories, capturing more data when necessary and exiting to perform a decision when appropriate.
A successful exit is then defined by an exit followed by a correct classification, with an unsuccessful exit being the opposite. We assign a reward to each of these outcomes, allowing control of the trade-off between acquisition cost and accuracy. The decision engine is trained using this outcome dependent reward, mapped to a cross entropy loss function, and the classifier uses categorical cross-entropy based loss. While the classification loss doesn’t affect the decision model and vice versa, both of the losses influence the other portions of the model.
We conducted two experiments to understand the benefit of an adaptive attention mechanism using the simulated MNIST and physically collected malaria image data. For each experiment we ran both a baseline (a single forward pass of the agent’s model, with a fixed learned LED pattern) and our new adaptive approach. Using the same forward model in each case allowed a fair comparison between the approaches, uninfluenced by network architecture. We evaluated the agent’s performance across a sweep of exit decision rewards () from while the stay decision reward () remained fixed at5 shows a visualisation of the trajectories that the network took for a sample, where we observe that the agent prefers to probe the samples under a variety of different illumination schemes during the decision process.
These results show that an adaptive sampling paradigm can effectively trade-off task performance for acquisition cost. Both the MNIST and Malaria datasets (Figure 4) show a decaying positive relationship between average trajectory length and overall system accuracy. Additional information drives this trajectory length to accuracy relationship - longer trajectories allow the agent to gather more information about the sample prior to making a decision. We observe that more information isn’t always better, however - some systems aren’t limited by the information that is available, but instead by their ability to process it. We hypothesize that this is the case with the malaria classification task, where although the positive relationship does exist, a more direct training paradigm (the baseline method) offers higher performance.
This work establishes a reinforcement learning based framework to deploy an adaptive imaging system, where an attention mechanism is used to perform data-dependent sampling. Although our experiments are exclusive to controlling illumination for a classification task, we expect our framework to extend to more elements of the microscope and different kinds of tasks. By including an exit mechanism within the recurrent structure, we establish a relationship between trajectory length and task performance. We postulate that this relationship is not only task dependent, but also depends on the processing capability of the network. A classification system may not be fundamentally information limited, however, as the amount of information demanded from the task increases (such as in an image-to-image inference task), we expect this trajectory-to-performance curve to shift. We demonstrate that adaptive systems not only offers increased performance compared to fixed imaging systems, but also show a path towards the integration of data capture and processing.
-  (2014) Multiple object recognition with visual attention. CoRR abs/1412.7755. Cited by: §2.
-  (2019-08) On the use of deep learning for computational imaging. Optica 6 (8), pp. 921–943. External Links: Cited by: §2.
-  (2016) Learning sensor multiplexing design through back-propagation. In Advances in Neural Information Processing Systems, pp. 3081–3089. Cited by: §2.
Hybrid optical-electronic convolutional neural networks with optimized diffractive optics for image classification. Sci. Rep.. External Links: Cited by: §2.
-  (2019) Physics-Enhanced Machine Learning for Microscope Image Segmentation. In Prep.. Cited by: §2, §3.2.
-  (2018) Using Machine Learning to Optimize Phase Contrast in a Low-cost Cellphone Microscope. PLoS One. External Links: Cited by: §2.
-  (2014) Single-shot Compressed Ultrafast Photography at One Hundred Billion Frames per Second. Nature 516 (7529), pp. 74. Cited by: §1.
-  (2007) Far-field Optical Nanoscopy. Science 316 (5828), pp. 1153–1158. Cited by: §1.
-  (2019-03) Multicolor localization microscopy and point-spread-function engineering by deep learning. Opt. Express 27 (5), pp. 6158–6183. External Links: Cited by: §2.
-  (2017) Convolutional Neural Networks that Teach Microscopes How to Image. arXiv:1709.07223. External Links: Cited by: §2, §3.2, §3.2, §3.3.
-  (2019) Physics-based Learned Design: Optimized Coded-Illumination for Quantitative Phase Imaging. IEEE Trans. Comput. Imaging. External Links: Cited by: §2.
-  (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
-  (2014) Recurrent Models of Visual Attention. In Advances in Neural Information Processing Systems 27, pp. 2204–2212. External Links: Cited by: §2.
In Vivo Three-Photon Imaging of Activity of GCaMP6-labeled Neurons Deep in Intact Mouse Brain. Nature Methods 14 (4), pp. 388. Cited by: §1.
End-to-end Optimization of Optics and Image Processing for Achromatic Extended Depth of Field and Super-resolution Imaging. ACM Trans. Graph. (SIGGRAPH). Cited by: §2.
-  (2015) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In International Conference on Machine Learning, pp. 2048–2057. Cited by: §2.
-  (2018) Personalized Exposure Control Using Adaptive Metering and Reinforcement Learning. arXiv:1803.02269. External Links: Cited by: §2.