In many engineered networks, a payload needs to be transported between nodes without central control. These systems are often represented as weighted, directed graphs. Each edge has a fixed capacity, which represents the maximum amount of traffic the edge can carry at any time. Traffic in these networks consist of a series of flows, each of which contains some amount of data that originates at a source node and attempts to reach a target node via a path in the network. For example, in vehicular transportation systems (nodes are intersections, edges are roads), cars travel from one location to another, and each road has a capacity that limits the number of cars that can traverse the road at once. In network-on-a-chip circuits (nodes are components such as CPU and GPU cores equipped with extremely basic routers, and edges are circuit wiring between cores), tiny units of data called flits flow through the circuit, and each link can only transport a limited number of flits at a time (Cota et al., 2012). On the Internet, packets navigate from one host to another, and each communication link has a capacity that constrains the rate of data flow through the link. In all these cases, the goal is to optimize performance metrics, including how long it takes for data to reach the target and how long data is queued or lost along the way due to exceeding capacities.
There are two primary services required in these networks: routing and flow control. Network routing refers to moving a payload from a source node in the network to a target node through some path (Royer and Toh, 1999), often maintained using a distributed routing table (Gavoille, 2001). The majority of networks always send traffic along the shortest path from source to target, where shortest refers to physical distance or transit time. In this paper, we also assume routing occurs via the shortest source-target path and focus on algorithms for flow control.
Flow control determines when and how much data can be sent through the network. Engineered networks are often designed based on the concept of oversubscription or blocking, where there is not enough capacity to simultaneously service every possible flow at maximum capacity due to bottlenecks in the network. Bottlenecks can cause congestion, which results in data loss and reduction in useful throughput since when demand for a certain link exceeds capacity, the excess data must be queued or dropped. Effective flow control simultaneously maximizes the utilization of link capacity while maintaining low loss and delay. In general, flow control is an NP-hard problem (Wang et al., 2003; Fortz and Thorup, 2000). It is even more challenging in online systems that serve many flows concurrently, where traffic can change unpredictably, and where optimization must happen in real-time. Further, control logic must be implemented distributedly, with minimal communication between nodes.
To address these challenges, we developed a new neuro-inspired distributed computing model where flow control occurs by modulating (increasing or decreasing) edge weights using only 1-bit (binary) local feedback. Edge weights represent a control variable that denotes how much data should flow along the edge at the current time. For example, if there are 10 units of data that want to travel from node to , and if the weight of edge is 6, then 6 units can be successfully transmitted, and the other 4 units are either queued or dropped. There exists an optimal global weight distribution at every time step, which is dependent on the trade-off between maximizing data flow rates versus minimizing data drops and queueing. To inform the direction of how weights should change to approach this distribution, the network relies on 1-bit local feedback between neighboring nodes, which indicates whether the data was successfully transmitted (without being queued or dropped). This feedback is used to increase or decrease edge weights to ensure that links are not under-utilized or overloaded, respectively.
How does this relate to synaptic plasticity in the brain? Building off prior work, we will argue that synaptic weight-update rules can be viewed as a distributed gradient descent process that attempts to find a weight distribution that optimizes a global objective (Bengio et al., 2015a, b). While the computational models and feedback mechanisms used to trigger weight changes in engineered networks are clearly different than those used in the brain, the questions we consider here are: 1) The direction and magnitude of weight updates; i.e. when and how much to increase and decrease; 2) How these local decisions affect global performance objectives in engineering (bandwidth, drops, queueing delay); and 3) Whether general principles for modulating edge weights may be found across engineered and neural systems.
Overall, this paper makes four contributions: 1) A new neuro-inspired distributed computing model to optimize traffic flow based only on edge weight updates and 1-bit feedback between adjacent nodes; 2) A casting of long-term potentiation (LTP) and long-term depression (LTD) in terms of distributed gradient descent dynamics; 3) Simulations, using simulated and real networks, and theoretical analysis of five classes of weight-update rules; and 4) Comparisons of the best performing classes with experimental data detailing the functional forms of LTP and LTD in the brain.
1.1 Related work
Congestion control in many distributed networks, such as the Internet, is performed on a global end-to-end basis per-flow, meaning that the source of each flow regulates the rate at which it sends data into the network based on a binary feedback signal from the target node (Corless et al., 2016). Our neuro-inspired per-link model described below is stricter in its feedback constraints; each node can regulate traffic but based only on the congestion it observes on its incoming and outgoing links, independently of the source and target of the data. This model is more relevant to vehicular traffic networks (where it is impossible for a traffic control device, such as a traffic light, to know the ultimate destination of a vehicle) and network-on-a-chip circuits (where flits travel independently through the circuit, like vehicles). Traffic control algorithms have been analyzed in many forms, but largely assume the problem is centralized or offline (Gayme and Topcu, 2011). Other approaches attempt to emulate a centralized algorithm by passing large-sized messages (Mateos-Nunez and Cortes, 2013). Distributed gradient descent algorithms have been studied in many areas, but they also assume the ability to pass large and frequent messages, or they require significant data aggregation at individual nodes (Li et al., 2014; Zinkevich et al., 2010; Yuan et al., 2014).
2 A distributed network model for traffic flow control
We are given a network . The node set is partitioned into three non-overlapping subsets: sources , targets , and routers . Data is transmitted from sources to target via the routers (Figure 1A). The edges are directed; for simplicity, we assume each source is connected to exactly one router (i.e. a “feeder road”), and each target has an incoming edge from exactly one router (i.e. a “highway exit”). The routers are connected with a uniform or scale-free degree topology.
The weight of each edge corresponds to the maximum amount of traffic the flow control algorithm allows to travel from at time . Each weight for all and for all , where is defined as a fixed maximum capacity for each edge. Edge weights from routers to targets are always set to because there are no outgoing edges from the targets, and hence no possibility of downstream congestion. Weights of all other other edges will vary over time based on traffic flow and congestion.
Each source desires to send units of data to one random target; hence, there are flows actively competing in the network and sending data. Data for each flow is routed from the source to the target via the shortest path in the network. In practice, this is easily accomplished using a distributed routing algorithm (Gavoille, 2001)111This model of routing is likely different than that occurring in the brain. Our work is not meant to derive a mapping between the two or suggest that their mechanisms are similar. In each time step , each source injects new data into the network, the amount of which is equal to the outgoing edge weight from the source to the router it is connected to. Thus, if this weight changes, so does the amount of new data injected into the network by the source. Data arriving at a router is forwarded to the next node along the shortest path to the target. If data arrives at router destined for node , the amount of data actually sent is upper-bounded by the edge weight , which can change over time. If multiple flows desire to use edge in the same time step, we process each flow in a random order. This means that a random incoming flow is chosen, its data units are all sent (up to the link weight; the rest are dropped). Then another incoming flow is chosen at random, and its units are handled similarly. A flow is complete when its source has successfully transferred data units to its target.
The only control variables of the algorithm to achieve flow control are the edge weights at each time step. Our objective is to:
Here, is an objective function (described later) that measures how well the current edge weights perform when routing data from the flows currently in the network. To enforce the constraint, we propose two application-dependent models for penalizing excess traffic on an edge: a drop model and a queue model (Figure 1B). Let be the total amount of data at router that desires to be passed to node at time . If then the excess data is either dropped (i.e. discarded), or it is queued at node until at least the next time step. Loss models are consistent with many data networks and neural circuits (Branco and Staras, 2009). Queue models are consistent with transport networks and can help smoothen transient or bursty traffic. In each time step, data queued at a node is processed before non-queued data (i.e. a first-in first-out buffer). We assume a queue of infinite length, but we penalize solutions that produce long queues.
Thus, at every time step, there exists an optimal global weight distribution for a given objective function. This optimum will change as traffic demand varies over time. The goal is to track this distribution as closely as possible by applying gradient descent on the edge weights. We use 1-bit feedback between adjacent nodes, indicating whether data was successfully sent from one node to the other (i.e. if it resulted in no dropped or queued data). If the transmission is successful, data rate is increased (to probe whether higher bandwidth can be achieved). Conversely, if congestion is experienced, the data rate is decreased. This feedback thus serves as the direction of the local gradient (at the edge) for the global objective. By modifying the edge weight in accordance with this local gradient, we attempt to minimize the global objective. The magnitude of the edge weight change is based on both the direction of change and the current weight. Therefore, the algorithm seeks the point that locally maximizes traffic flow without triggering congestion.
Next, we describe how synaptic plasticity rules can serve as inspiration towards regulating these edge weights in an online, activity-dependent manner using simple distributed computation.
2.1 Synaptic plasticity as distributed gradient descent on edge weights
Recent work by Bengio et al. has argued that many forms of Hebbian learning, such as spike-timing dependent plasticity, may correspond to gradient descent towards neural activities that optimize a global objective (Bengio et al., 2015a, b; Osogami and Otsuka, 2015)
. They propose that neurons perform “inference” to try and better predict future activity, given current and past data. To approach optimal activity levels, feedback signals between pre-and post-synaptic neurons, such as those used to trigger long-term potentiation (LTP) and long-term depression (LTD), cause firing rates to increase or decrease based on the gradient of the objective.
Under a connectionist assumption, where neural activity is a function of the synaptic weights coupling neurons together, the state of the network can be described by the edge weights over the population of synapses (along with other presumed constants, such as activation functions). The evolution of the system can be described by how these weights change in response to activity. In our case, activity-dependent feedback signals between adjacent nodes (described in detail below) provides a measurement of the direction of the local gradient at the edge, which similarly triggers edge weights to increase or decrease. These weight changes thus correspond to a distributed gradient descent algorithm for finding a set of edge weights that maximizes a global objective. The movement towards the global optimal is complicated by the non-independence of weight changes (network effects) and the uncertainty in future inputs (non-stationarity of traffic).
One critical question then is: How much should the weight increase (“LTP”) or decrease (“LTD”) following feedback? We next describe experimental data detailing what forms these rules might take in the brain.
2.2 Experimentally-derived weight update rules for LTP and LTD
To inspire our search into different possible weight update rules, we surveyed the recent literature for models based on experimental data (electrophysiology, imaging, etc.) that provided evidence of the functional forms of LTP and LTD (Table 1). We categorized rules into four classes: 1) Additive: the change in edge weight is based on an additive constant; 2) Multiplicative: the change is based on a multiplicative constant; 3) Weight-dependent: the change more generally is based on a function of the existing edge weight; and 4) Time-dependent: the change also depends on the history of recent edge-weight changes. In this paper, we focus on the first three classes.
These rules will be used to derive a class of neuro-inspired distributed gradient descent algorithms for update edge weights, as described in the next section. Table 1 is not meant to exhaustively list all forms of synaptic plasticity rules derived in the literature (see Discussion) but rather to provide some basic structure into possible, simple-to-implement rules and their parameters.
|LTP||Additive||(Song et al., 2000; Kopec et al., 2006)||(Chiu and Jain, 1989; Corless et al., 2016)|
|Multiplicative||—||(Chiu and Jain, 1989)|
|Weight-Dependent||(Oja, 1982; Bi and Poo, 1998; van Rossum et al., 2000)||(Bansal and Balakrishnan, 2001; Kelly, 2003; Ha et al., 2008)|
|History-Dependent||(Bienenstock et al., 1982; Froemke et al., 2010; Cooper and Bear, 2012; Kramar et al., 2012)||(Jin et al., 2002)|
|LTD||Subtractive||(Song et al., 2000)||—|
|Multiplicative||(Bi and Poo, 1998; van Rossum et al., 2000; Zhou et al., 2004)||(Chiu and Jain, 1989; Corless et al., 2016)|
|Weight-Dependent||(Oja, 1982)||(Bansal and Balakrishnan, 2001; Corless and Shorten, 2012)|
|History-Dependent||(Bienenstock et al., 1982; Froemke et al., 2010; Cooper and Bear, 2012)||(Jin et al., 2002)|
2.3 Distributed algorithms for updating edge weights
First, to inform the direction of the weight change (increase or decrease), in each time step, we allow 1-bit feedback between adjacent nodes. Let be an indicator variable equal to if data at node is lost or queued due to congestion on edge at time , and otherwise. Assume data for a flow is traveling from node to to . When considering how to modify the weight of edge , we need to consider what happens on the adjacent downstream edge where traffic is flowing. Intuitively, if , then the incoming edge weight should LTD since it contributed data to the jam. Further, the edge weight should LTP to attempt to alleviate the congestion. If neither edge nor are jammed, then both should LTP (Figure 1C). Thus, the Jam term serves as a 1-bit measurement of the direction of the local gradient at edge . Overall, the logic implemented at each edge is shown in Algorithm 1 below:
We assume a node (in this case,
) can modify the edge weight of both its incoming and outgoing edges. In synapses, this may be achieved by modulating pre-synaptic release probability or number of post-synaptic receptors (e.g.Costa et al. (2015); Markram et al. (2012); Yang and Calakos (2013); Fitzsimonds et al. (1997)) or other gating mechanisms (Vogels and Abbott, 2009); in data networks, a node can pause incoming data by transmitting a signal and can pause outgoing data by simply not transmitting. If an edge gets both LTP and LTD signals in a time step, it default to LTD. In every time step, the weight of every edge that carries data will either LTP or LTD.
Second, to determine the magnitude of the weight change, we consider the following weight-update rules for LTP and LTD:
We consider four combinations of LTP and LTD: AIMD, AISD, MIMD, and MISD. Each of these algorithms has some theoretical or experimental basis, or both (Table 1). For example, AISD was proposed by Kopec et al. (2006) and Song et al. (2000). AIMD was proposed by van Rossum et al. (2000) and Delgado et al. (2010). Multiplicative decrease rules have been proposed by Zhou et al. (2004), amongst others.
We also compare to an algorithm based on the classic Oja learning rule (Oja, 1982):
Unlike the previous rules, Oja uses the activity (traffic) of the edge as a variable, where is the amount of data traversing edge at time . This rule is slightly different from the typical Oja rule where the change in the weight for the th input, (learning weight , synaptic input and output ). The squared term functions as a decay on the weight. We include a similar activity-dependent squared decay term , but we normalize it by to lie within the required weight range. Our term decreases the effect of LTP and increases the effect of LTD as the link approaches capacity. This allows more aggressive and to quickly adjust traffic.
Another algorithm we compare is called Bang-Bang control. This rule is often used in neural circuit design (Feng and Tuckwell, 2003; Zanutto and Staddon, 2007) and in engineering (Lazar, 1983) to control and stabilize activity:
Finally, we compare to a baseline rule, Max Send, which keeps all edge weights fixed at in every time step. For all algorithms, if a weight equals and is triggered to LTP, it stays at . Likewise, if a weight equals and is triggered to LTD, it stays at .
We only consider integer units of data, thus for all update rules, is rounded to an integer. To prevent a link from getting stuck at 0 weight (e.g. for MISD), it was required that every weight have a minimum of 1. The link capacity , so the integer rounding and minimum value were negligible in terms of overall performance.
2.4 Objective functions, simulations, and data
Next, we describe network performance measures to quantify how well these rules behave towards optimizing global objectives (). The objective functions we selected are typically used to evaluate performance in engineered networks (Pande et al., 2005; Ahn et al., 1995).
Let define the set of competing flows, and be the load of each flow. The three objective functions are:
Bandwidth: Amount of data successfully transferred per time step, averaged over all flows: , where is the number of time steps for flow to complete.
Drop Penalty: Percentage of data lost, averaged over all flows: , where is the amount of data lost by flow over all time steps until completion. The drop penalty may go above 100% if more data sent by the source is lost than delivered.
Queue Penalty: The percentage of data that is queued per hop, averaged over all flows: , where is the total amount of data inserted into queues and is the path length of flow .
Parameter robustness: One critical component of these algorithms is that they must work well in general. Traffic is highly dynamic, and thus optimizing parameters for one particular traffic regime or network topology will not be sufficient for real-world use. We thus focus on the robustness or sensitivity of each algorithm by testing the variability in their performance across a broad range of parameters.
Simulation framework. We created a directed network with sources, targets, and routers, where or . Each source is connected to exactly one random router (the same router can have an edge from multiple different sources), and each target has an incoming edge from one random router. The router-router network was defined using a uniform or scale-free degree topology with each router connected to six other routers (Corless et al., 2016). We defined concurrent flows, one starting from each source, and each selecting a random target (the same target may be selected twice across flows). Each flow contained data units, as is roughly the average number of round-trip times taken for a data transfer on the Internet (Corless et al., 2016). The weight of each edge was initialized to the maximum capacity, , in order to immediately experience contention. The mean path length of the artificial network was 4.7; this is large enough to provide several links of potential contention. Each performance measure was averaged over 25 repeat simulations.
Parameter variation. For the weight-update rules, we varied: AI (), SD (), MI ), MD ).
Real-world data. We used the CAIDA Autonomous System (AS) relationship data to generate a graph based on the Internet routing network (Cai, 2016). Each AS represents the highest-level routing subnetwork on the Internet. The CAIDA data contained connectivity between 53,195 AS subnetworks. We treat each AS as a single routing node in our model. We created the same number of sources and targets (53,195 each); each source was connected to one random AS and each target had one incoming edge from a random AS. The network contained 537,582 total directed edges. To simulate flows, we selected of the sources (500) and paired each source with a random target. Each performance measure was averaged over 10 repeat simulations. As before, we set the capacity, .
First, we describe the performance of each weight-update algorithm against the global objectives (bandwidth, drops, and queueing) using both simulated and real-world networks. Second, we support these results by analytically deriving the performance of each algorithm as it adapts to changing traffic demands. Third, we describe how the best performing rules compare to those commonly used in engineering.
3.1 Observations from simulations and real-world network flows
We first compared the performance of the seven edge-weight update algorithms (Section 2.3) via simulation. Each algorithm was evaluated according to the bandwidth offered and the amount of data that was dropped or queued.
The two strongest performing algorithms were AIMD and MIMD, with Oja lying in between (Figure 2). AIMD, for some parameters, achieved a bandwidth comparable to other algorithms, but its main strength was in reducing the drop penalty by at least one order of magnitude (averages: , , , , ; Figure 2A). The former as due to its conservative rule for increasing edge weights for LTP (additive), and the latter was due to its aggressive edge weight decrease for LTD (multiplicative). AISD and MISD showed very high drop penalty primarily because upon contention, edge weights were only decreased subtractively; this led to slow adaptation, though higher bandwidth because few links were ever under-utilized. In general, the Oja rule (a variant of AISD) improved over AISD but still achieved a much higher drop penalty compared to AIMD also due to its conservative decrease. MIMD showed great sensitivity to algorithm parameters, some of which performed well. While keeping all edge weights at maximum capacity may be intuitively appealing (Max Send), this is in general not a good solution because any downstream bottleneck will result in massive data drops. We also observed similar trends for all algorithms using the queue model (Figure 2B).
Overall, for both models, the multiplicative decrease algorithms (AIMD, MIMD) and the Oja algorithm demonstrated a better trade-off between bandwidth and drop/queue penalties than other algorithms, indicating their ability to more closely approach the optimal edge weights. We also tested these observations for larger networks with different router connectivity topologies and observed similar overall trends (Appendix, Figure S1).
Next, we tested how well and how quickly each algorithm could adapt to new traffic, simulated on a real Internet backbone routing network. The simulation was for time steps: During and , 500 flows concurrently competed. During , 500 additional flows were temporarily added (“rush hour”). All algorithms demonstrated some reduction in bandwidth when rush hour begins due to the additional number of competing flows. However, AIMD and MIMD incurred the least transient drop penalty (i.e., the drop penalty incurred immediately after rush hour starts; Figure 3B). Overall, AIMD yielding significantly lower transient drop penalty than all other algorithms (
; 2-sample t-tests), and significantly lower overall drop penalty than MIMD. AIMD and MIMD also produced only a 4–5% difference in bandwidth compared to the other algorithms (average data per timestep:, , , , at , Figure 3A). These results suggest that AIMD and MIMD adapt faster to changing traffic, yielding fewer transient drops with comparable bandwidth.
Next, to support these observations, we formally analyze the adaptive behavior of each algorithm.
3.2 Analyzing transient response times for AIMD, MIMD, AISD, MISD, and Oja
An important aspect of algorithm performance is its non-stationary or transient behavior; i.e. how well it adapts when traffic suddenly increases and the available bandwidth per flow decreases (Figure 3). Real networks are never static; all flows experience some level of perturbation due to varying traffic. Thus, as opposed to analyzing convergence properties (as is typically done for gradient descent algorithms), we analyzed a simple but informative scenario: the performance of each algorithm when a second flow is added to a link that is initially serving only a single flow at maximum capacity.
Assume the link under consideration has fixed weight , and that there are two flows starting from and that both need to use to reach their downstream target (and no other link in the network is limiting). Let and then when the second flow begins, assume , where . Assume a single LTD operation on will lower the total traffic sent by both flows along to be below , hence alleviating the congestion. After the initial congestion event there will be time steps of LTP before congestion re-occurs. Let be the number of time steps before congestion re-occurs. We wish to find the amount of data dropped/queued (called overshoot) at time , defined as the difference between the amount of data desired by both flows along edge and
at the moment LTD is re-activated.
Theorem 1 (Overshoot of AIMD).
The two-flow transient response of AIMD has an overshoot that increases linearly with , but is reduced by a term proportional to the capacity, .
At the first time step ( to simplify notation), the amount desired by both flows on is:
|Time||Flow 1||Flow 2|
Since the sum of the flows is greater than , congestion occurs and the jam indicator variable is true. In the next step, LTD via multiplicative decrease is applied:
|Time||Flow 1||Flow 2|
This brings the total desired traffic along under . Additive increase then occurs for steps:
|Time||Flow 1||Flow 2|
At time step , the overshoot along , which is the excess traffic over , is:
assuming the single term is negligibly small. Note that since , the second term is negative. ∎
|Rule||Transient Overshoot||Balanced ()||Aggressive Increase ()||Aggressive Decrease ()|
|AIMD||0.5 (250)||101 (3)||0.1 (450)|
|AISD||1.0 (5)||196 (1)||1.0 (100)|
|MIMD||72.8 (8)||126 (2)||84.6 (25)|
|MISD||95.6 (1)||494 (1)||90.2 (2)|
|OJA||Undershoots||-1 ()||-1 ()||-1 ()|
|OPT||0 ()||0 ()||0 ()||0 ()|
We similarly derived the overshoot of MIMD, AISD, and MISD (Appendix; summary in Table LABEL:tbl:overshootsummary). The theoretical performance of the algorithms on the two-flow case correlate well with the simulated performance using hundreds of concurrent flows in a larger network. Both AI algorithms have an overshoot that has a term. For AIMD, this term is reduced by . AISD, on the other hand, is hurt by a slow subtractive decrease, and hence only reduces the term by . Since for large values of , AIMD typically performs better (compare blue dots vs. yellow dots in Figure 2).
The MIMD overshoot shows a complex dependence between , and , which makes performance highly parameter-dependent, as we also observe via simulation (see variability of red dots in Figure 2). MISD has uniformly high overshoot, since is multiplied by a large constant , and thus performs poorly. Finally, Oja uses an AISD rule with a weight-dependent squared decay; this decay cancels out the additive increase term as the traffic of Flow 1 and 2 approaches . This leads to performance that always under-shoots for this simple two-flow case, improving drop and queue penalty over AISD; however, in practice when many flows concurrently compete and overshoot does occur (as in Figures 2 and 3), it is limited by the weak decrease term, as the decay only doubles , at best.
We also simulated the transient overshoot under the assumptions of Theorems 1–4 using different parameter settings (Table LABEL:tbl:overshootsummary, right side). These simulations further validate the theorems, showing that AIMD overshoots the least, while MIMD varies from the second-best to second-worst, depending on and . Further analysis of each algorithm is provided in the Appendix.
3.3 Comparing distributed gradient descent algorithms in the brain and the Internet
Simulations and theory both suggest that AIMD achieves a robust and well-balanced trade-off between bandwidth and drop/queue penalties, with MIMD and Oja also performing comparably depending on the parameters selected. This implies that these algorithms approach the optimal global edge-weight distribution more quickly and accurately than other distributed gradient descent algorithms, including MISD, AISD, Bang-Bang, and Max-Send.
In the brain, the additive increase and multiplicative decrease rule (AIMD) for LTP and LTD, respectively, has strong theoretical and experimental support (Table 1), particularly over MI and SD models. The AIMD algorithm is also very similar to the rule proposed by van Rossum et al. (2000) and has been commonly referred to as the mixed-weight update rule (Delgado et al., 2010)
. This rule, in neural network simulations, has been shown to produce stable Hebbian learning compared to many other rules(van Rossum et al., 2000; Billings and van Rossum, 2009).
While Oja was one type of weight-dependent rule, other rules have also been proposed where a weak (i.e. low-weight) synapse that undergoes LTP is strengthened by a greater amount than a strong synapse that undergoes LTP, and vice-versa for LTD (Table 1). Prior work has also suggested that individual synapses may have a “memory” that allows for more sophisticated update rules to be implemented, including history-dependent updates (Lahiri and Ganguli, 2013). One such form of update is called integral control in engineering, which utilizes the time integral of a variable from to the present. We show the general forms of these rules in Table 1 but do not explore them further here.
Interestingly, in engineering, AIMD also lies at the heart of the most popular congestion control algorithm used on the Internet today: the transmission control protocol (TCP (Corless et al., 2016)). In contrast to our per-link model, congestion control on the Internet is performed on a global end-to-end basis per-flow, meaning that the source of each flow regulates its transmission rate based on a binary feedback signal (called an ACK or acknowledgment) sent by the target. If there is a long delay before the ACK is received (or if the ACK is never received at all), congestion is assumed, and the source decreases its rate of sending data by a multiplicative constant (often ). Otherwise, the source increases its rate by an additive constant (often ). TCP was also designed with the goal of converging to steady-state flow rates over time in a gradient-descent like manner (Shakkottai and Srikant, 2007; Jose et al., 2015). Thus, despite different models and objectives, both the brain and the Internet may have discovered a similar distributed algorithm for optimizing network activity using sparse, local feedback.
Our work connects distributed traffic flow control algorithms in engineered networks to synaptic plasticity rules used to regulate activity in neural circuits. While we do not claim that there is a one-to-one mapping between mechanisms of synaptic plasticity and flow control problems, we showed how both problems can be viewed abstractly in terms of gradient descent dynamics on global objectives, with simple local feedback. We performed simulations and theoretical analyses of several edge-weight update rules and found that the additive-increase multiplicative-decrease (AIMD) algorithm performed the best in terms minimizing drops/queueing with comparable bandwidth as other algorithms. This algorithm also matched experimental data detailing the functional forms of LTP and LTD in the brain and on the Internet, suggesting a similar design principle used in biology and engineering. Further, these weight rules use limited (1-bit) local communication and hence may be useful for implementing energy-efficient and scalable flow control in other applications, including integrated circuits, wireless networks, or neuromorphic computing.
There are many avenues for future work. First, other plasticity rules may also be explored within our framework, such as short-term plasticity. Second, in cases where source and/or receiver rates are fixed, the payload needs to be routed over alternative paths (i.e. routes may change over time basic on traffic (Isa et al., 2015)
). This requires that heavily used edges become down-weighted and unused edges become more attractive, which effectively performs load balancing over all resources (edges) in the network. Biologically, similar behavior is observed due to homeostatic plasticity mechanisms, which may inspire algorithms for this problem. Third, these distributed gradient descent updates rules may be useful in machine learning applications for non-stationary learning. Fourth, more sophisticated weight-and history-dependent update rules already explored in engineering may provide insight into their form and function in the brain. Overall, we hope our work inspires closer collaborations between distributed computing theorists and neuroscientists(Navlakha et al., 2015).
Analyzing transient response times for AIMD, MIMD, AISD, MISD, and Oja
Below, we derive the transient overshoot of AIMD, MIMD, AISD, and MISD.
Theorem 2 (Overshoot of AIMD).
The two-flow transient response of AIMD has an overshoot that increases linearly with , but is reduced by a term proportional to the capacity, .
See the main text for the derivation. The overshoot is:
assuming the single term is negligibly small. ∎
Theorem 3 (Overshoot of MIMD).
The two-flow transient response of MIMD is highly parameter dependent: based on and , the scaling of overshoot can either be dominated by or by a power of .
Following the proof of Theorem 1 in the main text, the time evolution will be:
|Time||Flow 1||Flow 2|
The overshoot for MIMD is:
if . Thus, the overshoot shows a positive dependence on (unlike AIMD, which has a negative dependence on ) and a power dependence on . ∎
Theorem 4 (Overshoot of AISD).
The two-flow transient response of AISD increases linearly with , but does not scale relative to the link capacity, .
The time evolution will be:
|Time||Flow 1||Flow 2|
The overshoot for AISD is thus:
In our theorems, we ignore the constraint that . This limit in our implementation was due to integer rounding to ensure MISD links do not get stuck at weight; hence, the theorems we present here are more general. If we apply this limit, it affects the weight of flow 2 at , which should equal . In all cases except AISD, the difference is removed by the approximation at the end. For AISD, Eqn. (5) becomes , a negligible change when .
Theorem 5 (Overshoot of MISD).
The two-flow transient response of MISD increases to the power of and as a product with .
The time evolution will be:
|Time||Flow 1||Flow 2|
The overshoot for MISD is thus:
since is large relative to . ∎
General conclusions can be drawn from these relations that correlate well with the simulation results. We can bound the overshoot of both AI algorithms (AIMD and AISD) by , where is the number of flows that must share a link, because each individual flow will not overshoot more than one . The more precise overshoot of AIMD, (Theorem 1 main text), shows a significant capacity-dependent stabilizing factor, as , which is negative, counteracts the factor of . When we repeated the AIMD flow simulation (Table 1 main text) with (as opposed to ), the overshoot of AIMD increased, as expected.
AISD has a conservative increase similar to AIMD but suffers because subtractive decrease slowly adjusts when available bandwidth decreases. While some parameter settings may overcome this, it is difficult in practice to tune the SD constant to be effective in all scenarios. Simulation results supports this observation, showing a large drop penalty during rush hour and a slow recovery after traffic subsides (Figure 3 main text). For AISD, if multiple flows share a link, transient overshoot is approximately . When a large number of flows share a link, e.g. millions of flows on an Internet backbone, overshoot will be very large.
The Oja-inspired algorithm is based on AISD, but subtracts a normalized traffic-dependent quadratic term. This attempts to correct for the slow LTD decrease of AISD, especially at high weights, and we do observe improvement over AISD in our network simulations. However, we also observe larger drop and queue penalties for Oja compared to AIMD due to its still rather conservative decrease following congestion. The decay term can also completely counteract the AI contribution when (i.e. when edge utilization is high), thus potentially leading to edge under-utilization. This convergence to a weight less than causes LTD never to be triggered (hence, the term in Table 2 in the main text, for this simple two-flow case). The lack of periodic overshooting appears to be key for the drop and queue penalty improvements over AISD.
The overshoot and performance of MIMD are highly dependent on relative to the and parameters, explaining the scattered performance of MIMD in Figure 2 (main text). With optimal parameter tuning, MIMD can be made to operate well under transient behavior, which is seen when certain MIMD points reach the AIMD region (Figure 3 main text).
MISD can easily be seen as worse than AISD in terms of drop/queue penalties in our simulations (Figures 2 and 3 main text). Intuitively, this is because MISD reduces slowly (subtractively) but increases aggressively (multiplicatively). The transient overshoot analysis for MISD shows that it increases as a product of and a constant to the power of . Since the decrease term is weak, will be small, meaning that poor performance will be especially seen for high capacity links.
Algorithm performance with additional topologies
We performed simulations with 10-fold larger networks , with both uniform and scale-free degree distributions. The scale-free topology was derived using the Barabasi-Albert preferential attachment model (Barabasi and Albert, 1999). When generating graphs using this model, new nodes are connected to existing nodes with probability proportional to their existing degree. This mechanism has been shown to produce a power-law degree distribution. We found no qualitative change in our conclusions here (Figure 4 below) compared to the results discussed in the main text. Thus, our results are applicable to at least two classes of network topologies: uniform (inspired by grid-like road networks) and power-law (Internet). This invariance is likely due to the distributed nature of the flow control algorithms. The overshoot theorems exemplify this, having no assumptions on network size and topology.
Changes in edge weights (W) under dynamic traffic
To study how (edge weights) changed under the rush hour protocol of Figure 3, we plotted the average
(computed in 10 time-step bins) for all used source-to-router links in the network. We focused on source-router links as these primarily control the amount of new data injected into the network in each time step. Error bars in this figure correspond to the standard error of the average over 10 trials. Both MIMD and AIMD reduce edge weights in response to rush-hour, showing that they handle excess traffic by reducing bandwidth instead of dropping data. The characteristic oscillatory probing nature of both algorithms is also apparent. In both cases, the range of W in each time step is narrowly bounded, indicating network stability.
- Cai (2016) (2016). The CAIDA [AS Relationships]. www.caida.org/data/as-relationships. Accessed: 2016-02-01.
- Ahn et al. (1995) Ahn, J. S., Danzig, P. B., Liu, Z., and Yan, L. (1995). Evaluation of tcp vegas: Emulation and experiment. SIGCOMM Comput. Commun. Rev., 25(4):185–195.
- Bansal and Balakrishnan (2001) Bansal, D. and Balakrishnan, H. (2001). Binomial Congestion Control Algorithms. In IEEE Infocom 2001, Anchorage, AK.
- Barabasi and Albert (1999) Barabasi, A. L. and Albert, R. (1999). Emergence of scaling in random networks. Science, 286(5439):509–512.
- Bengio et al. (2015a) Bengio, Y., Lee, D., Bornschein, J., and Lin, Z. (2015a). Towards biologically plausible deep learning. arXiv, abs/1502.04156.
- Bengio et al. (2015b) Bengio, Y., Mesnard, T., Fischer, A., Zhang, S., and Wu, Y. (2015b). STDP as presynaptic activity times rate of change of postsynaptic activity. arXiv, abs/1509.05936.
- Bi and Poo (1998) Bi, G. Q. and Poo, M. M. (1998). Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18(24):10464–10472.
- Bienenstock et al. (1982) Bienenstock, E. L., Cooper, L. N., and Munro, P. W. (1982). Theory for the development of neuron selectivity: orientation specificity and binocular interaction in visual cortex. J. Neurosci., 2(1):32–48.
- Billings and van Rossum (2009) Billings, G. and van Rossum, M. C. (2009). Memory retention and spike-timing-dependent plasticity. J. Neurophysiol., 101(6):2775–2788.
- Branco and Staras (2009) Branco, T. and Staras, K. (2009). The probability of neurotransmitter release: variability and feedback control at single synapses. Nat. Rev. Neurosci., 10(5):373–383.
- Chiu and Jain (1989) Chiu, D.-M. and Jain, R. (1989). Analysis of the increase and decrease algorithms for congestion avoidance in computer networks. Comput. Netw. ISDN Syst., 17(1):1–14.
- Cooper and Bear (2012) Cooper, L. N. and Bear, M. F. (2012). The BCM theory of synapse modification at 30: interaction of theory with experiment. Nat. Rev. Neurosci., 13(11):798–810.
- Corless et al. (2016) Corless, M., King, C., Shorten, R., and Wirth, F. (2016). AIMD Dynamics and Distributed Resource Allocation. Society for Industrial & Applied Mathematics (SIAM).
- Corless and Shorten (2012) Corless, M. and Shorten, R. (2012). Deterministic and stochastic convergence properties of aimd algorithms with nonlinear back-off functions. Automatica, 48(7):1291–1299.
- Costa et al. (2015) Costa, R. P., Froemke, R. C., Sjostrom, P. J., and van Rossum, M. C. (2015). Unified pre- and postsynaptic long-term plasticity enables reliable and flexible learning. Elife, 4.
- Cota et al. (2012) Cota, É., de Morais Amory, A., and Lubaszewski, M. (2012). Reliability, Availability and Serviceability of Networks-on-Chip. Springer Science.
- Delgado et al. (2010) Delgado, J. Y., Gomez-Gonzalez, J. F., and Desai, N. S. (2010). Pyramidal neuron conductance state gates spike-timing-dependent plasticity. J. Neurosci., 30(47):15713–15725.
- Feng and Tuckwell (2003) Feng, J. and Tuckwell, H. C. (2003). Optimal control of neuronal activity. Phys. Rev. Lett., 91(1):018101.
- Fitzsimonds et al. (1997) Fitzsimonds, R. M., Song, H. J., and Poo, M. M. (1997). Propagation of activity-dependent synaptic depression in simple neural networks. Nature, 388(6641):439–448.
- Fortz and Thorup (2000) Fortz, B. and Thorup, M. (2000). Internet traffic engineering by optimizing ospf weights. In INFOCOM 2000. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE, volume 2, pages 519–528 vol.2.
- Froemke et al. (2010) Froemke, R. C., Debanne, D., and Bi, G. Q. (2010). Temporal modulation of spike-timing-dependent plasticity. Front Synaptic Neurosci, 2:19.
- Gavoille (2001) Gavoille, C. (2001). Routing in distributed networks: Overview and open problems. SIGACT News, 32(1):36–52.
- Gayme and Topcu (2011) Gayme, D. and Topcu, U. (2011). Optimal power flow with distributed energy storage dynamics. In Proceedings of the 2011 American Control Conference, pages 1536–1542.
- Ha et al. (2008) Ha, S., Rhee, I., and Xu, L. (2008). Cubic: A new tcp-friendly high-speed tcp variant. SIGOPS Oper. Syst. Rev., 42(5):64–74.
- Isa et al. (2015) Isa, N., Mohamed, A., and Yusoff, M. (2015). Implementation of dynamic traffic routing for traffic congestion: A review, volume 545 of Communications in Computer and Information Science, pages 174–186. Springer Verlag.
- Jin et al. (2002) Jin, S., Guo, L., Matta, I., and Bestavros, A. (2002). A spectrum of tcp-friendly window-based congestion control algorithms. Technical report, Computer Science Department, Boston University, Boston, MA, USA.
- Jose et al. (2015) Jose, L., Yan, L., Alizadeh, M., McKeown, G. V. N., and Katti, S. (2015). High speed networks need proactive congestion control. In Proceedings of the 14th ACM Workshop on Hot Topics in Networks, HotNets-XIV, pages 14:1–14:7, New York, NY, USA. ACM.
- Kelly (2003) Kelly, T. (2003). Scalable tcp: Improving performance in highspeed wide area networks. SIGCOMM Comput. Commun. Rev., 33(2):83–91.
- Kepecs et al. (2002) Kepecs, A., van Rossum, M. C., Song, S., and Tegner, J. (2002). Spike-timing-dependent plasticity: common themes and divergent vistas. Biol Cybern, 87(5-6):446–458.
- Kopec et al. (2006) Kopec, C. D., Li, B., Wei, W., Boehm, J., and Malinow, R. (2006). Glutamate receptor exocytosis and spine enlargement during chemically induced long-term potentiation. J. Neurosci., 26(7):2000–2009.
- Kramar et al. (2012) Kramar, E. A., Babayan, A. H., Gavin, C. F., Cox, C. D., Jafari, M., Gall, C. M., Rumbaugh, G., and Lynch, G. (2012). Synaptic evidence for the efficacy of spaced learning. Proc. Natl. Acad. Sci. U.S.A., 109(13):5121–5126.
- Lahiri and Ganguli (2013) Lahiri, S. and Ganguli, S. (2013). A memory frontier for complex synapses. In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 26, pages 1034–1042. Curran Associates, Inc.
- Lazar (1983) Lazar, A. (1983). Optimal flow control of a class of queueing networks in equilibrium. IEEE Transactions on Automatic Control, 28(11):1001–1007.
Li et al. (2014)
Li, F., Wu, B., Xu, L., Shi, C., and Shi, J. (2014).
A fast distributed stochastic gradient descent algorithm for matrix factorization.In Proc. of the Intl. Workshop on Big Data, Streams and Heterogeneous Source Mining (BigMine), pages 77–87.
- Markram et al. (2012) Markram, H., Gerstner, W., and Sjostrom, P. J. (2012). Spike-timing-dependent plasticity: a comprehensive overview. Front Synaptic Neurosci, 4:2.
- Mateos-Nunez and Cortes (2013) Mateos-Nunez, D. and Cortes, J. (2013). Noise-to-state exponentially stable distributed convex optimization on weight-balanced digraphs. In IEEE Conf. on Decision and Control, pages 2781–2786.
- Morrison et al. (2008) Morrison, A., Diesmann, M., and Gerstner, W. (2008). Phenomenological models of synaptic plasticity based on spike timing. Biol Cybern, 98(6):459–478.
- Navlakha et al. (2015) Navlakha, S., Barth, A. L., and Bar-Joseph, Z. (2015). Decreasing-Rate Pruning Optimizes the Construction of Efficient and Robust Distributed Networks. PLoS Comput. Biol., 11(7):e1004347.
- Oja (1982) Oja, E. (1982). Simplified neuron model as a principal component analyzer. Journal of Mathematical Biology, 15(3):267–273.
- Osogami and Otsuka (2015) Osogami, T. and Otsuka, M. (2015). Seven neurons memorizing sequences of alphabetical images via spike-timing dependent plasticity. Sci Rep, 5:14149.
- Pande et al. (2005) Pande, P. P., Grecu, C., Jones, M., Ivanov, A., and Saleh, R. (2005). Performance evaluation and design trade-offs for network-on-chip interconnect architectures. IEEE Trans. Comput., 54(8):1025–1040.
- Royer and Toh (1999) Royer, E. M. and Toh, C.-K. (1999). A review of current routing protocols for ad hoc mobile wireless networks. IEEE personal communications, 6(2):46–55.
- Shakkottai and Srikant (2007) Shakkottai, S. and Srikant, R. (2007). Network optimization and control. Found. Trends Netw., 2(3):271–379.
- Song et al. (2000) Song, S., Miller, K. D., and Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nat. Neurosci., 3(9):919–926.
- van Rossum et al. (2000) van Rossum, M. C., Bi, G. Q., and Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing-dependent plasticity. J. Neurosci., 20(23):8812–8821.
- Vogels and Abbott (2009) Vogels, T. P. and Abbott, L. F. (2009). Gating multiple signals through detailed balance of excitation and inhibition in spiking networks. Nat. Neurosci., 12(4):483–491.
- Wang et al. (2003) Wang, J., Li, L., Low, S. H., and Doyle, J. C. (2003). Can shortest-path routing and tcp maximize utility. In INFOCOM 2003. Twenty-Second Annual Joint Conference of the IEEE Computer and Communications. IEEE Societies, volume 3, pages 2049–2056 vol.3.
- Watt and Desai (2010) Watt, A. J. and Desai, N. S. (2010). Homeostatic Plasticity and STDP: Keeping a Neuron’s Cool in a Fluctuating World. Front Synaptic Neurosci, 2:5.
- Yang and Calakos (2013) Yang, Y. and Calakos, N. (2013). Presynaptic long-term plasticity. Front Synaptic Neurosci, 5:8.
- Yuan et al. (2014) Yuan, K., Ling, Q., and Yin, W. (2014). On the Convergence of Decentralized Gradient Descent. arXiv, abs/1310.7063.
- Zanutto and Staddon (2007) Zanutto, B. S. and Staddon, J. E. (2007). Bang-bang control of feeding: role of hypothalamic and satiety signals. PLoS Comput. Biol., 3(5):e97.
- Zhou et al. (2004) Zhou, Q., Homma, K. J., and Poo, M. M. (2004). Shrinkage of dendritic spines associated with long-term depression of hippocampal synapses. Neuron, 44(5):749–757.
- Zinkevich et al. (2010) Zinkevich, M., Weimer, M., Li, L., and Smola, A. (2010). Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems 23, pages 2595–2603. Curran Associates, Inc.