PALS: Plesiochronous and Locally Synchronous Systems

Consider an arbitrary network of communicating modules on a chip, each requiring a local signal telling it when to execute a computational step. There are three common solutions to generating such a local clock signal: (i) by deriving it from a single, central clock source, (ii) by local, free-running oscillators, or (iii) by handshaking between neighboring modules. Conceptually, each of these solutions is the result of a perceived dichotomy in which (sub)systems are either clocked or fully asynchronous, suggesting that the designer's choice is limited to deciding where to draw the line between synchronous and asynchronous design. In contrast, we take the view that the better question to ask is how synchronous the system can and should be. Based on a distributed clock synchronization algorithm, we present a novel design providing modules with local clocks whose frequency bounds are almost as good as those of corresponding free-running oscillators, yet neighboring modules are guaranteed to have a phase offset substantially smaller than one clock cycle. Concretely, parameters obtained from a 15nm ASIC implementation running at 2GHz yield mathematical worst-case bounds of 30ps on phase offset for a 32x32 node grid network.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/21/2018

Communication with Crystal-Free Radios

We consider a communication channel where there is no common clock betwe...
06/19/2019

Clock synchronization over networks -- Identifiability of the sawtooth model

In this paper, we analyze the two-node joint clock synchronization and r...
06/24/2021

Self-Stabilizing Phase Clocks and the Adaptive Majority Problem

We present a self-stabilising phase clock for population protocols. In t...
02/21/2019

Fault Tolerant Gradient Clock Synchronization

Synchronizing clocks in distributed systems is well-understood, both in ...
06/01/2018

A programmable clock generator for automatic Quality Assurance of LOCx2

The upgrade of ATLAS Liquid Argon Calorimeter (LAr) Phase-1 trigger requ...
09/08/2021

Convergence of Batch Asynchronous Stochastic Approximation With Applications to Reinforcement Learning

The stochastic approximation (SA) algorithm is a widely used probabilist...
10/03/2020

TRIX: Low-Skew Pulse Propagation for Fault-Tolerant Hardware

The vast majority of hardware architectures use a carefully timed refere...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction and Related Work

At surface level, the synchronous and asynchronous design paradigms seem to be opposing extremes. In their most pure forms, this is true: Early synchronous systems would wait for a clock signal to be propagated throughout the system and all computations of the current clock cycle to complete before moving on to the next; and delay-insensitive circuits make no assumptions on timing whatsoever, explicitly acknowleding completion of any computational step.

In reality, however, fully synchronous or asynchronous systems are the exception. It has long since become impractical to wait for the clock to propagate across a chip, and there are numerous clock domains and asynchronous interfaces in any off-the-shelf “synchronously” clocked computer [1]. On the other hand, delay-insensitive circuits [2] suffer from substantial computational limitations [3, 4, 5] and provide no timing guarantees, rendering them unsuitable for many applications – in particular the construction of a general-purpose computer. Accordingly, most real-world “asynchronous” systems will utilize timing assumptions on some components, which in fact could be used to construct a (possibly very primitive) clock.

As systems grow in size – physically or due to further miniaturization – maintaining the illusion of perfect synchronism becomes increasingly challenging. Due to various scalability issues, more and more compromises are made. A well-known such compromise gaining in popularity in recent years are Globally Asynchronous Locally Synchronous (GALS) systems [6, 7]. Here, several clock domains are independently clocked and communicate asynchronously via handshakes, where synchronizers are used to ensure sufficiently reliable clock domain crossing [7, 8]

. While this approach resolves important scalability issues, arguably it does so by surrendering to them: between clock domains, all interaction is asynchronous. However, fixing a sufficiently small probability of synchronizer failure, communication latency becomes bounded, permitting bounded response times to internal and external events. Yet, as timing relations between different clock domains remain desirable, GALS systems with guaranteed frequency relations between clock domains (but without any bound on their phase offsets), so-called mesochronous architectures, have been conceived 

[7].

One might think that GALS systems exemplify a fundamental struggle between the synchronous and asynchronous paradigms. We argue that this dichotomy is false! Rather, choices between clocked and clockless designs are driven by tradeoffs between guarantees on response times, cost (in terms of energy, buffer size, area, etc.), and complexity of development. Ideally, we would like to provide the convenient synchronous abstraction to the developer, yet have the system respond quickly to external and internal events. Unfortunately, existing approaches behave less than ideal in this regard:

  • [leftmargin=*]

  • Centralized clocking does not scale. In large systems, the resulting timing guarantees become too loose (requiring to make the system slow). Indeed, it has been shown that the achievable

    local skew

    , i.e., maximum phase offset between neighbors, in a grid grows linearly with the width of the grid; see Section V-B.

  • A system-wide asynchronous design results in challenging development, especially when tight timing constraints are to be met. While in a clocked system one can bound response times by bounding the number of clock cycles for computation and communication, analyzing the (worst-case) response time of a large-scale asynchronous system has to be performed bottom-up. In addition, without highly constraining design rules, it is difficult to ensure that waiting for acknowledgements does not delay the response to a high-priority local event or an external request for a significant time. Causal acknowledge chains can span the entire system, potentially resulting in waiting times that grow linearly with the system diameter.

  • A GALS design ostensibly does not suffer from these issues, as each clock domain can progress on its own due to independent clocks.111This is different for designs with pausible clocks [9, 10], rendering them even more problematic in this context. However, clock domain crossings require synchronizers, incurring or more clock cycles of additional latency. If synchronizers are placed in the data path, communication becomes slow, even if a simple command is to be spread across the chip or information is acquired from an adjacent clock domain.

  • Alternative solutions that do not require synchronizers in the data path have been proposed in [11, 12]. The designs either skip clock cycles or switch to a clock signal shifted by half a period, when transmitter and receiver clock risk to violate setup/hold conditions. The indicating signal is synchronized without additional latency to the datapath. Depending on the implementation and intended guarantees, the additional latency is in the order of a clock period. While this can, in principle, be brought down to the order of setup/hold-windows, such designs would require considerable logical overhead and fine-tuning of delays. Further, note that an application of such a scheme has to periodically insert no-data packets. An application-level transmission may be delayed by such a timeslot. In [11] this additional delay can be up to two periods when the no-data packet is oversampled. Finally, note that a potential application that runs on top of this scheme and uses handshaking to make sure all its packets of a (logical) time step have arrived before the next time step is locally initiated faces the same problem as a fully asynchronous design, i.e., that the worst-case waiting time between consecutive time steps grows linearly with the system diameter.

Our Contribution

In this work, we present a radically different approach. By using a distributed clock synchronization algorithm, we essentially create a single, system-wide clock domain without needing to spread a clock signal from a single dedicated source with small skew. We employ results on gradient clock synchronization (GCS) by Lenzen et al. [13], in which the goal is to minimize the worst-case clock skew between adjacent nodes in a network. In our setting, the modules correspond to nodes, and they are connected by an edge if they directly communicate (i.e., exchange data). Thus, nodes of the clock synchronization algorithm communicate only if the respective nodes exchange data for computational functionality. This leads to an easy integration of our algorithm into the existing communication infrastructure.

The algorithm provides strong parametrized guarantees. Consider a network of local clocks that are controlled by our GCS algorithm. Let be the diameter of the network. Further, let be the (unintended) drift of the local clock, a freely chosen constant, and an upper bound on how precisely the phase difference to neighbors is known. Then:

  • [leftmargin=*]

  • The synchronized clocks are guaranteed to run at normalized rates between and .

  • The local skew is bounded by .

  • The global skew, i.e., the maximum phase offset between any two nodes in the system, is .

In other words, the synchronized clocks are almost as good as free-running clocks with drift , yet the local skew grows only logarithmically in the chip’s diameter. The local and global skew bounds are optimal up to roughly factor 2 [13].

As a novel theoretical result, we improve the global skew bound by roughly factor compared to [14]. This improvement brings our theoretical worst-case skew to within a factor of roughly of the theoretical optimum (which is only known to be achieved by a significantly more complicated mechanism [13]). As a second theoretical contribution, we prove that a minor modification of the algorithm reduces the obtained local skew bound by an additive .

We can control the base of the logarithm in the local skew bound by choosing . Picking, e.g., means that for any . Of course, the constants hidden in the -notation matter, but they are reasonably small. Concretely, for a grid network of nodes in the  nm FinFET-based Nangate OCL [15],  GHz clock sources with an assumed drift of , and , our simple sample implementation guarantees that in the worst case. The resulting local skew is  ps, well below a clock cycle. We stress that this enables much faster communication than for handshake-based solutions incurring synchronizer delay.

Note that locking the local oscillators to a common stable reference does not require to balance the respective path delays, implying that our assumed is very pessimistic. Smaller (while keeping fixed) increases the base of the logarithm, further improving scalability. To show that the asymptotic behavior is relevant already to current systems and with our pessimistic , we compare the above results to skews obtained by clock trees in the same grid networks in Section V-B.

Organization of this paper

We present the GCS algorithm in Section II, stating worst-case bounds on the local and global skews proved in the appendix We then break down the algorithm into modules in Section III and discuss their implementation in Section IV. Section V presents Spice simulations for a network of four nodes, organized in a line and compares them to clock trees. We conclude in Section VI.

Ii Algorithm

Ii-a High-level Description

We give a high level description of our algorithm that achieves close synchronization between neighboring nodes in a network. We model the network as an undirected graph where is the set of nodes, and is the set of edges (or links). Abstractly, we think of each node as maintaining a logical clock, which we view as a function . That is for each (Newtonian) time , is ’s logical clock value at time . The local skew is the maximum clock difference between neighbors: . The global skew is the maximum clock difference between any two nodes in the network: . The goal of our algorithm is for each node to compute a logical clock minimizing at all times , subject to the condition that all logical clocks progress at least at (normalized) rate .222Without the minimum rate requirement, the task becomes trivial: all nodes can simply set for all times to achieve perfect “synchronization.”

We assume that each node has an associated reference clock signal, which we refer to as ’s hardware clock, denoted . For notational convenience,333It is common to assume a two-sided frequency error, i.e., a rate between and . However, the one-sided notation simplifies expressions. Translating between the two models is a straightforward renormalization. we assume that the minimum (normalized) rate of is , and its maximum rate is : for all and

(1)

To compute a logical clock, after initially setting , adjusts the rate of relative to the rate of (where this rate itself is neither known to nor under the influence of the algorithm). Specifically, can be either in slow mode or fast mode. In slow mode, runs at the same rate as , while in fast mode, sets the rate of to be times the one of its hardware clock. Here, is a parameter fixed by the designer. In order for the algorithm to work, a fast node must always run faster than a slow node—i.e., . We impose the stronger condition that .

The GCS algorithm of Lenzen et al. [13] specifies conditions for a node to be in slow or fast mode that ensure asymptotically optimal local skew, provided that the global skew is bounded. The algorithm is parametrized by a variable , whose value determines the quality of synchronization.

Definition 1.

Let be a parameter. We say that a node satisfies the fast condition at time if there exists a natural number such that the following two conditions hold:

FC1

has a neighbor such that

FC2

all of ’s neighbors satisfy .

It satisfies the slow condition if there exists such that:

SC1

has a neighbor such that

SC2

all of ’s neighbors satisfy .

Definition 2.

We say that an algorithm is a GCS algorithm with parameters if the following invariants hold, for every node and all times :

I1

,

I2

I3

if satisfies the fast condition throughout the interval , then

I4

if satisfies the slow condition throughout the interval , then .

Invariants (I3) and (I4) still allow a node’s clock to vary within the rates of the underlying hardware clock, which is assumed not be under the control of the algorithm.

Theorem 1.

Suppose algorithm is a GCS algorithm. Then maintains global skew and local skew for all sufficiently large .

Remark 1.

The precise local and global skew bounds achieved by a GCS algorithm at an arbitrary time depend on the initial state of the system. GCS algorithms are self-stabilizing in the sense that starting from an arbitrary initial state, the algorithm will eventually achieve the skew bounds claimed in Thm. 1 (see [16]). In the appendix, we analyze the speed of convergence as function of local skew at initialization.

In order to fulfill the invariants of a GCS algorithm, each node

maintains estimates of the offsets to neighboring clocks. Specifically, for each neighboring node

, computes an offset estimate . Given offset estimates for each neighbor, the synchronization algorithm determines if should run in fast mode by checking if the fast trigger (FT) is satisfied, as defined below. The trigger is parametrized by variables (as in the GCS algorithm) and , whose values are determined by the quality of estimates of neighboring clock values.

Definition 3.

We say that satisfies the fast trigger, FT, if there exists such that the following conditions hold:

FT1

,

FT2

.

We are now in the position to formalize our GCS algorithm, OffsetGCS (Algorithm 1). OffsetGCS is simple: at each time, each node checks if it satisfies FT. If so, it runs in fast mode. Otherwise, the node runs in slow mode. As the decision to run fast or slow is a discrete decision, a hardware implementation will be prone to metastability [17]. We discuss how to work around this problem in Section III.

1:At each time do
2:     
3:     
4:     if  satisfies FT then
5:         # fast mode (rate in )
6:         rate of rate of
7:     else
8:         # slow mode (rate in )
9:         rate of rate of      
Algorithm 1 OffsetGCS algorithm for node

In what follows, we show that for a suitable choice of parameters, OffsetGCS is a GCS algorithm in the sense of Def. 2. Thus, OffsetGCS maintains the skew bounds of Thm. 1.

Ii-B Analysis of the OffsetGCS algorithm

We denote an upper bound on the overall uncertainty of ’s estimate of by :

(2)

In our analysis, it will be helpful to distinguish two sources of uncertainty faced by any implementation of the GCS algorithm. The first is the propagation delay uncertainty, which is the absolute timing variation in signal propagation adding to the measurement error. We use the parameter to denote an upper bound on this value.

The second source of error is the time between initiating a measurement and actually “using” it in control of the logical clock speed. During this time, the logical clocks advance at rates that are not precisely known. Here, we can exploit that the maximum rate difference between any two logical clocks is . Thus, denoting the maximum end-to-end latency by , this contributes an error of at most at any given time. Time includes the time for the logical clock to respond to control signal.

Once suitable values of and are determined, can be computed easily.

Lemma 1.

With , Ineq. (2) holds.

Based on , we now seek to choose as small as possible to realize the invariants given in Def. 2. The basic idea is to ensure that if a node satisfies the fast condition at time (which depends on the unknown phase difference), then it must satisfy the fast trigger (which is expressed in terms of the estimates ), thus ensuring that is in fast mode at time . In turn, if the slow condition is not satisfied, we must make sure that the fast trigger does not hold either.

Lemma 2.

Suppose for all times an implementation of OffsetGCS satisfies (2). Then for any

(3)

and , OffsetGCS is a GCS algorithm.

Proof.

We verify the conditions of Def. 2. Conditions I1 and I2 are direct consequences of the algorithm specification. For Condition I3, suppose first that satisfies the fast condition at time . Therefore, there exists some and neighbor of such that . Therefore, by Ineq. 2, , so that FT1 is satisfied. Similarly, since satisfies the fast condition, all of its neighbors satisfy . Therefore, , hence FT2 is satisfied for the same value of and runs in fast mode at time .

It remains to show that if satisfies the slow condition at time , then it does not satisfy FT at time (and, accordingly, is in slow mode). To this end suppose to the contrary that satisfies FT at . Since satisfies the slow condition at time ,

(4)
(5)

Since is assumed to satisfy FT at time , combining FT1 and FT2 with (2) imply that there exists some with

(6)
(7)

Combining (5) and (6), we must have

hence . Since , the previous expression implies that . Similarly, combining (4) and (7) gives , hence . Thus, , or equivalently (since and are integers), that . However, this final expression contradicts from before. Thus FT cannot be satisfied at time if the slow condition is satisfied at time , as desired. ∎

Applying Thm. 1 and Lem. 2 we obtain:

Corollary 1.

For suitable choices of parameters, OffsetGCS maintains local skew

Iii Modules

For a hardware implementation of the OffsetGCS algorithm, we break down the distributed algorithm into modules. Per node, this will be a local clock and a controller. Per link, we have a time offset measurement module for each node connected via the link. For each module we specify its input and output ports, its functionality, and its delay. We further relate the delay from Section II to the module delays.

Iii-a Local Clock

The clock signal of node is derived from a tunable local clock oscillator. It has input , the mode signal (given by the controller; see Section III-C), and output , the clock signal. The mode signal is used to tune the frequency of the oscillator within a factor of . An oscillator responds within time , i.e., switching between the two frequency modes takes at most time. We have four requirements to the local clock module:

(C1)

The initial maximum local skew is bounded by for a parameter depending on the implementation of the module.

(C2)

If is constantly (respectively ) during , then the local oscillator is in slow (respectively fast) mode at time and the rate of the local oscillator is in (respectively ).

(C3)

If is neither constantly nor during , then the local oscillator is unlocked and its rate is in .

(C4)

Clocks in slow mode are never faster than clocks in fast mode, hence .

Note that if (C2) does not apply, i.e., the mode signal is not stable, (C3) allows an arbitrary rate between fast and slow.

Iii-B Time Offset Measurement

In order to check whether the FT conditions are met, a node needs to measure the current phase offset to each of its neighbors . This is achieved by a time offset measurement module between and each neighbor . Note that the algorithm does not require a full access to the function , but only to the knowledge of whether has reached a bounded number of thresholds – we elaborate on this shortly.

The inputs of the module are the clock signal of and .

The outputs of the module are defined as follows. Let with . The output of the measurement module is a binary string of length bits where the first bits, denoted as , are going from to , followed by additional bits, denoted as , going from to . For example, a module with has outputs with thresholds , , , and .

Let be a (small) time. We require that output is set to if . Output is set to if . Otherwise, is unconstrained, i.e., within . Here, M denotes a meta-/unstable signal between logical values  and . Intuitively, will account for setup/hold times that any realistic hardware implementation will have to account for.

We further require that . This guarantees that at most one output is M at a time: Assume that bit is metastable, then . Since the adjacent thresholds are away, their corresponding outputs are either or . In fact, by Eq. (3) and since (we account for setup/hold times in ), we get that , hence our requirement is satisfied.

Choosing , where is the guaranteed local skew of the OffsetGCS algorithm, guarantees that the nodes will always be within the module’s measurement range. Note that here needs to respect the initial local skew as well, i.e., here is given by the bound from Cor. 1 plus the local skew on initialization (as we show in the appendix) .

Given the above, the module outputs form a unary thermometer code of the phase difference between and ’s clocks. Moreover, since this module decides whether a subset of the thresholds are met or not, then inevitably, any implementation of this module (see Section IV) is susceptible to metastable upsets. If implemented correctly, one can leverage the output encoding, which is a unary thermometer code, and guarantee that at most one bit is in a metastable state, located conveniently between a prefix of ’s and a suffix of ’s.

Let denote the maximum end-to-end latency of the measurement module, i.e., an upper bound on the elapsed time from when is set, to when the measurements are available at the output. More precisely, if is set to for the entire duration of an interval , then the corresponding output is .

Iii-C Controller

Each node is equipped with a controller module. Its input is the (thermometer encoded) time measurement for each of ’s neighbors, i.e., the outputs of the time offsets measurement module on each link connecting to an adjacent node. It outputs the mode signal .

Denote by the maximum end-to-end delay of the controller circuit, i.e., the delay between its inputs (the measurement offset outputs) and its output . The specification of the controller’s interface is as follows:

(L1)

For , if algorithm OffsetGCS continuously maps the rate of to fast mode (resp. slow mode) during , then (resp. )

(L2)

In all other cases, the output at time is arbitrary, i.e., any value from .

Iii-D Putting it all together

The module specifications above, together, specify a realization of the OffsetGCS algorithm in hardware. The parameters of this hardware specification of OffsetGCS are: , and , where . These parameters are mapped to parameters of Cor. 1 by applying Lem. 2.

Iv Hardware Implementation

We have implemented the modules from Section III and compiled them into a system of nodes, connected in a line from node to node . To resemble a realistically sparse spacing of clocks, we placed nodes at distances of m. Target technology was the  nm FinFET-based Nangate OCL [15]. The gate-level design was laid out and routed with Cadence Encounter, which was also used for extraction of parasitics and timing. Local clocks run at a frequency of approximately  GHz, controllable within a factor of . We use here to make the interplay of and better visible in traces. We will discuss the gate-level design and its performance measures in the following.

Iv-a Gate-level Implementation

Figures 0(a) to 0(c) show the schematics of an implementation of the time offsets measurement module (Figure 0(a)), and the controller (Figures 0(b) and 0(c)).

(a)
(b)
(c)
Figure 1: Gate-level implementation of the OffsetGCS algorithm’s modules. Sub-figure 0(a) shows a linear TDC-based circuitry for the module which measure the time offsets between nodes and . Buffers and inverters are used as delay elements the delay of which appears next to the corresponding delay element. Given node ’s time offsets to its neighbors, the circuit in Sub-figure 0(b) computes the minimum and maximum threshold levels which have been reached. Sub-figure 0(c) shows the circuit that computes if the FT conditions are satisfied, i.e., if there is an that satisfies both FT1 and FT2.

As a local clock source, we used a ring oscillator with some of its inverters being starved-inverters to set the frequency to either fast mode or slow mode. Nominal frequency is around  GHz, controllable by a factor via the signal. We choose , assuming a moderately stable oscillator. While this is below drifts achievable with uncontrolled ring oscillators, one may lock the frequency of the ring oscillator to a stable external quartz oscillator, see e.g., [18]. For such an implementation, we only require a stable frequency reference for local clocks; the phase difference of the distributed clock signal between adjacent nodes (which may be large) is immaterial. If distributing a stable clock source to all nodes is not feasible or considered too costly for a design, one may choose a larger resulting in a larger local and global skew bound; see Thm. 1 and Cor. 1.

We measure the logical clock value in terms of the time passed since its first active clock transition.

The time offset measurement module resembles a time to digital converter (TDC) in both its structure and function. The upper delay line in Figure 0(a), fed by remote clock , is tapped at intervals of . The lower delay line is used to shift the module’s own local clock to the middle of the delay line (plus some offset) so that phase differences can be measured both in the negative and positive direction. The module in Figure 0(a) is instantiated for with taps for threshold levels. In fact, in our hardware implementation we set , as even for this is sufficient for networks of diameter up to around (see how to choose this set of thresholds in the specification of this module in Sec. III).

If both clocks are perfectly synchronized, i.e., , then the state of the flip-flops will be after a rising transition of . Now, assume that clock is earlier than clock , say by a small more than  ps. Then

. For the moment assuming that we do not make a measurement error, we get

. From the delays in Figure 0(a) one verifies that in this case, the flip-flops are clocked before clock has reached the second flip-flop with output , resulting in a snapshot of . Likewise, an offset of results in a snapshot of , etc.

However, care has to be taken for non-binary outputs. Given the output specification above, one can verify that measurements are of the form or .

The circuit in Figure 0(b) then computes the minimum and the maximum of the thermometer codes (by AND and OR gates), determining the thresholds reached by the furthest node ahead and behind (while possibly masking metastable bits); compare this with lines 2 and 3 in OffsetGCS (Algorithm 1). Figure 0(c) finally computes the mode signal of from the thermometer codes, namely verifying whether there is an that satisfies both triggers; compare this with FT1 and FT2 in Def. 3.

Timing Parameters

We next discuss how the modules’ timing parameters relate to the extracted physical timing of the above design.

The time required for switching between oscillator modes is about the delay of the ring oscillator, which in our case is about  ps. The measurement latency plus the controller latency are given by a clock cycle ( ps) plus the delay ( ps) from the flip-flops through the AND/OR circuitry in Figures 0(b) and 0(c) to the mode signal. In our case, delay extraction of the circuit yields . We thus have,  ps.

The propagation delay uncertainty, , in measuring if has reached a certain threshold is given by the uncertainties in latency of the upper delay chain plus the lower delay chain in Figure 0(b). For the described naive implementation using an uncalibrated delay line, this would be problematic. With an uncertainty of for gate delays, and starting with moderately sized and thus length of delay chains, extraction of minimum and maximum delays showed that the constraints for and from Lem. 2 were not met. Successive cycles of increasing and do not converge due to the linear dependency of and on the uncertainty with a too large factor. Rather, delay variations (of the entire system) have to be less than for the linear offset measurement circuit, depicted in Sub-figure 0(a), to fulfill Lem. 2’s requirements.

Iv-B Improvements

Figure 2 shows an improved TDC-type offset-measurement circuit that does not suffer from the problem above. Conceptually the TDC of node that measures offsets w.r.t. node is integrated into the local ring oscillator of neighboring node . If has several neighbors, e.g., up to in a grid, they share the taps, but have their own flip-flops within node . The Figure shows a design for with taps, as used in our setup.

Figure 2:

Improved offset measurement implementation. The delays of each delay element are written inside it. The gray buffers at the offset measurement taps decouple the load of the remaining circuitry. At the bottom of the ring oscillator an odd number of starved inverters used to set slow or fast mode for node

. The phase offset that we measure in each tap is written next to the corresponding flip-flop. The delay elements at the top are inverters instead of buffers to achieve a latency of  ps. We inverted the clock output to account for the negated signal at the tap of clock at the top.

Integration of the TDC into ’s local ring oscillator greatly reduces uncertainties at both ends: (i) the uncertainty at the remote clock port (of node ) is removed to a large extent, since the delay elements which are used for the offset measurements are part of ’s oscillator, and (ii) the uncertainty at the local clock port is greatly reduced by removing the delay line of length . Remaining timing uncertainties are the latency from taps to the D-ports of the flip-flops and from clock to the CLK-ports of the flip-flop. Timing extraction yielded  ps in presence of gate delay variations.

From Lem. 2, we thus readily obtain  ps and  ps which matched the previously chosen latencies of the delay elements. Applying Thm. 1 and Cor. 1 finally yields a bounds of  ps on the global skew and of on the local skew. For our design with diameter this makes a maximum global skew of  ps and a maximum local skew of  ps. Note that considerably larger systems, e.g., a grid with side length of nodes and diameter , still are guaranteed to have a maximum local skew of  ps – and for , the base of the logarithm becomes .

V Simulation and Comparison to Clock Trees

V-a Spice Simulations on a Line Topology

We ran Spice simulations with Cadence Spectre of the post-layout extracted design for nodes arranged in a line, as described in Section IV. The line’s nodes are labeled to . For the simulations, we set instead of , resulting in slower decrease of skew, to better observe how skew is removed. We simulated two scenarios where node is initialized with an offset of  ps ahead of (resp. behind) all other nodes. Simulation time is  ns ( clock cycles) for the first and  ns for the second scenario.

Figure 3 shows the clock signals of nodes to at three points in time for the first scenario: (i) shortly after the initialization, (ii) around  ns, and (iii) after  ns.

For the mode signals, in the first scenario, we observe the following: Since node is ahead of nodes and , node ’s mode signal is correctly set to (slow mode) while node and ’s mode signals are set to (fast mode). Node is unaware that node is ahead since it only observes node . By default its mode signal is set to slow mode. When the gap to is large enough it switches to fast mode. This configuration remains until nodes and catch up to , where they switch to slow mode, to not overtake node . Again node sees only node which is still ahead and switches only after it catches up to .

Figure 4 (red lines) depicts the dynamics of the maximum local and global skews for the first scenario. Observe that, from the beginning the local skew decreases until it reaches less than  ps. It then remains in an stable oscillatory state where it increases until the algorithm detects and reduces the local skew. This is well below our worst-case bound of  ps on the local skew. The global skew first increases, as node does not switch to fast mode immediately. Scenario two shows a similar behaviour (blue lines in Figure 4).

Figure 3: Spice simulation of the line topology. Node has been initialized with a skew of  ps ahead of the other nodes. Nodes from left to right: (i) before , (ii) before before , (iii) before before .
Figure 4: Maximum local skew (dotted) and global skew (solid) for the scenarios of node initially being ahead (red) and behind (blue) of all other nodes.

V-B Comparison to Clock Tree

For comparison, we laid out a grid of flip-flops, evenly spread in distance in x and y direction across the chip. The data port of a flip-flop is driven by the OR of the up to four adjacent flip-flops. Clock trees were synthesized and routed with Encounter Cadence, with the target to minimize local skews. Delay variations on gates and nets were set to . The results are presented in Figures 5. For comparison, we plotted local skews guaranteed by our algorithm for the same grids with parameters extracted from the implementation described in Section IV. Observe the linear growth of the local clock skew and the logarithmic growth of the local skew in our implementation. The figure also shows the skew for a clock tree with delay variations of . This comparison is relevant, as is governed by local delay variations, which can be expected to be smaller than those across a large chip.

Figure 5: Local skew (ps) between neighboring flip-flops in the grid. Clock tree with delay variation (solid green) and our algorithm with delay variation (solid magenta). The dotted line shows the clock tree with delay variation, demonstrating linear growth of the skew also in a different setting. Clock trees are shown up to after which Encounter ran out of memory.

It is worth mentioning that it has been shown that no clock tree can avoid the local skew being proportional to  [19].

It is worth mentioning that one can show that for any clock tree there are always two nodes in the grid that have local skew which is proportional to . This follows from the fact that there are always two neighboring nodes in the grid which are in distance proportional to from each other in the clock tree [19, 20]. Accordingly, uncertainties accumulate in a worst case fashion to create a local skew which is proportional to ; this behavior can be observed in Figure 6.

To gain intuition on this result, note that there is always an edge that, if removed (see the edge which is marked by an X in Figure 6), partitions the tree into two subtrees each spanning an area of and hence having a shared perimeter of length . Thus, there must be two adjacent nodes, one on each side of the perimeter, at distance in the tree.

Our algorithm, on the other hand, manages to reduce the local skew exponentially to being proportional to .

Figure 6: A low stretch spanning tree of an grid [21]. The bold lines depict the spanning tree, i.e., our clock tree in this example. The two neighboring nodes that are of distance in the tree are circled (at the middle right side of the grid).

Vi Conclusion

Low skew between neighboring nodes in a chip allows for efficient low-latency communication and provides the illusion of a single clock domain. A classical solution for this problem is to use a clock tree. However, clock trees inevitably produce local skews which are proportional to the diameter of the chip. We propose a solution based on a distributed clock synchronization algorithm. Its main idea is to control the local clocks of each node by measuring the time offsets from its neighbors and switching between fast and slow clock rates.

We compare our implementation to tool-generated  GHz clock trees for grids in  nm technology. Asymptotically, the implementation improves over the clock tree exponentially. Our simulations show an improvement of roughly on the local skew already for .

The algorithmic approach is highly robust. It does not rely on a single node or link, and can stabilize to small skews even under poor initialization conditions. In particular, it will recover from transient faults, and can handle the loss of individual nodes or links by adding simple detection mechanisms [16]. Moreover, it is known how to integrate new or recovering links or nodes by a simple mechanism without interfering with the skew bounds [16]. Thus, our approach provides a flexible and resilient alternative to classic designs.

In future work, we intend to design a full implementation including suitable (locked) oscillators. As demonstrated by the work of Mota et al. [18], systems with much smaller values of than are feasible. Consequently, even a simple design is likely to result in sufficiently stable local time references. However, a challenge here is that the oscillators need to be locked to a (frequency) reference. This prevents directly adjusting their phase, which would be in conflict with their locking. This issue can be resolved by using a digitally controlled oscillator derived from the local clock. Such a design is possible using synchronizers (which however would increase ), or could make use of metastability-containing techniques in the vein of Függer et al. [22].

Acknowledgments. We thank the reviewers for their valuable feedback, and in particular the third reviewer for pointers to related work. This research has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 716562), the Israel Science Foundation under Grant 867/19, ANR grant FREDDA (ANR-17-CE40-0013), and the Digicosme working group HicDiesMeus.

References

  • [1] H. D. Foster, “Trends in functional verification: A 2014 industry study,” in 52nd Annual Design Automation Conference.   ACM, 2015, p. 48.
  • [2] A. J. Martin, “Compiling communicating processes into delay-insensitive vlsi circuits,” Dist. comp., vol. 1, no. 4, pp. 226–234, 1986.
  • [3] ——, “The limitations to delay-insensitivity in asynchronous circuits,” in Beauty is our business.   Springer, 1990, pp. 302–311.
  • [4] R. Manohar and Y. Moses, “The eventual c-element theorem for delay-insensitive asynchronous circuits,” in 23rd IEEE International Symposium on Asynchronous Circuits and Systems.   IEEE, 2017, pp. 102–109.
  • [5] ——, “Asynchronous signalling processes,” in 25th IEEE Int. Symposium on Asynchronous Circuits and Systems.   IEEE, 2019, pp. 68–75.
  • [6] D. M. Chapiro, “Globally-asynchronous locally-synchronous systems.” Stanford Univ CA Dept of Computer Science, Tech. Rep., 1984.
  • [7] P. Teehan, M. Greenstreet, and G. Lemieux, “A survey and taxonomy of gals design styles,” IEEE Design & Test of Computers, vol. 24, no. 5, pp. 418–428, 2007.
  • [8] R. Dobkin, R. Ginosar, and C. P. Sotiriou, “Data synchronization issues in gals socs,” in 10th International Symposium on Asynchronous Circuits and Systems.   IEEE, 2004, pp. 170–179.
  • [9] K. Y. Yun and R. P. Donohue, “Pausible clocking: A first step toward heterogeneous systems,” in Proc. Int. Conference on Computer Design. VLSI in Computers and Processors.   IEEE, 1996, pp. 118–123.
  • [10] X. Fan, M. Krstić, and E. Grass, “Analysis and optimization of pausible clocking based gals design,” in IEEE International Conference on Computer Design.   IEEE, 2009, pp. 358–365.
  • [11] L. R. Dennison, W. J. Dally, and D. Xanthopoulos, “Low-latency plesiochronous data retiming,” in Proceedings Sixteenth Conference on Advanced Research in VLSI.   IEEE, 1995, pp. 304–315.
  • [12] A. Chakraborty and M. R. Greenstreet, “Efficient self-timed interfaces for crossing clock domains,” in 9th International Symposium on Asynchronous Circuits and Systems.   IEEE, 2003, pp. 78–88.
  • [13] C. Lenzen, T. Locher, and R. Wattenhofer, “Tight Bounds for Clock Synchronization,” Journal of the ACM, vol. 57, no. 2, pp. 1–42, 2010.
  • [14] F. Kuhn and R. Oshman, “Gradient Clock Synchronization Using Reference Broadcasts,” in Principles of Distributed Systems, 13th International Conference, 2009, pp. 204–218. [Online]. Available: https://doi.org/10.1007/978-3-642-10877-8_17
  • [15] M. Martins, J. M. Matos, R. P. Ribas, A. Reis, G. Schlinker, L. Rech, and J. Michelsen, “Open cell library in 15nm freepdk technology,” in Proceedings of the 2015 Symposium on International Symposium on Physical Design.   ACM, 2015, pp. 171–178.
  • [16] F. Kuhn, C. Lenzen, T. Locher, and R. Oshman, “Optimal gradient clock synchronization in dynamic networks,” CoRR, vol. abs/1005.2894, 2010. [Online]. Available: http://arxiv.org/abs/1005.2894
  • [17] L. R. Marino, “General Theory of Metastable Operation,” IEEE Transactions on Computers, vol. 30, no. 2, pp. 107–115, 1981.
  • [18] M. Mota, J. Christiansen, S. Debieux, V. Ryjov, P. Moreira, and A. Marchioro, “A flexible multi-channel high-resolution time-to-digital converter asic,” in IEEE Nuclear Science Symp., vol. 2, 2000, pp. 9–155.
  • [19] Fisher and Kung, “Synchronizing Large VLSI Processor Arrays,” IEEE Transactions on Computers, vol. C-34, no. 8, pp. 734–740, 1985.
  • [20] P. Boksberger, F. Kuhn, and R. Wattenhofer, “On the approximation of the minimum maximum stretch tree problem,” Technical report/ETH, Department of Computer Science, vol. 409, 2003.
  • [21] M. James, “Linear solver in linear time.” [Online]. Available: https://www.i-programmer.info/news/181-algorithms/5573-linear-solver-in-linear-time.html
  • [22] M. Függer, A. Kinali, C. Lenzen, and B. Wiederhake, “Fast All-Digital Clock Frequency Adaptation Circuit for Voltage Droop Tolerance,” in Symp. on Asynchronous Circuits and Systems, 2018.
  • [23] W. Rudin, Principles of Mathematical Analysis, 3rd ed.   New York: McGraw-Hill Education, 1976.

Appendix A Proof of Lemma 1

Proof.

Consider the estimate that the algorithm uses at node for neighbor at time . By definition of , the measurement is based on clock values and for some . Without loss of generality, we assume that to measure whether , the signals are sent at logical times satisfying .444One can account for asymmetric propagation times by shifting and accordingly, so long as this is accounted for in and carry out the proof analogously. Denote by and the times when the respective signals arrive at the data or clock input, respectively, of the register555We assume a register here, but the same argument applies to any state-holding component serving this purpose in the measurement circuit. indicating whether for a given threshold . By definition of , we have that

Note that the register indicates , i.e., latches , if and only if .666For simplicity of the presentation we neglect the setup/hold time (accounted for in ) and metastability; see Section III for a discussion. Thus, we need to show

Assume first that . Then, using I4 and that , we can bound

Hence,

For the second implication, observe that it is equivalent to

As we have shown the first implication for any , the second follows analogously by exchanging the roles of and . ∎

Appendix B Proof of Theorem 1

In this appendix, we prove Theorem 1. We assume that at (Newtonian) time , the system satisfies some bound on local skew. The analysis we provide shows that the GCS algorithm maintains a (slightly larger) bound on local skew for all . An upper bound on the local skew also bounds the number of values of for which FC or SC (Definition 1) can hold, as a large implies a large local skew. (For example, if a node satisfies FC1 for some , then has a neighbor satisfying , implying that .) Accordingly, an implementation need only test for values of satisfying , where is an upper bound on the local skew. Our analysis also shows that given an arbitrary initial global skew , the system will converge to the skew bounds claimed in Theorem 1 within time . We note that the skew upper bounds of Theorem 1 match the lower bounds of [13] up to a factor of approximately 2, and these lower bounds apply even under the assumption of initially perfect synchronization (i.e., systems with ).

Our analysis also assumes that logical clocks are differentiable functions. This assumption is without loss of generality: By the Stone-Weierstrass Theorem (cf. Theorem 7.26 in [23]) every continuous function on a compact interval can be approximated arbitrarily closely by a differentiable function.

We will rely on the following technical result. We provide a proof in Section B-E.

Lemma 3.

For and with , let , where each is a differentiable function. Define by . Suppose has the property that for every and , if , then . Then for all , we have .

Throughout this section, we assume that each node runs an algorithm satisfying the invariants stated in Definition 2. By Lemmas 2 and 1, Algorithm 1 meets this requirement if .

B-a Leading Nodes

We start by showing that skew cannot build up too quickly. This is captured by analyzing the following functions.

Definition 4 ( and Leading Nodes).

For each , , and , we define

where denotes the distance between and in . Moreover, set

Finally, we say that is a leading node if there is some satisfying

Observe that any bound on implies a corresponding bound on : If , then for any adjacent nodes we have . Therefore, . Our analysis will show that in general, for every and all times . In particular, considering gives a bound on in terms of . Because , the skew bounds will then follow if we can suitably bound at all times.

Note that the definition of is closely related to the definition of the slow condition. In fact, the following lemma shows that if is a leading node, then satisfies the slow condition. Thus, cannot increase quickly: I4 (Def. 2) then stipulates that leading nodes increase their logical clocks at rate at most . This behavior allows nodes in fast mode to catch up to leading nodes.

Lemma 4 (Leading Lemma).

Suppose is a leading node at time . Then .

Proof.

By I4, the claim follows if satisfies the slow condition at time . As is a leading node at time , there are and satisfying

In particular, , so . For any , we have

Rearranging this expression yields

In particular, for any , and hence

i.e., SC2 holds for at .

Now consider so that . Such a node exists because . We obtain

Thus SC1 is satisfied for , i.e., indeed the slow condition holds at at time . ∎

Lemma 4 can readily be translated into a bound on the growth of whenever .

Lemma 5 (Wait-up Lemma).

Suppose satisfies for all . Then

Proof.

Fix , and as in the hypothesis of the lemma. For and , define the function . Observe that

Moreover, for any satisfying , we have . Thus, Lemma 4 shows that is in slow mode at time . As (we assume that) logical clocks are differentiable, so is , and it follows that for any and time satisfying . By Lemma 3, it follows that grows at most at rate :

We conclude that

which can be rearranged into the desired result. ∎

Corollary 2.

For all and times , .

Proof.

Choose such that . As for all times , nothing is to show if . Let be the supremum of times from with the property that . Because is continuous, implies that . Hence, . By I2 and Lemma 5, we get that

Trailing Nodes

As at all times by I2, Lemma 7 implies that cannot grow faster than at rate when . This means that nodes whose clocks are far behind leading nodes can catch up, so long as the lagging nodes satisfy the fast condition and thus run at rate at least by I3. Our next task is to show that “trailing nodes” always satisfy the fast condition so that they are never too far behind leading nodes. The approach to showing this is similar to the one for Lemma 5, where now we need to exploit the fast condition.

Definition 5 ( and Trailing Nodes).

For each , , and , we define

where denotes the distance between and in . Moreover, set

Finally, we say that is a trailing node at time , if there is some satisfying