# Predicting Switching Graph Labelings with Cluster Specialists

We address the problem of predicting the labeling of a graph in an online setting when the labeling is changing over time. We provide three mistake-bounded algorithms based on three paradigmatic methods for online algorithm design. The algorithm with the strongest guarantee is a quasi-Bayesian classifier which requires O(t n) time to predict at trial t on an n-vertex graph. The fastest algorithm (with the weakest guarantee) is based on a specialist [10] approach and surprisingly only requires O( n) time on any trial t. We also give an algorithm based on a kernelized Perceptron with an intermediate per-trial time complexity of O(n) and a mistake bound which is not strictly comparable. Finally, we provide experiments on simulated data comparing these methods.

## Authors

• 8 publications
• 4 publications
• ### A Quasi-Bayesian Perspective to Online Clustering

When faced with high frequency streams of data, clustering raises theore...
02/01/2016 ∙ by Le Li, et al. ∙ 0

• ### Online Multitask Learning with Long-Term Memory

We introduce a novel online multitask setting. In this setting each task...
08/17/2020 ∙ by Mark Herbster, et al. ∙ 13

• ### An Online Learning-based Framework for Tracking

We study the tracking problem, namely, estimating the hidden state of an...
03/15/2012 ∙ by Kamalika Chaudhuri, et al. ∙ 0

• ### Tracking using explanation-based modeling

We study the tracking problem, namely, estimating the hidden state of an...
03/16/2009 ∙ by Kamalika Chaudhuri, et al. ∙ 0

• ### Integrating Phase 2 into Phase 3 based on an Intermediate Endpoint While Accounting for a Cure Proportion

For a trial with primary endpoint overall survival for a molecule with c...
01/04/2019 ∙ by Kaspar Rufibach, et al. ∙ 0

• ### PCA with Gaussian perturbations

Most of machine learning deals with vector parameters. Ideally we would ...
06/16/2015 ∙ by Wojciech Kotłowski, et al. ∙ 0

• ### Learning to Reach, Swim, Walk and Fly in One Trial: Data-Driven Control with Scarce Data and Side Information

We develop a learning-based control algorithm for unknown dynamical syst...
06/19/2021 ∙ by Franck Djeumou, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We study the problem of predicting graph labelings that evolve over time. Consider the following game for predicting the labeling of a graph in the online setting. Nature presents a graph ; Nature queries a vertex ; the learner predicts the label of the vertex ; Nature presents a label ; Nature queries a vertex ; the learner predicts ; and so forth. The learner’s goal is to minimize the total number of mistakes . If Nature is strictly adversarial, the learner will incur a mistake on every trial, but if Nature is regular or simple, there is hope that the learner may incur only a few mistakes. Thus, a central goal of mistake-bounded online learning is to design algorithms whose total mistakes can be bounded relative to the complexity of Nature’s labeling. This (non-switching) graph labeling problem has been studied extensively in the online learning literature (Herbster-LargeDiameter, ; HL09, ; CGVZ10, ; VCGZ11, ; HPG15, ). In this paper we generalize the setting to allow the underlying labeling to change arbitrarily over time. The learner has no knowledge of when a change in labeling will occur and therefore must be able to adapt quickly to these changes.

Consider an example of services placed throughout a city, such as public bicycle sharing stations. As the population uses these services the state of each station–such as the number of available bikes–naturally evolves throughout the day, at times gradually and others abruptly, and we might want to predict the state of any given station at any given time. Since the location of a given station as well as the state of nearby stations will be relevant to this learning problem it is natural to use a graph-based approach. Another setting might be a graph of major road junctions (vertices) connected by roads (edges), in which one wants to predict whether or not a junction is congested at any given time. Traffic congestion is naturally non-stationary and also exhibits both gradual and abrupt changes to the structure of the labeling over time (K98, ).

The structure of this paper is as follows. In Section 2 we discuss the background literature. In Section 3 we present the Switching Cluster Specialists algorithm (SCS), a modification of the method of specialists (Freund-Specialists, ) with the novel machinery of cluster specialists, a set of specialists that in a rough sense correspond to clusters in the graph. We consider two distinct sets of specialists, and , where . With the smaller set of specialists the bound is only larger by factor of . On the other hand, prediction is exponentially faster per trial, remarkably requiring only time to predict. In Section 4 we provide experiments on Chicago Divvy Bicycle Sharing data. In Section 5 we provide some concluding remarks. All proofs are contained in the technical appendices.

### 1.1 Notation

We first present common notation. Let be an undirected, connected, -vertex graph with vertex set and edge set . Each vertex of this graph may be labeled with one of two states

and thus a labeling of a graph may be denoted by a vector

where denotes the label of vertex . The underlying assumption is that we are predicting vertex labels from a sequence of graph labelings over trials. The set contains the first trial of each of the “segments” of the prediction problem. Each segment corresponds to a time period when the underlying labeling is unchanging. The cut-size of a labeling on a graph is defined as , i.e., the number of edges between vertices of disagreeing labels.

We let denote the resistance distance (effective resistance) between vertices and when the graph is seen as a circuit where each edge has unit resistance (e.g., Klein-ResistanceDistance ). The resistance diameter of a graph is . The resistance weighted cut-size of a labeling is . Let be the

-dimensional probability simplex. For

we define to be the entropy of . For we define to be the relative entropy between and . For a vector and a set of indices let . For any positive integer we define and for any predicate if pred is true and equals 0 otherwise.

## 2 Related Work

The problem of predicting the labeling of a graph in the batch setting was introduced as a foundational method for semi-supervised (transductive) learning. In this work, the graph was built using both the unlabeled and labeled instances. The seminal work by (Blum:2001, )

used a metric on the instance space and then built a kNN or

-ball graph. The partial labeling was then extended to the complete graph by solving a mincut-maxflow problem where opposing binary labels represented sources and sinks. In practice this method suffered from very unbalanced cuts. Significant practical and theoretical advances were made by replacing the mincut/maxflow model with methods based on minimising a quadratic form of the graph Laplacian. Influential early results include but are not limited to (ZhuGL03, ; BN04, ; ZhouBLWS03, ). A limitation of the graph Laplacian-based techniques is that these batch methods–depending on their implementation–typically require to  time to produce a single set of predictions. In the online switching setting we will aim for our fastest algorithm to have time complexity per trial.

Predicting the labeling of a graph in the online setting was introduced by Herbster-OnlineLearningGraphs . The authors proved bounds for a Perceptron-like algorithm with a kernel based on the graph Laplacian. Since this work there has been a number of extensions and improvements in bounds including but not limited to (Herbster-LargeDiameter, ; CesaBianchi-FastOptimalTrees, ; HL09, ; Herbster-SwitchingGraphs, ; HPG15, ; RS17, ). Common to all of these papers is that a dominant term in their mistake bounds is the (resistance-weighted) cut-size.

From a simplified perspective, the methods for predicting the labeling of a graph (online) split into two approaches. The first approach works directly with the original graph and is usually based on a graph Laplacian (Herbster-OnlineLearningGraphs, ; HL09, ; HPG15, ); it provides bounds that utilize the additional connectivity of non-tree graphs, which are particularly strong when the graph contains uniformly-labeled clusters of small (resistance) diameter. The drawbacks of this approach are that the bounds are weaker on graphs with large diameter, and that computation times are slower.

The second approach is to approximate the original graph with an appropriately selected tree or “line” graph (Herbster-LargeDiameter, ; CGVZ10, ; CesaBianchi-FastOptimalTrees, ; VCGZ11, ). This enables faster computation times, and bounds that are better on graphs with large diameters. These algorithms may be extended to non-tree graphs by first selecting a spanning tree uniformly at random (CGVZ10, ) and then applying the algorithm to the sampled tree. This randomized approach induces expected mistake bounds that also exploit the cluster structure in the graph (see Section 2.2). Our algorithm takes this approach.

### 2.1 Switching Prediction

In this paper rather than predicting a single labeling of a graph we instead will predict a (switching) sequence of labelings. Switching in the mistake- or regret-bound setting refers to the problem of predicting an online sequence when the “best comparator” is changing over time. In the simplest of switching models the set of comparators is structureless and we simply pay per switch. A prominent early result in this model is Herbster-TrackingExpert which introduced the fixed-share update which will play a prominent role in our main algorithm. Other prominent results in the structureless model include but are not limited to (derandom, ; Bousquet-MPP, ; wacky, ; koolenswitch, ; kaw12, ; fixmir, ). A stronger model is to instead prove a bound that holds for any arbitrary contiguous sequence of trials. Such a bound is called an adaptive-regret bound. This type of bound automatically implies a bound on the structureless switching model. Adaptive-regret was introduced in (HS07, )111However, see the analysis of WML in (LW94, ) for a precursory result. other prominent results in this model include (AKCV12, ; fixmir, ; DGS15, ).

The structureless model may be generalized by introducing a divergence measure on the set of comparators. Thus, whereas in the structureless model we pay for the number of switches, in the structured model we instead pay in the sum of divergences between successive comparators. This model was introduced in (Herbster-TrackLinPred, ); prominent results include (KSW04, ; fixmir, ).

In (Herbster-SwitchingGraphs, ) the authors also consider switching graph label prediction. However, their results are not directly comparable to ours since they consider the combinatorially more challenging problem of repeated switching within a small set of labelings contained in a larger set. That set-up was a problem originally framed in the “experts” setting and posed as an open problem by freundopen and solved in (Bousquet-MPP, ). If we apply the bound in (Herbster-SwitchingGraphs, ) to the case where there is not repeated switching within a smaller set, then their bound is uniformly and significantly weaker than the bounds in this paper and the algorithm is quite slow requiring time per trial in a typical implementation. Also contained in (Herbster-SwitchingGraphs, ) is a baseline algorithm based on a kernel perceptron with a graph Laplacian kernel. The bound of that algorithm has the significant drawback in that it scales with respect to the “worst” labeling in a sequence of labelings. However, it is simple to implement and we use it as a benchmark in our experiments.

### 2.2 Random Spanning Trees and Linearization

Since we operate in the transductive setting where the entire unlabeled graph is presented to the learner beforehand, this affords the learner the ability to perform any reconfiguration to the graph as a preprocessing step. The bounds of most existing algorithms for predicting a labeling on a graph are usually expressed in terms of the cut-size of the graph under that labeling. A natural approach then is to use a spanning tree of the original graph which can only reduce the cut-size of the labeling.

The effective resistance between vertices and , denoted , is equal to the probability that a spanning tree of drawn uniformly at random (from the set of all spanning trees of ) includes as one of its edges (e.g., LP17 ). As first observed by CesaBianchi-FastOptimalTrees , by selecting a spanning tree uniformly at random from the set of all possible spanning trees, mistake bounds expressed in terms of the cut-size then become expected mistake bounds now in terms of the effective-resistance-weighted cut-size of the graph. That is, if is a random spanning tree of then and thus . A random spanning tree can be sampled from a graph efficiently using a random walk or similar methods (see e.g., Wilson-RST ).

To illustrate the power of this randomization consider the simplified example of a graph with two cliques each of size , where one clique is labeled uniformly with ‘+1’ and the other ‘-1’ with an additional arbitrary “cut” edges between the cliques. This dense graph exhibits two disjoint clusters and . On the other hand , since between any two vertices in the opposing cliques there are edge disjoint paths of length and thus the effective resistance between any pair of vertices is . Since bounds usually scale linearly with (resistance-weighted) cut-size, the cut-size bound would be vacuous but the resistance-weighted cut-size bound would be small.

We will make use of this preprocessing step of sampling a uniform random spanning tree, as well as a linearization of this tree to produce a (spine) line-graph, . The linearization of to as a preprocessing step was first proposed by Herbster-LargeDiameter and has since been applied in, e.g., (CGVZ10, ; PSST16, ). In order to construct , a random-spanning tree is picked uniformly at random. A vertex of is then chosen and the graph is fully traversed using a depth-first search generating an ordered list of vertices in the order they were visited. Vertices in may appear multiple times in . A subsequence is then chosen such that each vertex in appears only once. The line graph is then formed by connecting each vertex in to its immediate neighbors in with an edge. We denote the edge set of by and let , where the cut is with respect to the linear embedding . Surprisingly, as stated in the lemma below, the cut on this linearized graph is no more than twice the cut on the original graph.

###### Lemma 1 (Herbster-LargeDiameter ).

Given a labeling on a graph , for the mapping , as above, we have .

By combining the above observations we may reduce the problem of learning on a graph to that of learning on a line graph. In particular, if we have an algorithm with a mistake bound of the form this implies we then may give an expected mistake bound of the form by first sampling a random spanning tree and then linearizing it as above. Thus, for simplicity in presentation, we will only state the deterministic mistake bounds in terms of cut-size, although the expected bounds in terms of resistance-weighted cut-sizes will hold simultaneously.

## 3 Switching Specialists

In this section we present a new method based on the idea of specialists (Freund-Specialists, ) from the prediction with expert advice literature (LW94, ; AS90, ; nicolobook, ). Although the achieved bounds are slightly worse than other methods for predicting a single labeling of a graph, the derived advantage is that it is possible to obtain “competitive” bounds with fast algorithms to predict a sequence of changing graph labelings.

Our inductive bias is to predict well when a labeling has a small (resistance-weighted) cut-size. The complementary perspective implies that the labeling consists of a few uniformly labeled clusters. This suggests the idea of maintaining a collection of basis functions where each such function is specialized to predict a constant function on a given cluster of vertices. To accomplish this technically we adapt the method of specialists (Freund-Specialists, ; kaw12, ). A specialist is a prediction function from an input space to an extended output space with abstentions. So for us the input space is just , the vertices of a graph; and the extended output space is where corresponds to predicted labels of the vertices, but ‘’ indicates that the specialist abstains from predicting. Thus a specialist specializes its prediction to part of the input space and in our application the specialists correspond to a collection of clusters which cover the graph, each cluster uniformly predicting or . algocf[t]

In Algorithm LABEL:Main_Alg we give our switching specialists method. The algorithm maintains a weight vector over the specialists in which the magnitudes may be interpreted as the current confidence we have in each of the specialists. The updates and their analyses are a combination of three standard methods: i) Halving loss updates, ii) specialists updates and iii) (delayed) fixed-share updates. The loss update (LABEL:eq:lossupdate) zeros the weight components of incorrectly predicting specialists, while the non-predicting specialists are not updated at all. In (LABEL:ShareUpdateLogN) we give our delayed fixed-share style update. A standard fixed share update may be written in the following form:

 ωt,ε=(1−α)˙ωt−1,ε+α|E|. (3)

Although (3) superficially appears different to (LABEL:ShareUpdateLogN), in fact these two updates are exactly the same in terms of predictions generated by the algorithm. This is because (LABEL:ShareUpdateLogN) caches updates until the given specialist is again active. The purpose of this computationally is that if the active specialists are, for example, logarithmic in size compared to the total specialist pool, we may then achieve an exponential speedup over (3); which in fact we will exploit.

In the following theorem we will give our switching specialists bound. The dominant cost of switching on trial to is given by the non-symmetric , i.e., we pay only for each new specialist introduced but we do not pay for removing specialists.

###### Theorem 2.

For a given specialist set , let denote the number of mistakes made in predicting the online sequence by Algorithm LABEL:Main_Alg. Then,

 ME≤1π1log|E|+T∑t=11πtlog11−α+|K|−1∑i=1JE(μki,μki+1)log|E|α, (4)

for any sequence of consistent and well-formed comparators where , and .

The bound in the above theorem depends crucially on the best sequence of consistent and well-formed comparators . The consistency requirement implies that on every trial there is no active incorrect specialist assigned “mass” (). We may eliminate the consistency requirement by “softening” the loss update (LABEL:eq:lossupdate). A comparator is well-formed if , there exists a unique such that and , and furthermore there exists a such that , i.e., each specialist in the support of has the same mass and these specialists disjointly cover the input space (). At considerable complication to the form of the bound the well-formedness requirement may be eliminated.

The above bound is “smooth” in that it scales with a gradual change in the comparator. In the next section we describe the novel specialists sets that we’ve tailored to graph-label prediction so that a small change in comparator corresponds to a small change in a graph labeling.

### 3.1 Cluster Specialists

In order to construct the cluster specialists over a graph , we first construct a line graph as described in Section 2.2. A cluster specialist is then defined by which maps where if and otherwise. Hence cluster specialist corresponds to a function that predicts the label if vertex lies between vertices and and abstains otherwise. Recall that by sampling a random spanning tree the expected cut-size of a labeling on the spine is no more than twice the resistance-weighted cut-size on . Thus, given a labeled graph with a small resistance-weighted cut-size with densely interconnected clusters and modest intra-cluster connections, this implies a cut-bracketed linear segment on the spine will in expectation roughly correspond to one of the original dense clusters. We will consider two basis sets of cluster specialists.

#### Basis Fn:

We first introduce the complete basis set . We say that a set of specialists from basis covers a labeling if for all and that and if then there exists such that . The basis is complete if every labeling is covered by some . The basis is complete and in fact has the following approximation property: for any there exists a covering set such that . This follows directly as a line with cuts is divided into segments. We now illustrate the use of basis to predict the labeling of a graph. For simplicity we illustrate by considering the problem of predicting a single graph labeling without switching. As there is no switch we will set and thus if the graph is labeled with with cut-size then we will need specialists to predict the labeling and thus the comparators may be post-hoc optimally determined so that and there will be components of each with “weight” , thus , since there will be only one specialist (with non-zero weight) active per trial. Since the cardinality of is , by substituting into (4) we have that the number of mistakes will be bounded by . Note for a single graph labeling on a spine this bound is not much worse than the best known result (Herbster-LargeDiameter, , Theorem 4). In terms of computation time however it is significantly slower than the algorithm in (Herbster-LargeDiameter, ) requiring time to predict on a typical trial since on average there are specialists active per trial.

#### Basis B1,n:

We now introduce the basis which has specialists and only requires time per trial to predict with only a small increase in bound. The basis is defined as

 Bp,q:={{εp,q−1,εp,q1}p=q,{εp,q−1,εp,q1}∪Bp,⌊p+q2⌋∪B⌊p+q2⌋+1,qp≠q

and is analogous to a binary tree. We have the following approximation property for ,

###### Proposition 3.

The basis is complete. Furthermore, for any labeling there exists a covering set such that for .

From a computational perspective the binary tree structure ensures that there are only specialists active per trial, leading to an exponential speed-up in prediction. A similar set of specialists were used for obtaining adaptive-regret bounds in (DGS15, ; KOWW17, ). In that work however the “binary tree” structure is over the time dimension (trial sequence) whereas in this work the binary tree is over the space dimension (graph) and a fixed-share update is used to obtain adaptivity over the time dimension.222An interesting open problem is to try to find good bounds and time-complexity with sets of specialists over both the time and space dimensions.

In the corollary that follows we will exploit the fact that by making the algorithm conservative we may reduce the usual term in the mistake bound induced by a fixed-share update to . A conservative algorithm only updates the specialists’ weights on trials on which a mistake is made. Furthermore the bound given in the following corollary is smooth as the cost per switch will be measured with a Hamming-like divergence on the “cut” edges between successive labelings, defined as

 H(u,u′):=∑(i,j)∈ES[ [[ui≠uj]∨[u′i≠u′j]] ∧ [[ui≠u′i]∨[uj≠u′j]] ].

Observe that is smaller than twice the hamming distance between and and is often significantly smaller. To achieve the bounds we will need the following proposition, which upper bounds divergence by , a subtlety is that there are many distinct sets of specialists consistent with a given comparator. For example, consider a uniform labeling on . One may “cover” this labeling with a single specialist or alternatively specialists, one covering each vertex. For the sake of simplicity in bounds we will always choose the smallest set of covering specialists. Thus we introduce the following notions of consistency and minimal-consistency.

###### Definition 4.

A comparator is consistent with the labeling if is well-formed and implies that for all that .

###### Definition 5.

A comparator is minimal-consistent with the labeling if it is consistent with and the cardinality of its support set is the minimum of all comparators consistent with .

###### Proposition 6.

For a linearized graph , for comparators that are minimal-consistent with and respectively,

 JFn(μ,μ′)≤min(2H(u,u′),ΦS(u′)+1).

A proof is given in Appendix C. In the following corollary we summarize the results of the SCS algorithm using the basis sets and with an optimally-tuned switching parameter .

###### Corollary 7.

For a connected -vertex graph and with randomly sampled spine , the number of mistakes made in predicting the online sequence by the SCS algorithm with optimally-tuned is upper bounded with basis by

 O⎛⎝Φ1logn+|K|−1∑i=1H(uki,uki+1)(logn+log|K|+loglogT)⎞⎠

and with basis by

 O⎛⎝⎛⎝Φ1logn+|K|−1∑i=iH(uki,uki+1)(logn+log|K|+loglogT)⎞⎠logn⎞⎠

for any sequence of labelings such that for all .

Thus the bounds are equivalent up to a factor of although the computation times vary dramatically. See Appendix D for a technical proof of these results, and details on the selection of the switching parameter . Note that we may avoid the issue of needing to optimally tune using the following method proposed by Herbster-TBE2 and by koolenswitch . We use a time-varying parameter and on trial we set . We have the following guarantee for this method, see Appendix E for a proof.

###### Proposition 8.

For a connected -vertex graph and with randomly sampled spine , the SCS algorithm with bases and in predicting the online sequence now with time-varying set equal to on trial achieves the same asymptotic mistake bounds as in Corollary 7 with an optimally-tuned , under the assumption that .

## 4 Experiments

In this section we present results of experiments on real data. The City of Chicago currently contains public bicycle stations for its “Divvy Bike” sharing system. Current and historical data is available from the City of Chicago containing a variety of features for each station, including latitude, longitude, number of docks, number of operational docks, and number of docks occupied. The latest data on each station is published approximately every ten minutes.

We used a sample of hours of data, consisting of three consecutive weekdays in April . The first hours of data were used for parameter selection, and the remaining hours of data were used for evaluating performance. On each ten-minute snapshot we took the percentage of empty docks of each station. We created a binary labeling from this data by setting a threshold of . Thus each bicycle station is a vertex in our graph and the label of each vertex indicates whether that station is ‘mostly full’ or ‘mostly empty’. Due to this thresholding the labels of some ‘quieter’ stations were observed not to switch, as the percentage of available docks rarely changed. These stations tended to be on the ‘outskirts’, and thus we excluded these stations from our experiments, giving vertices in our graph.

Using the geodesic distance between each station’s latitude and longitudinal position a connected graph was built using the union of a -nearest neighbor graph () and a minimum spanning tree. For each instance of our algorithm the graph was then transformed in the manner described in Section 2.2, by first drawing a spanning tree uniformly at random and then linearizing using depth-first search.

As natural benchmarks for this setting we considered the following four methods. For all vertices predict with the most frequently occurring label of the entire graph from the training data (“Global”). For each vertex predict with its most frequently occurring label from the training data (“Local”). For all vertices at any given time predict with the most frequently occurring label of the entire graph at that time from the training data (“Temporal-Global”) For each vertex at any given time predict with that vertex’s label observed at the same time in the training data (“Temporal-Local”). We also compare our algorithms against a kernel Perceptron proposed by Herbster-SwitchingGraphs for predicting switching graph labelings (see Appendix F for details).

Following the experiments of CGVZ10 in which ensembles of random spanning trees were drawn and aggregated by an unweighted majority vote, we tested the effect on performance of using ensembles of instances of our algorithms, aggregated in the same fashion. We tested ensemble sizes in

, using odd numbers to avoid ties.

For every ten-minute snapshot (labeling) we queried vertices uniformly at random (with replacement) in an online fashion, giving a sequence of trials over hours. The average performance over  iterations is shown in Figure 1. There are several surprising observations to be made from our results. Firstly, both SCS algorithms performed significantly better than all benchmarks and competing algorithms. Additionally basis outperformed basis by quite a large margin, despite having the weaker bound and being exponentially faster. Finally we observed a significant increase in performance of both SCS algorithms by increasing the ensemble size (see Figure 1), additional details on these experiments and results of all ensemble sizes are given in Appendix G.

Interestingly when tuning we found basis to be very robust, while was very sensitive. This observation combined with the logarithmic per-trial time complexity suggests that SCS with has promise to be a very practical algorithm.

## 5 Conclusion

Our primary result was an algorithm for predicting switching graph labelings with a per-trial prediction time of and a mistake bound that smoothly tracks changes to the graph labeling over time. In the long version of this paper we plan to extend the analysis of the primary algorithm to the expected regret setting by relaxing our simplifying assumption of the well-formed comparator sequence that is minimal-consistent with the labeling sequence. From a technical perspective the open problem that we found most intriguing is to eliminate the term from our bounds. The natural approach to this would be to replace the conservative fixed-share update with a variable-share update (Herbster-TrackingExpert, ); in our efforts however we found many technical problems with this approach. On both the more practical and speculative side; we observe that the specialists sets , and were chosen to “prove bounds”. In practice we can use any hierarchical graph clustering algorithm to produce a complete specialist set and furthermore multiple such clusterings may be pooled. Such a pooled set of subgraph “motifs” could be then be used for example in a multi-task setting (see for example, (kaw12, )).

## References

• (1) D. Adamskiy, W. M. Koolen, A. Chernov, and V. Vovk. A closer look at adaptive regret. In Proceedings of the 23rd International Conference on Algorithmic Learning Theory, ALT’12, pages 290–304, 2012.
• (2) M. Belkin and P. Niyogi. Semi-supervised learning on riemannian manifolds. Machine learning, 56(1-3):209–239, 2004.
• (3) A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 19–26, 2001.
• (4) O. Bousquet and M. K. Warmuth. Tracking a small set of experts by mixing past posteriors. Journal of Machine Learning Research, 3(Nov):363–396, 2002.
• (5) N. Cesa-Bianchi, P. Gaillard, G. Lugosi, and G. Stoltz. Mirror descent meets fixed share (and feels no regret). In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS ’12, pages 980–988, 2012.
• (6) N. Cesa-Bianchi, C. Gentile, and F. Vitale. Fast and optimal prediction on a labeled tree. In Proceedings of the 22nd Annual Conference on Learning Theory, pages 145–156. Omnipress, 2009.
• (7) N. Cesa-Bianchi, C. Gentile, F. Vitale, and G. Zappella. Random spanning trees and the prediction of weighted graphs. Journal of Machine Learning Research, 14(1):1251–1284, 2013.
• (8) N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006.
• (9) A. Daniely, A. Gonen, and S. Shalev-Shwartz. Strongly adaptive online learning. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 1405–1411, 2015.
• (10) Y. Freund. Private communication, 2000. Also posted on http://www.learning-theory.org.
• (11) Y. Freund, R. E. Schapire, Y. Singer, and M. K. Warmuth. Using and combining predictors that specialize. In

Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of Computing

, STOC ’97, pages 334–343, 1997.
• (12) A. György, T. Linder, and G. Lugosi. Tracking the best of many experts. In Proceedings of the 18th Annual Conference on Learning Theory, COLT ’05, pages 204–216, 2005.
• (13) E. Hazan and C. Seshadhri. Adaptive algorithms for online decision problems. Electronic Colloquium on Computational Complexity (ECCC), 14(088), 2007.
• (14) M. Herbster. Tracking the best expert II. Unpublished manuscript, 1997.
• (15) M. Herbster and G. Lever.

Predicting the labelling of a graph via minimum $p$-seminorm interpolation.

In COLT 2009 - The 22nd Conference on Learning Theory, 2009.
• (16) M. Herbster, G. Lever, and M. Pontil. Online prediction on large diameter graphs. In Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS ’08, pages 649–656, 2008.
• (17) M. Herbster, S. Pasteris, and S. Ghosh. Online prediction at the limit of zero temperature. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 2935–2943, 2015.
• (18) M. Herbster, S. Pasteris, and M. Pontil. Predicting a switching sequence of graph labelings. Journal of Machine Learning Research, 16(1):2003–2022, 2015.
• (19) M. Herbster and M. Pontil. Prediction on a graph with a perceptron. In Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS’06, pages 577–584, 2006.
• (20) M. Herbster, M. Pontil, and L. Wainer. Online learning over graphs. In Proceedings of the 22nd International Conference on Machine Learning, ICML ’05, pages 305–312, 2005.
• (21) M. Herbster and M. Warmuth. Tracking the best expert. Machine Learning, 32(2):151–178, 1998.
• (22) M. Herbster and M. K. Warmuth. Tracking the best linear predictor. Journal of Machine Learning Research, 1:281–309, Sept. 2001.
• (23) K. Jun, F. Orabona, S. Wright, and R. Willett. Improved strongly adaptive online learning using coin betting. In

Proceedings of the 20th International Conference on Artificial Intelligence and Statistics

, volume 54 of Proceedings of Machine Learning Research, pages 943–951. PMLR, 20–22 Apr 2017.
• (24) B. S. Kerner. Experimental features of self-organization in traffic flow. Phys. Rev. Lett., 81:3797–3800, Oct 1998.
• (25) J. Kivinen, A. Smola, and R. Williamson. Online learning with kernels. Trans. Sig. Proc., 52(8):2165–2176, Aug. 2004.
• (26) D. J. Klein and M. Randić. Resistance distance. Journal of mathematical chemistry, 12(1):81–95, 1993.
• (27) W. M. Koolen, D. Adamskiy, and M. K. Warmuth. Putting bayes to sleep. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS ’12, pages 135–143, 2012.
• (28) W. M. Koolen and S. Rooij. Combining expert advice efficiently. In 21st Annual Conference on Learning Theory - COLT 2008, pages 275–286, 2008.
• (29) N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212–261, 1994.
• (30) R. Lyons and Y. Peres. Probability on Trees and Networks. Cambridge University Press, New York, NY, USA, 1st edition, 2017.
• (31) O. H. M. Padilla, J. Sharpnack, J. G. Scott, and R. J. Tibshirani. The dfs fused lasso: Linear-time denoising over general graphs. Journal of Machine Learning Research, 18(1):1–36, 2018.
• (32) A. Rakhlin and K. Sridharan. Efficient online multiclass prediction on graphs via surrogate losses. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, pages 1403–1411, 2017.
• (33) F. Vitale, N. Cesa-Bianchi, C. Gentile, and G. Zappella. See the tree through the lines: The shazoo algorithm. In Advances in Neural Information Processing Systems 23, pages 1584–1592, 2011.
• (34) V. Vovk. Aggregating strategies. In

Proceedings of the Third Annual Workshop on Computational Learning Theory

, COLT ’90, pages 371–386, 1990.
• (35) V. Vovk. Derandomizing stochastic prediction strategies. Machine Learning, 35(3):247–282, 1999.
• (36) D. B. Wilson. Generating random spanning trees more quickly than the cover time. In Proceedings of the Twenty-eighth Annual ACM Symposium on Theory of Computing, STOC ’96, pages 296–303, 1996.
• (37) D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf. Learning with local and global consistency. In Proceedings of the 16th International Conference on Neural Information Processing Systems, NIPS ’03, pages 321–328, 2003.
• (38) X. Zhu, Z. Ghahramani, and J. D. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML ’03, pages 912–919, 2003.

## Appendix A Proof of Theorem 2

###### Proof.

Recall that the cached share update (LABEL:ShareUpdateLogN) is equivalent to performing (3). We thus simulate the latter update in our analysis. We first argue the inequality

 [^yt≠yt]≤1μt(Yt)(d(μt,ωt)−d(μt,˙ωt)), (5)

as this is derived by observing that

 d(μt,ωt)−d(μt,˙ωt) =∑ε∈Eμt,εlog˙ωt,εωt,ε =∑ε∈Ytμt,εlog˙ωt,εωt,ε ≥μt(Yt)[^yt≠yt],

where the second line follows the fact that if as either the specialist predicts ‘’ and or it predicts incorrectly and hence . The third line follows as for , if there has been a mistake on trial and otherwise the ratio is . Indeed, since Algorithm LABEL:Main_Alg is conservative, this ratio is exactly when no mistake is made on trial , thus without loss of generality we will assume the algorithm makes a mistake on every trial.

For clarity we will now use simplified notation and let . We now prove the following inequalities which we will add to (5) to create a telescoping sum of relative entropy terms and entropy terms.

 1πt[d(μt,˙ωt)−d(μt,ωt+1)] ≥−1πtlog11−α, (6) 1πtd(μt,ωt+1)−1πt+1d(μt+1,ωt+1) ≥−1πtH(μt)+1πt+1H(μt+1)−JE(μt,μt+1)log|E|α. (7)

Firstly (6) is proved with the following

 d(μt,˙ωt)−d(μt,ωt+1)=∑ε∈Eμt,εlogωt+1,ε˙ωt,ε≥∑ε∈Eμt,εlog((1−α)˙ωt,ε˙ωt,ε)=log(1−α),

where the inequality has used from (3).

To prove (7) we first define the following sets.

 Θt :={ε∈E:μt−1,ε≠0,μt,ε=0}, Ψt :={ε∈E:μt−1,ε≠0,μt,ε≠0}, Ωt :={ε∈E:μt−1,ε=0,μt,ε≠0}.

We now expand the following

 1πtd(μt,ωt+1)−1πt+1d(μt+1,ωt+1) =1πtd(μt,ωt+1)−1πtd(μt+1,ωt+1)+1πtd(μt+1,ωt+1)−1πt+1d(μt+1,ωt+1) =1πt∑ε∈Eμt,εlogμt,εωt+1,ε−1πt∑ε∈Eμt+1,εlogμt+1,εωt+1,ε +1πt∑ε∈Eμt+1,εlogμt+1,εωt+1,ε−1πt+1∑ε∈Eμt+1,εlogμt+1,εωt+1,ε =−1πtH(μt)+1πtH(μt+1)+∑ε∈E(μt,επt−μt+1,επt)log1ωt+1,ε −1πtH(μt+1)+1πt+1H(μt+1)+∑ε∈E(μt+1,επt−μt+1,επt+1)log1ωt+1,ε. (8)

Recall that a comparator is well-formed if , there exists a unique such that and , and furthermore there exists a such that , i.e., each specialist in the support of has the same mass and these specialists disjointly cover the input space (). Thus, by collecting terms into the three sets , , and we have

 ∑ε∈E(μt,επt−μt+1,επt)log1ωt+1,ε =∑ε∈Θt+1μt,επtlog1ωt+1,ε+∑ε∈Ψt+1(μt,επt−μt+1,επt)log1ωt+1,ε−∑ε∈Ωt+1μt+1,επtlog1ωt+1,ε (9)

and similarly

 ∑ε∈E(μt+1,επt−μt+1,επt+1)log1ωt+1,ε =∑ε∈Ψt+1(μt+1,επt−1)log1ωt+1,ε+∑ε∈Ωt+1(μt+1,επt−1)log1ωt+1,ε. (10)

Substituting (A) and (A) into (A) and simplifying gives

 1πtd(μt,ωt+1)−1πt+1d(μt+1,ωt+1) =−1πtH(μt)+1πt+1H(μt+1)+∑ε∈Θt+1μt,επtlog1ωt+1,ε−∑ε∈Ωt+1log1ωt+1,ε ≥−1πtH(μt)+1πt+1H(μt+1)−|Ωt+1|log|E|α, (11)

where the inequality has used the fact that from (3).

Summing over all trials then leaves a telescoping sum of relative entropy terms, a cost of on each trial, and for each switch. Thus,

 T∑t=1[^yt≠yt] ≤1π1d(μ1,ω1)+1π1H(μ1)+T∑t=11πtlog11−α+|K|−1∑i=1JE(μki,μki+1)log|E|α, (12)

where , and since , we can combine the remaining entropy and relative entropy terms to give , concluding the proof. ∎

## Appendix B Proof of Proposition 3

We recall the proposition:

The basis is complete. Furthermore, for any labeling there exists a covering set such that .

We first give a brief intuition of the proof; any required terms will be defined more completely later. For a given labeling of cut-size , the spine can be cut into clusters, where a cluster is a contiguous segment of vertices with the same label. We will upper bound the maximum number of cluster specialists required to cover a single cluster, and therefore obtain an upper bound for by summing over the clusters.

Without loss of generality we assume for some integer and thus the structure of is analogous to a perfect binary tree of depth . Indeed, for a fixed label parameter we will adopt the terminology of binary trees such that for instance we say specialist for has a so-called left-child and right-child . Similarly, we say that and are siblings, and is their parent. Note that any specialist is both an ancestor and a descendant of itself, and a proper descendant of a specialist is a descendant of one of its children. Finally the depth of specialist is defined to be equal to the depth of the corresponding node in a binary tree, such that is of depth , and are of depth , etc.

The first claim of the proposition is easy to prove as and thus any labeling can be covered. We now prove the second claim of the proposition.

We will denote a uniformly-labeled contiguous segment of vertices by the pair , where are the two end vertices of the segment. For completeness we will allow the trivial case when . Given a labeling , let be the set of maximum-sized contiguous segments of unifmormly-labeled vertices. Note that or may be vacuous. When the context is clear, we will also describe as a cluster, and as the set of vertices .

For a given and cluster , we say is an -covering set with respect to if for all we have , and if for all there exists some such that and . That is, every vertex in the cluster is ‘covered’ by at least one specialist and no specialists cover any vertices . We define to be the set of all possible -covering sets with respect to .

We now define

 δ(B(l,r)):=|B(l,r)|

to be the complexity of .

For a given and cluster , we wish to produce an -covering set of minimum complexity, which we denote . Note that an -covering set of minimum complexity cannot contain any two specialists which are siblings, since they can be removed from the set and replaced by their parent specialist.

###### Lemma 9.

For any , for any , the -covering set of minimum complexity, contains at most two specialists of each unique depth.

###### Proof.

We first give an intuitive sketch of the proof. For a given and cluster assume that there are at least three specialists of equal depth in , then any of these specialists that are in the ‘middle’ may be removed, along with any of their siblings or proper descendants that are also members of without creating any ‘holes’ in the covering, decreasing the complexity of .

We use a proof by contradiction. Suppose for contradiction that for a given and , the -covering set of minimum complexity, , contains three distinct specialists of the same depth, . Without loss of generality let . Note that we have . We consider the following two possible scenarios: when two of the three specialists are siblings, and when none are.

If and are siblings, then we have and thus is an -covering set of smaller complexity, leading to a contradiction. The equivalent argument holds if and are siblings.

If none are siblings, then let be the sibling of and let be the parent of and . Note that