# Learning Set Functions that are Sparse in Non-Orthogonal Fourier Bases

Many applications of machine learning on discrete domains, such as learning preference functions in recommender systems or auctions, can be reduced to estimating a set function that is sparse in the Fourier domain. In this work, we present a new family of algorithms for learning Fourier-sparse set functions. They require at most nk - k log_2 k + k queries (set function evaluations), under mild conditions on the Fourier coefficients, where n is the size of the ground set and k the number of non-zero Fourier coefficients. In contrast to other work that focused on the orthogonal Walsh-Hadamard transform, our novel algorithms operate with recently introduced non-orthogonal Fourier transforms that offer different notions of Fourier-sparsity. These naturally arise when modeling, e.g., sets of items forming substitutes and complements. We demonstrate effectiveness on several real-world applications.

## Authors

• 5 publications
• 1 publication
• 8 publications
• 108 publications
• 11 publications
• ### Clifford Fourier-Mellin transform with two real square roots of -1 in Cl(p,q), p+q=2

We describe a non-commutative generalization of the complex Fourier-Mell...
06/07/2013 ∙ by Eckhard Hitzer, et al. ∙ 0

• ### Fast Fourier Sparsity Testing

A function f : F_2^n →R is s-sparse if it has at most s non-zero Fourier...
10/13/2019 ∙ by Grigory Yaroslavtsev, et al. ∙ 0

• ### Characterization of Sobolev spaces by their Fourier coefficients in axisymmetric domains

Using Fourier series representations of functions on axisymmetric domain...
04/15/2020 ∙ by {Martin Costabel, et al. ∙ 0

• ### A family of orthogonal rational functions and other orthogonal systems with a skew-Hermitian differentiation matrix

In this paper we explore orthogonal systems in L_2(R) which give rise to...
11/20/2019 ∙ by Arieh Iserles, et al. ∙ 0

• ### Machine Learning Assisted Orthonormal Basis Selection for Functional Data Analysis

In implementations of the functional data methods, the effect of the ini...
03/12/2021 ∙ by Rani Basna, et al. ∙ 0

• ### Fourier Analysis-based Iterative Combinatorial Auctions

Recent advances in Fourier analysis have brought new tools to efficientl...
09/22/2020 ∙ by Jakob Weissteiner, et al. ∙ 0

• ### A short letter on the dot product between rotated Fourier transforms

Spatial Semantic Pointers (SSPs) have recently emerged as a powerful too...
07/24/2020 ∙ by Aaron R. Voelker, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Numerous problems in machine learning on discrete domains involve learning set functions, i.e., functions that map subsets of some ground set to the real numbers. In recommender systems, for example, such set functions express diversity among sets of articles and their relevance w.r.t. a given need Sharma et al. (2019); Balog et al. (2019); in sensor placement tasks, they express the informativeness of sets of sensors Krause et al. (2008); in combinatorial auctions, they express valuations for sets of items Brero et al. (2019). A key challenge is to estimate from a small number of observed evaluations. Without structural assumptions an exponentially large (in ) number of queries is needed. Thus, a key question is which families of set functions can be efficiently learnt, while capturing important applications. One key property is sparsity in the Fourier domain Stobbe and Krause (2012); Amrollahi et al. (2019); Weissteiner et al. (2020b).

The Fourier transform for set functions is classically known as the orthogonal Walsh-Hadamard transform (WHT) Bernasconi et al. (1996); De Wolf (2008); Li and Ramchandran (2015); Cheraghchi and Indyk (2017). Using the WHT, it is possible to learn functions with at most non-zero Fourier coefficients with evaluations Amrollahi et al. (2019). In this paper, we consider an alternative family of non-orthogonal Fourier transforms, recently introduced in the context of discrete signal processing on set functions (DSSP) Püschel (2018); Püschel and Wendler (2020). In particular, we present the first efficient algorithms which (under mild assumptions on the Fourier coefficients), efficiently learn -Fourier-sparse set functions requiring at most evaluations. In contrast, naively computing the Fourier transform requires evaluations and operations Püschel and Wendler (2020).

Importantly, sparsity in the WHT domain does not imply sparsity in the alternative Fourier domains we consider, or vice versa. Thus, we significantly expand the class of set functions that can be efficiently learnt. One natural example of set functions, which are sparse in one of the non-orthogonal transforms, but not for the WHT, are certain preference functions considered by Djolonga et al. Djolonga et al. (2016) in the context of recommender systems and auctions. In recommender systems, each item may cover the set of needs that it satisfies for a customer. If needs are covered by several items at once, or items depend on each other to provide value there are substitutability or complementarity effects between the respective items. A natural way to learn such set functions is to compute their respective sparse Fourier transforms.

Contributions. In this paper we develop, analyze, and evaluate novel algorithms for computing the sparse Fourier transform under the various notions of Fourier basis introduced by Püschel Püschel (2018):

1. We are the first to introduce an efficient algorithm to compute the sparse Fourier transform under the recent notions of non-orthogonal Fourier basis for set functions Püschel (2018); Püschel and Wendler (2020). In contrast to the naive fast Fourier transform algorithm that requires queries and operations, our sparse Fourier transform requires at most queries and operations to compute the non-zero coefficients of a Fourier-sparse set function. The algorithm works in all cases up to a null set of pathological set functions.

2. We then further extend our algorithm to handle an even larger class of Fourier-sparse set functions with queries and operations using filtering techniques.

3. We demonstrate the effectiveness of our algorithms in two real-world set function learning tasks: learning surrogate objective functions for sensor placement tasks and preference elicitation in combinatorial auctions. The sensor placements obtained by our learnt surrogates are indistinguishable from the ones obtained using the compressive sensing based WHT by Stobbe and Krause Stobbe and Krause (2012). However, our algorithm does not require prior knowledge of the Fourier support and runs significantly faster. In the preference elicitation task the non-orthogonal basis naturally captures the structure of the valuation functions: only half as many Fourier coefficients were required in our basis as in the WHT basis.

Please note that all our proofs are in the appendix.

## 2 Fourier Transforms for Set Functions

We introduce background and definitions for set functions and associated Fourier bases, following the discrete-set signal processing (DSSP) introduced by Püschel (2018); Püschel and Wendler (2020). DSSP generalizes key concepts from classical signal processing, including shift, convolution or filtering, and Fourier transform to the powerset domain. The approach follows a general procedure that derives these concepts from a suitable definition of the shift operation Püschel and Moura (2006, 2008).

Set functions. We consider a ground set . An associated set function maps each subset of to a real value:

 s:2N→R;A↦s(A). (1)

Each set function can be identified with a

-dimensional vector

by fixing an order on the subsets. We choose the lexicographic order on the corresponding set indicator vectors.

Shifts. Classical convolution (e.g., on images) is associated with the translation operator. Analogously, DSSP considers different versions of "set translations," called models 1–5. One choice (model 4) is for . The shift operators are parameterized by the powerset monoid , since the equality holds for all , and .

Convolutional filters. The corresponding linear, shift-equivariant convolution is given by

 (h∗s)(A)=∑Q⊆Nh(Q)s(A∪Q). (2)

Namely, , for all . Convolving with is a linear mapping called a filter; is also a set function.

Fourier transform and convolution theorem. The Fourier transform (FT) simultaneously diagonalizes all filters. Thus, different definitions of set shifts yield different notions of Fourier transform. For the shift chosen above the Fourier transform of takes the form

 ˆs(B)=∑A⊆N:A∪B=N(−1)|A∩B|s(A) (3)

with the inverse

 s(A)=∑B⊆N:A∩B=∅ˆs(B). (4)

As a consequence we obtain the convolution theorem

 ˆ(h∗s)(B)=¯¯¯h(B)ˆs(B). (5)

Interestingly, (the so-called frequency response) is computed differently than , namely as

 ¯¯¯h(B)=∑A⊆N:A∩B=∅h(A). (6)

In matrix form, with respect to the chosen order of , the Fourier transform and its inverse are

 F=(0−11−1)⊗nandF−1=(1110)⊗n, (7)

respectively, in which denotes the -fold Kronecker product of the matrix . Thus, the Fourier transform and its inverse can be computed in operations.

The columns of form the Fourier basis and can be viewed as indexed by . The -th column is given by , where if and otherwise. The basis is not orthogonal as can be seen from the triangular structure in (7).

Example and interpretation. We start by considering a special class of preference functions that, e.g., model customers in a recommender system Djolonga et al. (2016). Preference functions naturally occur in machine learning tasks on discrete domains such as recommender systems and auctions, in which they, e.g., are used to model complementary- and substitution effects between goods. Goods complement each other when their combined utility is greater than the sum of their individual utilities. Analogously, goods substitute each other when their combined utility is smaller than the sum of their individual utilities. Formally, a preference function is given by

 p(A) =∑i∈Aui+L∑ℓ=1(maxi∈Arℓi−∑i∈Arℓi) (8) −K∑k=1(maxi∈Aaki−∑i∈Aaki).

Equation (8) is composed of a modular part parametrized by , a repulsive part parametrized by , with , and an attractive part parametrized by , with . The repulsive part captures substitution and the attractive part complementary effects.

In Lemma 1 we show that these preference functions are indeed Fourier-sparse.

###### Lemma 1.

Preference functions of the form (8) are Fourier-sparse w.r.t. model 4 with at most non-zero Fourier coefficients.

Motivated by Lemma 1, we call set functions that are Fourier-sparse w.r.t. model 4 generalized preference functions. Formally, a generalized preference function is defined in terms of a collection of distinct subsets of some universe , , and a weight function . The weight of a set is . Then, the corresponding generalized preference function is

 s:2N→R;A↦w⎛⎝⋃Si∈ASi⎞⎠. (9)

For non-negative weights is called a weighted coverage function Krause and Golovin (2014), but here we allow general (signed) weights. Thus, generalized preference functions are generalized coverage functions. Generalized coverage functions can be visualized by a bipartite graph, see Fig. 1. In recommender systems, could model the customer-needs covered by item . Then, the score that a customer associates to a set of items corresponds to the needs covered by the items in that set. Substitution as well as complementary effects occur if the needs covered by items overlap (e.g., ).

Interestingly, the Fourier coefficients in (3) of a generalized coverage function are

 ˆs(B)={∑u∈Uw(u),if B=∅,−w(⋂Si∈BSi∖⋃Si∈N∖BSi),otherwise, (10)

which corresponds to the (negative) weights of the fragments of the Venn-diagram of the sets (Fig. 1). If the universe contains fewer than items, some fragments will have weight zero, i.e., are Fourier-sparse.

Other shifts and Fourier bases. There are several other natural definitions of shifts, each with its respective shift-equivariant convolution, associated Fourier basis, and thus notion of Fourier-sparsity. Püschel and Wendler Püschel and Wendler (2020) call these variants model 1–5, with 5 being the classical definition that yields the WHT and 4 the version introduced above. Table 1 collects the key concepts, also including model 3.

The notions of Fourier-sparsity can differ dramatically. For example, consider the coverage function for which there is only one element in the universe and this element is covered by all sets . Then, , and for w.r.t. model 4, and and for all w.r.t. the WHT.

###### Remark 1.

For the same reason the preference functions in (8) with are dense w.r.t. the WHT basis.

The Fourier bases have appeared in different contexts before. For example, (3) can be related to the W-transform, which has been used by Chakrabarty and Huang Chakrabarty and Huang (2012) to test coverage functions.

## 3 Learning Fourier-Sparse Set Functions

We now present our algorithm for learning Fourier-sparse set functions w.r.t. model 4. One of our main contributions is that the presented derivation also applies to the other models. In particular, we derive the variants for models 3 and 5 from Table 1 in the appendix.

###### Definition 1.

A set function is called -Fourier-sparse if

 supp(ˆs)={B:ˆs(B)≠0}={B1,…,Bk}. (11)

Thus, exactly learning a -Fourier-sparse set function is equivalent to computing its non-zero Fourier coefficients and associated support. Formally, we want to solve:

###### Problem 1 (Sparse FT).

Given oracle access to query a -Fourier-sparse set function , compute its Fourier support and associated Fourier coefficients.

### 3.1 Sparse FT with Known Support

First, we consider the simpler problem of computing the Fourier coefficients if the Fourier support (or a small enough superset ) is known. In this case, the solution boils down to selecting queries such that the linear system of equations

 sA=F−1ABˆsB, (12)

admits a solution. Here, is the vector of queries, is the submatrix of obtained by selecting the rows indexed by and the columns indexed by , and are the unknown Fourier coefficients we want to compute.

###### Theorem 1 (Theorem 1 of Püschel and WendlerPüschel and Wendler (2020)).

Let be -Fourier-sparse with . Let . Then is invertible and can be perfectly reconstructed from the queries .

Consequently, we can solve Problem 1 if we have a way to discover a , which is what we do next.

### 3.2 Sparse FT with Unknown Support

In the following we present our algorithm to solve Problem 1. As mentioned, the key challenge is to determine the Fourier support w.r.t (3). The initial skeleton is similar to the algorithm Recover Coverage by Chakrabarty and Huang Chakrabarty and Huang (2012), who used it to test coverage functions. Here we take the novel view of Fourier analysis to expand it to a sparse Fourier transform algorithm for all set functions. Doing so creates challenges since the connection to a positive weight function is lost (see (9)). Using the framework in Section 2 we are going to analyze and address them.

Let , and consider the associated restriction of a set function on :

 s↓2M:2M→R;A↦s(A) (13)

The Fourier coefficients of and the restriction can be related as (proof in appendix):

 \savestack\tmpbox\stretchto\scaleto\scalerel∗[\widthofs↓2M]⋀0.5ex\stackon[1pt]s↓2M\tmpbox(B)=∑A⊆N∖Mˆs(A∪B). (14)

We observe that, if the Fourier coefficients on the right hand side of (14) do not cancel, knowing contains information about the sparsity of , for . To be precise, the relation

 (15)

implies that and both must be zero whenever is zero, assuming Fourier coefficients do not cancel. As a consequence, we can construct

 (16)

with , from (15), and then compute with Theorem 1.

As a result we can solve Problem 1 with our algorithm SSFT, under mild conditions on the coefficients, by successively computing the non-zero Fourier coefficients of restricted set functions along the chain

 (17)
###### Remark 2 (Implementation of Ssft).

For practical reasons we only process up to subsets in line 6. In line 11, we consider a Fourier coefficient

(a hyperparameter) as zero.

Analysis. We consider set functions that are -Fourier-sparse (but not -Fourier-sparse) with support , i.e., , which is isomorphic to

 S={ˆs∈Rk:ˆsi≠0 for all i∈{1,…,k}}. (18)

Let denote the Lebesgue measure on . Let .

Pathological set functions. SSFT fails to compute the Fourier coefficients for which despite . Thus, the set of pathological set functions can be written as the finite union of kernels

 K1(Mi,C)={ˆs∈Rk:\savestack\tmpbox\stretchto\scaleto\scalerel∗[\widthofs↓2Mi]⋀0.5ex\stackon[1pt]s↓2Mi\tmpbox(C)=0} (19)

intersected with .

###### Theorem 2.

Using prior notation, the set of pathological set functions for SSFT is given by

 D1=n−1⋃i=0⋃C⊆Mi:PMiC≠∅K1(Mi,C)∩S, (20)

and has Lebesgue measure zero, i.e., .

Complexity. By reusing queries and computations from the -th iteration of SSFT in the -th iteration, we obtain:

###### Theorem 3.

SSFT requires at most queries and operations.

### 3.3 Shrinking the Set of Pathological Fourier Coefficients

According to Theorem 2, the set of pathological Fourier coefficients for a given support has measure zero. However, unfortunately, this set includes important classes of set functions including graph cuts (in the case of unit weights) and hypergraph cuts.111As an example, consider the cut function associated with the graph , and , using .

Solution. The key idea to exclude these and further narrow down the set of pathological cases is to use the convolution theorem (5), i.e., the fact that we can modulate Fourier coefficients by filtering. Concretely, we choose a random filter such that SSFT works for

with probability one.

is then obtained from by dividing by the frequency response . We keep the associated overhead in by choosing a one-hop filter, i.e., for

. Motivated by the fact that, e.g., the product of a Rademacher random variable (which would lead to cancellations) and a normally distributed random variable is again normally distributed, we sample our filtering coefficients i.i.d. from a normal distribution. We call the resulting algorithm

SSFT+, shown above.

Analysis. Building on the analysis of SSFT, recall that denotes the set of -Fourier-sparse (but not -Fourier-sparse) set functions and are the elements satisfying . Let

 K2(Mi,C) ={ˆs∈Rk:\savestack\tmpbox\stretchto\scaleto\scalerel∗[\widthofs↓2Mi]⋀0.5ex\stackon[1pt]s↓2Mi\tmpbox(C)=0 and (21) \savestack\tmpbox\stretchto\scaleto\scalerel∗[\widthofs↓2Mi∪{xj}]⋀0.5ex\stackon[1pt]s↓2Mi∪{xj}\tmpbox(C)=0 for all j∈{i+1,…,n}}.
###### Theorem 4.

With probability one with respect to the randomness of the filtering coefficients, the set of pathological set functions for SSFT+ has the form (using prior notation)

 D2=n−2⋃i=0⋃C⊆Mi:PMiC≠∅K2(Mi,C)∩S. (22)

Theorem 4 shows that SSFT+ correctly processes with , iff there is an element for which .

###### Theorem 5.

If is non-empty, is a proper subset of . In particular, implies , for all with .

Complexity. There is a trade-off between the amount of non-zero filtering coefficients used and the size of the set of pathological set functions. For example, for the one-hop filters used, computing requires queries.

###### Theorem 6.

The query complexity of SSFT+ is and the algorithmic complexity is .

## 4 Related Work

We briefly discuss related work on learning set functions.

Fourier-sparse learning. There is a substantial body of research concerned with learning Fourier/WHT-sparse set functions Stobbe and Krause (2012); Scheibler et al. (2013); Kocaoglu et al. (2014); Li and Ramchandran (2015); Cheraghchi and Indyk (2017); Amrollahi et al. (2019). Recently, Amrollahi et al. Amrollahi et al. (2019) have imported ideas from the hashing based sparse Fourier transform algorithm Hassanieh et al. (2012) to the set function setting. The resulting algorithms compute the WHT of -WHT-sparse set functions with a query complexity for general frequencies, for low degree () frequencies and for low degree set functions that are only approximately sparse. To the best of our knowledge this latest work improves on previous algorithms, such as the ones by Scheibler et al. Scheibler et al. (2013), Kocaoglu et al. Kocaoglu et al. (2014), Li and Ramchandran Li and Ramchandran (2015), and Cheraghchi and Indyk Cheraghchi and Indyk (2017), providing the best guarantees in terms of both query complexity and runtime. E.g., Scheibler et al. Scheibler et al. (2013)

utilize similar ideas like hashing/aliasing to derive sparse WHT algorithms that work under random support (the frequencies are uniformly distributed on

) and random coefficient (the coefficients are samples from continuous distributions) assumptions. Kocaoglu et al. Kocaoglu et al. (2014) propose a method to compute the WHT of a -Fourier-sparse set function that satisfies a so-called unique sign property using queries polynomial in and .

In a different line of work, Stobbe and Krause Stobbe and Krause (2012) utilize results from compressive sensing to compute the WHT of -WHT-sparse set functions, for which a super-set of the support is known. This approach also can be used to find a -Fourier-sparse approximation and has a theoretical query complexity of . In practice, it even seems to be more query-efficient than the hashing based WHT (see experimental section of Amrollahi et al. Amrollahi et al. (2019)), but suffers from the high computational complexity, which scales at least linearly with . Regrading coverage functions, to our knowledge, there has not been any work in the compressive sensing literature for the non-orthogonal Fourier bases which do not satisfy RIP properties and hence lack sparse recovery and robustness guarantees.

In summary, all prior work on Fourier-based methods for learning set functions was based on the WHT. Our work leverages the broader framework of signal processing with set functions proposed by Püschel and Wendler Püschel and Wendler (2020), which provides a larger class of Fourier transforms and thus new types of Fourier-sparsity.

Other learning paradigms. Other lines of work for learning set functions include methods based on new neural architectures Dolhansky and Bilmes (2016); Zaheer et al. (2017); Weiss et al. (2017)

, methods based on backpropagation through combinatorial solvers

Djolonga and Krause (2017); Tschiatschek et al. (2018); Wang et al. (2019); Vlastelica et al. (2019), kernel based methods Buathong et al. (2020)

, and methods based on other succinct representations such as decision trees

Feldman et al. (2013) and disjunctive normal forms Raskhodnikova and Yaroslavtsev (2013).

## 5 Empirical Evaluation

We evaluate the two variants of our algorithm (SSFT and SSFT+) for model 4 on three classes of real-world set functions. First, we approximate the objective functions of sensor placement tasks by Fourier-sparse functions and evaluate the quality of the resulting surrogate objective functions. Second, we learn facility locations functions (i.e., preference functions) that are used to determine cost-effective sensor placements in water networks Leskovec et al. (2007). Finally, we learn simulated bidders from a spectrum auctions test suite Weiss et al. (2017).

Benchmark learning algorithms. We compare our algorithm against three state-of-the-art algorithms for learning WHT-sparse set functions: the compressive sensing based approach CS-WHT Stobbe and Krause (2012), the hashing based approach H-WHT Amrollahi et al. (2019), and the robust version of the hashing based approach R-WHT Amrollahi et al. (2019). For our algorithm we set and . CS-WHT requires a superset of the (unknown) Fourier support, which we set to all with and the parameter for expected sparsity to . For H-WHT we used the exact algorithm without low-degree assumption and set the expected sparsity parameter to . For R-WHT we used the robust algorithm without low-degree assumption and set the expected sparsity parameter to unless specified otherwise.

We consider a discrete set of sensors located at different fixed positions measuring a quantity of interest, e.g., temperature, amount of rainfall, or traffic data, and want to find an informative subset of sensors subject to a budget constraint on the number of sensors selected (e.g., due to hardware costs). To quantify the informativeness of subsets of sensors, we fit a multivariate normal distribution to the sensor measurements Krause et al. (2008) and associate each subset of sensors with its information gain Srinivas et al. (2010)

 G(A)=12log|I|A|+σ−2(Kij)i,j∈A|, (23)

where is the submatrix of the covariance matrix that is indexed by the sensors and the identity matrix. We construct two covariance matrices this way for temperature measurements from 46 sensors at Intel Research Berkeley and for velocity data from 357 sensors deployed under a highway in California.

The information gain is a submodular set function and, thus, can be approximately maximized using the greedy algorithm by Nemhauser et al. Nemhauser et al. (1978): to obtain informative subsets. We do the same using Fourier-sparse surrogates of : and compute . As a baseline we place sensors at random and compute . Figure 2 shows our results. The x-axes correspond to the cardinality constraint used during maximization and the y-axes to the information gain obtained by the respective informative subsets. In addition, we report next to the legend the execution time and number of queries needed by the successful experiments.

Interpretation of results. H-WHT only works for the Berkeley data. For the other dataset it is not able to reconstruct enough Fourier coefficients to provide a meaningful result. The likely reason is that the target set function is not exactly Fourier-sparse, which can cause an excessive amount of collisions in the hashing step. In contrast, CS-WHT is noise-robust and yields sensor placements that are indistinguishable from the ones obtained by maximizing the true objective function in the first task. However, for the California data, CS-WHT times out. In contrast, SSFT and R-WHT work well on both tasks. In the first task, SSFT is on par with CS-WHT in terms of sensor placement quality and significantly faster despite requiring more queries. On the California data, SSFT yields sensor placements of similar quality as the ones obtained by R-WHT while requiring orders of magnitude fewer queries and time.

### 5.2 Learning Preference Functions

We now consider a class of preference functions that are used for the cost-effective contamination detection in water networks Leskovec et al. (2007). The networks stem from the Battle of Water Sensor Networks (BSWN) challenge Ostfeld et al. (2008). The junctions and pipes of each BSWN network define a graph. Additionally, each BSWN network has dynamic parameters such as time-varying water consumption demand patterns, opening and closing valves, and so on.

To determine a cost-effective subset of sensors (e.g., given a maximum budget), Leskovec et al. Leskovec et al. (2007) make use of facility locations functions of the form

 p:2N→R;A↦L∑ℓ=1maxi∈Arℓi, (24)

where is a matrix in . Each row corresponds to an event (e.g., contamination of the water network at any junction) and the entry quantifies the utility of the -th sensor in case of the -th event. It is straightforward to see that (24) is a preference function with and . Thus, they are sparse w.r.t. model 4 and dense w.r.t. WHT (see Lemma 1 and Remark 1).

Leskovec et al. Leskovec et al. (2007) determined three different utility matrices that take into account the fraction of events detected, the detection time, and the population affected, respectively. The matrices were obtained by costly simulating millions of possible contamination events in a 48 hour timeframe. For our experiments we select one of the utility matrices and obtain subnetworks by selecting the columns that provide the maximum utility, i.e., we select the columns with the largest .

In Table 3 we compare the sparsity of the corresponding facility locations function in model 4 against its sparsity in the WHT. For , we compute the full WHT and select the largest coefficients. For , we compute the largest WHT coefficients using R-WHT. The model 4 coefficients are always computed using SSFT. If the facility locations function is sparse w.r.t. model 4 for some , we set the expected sparsity parameter of R-WHT to different multiples up to the first for which the algorithm runs out of memory. We report the number of queries, time, number of Fourier coefficients , and relative reconstruction error. For R-WHT experiments that require less than one hour we report average results over 10 runs (indicated by italic font). For , the relative error cannot be computed exactly and thus is obtained by sampling 100,000 sets uniformly at random and computing , where denotes the real facility locations function and the estimate.

Interpretation of results. The considered facility locations functions are indeed sparse w.r.t. model 4 and dense w.r.t. the WHT. As expected, SSFT outperforms R-WHT in this scenario, which can be seen by the lower number of queries, reduced time, and an error of exactly zero for the SSFT. This experiment shows certain classes of set functions of practical relevance are better represented in the model 4 basis than in the WHT basis.

### 5.3 Preference Elicitation in Auctions

In combinatorial auctions a set of goods is auctioned to a set of bidders. Each bidder is modeled as a set function that maps each bundle of goods to its subjective value for this bidder. The problem of learning bidder valuation functions from queries is known as the preference elicitation problem Brero et al. (2019). Our experiment sketches an approach under the assumption of Fourier sparsity.

As common in this field Weissteiner et al. (2020a, b), we resort to simulated bidders. Specifically, we use the multi-region valuation model (MRVM) from the spectrum auctions test suite Weiss et al. (2017). In MRVM, 98 goods are auctioned off to 10 bidders of different types (3 local, 4 regional, and 3 national). We learn these bidders using the prior Fourier-sparse learning algorithms, this time including SSFT+, but excluding CS-WHT, since is not known in this scenario. Table 2

shows the results: means and standard deviations of the number of queries, number of Fourier coefficients, and relative error (estimated using 10,000 samples) taken over the bidder types and 25 runs.

Interpretation of results. First, we note that SSFT+ can indeed improve over SSFT for set functions that are relevant in practice. Namely, SSFT+ consistently learns sparse representations for local and regional bidders, while SSFT fails. H-WHT also achieves perfect reconstruction for local and regional bidders. For the remaining bidders none of the methods achieves perfect reconstruction, which indicates that those bidders do not admit a sparse representation. Second, we observe that, for the local and regional bidders, in the non-orthogonal model 4 basis only half as many coefficients are required as in the WHT basis. Third, SSFT+ requires less queries than H-WHT in the Fourier-sparse cases.

## 6 Conclusion

We introduced an algorithm for learning set functions that are sparse with respect to various generalized, non-orthogonal Fourier bases. In doing so, our work significantly expands the set of efficiently learnable set functions. As we explained, the new notions of sparsity connect well with preference functions in recommender systems, which we consider an exciting avenue for future research.

## Ethical Statement

Our approach is motivated by a range of real world applications, including modeling preferences in recommender systems and combinatorial auctions, that require the modeling, processing, and analysis of set functions, which is notoriously difficult due to their exponential size. Our work adds to the tool set that makes working with set functions computationally tractable. Since the work is of foundational and algorithmic nature we do not see any immediate ethical concerns. In case that the models estimated with our algorithms are used for making decisions (such as recommendations, or allocations in combinatorial auctions), of course additional care has to be taken to ensure that ethical requirements such as fairness are met. These questions are complementary to our work.

## References

• A. Amrollahi, A. Zandieh, M. Kapralov, and A. Krause (2019) Efficiently Learning Fourier Sparse Set Functions. In Advances in Neural Information Processing Systems, pp. 15094–15103. Cited by: Appendix E, §1, §1, §4, §4, §5.
• K. Balog, F. Radlinski, and S. Arakelyan (2019) Transparent, Scrutable and Explainable User Models for Personalized Recommendation. In Proc. Conference on Research and Development in Information Retrieval (ACM SIGIR), pp. 265–274. Cited by: §1.
• A. Bernasconi, B. Codenotti, and J. Simon (1996) On the Fourier analysis of Boolean functions. preprint, pp. 1–24. Cited by: Table 4, §1.
• A. Björklund, T. Husfeldt, P. Kaski, and M. Koivisto (2007) Fourier meets Möbius: fast subset convolution. In

Proc ACM Symposium on Theory of Computing

,
pp. 67–74. Cited by: Table 4.
• G. Brero, B. Lubin, and S. Seuken (2019) Machine Learning-powered Iterative Combinatorial Auctions. Note: arXiv preprint arXiv:1911.08042 Cited by: §1, §5.3.
• P. Buathong, D. Ginsbourger, and T. Krityakierne (2020)

Kernels over Sets of Finite Sets using RKHS Embeddings, with Application to Bayesian (Combinatorial) Optimization

.
In

International Conference on Artificial Intelligence and Statistics

,
pp. 2731–2741. Cited by: §4.
• D. Chakrabarty and Z. Huang (2012) Testing coverage functions. In International Colloquium on Automata, Languages, and Programming, pp. 170–181. Cited by: Table 4, Appendix A, §2, §3.2.
• M. Cheraghchi and P. Indyk (2017) Nearly optimal deterministic algorithm for sparse walsh-hadamard transform. ACM Transactions on Algorithms (TALG) 13 (3), pp. 1–36. Cited by: §1, §4.
• R. De Wolf (2008) A brief introduction to Fourier analysis on the Boolean cube. Theory of Computing, pp. 1–20. Cited by: §1.
• J. Djolonga and A. Krause (2017) Differentiable learning of submodular models. In Advances in Neural Information Processing Systems, pp. 1013–1023. Cited by: §4.
• J. Djolonga, S. Tschiatschek, and A. Krause (2016) Variational inference in mixed probabilistic submodular models. In Advances in Neural Information Processing Systems, pp. 1759–1767. Cited by: Appendix A, §1, §2.
• B. W. Dolhansky and J. A. Bilmes (2016) Deep Submodular Functions: Definitions and Learning. In Advances in Neural Information Processing Systems, pp. 3404–3412. Cited by: §4.
• V. Feldman, P. Kothari, and J. Vondrák (2013) Representation, approximation and learning of submodular functions using low-rank decision trees. In Conference on Learning Theory, pp. 711–740. Cited by: §4.
• H. Hassanieh, P. Indyk, D. Katabi, and E. Price (2012) Nearly Optimal Sparse Fourier Transform. In Proc. ACM Symposium on Theory of Computing, pp. 563–578. Cited by: §4.
• A. Khare (2009) Vector spaces as unions of proper subspaces. Linear algebra and its applications 431 (9), pp. 1681–1686. Cited by: §C.2.
• M. Kocaoglu, K. Shanmugam, A. G. Dimakis, and A. Klivans (2014) Sparse polynomial learning and graph sketching. In Advances in Neural Information Processing Systems, pp. 3122–3130. Cited by: §4.
• A. Krause and D. Golovin (2014) Submodular function maximization.. Cited by: §2.
• A. Krause, A. Singh, and C. Guestrin (2008) Near-optimal Sensor Placements in Gaussian processes: Theory, Efficient Algorithms and Empirical Studies. Journal of Machine Learning Research 9, pp. 235–284. Cited by: §1, §5.1.
• J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance (2007) Cost-effective outbreak detection in networks. In Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 420–429. Cited by: §5.2, §5.2, §5.2, §5.
• X. Li and K. Ramchandran (2015)

An active learning framework using sparse-graph codes for sparse polynomials and graph sketching

.
In Advances in Neural Information Processing Systems, pp. 2170–2178. Cited by: §1, §4.
• G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher (1978) An analysis of approximations for maximizing submodular set functions — I. Mathematical programming 14 (1), pp. 265–294. Cited by: §5.1.
• A. Ostfeld, J. G. Uber, E. Salomons, J. W. Berry, W. E. Hart, C. A. Phillips, J.-P. Watson, G. Dorini, P. Jonkergouw, Z. Kapelan, et al. (2008) The battle of the water sensor networks (bwsn): a design challenge for engineers and algorithms. Journal of Water Resources Planning and Management 134 (6), pp. 556–568. Cited by: §5.2.
• M. Püschel and J. M.F. Moura (2008) Algebraic signal processing theory: Foundation and 1-D time. IEEE Trans. on Signal Processing 56 (8), pp. 3572–3585. Cited by: §2.
• M. Püschel and J.M.F. Moura (2006) Algebraic Signal Processing Theory. Note: arXiv preprint arXiv:cs/0612077v1 Cited by: §2.
• M. Püschel and C. Wendler (2020) Discrete signal processing with set functions. Note: arXiv preprint arXiv:2001.10290 Cited by: Appendix D, Appendix E, Appendix E, Appendix E, Appendix E, item 1, §1, §2, §2, §4, Theorem 1.
• M. Püschel (2018) A discrete signal processing framework for set functions. In Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4359–4363. Cited by: item 1, §1, §1, §2.
• S. Raskhodnikova and G. Yaroslavtsev (2013) Learning pseudo-Boolean k-DNF and Submodular Functions. In Proc. ACM-SIAM Symposium on Discrete Algorithms, pp. 1356–1368. Cited by: §4.
• R. Scheibler, S. Haghighatshoar, and M. Vetterli (2013) A Fast Hadamard Transform for Signals with Sub-linear Sparsity. In Proc. Annual Allerton Conference on Communication, Control, and Computing, pp. 1250–1257. Cited by: Appendix E, §4.
• M. Sharma, F. M. Harper, and G. Karypis (2019) Learning from sets of items in recommender systems. ACM Trans. on Interactive Intelligent Systems (TiiS) 9 (4), pp. 1–26. Cited by: §1.
• N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger (2010) Gaussian process optimization in the bandit setting: no regret and experimental design. In Proc. International Conference on Machine Learning (ICML), pp. 1015–1022. Cited by: §5.1.
• P. Stobbe and A. Krause (2012) Learning Fourier Sparse Set Functions. In Artificial Intelligence and Statistics, pp. 1125–1133. Cited by: Appendix E, item 3, §1, §4, §4, §5.
• S. Tschiatschek, A. Sahin, and A. Krause (2018) Differentiable submodular maximization. In Proc. International Joint Conference on Artificial Intelligence, pp. 2731–2738. Cited by: §4.
• M. Vlastelica, A. Paulus, V. Musil, G. Martius, and M. Rolínek (2019) Differentiation of Blackbox Combinatorial Solvers. arXiv preprint arXiv:1912.02175. Cited by: §4.
• P. Wang, P. L. Donti, B. Wilder, and Z. Kolter (2019)

SATNet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver

.
arXiv preprint arXiv:1905.12149. Cited by: §4.
• M. Weiss, B. Lubin, and S. Seuken (2017) SATS: a universal spectrum auction test suite. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pp. 51–59. Cited by: §4, §5.3, §5.
• J. Weissteiner, S. Ionescu, N. Olberg, and S. Seuken (2020a) Deep Learning-powered Iterative Combinatorial Auctions. In 34th AAAI Conference on Artificial Intelligence, Cited by: §5.3.
• J. Weissteiner, C. Wendler, S. Seuken, B. Lubin, and M. Püschel (2020b) Fourier Analysis-based Iterative Combinatorial Auctions. Note: arXiv preprint arXiv:2009.10749 Cited by: §1, §5.3.
• C. Wendler and M. Püschel (2019) Sampling signals on meet/join lattices. In Proc. Global Conference on Signal and Information Processing (GlobalSIP), Cited by: Appendix E, Appendix E.
• M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola (2017) Deep sets. In Advances in Neural Information Processing Systems, pp. 3391–3401. Cited by: §4.

## Appendix A Preference Functions

Let denote our ground set. For this section, we assume .

An important aspect of our work is that certain set functions are sparse in one basis but not in the others. In this section we show that preference functions Djolonga et al. (2016) indeed constitute a class of set functions that are sparse w.r.t. model 4 (see Table 4, which we replicate from the paper for convenience) and dense w.r.t. model 5 (= WHT basis). Preference functions naturally occur in machine learning tasks on discrete domains such as recommender systems and auctions, in which they, e.g., are used to model complementary- and substitution effects between goods. Goods complement each other when their combined utility is greater than the sum of their individual utilities. E.g., a pair of shoes is more useful than the two shoes individually and a round trip has higher utility than the combined individual utilities of outward and inward flight. Analogously, goods substitute each other when their combined utility is smaller than the sum of their individual utilities. E.g., it might not be necessary to buy a pair of glasses if you already have one. Formally, a preference function is given by

 p:2N→R;A↦∑i∈Aui+L∑ℓ=1(maxi∈Arℓi−∑i∈Arℓi) (25) −K∑k=1(maxi∈Aaki−∑i∈Aaki).

Equation (25) is composed of a modular part parametrized by , a repulsive part parametrized by , with , and an attractive part parametrized by , with .

###### Lemma 2.

Preference functions of the form (25) are Fourier-sparse w.r.t. model 4.

###### Proof.

In order to prove that preference functions are sparse w.r.t. model 4 we exploit the linearity of the Fourier transform. That is, we are going to show that is Fourier sparse by showing that it is a sum of Fourier sparse set functions. In particular, there are only two types of summands (= set functions):

First, , , for , and , for , are modular set functions whose only non-zero Fourier coefficients are summed up in for and .

Second, , for , and , for , are weighted- and negative weighted coverage functions, respectively. In order to see that is a weighted coverage function, observe that the codomain of is . Let denote the permutation that sorts , i.e., . Let denote the universe. We set for and for . Let . Let the set . Notice that , and, because of we have, for all ,

 w(⋃i∈ASi)=w(Sj)=rℓj, (26)

where is the element in that satisfies for all . Now, observe that is equivalent to , for all . Thus, by definition of we have .

The same construction works for . Weighted coverage functions with are -Fourier-sparse with respect to the W-transform Chakrabarty and Huang (2012) and -Fourier-sparse with respect to model 4 (one additional coefficient for ). The preference function is a sum of modular set functions, sparse weighted coverage functions that require at most additional Fourier coefficients (with ) each and sparse negative weighted coverage functions that require at most additional Fourier coefficients each. Therefore, has at most non-zero Fourier coefficients w.r.t. model 4. ∎

###### Remark 3.

The construction in the second part of the proof of Lemma 2 shows that preference functions with are dense w.r.t. the WHT basis, because there is an element in that is covered by all .

## Appendix B SSFT: Support Discovery

In this section we prove the equations necessary for the support discovery mechanism of SSFT.

Let be a set function and let . As before we denote the restriction of to with

 s↓2Mi:2Mi→R;A↦s(A). (27)

Recall the problem we want to solve and our algorithms (Fig. 3) for doing so (under mild assumptions on the Fourier coefficients).

###### Problem 2 (Sparse Fourier transform).

Given oracle access to query a -Fourier-sparse set function , compute its Fourier support and associated Fourier coefficients.