1 Introduction
Consider an online dating service in which the recommendation system chooses a (female, male) pair from the user pool, matches them, and receives feedback about whether they like each other or not. Over many repetitions of this step, the system aims to maximize the number of positive matches. Similarly, a clothing website may want to choose a pair of items (top, bottom) for a customer that appeals better together. In medicine, scientists would like to find (protein, drug) pairs that are most likely to dock. In these applications, the two types of entities are recommended and evaluated as a unit. Having feature vectors of the entities available,^{1}^{1}1The feature vectors can be obtained either directly from the entity description (for example, hobbies or age) or by other preprocessing techniques (for example, embedding). the system must explore and learn what features of the two entities jointly predict positive feedback in order to make effective recommendations.
The recommendation system aims to obtain large rewards (the amount of positive feedback) but does not know ahead of time the relationship between the features and the feedback. The system thus faces two conflicting goals: choosing pairs that maximally help estimate the relationship (“exploration”) but which may give small rewards and return relatively large, but possibly suboptimal, rewards (“exploitation”), given the limited information inferred by the feedback collected so far. Such an explorationexploitation dilemma can be formulated as a multiarmed bandit problem (Lai & Robbins, 1985, Auer et al., 2002). When the feature vectors are available for each arm, one can postulate simple reward structures such as (generalized) linear models to allow a large or even infinite number of arms (Auer, 2002, Dani et al., 2008, AbbasiYadkori et al., 2011, Filippi et al., 2010), a paradigm that has received much attention during the past decade, with such applications as online news recommendations (Li et al., 2010). Less is known for the situation we consider here, where the recommendation (action) involves two different entity types and forms a bilinear structure. The closest work we are aware of is Kveton et al. (2017) whose action space is the same but without arm feature vectors. Our problem is different from dueling bandits (Yue et al., 2012a) or bandits with unknown user segment (Bhargava et al., 2017), which choose two arms from the same entity set rather than from two different entity types. Section 6 below contains detailed comparisons to related work.
This paper introduces the bilinear bandit problem with lowrank structure. In each round , an algorithm chooses a left arm from and a right arm from , and observes a noisy reward of a bilinear form:
(1) 
where is an unknown parameter and is a conditionally
subGaussian random variable given all the observations before time
. Denoting by the rank of , we assume that is small (), which means that the reward is governed by a few factors. Such lowrank appears in many recommendation applications (Ma et al., 2008).The goal is to maximize the cumulative reward up to time . Equivalently, we aim to minimize the cumulative regret:^{2}^{2}2This regret definition is actually called pseudo regret; we refer to Bubeck & CesaBianchi (2012, Section 1) for detail.
(2) 
A naive approach to this problem is to reduce the bilinear problem to a linear problem, as follows:
Throughout the paper, we focus on the regime where the numbers of possible actions and are much larger than dimensions and , respectively.^{3}^{3}3Otherwise, one can reduce the problem to the standard armed bandit problem and enjoy regret of . With SupLinRel (Auer, 2002), one may also achieve , but this approach wastes a lot of samples and does not allow an infinite number of arms. The reduction above allows us to use the standard linear bandit algorithms (see, for example, (AbbasiYadkori et al., 2011)) in the dimensional space and achieve regret of , where hides logarithmic factors. However, can be large, resulting in a very large regret. Furthermore, the regret does not decrease as gets smaller, since the reduction hinders us from exploiting the lowrank structure.
We address the following challenge: Can we design an algorithm for the bilinear bandit problem that exploits the lowrank structure and enjoys strictly smaller regret than ? We answer the question in the affirmative by proposing Explore Subspace Then Refine (ESTR), an approach that achieves a regret bound of . ESTR consists of two stages. In the first stage, we estimate the row and column subspace by randomly sampling arms. In the second stage, we leverage the estimated subspace by invoking an approach that we call almostlowdimensional OFUL (LowOFUL), a variant of OFUL (AbbasiYadkori et al., 2011) that uses regularization to penalize the subspaces that are apparently not spanned by the rows and columns (respectively) of . We conjecture that our regret upper bound is minimax optimal up to polylogarithmic factors based on the fact that the bilinear model has a much lower expected signal strength than the linear model. We provide a detailed argument on the lower bound in Section 5.
While the idea of explorethencommit is not new, the way we exploit the subspace with LowOFUL is novel. After the first stage that outputs a subspace estimate, it is tempting to project all the arms onto the identified subspaces ( dimensions for each row and column space), and naively invoke OFUL in the dimensional space. However, the subspace mismatch invalidates the upper confidence bound used in OFUL; i.e., the confidence bound is not faithful anymore (i.e., does not actually bound the mean reward).
Attempts to correct the confidence bound so it is faithful are not trivial, and we are unaware of a solution that leads to improved regret bounds. Departing from completely committing to the identified subspaces, LowOFUL works with the full dimensional space while penalizing the subspace that is complementary to the estimated subspace, thus continuing to refine the subspace. We calibrate the amount of regularization to be a function of the subspace estimation error; this is the key to achieving our final regret bound.
The paper is structured as follows. In Section 2, we define the problem formally and provide a sketch of the main contribution. Sections 3 and 4 describe the details of stages 1 and 2 of ESTR, respectively. We elaborate our conjecture on the regret lower bound in Section 5. We discuss related work in Section 6 and conclude the paper by proposing future research directions in Section 7.
2 Preliminaries
We define the problem formally as follows. Let and be the left and right arm space, respectively. Define and . (Either or both can be infinite.) We assume that both the left and right arms have Euclidean norm at most 1: and for all and . Without loss of generality, we assume () spans the whole () dimensional space (respectively); otherwise a preprocessing step can reduce to lowerdimensional spaces. We assume and define . If is a positive integer, we use notation . We denote by the dimensional vector taking values from the coordinates from to from . Similarly, we define to be a submatrix taking values from with the row indices from to and the column indices from to . We denote by the th component of the vector and by the entry of a matrix located at the th row and th column. Denote by and
the largest and smallest nonzero singular value of a matrix
, respectively. denotes the determinant of a matrix .The protocol of the bilinear bandit problem is as follows. At time , the algorithm chooses a pair of arms and receives a noisy reward according to (1). We make the standard assumptions in linear bandits: the Frobenius and operator norms of are bounded by known constants, and ,^{4}^{4}4When is not known, one can set . In some applications, is known. For example, the popular binary reward can be formulated as (1). Since the reward cannot be greater than in this case, we can use . and the subGaussian scale of is known to the algorithm. We denote by the th largest singular value of . We assume that the rank of the matrix is known and that for some known . ^{5}^{5}5In practice, one can perform rank estimation after the first stage (see, for example, Keshavan et al. (2010)).
The main contribution of this paper is the first nontrivial upper bound on the achievable regret for the bilinear bandit problem. In this section, we provide a sketch of the overall result and the key insight. For simplicity, we omit constants and variables other than , , and . Our proposed ESTR algorithm enjoys the following regret bound, which strictly improves the naive linear bandit reduction when .
Theorem 1 (Informal).
Under mild assumptions, the regret of ESTR is with high probability.
We conjecture that the regret bound above is minimax optimal up to polylogarithmic factors since the expected signal strength in the bilinear model is much weaker than the linear model. We elaborate on this argument in Section 5.
We describe ESTR in Figure 1. The algorithm proceeds in two stages. In the first stage, we estimate the column and row subspace of from noisy rankone measurements, using a matrix recovery algorithm. Specifically, we first identify and arms from the set and
in such a way that the smallest eigenvalues of the matrices formed from these arms are maximized approximately (Theorem
1 is valid for any approximation). We then use the popular matrix recovery algorithm, OptSpace (Keshavan et al., 2010) to estimate . The theorem of Wedin (Stewart & Sun, 1990) is used to convert the matrix recovery error bound from OptSpace to the desired subspace angle guarantee (4) with . The regret incurred in stage 1 is bounded trivially by .In the second stage, we transform the problem into a dimensional linear bandit problem and invoke LowOFUL that we introduce in Section 4. This technique projects the arms onto both the estimated subspace and its complementary subspace and uses to penalize weights in the complementary subspaces and . LowOFUL enjoys regret bound during rounds. By combining with the regret for the first stage, we obtain an overall regret of
Choosing to minimize this expression, we obtain a regret bound of .
3 Stage 1: Subspace estimation
The goal of stage 1 is to estimate the row and column subspaces for the true parameter . How should we choose which arm pairs to pull, and what guarantee can we obtain on the subspace estimation error? One could choose to apply a noisy matrix recovery algorithm with affine rank minimization (Recht et al., 2010, Mohan & Fazel, 2010) to the measurements attained from the arm pulls. However, these methods require the measurements to be Gaussian or Rademacher, so their guarantees depend on satisfaction of a RIP property (Recht et al., 2010). Such assumptions are not suitable for our setting since measurements are restricted to the arbitrarily given arm sets and . Uniform sampling from the arm set cannot guarantee RIP, as the arm set itself can be biased in certain directions.
We design a simple reduction procedure though matrix recovery with noisy entry observations, leaving a more sophisticated treatment as future work. The arms in are chosen according to the criterion (3), which is a combinatorial problem that is hard to solve exactly. Our analysis does not require its exact solution, however; it is enough that the objective value is not infinite (that it, the matrix constructed from these arms is nonsingular). (Similar comments hold for the matrix .) If is the matrix defined by , each time step of stage 1 obtains a noisy estimate of one element of . Since multiple measurements of each entry are made, in general, we compute average measurements for each entry. A matrix recovery algorithm applied to this matrix of average measurements yields the estimate of the rank matrix . Since , we estimate by and then compute the subspace estimate and by applying SVD to .
We choose the recovery algorithm OptSpace by Keshavan et al. (2010) because of its strong (nearoptimal) guarantee. Denoting the SVD of by , we use the matrix incoherence definition from Keshavan et al. (2010) and let be the smallest values such that
Define the condition number . We present the guarantee of OptSpace (Keshavan et al., 2010) in a paraphrased form. (The proof of this result, and all subsequent proofs, are deferred to the appendix.)
Theorem 2.
There exists a constant such that for , we have that, with probability at least , that
(5) 
where is an absolute constant.
Although the original theorem from Keshavan et al. (2010) assumes and does not allow repeated sampling, we show in the proof that the same guarantee holds for . Repeated sampling of entries has the effect of reducing the noise parameter ; see Appendix A for detail.
Our recovery of an estimate of implies the bound where is the RHS of (5). However, our goal in stage 1 is to obtain a bound the subspace estimation errors. That is, given the SVD , we wish to identify how close () is to ( respectively), where is the SVD of . Such guarantees on the subspace error can be obtained via the theorem by Stewart & Sun (1990), which we restate in our appendix. Roughly speaking, this theorem bounds the canonical angles between two subspaces by the Frobenius norm of the difference between the two matrices. Recall that is the th largest singular value of .
Theorem 3.
Suppose we invoke OptSpace to compute as an estimate of the matrix . After stage 1 of ESTR with satisfying the condition of Theorem 2, with probability at least , we have
(6) 
where .
4 Stage 2: Almostlowdimensional linear bandits
The goal of stage 2 is to exploit the subspaces and estimated in stage 1 to perform efficient bandit learning. At first, it is tempting to project all the left and right arms to dimensional subspaces using and , respectively, which seems to be a bilinear bandit problem with an by unknown matrix. One can then reduce it to an dimensional linear bandit problem and solve it by standard algorithms such as OFUL (AbbasiYadkori et al., 2011). Indeed, if and exactly span the row and column spaces of , this strategy yields a regret bound of . In reality, these matrices (subspaces) are not exact, so there is model mismatch, making it difficult to apply standard regret analysis. The upper confidence bound (UCB) used in popular algorithms becomes invalid, and there is no known correction that leads to a regret bound lower than , to the best of our knowledge.
In this section, we show how stage 2 of our approach avoids the mismatch issue by returning to the full dimensional space, allowing the subspace estimates to be inexact, but penalizing those components that are complementary to and . This effectively constrains the hypothesis space to be much smaller than the full dimensional space. We show how the bilinear bandit problem with subspace estimates can be turned into the almost lowdimensional linear bandit problem, and how much penalization / regularization is needed to achieve a low overall regret bound. Finally, we state our main theorem showing the overall regret bound of ESTR.
Reduction to linear bandit.
Recall that is the SVD of (where is diagonal) and that and are the complementary subspace of and respectively. Let be a rotated version of . Then we have
Thus, the bilinear bandit problem with the unknown with arm sets and is equivalent to the one with the unknown with arm sets and (defined similarly). As mentioned earlier, this problem can be cast as a dimensional linear bandit problem by considering the unknown vector . The difference is, however, that we have learnt something about the subspace in stage 1. We define to be a rearranged version of so that the last dimensions of are for and , that is,
(7a)  
(7b) 
Then we have
(8) 
Our knowledge on the subspace results in the knowledge of the norm of certain coordinates! Can we exploit this knowledge to enjoy a better regret bound than ? We answer this question in the affirmative below.
Almostlowdimensional OFUL (LowOFUL).
We now focus on an abstraction of the conversion described in the previous paragraph, which we call the almostlowdimensional linear bandit problem. In the standard linear bandit problem in dimensions, the player chooses an arm at time from an arm set and observes a noisy reward , where the noise has the same properties as in (1). We assume that for all , and for some known constant . In almostlowdimensional linear bandits, we have additional knowledge that for some index and some constant (ideally ). This means that allbut dimensions of are close to zero.
To exploit the extra knowledge on the unknown, we propose almostlowdimensional OFUL (LowOFUL) that extends the standard linear bandit algorithm OFUL (AbbasiYadkori et al., 2011). To describe OFUL, consider a linear model with design matrix with rows , and righthand side . The key estimator is based on regression with the standard squared norm regularizer, as follows:
OFUL then defines a confidence ellipsoid around based on which one can compute an upper confidence bound on the mean reward of any arm. In our variant, we allow a different regularization for each coordinate, replacing the regularizer by for some positive diagonal matrix . Specifically, we define , where occupies the first diagonal entries and the last positions. With this modification, the estimator becomes
(9) 
Define and let be the failure rate we are willing to endure. The confidence ellipsoid for becomes
(10) 
This ellipsoid enjoys the following guarantee, which is a direct consequence of Valko et al. (2014, Lemma 3) that is based on the selfnormalized martingale inequality of AbbasiYadkori et al. (2011, Theorem 1).
Lemma 1.
With probability at least , we have for all .
We summarize LowOFUL in Algorithm 1, where can be simplified to .
We now state the regret bound of LowOFUL in Theorem 4, which is based on the standard linear bandit regret analysis dating back to Auer (2002).
Theorem 4.
The regret of LowOFUL is, with probability at least ,
(11) 
In the standard linear bandit setting where and , we recover the regret bound of OFUL, since (AbbasiYadkori et al., 2011, Lemma 10).
To alleviate the dependence on in the regret bound, we propose a carefully chosen value of in the following corollary.
Corollary 1.
Then, the regret of LowOFUL with is, with probability at least ,
The bound improves the dependence on dimensionality from to , but introduces an extra factor of to , resulting in linear regret. While this choice is not interesting in general, it is useful for our algorithm ESTR because the subspace error decreases with . Since , we can set to be a valid upper bound of . Then, by setting , the regret bound in Corollary 1 scales with rather than . Concretely, using (8), we set the values of and as follows:
(12) 
which are valid upper bounds of and , respectively, with high probability. Note we must use , , and instead of , , and , respectively, since the latter is unknown to the learner.
Overall regret.
Theorem 5 shows the overall regret bound of ESTR.
Theorem 5.
One can see that there exists an optimal choice of , which we state in the following corollary.
Corollary 2.
Suppose the assumptions in Theorem 5 hold. If , then the regret of ESTR is, with probability at least ,
Remark.
One might notice that the submatrices and can be also regularized further since these submatrices are coming partly from the complementary subspace of and partly from the complement of (but not both). In practice, such a regularization can be done to reduce the regret slightly, but it does not affect the order of the regret. We do not have sufficient decrease in the magnitude to provide interesting bounds. One can show that, while , the quantities and are ..
5 Lower bound
A simple lower bound is , since when the arm set is a singleton the problem reduces to a dimensional linear bandit problem. We have attempted to extend existing lowerbound proof techniques in Rusmevichientong & Tsitsiklis (2010), Dani et al. (2008), and Lattimore & Szepesvári (2018), but the bilinear nature of the problem introduces cross terms between the left and right arm, which are difficult to deal with in general. However, we conjecture that the lower bound is . We provide an informal argument below that the dependence on must be based on the observation that the rankone bilinear reward model’s signaltonoise ratio (SNR) is significantly worse than that of the linear reward model.
Consider a rankone that can be decomposed as for some . Suppose the left and right arm sets are . Let us choose and uniformly at random (which is the sort of pure exploration that must be performed initially). Then a simple calculation shows that the expected squared signal strength with such a random choice is . In contrast, the expected squared signal strength for a linear reward model is . The effect of this is analogous to increasing the subGaussian scale parameter of the noise by a factor of . We thus conjecture that the difference in the SNR introduces the dependence in the regret rather than .
6 Related work
There exist a few studies on pulling a pair of arms as a unit action, as we do. Kveton et al. (2017) consider the armed bandit with left arms and right arms. The expected reward can be represented as a matrix where the authors assume has rank . The main difference from our setting is that they do not assume that the arm features are available, so our work is related to Kveton et al. (2017) in the same way as the linear bandits are related to armed bandits. The problem considered in Katariya et al. (2017b) is essentially a rankone version of Kveton et al. (2017), which is motivated by a clickfeedback model called positionbased model with items and positions. This work is further extended to have a tighter KLbased bound by Katariya et al. (2017a). All these studies successfully exploit the lowrank structure to enjoy regret bounds that scale with rather than .
There are other works that exploit the lowrank structure of the reward matrix, although the action is just a single arm pull. Sen et al. (2017) consider the contextual bandit setting where there are discrete contexts and arms, but do not take into account the observed features of contexts or arms. Under the socalled separability assumption, the authors make use of Hottopix algorithm to exploit the lowrank structure. Gopalan et al. (2016)
consider a similar setting, but employ the robust tensor power method for recovery.
Kawale et al. (2015) study essentially the same problem, but make assumptions on the prior that generates the unknown matrix and perform online matrix factorization with particle filtering to leverage the lowrank structure. These studies also exploit the lowrank structure successfully and enjoy regret bounds that scale much better than .There has been a plethora of contextual bandit studies that exploit structures other than the lowrankness, where the context is usually the user identity or features. For example, Gentile et al. (2014) and its followup studies (Li et al., 2016, Gentile et al., 2017) leverage the clustering structure of the contexts. In CesaBianchi et al. (2013) and Vaswani et al. (2017), a graph structure of the users is utilized to enjoy regret bound that is lower than running bandits on each context (i.e., user) independently. Deshmukh et al. (2017) introduce a multitask learning view and exploit arm similarity information via kernels, but their regret guarantee is valid only when the similarity is known ahead of time. In this vein, if we think of the right arm set as tasks, we effectively assume different parameters for each task but with a lowrank structure. That is, the parameters can be written as a linear combination of a few hidden factors, which are estimated on the fly rather than being known in advance. Johnson et al. (2016) consider lowrank structured bandits but in a different setup. Their reward model has expected reward of the form with the arm and the unknown . While corresponds to in our setting, they consider a continuous arm set, so their algorithm cannot be applied to our problem.
Our subroutine LowOFUL is quite similar to SpectralUCB of (Valko et al., 2014), which is designed specifically for graphstructured arms in which expected rewards of the two arms are close to each other (i.e., “smooth”) when there is an edge between them. Although technical ingredients for showing the magnitude of the logdeterminant term stem from Valko et al. (2014), LowOFUL is for an inherently different setup in which we design the regularization matrix to maximally exploit the subspace knowledge and minimize the regret, rather than receiving from the environment as a part of the problem definition. Gilton & Willett (2017) study a similar regularizer in the context of sparse linear bandits under the assumption that a superset of the sparse locations is known ahead of time. Yue et al. (2012b) consider a setup similar to LowOFUL. They assume an estimate of the subspace is available, but their regret bound still depends on the total dimension .
7 Conclusion
In this paper, we introduced the bilinear lowrank bandit problem and proposed the first algorithm with a nontrivial regret guarantee. Our study opens up several future research directions. First, there is currently no nontrivial lower bound, and showing whether the regret of is tight or not remains open. Second, while our algorithm improves the regret bound over the trivial linear bandit reduction, the algorithm is not practical since the uniform exploration in the stage 1 hurts the performance. A more natural algorithm would be to combine the stage 1 and stage 2 in a way that we always use LowOFUL to choose the arm but update continuously the subspace estimate and the amount of regularization. The challenge is, however, that most matrix recovery algorithms require some sort of uniform sampling with a “nice” set of measurements. We speculate that one can employ randomized arm selection and use importanceweighted data to perform effective and provable matrix recoveries onthefly.
Appendix
Appendix A Proof of Theorem 2
Theorem 2 (Restated) There exists a constant such that for , we have that, with probability at least , that
(13) 
where is an absolute constant.
Proof.
There are a number of assumptions required for the guarantee of OptSpace to hold. Given a noise matrix , let be the noisy observation of matrix . Among various noise models in Keshavan et al. (2010, Theorem 1.3), the independent subGaussian model fits our problem setting well. Let be the indicator of observed entries and let be a censored version of in which the unobserved entries are zeroed out. Recall that we assume , that , and that is the condition number of .
We first state the guarantee and then describe the required technical assumptions. Keshavan et al. (2010, Theorem 1.2) states that the following is true for some constant :
Here, by Keshavan et al. (2010, Theorem 1.3), is no larger than , for some constant , under Assumption (A3) below, where is the subGaussian scale parameter for the noise . ( can be different from , as we explain below). The original version of the statement has a preprocessed version rather than , but they are the same under our noise model, according to Keshavan et al. (2010, Section 1.5). Together, in our notation, we have
In the case of , the guarantee above holds true with and . If , the guarantee holds true with and . In both cases, we arrive at (5).
We now state the conditions. Let . Define to be the smallest nonzero singular values of .

(A1): is incoherent. Note .

(A2): (Sufficient observation) For some , we have
which we loosen and simplify to (using )

(A3): .

(A4): We combine the bound on and the condition in Keshavan et al. (2010, Theorem 1.2) that says “provided that the RHS is smaller than ”, which results in requiring
Comments
There are no comments yet.