Online Learning Algorithms in Varied Bandit Scenarios

Modified: 8th Feb 2020
Wordcount: 2850 words

Disclaimer: This is an example of a student written essay. Click here for sample essays written by our professional writers.
Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UKEssays.ae.

Cite This

1.      ABSTRACT

In this paper, we present a discussion on the performance of non-stochastic/adversarial-setting algorithms in 2 special scenarios:

  • When agents/players are given only partial or sample feedback
  • Case where agent’s reward information is falsified by the adversary.

2.     INTRODUCTION

A setting where a Bandit can observe the payout of all the N arms after a round is called full information. However, practical problems do not have this luxury. The Bandit/agent is given only limited feedback based on his own decision or he can observe the payout of only the arm he pulled in that round. This scenario is semi-bandit feedback.

Bandit theory was constrained from application to several practical problems due to the lack of a method or workaround to handle scenarios where a Bandit has no full information. But when Bandit Algorithms are combined with Online Optimization methods, accurate estimations of importance weights can be obtained. Online optimization methods differ from other optimization methods such as Stochastic Optimization, Robust Optimization, and Markov Decision Processes. Shortest path problems are an example of those using Online Optimization.

Let us say we have the following Adversarial MAB setting in-hand:

Consider a simple 2-player capture the ag game, which consists of 5 sites and 2 flags. At each round, Player 1 (the hider) chooses 2 sites to hide her flags. Note that each site can contain at most 1 flag. Player 2 (the seeker), without knowing the exact places of the flags, must guess where the flags could be. To do so, she can choose 2 sites to check whether the flags are hidden there. If the seeker finds both flags, she gets 2 points, if she finds only 1 flag, then her reward is 1, and 0 otherwise. The goal of the seeker is thus to maximize her total points over time, while the hider aims to minimize the total reward of her opponent. This problem can be equated to have 5 non-identical slot machines, and therefore enabling us to apply online learning algorithms that can cope with non-stationary, adversarial environments that help minimize Regret while venturing into both exploration and exploitation.

In the remainder of this paper we describe the approach taken and the performance of our strategies in two different scenarios of the above problem. Sections 3 and 4 describe the two scenarios and discuss the observations.

 

3.     PARTIAL FEEDBACK SCENARIO

Consider the following asymmetric game setting: Player 1 can see what Player 2 has chosen each round, while Player 2 can only see the places of the flags if they were at the chosen sites of Player 2. Play the game for 500 rounds and repeat this for 100 times.

 

To ensure the flag seeker can learn and minimize his regret in the above setting, we choose to implement a pair of combinations of adversarial bandit algorithms: EXP3 and FPL. Both chosen because of their fast/dynamic learning ability, and the ability to not depend on full information to update weights of each arm or in this case flag hiding sites. In other words, the algorithms are suitable for use in a partial feedback setting because they utilize only sample feedback or feedback from the arm chosen by the player, in order to update weights of the arms. We choose these algorithms for their simplicity. More sophisticated choices for such a scenario would include FPL-GR [2], and FPL-UE [3] algorithms which are popularly used in combinatorial bandit problems.

3.1        EXP3 Hider and EXP3 SEEKER

The only hyperparameters in this combination of algorithms chosen for the hider and seeker, we must choose beforehand is γ ∈ [0,1] which tunes the desire to pick an action uniformly at random. This also determines to what extent we balance exploration and exploitation. We study the effects of different values of γ used by both the hider and seeker and the reward earned by hider in each of these cases looking at the trend, and convergence. This is depicted in Fig 1.

Note: when γ =1, there is purely uniform random exploration, and the weights computed for different sites wouldn’t matter anymore. On the other hand, the algorithm has provision to exploit sites higher weights with probability 1- γ.

 

Observations

  1. Hider Uses γ = 0.25:

    Fig 1: Average Seeker reward plotted over the number of plays. Each plot shows the reward trend for different values of γ used by the Seeker, and a constant γ used by the Hider. The blue, green, gold, and red trends show the convergence of Seeker reward when Hider uses γ values: 0.25, 0.5, 0.75, and 1 respectively.

  • The first plot from the left depicts the scenarios when the Hider exploits with a probability of 75% and explores only 25% of the time. The blue trend indicates the convergence in reward earned by the Seeker when he explores just 25% of the time.  Since the Hider is mostly exploiting, the Seeker can better predict the hiding sites and earns highest reward even with little exploration.
  • As the Seeker chooses higher values of γ, his earnings tend to drop. The red trend is when the Seeker only explores. This strategy earns him the lowest average reward over all 100 games as he fails to utilize the information he collects during exploration.
  1. As the Hider uses higher values of γ, he explores more he seems to discover better hiding spots. This is because the maximum reward earned by the Seeker with any γ value has dropped from the first(left-most) plot. By the time Hider is exploiting 75% of the time, the Seeker’s average reward has dropped from about 0.95 to 0.85.

This also implies the Seeker is struggling to predict the Hider’s moves, as the Hider opts to explore more, he randomizes his actions.

  1. Hider only Explores(γ=1):

The last plot shows the reward trends when Hider only explores, and the Seeker chooses varying values of gamma.

  • Surprisingly, no matter what γ value the Seeker chooses, the average reward remains about the same around convergence.
  • Also, in all plots, there is evidence that whenever the Seeker is purely exploring, regardless of the Hider’s γ value, the average reward is the same at about 0.8(this is the red trend in all plots). Even with partial information, the Seeker was able to achieve an average reward of about 0.8.

3.2       EXP3 Hider and FPL SEEKER

In this experiment, we choose to implement the FPL strategy for the seeker. The game is played under the settings described earlier, and results are recorded in a similar manner to allow for comparison.  The hyperparameters chosen here are γ for the hider, and η for the FPL seeker. We plot the reward trend over 100 games for different combinations of these hyperparameters as in Fig 2.

Observations:

  1. In the first three plots from the left, we observe that when Hider chooses to purely explore, the Seeker tends to gain the least average reward, when η is 1, 10, and 100. Unexpectedly, when η=1000, and when the Hider uses γ = 1(pure exploration), the average Seeker reward earned is higher than the reward earned with all other γ values.

2.       When Seeker η= 1, the Seeker earns a higher average reward than he could while using the EXP3. Similar observations are seen when η =10, and 100 as well.

Fig 2: Average Seeker reward plotted over the number of plays. Each plot shows the reward trends for different values of γ used by the Hider, and a constant η used by the Seeker. The blue, green, gold, and red trends show the convergence of Seeker reward when Hider uses γ values: 0.25, 0.5, 0.75, and 1 respectively.

  1. Just like when using the EXP3, we see that the more adventurous the Seeker is, the more unpredictable the hiding sites are and consequently the lower our average reward is. For example, when Seeker’s γ is 0.25, he only explores 25 % of the time. Remaining time, he exploits sites that fetch him higher reward. The seeker is better able to learn the hiding pattern, and his average reward is higher as well.
  2. As η increases from 1 to 1000, the average reward drops from about 1 to 0.85.
  3. Also, when η=1000, the average reward our FPL Seeker earns is almost the same for all values of γ that the Hider used implying, for larger exponential functions that we use to perturb, the Seeker is more indifferent to the exploration/exploitation strategies of the Seeker.  The Hider is no longer able to limit the average reward earned.

4.         WHEN ADVERSARY FALSIFIES         REWARDS

Now suppose that the hider can lie about the position of the true flags during the game. For example, if the seeker chooses sites 1 and 2, and the flags were in fact on sites 3 and 4, the hider can still say that the seeker found the flags and received a reward of 2 (the hider can also lie about 1 site, not 2). Basically, when the hider lies, she can give any arbitrary information. To restrict the hider’s power, we assume that she can lie up to B times during a game. Can you come up with a smart faking strategy for the hider such that she can manipulate the seeker to have a low performance in the end (after revealing the fake flags)? Once you have the faking strategy, please play the game with B = 10, 50, 100 for 500 rounds, and repeat this for 100 times. Plot the average reward of the seeker (i.e., the total reward divided by the number of plays) over number of plays, both before and after the reveal for all three cases (B = 10, 50, 100). Also, please plot the difference between the true total reward (without the fake flags) and the total reward the seeker thought she received at the end of the game.

4.1        Lying Strategy

Each game is designed as follows as shown in Fig 3. The lying strategy is depicted in Fig4. This is a very simple strategy built with the intention of misleading or confusing the Seeker such that he updates his weights wrongly. The algorithmic flowchart is illustrating this clearly. We have three possibilities:

Fig 3. Program flow when adversary has provision to lie about the flag positions

  1. Seeker is right about both positions: Randomly pick any of the guesses to mislead the Seeker about, let us say guess-1 or position 1. Let him think he is wrong. Return position vector such that there is a flag at any other random position instead of at guess 1.
  2. Seeker is wrong about both positions: Randomly pick any of the guesses to mislead the Seeker about, let us say guess-1 or position 1. Let him think he is right. Return position vector such that there is a flag at guess1 even though the truth is that it is at some other position.
  3. Seeker is wrong about one position: Pick the position he is wrong about. Let him think he is right.

The idea is, if we make the Hider think he is right even though he is wrong, he’ll give more importance to that position or wrongly update the weights and therefore influence his future guesses to be wrong as well.

Fig 4. Program flow when adversary has provision to lie about the flag positions

4.2        EXP3 Hider with falsifying powers, against EXP3

Fig 5: Plots above show the Average reward accumulated by both the Hider with falsifying powers (in blue), as well as the true, and fake reward accumulated by the Seeker; Green trend is the true reward while the black trend is the false reward). Plots from left to right show the convergence when the Seeker is lied to B = 10, 50, 100, 500 times respectively, in each game.

Observations:

  1. When B=10: Hider can lie only 10 times in every game. The hider makes only slightly more than what he would without faking the positions.
  2. When B is increased further: The gap is earnings between the two players is increased slightly. Hider can better mislead the Seeker. This gap increases when B is increased further.
  3. When B=500: The Hider lies in every round. This is when the Seeker can make the least rewards. This is understandable as he is misled to believing false information in every round. But what is surprising is, despite this amount of false information given to him, he manages to make about 0.7 at convergence, while one might expect a much lower figure. One would also think that lying randomly even 10 times might be enough to make the hider update his weights wrongly and give him lower rewards. But it looks like the lying, unless done in every round, does not really throw off the Seeker.

Probably this is because, the lying strategy is not sophisticated enough. The lying strategy could be modified to ensure the Seeker looks in places he weights the least. But in the current strategy, we only randomly redirect him to a false location, while we could have tried to make him look in the place, he gives least weight to, or the place the Hider gives least weights to. That means, if we redirected him to look in a place that the hider would least probably hide, he could be misled better into making mistakes while guessing in future rounds.

Figure 5 reconfirms our understanding that lying in every single round fetches the maximum reward for the Hider. Also, lying for just 10 rounds is as good as not lying/faking at all. The reward accumulated by the Seeker decreases as the lying power of the Hider increases. This gap in earnings maybe widened with a more sophisticated lying strategy as mentioned earlier.

Fig 6. Plot shows the difference in reward accumulated between players against types of games

5.   REFERENCES

[1] G. Neu and G. Bartók. An efficient algorithm for learning with semi-bandit feedback. ALT, pages 234–248, 2013.

[2] Neu, Gergely and Gábor Bartók. “Importance weighting without importance weights: An efficient algorithm for combinatorial semi-bandits.” Journal of Machine Learning Research 17 (2016): 154:1-154:21.

[3] Xu, Haifeng et al. “Playing Repeated Security Games with No Prior Knowledge.” AAMAS (2016).

[4] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006.

[5] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404–1422, 2012.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Cite This Work

To export a reference to this article please select a referencing style below:

Give Yourself The Academic Edge Today

  • On-time delivery or your money back
  • A fully qualified writer in your subject
  • In-depth proofreading by our Quality Control Team
  • 100% confidentiality, the work is never re-sold or published
  • Standard 7-day amendment period
  • A paper written to the standard ordered
  • A detailed plagiarism report
  • A comprehensive quality report
Discover more about our
Essay Writing Service

Essay Writing
Service

AED558.00

Approximate costs for Undergraduate 2:2

1000 words

7 day delivery

Order An Essay Today

Delivered on-time or your money back

Reviews.io logo

1856 reviews

Get Academic Help Today!

Encrypted with a 256-bit secure payment provider