07/01/2024
By Monish Reddy Kotturu

The Richard A. Miner School of Computer & Information Sciences, Department of Computer Science, invites you to attend a Master's Thesis defense by Monish Reddy Kotturu on "Enhancing Team Performance in Multi-Agent Multi-Armed Bandits through Optimization."

Candidate name: Monish Reddy Kotturu
Time: Monday, July 1, 2024
Time: 11 a.m. ET
Location: DAN 309 and via Zoom

Thesis title: Enhancing Team Performance in Multi-Agent Multi-Armed Bandits through Optimization

Committee:

  • Advisor Reza Azadeh, Miner School of Computer and Information Sciences, University of Massachusetts Lowell
  • Hadi Amiri, Miner School of Computer and Information Sciences, University of Massachusetts Lowell
  • Amanda Redlich, Department of Mathematics and Statistics, University of Massachusetts Lowell

Abstract:
The multi-armed bandit (MAB) problem involves sequential decision-making under uncertainty with the goal of maximizing an agent's cumulative rewards. In reinforcement learning, the MAB problem provides a foundation for developing algorithms that tackle the exploration-exploitation trade-off. MABs are used in various areas such as recommender systems, online advertising, dynamic pricing, and adaptive experimental design. In a multi-agent setting, the MAB problem can be extended to involve a team of agents cooperating to reach a consensus and maximize team reward. Effective decision-making in the Multi-Agent Multi-Armed Bandit (MAMAB) scenarios requires effective communication (i.e., sharing information) among agents. However, suboptimal communication can reduce team performance. The topic of this thesis is to study and develop methods to improve team performance in MAMABs through optimization.

We define the team structure using a relational network, a graph that dictates the manner in which information is exchanged among agents and assigns weights (i.e., importance) to transmitted and received information. In teams governed by relational networks, one important step in achieving effective communication is finding optimal edge weights. The edge weight optimization problem can be formulated as a convex optimization problem to find the ideal relational weights that expedite consensus formation efficiently. We study the effects of various edge weight optimization algorithms in MAMABs. Our results show that in large, communication-constrained networks of agents, the timescale needed to reach a consensus can be improved through optimization.

A major shortcoming of the above experimental setting is that it assumes perfect agents playing a given MAB, which is usually not the case in real-world scenarios. Agents often possess unique abilities that influence their performance on specific tasks. To account for this variability, we focus on the notion of competency. We define competency as an agent's ability to find the optimal arm of a MAB with high probability within a finite time. We simulate agent competency by adding noise to their observations. In our experiments, we formulate an intricate method that uses a vector of competencies for an agent that represents its performance in different scenarios. In other words, an agent can possess higher competency when playing a set of MABs with a certain level of difficulty while showing lower competency when playing another. When agents with varying levels of competencies work together, it is important to test the network's performance as a whole over a longer period of time over different problems or missions. We hypothesize that the optimization process can help improve team performance when playing bandits with various difficulty levels over a longer period of time. To validate this hypothesis, we propose a long-term online optimization process where in each bandit stage, the team goes through an iterative operation of playing a batch of bandits and optimizing their edge weights based on the resulting team performance. Our results show that the team performance can be substantially improved through this approach by limiting the spread of noisy information from lesser competent agents and feeding helpful information back to the lesser competent agents.