A Paper in Thousand Words: Dueling Network Architectures for Deep Reinforcement Learning


Paper is found here

Introduction and the Contribution

This paper’s main contribution is that instead of using standard neural network architectures such as Convolution Neural Networks (CNNs) and Long-Short Term Memory Netowrks (LSTMs). For example, Deep Q-Networks uses a standard CNN. The authors of this paper introduce an alterations to the network architecture that is better suited for model-free reinforcement learning (RL). However, the new network is still compatible with existing and future RL algorithms.

In a nutshell this “alteration” is to have shared convolution features and split the fully connected layers to two separate modules and then combin the two modules again to form a single output layer. “What in the world?” you might say. However this particular choise of design is justified. The figure below depits and gives more intuition to how it is done.

Now why does this “alteration” make sense. The primary objective of this deep model will be as same as any other network. This takes a high-dimensional input, i.e. an image from an Atari game (a state s_t), as the input and spit out Q-values i.e., the state-action values Q(s_t,a') for all the actions a'\in A where A is the action space. However, this state-action value Q(s_t,a') is made out of two main components.

  1. A state-dependent action-independent Value function V(s): Goodness of state s
  2. A state, action dependent Advantage function A(s,a): Goodness of taking action a in state s

Now looking more closely at these two entities it is possible to understand that they behave differently; V changes slower than A. Broadly speaking, V (action-independent) is calculated from a combination of many actions where as A will change with the effect of a single action. And Q is the result of addition of both these entities (\theta,\alpha and \beta are just parameters of the neural network).

(1)   \begin{equation*} Q(s,a;\theta,\alpha,\beta) = V(s;\theta,\beta) + A(s,a;\theta,\alpha) \end{equation*}

Now let’s put the ideas in the last two paragraphs together. If the learning model outputs Q and Q is made from two entities that behaves differently, it makes sense to have different sets of parameters to learn these two entities (V and A) and then combine them at the output layer. Thus, giving rise to “Dueling Network Architecture”.

Duel Network

Duel Network

The loss function for the Dueling Architecture is as same as for the Deep Q-Network.

(2)   \begin{equation*} L_i(\theta_i) = E_{s,a,r,s'}[(y_i^{DQN} - Q(s,a;\theta_i))^2] \end{equation*}

where y_i^{DQN} = r + \gamma max_{a'}Q(s',a';\theta^-). I’m not going to explain the notation here but there are pretty standard in deep reinforcement learning. \theta^- comes from a “target network” as in the Deep Q-Network paper.

Issue of Unidentifiability

We now delve into a certain issue underlying this approach known as the issue of unidentifiability. To imagine this, add a constant to the value V and subtract a constant from A the effect will cancel out right? Well, this is a not good property to have as these two entities serve two very different purposes (This is the best I can explain this. I’d welcome any theoritically backed-up explanations happily). Now to solve this, a “baseline”, an unbiased normalization term, is introduced to the calculation of Q which turns the equation to,

(3)   \begin{equation*} Q(s,a;\theta,\alpha,\beta) = V(s;\theta,\beta) + (A(s,a;\theta,\alpha)-b) \end{equation*}

where b is the baseline. In the paper they use b=\frac{1}{|A|}\sum_{a'}A(s,a';\theta,\alpha). I’m not exactly sure about the purpose of the baseline but I assume it is motivated from this paper. So baseline helps to solve the problem of unidentifiability, reduce variance and also to converge faster (resource for theoritical understanding).

Experiments

Corridor Navigation

The first experiment is to navigate a simple corridor composed of 2 vertical (10 units) and 1 horizontal (50 units) section. The agent should start at one end and navigate itself to the other end. Actions include going left, right, up, down and no-op. 3 different experiment were performed by augmenting the action spaces to 5, 10 and 20 sizes by adding redundant no-op actions. This simple experiment is designed to show that Dueling Networks effectively converges with a large action space where standard single network has a slower convergence (Figure 3 in the paper).

Figure from the original paper

Figure from the original paper

Atari Games

To evaluate the performance of the Duel Network in the arcade game environment, the following performance measure is used.

(4)   \begin{equation*} \frac{\text{Score}_{Agent}-Score_{Baseline}}{max(\text{Score}_{Human},\text{Score}_{Baseline})-\text{Score}_{Random}} \end{equation*}

where \text{Score}_{Human} is the best human performance (median after playing 2 hours of the game), \text{Score}_{Baseline} is the performance of a baseline agent that performs well and \text{Score}_{Random} is score obtained by sampling actions uniformly random.

Experiments include comparisons against two baselines; Double Deep Q-Network (DDQN) and Prioritized DDQN. The experiments indicate that the duel network outperforms both the baselines in a majority of games (Figure 4 and 5 in the paper).

Figure from the original paper

Figure from the original paper

Conclusion

I find the idea behind the paper quite interesting; exploiting the formulation of state-action value and incorporating that into the neural network architecture allows it to make the learning more effective. However, I was thrown off by the following paragraph in the paper (page 4, 3^{rd} from last).
However, we need to keep in mind that Q is only a parameterized estimate of the true Q-function. Moreover, it would be wrong to conclude that V is a good estimator of the state-value function, or likewise that A provides a reasonable estimate of the advantage function.

Well this paragraph just collapses all their arguments built up to prove their method (in my opinion). They’re basically saying that splitting the network to learn V and A separately helps the network to outperform the standard architectures, but then they say that it is wrong to consider these as reasonable estimators. (¯\_(ツ)_/¯)

Also I think it would be interesting to see this model being used to some real world control problems instead of just Atari games.