-
Notifications
You must be signed in to change notification settings - Fork 147
Open
Description
Hi Phil,
I implemented your DuelDDQN architecture for myself, and was curious as to the following snippet of the learning function, as my question wasn't detailed in the course.
q_pred = T.add(V_s,
(A_s - A_s.mean(dim=1, keepdim=True)))[indices, actions]
q_next = T.add(V_s_, (A_s_ - A_s_.mean(dim=1, keepdim=True)))
q_eval = T.add(V_s_eval, (A_s_eval - A_s_eval.mean(dim=1,keepdim=True)))
Why is it that only Q_pred is muliplied by the action maatrix, is it because it represents the actions we have just taken in the current state? Are all of these q_value matrices of the same dimensions?
Metadata
Metadata
Assignees
Labels
No labels