Reviewing Guideline for ICLR2020

To make this concise, we are providing three main guidelines to keep in mind when reviewing.

Answer three key questions for yourself, to make a decision to Accept or Reject:
1. What is the specific question/problem tackled by the paper?
2. Is the approach well motivated, including being well-placed in the literature?
3. Does the paper support the claims? This includes determining if results, whether theoretical or empirical, are correct and if they are scientifically rigorous.
Organize your review as follows:
1. Summarize what the paper claims to do/contribute. Be positive and generous.
2. Clearly state your decision (accept or reject) with one or two key reasons for this choice.
3. Provide supporting arguments for the reasons for the decision.
4. Provide additional feedback with the aim to improve the paper. Make it clear that these points are here to help, and not necessarily part of your decision assessment.
Ask questions you would like answered by the authors that help you clarify your understanding of the paper and provides the additional evidence you need to make be confident in your assessment.

For great in-depth resources on reviewing, see these resources:

Daniel Dennet, Criticising with Kindness.
Comprehensive advice: Mistakes Reviewers Make
Views from multiple reviewers: Last minute reviewing advice
Perspective from instructions to Area Chairs: Dear ACs.

Sample Review

Hare are two sample reviews from previous conferences, that give an example of what we consider a good review for the case of leaning-to-accept and leaning-to-reject.

Review for a Paper where Leaning-to-Accept

This paper proposes a method, Dual-AC, for optimizing the actor (policy) and critic (value function) simultaneously which takes the form of a zero-sum game resulting in a principled method for using the critic to optimize the actor. In order to achieve that, they take the linear programming approach of solving the Bellman optimality equations, outline the deficiencies of this approach, and propose solutions to mitigate those problems. The discussion on the deficiencies of the naive LP approach is mostly well done. Their main contribution is extending the single step LP formulation to a multi-step dual form that reduces the bias and makes the connection between policy and value function optimization much clearer without losing convexity by applying a regularization. They perform an empirical study in the Inverted Double Pendulum domain to conclude that their extended algorithm outperforms the naive linear programming approach without the improvements. Lastly, there are empirical experiments done to conclude the superior performance of Dual-AC in contrast to other actor-critic algorithms.

Overall, this paper could be a significant algorithmic contribution, with the caveat for some clarifications on the theory and experiments. Given these clarifications in an author response, I would be willing to increase the score.

For the theory, there are a few steps that need clarification and further clarification on novelty. For novelty, it is unclear if Theorem 2 and Theorem 3 are both being stated as novel results. It looks like Theorem 2 has already been shown in "Randomized Linear Programming Solves the Discounted Markov Decision Problem in Nearly-Linear Running Time”. There is a statement that “Chen & Wang (2016); Wang (2017) apply stochastic first-order algorithms (Nemirovski et al., 2009) for the one-step Lagrangian of the LP problem in reinforcement learning setting. However, as we discussed in Section 3, their algorithm is restricted to tabular parametrization”. Is your Theorem 2 somehow an extension? Is Theorem 3 completely new?

This is particularly called into question due to the lack of assumptions about the function class for value functions. It seems like the value function is required to be able to represent the true value function, which can be almost as restrictive as requiring tabular parameterizations (which can represent the true value function). This assumption seems to be used right at the bottom of Page 17, where U^{pi*} = V^*. Further, eta_v must be chosen to ensure that it does not affect (constrain) the optimal solution, which implies it might need to be very small. More about conditions on eta_v would be illuminating.

There is also one step in the theorem that I cannot verify. On Page 18, how is the squared removed for difference between U and Upi? The transition from the second line of the proof to the third line is not clear. It would also be good to more clearly state on page 14 how you get the first inequality, for || V^* ||_{2,mu}^2.

For the experiments, the following should be addressed.

1. It would have been better to also show the performance graphs with and without the improvements for multiple domains.

2. The central contribution is extending the single step LP to a multi-step formulation. It would be beneficial to empirically demonstrate how increasing k (the multi-step parameter) affects the performance gains.

3. Increasing k also comes at a computational cost. I would like to see some discussions on this and how long dual-AC takes to converge in comparison to the other algorithms tested (PPO and TRPO).

4. The authors concluded the presence of local convexity based on hessian inspection due to the use of path regularization. It was also mentioned that increasing the regularization parameter size increases the convergence rate. Empirically, how does changing the regularization parameter affect the performance in terms of reward maximization? In the experimental section of the appendix, it is mentioned that multiple regularization settings were tried but their performance is not mentioned. Also, for the regularization parameters that were tried, based on hessian inspection, did they all result in local convexity? A bit more discussion on these choices would be helpful.

Minor comments:

1. Page 2: In equation 5, there should not be a 'ds' in the dual variable constraint

Review for a Paper where Leaning-to-Reject

This paper introduces a variation on temporal difference learning for the function approximation case that attempts to resolve the issue of over-generalization across temporally-successive states. The new approach is applied to both linear and non-linear function approximation, and for prediction and control problems. The algorithmic contribution is demonstrated with a suite of experiments in classic benchmark control domains (Mountain Car and Acrobot), and in Pong.

This paper should be rejected because (1) the algorithm is not well justified either by theory or practice, (2) the paper never clearly demonstrates the existence of problem they are trying to solve (nor differentiates it from the usual problem of generalizing well), (3) the experiments are difficult to understand, missing many details, and generally do not support a significant contribution, and (4) the paper is imprecise and unpolished.

Main argument

The paper does not do a great job of demonstrating that the problem it is trying to solve is a real thing. There is no experiment in this paper that clearly shows how this temporal generalization problem is different from the need to generalize well with function approximation. The paper points to references to establish the existence of the problem, but for example the Durugkar and Stone paper is a workshop paper and the conference version of that paper was rejected from ICLR 2018 and the reviewers highlighted serious issues with the paper—that is not work to build upon. Further the paper under review here claims this problem is most pressing in the non-linear case, but the analysis in section 4.1 is for the linear case.

The resultant algorithm does not seem well justified, and has a different fixed point than TD, but there is no discussion of this other than section 4.4, which does not make clear statements about the correctness of the algorithm or what it converges to. Can you provide a proof or any kind of evidence that the proposed approach is sound, or how it’s fixed point relates to TD?

The experiments do not provide convincing evidence of the correctness of the proposed approach or its utility compared to existing approaches. There are so many missing details it is difficult to draw many conclusions:

1) What was the policy used in exp1 for policy evaluation in MC?

2) Why Fourier basis features?

3) In MC with DQN how did you adjust the parameters and architecture for the MC task?

4) Was the reward in MC and Acrobot -1 per step or something else

5) How did you tune the parameters in the MC and Acrobot experiments?

6) Why so few runs in MC, none of the results presented are significant?

7) Why is the performance so bad in MC?

8) Did you evaluate online learning or do tests with the greedy policy?

9) How did you initialize the value functions and weights?

10) Why did you use experience replay for the linear experiments?

11) IN MC and Acrobot why only a one layer MLP?

Ignoring all that, the results are not convincing. Most of the results in the paper are not statistically significant. The policy evaluation results in MC show little difference to regular TD. The Pong results show DQN is actually better. This makes the reader wonder if the result with DQN on MC and Acrobot are only worse because you did not properly tune DQN for those domains, whereas the default DQN architecture is well tuned for Atari and that is why you method is competitive in the smaller domains.

The differences in the “average change in value plots” are very small if the rewards are -1 per step. Can you provide some context to understand the significance of this difference? In the last experiment linear FA and MC, the step-size is set equal for all methods—this is not a valid comparison. Your method may just work better with alpha = 0.1.

The paper has many imprecise parts, here are a few:

1) The definition of the value function would be approximate not equals unless you specify some properties of the function approximation architecture. Same for the Bellman equation

2) equation 1 of section 2.1 is neither an algorithm or a loss function

3) TD does not minimize the squared TD. Saying that is the objective function of TD learning in not true

4) end of section 2.1 says “It is computed as” but the following equation just gives a form for the partial derivative

5) equation 2, x is not bounded

6) You state TC-loss has an unclear solution property, I don’t know what that means and I don’t think your approach is well justified either

7) Section 4.1 assumes linear FA, but its implied up until paragraph 2 that it has not assumed linear

8) treatment of n_t in alg differs from appendix (t is no time episode number)

9) Your method has a n_t parameter that is adapted according to a schedule seemingly giving it an unfair advantage over DQN.

10) Over-claim not supported by the results: “we see that HR-TD is able to find a representation that is better at keeping the target value separate than TC is “. The results do not show this.

11) Section 4.4 does not seem to go anywhere or produce and tangible conclusions

Things to improve the paper that did not impact the score:

0) It’s hard to follow how the prox operator is used in the development of the alg, this could use some higher level explanation

1) Intro p2 is about bootstrapping, use that term and remove the equations

2) It’s not clear why you are talking about stochastic vs deterministic in P3

3) Perhaps you should compare against a MC method in the experiments to demonstrate the problem with TD methods and generalization

4) Section 2: “can often be a regularization term” >> can or must be?

5) update law is an odd term

6)” tends to alleviate” >> odd phrase

7) section 4 should come before section 3

8) Alg 1 in not helpful because it just references an equation

9) section 4.4 is very confusing, I cannot follow the logic of the statements

10) Q learning >> Q-learning

11) Not sure what you mean with the last sentence of p2 section 5

12) where are the results for Acrobot linear function approximation

13) appendix Q-learning with linear FA is not DQN (table 2)