Checks, double checks and triple checks

Dear readers,

As we come toward the period for author responses, we thought we’d update everyone about the status of the review process. This year, we adopted a short initial review cycle of two weeks to ensure that the reviews would be completed with enough time for quality assurance checks before authors see the reviews. The longer discussion period also helped reviewers read and check each others’ review (check). Jointly with 61 Area Chairs, Regina and I focused our efforts along three dimensions: (1) chasing late reviewers, (2) ACs manual checking of the reviews, and (3) PC chairs global checking of the reviews using scripts.

  1. Chasing Late Reviews.  While the vast majority of reviews were completed on time, around 1% of reviews were not delivered even within a week from the deadline. For those papers, we had to ask area chairs to either find new reviewers or review the papers themselves. We are happy to say that now, we have all 3900+ reviews in. We have collected a list of reviewers who didn’t deliver (and didn’t notified us) for future PC chair considerations.
  2. AC Manual Checks. Each and every paper’s reviews were read personally by at least one area chair and vetted for quality (double check).  This year we recruited many ACs which makes it manageable for them to closely supervise a cohort of papers. The goal of this check was to identify reviews with vague statements, unclear questions to the authors, and also closely monitor cases of inconsistency across reviewers. To resolve these issues, ACs started the discussions across reviewers and provided direct feedback to those who needed to change their reviews.
  3. PC chair’s programmatic checks for quality assurance. We implemented a spreadsheet that downloaded all of the reviews — all 2200+ long and 1700+ short — to check on the status by area. We ran status checks for inconsistent reviews (submissions where the reviews had a standard deviation above 1.5; about 3% initially), ones that had low confidence reviews (confidence score of 2 or 1; about 15% initially), and ones where at least one review was particularly short (50 words or less; about 3% initially), and flagged these for ACs to do a final round of checking (over 200 submissions; triple check).  The excellent AC crew were already aware of most of these problems, but it definitely helped to have this layer of consistent checks implemented across the board.

Inevitably some authors will not be happy with the reviews that they receive, and that is statistically expected. We are writing this post to let you know the extent of the steps and checks that the reviewers, ACs, and PC chairs implemented to ensure that ACL remains the top venue for peer-reviewed, published work in NLP and CL. Also, as any reviewer knows, while the reviews themselves give authors’ their feedback, the discussion among ACs and the peer reviewers are also an unseen and often significant source of work that adds to the quality of the program.  From checks #1 and #2, these discussions (which authors are not privy to) strongly affect many reviews before they are released to submission authors.  Indeed, there are some papers where the peer review discussions rival the length and quality of the reviews themselves!

We are looking forward to submission authors’ responses over the next few days, and will be posting about this shortly.

P.S. (edit – Added quartile info as of March, 15) Below we provide some statistics on the scores per area at this midpoint juncture (excluding Biomedical and Speech, that had less than 10 submissions in each long/short category). Hopefully, these will help you to put your own scores into perspective when the initial reviews are released in the next day:

Screen Shot 2017-03-15 at 11.23.28 PM

19 thoughts on “Checks, double checks and triple checks

  1. One striking issue with the table is the lower ratings of short papers compared to long papers, which apparently cuts across all tracks. Here are some possible explanations I can think of:

    – Authors are submitting lower quality work as short papers.
    – Reviewers are applying inappropriate criteria when judging short papers.
    – There is a mismatch of expectations between authors and reviewers regarding short papers.

    It could be a combination of these factors, or other reasons, but either way it appears we have a problem: I don’t think it’s good for ACL to have a submission category that fares consistently worse than another. I’m not sure how to solve the problem, but in future calls, it’s probably a good idea to clarify (to both authors and reviewers) what the expectations are of short papers.

    Liked by 1 person

    1. Great points, Ron!

      Also, in my experience, the (overall) scores tend to be somewhat disconnected from the review texts, with the latter usually being more positive.

      Alas, the scores are what decides the final outcome!

      Liked by 1 person

  2. Thanks for another interesting post! A couple questions: 1) Did you mean “% scores > 4” and “% scores =, <=)? Otherwise what are the "% remaining"? 2) You say: "Hopefully, these will help you to put your own scores into perspective when the initial reviews are released in the next day." However, we didn't actually receive any scores with our reviews, just the text of the reviews! Is this a mistake? It would be really helpful to have the numeric scores in order to gauge how likely it is that the paper might be accepted, especially since the turn-around time between final ACL decisions and the EMNLP deadline is so short this year.

    Like

    1. Min and I are currently working with Softconf to configure the system to show the scores to the authors. I hope that it will be available shortly — please stay tuned. We decided to release the reviews before people from Softconf fix the system issue on their side, so that the authors can start thinking of their replies.

      About your question related to the range of the scores: we divided it into three brackets based on the average score:
      =4 (likely accepts), and the rest of the papers (> 3 and <4). The latter category contains papers which will generate most of the discussion.

      Liked by 1 person

  3. Thanks for these interested stats.

    Are the reviewers scores final, or the authors responses may affect these scores ?

    Will the final decision be on the recommendation score only or the other scores will be used ?

    Like

    1. We certainly hope that based on your response at least some reviews and the corresponding scores will be modified. We don’t actually know how often does it happen in reality. This year we saved the copy of all the reviews and scores before the rebuttal. We will compare how did it change after reviewers so your responses, and report to you.

      Note that this year, in addition to standard response to the reviewers, we added an additional communication channel — direct connection to the ACs not seen by reviewers. So if you think that your paper was mishandled in some way, please let the ACs know.

      Like

  4. Hi, I think the SOFTCNONF system could send a message to us saying that our rebuttal was delivered successfully. Because, when I enter now after the rebuttal’s phase, I only can see “at this time, there are no action items available for this submission.”

    It would be great in the future, I guess.

    Like

    1. Hi Diego:

      Thanks for your helpful inpu!. Yes, we will be asking Softconf to add such functionalities so that authors can be assured that the rebuttal was properly received by the system.

      Like

  5. Hi, can I ask the question that how the chairs decide papers whether can be accepted or not?
    I suppose a circumstance:
    Firstly chairs will rank all papers by scores in descending order, then find the border line score based on the accept rate, afterwards the paper with score larger than border line score will be accepted and the paper with border line score will be discussed whether to be accepted or not.

    Is it right?

    Like

    1. Hi Frank, thanks for your email. Regina and I will be definitely writing more about the accept/reject decision processes that we will be going through in the next week or so, as this is one of the most mystifying parts of organizing a conference (“how did my paper with average score XX which is better than a colleague’s YY, get rejected when the other paper got accepted?”).

      However, let me state up front that while scores are certainly a very useful indicator of quality it certainly is not the basis for a decision. Area chair teams will be asked to come up with rankings for clear and borderline accepts, not solely based on scores and Regina and I will be using these as input to organise the final programme. As with many projects that have multiple “objective functions” (quality, diversity, etc.), the balancing act is difficult and ultimately, subjective. We hope to give the community a bit more transparency into our thought process as we proceed along with these decisions.

      Like

      1. Thanks for your reply. But I think recommendation score is the most important facor to decide the paper destiny, because it is the fairest way for the authors, the only concern is how to maintain the diversity of papers.

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s