Average rating: | Rated 5 of 5. |
Level of importance: | Rated 5 of 5. |
Level of validity: | Rated 5 of 5. |
Level of completeness: | Rated 5 of 5. |
Level of comprehensibility: | Rated 5 of 5. |
Competing interests: | None |
The paper makes a valuable contribution to the analysis of diversity data by providing a blueprint for constructing a data set with minimal sources of bias and then analyzing that data set with methods that are easy to interpret and are statistically sound. I comment on these aspects of the paper, and not the conclusions of the study.
The authors took great care in constructing their data set, reducing many ways that analysis might lead one astray. The details are provided in the paper, and are good guidance for other researchers. For instance, questions asked during a seminar had to be labelled according to type. The authors had several raters label each question, to reduce any rater-induced bias in the labels (“two raters reviewed each video and a third rater resolved differences”). In addition, labels were carefully defined; Table 2 is the outcome of iterations and collaboration and discussion, to make sure that all raters were on the same page. Kudos also to whoever provided the funding for this effort. I’m sure it took a lot of time.
The authors also took great care in analyzing and interpreting the data. They calculate p-values using a randomization scheme, which makes their p-values valid no matter what the underlying distribution of the data. In contrast, p-values calculated in classical parametric statistics methods rely on distributional assumptions that often are not satisfied, and not even approximately satisfied when the sample sizes are small. If distributional assumptions are not satisfied, the p-values are not valid, and thus conclusions are suspect. This randomization method for calculating a p-value can be applied to any test statistic, and the authors have considered several. They have provided excellent details on the importance of this technique. The only criticism one can make is that, sadly, randomization methods don’t have as much power as parametric methods. This may be the reason that the paper’s results were null – no differences.
I am not familiar with all of the literature in this area, so I really can’t comment on that (I was forced to make a rating, and just used quantlity as a metric!).
The paper is well-written, well organized, an interesting read.
One part of the paper – a very small part - was a little disappointing, and stood in contrast to the rest, which was so carefully laid out. This is in section 5.1 “Are interruptions bad?” This is a very interesting question, of course. The authors write “Table 1 shows that the proportion of female pre-tenure faculty in CEE, EECS, and IEOR is higher than the proportion of women in their applicant pools. These departments also spent more time questioning women than men.” I’m not sure what I can take from this. The statement relates past hiring practices with current interviewing practices, which is a questionable way to consider the question “Are interruptions bad?”. It seems that the CEE department data provides a more direct way to answer the question, as we read “In CEE, faculty presenters who received offers generally were asked more questions during their talk than presenters who did not receive offers.” There is no statistical analysis here, which is OK, I guess, since all of this is in the discussion. But I feel that the authors should put some cautionary remarks here about making any conclusions.