A Closer Look at AUROC and AUPRC under Class Imbalance

Matthew B. A. McDermott1, Lasse Hyldig Hansen2, Haoran Zhang3, Giovanni Angelotti4, Jack Gallifant3

How Should We Prioritize Fixing Model Mistakes?

This paper critically examines the widely held belief in machine learning (ML) that the area under the precision-recall curve (AUPRC) is superior to the area under the receiver operating characteristic (AUROC) for binary classification tasks in class-imbalanced scenarios. Through novel mathematical analysis, it demonstrates that AUPRC is not inherently superior and may even be detrimental due to its tendency to overemphasize improvements in subpopulations with more frequent positive labels, potentially exacerbating algorithmic biases.

Using Atomic Mistakes

Atomic mistakes occur when neighboring samples, when ordered by model score, are out-of-order with respect to the classi- fication label. AUROC improves by a constant amount no matter which atomic mistake is corrected; AUPRC improves in descend- ing order with model score due to the dependence on model firing rate (Theorem 1).

Atomic Mistakes Diagram

Different types of mistakes a model can learn to fix. y= 0 is the negative class and y= 1 is the positive class. a= 0 is subgroup 1 and a= 1 is subgroup 2.

Which mistake you should prioritize fixing first depends on usage; in a classification setting, where you do not know whether the sample of interest is from a high-scoring or low-scoring region, you want to use a metric that optimizes scores in an unbiased manner, like AUROC. In a single-stream retrieval setting, where you choose the top-k samples, regardless of group membership and evaluate with those, a metric that favors mistakes in high-scoring regions like AUPRC will be most impactful. But, if you care about retrieving the top-k metrics from multiple distinct subpopulations within your dataset, AUPRC will be dangerous as it will favor the high-prevalence sub-population

Optimizing AUPRC Introduces Disparities

First Image Description

Optimizing overall AUROC.

Second Image Description

Optimizing overall AUPRC.

Comparison of the impact of optimizing for overall AUROC and overall AUPRC on the per-group AUROC and AUPRCs of two groups in a synthetic setting, using both the sequentially fixing atomic mistakes optimization procedure. Left: Fixing atomic mistakes to optimize overall AUROC, Right: Fixing atomic mistakes to optimize overall AUPRC.

These figures demonstrate the impact of the optimization metric on subpopulation disparity. In particular, on the right we observe a notable disparity introduced when optimizing under the AUPRC metric. This is evident in the performance metrics across the high and low preva- lence subpopulations, which exhibit significant divergence as the optimization process favors the group with higher prevalence. In comparison, when optimizing for overall AUROC (Left), the AUROC and AUPRC of both groups increase together.

Related Work

Our work builds upon insights in other work that has examined robustness of models and metrics among subpopulations:

First Image Description

Yang, Zhang*, Katabi, and Ghassemi. Change is Hard: A Closer Look at Subpopulation Shift. 2023.

Notes: This work is a fine-grained analysis of the variation in mechanisms that cause subpopulation shifts, and how algorithms generalize across such diverse shifts at scale.

How To Cite

This work is not yet peer-reviewed. The preprint can be cited as follows.

Bibliography

Matthew B. A. McDermott, Lasse Hyldig Hansen, Haoran Zhang, Giovanni Angelotti, and Jack Gallifant. "A Closer Look at AUROC and AUPRC under Class Imbalance" arXiv preprint arXiv:2401.06091 (2024).

BibTeX

@misc{mcdermott2024closer,
            title={A Closer Look at AUROC and AUPRC under Class Imbalance}, 
            author={Matthew B. A. McDermott and Lasse Hyldig Hansen and Haoran Zhang and Giovanni Angelotti and Jack Gallifant},
            year={2024},
            eprint={2401.06091},
            archivePrefix={arXiv},
            primaryClass={cs.LG}
        }