Given the extensive use of machine learning in patient outcome prediction, and the understanding that the challenging nature of predictions in this field may considerably modify the performance of predictive models, research in this area requires some forms of context-sensitive performance metrics. The area under the receiver operating characteristic curve (AUC), precision, recall, specificity, and F1 are widely used measures of performance for patient outcome prediction. These metrics have several merits: they are easy to interpret and do not need any subjective input from the user. However, they weight all samples equally and do not adequately reflect the ability of predictive models in classifying difficult samples. In this paper, we propose the Difficulty Weight Adjustment (DWA) algorithm, a simple method that incorporates the difficulty level of samples when evaluating predictive models. Using a large dataset of 139,367 unique ICU admissions within the eICU Collaborative Research Database (eICU-CRD), we show that the classification difficulty and the discrimination ability of samples are critical aspects that need to be considered when comparing machine learning models that predict patient outcomes.