Evaluation

Evaluation will be done according to the following metrics:

  • Dice
  • Hausdorff distance (modified, 95th percentile)
  • Average volume difference (in percentage)
  • Sensitivity for individual lesions (in percentage)
  • F1-score for individual lesions

Individual lesions are defined as 3D connected components. The full source code that will be used for evaluation can be found here: evaluation.py.

Ranking

Each metric is averaged over all test scans. For each metric, the participating teams are sorted from best to worst. The best team receives a rank of 0 and the worst team a rank of 1; all other teams are ranked (0,1) relative to their performance within the range of that metric. Finally, the five ranks are averaged into the overall rank that is used for the Results.

For example: the best team A has a DSC of 80 and the worst team B a DSC of 60. In the ranking: A=0.00 and B=1.00. Another team C has a DSC of 78, which is then ranked at 1.0 - (78 - 60) / (80 - 60) = 0.10. The actual Python code to compute this is:

import pandas

def getRankingHigherIsBetter(df, metric):    
    return 1.0 - getRankingLowerIsBetter(df, metric)

def getRankingLowerIsBetter(df, metric):    
    rank = df.groupby('team')[metric].mean()
    
    lowest  = rank.min()
    highest = rank.max()
    
    return (rank - lowest) / (highest - lowest)

# Pandas DataFrame containing the results for each team for each test image
df = loadResultData() 
    
rankDsc    = getRankingHigherIsBetter(df, 'dsc')
rankH95    = getRankingLowerIsBetter(df, 'h95')
rankAvd    = getRankingLowerIsBetter(df, 'avd')
rankRecall = getRankingHigherIsBetter(df, 'recall')
rankF1     = getRankingHigherIsBetter(df, 'f1')
    
finalRank = (rankDsc + rankH95 + rankAvd + rankRecall + rankF1) / 5