irrCAC package
Module contents
Inter-rater reliability
In statistics, inter-rater reliability (also called by various similar names, such as inter-rater agreement, inter-rater concordance, inter-observer reliability, and so on) is the degree of agreement among raters. It is a score of how much homogeneity or consensus exists in the ratings given by various judges.
There are a number of statistics that can be used to determine inter-rater reliability. Different statistics are appropriate for different types of measurement. Some options are
Joint-probability of agreement
and many others.
Submodules
irrCAC.benchmark module
The “Cumulative Probability” approach to Benchmarking.
- class irrCAC.benchmark.Benchmark(coeff, se)
Bases:
objectCompute benchmark scale membership probabilities.
An elaborate approach to interpret a kappa value is based on the notion of cumulative interval membership probability (CIMP). The interval probability represents the Normality-based probability that the “true” agreement coefficient kappa belongs to the interval in question and is calculated based on \(\hat{\kappa}\) and an arbitrary interval \((a, b)\) as follows:
\[P(a \le \kappa_1 \le b) &= P[(\hat{\kappa}_1 - b)/se(\hat{\kappa}_1) \le Z \le (\hat{\kappa}_1 - a)/se(\hat{\kappa}_1)] &= \Phi[(\hat{\kappa}_1 - b)/se(\hat{\kappa}_1)] - \Phi[(\hat{\kappa}_1 - a)/se(\hat{\kappa}_1)]\]where \(\Phi\) is the cumulative distribution function of the standard Normal distribution. The general rule consists of retaining the highest interval whose CIMP equals or exceeds the threshold of 0.95. For more details see Inter-rater reliability among multiple raters when subjects are rated by different pairs of subjects.
- Parameters:
coeff (float) – A floating number representing the estimated value of an agreement coefficient.
se (float) – The coefficient standard error.
Examples
Using the following example, for kappa 0.67 and standard error 0.15, it is recommended to consider the agreement as moderate on the Altman scale.
>>> from irrCAC.benchmark import Benchmark >>> benchmark = Benchmark(coeff=0.67, se=0.15) >>> print(benchmark.altman()) {'scale': [(0.8, 1.0), (0.6, 0.8), (0.4, 0.6), (0.2, 0.4), (-1.0, 0.2)], 'Altman': ['Very Good', 'Good', 'Moderate', 'Fair', 'Poor'], 'CumProb': [0.18168, 0.67511, 0.96356, 0.99912, 1.0]} >>> my_scale = dict( ... lb=[0.6, 0.3, 0.0], ... ub=[1.0, 0.6, 0.3], ... interp=['Excellent', 'Acceptable', 'Poor'], ... scale_name='My Scale') >>> print(benchmark.interpret(my_scale)) {'scale': [(0.6, 1.0), (0.3, 0.6), (0.0, 0.3)], 'My Scale': ['Excellent', 'Acceptable', 'Poor'], 'CumProb': [0.67511, 0.99308, 1.0]}
- altman()
Interpret the level of agreement using the Altman [Alt90] benchmark scale.
Interpretation
Scale
Very Good
0.8 - 1.0
Good
0.6 - 0.8
Moderate
0.4 - 0.6
Fair
0.2 - 0.4
Poor
-1.0 - 0.2
- cicchetti_sparrow()
Interpret the level of agreement using the Cicchetti and Sparrow [CS81] benchmark scale.
Interpretation
Scale
Excellent
0.75 - 1.0
Good
0.6 - 0.75
Fair
0.4 - 0.6
Poor
0.0 - 0.4
- fleiss()
Interpret the level of agreement using the Fleiss [Fle71] benchmark scale.
Interpretation
Scale
Excellent
0.75 - 1.0
Fair to Good
0.4 - 0.75
Poor
0.0 - 0.4
- interpret(bench)
Interpret the agreement coefficient on a benchmark scale.
To interpret the agreement coefficient we see in which range the cumulative probability exceeds 0.95. E.g., if we have a coefficient value of 0.67 with standard error 0.15, we get the following results.
Scale
Altman
CumProb
(0.8, 0.1)
Very Good
0.18168
(0.6, 0.8)
Good
0.67511
(0.4, 0.6)
Moderate
0.96356
(0.2, 0.4)
Fair
0.99912
(-1.0, 0.2)
Poor
1.0
It is safer to say that we have a Moderate agreement (the first scale that is >= 0.95), than to say that we have a Good agreement (because 0.6 <= 0.67 <= 0.8). The reason for that is that we have a large standard error.
- Parameters:
bench (dict) – A dictionary with the lower and upper bounds of the scale, the interpretation of each scale and a scale name.
- Returns:
A dict with three keys: Kappa intervals, Benchmark scale interpretation, and Cumulative probability. For example:
{'scale': [ (0.8, 1.0), (0.6, 0.8), (0.4, 0.6), (0.2, 0.4), (-1.0, 0.2)], 'Altman': ['Very Good', 'Good', 'Moderate', 'Fair', 'Poor'], 'CumProb': [0.18168, 0.67511, 0.96356, 0.99912, 1.0]}- Return type:
dict
irrCAC.datasets module
irrCAC.raw module
irrCAC.table module
irrCAC.weights module
References
D. Altman. Practical statistics for medical research. Chapman and Hall/CRC, 1990. ISBN 9780412276309.
Domenic V Cicchetti and Sara A Sparrow. Developing criteria for establishing interrater reliability of specific items: applications to assessment of adaptive behavior. American journal of mental deficiency, 1981.
Joseph L Fleiss. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378, 1971.
J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data. biometrics, pages 159–174, 1977.
Darrel A Regier, William E Narrow, Diana E Clarke, Helena C Kraemer, S Janet Kuramoto, Emily A Kuhl, and David J Kupfer. Dsm-5 field trials in the united states and canada, part ii: test-retest reliability of selected categorical diagnoses. American journal of psychiatry, 170(1):59–70, 2013.