irrCAC package

Module contents

Inter-rater reliability

In statistics, inter-rater reliability (also called by various similar names, such as inter-rater agreement, inter-rater concordance, inter-observer reliability, and so on) is the degree of agreement among raters. It is a score of how much homogeneity or consensus exists in the ratings given by various judges.

There are a number of statistics that can be used to determine inter-rater reliability. Different statistics are appropriate for different types of measurement. Some options are

and many others.

Submodules

irrCAC.benchmark module

The “Cumulative Probability” approach to Benchmarking.

class irrCAC.benchmark.Benchmark(coeff, se)

Bases: object

Compute benchmark scale membership probabilities.

An elaborate approach to interpret a kappa value is based on the notion of cumulative interval membership probability (CIMP). The interval probability represents the Normality-based probability that the “true” agreement coefficient kappa belongs to the interval in question and is calculated based on \(\hat{\kappa}\) and an arbitrary interval \((a, b)\) as follows:

\[P(a \le \kappa_1 \le b) &= P[(\hat{\kappa}_1 - b)/se(\hat{\kappa}_1) \le Z \le (\hat{\kappa}_1 - a)/se(\hat{\kappa}_1)] &= \Phi[(\hat{\kappa}_1 - b)/se(\hat{\kappa}_1)] - \Phi[(\hat{\kappa}_1 - a)/se(\hat{\kappa}_1)]\]

where \(\Phi\) is the cumulative distribution function of the standard Normal distribution. The general rule consists of retaining the highest interval whose CIMP equals or exceeds the threshold of 0.95. For more details see Inter-rater reliability among multiple raters when subjects are rated by different pairs of subjects.

Parameters:

coeff (float) – A floating number representing the estimated value of an agreement coefficient.
se (float) – The coefficient standard error.

Examples

Using the following example, for kappa 0.67 and standard error 0.15, it is recommended to consider the agreement as moderate on the Altman scale.

>>> from irrCAC.benchmark import Benchmark
>>> benchmark = Benchmark(coeff=0.67, se=0.15)
>>> print(benchmark.altman())  
{'scale': [(0.8, 1.0), (0.6, 0.8), (0.4, 0.6), (0.2, 0.4), (-1.0, 0.2)],
'Altman': ['Very Good', 'Good', 'Moderate', 'Fair', 'Poor'],
'CumProb': [0.18168, 0.67511, 0.96356, 0.99912, 1.0]}
>>> my_scale = dict(
... lb=[0.6, 0.3, 0.0],
... ub=[1.0, 0.6, 0.3],
... interp=['Excellent', 'Acceptable', 'Poor'],
... scale_name='My Scale')
>>> print(benchmark.interpret(my_scale))  
{'scale': [(0.6, 1.0), (0.3, 0.6), (0.0, 0.3)],
'My Scale': ['Excellent', 'Acceptable', 'Poor'],
'CumProb': [0.67511, 0.99308, 1.0]}

altman()

Interpret the level of agreement using the Altman [Alt90] benchmark scale.

Interpretation	Scale
Very Good	0.8 - 1.0
Good	0.6 - 0.8
Moderate	0.4 - 0.6
Fair	0.2 - 0.4
Poor	-1.0 - 0.2

cicchetti_sparrow()

Interpret the level of agreement using the Cicchetti and Sparrow [CS81] benchmark scale.

Interpretation	Scale
Excellent	0.75 - 1.0
Good	0.6 - 0.75
Fair	0.4 - 0.6
Poor	0.0 - 0.4

fleiss()

Interpret the level of agreement using the Fleiss [Fle71] benchmark scale.

Interpretation	Scale
Excellent	0.75 - 1.0
Fair to Good	0.4 - 0.75
Poor	0.0 - 0.4

interpret(bench)

Interpret the agreement coefficient on a benchmark scale.

To interpret the agreement coefficient we see in which range the cumulative probability exceeds 0.95. E.g., if we have a coefficient value of 0.67 with standard error 0.15, we get the following results.

Scale	Altman	CumProb
(0.8, 0.1)	Very Good	0.18168
(0.6, 0.8)	Good	0.67511
(0.4, 0.6)	Moderate	0.96356
(0.2, 0.4)	Fair	0.99912
(-1.0, 0.2)	Poor	1.0

It is safer to say that we have a Moderate agreement (the first scale that is >= 0.95), than to say that we have a Good agreement (because 0.6 <= 0.67 <= 0.8). The reason for that is that we have a large standard error.

Parameters:

bench (dict) – A dictionary with the lower and upper bounds of the scale, the interpretation of each scale and a scale name.

Returns:

A dict with three keys: Kappa intervals, Benchmark scale interpretation, and Cumulative probability. For example:

{'scale': [ (0.8, 1.0), (0.6, 0.8), (0.4, 0.6), (0.2, 0.4), (-1.0, 0.2)], 'Altman': ['Very Good', 'Good', 'Moderate', 'Fair', 'Poor'], 'CumProb': [0.18168, 0.67511, 0.96356, 0.99912, 1.0]}

Return type:

dict

landis_koch()

Interpret the level of agreement using the Landis and Koch [LK77] scale.

Interpretation	Scale
Almost Perfect	0.8 - 1.0
Substantial	0.6 - 0.8
Moderate	0.4 - 0.6
Fair	0.2 - 0.4
Slight	0.0 - 0.2
Poor	-1.0 - 0.0

regier()

Interpret the level of agreement using the Regier et al. [RNC+13] benchmark scale.

Interpretation	Scale
Excellent	0.8 - 1.0
Very Good	0.6 - 0.8
Good	0.4 - 0.6
Questionable	0.2 - 0.4
Unacceptable	0.0 - 0.2

irrCAC.datasets module

irrCAC.raw module

irrCAC.table module

irrCAC.weights module

References

[Alt90]

D. Altman. Practical statistics for medical research. Chapman and Hall/CRC, 1990. ISBN 9780412276309.

[CS81]

Domenic V Cicchetti and Sara A Sparrow. Developing criteria for establishing interrater reliability of specific items: applications to assessment of adaptive behavior. American journal of mental deficiency, 1981.

[Fle71]

Joseph L Fleiss. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378, 1971.

[LK77]

J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data. biometrics, pages 159–174, 1977.

[RNC+13]

Darrel A Regier, William E Narrow, Diana E Clarke, Helena C Kraemer, S Janet Kuramoto, Emily A Kuhl, and David J Kupfer. Dsm-5 field trials in the united states and canada, part ii: test-retest reliability of selected categorical diagnoses. American journal of psychiatry, 170(1):59–70, 2013.