# Patterns classes are heart disease present and heart disease absent

262 **Performance**

The area under the curve is given by

Z | .1 � F.u//dG.u/ D 1 � |
Z | (8.10) |
---|

value than a randomly chosen class !1 pattern is

definition (8.11) for the area under the ROC curve. R

G.u/f.u/du. This is the same as the

Calculating the area under the ROC curveThe area under the ROC curve is easily calculated by applying the classification rule

X .

ri�i/ D Xri�XiDS0 �1 2n1.n1 C 1/iD1iD1iD1where

S0 is the sum of the ranks of the class !1 test patterns. Since there aren1n2

OA D |
1 | ² | |
---|---|---|---|

n1n2 |

been obtained using the rankings alone and has not used threshold values to calculate it.

The standard deviation of the statisticO

Ais (Hand and Till, 2001)

s |
---|

S0

O� Dn1n2

Q0 D1 6.2n1 C 2n2 C 1/.n1 Cn2/ �Q1

n1

Q1 D X

.r j� 1/2jD1

An alternative approach, considered by Bradley (1997), is to construct an estimate of the ROC curve directly for specific classifiers by varying a threshold and then to use an integration rule (for example, the trapezium rule) to obtain an estimate of the area beneath the curve.

**The data** There are six data sets comprising
measurements on two classes:

1. Cervical cancer. Six features, 117 patterns; classes are normal and abnormal cervical cell nuclei.

6. Heart disease 2. Eleven features, 261 patterns; classes are heart disease present and heart disease absent.

Incomplete patterns (patterns for which measurements on some features are missing) were removed from the data sets.