Reliability and ValidityReliability and validity are presented together, here, because they are related, and are often confused with one another.
ReliabilityReliability is a property of a measure that refers to its statistical stability, or the degree to which multiple observations of identical phenomena yield identical results.
In public health, sometimes we are interested in the actual number of health events, but more often we use observed measures such as birth or death rates to indicate the underlying risk of illness or disability in a population. But the observed measures of risk fluctuate even when the true underlying risk of disease does not. The reasons for the variability usually include one or more of the following factors: 1) the health event is relatively rare, 2) the population size is relatively small, and 3) the health events do not occur at regular time intervals.
Even for complete count datasets, such as birth and death certificate datasets, random fluctuations over time will yield estimates that are not reliable. Consider the case of low birth weight in a small community. In this community one low birth weight infant is born each month, on average. But health events such as low birth weight do not occur at regular time intervals - there is randomness in the timing of low birth weight occurrence. In our small community, if three mothers give birth to low birth weight infants in December of Year 1, and none do in January or February of Year 2, it may appear as though the risk of low birth weight births has declined from Year 1 to Year 2. Actually the true underlying risk did not change, the rates were merely subject to randomness in the timing of the low birthweight births.
Rates that fluctuate over time, in the absence of changes in underlying risk, are considered "unreliable" or "unstable." Since the underlying risk typically changes very slowly, the term, "unstable" is used to refer to any observed rates that fluctuate widely in the absence of changes in the true underlying risk.
The terms "reliability," "precision," and "stability" are used to refer to the amount of random error that is likely to be included in an observed measure. Fortunately, we can use statistical techniques to assess the stability of a given rate. The confidence interval is a common statistical measure that conveys the reliability of an estimate. It may be thought of as the range of probable true values for a statistic. A wide confidence interval (wide in relation to the rate) indicates that the rate is likely to include a lot of random error.
Relative Standard Error (RSE)
Another, related, measure is called "Relative Standard Error," (RSE). RSE is the ratio of the standard error to the point estimate (e.g., rate, average), and is commonly expressed as a percentage. For instance, the 2006 teen birth rate for girls age 15-17 in Bernalillo County was 33.1, with a confidence interval of 3.1 and a relative standard error of just 4.8%. The standard error (1.58) divided by the rate (33.1) equals 4.8%. Compare those statistics with the comparable 2006 teen birth rate in Sierra County, below.
The teen birth rates for the two counties are similar, but Sierra County's smaller population size (just 226 girls age 15-17) yields a teen birth rate that is more subject to random error. Intuitively, you can see that with only 7 teen births in the entire year, if only one or two of those teen births had occured just before January 1 or just after December 31st of that year, the rate would be very different (but the underlying risk would not!). In the larger county, a few cases one year or the other doesn't make nearly as much difference, and the sheer number of cases allows the random variation to even out.
RSE is on a continuum, from 0 to 1. Cut-off points of 0.30 and 0.50 are used as conventions for interpreting the RSE. A rate associated with an RSE of 0.30 (the standard error is 30% as large as the estimate) is deemed by most public health epidemiologists as too unstable to report. The criteria for cell suppression used in the IBIS query system is more liberal, with warnings delivered when rates are associated with an RSE above either 0.30 (one red flag) or 0.50 (two red flags).
Measures of statistical stability are related to one another. The confidence interval is based on the standard error, and the standard error is based on the variance of the statistic and the population or sample size.
The 95% confidence interval is calculated as the standard error of a statistic multiplied by 1.96.
ValidityValidity is a property of a measurement that refers to its accuracy, or the degree to which observations reflect the true value of a phenomenon. It is possible to have a measure that is very reliable, but not at all valid.
In public health, we are lucky because the validity of most of our measures is really quite good. "Cause of death" on death certificates is certified by a physician. Survey measures have been tested to maximize validity. Birthweight is measured and reported at the birth hospital. There are some measures that we question, for instance self-reported body weight, but on the whole, the measures we use have a high degree of validity.
The Bulls-eye AnalogyIn the three figures, below, the bulls-eye of the target represents the true underlying risk of disease in a population, and the holes in the target represent multiple observed measurements of the risk. In the first figure, the measure is reliable - it measures nearly the same value each time. But the measure in Figure 1 is not valid - the average of the scores is not close to the true underlying risk. In the second figure, the scores are not very reliable - there is a lot of variability in the scores, but they center around the true risk value, so they are valid (at least on average). In the third figure, the measure is both reliable and valid.
The term "accuracy" is often used in relation to validity, while the term, "precision" is used to describe reliability.
THIS PAGE NEEDS REFERENCES
Proceed to the page on confidence intervals.