WEEK 4: RELIABILITY AND VALIDITY
RELIABILITY (of indicators)
Definition- repeated
measurements of a concept (the indicator) should yield similar results.
Tests of reliability:
1)
Test-Retest- using the same indicator on the same people at two or more time
points. You should have consistent responses at both time points.
TEST-RETEST RELIABILITY TEST OF PARTY IDENTIFICATION
(Note: the following
table is derived from Herbert B. Asher's Presidential Elections and
American Politics, 5th edition, page 71; Brooks/Cole co., 1992)
1976 Partisanship
1972 Party Id |
Strong Dem. |
Weak Dem. |
Indep. Dem. |
Pure Indep. |
Indep. Rep. |
Weak Rep. |
Strong Rep. |
Strong Dem. |
9 |
4 |
1 |
0 |
0 |
0 |
0 |
Weak Dem. |
5 |
13 |
3 |
2 |
1 |
1 |
0 |
Indep. Dem. |
2 |
3 |
4 |
1 |
1 |
0 |
0 |
Pure Indep. |
1 |
1 |
2 |
5 |
2 |
1 |
0 |
Indep. Rep. |
1 |
0 |
1 |
3 |
5 |
2 |
1 |
Weak Rep. |
0 |
1 |
0 |
1 |
3 |
7 |
2 |
Strong Rep. |
0 |
0 |
0 |
0 |
1 |
4 |
6 |
First
of all, how do you read such a table? Each cell indicates the number of people
who gave that response to the 7-category party identification indicator at both
of the time points. The total number of people in the table is 100; all of them
were asked their party identification in 1972, and then the same people were
asked their partisanship in 1976. Therefore, the 9 people in the top leftmost
cell were strong Democrats in both 1972 and in 1976. The 4 people in the cell
next to them were strong Democrats in 1972 but had become weak Democrats in 1976.
How
much stability (consistency) is there in this table? How many people have given
the same response at both time points? The cells in the diagonal going from top
left to the bottom right of the table are those people giving the same
responses at both time points. Count the number of people in the diagonal. The
number remaining stable in attitudes is therefore (9 + 13 + 4 + 5 + 5 + 7 + 6) which
is a total of 49. The total number of people
in the table is 100. Therefore, 49% of the sample has remained stable in
attitudes. Is 49% a high or low reliability score? The stable percent must be
compared to chance alone. Chance stability is the number of stable cells,
divided by the total number of cells in the table. Hence, chance stability is 7
/ 49 = 14%. Since 49% is significantly higher than 14%, this indicator is
reliable.
Another
example of test-retest reliability:
|
DEMOCRATS In 1982 |
INDEPENDENTS In 1982 |
REPUBLICANS In 1982 |
DEMOCRATS In 1997 |
23 |
5 |
4 |
INDEPENDENTS
In 1997 |
7 |
27 |
10 |
REPUBLICANS In 1997 |
2 |
5 |
17 |
(Source
of this info is: Political Behavior of the American Electorate, 12th edition,
by William H. Flanigan and Nancy Zingale, p. 104; data originally are from the
Youth-Parent Socialization Panel Study, 1965-1997, Youth Wave, data provided by
the ICPSR).
There
were 100 people surveyed. How many people kept the same partisanship at both times
points? Add up 23 + 27 + 17 = 67. What percentage of stability do you have?
67/100 = 67%. What is chance stability? 3 stable cells / 9 total cells = 33%.
Is actual stability significantly greater than chance alone? Yes, since 67% is
much higher than 33%. So the party identification indicator remains a reliable
indicator.
Two examples of a possible test question:
Question. (10 points) A public administration
graduate student was hired to test the reliability of a Mississippi city
government’s indicator of job satisfaction of its municipal workers. 500
workers of the city were surveyed in October of 2023, and the same people were
re-surveyed in December of 2023. She conducts a test-retest of the job
satisfaction indicator’s reliability, and obtains the following table of
workers’ responses at 2 time points. Is this a reliable indicator? Yes, or no? How
could you tell, and be specific and include the definition of reliability in
your answer?
JOB
SATISFACTION LEVEL IN OCTOBER 2023
Job satisfaction in December of 2023 |
Dissatisfied |
Mixed |
Satisfied |
Dissatisfied |
85 |
10 |
10 |
Mixed |
10 |
130 |
20 |
Satisfied |
5 |
10 |
220 |
Another question. (10 points) A department head
of a state agency is concerned about the extent to which agency workers are
satisfied with their jobs. She creates a worker satisfaction scale that seeks
to measure the extent to which her employees are satisfied with their jobs. It
ranges from 1 for low satisfaction to 3 for high satisfaction, with 2 being a
medium level of satisfaction. She then has an independent company (which
guarantees the employees anonymity) conduct a test‑retest of the
scale’s reliability by determining
workers’ scores
on the scale in September 2023 and then December 2023. One hundred workers were
measured at both time points.
NOTE: Cell entries in the table below are the
numbers of workers who gave the listed combination of responses in September and
December 2023.
Examining the table below, is this worker
satisfaction scale a "reliable" indicator? Why or why not? Defend
your answer mathematically, and by relying on the verbal definition of
reliability.
WORKER SATISFACTION
SCORES IN SEPTEMBER 2023
WORKER SATISFACTION SCORES IN DECEMBER 2023 |
LOW |
MEDIUM |
HIGH |
LOW |
15 |
10 |
5 |
MEDIUM |
25 |
10 |
10 |
HIGH |
10 |
10 |
5 |
2) Alternate Forms
(Parallel Forms) reliability test- using two or more indicators (that measure the same concept) on the same people at one time
point. You should have consistent responses for both indicators, since they are different measures of the same concept. In the 2002
Mississippi Poll, we not only asked respondents their party identifications,
but we also asked them which (if either) major party was best for people like
themselves. We got the following responses:
ALTERNATE FORMS TEST
2002 Party Identification
Party that is best for "People like you" |
Democratic |
Independent |
Republican |
Democrats |
172 |
51 |
7 |
Both are Equal |
18 |
40 |
29 |
Republican |
6 |
39 |
157 |
Consistent responses for
both indicators are Democrats who believe that the Democratic party is best for
people like themselves, Republicans who believe that the Republican party is
best for people like themselves, and Independents who believe that both parties
are equally good (or bad) for people like themselves. The number of consistent
responses is (172 + 40 + 157) = 369.
The total number of
people in the table is 519. The percentage of people who give consistent
responses is: 369 / 519 = 71%. How reliable is the party identification
indicator compared to chance alone? Chance is the number of consistent cells
divided by the total number of cells: 3 / 9 = 33%. Since 71% is significantly
greater than 33%, the party identification indicator is reliable.
3) Split Half- using multiple indicators of a concept on the
same people at one time point. The researcher forms two scales with each
combining people's responses on half of the indicators. The two scales' scores
should be consistent for people.
A Health Care example.
In 2004 the Mississippi Poll included seven questions about how important
people thought a number of health care issues were, and they rated them from
scores of 1 for Very Important to scores of 4 for Not Important. An item on
Recruiting and Retaining Doctors were not highly related to the other six
items, so we excluded it from the analysis. The other six items were:
These six indicators
were divided into two groups: Group A included items 1, 3, and 5; and Group B
included items 2, 4, and 6. Responses to all three items in each group were
added together to create a scale for each group. Since each item was coded to
range from a 1 to 4, the scale for each group ranges from a 3 to a 12. The
Pearson correlation between the two group scales is a .71, which is pretty
respectable, since correlation coefficients range from a low of 0 to a high of
1.0.
Another way of testing
consistency is with a crosstabulation. Looking at the frequency distributions
of each scale, I combined each scale's codes as follows: 3 and 4 were coded as
High Priority; 5 and 6 were coded as Medium; 7 thru 12 were coded as Low
Priority. The crosstabulation follows:
SPLIT HALF EXAMPLE
Group A Scale
Group B Scale |
High |
Medium |
Low |
High |
141 |
25 |
1 |
Medium |
81 |
104 |
29 |
Low |
6 |
24 |
47 |
Notice that 292 people
(141 + 104 + 47) gave consistent responses to both of the scales. They fall into
the diagonal, being high-high, medium-medium, or low-low. The total number of
people in the table is 458. Therefore, 292/458 people gave consistent responses,
or 64% of the sample. Chance alone would predict about one-third or 33%. So the
six indicators of the importance of health care demonstrate some reliability.
4) Cronbach's Alpha- used for multi-indicator indexes, it calculates
how reliable the component indicators are. Cronbach’s Alpha ranges from 0 for
unreliable to 1 for most reliable. The Cronbach's Alpha for the six health care
items included in the 2004 Mississippi Poll analysis discussed earlier was .80.
Reasons for low observed
reliability:
VALIDITY (of indicators)
Definition- are we
really measuring what we think we are measuring.
Types of validity tests:
1) Face Validity- a researcher might believe that an indicator
is so well established and obvious that they say that on its face, it appears
to be valid. An example is a ruler. We don’t question its accuracy, we just use
it. Are there any such obvious indicators in political science? What about race,
sex, age, income? Even those are increasingly complicated concepts, with
multi-racial people, transgenders, people young at heart, and so on.
2)
Construct (Criterion) Validity- relate your questionable indicator to more well-established
indicators, and see whether it behaves as you expect it to behave.
CONSTRUCT VALIDITY
Questionable Indicator is Party Identification
Well Established Indicators |
Strong Dem |
Weak Dem |
Indep. Dem. |
Pure Indep. |
Indep. Rep. |
Weak Rep. |
Strong Rep. |
Pres. Vote |
|||||||
1984-1992 |
13% |
54% |
49% |
77% |
95% |
91% |
95% |
1996-2004 |
7 |
32 |
22 |
58 |
87 |
92 |
94 |
1988 |
13 |
46 |
52 |
68 |
94 |
87 |
98 |
1996 |
7 |
26 |
23 |
45 |
90 |
84 |
92 |
2004 |
7 |
40 |
15 |
69 |
86 |
96 |
97 |
Senate Vote |
|||||||
1984-1994 |
29% |
54% |
55% |
80% |
86% |
80% |
92% |
1994 |
50 |
73 |
77 |
80 |
93 |
93 |
94 |
2014 |
13 |
30 |
40 |
67 |
79 |
97 |
98 |
Note: The table above actually provides 8 tests of construct
validity for the party identification indicator, since each row is a separate
test relating party identification to the more well-established vote indicator
(in different years and for different offices). The cell entries are the percentage vote for the
Republican candidate among each of the seven party identification categories.
These data are from the Mississippi Poll.
Our expectations are that the percentage Republican vote would
increase steadily as one moves from the most Democratic party identification
category of Strong Democrat to the most Republican party identification
category of Strong Republican. Examining the 1988 presidential vote indicator,
we see a steady increase in Republican vote as we move from Strong Dem. to
Strong Rep. with one set of exceptions.
Only 87% of Weak Republicans voted for Republican Bush, while 94% of
Independent Republicans voted for Bush. Those two categories should have
reversed percentages, so draw one circle around both of those cells, since they involve
validity problems with the party identification indicator. Examine the 1996
presidential vote and you find two sets of validity problems among Democrats
and Republicans. So you have two circles, where you circle the pairs of
adjacent categories that should be reversed.
Repeat this validity test for the other vote indicators, and
discuss the validity problems with the party identification indicator that you
find.
Now,
answer the following practice test question:
(10 points) This is a Construct or
Criterion Validity Test.
|
Strong
Dem |
Weak Dem |
Indep Dem |
Pure Indep |
Indep Rep |
Weak Rep |
Strong Rep |
% favoring affirmative action |
55% |
29% |
35% |
21% |
15% |
13% |
12% |
% opposing any legal recognition of gay couples |
20% |
32% |
29% |
33% |
45% |
50% |
78% |
% pro-life on abortion question |
41% |
52% |
46% |
54% |
61% |
57% |
71% |
% favoring the death penalty for murder |
35% |
48% |
47% |
58% |
67% |
70% |
72% |
% favoring the federal gov’t providing jobs and welfare |
87% |
77% |
75% |
65% |
48% |
46% |
41% |
It is drawn from
the Mississippi Poll pooled dataset (with a few minor revisions). It relates
the questionable 7 category indicator of party identification to five
well-established issue items. For each row, each cell entry is the percentage
of people having that particular partisan orientation (strong Democratic, weak
Democratic, etc.) that agrees with the opinion at the extreme left of that row.
For example, in the first row and column, 55% of Strong Democrats favor
affirmative action (and 45% therefore oppose it). Also, please note that the
five well-established issue items are sometimes liberal policies and sometimes
conservative policies, so that the highest percentages will sometimes be in the
Strong Democrat category and sometimes in the Strong Republican category, so
the percentages will either decrease or increase going across the rows
depending on the ideology of the policy.
Remember that party
identification is measured at the ordinal level of measurement, so the validity
question that we are addressing is whether this party indicator is actually a
valid ordinal indicator. For each row, just circle each pair of adjacent
categories that exhibits a validity problem (in terms of ordinality of the
questionable indicator). No other answer is needed.
3)
Convergent-Discriminant Validity Test- different measures of the same concept should yield similar
results; the same measures of different concepts should yield different
results. Examine correlation matrix.
Convergent-discriminant
validity tests help to determine if your multiple indicators of one concept are
actually measuring only one concept, or whether your indicators are measuring
more than one concept (a multi-dimensional concept). Generate a correlation
matrix as indicated below, and remember that the correlations range from 0 for
no relationship to 1 for highest relationship. Then, pick out the highest
correlations in order of their size.
The
following example uses 10 state government spending items listed below, as
measured in the pooled Mississippi Poll datasets from 2004, 2006, 2008, and
2010.
CORRELATION MATRIX,
SPENDING ITEMS, 2004-2010
|
Poor |
Health Care |
K-12 |
Univer- sities |
Day Care |
Environ- ment |
Tourism |
Industry |
Roads |
Poor |
- |
|
|
|
|
|
|
|
|
Health Care |
.45 |
- |
|
|
|
|
|
|
|
K-12 Education |
.32 |
.31 |
- |
|
|
|
|
|
|
Universities |
.23 |
.41 |
.39 |
- |
|
|
|
|
|
Day Care |
.37 |
.44 |
.33 |
.35 |
- |
|
|
|
|
Environment |
.32 |
.27 |
.23 |
.21 |
.28 |
- |
|
|
|
Tourism |
.01 |
.06 |
.11 |
.10 |
.12 |
.08 |
- |
|
|
Industry |
.11 |
.19 |
.10 |
.19 |
.09 |
.16 |
.23 |
- |
|
Roads |
.17 |
.15 |
.17 |
.17 |
.21 |
.13 |
.15 |
.17 |
- |
Police |
.13 |
.16 |
.16 |
.13 |
.18 |
.17 |
.13 |
.15 |
.21 |
How
many dimensions do you get? I'd say, three. What do they pertain to? I'd say-
social welfare (poor, health care, K-12 education, universities, day care,
environment), economic development (tourism, industry), and public safety
(roads, police). What are the intra-cluster item correlations? For social
welfare, take the average of the 15 correlations among these 6 items; this
value is 4.91/15 = .33. For economic development, it is .23. For public safety,
it is .21. The inter-cluster correlations are: social welfare-economic
development is (1.32/12) .11. For social welfare-public safety, the
inter-cluster correlation is (.193/12) .16. For economic development-public
safety, the inter-cluster correlation is (.15 + .17 + .13 + .15 = .60/4) .15.
So you can see that with real-world data, we have far from the ideal of a 1.0
correlation. But at least the intra-cluster correlations (.33, .23, and .21)
are higher than the inter-cluster correlations (.11, .16, .15).
4) Factor Analysis- can be used as a validity test for testing
whether a concept is multi-dimensional.
A
2004 health care example. The six relevant items were subjected to a Principal
Components Factor Analysis with Varimax Rotation. Only 457 of the 523
respondents were analyzed, since others lacked responses on one or more of the
six items. Thus, 13% of the respondents were excluded from this factor
analysis. Only one factor emerged, explaining 51% of the variance in all six
items. Other factors explained less of the variance than each item did, so they
were dropped from the analysis. The factor loadings for each item ranged from a
low of .66 for public education to encourage nutrition and exercise to a high
of .78 for providing health care for adults who can't afford it.
The Component Matrix,
and the Component 1 factor loading scores follow:
Extraction Method:
Principal Component Analysis. 1 component extracted.
These
results suggest that it is valid to combine these six health care importance
indicators into one scale measuring one dimension. If we had included the third
health care item on the importance of recruiting and retaining doctors in
Mississippi, we would have still ended up with one dimension, but the loading
of that item on the factor was only .47, clearly the lowest of the factor
loadings. This suggests that that item does not measure the one dimension very
well, so we excluded it from the scale. Don’t worry about this or some of the
other more complex reliability and validity tests, since they are more designed
for a graduate school education. But do focus on the test-retest reliability
method and the construct validity method and the practice test questions.