WEEK 4: RELIABILITY AND VALIDITY

 

RELIABILITY (of indicators)

Definition- repeated measurements of a concept (the indicator) should yield similar results.

Tests of reliability:

1) Test-Retest- using the same indicator on the same people at two or more time points. You should have consistent responses at both time points.

TEST-RETEST RELIABILITY TEST OF PARTY IDENTIFICATION

(Note: the following table is derived from Herbert B. Asher's Presidential Elections and American Politics, 5th edition, page 71; Brooks/Cole co., 1992)

1976 Partisanship

1972 Party Id

Strong Dem.

Weak Dem.

Indep. Dem.

Pure Indep.

Indep. Rep.

Weak Rep.

Strong Rep.

Strong Dem.

9

4

1

0

0

0

0

Weak Dem.

5

13

3

2

1

1

0

Indep. Dem.

2

3

4

1

1

0

0

Pure Indep.

1

1

2

5

2

1

0

Indep. Rep.

1

0

1

3

5

2

1

Weak Rep.

0

1

0

1

3

7

2

Strong Rep.

0

0

0

0

1

4

6

First of all, how do you read such a table? Each cell indicates the number of people who gave that response to the 7-category party identification indicator at both of the time points. The total number of people in the table is 100; all of them were asked their party identification in 1972, and then the same people were asked their partisanship in 1976. Therefore, the 9 people in the top leftmost cell were strong Democrats in both 1972 and in 1976. The 4 people in the cell next to them were strong Democrats in 1972 but had become weak Democrats in 1976.

How much stability (consistency) is there in this table? How many people have given the same response at both time points? The cells in the diagonal going from top left to the bottom right of the table are those people giving the same responses at both time points. Count the number of people in the diagonal. The number remaining stable in attitudes is therefore (9 + 13 + 4 + 5 + 5 + 7 + 6) which is a total of 49. The total number of people in the table is 100. Therefore, 49% of the sample has remained stable in attitudes. Is 49% a high or low reliability score? The stable percent must be compared to chance alone. Chance stability is the number of stable cells, divided by the total number of cells in the table. Hence, chance stability is 7 / 49 = 14%. Since 49% is significantly higher than 14%, this indicator is reliable.

Another example of test-retest reliability:

 

DEMOCRATS

In 1982

INDEPENDENTS

In 1982

REPUBLICANS

In 1982

DEMOCRATS

In 1997

 

23

 

5

 

4

INDEPENDENTS In 1997

 

7

 

27

 

10

REPUBLICANS

In 1997

 

2

 

5

 

17

(Source of this info is: Political Behavior of the American Electorate, 12th edition, by William H. Flanigan and Nancy Zingale, p. 104; data originally are from the Youth-Parent Socialization Panel Study, 1965-1997, Youth Wave, data provided by the ICPSR).

There were 100 people surveyed. How many people kept the same partisanship at both times points? Add up 23 + 27 + 17 = 67. What percentage of stability do you have? 67/100 = 67%. What is chance stability? 3 stable cells / 9 total cells = 33%. Is actual stability significantly greater than chance alone? Yes, since 67% is much higher than 33%. So the party identification indicator remains a reliable indicator.

Two examples of a possible test question:

Question. (10 points) A public administration graduate student was hired to test the reliability of a Mississippi city government’s indicator of job satisfaction of its municipal workers. 500 workers of the city were surveyed in October of 2023, and the same people were re-surveyed in December of 2023. She conducts a test-retest of the job satisfaction indicator’s reliability, and obtains the following table of workers’ responses at 2 time points. Is this a reliable indicator? Yes, or no? How could you tell, and be specific and include the definition of reliability in your answer?

 

JOB SATISFACTION LEVEL IN OCTOBER 2023

Job satisfaction in December of 2023

Dissatisfied

Mixed

Satisfied

Dissatisfied

85

10

10

Mixed

10

130

20

Satisfied

5

10

220

 

Another question. (10 points) A department head of a state agency is concerned about the extent to which agency workers are satisfied with their jobs. She creates a worker satisfaction scale that seeks to measure the extent to which her employees are satisfied with their jobs. It ranges from 1 for low satisfaction to 3 for high satisfaction, with 2 being a medium level of satisfaction. She then has an independent company (which guarantees the employees anonymity) conduct a test‑retest of the scale’s reliability by determining workers’ scores on the scale in September 2023 and then December 2023. One hundred workers were measured at both time points.

NOTE: Cell entries in the table below are the numbers of workers who gave the listed combination of responses in September and December 2023.

Examining the table below, is this worker satisfaction scale a "reliable" indicator? Why or why not? Defend your answer mathematically, and by relying on the verbal definition of reliability.

 

WORKER SATISFACTION SCORES IN SEPTEMBER 2023

  

WORKER SATISFACTION SCORES IN DECEMBER 2023

 

LOW

 

MEDIUM

 

HIGH

LOW

15

10

5

 MEDIUM

 25

 10

 10

 HIGH

 10

10

5

 

2) Alternate Forms (Parallel Forms) reliability test- using two or more indicators (that measure the same concept) on the same people at one time point. You should have consistent responses for both indicators, since they are different measures of the same concept. In the 2002 Mississippi Poll, we not only asked respondents their party identifications, but we also asked them which (if either) major party was best for people like themselves. We got the following responses:

ALTERNATE FORMS TEST

2002 Party Identification

Party that is best for "People like you"

Democratic

Independent

Republican

Democrats

172

51

7

Both are Equal

18

40

29

Republican

6

39

157

Consistent responses for both indicators are Democrats who believe that the Democratic party is best for people like themselves, Republicans who believe that the Republican party is best for people like themselves, and Independents who believe that both parties are equally good (or bad) for people like themselves. The number of consistent responses is (172 + 40 + 157) = 369.

The total number of people in the table is 519. The percentage of people who give consistent responses is: 369 / 519 = 71%. How reliable is the party identification indicator compared to chance alone? Chance is the number of consistent cells divided by the total number of cells: 3 / 9 = 33%. Since 71% is significantly greater than 33%, the party identification indicator is reliable.

3) Split Half- using multiple indicators of a concept on the same people at one time point. The researcher forms two scales with each combining people's responses on half of the indicators. The two scales' scores should be consistent for people.

A Health Care example. In 2004 the Mississippi Poll included seven questions about how important people thought a number of health care issues were, and they rated them from scores of 1 for Very Important to scores of 4 for Not Important. An item on Recruiting and Retaining Doctors were not highly related to the other six items, so we excluded it from the analysis. The other six items were:

These six indicators were divided into two groups: Group A included items 1, 3, and 5; and Group B included items 2, 4, and 6. Responses to all three items in each group were added together to create a scale for each group. Since each item was coded to range from a 1 to 4, the scale for each group ranges from a 3 to a 12. The Pearson correlation between the two group scales is a .71, which is pretty respectable, since correlation coefficients range from a low of 0 to a high of 1.0.

Another way of testing consistency is with a crosstabulation. Looking at the frequency distributions of each scale, I combined each scale's codes as follows: 3 and 4 were coded as High Priority; 5 and 6 were coded as Medium; 7 thru 12 were coded as Low Priority. The crosstabulation follows:

 

 

SPLIT HALF EXAMPLE

Group A Scale

Group B Scale

High

Medium

Low

High

141

25

1

Medium

81

104

29

Low

6

24

47

Notice that 292 people (141 + 104 + 47) gave consistent responses to both of the scales. They fall into the diagonal, being high-high, medium-medium, or low-low. The total number of people in the table is 458. Therefore, 292/458 people gave consistent responses, or 64% of the sample. Chance alone would predict about one-third or 33%. So the six indicators of the importance of health care demonstrate some reliability.

4) Cronbach's Alpha- used for multi-indicator indexes, it calculates how reliable the component indicators are. Cronbach’s Alpha ranges from 0 for unreliable to 1 for most reliable. The Cronbach's Alpha for the six health care items included in the 2004 Mississippi Poll analysis discussed earlier was .80.

Reasons for low observed reliability:

VALIDITY (of indicators)

Definition- are we really measuring what we think we are measuring.

Types of validity tests:

1) Face Validity- a researcher might believe that an indicator is so well established and obvious that they say that on its face, it appears to be valid. An example is a ruler. We don’t question its accuracy, we just use it. Are there any such obvious indicators in political science? What about race, sex, age, income? Even those are increasingly complicated concepts, with multi-racial people, transgenders, people young at heart, and so on.

2) Construct (Criterion) Validity- relate your questionable indicator to more well-established indicators, and see whether it behaves as you expect it to behave.

CONSTRUCT VALIDITY

Questionable Indicator is Party Identification

Well Established Indicators

Strong Dem

Weak Dem

Indep. Dem.

Pure Indep.

Indep. Rep.

Weak Rep.

Strong Rep.

Pres. Vote

1984-1992

13%

54%

49%

77%

95%

91%

95%

1996-2004

7

32

22

58

87

92

94

1988

13

46

52

68

94

87

98

1996

7

26

23

45

90

84

92

2004

7

40

15

69

86

96

97

Senate Vote

1984-1994

29%

54%

55%

80%

86%

80%

92%

1994

50

73

77

80

93

93

94

2014

13

30

40

67

79

97

98

Note: The table above actually provides 8 tests of construct validity for the party identification indicator, since each row is a separate test relating party identification to the more well-established vote indicator (in different years and for different offices). The cell entries are the percentage vote for the Republican candidate among each of the seven party identification categories. These data are from the Mississippi Poll.

Our expectations are that the percentage Republican vote would increase steadily as one moves from the most Democratic party identification category of Strong Democrat to the most Republican party identification category of Strong Republican. Examining the 1988 presidential vote indicator, we see a steady increase in Republican vote as we move from Strong Dem. to Strong Rep. with one set of exceptions. Only 87% of Weak Republicans voted for Republican Bush, while 94% of Independent Republicans voted for Bush. Those two categories should have reversed percentages, so draw one circle around both of those cells, since they involve validity problems with the party identification indicator. Examine the 1996 presidential vote and you find two sets of validity problems among Democrats and Republicans. So you have two circles, where you circle the pairs of adjacent categories that should be reversed.

Repeat this validity test for the other vote indicators, and discuss the validity problems with the party identification indicator that you find.

 

Now, answer the following practice test question:

(10 points) This is a Construct or Criterion Validity Test.

 

Strong Dem

Weak

Dem

Indep

Dem

Pure

Indep

Indep

Rep

Weak Rep

Strong

Rep

% favoring affirmative action

 

55%

 

29%

 

35%

 

21%

 

15%

 

13%

 

12%

% opposing any legal recognition of gay couples

 

20%

 

32%

 

29%

 

33%

 

45%

 

50%

 

78%

% pro-life on abortion question

 

41%

 

52%

 

46%

 

54%

 

61%

 

57%

 

71%

% favoring the death penalty for murder

 

35%

 

48%

 

47%

 

58%

 

67%

 

70%

 

72%

% favoring the federal gov’t providing jobs and welfare

 

87%

 

77%

 

75%

 

65%

 

48%

 

46%

 

41%

 

It is drawn from the Mississippi Poll pooled dataset (with a few minor revisions). It relates the questionable 7 category indicator of party identification to five well-established issue items. For each row, each cell entry is the percentage of people having that particular partisan orientation (strong Democratic, weak Democratic, etc.) that agrees with the opinion at the extreme left of that row. For example, in the first row and column, 55% of Strong Democrats favor affirmative action (and 45% therefore oppose it). Also, please note that the five well-established issue items are sometimes liberal policies and sometimes conservative policies, so that the highest percentages will sometimes be in the Strong Democrat category and sometimes in the Strong Republican category, so the percentages will either decrease or increase going across the rows depending on the ideology of the policy.

Remember that party identification is measured at the ordinal level of measurement, so the validity question that we are addressing is whether this party indicator is actually a valid ordinal indicator. For each row, just circle each pair of adjacent categories that exhibits a validity problem (in terms of ordinality of the questionable indicator). No other answer is needed.

3) Convergent-Discriminant Validity Test- different measures of the same concept should yield similar results; the same measures of different concepts should yield different results. Examine correlation matrix.

Convergent-discriminant validity tests help to determine if your multiple indicators of one concept are actually measuring only one concept, or whether your indicators are measuring more than one concept (a multi-dimensional concept). Generate a correlation matrix as indicated below, and remember that the correlations range from 0 for no relationship to 1 for highest relationship. Then, pick out the highest correlations in order of their size.

The following example uses 10 state government spending items listed below, as measured in the pooled Mississippi Poll datasets from 2004, 2006, 2008, and 2010.

CORRELATION MATRIX, SPENDING ITEMS, 2004-2010

 

Poor

Health Care

K-12

Univer-

sities

Day Care

Environ-

ment

Tourism

Industry

Roads

 Poor 

 

 

 

 

 

 

 

 

Health Care

.45

 -

 

 

 

 

 

 

 

K-12 Education

.32

.31

 -

 

 

 

 

 

 

Universities 

.23

.41

 .39

 -

 

 

 

 

 

Day Care

 .37

.44

.33

.35

 

 

 

 

Environment 

.32

.27

.23

.21

.28

-

 

 

 

 Tourism

.01

.06

 .11

 .10

.12

 .08

 

 

 Industry 

.11

.19

 .10

.19

.09

.16

.23

 -

 

 Roads

.17

 .15

.17

.17

.21

.13

 .15

.17

 -

 Police

 

.13

 .16

.16

.13

 .18

.17

.13

 .15

.21

How many dimensions do you get? I'd say, three. What do they pertain to? I'd say- social welfare (poor, health care, K-12 education, universities, day care, environment), economic development (tourism, industry), and public safety (roads, police). What are the intra-cluster item correlations? For social welfare, take the average of the 15 correlations among these 6 items; this value is 4.91/15 = .33. For economic development, it is .23. For public safety, it is .21. The inter-cluster correlations are: social welfare-economic development is (1.32/12) .11. For social welfare-public safety, the inter-cluster correlation is (.193/12) .16. For economic development-public safety, the inter-cluster correlation is (.15 + .17 + .13 + .15 = .60/4) .15. So you can see that with real-world data, we have far from the ideal of a 1.0 correlation. But at least the intra-cluster correlations (.33, .23, and .21) are higher than the inter-cluster correlations (.11, .16, .15).

4) Factor Analysis- can be used as a validity test for testing whether a concept is multi-dimensional.

A 2004 health care example. The six relevant items were subjected to a Principal Components Factor Analysis with Varimax Rotation. Only 457 of the 523 respondents were analyzed, since others lacked responses on one or more of the six items. Thus, 13% of the respondents were excluded from this factor analysis. Only one factor emerged, explaining 51% of the variance in all six items. Other factors explained less of the variance than each item did, so they were dropped from the analysis. The factor loadings for each item ranged from a low of .66 for public education to encourage nutrition and exercise to a high of .78 for providing health care for adults who can't afford it.

The Component Matrix, and the Component 1 factor loading scores follow:

Extraction Method: Principal Component Analysis. 1 component extracted.

These results suggest that it is valid to combine these six health care importance indicators into one scale measuring one dimension. If we had included the third health care item on the importance of recruiting and retaining doctors in Mississippi, we would have still ended up with one dimension, but the loading of that item on the factor was only .47, clearly the lowest of the factor loadings. This suggests that that item does not measure the one dimension very well, so we excluded it from the scale. Don’t worry about this or some of the other more complex reliability and validity tests, since they are more designed for a graduate school education. But do focus on the test-retest reliability method and the construct validity method and the practice test questions.

 

Here is a link to the actual use of one test of reliability and two tests of validity in one of my published articles on Health Care Attitudes in Mississippi.