FUNDAMENTAL MEASUREMENT IN SOCIAL SCIENCE AND EDUCATION
Benjamin D. Wright
Research Memorandum No. 33a
MESA Psychometric Laboratory
University of Chicago
March 30, 1983
No discussion of scientific method is complete without an argument for the importance of fundamental measurement - measurement of the kind characterizing length and weight. Yet few social scientists attempt to construct fundamental measures. This is not because social scientists disapprove of fundamental measurement. It is because they despair of obtaining it.
The conviction that fundamental measurement is unobtainable in social science and education has such a grip that we do not see our despair is unnecessary. Fundamental measurement is not only obtainable in social science but, in an unaware and hence incomplete form, is widely relied on. Social scientists are practicing fundamental measurement without knowing it and hence without enjoying its benefits or building on its strengths.
The realization that fundamental measurements can be made in social science research is usually traced to Luce and Tukey (1964) who show that fundamental measurement can be constructed from an axiomatization of comparisons among responses to arbitrary pairs of quantities of two specified kinds. But Thurstone's 1927 Law of comparative Judgement contains an equivalent idea and his empirical work (e.g., 1928a, 1928b, 1929) contains results which are rough examples of fundamental measurement. Fundamental measurement also occurs in Bradley and Terry 1952 and Rasch 1958, 1960 and 1966.
The fundamental measurement which follows from Rasch's 'specific objectivity' is developed in Rasch 1960, 1961, 1967 and 1977. Rasch's specific objectivity and R. A. Fisher's estimation sufficiency are two
sides of the same approach to inference. Andersen (1977) shows that the only measuring processes which support specific objectivity and hence fundamental measurement are those which have sufficient statistics for their parameters. It follows that sufficient statistics lead to and are necessary for fundamental measurement.
Several authors connect 'additive conjoint' fundamental measurement with Rasch's work (Keats, 1967, 1971; Fischer 1968; Brogden, 1977). Perline, Wright and Wainer (1977) provide two empirical demonstrations of the equivalence of non-metric multidimensional scaling (Kruskal, 1964, 1965) and the Rasch process in realizing fundamental measurement. Wright and Stone (1979) show how to obtain fundamental measurement from mental tests. Wright and Masters (1982) give examples of its application to rating scales and partial credit scoring.
In spite of this considerable literature advancing, explaining and illustrating the successful application of fundamental measurement in social science research, most current psychometric practice is either unaware of the opportunity or considers it impractical.
MAINTAINING A UNIT
Thurstone says "The linear continuum which is implied in all measurement is always an abstraction. . . . All measurement implies the recreation or restatement of the attribute measured to an abstract linear form." And "There is a popular fallacy that a unit of measurement is a thing such as a piece of yardstick. This is not so. A unit of measurement is always a process of some kind which can be repeated without modification in the different parts of the measurement continuum" (Thurstone, 1931, 257).
Campbell (1920) specifies an addition operation as the hallmark of fundamental measurement. At bottom, it is maintaining a unit that supports addition. Let us see how this requirement can be met in psychological
measurement. Rasch (1960, 171-172) shows that, if P = e(b - d) / G where G = [1 + e(b - d)]
is the way person ability b and item difficulty d combine to govern the probability of a successful outcome and, if Event AB is person A succeeding but person B failing on a particular item, while Event BA is Person B succeeding but person A failing on the same item, then a distance between persons A and B on a scale defined by a set of items of a single kind can be estimated by bA - bB = log NAB - log NBA where NAB, is the number of times A succeeds but B fails and NBA is the number of times B succeeds but A fails on any subset of these items. This happens because, for Rasch's model,
PAB = PA(1-PB) = e(ba-d)/GAGB
PAB = PB(1 - PA) = e(bB - d)/GAGB
so that d cancels out of PAB/PBA = e(bA - bB)
leaving log(PAB/PBA) = bA bB a distance which holds regardless of the value of d. This result is equivalent to Case 5 of Thurstone's Law of Comparative Judgement of 1927 and to Bradley and Terry of 1952 and conforms to Luce and Tukey of 1964. Since d does not appear in this equation, estimates of the distance between A and B must be statistically equivalent whatever the item difficulty d . Since the unit defined by the distance between A and B holds over the range of the continuum defined by the values d can take and is thus independent of d, Rasch's model for specifying measures is the unit-maintaining process Thurstone requires.
Whether a particular kind of data can be disciplined to follow the Rasch process can only be discovered by applying the process to the data and examining the consequences. It is worth noticing, however, that whenever we have deemed it useful to count right answers or to add scale ratings, we have taken it for granted that the data concerned did, in fact, follow the Rasch process well enough to suit our purposes. This is so because counts and additions are exactly the sufficient statistics for the Rasch process and for no other!
If we subscribe to Thurstone's requirement, then we want data that we can govern in this way. That means that fitting the Rasch process becomes more than a convenience, it becomes the essential criterion for data good enough to support the construction of fundamental measures. The Rasch process becomes the criterion for valid data.
VERIFYING FIT
How well does data have to fit the Rasch process in order to obtain fundamental measurement? The only reasonable or useful answer is: "Well enough to serve the practical problem for which the measures are intended, that is, well enough to maintain an invariance sufficient to serve the needs at hand."
How can we document the degree of invariance the Rasch process obtains with a particular set of data? (One method is to specify subsets of items in any way that is substantively interesting but also independent of the particular person scores we have already examined (NAB, NBA) and then to see whether the new counts resulting from these item subsets estimate statistically equivalent distances. The extent to which the distance between A and B is invariant over challenging Partitions of items is the extent to which the data succeeds in making use of the Rasch process to maintain a unit.
A more general way to examine and document fit is to compose for each response x = 0 or 1 the score residual y = x P in which P = e(b - d)/[1 + e(b - d)] comes from the current estimates of person ability b and item difficulty d and Ex = P and then to accumulate these score residuals over the item subsets chosen to challenge fit. If (b1 - b0) is defined as the extent to which a subset of items fails to maintain the unit constructed by the full set of items, then that subset score residual sum(y) estimates (b1 - b0) sum(dy/db) .
When the data fit the Rasch process, then the differential of y with respect to b dy/db = dP/db = P(1 - P) = w equals the score variance so that sum(y) =~ (b1 - b0 ) sum(w) and (b1- b0) =~ sum(y)/sum(w) = g
The statistic g = sum(y)/sum(w) estimates the logit discrepancy in scale invariance (b1- b0) due to the item subset specified, with g having Eg = 0 and Vg = 1/sum(w) when the data fit this unit-maintaining, i.e. Rasch, process. Subsets need not be limited to items. Groups of Persons can be used to review the extent' to which any item is vulnerable to bias for or against the type of persons grouped. In general, any combination of items and persons thought to interact in a way that might interfere with the unit-maintaining process can be used to define a subset for calculating g. The resulting value of g estimates the direction and logit magnitude of the putative disturbance to scale invariance. The stability of any particular value of g can be evaluated from the root of its modeled variance, Vg = 1/sum(w).
CONSTRUCTING ADDITION
The way to build a linear scale is to construct an addition operation. This can be done by finding an operation which answers the question: "If Person A has more ability than person B, then how much 'ability' must be added to B to make the performance of B appear the same as the performance of A ?" To be more specific,
"What 'addition' will cause PB = PA?"
To answer this question we must realize that the only situation in which we can observe these P's is the one in which we expose the persons to items of the specified kind. This changes the question to: "What change in the situation through which we find out about persons by testing them with items will give B the same probability of success as A ?" In other words: "What 'addition' will cause PBj = PAi?"
Or, to be explicit, "What item j will make the performance of person B appear the same as the performance of person A on item i?" The Rasch process specifies that when PBj = Pai then bB - dj = bA di The 'addition' required to cause B to perform like A is bB + (bA - bB) = bA. The way this 'addition' is accomplished is to give person B an item j which is di - dj = bA bB easier than item i , namely, an item j with difficulty dj = di - (bA - bB) so that bB + (bA - bB) = bB+ (di - dj) = bA The way the success of this 'addition' is evaluated is to see whether the performance of person B on items like j is observed to be statistically equivalent to the performance of person A on items like i. This, in fact, is the comparison checked in any detailed test of fit.
CURRENT PRACTICE
It has long been customary in social science research to construct scores by counting answers (scored by their ordinal position in a sequence of ordered response possibilities) and then to use these scores and monotonic transformations of them as measures. When the questions asked have only two answer categories, then we count right answers. When the questions have an ordered series of answer categories, then ye count how many categories from 'least' to 'most' ('worst` to 'best', 'weakest' to strongest') have been surpassed. There is scarcely any quantitative data in social science research not already in this form or easily put so.
If there has been any progress in quantitative social science, then this kind of counting must have been useful. But this has implications. Counting in this way implies a measurement process, not any process, but a particular one. Counting implies a process which derives counting as the necessary and sufficient scoring procedure. Well, counting is exactly the sufficient statistic for estimating measures under the Rasch process. Since the Rasch process constructs simultaneous conjoint measures whenever data are valid for such a construction, we have, in our counting, been practicing the first steps of fundamental measurement all along. All we need do is to take this implication of our actions seriously and to complete our data analyses by verifying the extent to which our data fit the Rasch process and so are valid for fundamental measuring. When our data can be organized to fit well enough to be useful, then we can use the results to define Thurstone linear scales and to make Luce and Tukey fundamental measures on them.
WHAT OF OTHER MODELS?
The Rasch process maintains a unit that supports addition. Is that so for the other processes advocated for the construction of psychological measurement systems? Consider the three item parameter process (Lord, 1780, 12) Q = c + (1 - c)P
P = e[a(b - d)]/G
1 - Q = (1 - c)(1 - P)
G = 1 + e[a(b - d)]
Now QAB/QBA = QA(1- QB)/QB(1-QA)
= c(1-PB) + (1-c)PA(1-PB) / c(1- PA) + (1-c)PB(1-PA)
Is there any way to cancel the three item parameters out of this expression in order to maintain a unit among b's over the range of the item parameters? Is there any way to cancel b out of this expression in order to enable a sample-free estimation of the item parameters?
If c were a single constant known beforehand and always the same for all items no matter how much persons differed in their guessing behavior, then we could use (Q-c)/(1-Q) = P/(1-P) to eliminate the influence of this one common c and so concentrate on the problems caused by the interaction of b with a. But when c varies from item to item, then, even if its various values were known, the differential consequences of b variation on [c/(1 - c)](1 - PB) versus [c/(1 - c)](1 - PA) would prevent the Q process from maintaining a fixed distance between persons A and B over the range of d and c. Nor can we construct an addition for the Q process. What shall we 'add' to bB to cause person B to perform like person A, that is, to cause QBj = QAi?
There is no single 'amount' to add because the amount called for varies with the varying values of c and a. If we abandon c as a variable, then PAB/PBA = e[a(bA- d)]/e[a(bA-d)] and log(PAB/PBA) = a(bA- bB) . The item parameter d is gone, so that a(bA- bB) is maintained over the range of d . But what shall we do with a? If we advance a as an item parameter, then we have to estimate a different unit for every item. The distance between A and B can only be maintained if every a for every item can be known independently of every b to be compared. But that prevents us from using the behavior of persons to estimate the values of a. This happens because when we try to estimate a we find that we cannot separate it from its interactions with the estimation of the b's used for its estimation. When we try to estimate these b's we find that we cannot separate them from their interactions with a. We can maintain the distance between A and B only when a is a constant over persons and items, that is, when we are back tb the Rasch process. Nor can the process which includes a as a variable support addition. When P = e[a(b - d)]/{1 + e[a(b - d)]} then PBj = PAi implies that aj(bB-dj) = ai(bA-di) so that bA = di+ (aj/ai)(bB-dj)
We see that an 'addition' which will equate the performances of persons A and B is defined in general only over persons and items for which a is a constant so that (aj/ai) = 1 and bA = bB + (di- dj) as in the Rasch process.
CONCLUSION
If measurement is our aim, nothing can be gained by chasing after extra item parameters like c and a. We must seek, instead, for items which can be managed by an observation process in which any potentially misleading disturbances which might be blamed on variation in possible c's and a's can be kept slight enough not to interfere with the maintenance of a scale stability sufficient for the measuring job at hand.
That we have been content to use unweighted raw scores, just the count of right answers, as our 'good enough' statistic for all these eighty years, testifies to our latent conviction that the data with which we work can be usefully managed with a process no more complicated than the Rasch process. A good thing too! Only the Rasch process can maintain units that support addition and so produce results that qualify as fundamental measurement.
REFERENCES
Andersen, E.B. Sufficient statistics and latent trait models.
Psychometrika. 1977, 42. 69-81.
Bradley, R.A. and Terry, M E. Rank analysis of incomplete block designs I:
The method of paired comparisons. Biometrika, 1952, 39, 324-345.
Brogden, H.E. The Rasch model, the lay of comparative judgement and
additive conjoint measurement. Psychometrika, 1977, 62, 631-634.
Campbell, N.R. Physics: The elements. London: Cambridge University Press,
1920.
Fischer, G. Psychologische Testtheorie. Bern: Huber, 1968.
Keats, J.A. Test theory. Annual Review of Psychology, 1967, 18, 217-238.
Keats, J.A. An Introduction to Quantitative Psychology. Sydney: John Wiley,
1971.
KruskaI, J.B. Multidimensional scaling by optimizing goodness- of-fit to a
nonmetric hypothesis. Psychometrika, 1964, 22, 1- 27.
Kruskal, J.B. Analysis of factorial experiments by estimating monotone
transformations of the data. Journal of the Royal Statistical Society
(Series B). 1965, 27, 1-263.
Lord, F. Applications of Item Response Theory to Practical Testing
Problems. Hillsdale, N.J.: Lawrence Erlbaum Associates, 1980.
Luce, R. D. and Tukey, J. W. Simultaneous conjoint measurement: A new type
of fundamental measurement. Journal of Mathematical Psychology, 1964, 1,
1-27.
Perline, R., Wright, B.D. and Wainer, H. The Rasch model as additive
conjoint measurement Applied Psychological Measurement, 1979, 3, 237-256.
Rasch, G. On Applying a General Measuring Theory of Bridge- building
between Similar Psychological Tests. Copenhagen: Danmarks Paedogogiske
Institut, 1958.
Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests.
Copenhagen: Danmarks Paedogogiske Institut, 1960 (Chicago: University of
Chicago Press, 1980).
Rasch, G. On general laws and the meaning of measurement in psychology.
[Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics
and Probability, 1961, 321-333.
Rasch, G. An individualistic approach to item analysis. In P.F. Lazersfeld
and N.W. Henry Readings in Mathematical Social Science. Chicago: Science
Research Associates, 1966a.
Rasch, G. An item analysis which takes individual differences into account.
British Journal of Mathematical and Statistical Psychology, 1966b, 19,
49-57.
Rasch, G. An informal report on the present state of a theory of
objectivity in comparisons. In L.J. van der Kamp and C.A.J. Vieck,
Proceedings of the NUFFIC International Summer Session in Science at "Het
Oude Hof." Leiden, 1967.
Rasch, G. On specific objectivity: An attempt at formalizing the request
for generality and validity of scientific statements. Danish Yearbook of
Philosophy, 1977, 16, 58-94.
Thurstone, L. L. A law of comparative judgement. Psychological Review,
1927, 34, 273-286.
Thurstone, L.L. Attitudes can be measured. American Journal of Sociology.
1928a, 33, 529-554.
Thurstone, L.L. The measurement of opinion. Journal of Abnormal and Social
Psychology, 1928b, 22, 415-430.
Thurstone, L L. Theory of attitude measurement. Psychological Review, 1929.
36, 222-241.
Thurstone, L.L. Measurement of social attitudes. Journal of Abnormal and
Social Psychology, 1931, 26, 249-269.
Wright, B.D. and Masters, G.N., Rating Scale Analysis. Chicago: MESA Press,
1982.
Wright, P.D. and Stone, M.H., Best Test Design. Chicago: MESA Press, 1979.
---------------------------------------------------------------------------
Go to Top of Page
Go to MESA Press Page
Go to MESA Memoranda Page
For more information, contact
The Institute for Objective Measurement, MESA Press
and MESA Psychometric Laboratory at the University of Chicago
by e-mail at MESA@uchicago.edu.
Our current URL is http://mesa.spc.uchicago.edu
MESA, 5835 S. Kimbark Ave., Chicago IL 60637-1609, USA
Tel. (773) 702-1596, or FAX (773) 834-0326