Journal of Applied Measurement
GUIDELINES FOR MANUSCRIPTS
Reprinted from Smith, R.M., Linacre, J.M., and Smith, Jr., E.V. (2003). Guidelines for Manuscripts. Journal of Applied Measurement, 4, 198-204.
Included in this editorial are guidelines for manuscripts submitted to the Journal of Applied Measurement that involve applications of Rasch measurement. These guidelines may also be of use to those attempting to publish Rasch measurement applications in other journals that may not be familiar with these methods.
Following the guidelines, we provide a list of references that may assist individuals in gaining an overview of some of the material discussed in the guidelines. The guidelines and the list of references are by no means exhaustive. If you feel an important reference has been left out or have a recommendation for the guidelines, please e-mail us your suggestions (rsmith@jampress.org, mike@winsteps.com, or evsmith@uic.edu).
Finally, we consider this a work in progress and thank William Fisher and George Karabatsos for comments on an earlier version. We will attempt to incorporate ideas and references as we receive them. Please periodically visit the journal website at http://www.jampress.org for the most recent updates.
A. Describing the problem
1. Adequate references, at least reference to Georg Rasch (1960) when appropriate.
2. Adequate theory, at least exact algebraic representation of the Rasch model(s) used and citation for primary developer(s).
3. Adequate description of the measurement problem, including hypothesized definition of latent variable, identification of facets under investigation, description of rating scales or response formats.
4. Rationale for using Rasch measurement techniques. For example, this may include the preference for the unique properties that Rasch models embody, the goal of establishing generalized reference standard metrics, or empirical justification by performing, for example, a comparison of the generalizability of the estimated parameters obtained from competing models. Addressing the rationale for using Rasch measurement is particular important when reviewers are more familiar with the philosophy behind Item Response Theory or True Score Theory.
B. Describing the analysis
1. Name and citation or adequate description of software or estimation methodology employed.
2. Provide a rationale for the choice of fit statistics and the criteria employed to indicate adequate fit to the model requirements. This should include some acknowledgment of the Type I error rate that the critical values imply. Note. The mean square is not a symmetric statistic. A value of 0.7 is further from 1.0 than is 1.3. Using a 1.3/0.7 cutoff for mean squares uses a different Type I error rate for the upper and lower tail of the mean square distribution.
C. Reporting the analysis
1. Map of linear variable as defined by items.
2. Map of distribution of sample on linear variable.
3. Report on functioning of rating scale(s), and of any procedures taken to improve measurement (e.g., category collapsing).
Note: It is extremely difficult to make decisions about the use of response categories in the rating scale or partial credit model if there are less than 30 persons in the sample or 10 observations in each category. You might want to reserve that task until your samples are a little larger. If the sample person distribution is skewed you might actually need even larger sample sizes since one tail of the distribution will not be well populated. The same is true if the sample mean is offset from the mean of the item difficulties. This will result in there being few observations for the extreme categories for the items opposite the concentration of the persons.
4. Investigation of secondary dimensions in items, persons, etc. using, for example, fit statistics and other analysis of the residuals.
Note: All of the point-biserial correlations being greater that 0.30 in the rating scale and partial credit models does not lend a lot of support to the concept of unidimensionality. It is often the case that the median point-biserial in rating scale or partial credit data can be well above 0.70. A number of items in the 0.30 to 0.40 range in that situation would be a good sign of multidimensionality.
5. Investigation for local idiosyncrasies in items, persons, etc.
Note: Fit statistics for small sample sizes are very unstable. One or two unusual responses can produce a large fit statistic. Count up the number of item/person standardized residuals that are larger than 2.0. You might be surprised how few there are. Do you want to drop an item just because of a few unexpected responses?
6. Report Rasch separation and reliabilities, not KR-20 or Alpha.
Note: Reliability was originally conceptualized as the ratio of the true variance to the observed variance. Since there was no method in the true score model of estimating the SEM a variety of methods (e.g., KR-20, Alpha) were developed to estimate reliability without knowing the SEM. In the Rasch model it is possible to approach reliability the way it was originally intended rather than using a less than ideal solution.
7. Report on applicable validity issues
Note: This is of particular importance when attempting to convey the results of Rasch analysis to non-Rasch oriented readers. Attempts should be made to address the validity issues raised by Messick (1989, 1995), Cherryholmes (1988), and the Medical Outcomes Trust (1995). See Smith (2001) for one interpretation and Fisher (1994) for connecting qualitative mathematical criteria for meaningfulness with quantitative mathematical criteria.
8. Any special measurement concerns?
For example: Missing data: not administered or what? Folded data: how resolved? Nested data: how accommodated? Loosely connected facets: how were differences in local origins removed? Measurement vs. description facets: how disentangled?
9. For tests of statistical significance, in addition to the test statistics, degrees of freedom, and p-values, we encourage authors to report and interpret effect sizes and/or confidence intervals.
D. Style and Terminology
1. Use Score for Raw Score and Measure or Calibration for Rasch-constructed linear measures.
2. We do not encourage the use of Item Response Theory as a term for Rasch measurement.
3. Rescale from logits to user-oriented scaling.
4. If appropriate, attempt to convey the results in graphical format.
5. Do not use inappropriate language when discussing reliability and validity (e.g., the test is reliable and valid). It is the measures that are reliable and the inferences made from the item and person measures and fit information that are valid for specific purposes.
6. When citing formulas or equations from other author's work, please use the notation of the original author. For example, when citing the formula for the partial credit model from Wright and Masters (1982), please use β for person measures, not θ. If you decide to change notation after that citation, please explain the reason for the change.
E. Common Oversights
1. Do not take the mean and standard deviation of point biserial correlations. The statistics are more non-linear than the raw scores. It is best to report the median and inter-quartile range or to use a Fisher z-transformation before you calculate a mean.
2. When comparing the results of several calibrations of the same data, do not use the item and person reliability as criteria for improvement. These indices suffer from the same floor and ceiling effects as their true score counterparts and hence may not accurately reflect increases in reliability. If an increase in reliability is one of your criteria for improvement, use the item and person separation indices to compare the results of multiple calibrations as these indices do not suffer from the same deficiencies.
References
Cherryholmes, C. (1988). Construct validity and the discourses of research. American
Journal of Education, 96, 421-457.
Medical Outcomes Trust Scientific Advisory Committee. (1995). Instrument Review
Criteria. Medical Outcomes Trust Bulletin, 1-4.
Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational Measurement (3rd ed.,
pp.13-103). New York: Macmillan.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences
from persons' responses and performances as scientific inquiry into score
meaning. American Psychologist, 50, 741-749.
Rasch Measurement Models
Adams, R. J., Wilson, M. R., and Wang, W. C. (1997). The multidimensional random
coefficients multinomial logit model. Applied Psychological
Measurement, 21, 1-24.
Andrich, D. (1978). A rating formulation for ordered response categories.
Psychometrika, 43, 561-574.
Andrich, D. (1988). Rasch models for measurement. Sage university paper series on
quantitative measurement in the social sciences. Newberry Park, CA:
Sage Publications.
Bond, T. G., and Fox, C. M. (2001). Applying the Rasch model: Fundamental
measurement in the human sciences. London: Erlbaum.
Fischer, G. H., and Molenaar, I. W. (1995). Rasch models: Foundations, recent
developments, and applications. New York: Springer-Verlag.
Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago: MESA Press.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47,
149-174.
Rasch, G. (1960). Probabilistic models for some intelligence and achievement tests.
Copenhagen: Danish Institute for Educational Research (Expanded edition,
1980. Chicago: University of Chicago Press).
Wright, B. D., and Masters, G. N. (1982). Rating scale analysis. Chicago:
MESA Press.
Wright, B. D., and Mok, M. (2000). Rasch models overview. Journal of Applied
Measurement, 1, 83-106.
Wright, B. D., and Stone, M. H. (1979). Best test design. Chicago: MESA Press.
Rationale for Using Rasch Models
Anderson, E. B. (1977). Sufficient statistics in latent trait models. Psychometrika, 42,
69-81.
Andrich, D. (1989). Distinctions between assumptions and requirements in
measurement in the social sciences. In J.A. Keats, R. Taft, R.A. Heath, S.H.
Lovibond (Eds.), Mathematical and Theoretical Systems, (pp. 7-16).
North Holland: Elsevier Science Publishers.
Andrich, D. (1995). Distinctive and incompatible properties of two common classes of
IRT models for grade responses. Applied Psychological Measurement, 19,
101-119.
Andrich, D. (2001, October). Controversy and the Rasch model: A characteristics of a
scientific revolution. Paper presented at the meeting of the International
Conference on Objective Measurement: Focus on Health Care, Chicago, IL.
Andrich, D., (2002). Understanding resistance to the data-model relationship in
Rasch’s paradigm: A reflection for the next generation. Journal of
Applied Measurement, 3, 325-359.
Bond, T. G., and Fox, C. M. (2001). Applying the Rasch model: Fundamental
measurement in the human sciences. London: Erlbaum.
Choppin, B. (1985). Lessons for Psychometrics from Thermometry. International
Journal of Educational Research (formerly Evaluation In Education), 9, 9-12.
Fisher, W. P., Jr. (1993). Scale-free measurement revisited. Rasch Measurement
7 Transactions, 272-273. [http://www.rasch.org/rmt/rmt71.htm].
Fisher, W. P., Jr. (1995). Opportunism, a first step to inevitability? Rasch
Measurement Transactions, 9, 426. [http://www.rasch.org/rmt/rmt92.htm].
Fisher, W. P., Jr. (1996). The Rasch alternative. Rasch Measurement Transactions, 9,
466-467.[http://www.rasch.org/rmt/rmt94.htm].
Linacre, J. M. (1996). The Rasch model cannot be “disproved”! Rasch Measurement
Transactions, 10, 512-514 [http://www.rasch.org/rmt/rmt103.htm].
Perline, R., Wright, B. D., and Wainer, H. (1979). The Rasch model as additive
conjoint measurement. Applied Psychological Measurement, 3, 237-256.
Romanoski, J., and Douglas, G. (2002). Test scores, measurement, and the use
of analysis of variance: An historical overview. Journal of Applied
Measurement, 3, 232-242.
Smith, R. M. (1992). Applications of Rasch measurement. Chicago: MESA Press.
Wright, B. D. (1967). Sample-Free Test Calibration and Person Measurement. In
B. S. Bloom (Chair), Invitational Conference on Testing Problems (pp.
84-101). Princeton, NJ: Educational Testing Service. Available at
http://www.rasch.org/memo1.htm.
Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal
of Educational Measurement, 14 (2), 97-116. Available at
http://www.rasch.org/memo42.htm.
Wright, B. D., and Linacre, J. M. (1989). Observations are always ordinal;
measurements, however, must be interval. Archives of Physical Medicine and
Rehabilitation, 70, 857-860. Available at http://www.rasch.org/memo44.htm.
Wright, B. D., and Masters, G. N. (1982). Rating scale analysis. Chicago:
MESA Press.
Wright, B. D., and Stone, M. H. (1979). Best test design. Chicago: MESA Press.
Estimation Methodology
Fischer, G. H., and Molenaar, I. W. (1995). Rasch models: Foundations, recent
developments, and applications. New York: Springer-Verlag.
Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago: MESA Press.
Linacre, J. M. (1999). Estimation methods for Rasch measures. Journal of Outcome
Measurement, 3, 382-405.
Wright, B. D., and Masters, G. N. (1982). Rating scale analysis. Chicago:
MESA Press.
Wright, B. D., and Stone, M. H. (1979). Best test design. Chicago: MESA Press.
Assessing Dimensionality and Fit
Anderson, E. B. (1973). A goodness-of-fit test for the Rasch model. Psychometrika,
38, 123-140.
Bond, T. G., and Fox, C. M. (2001). Applying the Rasch model: Fundamental
measurement in the human sciences. London: Erlbaum.
Engelhard, Jr., G. (1994). Examining rater errors in the assessment of written
composition with a Many-Facet Rasch model. Journal of Educational
Measurement, 31, 93-112.
Engelhard, Jr., G. (1996). Clarification to “Examining rater errors in the assessment of
written composition with a Many-Facet Rasch model”. Journal of Educational
Measurement, 33, 115-116.
Fischer, G. H., and Molenaar, I. W. (1995). Rasch models: Foundations, recent
developments, and applications. New York: Springer-Verlag.
Glas, C. A. W. (1988). The derivation of some tests for the Rasch model from the
multinomial distribution. Psychometrika, 53, 525-546.
Kelderman, H. (1984). Loglinear Rasch model tests. Psychometrika, 49, 223-245.
Linacre, J. M. (1998a). Structure in Rasch residuals: Why principal component
analysis? Rasch Measurement Transactions, 12, 636.
Linacre, J. M. (1998b). Detecting multidimensionality: Which residual data-types
works best? Journal of Outcome Measurement, 2, 266-283.
Linacre, J. M. (1992). Prioritizing misfit indicators. Rasch Measurement
Transactions, 9, 422-423.
Linacre, J. M., and Wright, B. D. (1994). Chi-square fit statistics. Rasch
Measurement Transactions, 8, 360-361.
Smith, Jr., E. V. (2002). Detecting and evaluating the impact of multidimensionality
using item fit statistics and principal component analysis of residuals. Journal
of Applied Measurement, 3, 205-231.
Smith, R. M. (1991a). IPARM: Item and Person analysis with the Rasch model.
Chicago: MESA Press.
Smith, R. M. (1991b). The distributional properties of Rasch item fit statistics.
Educational and Psychological Measurement, 51, 541-565.
Smith, R. M. (1996). A comparison of methods for determining dimensionality in
Rasch measurement. Structural Equation Modeling, 3, 25-40.
Smith, R. M. (1996). Polytomous mean square fit statistics. Rasch Measurement
Transactions, 10, 516-517.
Smith, R. M. (2000). Fit analysis in latent trait measurement models. Journal of
Applied Measurement, 1, 199-218.
Smith, R. M., Schumacker, R. E., and Bush, M. J. (1998). Using item mean squares to
evaluate fit to the Rasch model. Journal of Outcome Measurement, 2, 66-78.
Wright, B. D. (1991a). Diagnosing misfit. Rasch Measurement Transactions, 5, 156.
Wright, B. D. (1991b). Factor item analysis versus Rasch item analysis. Rasch
Measurement Transactions, 5, 134-135.
Wright, B. D. (1996a). Comparing Rasch measurement and factor analysis. Structural
Equation Modeling, 3, 3-24.
Wright, B. D. (1996b). Local dependence, correlation, and principal components.
Rasch Measurement Transactions, 10, 509-511.
Wright, B. D., and Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch
Measurement Transactions, 8, 370.
Wright, B. D., and Masters, G. N. (1982). Rating scale analysis. Chicago:
MESA Press.
Wright, B. D., and Stone, M. H. (1979). Best test design. Chicago: MESA Press.
Rating Scale Category Effectiveness
Andrich, D. (1996). Category ordering and their utility. Rasch Measurement
Transactions, 9, 465-466.
Andrich, D. (1998). Thresholds, steps, and rating scale conceptualization. Rasch
Measurement Transactions, 12, 648-649.
Linacre, J. M. (1991). Step disordering and Thurstone thresholds. Rasch Measurement
Transactions, 5, 171.
Linacre, J. M. (1999). Investigating rating scale category utility. Journal of Outcome
Measurement, 3, 102-122.
Linacre, J. M. (2002). Optimizing rating scale category effectiveness. Journal of
Applied Measurement, 3, 86-106.
Stone, M., and Wright, B. D. (1994). Maximizing rating scale information. Rasch
Measurement Transactions, 8, 386.
Wright, B. D., and Linacre, J. M. (1992). Disordered steps? Rasch Measurement
Transactions, 6, 225.
Wright, B. D., and Masters, G. N. (1982). Rating scale analysis. Chicago:
MESA Press.
Zhu, W., Updyke, W. F., and Lewandowski, C. (1997). Post-hoc Rasch analysis of
optimal categorization of an ordered-response scale. Journal of Outcome
Measurement, 1, 286-304.
Reliability and Validity
Fisher, Jr., W. P. (1994). The Rasch debate: Validity and revolution in educational
measurement. In M. Wilson (Ed.), Objective measurement: Theory into
practice, Vol. 2 (pp.36-72). Norwood: Ablex Publishing Corporation.
Fisher, Jr., W. P. (1997). Is content validity valid? Rasch Measurement Transactions,
11, 548.
Linacre, J. M. (1993). Rasch-based Generalizability theory. Rasch Measurement
Transactions, 7, 283-284.
Linacre, J. M. (1995). Reliability and separation monograms. Rasch Measurement
Transactions, 9, 421.
Linacre, J. M. (1996). True-score reliability or Rasch statistical validity? Rasch
Measurement Transactions, 9, 455-456.
Linacre, J. M. (1999). Relating Cronbach and Rasch reliabilities. Rasch Measurement
Transactions, 13, 696.
Smith, Jr., E. V. (2001). Reliability of measures and validity of measure interpretation:
A Rasch measurement perspective. Journal of Applied Measurement,
2, 281-311.
Wright, B. D. (1995). Which standard error? Rasch Measurement Transactions, 9,
436-437.
Wright, B. D. (1996). Reliability and separation. Rasch Measurement Transactions, 9,
472.
Wright, B. D. (1998). Interpreting reliabilities. Rasch Measurement Transactions, 11,
602.
Wright, B. D., and Masters, G. N. (1982). Rating scale analysis. Chicago:
MESA Press.
Wright, B. D., and Stone, M. H. (1979). Best test design. Chicago: MESA Press.
Metric Development and Score Reporting
Linacre, J. M. (1997). Instantaneous measurement and diagnosis. In R.M. Smith (Ed.),
Physical Medicine and Rehabilitation State of the Art Reviews, Vol. 11:
Outcome Measurement (pp.315-324). Philadelphia: Hanley & Belfus, Inc.
Ludlow, L. H., and Haley, S. M. (1995). Rasch model logits: Interpretation, use, and
transformations. Educational and Psychological Measurement, 55, 967-975.
Smith, Jr., E. V. (2000). Metric development and score reporting in Rasch
measurement. Journal of Applied Measurement, 1, 303-326.
Smith, R. M. (1992). Applications of Rasch Measurement. Chicago: MESA Press.
Smith, R. M. (1991). IPARM: Item and Person analysis with the Rasch model.
Chicago: MESA Press.
Smith, R. M. (1994). Person response maps for rating scales. Rasch Measurement
Transactions, 8, 372-373.
Stanek, J., and Lopez, W. (1996). Explaining variables. Rasch Measurement
Transactions, 10, 518-519.
Woodcock, R. W. (1999). What can Rasch-based score convey about a person’s test
performance? In S. E. Embretson, and S. L. Hershberger, (Eds.), The new rules
of measurement: What every psychologist and educator should know.
Mahwah, NJ: Erlbaum.
Wright, B. D., Mead, R. J., and Ludlow, L. H. (1980). Kidmap: Research
memorandum number 29. Chicago: MESA Press.
Wright, B. D., and Stone, M. H. (1979). Best test design. Chicago: MESA Press.
Zhu, W. (1995). Communicating measurement. Rasch Measurement Transactions, 9,
437-438.