Journal of Applied Measurement
P.O. Box 1283
Maple Grove, MN 55311
Current Volume Article Abstracts
Vol. 14, No. 1 Spring 2013
A Bootstrap Approach to Evaluating Person and Item Fit to the Rasch Model
Edward W. Wolfe
Abstract
Historically, rule-of-thumb critical values have been employed for interpreting fit statistics that depict anomalous
person and item response patterns in applications of the Rasch model. Unfortunately, prior research has shown
that these values are not appropriate in many contexts. This article introduces a bootstrap procedure for identifying
reasonable critical values for Rasch fit statistics and compares the results of that procedure to applications
of rule-of-thumb critical values for three example datasets. The results indicate that rule-of-thumb values may
over- or under-identify the number of misfitting items or persons.
****
Using The Rasch Measurement Model to Design a Report Writing Assessment Instrument
Wayne R. Carlson
Abstract
This paper describes how the Rasch measurement model was used to develop an assessment instrument designed
to measure student ability to write law enforcement incident and investigative reports. The ability to
write reports is a requirement of all law enforcement recruits in the state of Michigan and is a part of the state’s
mandatory basic training curriculum, which is promulgated by the Michigan Commission on Law Enforcement
Standards (MCOLES). Recently, MCOLES conducted research to modernize its training and testing in the area
of report writing. A structured validation process was used, which included: a) an examination of the job tasks
of a patrol officer, b) input from content experts, c) a review of the professional research, and d) the creation of
an instrument to measure student competency. The Rasch model addressed several measurement principles that
were central to construct validity, which were particularly useful for assessing student performances. Based on
the results of the report writing validation project, the state established a legitimate connectivity between the
report writing standard and the essential job functions of a patrol officer in Michigan. The project also produced
an authentic instrument for measuring minimum levels of report writing competency, which generated results
that are valid for inferences of student ability. Ultimately, the state of Michigan must ensure the safety of its
citizens by licensing only those patrol officers who possess a minimum level of core competency. Maintaining
the validity and reliability of both the training and testing processes can ensure that the system for producing
such candidates functions as intended.
****
Using Multidimensional Rasch to Enhance Measurement Precision:
Initial Results from Simulation and Empirical Studies
Magdalena Mo Ching Mok and Kun Xu
Abstract
This study aimed to explore the effect on measurement precision of multidimensional, as compared with unidimensional,
Rasch measurement for constructing measures from multidimensional Likert-type scales. Many
educational and psychological tests are multidimensional but common practice is to ignore correlations among
the latent traits in these multidimensional scales in the measurement process. These practices may have serious
validity and reliability implications. This study made use of both empirical data from 208,083 students,
and simulated data simulated by 24 systematic combinations, each replicated 1000 times, of three conditions,
namely, sample size, degree of dimensionality, and scale length to compare unidimensional and multidimensional
approaches and to identify effects of sample size, dimensionality and scale length on measurement precision.
Results showed that the multidimensional Rasch approach yielded more precise estimates than did unidimensional
approach if the two dimensions were strongly correlated. The effect was more pronounced for long scales.
****
Using the Dichotomous Rasch Model to Analyze Polytomous Items
Qingping He and Chris Wheadon
Abstract
One of the most important applications of the Rasch measurement models in educational assessment is the
equating of tests. An important feature of attainment tests is the use of both dichotomous and polytomous items.
The partial credit model (PCM) developed by Masters (1982) represents an extension of the dichotomous Rasch
model for analysing polytomous item data. The dichotomous Rasch model has been used primarily to analyse
dichotomous item data. Whilst the partial credit model can provide detailed information on the performance of
individual score categories of polytomous items, it is mathematically more complex to use than the dichotomous
Rasch model and can, under certain circumstances, present difficulties in interpreting item measures and
in practical applications. This study explores the potential of using the dichotomous Rasch model to analyse
polytomous items and equate tests. Results obtained from a simulation study and from analysing the data of a
science achievement test indicate that the partial credit model and the dichotomous Rasch model produce similar
item and person measures and equivalent cut scores on different test forms.
****
With Hiccups and Bumps: The Development of a Rasch-based Instrument
to Measure Elementary Students’ Understanding of the Nature of Science
Shelagh M. Peoples, Laura M. O’Dwyer, Katherine A. Shields, and Yang Wang
Abstract
This research describes the development process, psychometric analyses and part validation study of a theoretically-
grounded Rasch-based instrument, the Nature of Science Instrument-Elementary (NOSI-E). The NOSI-E was
designed to measure elementary students’ understanding of the Nature of Science (NOS). Evidence is provided for
three of the six validity aspects (content, substantive and generalizability) needed to support the construct validity
of the NOSI-E. A future article will examine the structural and external validity aspects. Rasch modeling proved
especially productive in scale improvement efforts. The instrument, designed for large-scale assessment use, is
conceptualized using five construct domains. Data from 741 elementary students were used to pilot the Rasch
scale, with continuous improvements made over three successive administrations. The psychometric properties
of the NOSI-E instrument are consistent with the basic assumptions of Rasch measurement, namely that the
items are well-fitting and invariant. Items from each of the five domains (Empirical, Theory-Laden, Certainty,
Inventive, and Socially and Culturally Embedded) are spread along the scale’s continuum and appear to overlap
well. Most importantly, the scale seems appropriately calibrated and responsive for elementary school-aged
children, the target age group. As a result, the NOSI-E should prove beneficial for science education research.
As the United States’ science education reform efforts move toward students’ learning science through engaging
in authentic scientific practices (NRC, 2011), it will be important to assess whether this new approach to teaching
science is effective. The NOSI-E can be used as one measure of whether this reform effort has an impact.
****
Application of Single-level and Multi-level Rasch Models using the lme4 Package
Iasonas Lamprianou
Abstract
The aim of the article is to illustrate how researchers may use the lme4 package to run multilevel Rasch models.
The lme4 package is a popular open-source software and is frequently used by researchers around the world
to fit generalized mixed-effects models with crossed or partially crossed random effects. The article starts with
a short discussion of the reasons why a researcher might, sometimes, be motivated to use a multi-level Rasch
model and presents a practical example using empirical data. The main features of the lme4 package are presented,
and finally, the paper presents information about other open-source software that could alternatively be
used to fit multi-level Rasch models.
****
Rasch Modeling to Assess Albanian and South African Learners’ Preferences
for Real-life Situations to be Used in Mathematics: A Pilot Study
Suela Kacerja, Cyril Julie, and Said Hadjerrouit
Abstract
This paper reports on an investigation on the real-life situations students in grades 8 and 9 in South Africa and
Albania prefer to use in Mathematics. The functioning of the instrument used to assess the order of preference
learners from both countries have for contextual situations is assessed using Rasch modeling techniques. For
both the cohorts, the data fit the Rasch model. The differential item functioning (DIF) analysis rendered 3 items
operating differentially for the two cohorts. Explanations for these differences are provided in terms of differences
in experiences learners in the two countries have related to some of the contextual situations. Implications
for interpretation of international comparative tests are offered, as are the possibilities for the cross-country
development of curriculum materials related to contexts that learners prefer to use in Mathematics.
****
Vol. 14, No. 2 Summer 2013
Adaptive Testing for Psychological Assessment: How Many Items Are Enough To Run an Adaptive Testing Algorithm?
Michaela M. Wagner-Menghin and Geoff N. Masters
Abstract
Although the principles of adaptive testing were established in the psychometric literature many years ago (e.g.,
Weiss, 1977), and practice of adaptive testing is established in educational assessment, it not yet widespread in
psychological assessment. One obstacle to adaptive psychological testing is a lack of clarity about the necessary
number of items to run an adaptive algorithm. The study explores the relationship between item bank size, test
length and measurement precision. Simulated adaptive test runs (allowing a maximum of 30 items per person)
out of an item bank with 10 items per ability level (covering .5 logits, 150 items total) yield a standard error
of measurement (SEM) of .47 (.39) after an average of 20 (29) items for 85-93% (64-82%) of the simulated
rectangular sample. Expanding the bank to 20 items per level (300 items total) did not improve the algorithm’s
performance significantly. With a small item bank (5 items per ability level, 75 items total) it is possible to reach
the same SEM as with a conventional test, but with fewer items or a better SEM with the same number of items.
****
DIF Cancellation in the Rasch Model
Adam E. Wyse
Abstract
Differential item functioning (DIF) cancellation occurs when the cumulative effect of an item or set of items
exhibiting DIF against one subgroup cancels with other items that exhibit DIF against the comparison group
and hence results in non-existent DIF at the test level. This paper investigates DIF cancellation in the context
of Rasch measurement. It is shown that this phenomenon is not a property of the Rasch model, but rather, a
function of the manner in which item parameters are estimated and the way that DIF impacts these estimates.
The conditions under which DIF cancellation would exist when using the Rasch model are suggested and a
proof is provided to support this suggestion. Empirical examples are provided to refute prior suggestions that
DIF cancellation always exists if the Rasch model is used.
****
Multidimensional Diagnostic Perspective on Academic Achievement Goal Orientation Structure, Using the Rasch Measurement Models
Daeryong Seo, Husein Taherbhai, and Insu Paek
Abstract
This study is designed to investigate a multidimensional structure of academic achievement goal orientations
from a diagnostic perspective, using the Rasch measurement models. A data set of Korean students who responded
to the Patterns of Adaptive Learning Survey (PALS) was analyzed. Both consecutive unidimensional
and multidimensional Rasch measurement models were applied for comparative purposes. Each goal orientation
dimension (i.e., the attitude) was standardized and then classified into three categorical levels, i.e., low, middle
and high. These categorizations of goal dimensions were used to examine the role of students’ performanceapproach
goals on mathematics achievement in relation with the other achievement goals. Results indicate that
the multidimensional partial credit model was the best model with respect to the fit of the data to the models.
Findings of the current study also demonstrate that practitioners who need specific feedback for instruction and/
or intervention can benefit from the multidimensional approach.
****
An Extension of a Bayesian Approach to Detect Differential Item Functioning
Sandip Sinharay
Abstract
The application of the existing test statistics to determine differential item functioning (DIF) requires large
samples, but test administrators often face the challenge of detecting DIF with small samples. One advantage
of a Bayesian approach over a frequentist approach is that the former can incorporate, in the form of a prior
distribution, existing information on the inference problem at hand. Sinharay, Dorans, Grant, and Blew (2009)
suggested the use of information from past data sets as a prior distribution in a Bayesian DIF analysis. This
paper suggests an extension of the method of Sinharay et al. (2009). The suggested extension is compared to
the existing DIF detection methods in a realistic simulation study.
****
The Development of the de Morton Mobility Index (DEMMI) in an Older Acute Medical Population: Item Reduction using the Rasch Model (Part 1)
Natalie A. de Morton, Megan Davidson, and Jennifer L. Keating
Abstract
The DEMMI (de Morton Mobility Index) is a new and advanced instrument for measuring the mobility of all older
adults across clinical settings. It overcomes practical and clinimetric limitations of existing mobility instruments.
This study reports the process of item reduction using the Rasch model in the development of the DEMMI. Prior
to this study, qualitative methods were employed to generate a pool of 51 items for potential inclusion in the
DEMMI. The aim of this study was to reduce the item set to a unidimensional subset of items that ranged across
the mobility spectrum from bed bound to high levels of independent mobility. Fifty-one physical performance
mobility items were tested in a sample of older acute medical patients. A total of 215 mobility assessments were
performed. Seventeen mobility items that spanned the mobility spectrum were selected for inclusion in the new
instrument. The 17 item scale fitted the Rasch model. Items operated consistently across the mobility spectrum
regardless of patient age, gender, cognition, primary language or time of administration during hospitalisation.
Using the Rasch model, an interval level scoring system was developed with a score range of 0 to 100.
****
A Comparison of Confirmatory Factor Analysis and Multidimensional Rasch Models to Investigate the Dimensionality of Test-Taking Motivation
Christine E. DeMars
Abstract
Using a scale of test-taking motivation designed to have multiple factors, results are compared from a confirmatory
factor analysis (CFA) using LISREL and a multidimensional Rasch partial credit model using ConQuest. Both
types of analyses work with latent factors and allow the comparison of nested models. CFA models most typically
model a linear relationship between observed and latent variables, while Rasch models specify a non-linear
relationship between observed and latent variables. The CFA software provides many more measures of overall
fit than ConQuest, which is focused more on the fit of individual items. Despite the conceptual differences in
these techniques, the results were similar. The data fit a three-dimensional model better than the one-dimensional
or two-dimensional models also hypothesized, although some misfit remained.
****
Measuring Alternative Learning Outcomes: Dispositions to Study in Higher Education
Maria Pampaka, Julian Williams, Graeme Hutcheson, Laura Black, Pauline Davis, Paul Hernandez-Martinez, and Geoff Wake
Abstract
In this paper we describe the validation of two scales constructed to measure pre-university students’ changing
disposition (i) to enter Higher Education (HE) and (ii) to further study mathematically-demanding subjects. Items
were selected drawing on interview data, and on a model of disposition as socially- as well as self- attributed.
Rasch analyses showed that the two scales each produce robust one-dimensional measures on what we call a
‘strength of commitment to enter HE’ and ‘disposition to study mathematically-demanding subjects further’
respectively. However, the former scale was initially found to suffer psychometrically from a ceiling effect,
which we ‘corrected’ by adding some harder items at a later data point, and revised the scale according to our
interpretation of subsequent results. We finally discuss the potential significance of the constructed measures of
learning outcomes, as variables in monitoring or even explaining students’ progress into different subjects in HE.
****