Journal of Applied Measurement
P.O. Box 1283
Maple Grove, MN 55311
Volume 11, 2010 Article Abstracts
Vol. 11, No. 1 Spring 2010
Predicting Responses from Rasch Measures
John M. Linacre
Abstract
There is a growing family of Rasch models for polytomous observations. Selecting a suitable model for an
existing dataset, estimating its parameters and evaluating its fit is now routine. Problems arise when the model
parameters are to be estimated from the current data, but used to predict future data. In particular, ambiguities in
the nature of the current data, or overfit of the model to the current dataset, may mean that better fit to the current
data may lead to worse fit to future data. The predictive power of several Rasch and Rasch-related models
are discussed in the context of the Netflix Prize. Rasch-related models are proposed based on Singular Value
Decomposition (SVD) and Boltzmann Machines.
****
Concrete, Abstract, Formal, and Systematic Operations as Observed in a “Piagetian” Balance-Beam Task Series
Theo Linda Dawson, Eric Andrew Goodheart, Karen Draney,Mark Wilson, and Michael Lamport Commons
Abstract
We performed a Rasch analysis of cross-sectional developmental data gathered from children and adults who
were presented with a task series derived from Inhelder’s and Piaget’s balance beam. The partial credit model
situates both participants and items along a single hierarchically ordered dimension. As the Model of Hierarchical
Complexity predicted, order of hierarchical complexity accurately predicted item difficulty, with notable
exceptions at the formal and systematic levels. Gappiness between items was examined using the saltus model.
A two level saltus model, which examined the gap between the concrete/abstract and formal/systematic items,
was a better predictor of performance than the Rasch analysis (chi square= 71.91, df = 4, p < .01).
****
Sources of Self-Efficacy Belief: Development and Validation of Two Scales
Ou Lydia Liu and Mark Wilson
Abstract
Self-efficacy belief has been an instrumental affective factor in predicting student behavior and achievement in
academic settings. Although there is abundant literature on efficacy belief per se, the sources of efficacy belief
have not been fully researched. Very few instruments exist to quantify the sources of efficacy-beliefs. To fill
this void, we developed two scales for the two main sources of self-efficacy belief: past performance and social
persuasion. Pilot test data were collected from 255 middle school students. A self-efficacy measure was also
administered to the students as a criterion measure. The Rasch rating scale model was used to analyze the data.
Information on item fit, item design, content validity, external validity, internal consistency, and person separation
reliability was examined. The two scales displayed satisfactory psychometric properties. Applications and
limitations of these two scales are also discussed.
****
Reducible Or Irreducible? Mathematical Reasoning and the Ontological Method
William P. Fisher, Jr
Abstract
Science is often described as nothing but the practice of measurement. This perspective follows from longstanding
respect for the roles mathematics and quantification have played as media through which alternative hypotheses
are evaluated and experience becomes better managed. Many figures in the history of science and psychology
have contributed to what has been called the “quantitative imperative,” the demand that fields of study employ
number and mathematics even when they do not constitute the language in which investigators think together.
But what makes an area of study scientific is, of course, not the mere use of number, but communities of investigators
who share common mathematical languages for exchanging quantitative and quantitative value. Such
languages require rigorous theoretical underpinning, a basis in data sufficient to the task, and instruments traceable
to reference standard quantitative metrics. The values shared and exchanged by such communities typically
involve the application of mathematical models that specify the sufficient and invariant relationships necessary
for rigorous theorizing and instrument equating. The mathematical metaphysics of science are explored with the
aim of connecting principles of quantitative measurement with the structures of sufficient reason.
****
Children’s Understanding of Area Concepts: Development, Curriculum and Educational Achievement
Trevor G. Bond and Kellie Parkinson
Abstract
As one part of a series of studies undertaken to investigate the contribution of developmental attributes of
learners to school learning, a representative sample of forty-two students (age from 5 years and 3 months to 13
years and 1 month) was randomly selected from a total student population of 142 students at a small private
primary school in northern Australia. Those children’s understandings of area concepts taught during the primary
school years were assessed by their performance in two testing situations. The first consisted of a written classroom
test of ability to solve ‘area’ problems with items drawn directly from school texts, school examinations
and other relevant curriculum documents. The second, which focused more directly on each child’s cognitive
development, was an individual interview for each child in which four “area” tasks such as the Meadows and
Farmhouse Experiment taken from Chapter 11 of The Child’s Conception of Geometry (Piaget, Inhelder and
Szeminska, 1960, pp. 261-301) were administered.
Analysis using the Rasch Partial Credit Model provided a finely detailed quantitative description of the
developmental and learning progressions revealed in the data. It is evident that the school mathematics curriculum
does not satisfactorily match the learner’s developmental sequence at some key points. Moreover, the
children’s ability to conserve area on the Piagetian tasks, rather than other learner characteristics, such as age
and school grade seems to be a precursor for complete success on the mathematical test of area. The discussion
focuses on the assessment of developmental (and other) characteristics of school-aged learners and suggests how
curriculum and school organization might better capitalize on such information in the design and sequencing
of learning experiences for school children. Some features unique to the Rasch family of measurement models
are held to have special significance in elucidating the development/attainment nexus.
****
Thinking about Thinking—Thinking about Measurement: A Rasch Analysis of Recursive Thinking
Ulrich Müeller and Willis F. Overton
Abstract
Two studies were conducted to examine the dimensionality and hierarchical organization of a measure of recursive
thinking. In Study 1, Rasch analysis supported the claim that the recursive thinking task measures a single
underlying dimension. Item difficulty, however, appeared to be influenced not only by level of embeddedness but
also by syntactic features. In Study 2, this hypothesis was tested by adding new items to the recursive thinking
measure. Rasch analysis of the modified recursive thinking task produced evidence for the undimensionality and
segmentation. However, Study 2 did not support the idea that syntactic features influence item difficulty.
****
Understanding Rasch Measurement: Psychometric Aspects of Item Mapping for Criterion-Referenced
Interpretation and Bookmark Standard Setting
Huynh Huynh
Abstract
Locating an item on an achievement continuum (item mapping) is well-established in technical
work for educational/psychological assessment. Applications of item mapping may be found in criterionreferenced
(CR) testing (or scale anchoring, Beaton and Allen, 1992; Huynh, 1994, 1998a, 2000a, 2000b,
2006), computer-assisted testing, test form assembly, and in standard setting methods based on ordered test
booklets. These methods include the bookmark standard setting originally used for the CTB/TerraNova tests
(Lewis, Mitzel, Green, and Patz, 1999), the item descriptor process (Ferrara, Perie, and Johnson, 2002) and
a similar process described by Wang (2003) for multiple-choice licensure and certification examinations.
While item response theory (IRT) models such as the Rasch and two-parameter logistic (2PL) models
traditionally place a binary item at its location, Huynh has argued in the cited papers that such mapping
may not be appropriate in selecting items for CR interpretation and scale anchoring.
****
Vol. 11, No. 2 Summer 2010
Using Item Response Modeling Methods to Test Theory Related to Human Performance
Diane D. Allen
Abstract
Testing theories of human performance requires measurement of latent constructs like performer perception
and motivation, intrinsic parts of performance, but nebulous to assess. Item response modeling (IRM) methods
model latent constructs directly and thus can assist in the development and testing of theory. This article
presents an application of IRM methods in initial testing of the Movement Continuum Theory. A construct was
derived from the theory, and an instrument was generated to put the construct into operation. Over 300 people
of varying movement abilities, aged 18 to 101, completed the 24-item Movement Ability Measure, a self-report
questionnaire asking for participants’ perceptions of their current and preferred ability to move. Wright Maps
derived from estimated locations of participants and item thresholds showed the strong relationship between the
theorized construct and the empirical data. Data gathered with the instrument and analyzed with IRM methods
provided mixed support for principles of the theory.
****
From Model to Measurement with Dichotomous Items
Don Burdick, A. Jackson Stenner, and Andrew Kyngdon
Abstract
Psychometric models typically represent encounters between persons and dichotomous items as a random variable
with two possible outcomes, one of which can be labeled success. For a given item, the stipulation that each
person has a probability of success defines a construct on persons. This model specification defines the construct,
but measurement is not yet achieved. The path to measurement must involve replication; unlike coin-tossing, this
cannot be attained by repeating the encounter between the same person and the same item. Such replication can
only be achieved with more items whose features are included in the model specifications. That is, the model
must incorporate multiple items. This chapter examines multi-item model specifications that support the goal
of measurement. The objective is to select the model that best facilitates the development of reliable measuring
instruments. From this perspective, the Rasch model has important features compared to other models.
****
Using Guttman’s Mapping Sentences and Many Facet Rasch Measurement Theory to Develop an Instrument that Examines
the Grading Philosophies of Teachers
Jennifer Randall and George Engelhard, Jr.
Abstract
This study presents an approach to questionnaire design within educational research based on Guttman’s mapping
sentences (Guttman, 1977) and Many-Facet Rasch Measurement Theory (Linacre, 1994). The primary
purpose of this study was to illustrate how Guttman’s mapping sentences can be used to develop an instrument
that explores the grading philosophies of teachers. A secondary purpose was to clarify teacher grading
philosophies (i.e., severity or leniency) as a measurement construct. We designed a 54-item questionnaire in
which each item represented a unique combination of student characteristics, i.e., varying levels of classroom
achievement, ability, behavior, and effort. The grades assigned by the teachers to the scenarios were analyzed
using the FACETS (Linacre, 2007) computer program. The results of the analyses suggest that the grading
philosophies of teachers represent a unidimensional construct which is influenced, to varying extents, by the
classroom achievement (primarily), behavior, and effort of students; whereas the measurement value added by
the inclusion of the ability facet is uncertain.
****
Development of Scales Relating to Professional Development of Community College Administrators
Edward W. Wolfe and Kim E. Van Der Linden
Abstract
This article reports the results of an application of the Multidimensional Random Coefficients Multinomial
Logit Model (MRCMLM) to the measurement of professional development activities in which community college
administrators participate. The analyses focus on confirmation of the factorial structure of the instrument,
evaluation of the quality of the activities calibrations, examination of the internal structure of the instrument, and
comparison of groups of administrators. The dimensionality analysis results suggest a five-dimensional model that
is consistent with previous literature concerning career paths of community college administrators—education
and specialized training, internal professional development and mentoring, external professional development,
employer support, and seniority. The indicators of the quality of the activity calibrations suggest that measures
of the five dimensions are adequately reliable, that the activities in each dimension are internally consistent,
and that the observed responses to each activity are consistent with the expected values of the MRCMLM. The
hierarchy of administrator measure means and of activity calibrations is consistent with substantive theory
relating to professional development for community college administrators. For example, readily available
activities that occur at the institution were most likely to be engaged in by administrators, while participation in
selective specialized training institutes were the least likely activities. Finally, group differences with respect to
age and title were consistent with substantive expectations—the greater the administrator’s age and the higher
the rank of the administrator’s title, the greater the probability of having engaged in various types of professional
development.
****
Comparing Décalage and Development with Cognitive
Developmental Tests
Trevor Bond
Abstract
The use of Rasch measurement techniques with data from developmental psychology has provided important
insights into human development (e.g., Bond, 1997, 2003; Dawson, 2002 a, b;). In particular, Rasch methods
support investigations into what has been, up until now, intractable theoretical and empirical problems. Research
into the development of formal operational thinking using the Rasch model (Bond 1995 a, b; Bond and Bunting,
1995; Bond and Fox, 2001) substantiates important aspects of the original theorizing of Piaget (Inhelder and
Piaget, 1955/1958), which was based wholly on qualitative structural analyses of children’s problem-solving
responses. Common-person equating of student performances has been used across different formal operational
thinking tasks to estimate the relative difficulties of tasks measuring the same underlying developmental construct
(Bond, 1995b; Bond and Fox, 2001). Repeated person performance measures on the same task have been used in
order to estimate cognitive development over time. Rasch measurement estimates of cognitive development do
not exceed 0.5 logits per annum (Bond, 1996; Endler, 1998; Stanbridge, 2001); a result that has been estimated
independently in two large research projects in the United Kingdom (Shayer, 1999) and in Papua-New Guinea
(Lake, 1996). Interestingly, difficulty differences (décalage) between tests of formal thought are as large as 2.0
logits (Bond, 1995a; Bond, 1996; Bond and Fox, 2001), confounding attempts to differentiate development from
décalage. Given the problems and possibilities raised by the Rasch measurement quantification of cognitive
development, this article canvasses the promise of using Rasch modelling techniques to investigate systematically
these fundamental aspects of human cognitive performance.
****
Reliability of Performance Examinations: Revisited
Mary E. Lunz and John M. Linacre
Abstract
This article discusses the reliability for performance examinations with respect to the reproducibility of candidate
pass-fail decisions. The multi-facet Rasch model accounts for the difficulty of test forms (examiners + tasks +
items) taken by each candidate, so that all candidates are measured against the same criterion. The examiners
provide analytic ratings of candidate performance using a defined rating scale. The examination is standardized
so that sufficient information is collected about the candidates, and facets of the examination are linked. When
the examination data are properly collected, there are high levels of confidence in the reproducibility of candidate
outcomes for a high percentage of candidates.
****
Understanding Rasch Measurement: Equating Designs and Procedures Used in Rasch Scaling
Gary Skaggs and Edward W. Wolfe
Abstract
The development of alternate forms of tests requires a statistical score adjustment called equating that permits
the interchanging of scores from different test forms. Equating makes possible several important measurement
applications, including removing practice effects in pretest-posttest research designs, improving test security,
comparing scores between new and old forms, and supporting item bank development for computerized adaptive
testing. This article summarizes equating methods from a Rasch measurement perspective. The four sections of
this article present an introduction and definition of equating and related linking methods, data collection designs,
equating procedures, and evaluating equated measures. The methods are illustrated with worked examples
****
Vol. 11, No. 3 Fall 2010
Foreword, Emergence of Efficiency in Health Outcome Measurement
Nikolaus Bezruczko, Guest Editor
Abstract
Psychosocial measurement in the 21st Century is a dynamic field that is addressing challenges unthinkable
even a generation ago. Sophisticated methods and modern technology has brought psychometrics to the cusp
of scientific objectivity. This Foreword provides historical context and intellectual foundations for appreciating
contemporary psychometric advancements, as well as a perspective on issues that are determining future advances.
Efficiency in outcome measurement is one of these forces driving future advances. Efficiency, however, can
easily become conflated with expediency, and neither can substitute for effectiveness. Blind efficiency runs risk
of degrading measurement properties. Likewise, measurement advancement without accommodation to ordinary
needs leads to practical rejection. Bouchard presents a biographical link between scientific physics and Rasch
models that opened the door for fundamental psychosocial measurement. Symposium papers presented in this
issue present a broad range of ideas about contemporary psychosocial measurement. Granger summarizes key
ideas underlying achievement of objective, fundamental measurement. Massof, then, Stenner and Stone present
alternative perspectives on scientific knowledge systems, which are prominent landmarks on the psychometric
horizon. Fisher and Burton describe fundamental measurement methodology in diagnosis and implementation
of technology, which will consolidate isolated and redundant constructs in PROMIS. Hart presents an overview
on computer adaptive testing, which is the vanguard in health outcome measurement. Kisala and Tulsky present
a qualitative strategy that is improving sensitivity and validity of new outcome measures. Their diversity reflects
an intense competition of ideas about solving measurement problems. Their collection together in this special
issue is a milestone and tribute to scientific ingenuity.
****
Measuring One Variable at a Time: The Wright Way
Ed Bouchard
Abstract
This article, based on research for Ben Wright’s biography, explores some influences leading to his profound
commitment to useful and accurate measures. We briefly touch on his decision to leave a promising career in
physics for education and psychometrics, stemming from his belief that understanding how children learn is even
more important than understanding molecular structure. Along the way, we focus on a debate over measurement
models that Wright began with Fred Lord at an ETS Invitational Conference in 1967.
****
Rasch-Derived Latent Trait Measurement of Outcomes: Insightful Use Leads to Precision Case Management and Evidence-Based Practices
in Functional Healthcare
Carl V. Granger, Marsha Carlin, John M. Linacre, Ronald Mead, Paulette Niewczyk, A. Jackson Stenner, and Luigi Tesio
Abstract
ment
evolved from other fields, particularly education. Person-metrics is the measurement of how much chronic
disease and disablement affects an individual’s daily activities physically, cognitively, and through vocational
and social role participation. The ability of the Rasch model to assume that the probability of a given person/
item interaction is governed by the difficulty of the item and the ability of the person is invaluable to disability
measurement. The difference between raw scores and true measures is illustrated by an example of a patient
whose physical difficulty is rated on rising from a wheelchair and walking 100m (known to be more difficult),
and then walking an additional 200m. Though number ratings of 0-1-2 are assigned to these tasks, they are not
equidistant, and only a true measure shows the actual levels of physical difficulty.
****
Generally Objective Measurement of Human Temperature and Reading Ability: Some Corollaries
A. Jackson Stenner and Mark Stone
Abstract
We argue that a goal of measurement is general objectivity: point estimates of a person’s measure (height, temperature,
and reader ability) should be independent of the instrument and independent of the sample in which
the person happens to find herself. In contrast, Rasch’s concept of specific objectivity requires only differences
(i.e., comparisons) between person measures to be independent of the instrument. We present a canonical case
in which there is no overlap between instruments and persons: each person is measured by a unique instrument.
We then show what is required to estimate measures in this degenerate case. The canonical case encourages a
simplification and reconceptualization of validity and reliability. Not surprisingly, this reconceptualization looks
a lot like the way physicists and chemometricians think about validity and measurement error. We animate this
presentation with a technology that blurs the distinction between instruction, assessment, and generally objective
measurement of reader ability. We encourage adaptation of this model to health outcomes measurement.
****
A Clinically Meaningful Theory of Outcome Measures in Rehabilitation Medicine
Robert W. Massof
Abstract
Comparative effectiveness research in rehabilitation medicine requires the development and validation of
clinically meaningful and scientifically rigorous measurements of patient states and theories that explain and
predict outcomes of intervention. Patient traits are latent (unobservable) variables that can be measured only
by inference from observations of surrogate manifest (observable) variables. In the behavioral sciences, latent
variables are analogous to intensive physical variables such as temperature and manifest variables are analogous
to extensive physical variables such as distance. Although only one variable at a time can be measured,
the variable can have a multidimensional structure that must be understood in order to explain disagreements
among different measures of the same variable. The use of Rasch theory to measure latent trait variables can be
illustrated with a balance scale metaphor that has randomly added variability in the weights of the objects being
measured. Knowledge of the distribution of the randomly added variability provides the theoretical structure
for estimating measures from ordinal observation scores (e.g., performance measures or rating scales) using
statistical inference. In rehabilitation medicine, the latent variable of primary interest is the patient’s functional
ability. Functional ability can be estimated from observations of surrogate performance measures (e.g., speed
and accuracy) or self-report of the difficulty the patient experiences performing specific activities. A theoretical
framework borrowed from project management, called the Activity Breakdown Structure (ABS), guides the
choice of activities for assessment, based on the patient’s value judgments, to make the observations clinically
meaningful. In the case of low vision, the functional ability measure estimated from Rasch analysis of activity
difficulty ratings was discovered to be a two-dimensional variable. The two visual function dimensions are independent
of physical limitations and psychological state. To explain outcome measures (latent variable estimated
from difficulty ratings), a theory must be developed that explicitly defines how latent variables are related to
the observed manifest variables and to each other. The latent variables are categorized as primary variables,
which in the case of low vision are the two visual function dimensions, and as effect modifiers, which in the case
of low vision are other physical and psychological latent traits of the patients that can influence the outcome
measures. Interventions give rise to latent intervention effect variables that can alter the latent primary variables
or independently affect the outcome measures. The latent effect modifier variables, in turn, can alter the latent
intervention effect variables. Once developed and validated, a theory of this form will predict the rehabilitation
potential of individual patients, i.e., the probability of obtaining criterion outcome measures given the observed
state of the patient and the choice of interventions.
****
Embedding Measurement within Existing Computerized Data Systems: Scaling Clinical Laboratory and Medical Records Heart Failure Data
to Predict ICU Admission
William P. Fisher, Jr. and Elizabeth C. Burton
Abstract
This study employs existing data sources to develop a new measure of intensive care unit (ICU) admission risk
for heart failure patients. Outcome measures were constructed from laboratory, accounting, and medical record
data for 973 adult inpatients with primary or secondary heart failure. Several scoring interpretations of the laboratory
indicators were evaluated relative to their measurement and predictive properties. Cases were restricted to
tests within first lab draw that included at least 15 indicators. After optimizing the original clinical observations,
a satisfactory heart failure severity scale was calibrated on a 0-1000 continuum. Patients with unadjusted CHF
severity measures of 550 or less were 2.7 times more likely to be admitted to the ICU than those with higher
measures. Patients with low HF severity measures (<550) adjusted for demographic and diagnostic risk factors
are about six times more likely to be admitted to the ICU than those with higher adjusted measures. A nomogram
facilitates routine clinical application. Existing computerized data systems could be programmed to automatically
structure clinical laboratory reports using the results of studies like this one to reduce data volume with
no loss of information, make laboratory results more meaningful to clinical end users, improve the quality of
care, reduce errors and unneeded tests, prevent unnecessary ICU admissions, lower costs, and improve patient
satisfaction. Existing data typically examined piecemeal form a coherent scale measuring heart failure severity
sensitive to increased likelihood of ICU admission. Marked improvements in ROC curves were found for the
aggregate measures relative to individual clinical indicators.
****
Implementing Computerized Adaptive Tests in Routine Clinical Practice: Experience Implementing CATs
Dennis L. Hart, Daniel Deutscher, Mark W. Werneke, Judy Holder, and Ying-Chih Wang
Abstract
This paper traces the development, testing and use of CATs in outpatient rehabilitation from the perspective
of one proprietary international medical rehabilitation database management company, Focus On Therapeutic
Outcomes, Inc. (FOTO). Between the FOTO data in the United States and Maccabi Healthcare Services data
in Israel, over 1.5 million CATs have been administered. Using findings from published studies and results of
internal public relations surveys, we discuss (1) reasons for CAT development, (2) how the CATs were received
by clinicians and patients in the United States and Israel, (3) results of psychometric property assessments of
CAT estimated measures of functional status in routine clinical practice, (4) clinical interpretation of CAT functional
status measures, and (5) future development directions. Results of scientific studies and business history
provide confidence that CATs are pertinent and valuable to clinicians, patients and payers, and suggest CATs
will be prominent in the development of future integrated computerized electronic medical record systems with
electronic outcomes data collection.
****
The Use of PROMIS and Assessment Center to Deliver Patient-Reported Outcome Measures in Clinical Research
Richard C. Gershon, Nan Rothrock, Rachel Hanrahan, Michael Bass, and David Cella
Abstract
The Patient-Reported Outcomes Measurement Information System (PROMIS) was developed as one of the
first projects funded by the NIH Roadmap for Medical Research Initiative to re-engineer the clinical research
enterprise. The primary goal of PROMIS is to build item banks and short forms that measure key health outcome
domains that are manifested in a variety of chronic diseases which could be used as a “common currency” across
research projects. To date, item banks, short forms and computerized adaptive tests (CAT) have been developed
for 13 domains with relevance to pediatric and adult subjects. To enable easy delivery of these new instruments,
PROMIS built a web-based resource (Assessment Center) for administering CATs and other self-report data,
tracking item and instrument development, monitoring accrual, managing data, and storing statistical analysis
results. Assessment Center can also be used to deliver custom researcher developed content, and has numerous
features that support both simple and complicated accrual designs (branching, multiple arms, multiple time
points, etc.). This paper provides an overview of the development of the PROMIS item banks and details Assessment
Center functionality.
****
Opportunities for CAT Applications in Medical Rehabilitation: Development of Targeted Item Banks
Pamela A. Kisala and David S. Tulsky
Abstract
Researchers in the field of rehabilitation medicine have increasingly turned to qualitative data collection methods
to better understand the experience of living with a disability. In rehabilitation psychology, these techniques
are embodied by participatory action research (PAR; Hall 1981; White, Suchowierska, and Campbell 2004),
whereby researchers garner qualitative feedback from key stakeholders such as patients and physicians. Glaser
and Strauss (1967) and, later, Strauss and Corbin (1998) have outlined a systematic method of gathering and
analyzing qualitative data to ensure that results are conceptually grounded to the population of interest. This
type of analysis yields a set of interrelated concepts (“codes”) to describe the phenomenon of interest. Using
this data, however, becomes somewhat of a methodological problem. While this data is often used to describe
phenomena of interest, it is challenging to transform the knowledge gained into practical data to inform research
and clinical practice. In the case of developing patient-reported outcomes (PRO) measures for use in a rehabilitation
population, it is difficult to make sense of the qualitative analysis results. Qualitative feedback tends to be
open-ended and free-flowing, not conforming to any traditional data analysis methodology. Researchers involved
in measure development need a practical way to quantify the qualitative feedback. This manuscript will focus
on a detailed methodology of empiricizing qualitative data for practical application, in the context of developing
targeted, rehabilitation-specific PRO measures within a larger, more generic PRO measurement system.
****
Postscript, Emergence of Efficiency in Health Outcomes Measurement
Karon F. Cook
Abstract
The purpose of this postscript is to comment on psychometric issues raised by the collective articles of this special
issue of the Journal of Applied Measurement. Topics discussed include the need to engage relevant literature
in the psychometric and larger academic communities, the role of measurement models in psychometrics, and
challenges to efficient measurement of patient reported outcomes. Finally, I argue that psychometrics should
play “second fiddle” to larger scientific questions.
****
Vol. 11, No. 4 Winter 2010
Rasch Model’s Contribution to the Study of Items and Item Response Scales Formulation in Opinion/Perception Questionnaires
Jean-Guy Blais, Julie Grondin, Nathalie Loye, and Gilles Raîche
Abstract
Questionnaire-based inquiries make it possible to obtain data rather quickly and at relatively low cost, but a
number of factors may influence respondents’ answers and affect data’s validity. Some of these factors are related
to the individuals and the environment, while others are directly related to the characteristics of the questionnaire
and its items: the text introducing the questionnaire, the order in which the items are presented, the number of
responses categories and their labels on the proposed scale and the wording of items. The focus of this article is
on this last point and its goal is to show how the developments of diagnostic features surrounding Rasch modelling
can be used to study the impact of item wording in opinion/perception questionnaires on the responses
obtained and on the location of anchor points of the item response scale.
****
Estimating Tests Including Subtests
Steffen Brandt
Abstract
Current assessment studies often face a dilemma that arises from the necessity of a simultaneous unidimensional
and multidimensional interpretation of a test. When, additional to the measurement of a single domain, subdomains
of this domain are to be measured, one and the same data set will have to be analyzed unidimensionaly and
multidimensionaly at the same time even though it is a contradiction of the theoretical assumptions underlying
the analysis. This article, at first, describes the psychometric deficiencies typically going along with current
analysis approaches. Subsequently, a subdimension model is proposed that explicitly allows for the existence
of subdomains, or subdimensions, in a measured domain. Thereby, the model provides a means to obtain calibration
results that suffer less from the deficiencies described and also allows for an item selection in the test
development process that already considers the multidimensional structure of the test.
****
Measure for Measure: Curriculum Requirements and Children’s Achievement in Music Education
Trevor Bond and Marie Bond
Abstract
Children in all public primary schools in Queensland, Australia have weekly music lessons designed to develop
key musical concepts such as reading, writing, singing and playing simple music notation. Their understanding
of basic musical concepts is developed through a blend of kinaesthetic, visual and auditory experiences.
In keeping with the pedagogical principles outlined by the Hungarian composer, Zoltan Kodaly, early musical
experiences are based in singing well-known children’s chants—usually restricted to notes of the pentatonic
scale. In order to determine the extent to which primary school children’s musical understandings developed
in response to these carefully structured developmental learning experiences, the Queensland Primary Music
Curriculum was examined to yield a set of over 70 indicators of musical understanding in the areas of rhythm,
melody and part-work—the essential skills for choral singing. Data were collected from more than 400 children’s
attempts at elicited musical performances. Quantitative data analysis procedures derived from the Rasch model
for measurement were used to establish the sequence of children’s mastery of key musical concepts. Results
suggested that while the music curriculum did reflect the general development of musical concepts, the grade
allocation for a few concepts needed to be revised. Subsequently, children’s performances over several years
were also analysed to track the musical achievements of students over time. The empirical evidence confirmed
that children’s musical development was enhanced by school learning and that indicators can be used to identify
both outstanding and atypical development of musical understanding. It was concluded that modest adjustments
to the music curriculum might enhance children’s learning opportunities in music.
****
On the Factor Structure of Standardized Educational Achievement Tests
Tim W. Gaffney, Robert Cudeck, Emilio Ferrer, and Keith F. Widaman
Abstract
This research analyzed the factor structure at both the item- and subtest-level of California’s norm- and criterionreferenced
standardized educational achievement tests (SEAT) used in that state’s high-stakes educational
accountability assessments. It was shown through full information factor analysis and multidimensional IRT
models (e.g., TESTFACT and NOHARM) that, at the item-level, SEATs are invariably highly unidimensional
(i.e., they appear to tap a unidimensional theta scale) even when items representing such diverse content areas
such as English, science, mathematics, and history are analyzed simultaneously as a single measure. These
item-level factors also accounted for a relatively small proportion (1/4 to 1/3) of the variance. It was also shown
that, when these tests are analyzed using more reliable indicators such as subtests, a much richer factor structure
emerged that accounted for a larger portion (about 2/3) of the total common variance. As expected, these factor
structure configurations (and underlying dimensionality) were preserved across the item- and subtest-levels.
However, the factors emerging from both the item- and subtest-level analyses were highly correlated and produced
strong second-order and general factors. The meaning underlying these results was examined, along with
their implications with respect to the assumptions underlying modern approaches to test calibration, scaling,
and score interpretation.
****
The Practical Application of Optimal Appropriateness Measurement on Empirical Data using Rasch Models
Iasonas Lamprianou
Abstract
Optimal Appropriateness Measurement (OAM) is a general statistical method for the identification of examinees
whose test scores might not be a valid indicator of their true latent ability or trait. The method is statistically
very powerful and it pinpoints towards the direction of the suspected aberrance instead of simply identifying
that a specific response pattern is, in some way, aberrant. The method has been traditionally used with multiparameter
IRMs for the identification of examinees with spuriously low and high scores. This article presents
the practical application of the method, using Rasch models, in the context of a large-scale activity which aimed
to provide secondary education schools with feedback about their students’ performance on a high-stakes University
entrance science test. Although researchers in the past claimed that OAM was not ready to be routinely
used in practical settings, this article maintains that the practical use of OAM to answer specific educationally
meaningful questions is feasible.
****
Features of the Sampling Distribution of the Ability Estimatein Computerized Adaptive Testing According to Two Stopping Rules
Jean-Guy Blais and Gilles Raîche
Abstract
Whether paper and pencil or computerized adaptive, tests are usually described by a set of rules managing how
they are administered: which item will be first, which should follow any given item, when to administer the
last one. This article focus on the latter and looks at the effect of two stopping rules on the estimated sampling
distribution of the ability estimate in a CAT: the number of items administered and the a priori determined size
of the standard error of the ability estimate.
****
Understanding Rasch Measurement: Developing Examinations that use Equal Raw Scores for Cut Scores
Andrew Swanlund and Everett Smith
Abstract
This study describes and demonstrates a set of processes for developing new forms of examinations which
are intended to have equivalent cut scores in the raw score metric. This approach goes beyond the traditional
Rasch-based approach which develops forms with cut scores that are equated in the logit metric. The methods
described in this study can be used to create multiple forms of an assessment, all of which have the same raw
score cut score (i.e., the number correct required to pass each examination form represents the same amount of
the underlying construct). This paper provides an overview of equating standards, the research related specifically
to pre-equating procedures, and three guidelines which can be used to achieve equal raw score cut scores.
Three examples of how to use the guidelines as part of an iterative form-development process are provided
using simulated data sets.
****