Journal of Applied Measurement
P.O. Box 1283
Maple Grove, MN 55311
Volume 13, 2012 Article Abstracts
Vol. 13, No. 1 Spring 2012
Formulating Latent Growth Using an Explanatory Item Response Model Approach
Mark Wilson, Xiaohui Zheng, and Leah McGuire
Abstract
In this paper, we present a way to extend the Hierarchical Generalized Linear Model (HGLM; Kamata (2001),
Raudenbush (1995)) to include the many forms of measurement models available under the formulation known
as the Random Coefficients Multinomial Logit (MRCML) Model (Adams, Wilson and Wang, 1997), and apply
that to growth modeling. First, we review two different traditions in modeling growth studies: the first is based
in the hierarchical linear modeling (HLM) tradition, and the second, which is the topic of this paper, is rooted
in the Rasch measurement tradition—this is the linear Latent Growth Item Response Model (LG-IRM). Going
beyond the linear case, the LG-IRM approach allows us to considerably extend the range of models available in
the HLM tradition to incorporate several of the extensions of IRT models that are used in creating explanatory
item response models (EIRM; De Boeck and Wilson, 2004). We next present a number of extensions—including
polynomial growth modeling, differential item functioning (DIF) effects, growth functions that can be approximated
by polynomial expressions, provision for polytomous responses, person and item covariates (and
time varying covariates), and multiple dimensions of growth. We provide two empirical examples to illustrate
several of the models, using the ConQuest software (Wu, Adams, Wilson and Haldane, 2008) to carry out the
analyses. We also provide several simulations to investigate the success of the estimation procedures.
****
Using the Mixed Rasch Model to Analyze Data from the Beliefs and Attitudes About Memory Survey
Everett V. Smith, Jr., Yuping Ying, and Scott W. Brown
Abstract
In this study, we used the Mixed Rasch Model (MRM) to analyze data from the Beliefs and Attitudes About
Memory Survey (BAMS; Brown, Garry, Silver, and Loftus, 1997). We used the original 5-point BAMS data
to investigate the functioning of the “Neutral” category via threshold analysis under a 2-class MRM solution.
The “Neutral” category was identified as not eliciting the model expected responses and observations in the
“Neutral” category were subsequently treated as missing data. For the BAMS data without the “Neutral”
category, exploratory MRM analyses specifying up to 5 latent classes were conducted to evaluate data-model
fit using the consistent Akaike information criterion (CAIC). For each of three BAMS subscales, a two latent
class solution was identified as fitting the mixed Rasch rating scale model the best. Results regarding threshold
analysis, person parameters, and item fit based on the final models are presented and discussed as well as the
implications of this study.
****
An Examination of Personality Characteristics Related to Acquiescence
Christine DiStefano, Grant B. Morgan, and Robert W. Motl
Abstract
Acquiescence, the tendency to agree with statements regardless of content, is often a concern when administering
self-report instruments. While there is evidence to support acquiescence as a response style, this reporting
tendency may be related to personality factors of individuals. Using a sample of 757 adults, we investigated
the Rosenberg Self-Esteem Scale for acquiescence response tendencies by applying the Rasch partial credit
model. Results suggested that favorable (i.e., Agree or Strongly Agree) responses were more frequent for the
positively worded items than for negatively worded items. Second, the relationship between acquiescence and
seven additional personality measures was examined overall and by sex. Among females, acquiescence was
correlated with personality measures measuring perceptions by others, whereas acquiescence among males was
related to exhibition types of behaviors.
****
Construction and Validation of Two Parent-Report Scales for the Evaluation of Early Intervention Programs
William P. Fisher, Jr., Batya Elbaum, and W. Alan Coulter
Abstract
The State Performance Plan (SPP) developed under the 2004 reauthorization of the Individuals with Disabilities
Education Act (IDEA 2004, Public Law 108-446) requires states to collect data and report on the impact of
early intervention services on three key outcomes for participating families. The NCSEAM Impact on Family
Scale (NIFS) and the NCSEAM Family Centered Services Scale (NFCSS) were developed to provide states
with a means to address this new reporting requirement and to collect additional data that would inform program
improvement efforts. Items suggested by stakeholder groups were piloted with a nationally representative
sample of parents of children with developmental delays or disabilities ages birth to three participating in early
intervention services in eight states. The 28-item NIFS had measurement reliabilities ranging from .93-.96 in a
sample of 1,750; measurement reliabilities for the 135-item NFCSS ranged from .94 to .97 in a sample of 1,755
respondents. A 29-item version of the NFCSS had measurement reliabilities ranging from .87 to .92. Using data
from the pilot study, stakeholders established a recommended performance standard, set at a meaningful point
in the NIFS item hierarchy, for each of the three established outcome areas.
****
Multi-Factor Scale Consolidation When Theory is Weak
Nikolaus Bezruczko and Kyle Perkins
Abstract
As a practical matter, Spirituality and Quality of Life in the health sciences are usually measured separately.
Theoretical foundations for this distinction, however, are not strong. In this research, an empirical investigation
was conducted into their joint calibration with a Rasch model. Functional Assessment of Cancer Therapy-General
(28 items), a cancer health-related quality of life measure (HRQOL), and Functional Assessment of Chronic
Illness - Spiritual Well-Being (12 items), a measure of religious and existential well-being (Spirituality), were
co-calibrated with a Rasch model implemented with WINSTEPS software for ratings from 545 breast cancer
patients. The results show a hierarchical integration of QOL and Spirituality items on a common variable, and
both patient separation (2.66) and reliability (.88) improve after co-calibration. Principal Component Analysis
of co-calibrated item residuals did not show major threats to dimensionality, and joint calibration explains item
variance comparable to separate calibrations (51.9%). Although patient measures (logits) based on separate and
co-calibration are within two standard errors, ethnic and racial group values shift after co-calibration.
****
Developing an Emotional Distress Item Bank for Cancer Patients
Allen W. Heinemann, Rita K. Bode, Sarah Rosenbloom, and David Cella
Abstract
Emotional distress is common among cancer patients during and after treatment. Many instruments have been
used to measure emotional distress; however, none of them has emerged as a standard. Although the diversity
of instruments has some merit, the lack of a common measure limits our ability to compare studies. This paper
describes how we constructed a 46-item emotional distress bank. Using expert judgment, we selected a pool of
items with emotional content from this six-instrument set. Rasch rating scale analysis helped us identify a set of
general distress items with good model fit and a measurement gap causing floor effects. Additional items were
written to augment the measure where found deficient. The resulting set of items reflects a spectrum of positive
and negative affect. The measure demonstrated excellent reliability (person separation reliability = .96) and a
wide range of emotional distress and was able to distinguish among levels of disease severity.
****
Vol. 13, No. 2 Summer 2012
Is the Partial Credit Model a Rasch Model?
Robert W. Massof
Abstract
A balance scale metaphor is offered as a tool for explaining the principles of measurement and for visualizing the
internal structure of dichotomous and polytomous Rasch models. The balance scale metaphor is used to guide
the derivation of a general polytomous Rasch model and to illustrate the additional assumptions subsequently
required to derive the Andrich (1978) rating scale model (RSM) and the Masters (1982) partial credit model
(PCM). The metaphor is used to present the argument that the RSM conforms to the rules of measurement, but
the PCM has interactions implicit in its structure that violate specific objectivity and sufficiency of raw scores,
which challenge its status as a Rasch model. Using the metaphor and a literal interpretation of the narrative
description of the PCM by Masters (1982), a new version of the PCM is derived that does conform to the rules
of measurement.
****
Exploring the Alignment of Writing Self-Efficacy with
Writing Achievement Using Rasch Measurement Theory and Qualitative Methods
George Engelhard, Jr. and Nadia Behizadeh
Abstract
Alignment of writing self-efficacy and writing achievement is defined as the congruence between student confidence
regarding writing skills (writing self-efficacy) and the actual performance on these writing skills as reflected
in teacher grades (achievement). One purpose of this study is to examine the relationship between these two
variables. A second purpose is to demonstrate a mixed-methods approach to investigating relationships between
affective variables using Rasch measurement and interviews. Participants were eighth grade students (N = 94)
from an ethnically and socioeconomically diverse school in the southeastern United States. Our results suggest
that students who struggle with the mechanics of writing yet appreciate the expressive capacity of writing, may
have higher senses of writing self-efficacy that are not predictive of performance.
****
Measuring Positive and Negative Affect
in Older Adults Over 56 Days: Comparing Trait Level
Scoring Methods Using the Partial Credit Model
Monica K. Erbacher, Karen M. Schmidt, Steven M. Boker, Cindy S. Bergeman
Abstract
Positive (PA) and negative affect (NA) are important constructs in health and well-being research. Good longitudinal
measurement is crucial to conducting meaningful research on relationships between affect, health, and
well-being across the lifespan. One common affect measure, the PANAS, has been evaluated thoroughly with
factor analysis, but not with Racsh-based latent trait models (RLTMs) such as the partial credit model (PCM),
and not longitudinally. Current longitudinal RLTMs can computationally handle few occasions of data. The
present study compares four methods of anchoring PCMs across 56 occasions to longitudinally evaluate the
psychometric properties of the PANAS plus additional items. Anchoring item parameters on mean parameter
values across occasions produced more desirable results than using no anchor, using first occasion parameters
as anchors, or allowing anchor values to vary across occasions. Results indicated problems with NA items,
including poor category utilization, gaps in the item distribution, and a lack of easy-to-endorse items. PA items
had much more desirable psychometric qualities.
****
Item Set Discrimination and the Unit in the Rasch Model
Stephen Humphry
Abstract
The aim is to show that it is possible to parameterize discrimination for sets of items, rather than individual
items, without destroying conditions for sufficiency in a form of the Rasch model. The form of the model is
obtained by formalizing the relationship between discrimination and the unit of a metric. The raw score vector
across item sets is the sufficient statistic for the person parameter. Simulation studies are used to show the
implementation of conditional estimation solution equations based on the relevant form of the Rasch model.
The model also applied to two numeracy tests attempted by a group of common persons in a large-scale testing
program. The results show improved fit compared with the Rasch model in its standard form. They also show
the units of the scales were more accurately equated. The paper discusses implications for applied measurement
using Rasch models and contrasts the approach with the application of the two parameter logistic (2PL) model.
****
Educational Achievement, Personality, and Behavior:
Assessment, Factor Structure and Implications for Theory And Practice
Tim W. Gaffney and Cassandra Perryman
Abstract
The purposes of this research were to first examine the evidence regarding the factor structure of educational
achievement tests in the context of two theoretical models of cognitive ability (psychometric g and mutualism)
that have been proposed to explain this structure as well as the underlying processes that may be responsible
for its emergence in dimensionality studies. Then, the factor structure underlying a sample of the standardized
educational achievement tests used by California in its statewide school accountability program was compared
to those emerging from a selection of behavioral and personality assessments. As expected, the educational
achievement tests exhibited a strong and uniformly positive manifold resulting in greater unidimensionality as
evidenced by a dominant general factor in bi-factor analysis then either the personality or behavioral assessments.
The implications of these structural differences are discussed with respect to the two theoretical perspective as
well as in the context of formative and summative educational inferences in particular, and the school accountability
and reform movement in general.
****
An External Validation Study of a Classification
of Mixed Connective Tissue Disease and Systemic Lupus Erythematosus Patients
Robert W. Hoffman, Nikolaus Bezruczko, and Kyle Perkins
Abstract
Mixed Connective Tissue Disease (MCTD) and Systemic Lupus Erythematosus (SLE) are autoimmune rheumatic
diseases that are difficult for physicians to diagnose and to distinguish for a variety of reasons. The correct classification
of these two diseases is a crucial issue for clinicians who treat autoimmune rheumatic diseases. In prior
research, medical risk factors represented by instrument or laboratory measures and physician judgments (12
key features for MCTD and 12 key features for SLE) were parameterized with a one parameter logistic function
in a Rasch model. Those results identified separate diagnostic dimensions for MCTD and SLE. This procedure
was replicated in the present research with a sample of largely African American and Hispanic patients. Results
verified separate dimensions for MCTD and SLE, which suggests MCTD is a separate disease from SLE.
****
Vol. 13, No. 3 Fall 2012
The Development and Validation of the Core Competencies Scale (CCS) for the College and University Students
Bin Ruan, Magdalena Mo Ching Mok, Christopher R. Edginton, and Ming Kai Chin
Abstract
This article describes the development and validation of the Core Competencies Scale (CCS) using Bok’s (2006)
competency framework for undergraduate education. The framework included: communication, critical thinking,
character development, citizenship, diversity, global understanding, widening of interest, and career and
vocational development. The sample comprised 70 college and university students. Results of analysis using
Rasch rating scale modelling showed that there was strong empirical evidence on the validity of the measures
in contents, structure, interpretation, generalizability, and response options of the CCS scale. The implication
of having developed Rasch-based valid and dependable measures in this study for gauging the value added of
college and university education to their students is that the feedback generated from CCS will enable evidencebased
decision and policy making to be implemented and strategized. Further, program effectiveness can be
measured and thus accountability on the achievement of the program objectives.
****
Fixed or Random Testlet Effects: A Comparison of Two Multilevel Testlet Models
Tzu-An Chen
Abstract
This simulation study compared the performance of two multilevel measurement testlet (MMMT) models:
Beretvas and Walker’s (2012) two-level MMMT model and Jiao, Wang, and Kamata’s (2005) three-level model.
Several conditions were manipulated (including testlet length, sample size, and the pattern of the testlet effects)
to assess the impact on estimation of fixed and random effect parameters.
The results of the present simulation study showed that MMMT-2r yielded the best parameter bias in estimation
on fixed item effects, fixed testlet effects, and random testlet effects for conditions with nonzero equal
pattern of random testlet effects’ variance even when the MMMT-2r was not the generating model. However,
random effects estimation did not perform well when unequal random testlet effects’ variances were generated.
Fit indices did not perform well either as other studies have had found. In addition, from the modeling perspective,
MMMT-2r allows the greatest flexibility in terms of modeling testlet effects as fixed, random, or both.
****
A Study of Rasch, Partial Credit, and Rating Scale Model Parameter Recovery in WINSTEPS and jMetrik
J. Patrick Meyer and Emily Hailey
Abstract
jMetrik and WINSTEPS are two Rasch measurement software applications that implement joint maximum
likelihood estimation of Rasch, partial credit, and rating scale model parameters via a proportional curve fitting
algorithm. We describe this algorithm in this paper and explain the handling of missing data and extreme cases.
Results from a simulation study that manipulated sample size and the number of test items indicate that both
programs produce similar bias and root mean squared error values. In addition, root mean squared difference
values indicate that estimates from each program are within 0.001 and 0.004 logits of each other depending on
the model in question.
****
Cognitive Assessment in Mathematics with the Least Squares Distance Method
Lin Ma, Emre Çetin, and Kathy E. Green
Abstract
This study investigated the validation of comprehensive cognitive attributes of an eighth-grade mathematics test
using the least squares distance method and compared performance on attributes by gender and region. A sample
of 5,000 students was randomly selected from the data of the 2005 Turkish national mathematics assessment of
eighth-grade students. Twenty-five math items were assessed for presence or absence of 20 cognitive attributes
(content, cognitive processes, and skill). Four attributes were found to be misspecified or nonpredictive. However,
results demonstrated the validity of cognitive attributes in terms of the revised set of 17 attributes. The girls
had similar performance on the attributes as the boys. The students from the two eastern regions significantly
underperformed on the most attributes.
****
Using Rasch Measurement to Validate the Big Five Factor Marker Questionnaire for a Japanese University Population
Matthew T. Apple and Peter Neff
Abstract
In recent years, psychological studies have increasingly come to support the so-called “Big Five” or “Five-factor
Model” (FFM) of human personality. However, the vast majority of research in this field has been undertaken
in Western contexts, thus raising the question of how applicable the Big Five is to Asian populations. Moreover,
nearly all research into the Big Five has relied on traditional techniques of statistical analysis (e.g., factor analysis,
correlation) to validate their results, despite the limitations of such methods. This study examined instrument
validation of a widely-used Big Five instrument (the Factor Markers questionnaire) given to a Japanese population
(n = 283) by using the Rasch rating scale model (Andrich, 1978). Rasch principal components analysis
of the item residuals indicated the possible existence of additional factors within the Intellect/Imagination and
Agreeableness factors, as well as additional item fit problems within each hypothesized construct.
****
An Applied Examination of Different Weighting Methods for the Root
Expected Mean Square Difference and Root Mean Square Difference Indices
Anne Corinne Huggins
Abstract
It is possible that functions used to link tests are sensitive to subpopulations of test takers. The REMSD and
RMSD(x) are weighted effect sizes of linking invariance, yet it is often unclear how the weights are most appropriately
applied when subpopulation group sizes are heterogeneous. The objective of this research is to apply
two different weighting methods to the REMSD and RMSD(x) functions while testing for population invariance
in a linkage across subpopulations of disparate sample sizes, and to subsequently compare the results across
these differentially weighted effect sizes. The findings demonstrate that utilizing proportional weights in the
REMSD and RMSD(x) indices can underestimate differences in linking functions for small subpopulations.
****
Development and Validation of a Questionnaire to Evaluate Attitudes toward Family Medicine
Francisco Escobar Rabadán, Jesús López-Torres Hidalgo,
Julio Montoya Fernández, Juan M. Téllez Lapeira,
Mª Aranzazu Romero Cebrián, and Juan M. Armero Simarro
Abstract
A questionnaire that evaluates medical students’ knowledge of and attitudes towards primary care and family
medicine was developed and validated in order to analyze changes during medical training. A 34 items questionnaire
with 5 options on a Likert-type scale was designed. Based on this, a 21 item version was developed
and validated. The internal consistency, using Cronbach’s alpha test, was estimated, and the questionnaire was
analyzed by Rasch model. One hundred and fifty nine students responded to the brief questionnaire (95.78% of
those enrolled). Cronbach’s alpha was 0.72. Rasch analysis showed high adjustment of the items to the pattern,
whit an average reliability of the estimates of the items of 0.97 and of the people of 0.68. Our questionnaire has
shown acceptable internal consistency and good construct validity.
****
Vol. 13, No. 4 Winter 2012
Conditional Pairwise Person Parameter
Estimates in Rasch Models
Svend Kreiner
Abstract
Conditional pairwise estimates parameters in Rasch models separate inference on item parameters from inference
on person parameters. Pairwise item parameter estimates are consistent when sample size approaches infinity,
but adds an extra random error to estimation compared to conditional maximum likelihood estimates. Pairwise
estimates of person parameters are easily calculated, but can rarely be assumed to be consistent since the number
of items is often small and the properties of the estimates generally unknown. This note gives results from a
study of conditional pairwise estimation of person parameters and suggests a modification of the estimate that
takes care of some of the error.
****
Examining Rating Quality in Writing
Assessment: Rater Agreement, Error, and Accuracy
Stefanie A. Wind and George Engelhard, Jr.
Abstract
The use of performance assessments in which human raters evaluate student achievement has become increasingly
prevalent in high-stakes assessment systems such as those associated with recent policy initiatives (e.g.,
Race to the Top). In this study, indices of rating quality are compared between two measurement perspectives.
Within the context of a large-scale writing assessment, this study focuses on the alignment between indices of
rater agreement, error, and accuracy based on traditional and Rasch measurement theory perspectives. Major
empirical findings suggest that Rasch-based indices of model-data fit for ratings provide information about
raters that is comparable to direct measures of accuracy. The use of easily obtained approximations of direct
accuracy measures holds significant implications for monitoring rating quality in large-scale rater-mediated
performance assessments.
****
Beliefs about Language Development:
Construct Validity Evidence
Mavis L. Donahue, Qiong Fu, and Everett V. Smith, Jr.
Abstract
Understanding language development is incomplete without recognizing children’s sociocultural environments,
including adult beliefs about language development. Yet there is a need for data supporting valid inferences to
assess these beliefs. The current study investigated the psychometric properties of data from a survey (MODeL)
designed to explore beliefs in the popular culture, and their alignment with more formal theories. Support for
the content, substantive, structural, generalizability, and external aspects of construct validity of the data were
investigated. Subscales representing Behaviorist, Cognitive, Nativist, and Sociolinguistic models were identified
as dimensions of beliefs. More than half of the items showed a high degree of consensus, suggesting culturallytransmitted
beliefs. Behaviorist ideas were most popular. Bilingualism and ethnicity were related to Cognitive
and Sociolinguistic beliefs. Identifying these beliefs may clarify the nature of child-directed speech, and enable
the design of language intervention programs that are congruent with family and cultural expectations.
****
Concurrent Validation of CHIRP, a New
Instrument for Measuring Healthcare Student Attitudes towards Interdisciplinary Teamwork
David Hollar, Cherri Hobgood, Beverly Foster, Marco Aleman, and Susan Sawning
Abstract
Positive attitudes towards teamwork among health care professionals are critical to patient safety. The purpose
of this study is to describe the development and concurrent validation of a new instrument to measure attitudes
towards healthcare teamwork that is generalizable across various populations of healthcare students. The Collaborative
Healthcare Interdisciplinary Planning (CHIRP) scale was validated against the Readiness for Inter-
Professional Learning Scale (RIPLS). Analyses included student (n = 266) demographics, ANOVA, internal
consistency, factor analysis, and Rasch analysis. The two instruments correlated at r = .582 (p < .01). The CHIRP
showed a multifactorial structure having excellent internal consistency (alpha = .850), with 25 of the 36 scale
items loading onto a single Teamwork Attitudes factor. The RIPLS likewise had strong internal consistency (alpha
= .796) and a three-factor structure, supporting previous studies of the instrument. However, Rasch analyses
showed 14 (38.9%) of the 36 CHIRP items, but only four (21.1%) of the 19 RIPLS items remaining within the
satisfactory standardized OUTFIT zone of ± 2.0 standard deviation units. We propose the 14 fitting items as a
new, validated teamwork attitudes scale.
****
Using Extended Rasch Models to Assess
Validity of Diagnostic Tests in the Presence of a Reference Standard
Vivian Viallon, Emmanuel Ecosse, Mounir Mesbah, Jacques Pouchot, and Joel Coste
Abstract
We show that extended Rasch models, built to deal with continuous latent variables, may be useful for assessing
validity of medical diagnostic tests in the presence of a reference standard, particularly for chronic diseases.
We derive estimates for sensitivity and specificity under the Rasch model assumptions. Our estimations can be
computed conditionally on the level of potential confounding covariates, making use of a variety of extended
Rasch models, namely log-linear Rasch models, to examine the association between covariates and both disease
intensity and response to the tests. Also, another variety of extended Rasch models—partial credit models—can
determine appropriate thresholds for quantitative diagnostic tests. As an example, we study the validity of some
diagnostic tests of heart failure.
****
Measuring Work Stress among Correctional
Staff: A Rasch Measurement Approach
George E. Higgins, Richard Tewksbury, and Andrew Denney
Abstract
Today, the amount of stress the correctional staff endures at work is an important issue. Research has addressed
this issue, but has yielded no consensus as to a properly calibrated measure of perceptions of work stress for
correctional staff. Using data from a non-random sample of correctional staff (n = 228), the Rasch model was
used to assess whether a specific measure of work stress would fit the model. Results show that three items
rather than six items accurately represented correctional staff perceptions of work stress.
****
A Rasch Measure of Teachers’ Views of
Teacher-Student Relationships in the Primary School
Natalie Leitão and Russell F. Waugh
Abstract
This study investigated teacher-student relationships from the teachers’ point of view at Perth metropolitan
schools in Western Australia. The study identified three key social and emotional aspects that affect teacherstudent
relationships, namely, Connectedness, Availability and Communication. Data were collected by questionnaire
(N = 139) with stem-items answered in three perspectives: (1) Idealistic: this is what I would like to
happen; (2) Capability: this is what I am capable of; and (3) Behaviour: this is what actually happens, using
four ordered response categories: not at all (score 1), some of the time (score 2), most of the time (score 3), and
almost always (score 4). Data were analysed with a Rasch measurement model and a uni-dimensional, linear
scale with 24 items, ordered from easy to hard, was created. The data were shown to be highly reliable, so that
valid inferences could be made from the scale. The Person Separation Index (akin to a reliability index) was
0.93; there was good global teacher and item fit to the measurement model; there was good item fit; the targeting
of the item difficulties against the teacher measures was good, and the response categories were answered
consistently and logically. Teachers said that the ideal items were all easier than their corresponding capability
items which were in turn easier than the behaviour items (where the items fitted the model), as conceptualised.
The easiest ideal items were ‘I like this child’ and ‘this child and I get along well together.’ The hardest ideal
item (but still easy) was ‘I am available for this child.’ The easiest behaviour item (but still hard) was ‘This child
and I get along well together.’ The hardest behaviour item (and very hard) was ‘I am interested to learn about
this child’s personal thoughts, feelings and experiences.’ The difficulties of the items supported the conceptual
structure of the variable.
****