Journal of Applied Measurement
P.O. Box 1283
Maple Grove, MN 55311
Volume 10, 2009 Article Abstracts
Vol. 10, No. 1 Spring 2009
****
Mapping Multiple Dimensions of Student Learning: The
ConstructMap Program
Cathleen A. Kennedy and Karen Draney
Abstract
In the past, many assessments, especially standardized
assessments, tended to be composed of items with specific right and wrong
answers, such as those found in multiple choice, true-false and short response
items. Performancebased questions that require students to construct answers
rather than select correct responses introduce the complexities of multiple
correct answers, dependence on teacher judgment for scoring, and requisite
ancillary skills such as language fluency, which are technically difficult to
handle, and may even introduce problems such as bias against certain groups of
students. Recent developments in assessment design and psychometrics have
improved the feasibility of assessing performance-based tasks more efficiently
and effectively, thereby providing a rich domain of information from which
interpretations can be made about what students know and what they can do when
they draw upon that knowledge. We developed the ConstructMap computer program
specifically to assist teachers in interpreting and representing this type of
performance data. The program accepts as input student scores on items
associated with one or multiple performance variables, computes proficiencies
using multidimensional item response methods, and produces graphical
representations of students’ estimated proficiency on each of the variables.
****
Response Dependence and the Measurement of Change
Ida Marais
Abstract
Because of confounding effects that can mask change when
persons respond to the same items on more than one occasion, the measurement of
change is a challenge. The specific effect on change studied in this paper is
that observed when responses of persons to items at time 2 are dependent
statistically on their responses at time 1. In addition, because this response
dependence may affect the change differently for different locations of items
relative to persons at time 1, the initial targeting of persons to items was
studied. For a specific change in means of persons, dichotomous data were
simulated according to the Rasch model with varying degrees of dependence and
varying initial targeting of persons to items. Data were analysed, also using
the Rasch model, in two ways: firstly, by treating items used at time 1 and time
2 as distinct ones (rack analysis) and, secondly, by treating persons at time 1
and time 2 as distinct ones (stack analysis). With the rack analysis the change
is revealed through the item parameters and with the stack analysis the change
is revealed through the person parameters. With no response dependence the two
analyses gave equivalent and correct measures of change. With increasing
dependence change was increasingly masked or increasingly amplified, depending
on the targeting of items to persons at time 1. Response dependence affected the
measurement of change in both analyses, but not always in the same way. The
paper serves as a warning against undetected dependence and also considers
evidence that can be used in the analysis of real data sets for detecting the
presence of dependence when measuring change.
****
Using Paired Comparison Matrices to Estimate Parameters
of the Partial Credit Rasch Measurement Model for Rater-Mediated Assessments
Mary Garner and George Engelhard, Jr.
Abstract
The purpose of this paper is to describe a technique for
estimating the parameters of a Rasch model that accommodates ordered categories
and rater severity. The technique builds on the conditional pairwise algorithm
described by Choppin (1968, 1985) and represents an extension of a conditional
algorithm described by Garner and Engelhard (2000, 2002) in which parameters
appear as the eigenvector of a matrix derived from paired comparisons. The
algorithm is used successfully to recover parameters from a simulated data set.
No one has previously described such an extension of the pairwise algorithm to a
Rasch model that includes both ordered categories and rater effects. The paired
comparisons technique has importance for several reasons: it relies on the
separability of parameters that is true only for the Rasch measurement model; it
works in the presence of missing data; it makes transparent the connectivity
needed for parameter estimation; and it is very simple. The technique also
shares the mathematical framework of a very popular technique in the social
sciences called the Analytic Hierarchy Process (Saaty, 1996).
****
Toward a Domain Theory in English as a Second Language
Diane Strong-Krause
Abstract
This paper demonstrates how domain theory development is
enhanced by using both theoretical data and empirical data. The study explored
the domain of speaking English as a second language (ESL) comparing hypothetical
data on speaking tasks provided by an experienced teacher and by a certified
ACTFL oral proficiency interview rater with observed data from scores on a
computer-delivered speaking exam. While the hypothetical data and observed data
showed similar patterns in task difficulty in general, some tasks were
identified as being much easier or harder than expected. These differences raise
questions not only about test task design but also about the theoretical
underpinnings of the domain. The results of the study suggest that this
approach, where theory and data are examined together, will improve test design
as well as benefit domain theory development.
Comparison of Single- and Double-Assessor Scoring
Designs for the Assessment of Accomplished Teaching
George Engelhard, Jr. and Carol M.
Myford
Abstract
This article is based on a more extensive research
report (Engelhard, Myford and Cline, 2000) prepared for the National Board for
Professional Teaching Standards (NBPTS) concerning the Early
Childhood/Generalist and Middle Childhood/Generalist assessment systems. The
report is available from the Educational Testing Service (ETS). An earlier
version of the article was presented at the American Educational Research
Association Conference in New Orleans in 2000. We would like to acknowledge the
helpful advice of Mike Linacre regarding the use of the FACETS computer program
and the assistance of Fred Cline in analyzing these data. The material contained
in this article is based on work supported by the NBPTS. Any opinions, findings,
conclusions, and recommendations expressed herein are those of the authors and
do not necessarily reflect the views of the NBPTS, Emory University, ETS, or the
University of Illinois at Chicago.
****
A
Rasch Model Prototype for Assessing Vocabulary Learning Resulting from Different
Instructional Methods: A Preschool Example
Cynthia B. Leung and William Steve Lang
Abstract
This study explored the effects of using Rasch modeling
to analyze data on vocabulary knowledge of preschoolers who participated in
repeated read-aloud events and hands-on science activities with their classroom
teachers. A Rasch prototype for literacy research was developed and applied to
the preschool data. Thirty-one target words were selected for analysis from
three children’s informational picture books on light and color. After different
instructional activities, each child received scores on individual target words
measured with a total of six assessments, including free response vocabulary
tests and expressive and receptive picture vocabulary tests. Rasch modeling was
used to assess the learning difficulty of target words in different
instructional settings. Suggestions are made for applying Rasch modeling to
classroom studies of instructional interventions.
****
An Empirical Study on the Relationship between Teacher’s
Judgments and Fit Statistics of the Partial Credit Model
Sun-Geun Baek and Hye-Sook Kim
Abstract
The main purpose of the study was to investigate
empirically the relationship between classroom teacher’s judgment and the item
and person fit-statistics of the partial credit model. In this study, classroom
teacher’s judgments were made intuitively checking each item’s consistency with
the general response pattern and each student’s need for additional treatment or
advice. The item and person fit statistics of the partial credit model were
estimated using the WINSTEPS program (Linacre, 2003). The subjects of this study
were 321 sixth grade students in 9 classrooms within 3 elementary schools in
Seoul, Korea. For this research, a performance assessment test for 6th grade
mathematics was developed. It consisted of 20 polytomous response items and its
total scores ranged between 0 and 50. In addition, the 9 classroom teachers made
their judgments for each item of the test and for each student in their own
classroom. They judged intuitively using 4 categories; (1) well fit, (2) fit,
(3) misfit, and (4) badly misfit for each item as well as each student. Their
judgments were scored from 1 to 4 for each item as well as each student. There
are two significant findings in this study. First, there is a statistically
significant relationship between the classroom teacher’s judgment and item fit
statistic for each item (The median correlation coefficient between the
teacher’s judgment and the item outfit ZSTD is 0.61). Second, there is a
statistically significant relationship between the teacher’s judgment and the
person fit statistic for each student (The median correlation coefficient
between the teacher’s judgment and the person outfit ZSTD is 0.52). In
conclusion, the item and person fit statistics of the partial credit model
correspond with the teacher’s judgments for each test item and each student.
****
Understanding Rasch Measurement: Tools for Measuring
Academic Growth
G. Gage Kingsbury, Martha McCall, and Carl
Hauser
Abstract
Growth measurement and growth modeling have gained
substantial interest in the last few years with the development of new
statistical procedures and policy decisions such as the incorporation of growth
into No Child Left Behind. The current study investigates the following four
aspects of growth measurement: • Issues in the development of vertical scales to
measure growth • Design of instruments to measure academic growth • Techniques
for modeling individual student growth, and • Uses of growth information in a
classroom Measuring growth has always been a daunting task, but the development
of measurement tools such as the Rasch model and computerized adaptive testing
position us well to obtain high-quality data with which to measure and model the
growth of an individual student across a course of study. This growth
information, in norm-referenced and standards-referenced form, should enhance
educators’ ability to enrich student learning.
Vol. 10, No. 2 Summer 2009
****
The Relationships Among Design Experiments,
Invariant Measurement Scales,and Domain Theories
C. Victor Bunderson and Van A. Newby
Abstract
In this paper we discuss principled design experiments, a rigorous, experimentally-oriented form of designbased
research. We show the dependence of design experiments on invariant measurement scales. We discuss
four kinds of invariance culminating in interpretive invariance, and how this in turn depends on increasingly
adequate theories of a domain. These theories give an account of the dimensions and ordered attainments on a set
of dimensions that span a domain appropriately. This account may be called a domain theory or learning theory
of progressive attainments (in a local domain). We show the direct, and the broader benefits of developing and
using these descriptive theories of a domain to guide prescriptive design approaches to research.
In process of giving an account of this set of interdependencies, we will discuss aspects of the design method
we are using, called Validity-Centered Design. This design framework guides the development of instruments
based on domain theories, the development of learning opportunities; also based on domain theories, and the
construction of a sound validity argument for systems that integrate learning with assessment.
****
Considerations About Expected a
Posteriori Estimation in Adaptive Testing: Adaptive a Priori, Adaptive Correction for Bias, and Adaptive
Integration Interval
Gilles Raîche and Jean-Guy Blais
Abstract
In a computerized adaptive test, we would like to obtain an acceptable precision of the proficiency level estimate
using an optimal number of items. Unfortunately, decreasing the number of items is accompanied by a certain
degree of bias when the true proficiency level differs significantly from the a priori estimate. The authors suggest
that it is possible to reduced the bias, and even the standard error of the estimate, by applying to each provisional
estimation one or a combination of the following strategies: adaptive correction for bias proposed by Bock and
Mislevy (1982), adaptive a priori estimate, and adaptive integration interval.
****
Local Independence and Residual Covariance:
A Study of Olympic Figure Skating Ratings
John M. Linacre
Abstract
Rasch fit analysis has focused on tests of global fit and tests of the fit of individual parameter estimates. Critics
have noted that slight, but pervasive, patterns of misfit to a Rasch model within the data may escape detection
using these approaches. These patterns contradict the Rasch axiom of local independence, and so degrade
measurement and may bias measures. Misfit to a Rasch model is captured in the observation residuals. Traces
of pervasive, but faint, secondary dimensions within the observations may be identified using factor analytic
techniques. To illustrate these techniques, the ratings awarded during the Pairs Figure Skating competition
at the 2002 Winter Olympic Games are examined. The intention is to detect analytically the patterns of rater
bias admitted publicly after the event. It is seen that the one-parameter-at-a-time fit statistics and differential
item functioning approaches fail to detect the crucial misfit patterns. Factor analytic methods do. In fact, the
competition was held in two stages. Factor analytic techniques already detect the rater bias after the first stage.
This suggests that remedial rater retraining or other rater-related actions could be taken before the final ratings
are collected.
****
Constructing One Scale to Describe
Two Statewide Exams
Insu Paek, Deborah G. Peres, and Mark Wilson
Abstract
This study applies two approaches in creating a single scale from two separate statewide exams
(Golden State Math Exam and California Standard Math Test) and compares some aspects of the two statewide tests. The
first analysis involves a sequence of unidimensional Rasch scalings, using anchored items to scale the two tests
together. The second analysis employs a 2-dimensional Rasch scaling using previous unidimensional analysis
results to link the scales. The linking facilitates the investigation of their measurement properties of the two
exams and is a basis for combining items from both exams to develop a more efficient testing program. The
results of the comparisons of the two statewide exams based on the linking are shown and discussed.
****
Multidimensional Models in a Developmental Context
Yiyu Xie and Theo L. Dawson
Abstract
The concept of epistemological development is useful in psychological assessment only insofar as instruments
can be designed to measure it consistently, reliably, and without bias. In the psychosocial domain, most traditional
stage assessment systems rely on a process of matching concepts in a scoring manual generated from a limited
number of construction cases, and thus suffer to some extent from bias introduced by an over-dependence on
particular content. On the other hand, Commons’ Hierarchical Complexity Scoring System (HCSS) is an assessment
that employs criteria for assessing the hierarchical complexity of texts that are independent of specific
content. This paper examines whether the HCSS and one of the conventional systems, Kohlberg’s Standard Issue
Scoring System (SISS) measure the same dimension of performance. A multidimensional partial credit analysis
was performed on data collected between 1955 and 1999. The correlation between performance estimates on the
SISS and HCSS is 0.92. The high correlation provides strong evidence that the order of hierarchical complexity
identified by the HCSS is the same latent dimension of ability assessed with the SISS. The HCSS produced more
distinct patterns of ordered stages and wider gaps between adjacent stages. This evidence implies that individual
performances display a higher degree of consistency in their hierarchical complexity under the HCSS. A developmental
scoring system that employs scoring criteria that are independent of particular content might be more
powerful than the traditional scoring systems as it provides easiness in scoring and also possibilities of crosscultural,
cross-gender, cross-context comparison of conceptual knowledge within developmental levels.
****
An Application of the Multidimensional
Random Coefficients Multinomial Logit Model to Evaluating Cognitive Models of Reasoning in Genetics
Edward W. Wolfe, Daniel T. Hickey, and Ann C.H. Kindfield
Abstract
This article summarizes multidimensional Rasch analyses in which several alternative models of genetics
reasoning are evaluated based on item response data from secondary students who participated in a genetics
reasoning curriculum. The various depictions of genetics reasoning are compared by fitting several models to
the item response data and comparing data-to-model fit at the model level between hierarchically nested models.
We conclude that two two-dimensional models provide a substantively better depiction of student performance
than does a unidimensional model or more complex three- and four-dimensional models.
****
Understanding Rasch
Measurement: The ISR: Intelligent Student Reports
Ronald Mead
Abstract
Rasch-based Scale Scores are a simple linear transformation of the basic logit metric. Scale Scores are the
quantification of the measurement continuum. This quantification makes it possible to do arithmetic, computer
differences, and apply standard statistical techniques. However, qualitative meaning is not in the numbers and
must come from experience with the scale and from the descriptive information that can (and should) be attached.
This includes item content and exemplars, normative information for relevant groups, historical data
for the individual, and evaluative assessment like performance levels standards. The Scale Score metric is the
structure that manages the organization of intelligent reports and recognizes anomalies.
Scale Scores have no meaning, per se, but can provide a strong framework for organizing useful reports and
presenting meaningful information. They facilitate diagnosis by “Analysis of Fit” and by “Analysis of Misfit.”
The Analysis of Fit relies on the general definition of the construct to describe what a student at a particular point
on the scale can and cannot do. It is meaningful to the extent that the student conforms to the expectations of
the measurement model. The Analysis of Misfit uses the model to identify surprises, i.e., departures from the
model expectations. It highlights atypical areas of strong and weak performance. The intent is to bring these
exceptons to the attention of the experts for informed, substantive interpretation and diagnosis.
Intelligent reports, to be useful, and to justify the time and expense of testing, need to provide more information
in a useable format than the candidate, student, parent, or educator had available otherwise. This requires
more than reporting a single number or a single decision. It should include sufficient scaffolding to allow the
consumer to extract quickly and efficiently all the useful information that can be taken from the test. Rasch
Scale Scores are an important, perhaps essential, tool in this process.
Vol. 10, No. 3 Fall 2009
****
Using Classical and Modern Measurement Theories to Explore Rater,
Domain, and Gender Influences on Student Writing Ability
Ismail S. Gyagenda and George Engelhard, Jr.
Abstract
This study i) examined the rater, domain, and gender influences on the assessed quality of student’s writing
ability and ii) described and compared different approaches for examining these influences based on classical
and modern measurement theories.
Twenty raters were randomly selected from a group of 87 trained raters contracted to rate essays of the annual
Georgia High School Writing Test. Each rater scored the entire set of 375 essays on a 1-4 rating scale (366
essays were used in the analyses because nine cases had missing values and were dropped).
Two approaches, the classical approach and the item response theory-based Rasch model, were used to conduct
psychometric measures of reliability and inter-rater reliability, and statistical analyses with rater and gender
as the predictor variables and the total and domain scores as the dependent variables. To achieve the second
purpose, the Classical Test Model and the Rasch model were compared and contrasted and their strengths and
limitations discussed as they related to student writing assessment.
Analyses from both approaches indicated statistically significant rater and gender effects on student writing.
Using domain scores as the dependent variables, there was a statistically significant rater by gender interaction
effect at the multivariate level, but not at the univariate level. The Rasch analysis indicated a statistically
significant rater by gender effect. The comparison between the two approaches highlighted their strengths and
limitations, their different measurement and statistical models, and their different procedures.
****
The Efficacy of Link Items in the Construction of a Numeracy
Achievement Scale—from Kindergarten to Year 6
Juho Looveer and Joanne Mulligan
Abstract
A large-scale numeracy research project was commissioned by the Australian Government, involving 4732 Australian
students from 91 NSW primary schools. Rasch analysis was applied in the construction of a Numeracy
Achievement Scale (NAS) in order to measure numeracy growth. Following trialling, a pool of 244 items was
developed to assess number, space and measurement concepts. Link items were included in test forms within
year levels and across adjacent year levels to enable linking of the forms and the construction of a scale spanning
Kindergarten to Year 6 (5 to 13 years of age). However, results from the scaling were not consistent with
expectations of increases in student abilities or item difficulties across year levels. Differential item functioning
determined the problematic role of link items across year levels. After a different set of items was used for linking
test forms, the results were consistent with expectations. A key finding was that items used to link forms
must not exhibit differential item functioning across those levels.
****
The Study Skills Self-Efficacy Scale
for Use with Chinese Students
Mantak Yuen, Everett V. Smith, Jr., Lidia Dobria, and Qiong Fu
Abstract
Silver, Smith and Greene (2001) examined the dimensionality of responses to the Study Skills Self-Efficacy
Scale (SSSES) using exploratory principal factor analysis (PFA) and Rasch measurement techniques based on
a sample of social science students from a community college in the United States. They found that responses
defined three related dimensions. In the present study, Messick’s (1995) conceptualization of validity was
used to organize the exploration of the psychometric properties of data from a Chinese version of the SSSES.
Evidence related to the content aspect of validity was obtained via item fit evaluation; the substantive aspect
of validity was addressed by examining the functioning of the rating scales; the structural aspect of validity
was explored with exploratory PFA and Rasch item fit statistics; and support for the generalizability aspect of
validity was investigate via differential item functioning and internal consistency reliability estimates for both
items and persons. The exploratory PFA and Rasch analysis of responses to the Chinese version of the SSSES
were conducted with a sample of 494 Hong Kong high school students. Four factors emerged including Study
Routines, Resource Use, Text-Based Critical Thinking, and Self-Modification. The fit of the data to the Rasch
rating scale model for each dimension generally supported the unidimensionality of the four constructs. The
ordered average measures and thresholds from the four Rasch analyses supported the continued use of the
six-point response format. Item and person reliability were found to be adequate. Differential item functioning
across gender and language taught in was minimal.
****
Rasch Family Models in e-Learning: Analyzing Architectural
Sketching with a Digital Pen
Kathleen Scalise, Nancy Yen-wen Cheng, and Nargas Oskui
Abstract
Since architecture students studying design drawing are usually assessed qualitatively on the basis of their
final products, the challenges and stages of their learning have remained masked. To clarify the challenges in
design drawing, we have been using the BEAR Assessment System and Rasch family models to measure levels
of understanding for individuals and groups, in order to correct pedagogical assumptions and tune teaching
materials. This chapter discusses the analysis of 81 drawings created by architectural students to solve a space
layout problem, collected and analyzed with digital pen-and-paper technology. The approach allows us to map
developmental performance criteria and perceive achievement overlaps in learning domains assumed separate,
and then re-conceptualize a three-part framework to represent learning in architectural drawing. Results and
measurement evidence from the assessment and Rasch modeling are discussed.
****
Measuring Measuring: Toward a Theory of Proficiency
with the Constructing Measures Framework
Brent Duckor, Karen Draney, and Mark Wilson
Abstract
This paper is relevant to measurement educators who are interested in the variability of understanding and use of
the four building blocks in the Constructing Measures framework (Wilson, 2005). It proposes a uni-dimensional
structure for understanding Wilson’s framework, and explores the evidence for and against this conceptualization.
Constructed and fixed choice response items are utilized to collect responses from 72 participants who
range in experience and expertise with constructing measures. The data was scored by two raters was analyzed
with the Rasch partial credit model using ConQuest (1998). Guided by the 1999 Testing Standards, analyses
of validity and reliability evidence provide support for the construct theory and limited uses of the instrument
pending item design modifications.
****
Plausible Values: How to Deal with Their Limitations
Christian Monseur and Raymond Adams
Abstract
Rasch modeling and plausible values methodology were used to scale and report the results of the Organization
for Economic Cooperation and Development’s Programme for International Student Achievement (PISA).
This article will describe the scaling approach adopted in PISA. In particular it will focus on the use of
plausible values, a multiple imputation approach that is now commonly used in large-scale assessment. As with
all imputation models the plausible values must be generated using models that are consistent with those used in
subsequent data analysis. In the case of PISA the plausible value generation assumes a flat linear regression with
all students’ background variables collected through the international student questionnaire included as regressors.
Further, like most linear models, homoscedasticity and normality of the conditional variance are assumed.
This article will explore some of the implications of this approach. First, we will discuss the conditions
under which the secondary analyses on variables not included in the model for generating the plausible values
might be biased.
Secondly, as plausible values were not drawn from a multi-level model, the article will explore the adequacy
of the PISA procedures for estimating variance components when the data have a hierarchical structure.
****
Understanding Rasch Measurement:
Item and Rater Analysis of Constructed Response Items via the Multi-Faceted Rasch Model
Edward W. Wolfe
Abstract
This article describes how the multi-faceted Rasch model (MFRM) can be applied to item and rater analysis
and the types of information that is made available by a multifaceted analysis of constructed-response items.
Particularly, the text describes evidence that is made available by such analyses that is relevant to improving
item and rubric development as well as rater training and monitoring. The article provides an introduction to
MRFM extensions of the family of Rasch models, a description of item analysis procedures, a description of
rater analysis procedures, and concludes with an example analysis conducted using a commercially available
program that implements the MFRM, Facets.
Vol. 10, No. 4 Winter 2009
****
The Rasch Model and Additive Conjoint Measurement
Van A. Newby, Gregory R. Conner, Christopher P. Grant, and C. Victor Bunderson
Abstract
In this paper we clarify the relationship between the Rasch model, additive conjoint measurement, and Luce and
Tukey’s (1964) axiomatization of additive conjoint measurement. We prove a theorem which links the Rasch
model with additive conjoint measurement.
****
The Construction and Implementation of User-Defined
Fit Tests for Use with Marginal Maximum Likelihood Estimation and Generalized Item Response Models
Raymond J. Adams and Margaret L. Wu
Abstract
Wu (1997) developed a residual based fit statistic for the generalized Rasch model when marginal maximum
likelihood estimation is used. This statistic is based upon a comparison of individual’s contributions to the sufficient
statistics with the expectation of those contributions, for modeled parameters. In this paper, we present a
more flexible approach in which linear combinations of individual’s contributions to the sufficient statistics are
compared to their expectations. This more flexible approach can be used to test the fit of combinations of items.
This article first describes briefly the theoretical derivation of the fit statistics, and then illustrates how user-defined
fit tests can be implemented for a number of simulated data sets, and for two real data sets. The results show that
user-defined fit tests are much more powerful than fit tests at individual parameter level in testing hypothesized
violations of the item response model, such as local dependence and multi-dimensionality.
****
Development of a Multidimensional Measure of Academic
Engagement
Kyra Caspary and Maria Veronica Santelices
Abstract
This article describes development of a measure of academic engagement using items from an existing survey
of undergraduates enrolled at the University of California. The use of academic engagement as a criterion in
higher education admissions has been justified by the argument that highly engaged students benefit the most
from the academic experience an institution offers. A valid and reliable measure of academic engagement
would allow for research into this relationship between student engagement and student learning. After reviewing
the literature on engagement at both the secondary and postsecondary level, a multidimensional model of
engagement is proposed using a construct modeling approach. First the various hypothesized dimensions of
engagement are described, then items are mapped onto these dimensions, and finally responses to these items
are compared to our hypothesized dimensions. Results support the conceptualization of academic engagement
as a multidimensional measure composed of goals and behavioral constructs.
****
Random Parameter Structure and the Testlet Model:
Extension of the Rasch Testlet Model
Insu Paek, Haniza Yon, Mark Wilson, and Taehoon Kang
Abstract
The current Rasch testlet model (RT) assumes independence of the testlet effect and the target dimension. This
article investigated the impact of the violation of that assumption on RT and the performance of an extended
Rasch testlet model (ET) in which the random parameter variance-covariance matrix is estimated without any
constraints. Our simulation results showed that ET was the same or superior to RT in its performance. The
target dimension variance in RT was the most strongly affected parameter and the bias of the target dimension
variance was largest when the testlet effect was large and the correlation between the testlet effect and the target
dimension was high. This suggests that in some real data applications, it may be difficult to accurately assess
the size of testlet effect relative to the target dimension. RT showed close performance to ET with regard to
item and testlet effect parameter estimation.
****
A Comparative Analysis of the Ratings in Performance
Assessment Using Generalizability Theory and Many-Facet Rasch Model
Sungsook C. Kim and Mark Wilson
Abstract
The purpose of this study is to compare two different methods for modeling rater effects in performance assessment:
Generalizability (G) Theory and the Many-facet Rasch Model (MFRM). The view that G theory and the
MFRM are alternative solutions to the same measurement problem, in particular, rater effects, is seen to be only
partially true. G theory provides a general summary including an estimation of the relative influence of each
facet on a measure and the reliability of a decision based on the data. MFRM concentrates on the individual
examinee or rater and provides as fair a measure as it is possible to derive from the data as well as summary
information such as reliability indices and ways to express the relative influence of the facets. These conclusions
are illustrated using data for ratings of student writing assessments.
****
The Family Approach to Assessing Fit in Rasch
Measurement
Richard M. Smith and Christie Plackner
Abstract
There has been a renewed interest in comparing the usefulness of a variety of model and non-model based fit
statistics to detect measurement disturbances. Most of the recent studies compare the results of individual statistics
trying to find the single best statistic. Unfortunately, the nature of measurement disturbances is such that they
are quite varied in how they manifest themselves in the data. That is to say, there is not a single fit statistic that
is optimal for detecting every type of measurement disturbance. Because of this, it is necessary to use a family
of fit statistics designed to detect the most important measurement disturbances when checking the fit of data
to the appropriate Rasch model. The early Rasch fit statistics (Wright and Panchapakasen, 1969) were based on
the Pearsonian chi square. The ability to recombine the NxL chi squares into a variety of different fit statistics,
each looking at specific threats to the measurement process, is critical to this family approach to assessing fit.
Calibration programs, such as WINSTEPS and FACETS, that use only one type of fit statistic to assess the fit
of the data to the model, seriously underestimate the presence of measurement disturbances in the data. This is
due primarily to the fact that the total fit statistics (INFIT and OUTFIT), used exclusively in these programs,
are relatively insensitive to systematic threats to unidimensionality. This paper, which focuses on the Rasch
model and the Pearsonian chi-square approach to assessing fit, will review the different types or measurement
disturbances and their underlying causes, and identify the types of fit statistics that must be used to detect these
disturbances with maximum efficiency.
****
Understanding Rasch
Measurement: Standard Setting with Dichotomous and Constructed Response Items: Some Rasch Model Approaches
Robert G. MacCann
Abstract
Using real data comprising responses to both dichotomously scored and constructed response items, this paper
shows how Rasch modeling may be used to facilitate standard-setting. The modeling uses Andrich’s Extended
Logistic Model, which is incorporated into the RUMM software package. After a review of the fundamental
equations of the model, an application to Bookmark standard setting is given, showing how to calculate the
bookmark difficulty location (BDL) for both dichomotous items and tests containing a mixture of item types.
An example showing how the bookmark is set is also discussed. The Rasch model is then applied in various
ways to the Angoff standard-setting methods. In the first Angoff approach, the judges’ item ratings are compared
to Rasch model expected scores, allowing the judges to find items where their ratings differ significantly
from the Rasch model values. In the second Angoff approach, the distribution of item ratings are converted to
a distribution of possible cutscores, from which a final cutscore may be selected. In the third Angoff approach,
the Rasch model provides a comprehensive information set to the judges. For every total score on the test,
the model provides a column of item ratings (expected scores) for the ability associated with the total score.
The judges consider each column of item ratings as a whole and select the column that best fits the expected
pattern of responses of a marginal candidate. The total score corresponding to the selected column is then the
performance band cutscore.