Journal of Applied Measurement

P.O. Box 1283

Maple Grove, MN 55311

 

Volume 10, 2009 Article Abstracts

Vol. 10, No. 1 Spring 2009

****

Mapping Multiple Dimensions of Student Learning: The ConstructMap Program

Cathleen A. Kennedy and Karen Draney

Abstract

In the past, many assessments, especially standardized assessments, tended to be composed of items with specific right and wrong answers, such as those found in multiple choice, true-false and short response items. Performancebased questions that require students to construct answers rather than select correct responses introduce the complexities of multiple correct answers, dependence on teacher judgment for scoring, and requisite ancillary skills such as language fluency, which are technically difficult to handle, and may even introduce problems such as bias against certain groups of students. Recent developments in assessment design and psychometrics have improved the feasibility of assessing performance-based tasks more efficiently and effectively, thereby providing a rich domain of information from which interpretations can be made about what students know and what they can do when they draw upon that knowledge. We developed the ConstructMap computer program specifically to assist teachers in interpreting and representing this type of performance data. The program accepts as input student scores on items associated with one or multiple performance variables, computes proficiencies using multidimensional item response methods, and produces graphical representations of students’ estimated proficiency on each of the variables.

****

Response Dependence and the Measurement of Change

Ida Marais

Abstract

Because of confounding effects that can mask change when persons respond to the same items on more than one occasion, the measurement of change is a challenge. The specific effect on change studied in this paper is that observed when responses of persons to items at time 2 are dependent statistically on their responses at time 1. In addition, because this response dependence may affect the change differently for different locations of items relative to persons at time 1, the initial targeting of persons to items was studied. For a specific change in means of persons, dichotomous data were simulated according to the Rasch model with varying degrees of dependence and varying initial targeting of persons to items. Data were analysed, also using the Rasch model, in two ways: firstly, by treating items used at time 1 and time 2 as distinct ones (rack analysis) and, secondly, by treating persons at time 1 and time 2 as distinct ones (stack analysis). With the rack analysis the change is revealed through the item parameters and with the stack analysis the change is revealed through the person parameters. With no response dependence the two analyses gave equivalent and correct measures of change. With increasing dependence change was increasingly masked or increasingly amplified, depending on the targeting of items to persons at time 1. Response dependence affected the measurement of change in both analyses, but not always in the same way. The paper serves as a warning against undetected dependence and also considers evidence that can be used in the analysis of real data sets for detecting the presence of dependence when measuring change.

****

Using Paired Comparison Matrices to Estimate Parameters of the Partial Credit Rasch Measurement Model for Rater-Mediated Assessments

Mary Garner and George Engelhard, Jr.

Abstract

The purpose of this paper is to describe a technique for estimating the parameters of a Rasch model that accommodates ordered categories and rater severity. The technique builds on the conditional pairwise algorithm described by Choppin (1968, 1985) and represents an extension of a conditional algorithm described by Garner and Engelhard (2000, 2002) in which parameters appear as the eigenvector of a matrix derived from paired comparisons. The algorithm is used successfully to recover parameters from a simulated data set. No one has previously described such an extension of the pairwise algorithm to a Rasch model that includes both ordered categories and rater effects. The paired comparisons technique has importance for several reasons: it relies on the separability of parameters that is true only for the Rasch measurement model; it works in the presence of missing data; it makes transparent the connectivity needed for parameter estimation; and it is very simple. The technique also shares the mathematical framework of a very popular technique in the social sciences called the Analytic Hierarchy Process (Saaty, 1996).

****

Toward a Domain Theory in English as a Second Language

Diane Strong-Krause

Abstract

This paper demonstrates how domain theory development is enhanced by using both theoretical data and empirical data. The study explored the domain of speaking English as a second language (ESL) comparing hypothetical data on speaking tasks provided by an experienced teacher and by a certified ACTFL oral proficiency interview rater with observed data from scores on a computer-delivered speaking exam. While the hypothetical data and observed data showed similar patterns in task difficulty in general, some tasks were identified as being much easier or harder than expected. These differences raise questions not only about test task design but also about the theoretical underpinnings of the domain. The results of the study suggest that this approach, where theory and data are examined together, will improve test design as well as benefit domain theory development.

Comparison of Single- and Double-Assessor Scoring Designs for the Assessment of Accomplished Teaching

George Engelhard, Jr. and Carol M. Myford

Abstract

This article is based on a more extensive research report (Engelhard, Myford and Cline, 2000) prepared for the National Board for Professional Teaching Standards (NBPTS) concerning the Early Childhood/Generalist and Middle Childhood/Generalist assessment systems. The report is available from the Educational Testing Service (ETS). An earlier version of the article was presented at the American Educational Research Association Conference in New Orleans in 2000. We would like to acknowledge the helpful advice of Mike Linacre regarding the use of the FACETS computer program and the assistance of Fred Cline in analyzing these data. The material contained in this article is based on work supported by the NBPTS. Any opinions, findings, conclusions, and recommendations expressed herein are those of the authors and do not necessarily reflect the views of the NBPTS, Emory University, ETS, or the University of Illinois at Chicago.

****

A Rasch Model Prototype for Assessing Vocabulary Learning Resulting from Different Instructional Methods: A Preschool Example

Cynthia B. Leung and William Steve Lang

Abstract

This study explored the effects of using Rasch modeling to analyze data on vocabulary knowledge of preschoolers who participated in repeated read-aloud events and hands-on science activities with their classroom teachers. A Rasch prototype for literacy research was developed and applied to the preschool data. Thirty-one target words were selected for analysis from three children’s informational picture books on light and color. After different instructional activities, each child received scores on individual target words measured with a total of six assessments, including free response vocabulary tests and expressive and receptive picture vocabulary tests. Rasch modeling was used to assess the learning difficulty of target words in different instructional settings. Suggestions are made for applying Rasch modeling to classroom studies of instructional interventions.

****

An Empirical Study on the Relationship between Teacher’s Judgments and Fit Statistics of the Partial Credit Model

Sun-Geun Baek and Hye-Sook Kim

Abstract

The main purpose of the study was to investigate empirically the relationship between classroom teacher’s judgment and the item and person fit-statistics of the partial credit model. In this study, classroom teacher’s judgments were made intuitively checking each item’s consistency with the general response pattern and each student’s need for additional treatment or advice. The item and person fit statistics of the partial credit model were estimated using the WINSTEPS program (Linacre, 2003). The subjects of this study were 321 sixth grade students in 9 classrooms within 3 elementary schools in Seoul, Korea. For this research, a performance assessment test for 6th grade mathematics was developed. It consisted of 20 polytomous response items and its total scores ranged between 0 and 50. In addition, the 9 classroom teachers made their judgments for each item of the test and for each student in their own classroom. They judged intuitively using 4 categories; (1) well fit, (2) fit, (3) misfit, and (4) badly misfit for each item as well as each student. Their judgments were scored from 1 to 4 for each item as well as each student. There are two significant findings in this study. First, there is a statistically significant relationship between the classroom teacher’s judgment and item fit statistic for each item (The median correlation coefficient between the teacher’s judgment and the item outfit ZSTD is 0.61). Second, there is a statistically significant relationship between the teacher’s judgment and the person fit statistic for each student (The median correlation coefficient between the teacher’s judgment and the person outfit ZSTD is 0.52). In conclusion, the item and person fit statistics of the partial credit model correspond with the teacher’s judgments for each test item and each student.

****

Understanding Rasch Measurement: Tools for Measuring Academic Growth

G. Gage Kingsbury, Martha McCall, and Carl Hauser

Abstract

Growth measurement and growth modeling have gained substantial interest in the last few years with the development of new statistical procedures and policy decisions such as the incorporation of growth into No Child Left Behind. The current study investigates the following four aspects of growth measurement: • Issues in the development of vertical scales to measure growth • Design of instruments to measure academic growth • Techniques for modeling individual student growth, and • Uses of growth information in a classroom Measuring growth has always been a daunting task, but the development of measurement tools such as the Rasch model and computerized adaptive testing position us well to obtain high-quality data with which to measure and model the growth of an individual student across a course of study. This growth information, in norm-referenced and standards-referenced form, should enhance educators’ ability to enrich student learning.

 

 

Vol. 10, No. 2 Summer 2009

****

The Relationships Among Design Experiments, Invariant Measurement Scales,and Domain Theories

C. Victor Bunderson and Van A. Newby

Abstract

In this paper we discuss principled design experiments, a rigorous, experimentally-oriented form of designbased research. We show the dependence of design experiments on invariant measurement scales. We discuss four kinds of invariance culminating in interpretive invariance, and how this in turn depends on increasingly adequate theories of a domain. These theories give an account of the dimensions and ordered attainments on a set of dimensions that span a domain appropriately. This account may be called a domain theory or learning theory of progressive attainments (in a local domain). We show the direct, and the broader benefits of developing and using these descriptive theories of a domain to guide prescriptive design approaches to research. In process of giving an account of this set of interdependencies, we will discuss aspects of the design method we are using, called Validity-Centered Design. This design framework guides the development of instruments based on domain theories, the development of learning opportunities; also based on domain theories, and the construction of a sound validity argument for systems that integrate learning with assessment.

****

Considerations About Expected a Posteriori Estimation in Adaptive Testing: Adaptive a Priori, Adaptive Correction for Bias, and Adaptive Integration Interval

Gilles Raîche and Jean-Guy Blais

Abstract

In a computerized adaptive test, we would like to obtain an acceptable precision of the proficiency level estimate using an optimal number of items. Unfortunately, decreasing the number of items is accompanied by a certain degree of bias when the true proficiency level differs significantly from the a priori estimate. The authors suggest that it is possible to reduced the bias, and even the standard error of the estimate, by applying to each provisional estimation one or a combination of the following strategies: adaptive correction for bias proposed by Bock and Mislevy (1982), adaptive a priori estimate, and adaptive integration interval.

****

Local Independence and Residual Covariance: A Study of Olympic Figure Skating Ratings

John M. Linacre

Abstract

Rasch fit analysis has focused on tests of global fit and tests of the fit of individual parameter estimates. Critics have noted that slight, but pervasive, patterns of misfit to a Rasch model within the data may escape detection using these approaches. These patterns contradict the Rasch axiom of local independence, and so degrade measurement and may bias measures. Misfit to a Rasch model is captured in the observation residuals. Traces of pervasive, but faint, secondary dimensions within the observations may be identified using factor analytic techniques. To illustrate these techniques, the ratings awarded during the Pairs Figure Skating competition at the 2002 Winter Olympic Games are examined. The intention is to detect analytically the patterns of rater bias admitted publicly after the event. It is seen that the one-parameter-at-a-time fit statistics and differential item functioning approaches fail to detect the crucial misfit patterns. Factor analytic methods do. In fact, the competition was held in two stages. Factor analytic techniques already detect the rater bias after the first stage. This suggests that remedial rater retraining or other rater-related actions could be taken before the final ratings are collected.

****

Constructing One Scale to Describe Two Statewide Exams

Insu Paek, Deborah G. Peres, and Mark Wilson

Abstract

This study applies two approaches in creating a single scale from two separate statewide exams (Golden State Math Exam and California Standard Math Test) and compares some aspects of the two statewide tests. The first analysis involves a sequence of unidimensional Rasch scalings, using anchored items to scale the two tests together. The second analysis employs a 2-dimensional Rasch scaling using previous unidimensional analysis results to link the scales. The linking facilitates the investigation of their measurement properties of the two exams and is a basis for combining items from both exams to develop a more efficient testing program. The results of the comparisons of the two statewide exams based on the linking are shown and discussed.

****

Multidimensional Models in a Developmental Context

Yiyu Xie and Theo L. Dawson

Abstract

The concept of epistemological development is useful in psychological assessment only insofar as instruments can be designed to measure it consistently, reliably, and without bias. In the psychosocial domain, most traditional stage assessment systems rely on a process of matching concepts in a scoring manual generated from a limited number of construction cases, and thus suffer to some extent from bias introduced by an over-dependence on particular content. On the other hand, Commons’ Hierarchical Complexity Scoring System (HCSS) is an assessment that employs criteria for assessing the hierarchical complexity of texts that are independent of specific content. This paper examines whether the HCSS and one of the conventional systems, Kohlberg’s Standard Issue Scoring System (SISS) measure the same dimension of performance. A multidimensional partial credit analysis was performed on data collected between 1955 and 1999. The correlation between performance estimates on the SISS and HCSS is 0.92. The high correlation provides strong evidence that the order of hierarchical complexity identified by the HCSS is the same latent dimension of ability assessed with the SISS. The HCSS produced more distinct patterns of ordered stages and wider gaps between adjacent stages. This evidence implies that individual performances display a higher degree of consistency in their hierarchical complexity under the HCSS. A developmental scoring system that employs scoring criteria that are independent of particular content might be more powerful than the traditional scoring systems as it provides easiness in scoring and also possibilities of crosscultural, cross-gender, cross-context comparison of conceptual knowledge within developmental levels.

****

An Application of the Multidimensional Random Coefficients Multinomial Logit Model to Evaluating Cognitive Models of Reasoning in Genetics

Edward W. Wolfe, Daniel T. Hickey, and Ann C.H. Kindfield

Abstract

This article summarizes multidimensional Rasch analyses in which several alternative models of genetics reasoning are evaluated based on item response data from secondary students who participated in a genetics reasoning curriculum. The various depictions of genetics reasoning are compared by fitting several models to the item response data and comparing data-to-model fit at the model level between hierarchically nested models. We conclude that two two-dimensional models provide a substantively better depiction of student performance than does a unidimensional model or more complex three- and four-dimensional models.

****

Understanding Rasch Measurement: The ISR: Intelligent Student Reports

Ronald Mead

Abstract

Rasch-based Scale Scores are a simple linear transformation of the basic logit metric. Scale Scores are the quantification of the measurement continuum. This quantification makes it possible to do arithmetic, computer differences, and apply standard statistical techniques. However, qualitative meaning is not in the numbers and must come from experience with the scale and from the descriptive information that can (and should) be attached. This includes item content and exemplars, normative information for relevant groups, historical data for the individual, and evaluative assessment like performance levels standards. The Scale Score metric is the structure that manages the organization of intelligent reports and recognizes anomalies. Scale Scores have no meaning, per se, but can provide a strong framework for organizing useful reports and presenting meaningful information. They facilitate diagnosis by “Analysis of Fit” and by “Analysis of Misfit.” The Analysis of Fit relies on the general definition of the construct to describe what a student at a particular point on the scale can and cannot do. It is meaningful to the extent that the student conforms to the expectations of the measurement model. The Analysis of Misfit uses the model to identify surprises, i.e., departures from the model expectations. It highlights atypical areas of strong and weak performance. The intent is to bring these exceptons to the attention of the experts for informed, substantive interpretation and diagnosis. Intelligent reports, to be useful, and to justify the time and expense of testing, need to provide more information in a useable format than the candidate, student, parent, or educator had available otherwise. This requires more than reporting a single number or a single decision. It should include sufficient scaffolding to allow the consumer to extract quickly and efficiently all the useful information that can be taken from the test. Rasch Scale Scores are an important, perhaps essential, tool in this process.

 

 

Vol. 10, No. 3 Fall 2009

****

Using Classical and Modern Measurement Theories to Explore Rater, Domain, and Gender Influences on Student Writing Ability

Ismail S. Gyagenda and George Engelhard, Jr.

Abstract

This study i) examined the rater, domain, and gender influences on the assessed quality of student’s writing ability and ii) described and compared different approaches for examining these influences based on classical and modern measurement theories. Twenty raters were randomly selected from a group of 87 trained raters contracted to rate essays of the annual Georgia High School Writing Test. Each rater scored the entire set of 375 essays on a 1-4 rating scale (366 essays were used in the analyses because nine cases had missing values and were dropped). Two approaches, the classical approach and the item response theory-based Rasch model, were used to conduct psychometric measures of reliability and inter-rater reliability, and statistical analyses with rater and gender as the predictor variables and the total and domain scores as the dependent variables. To achieve the second purpose, the Classical Test Model and the Rasch model were compared and contrasted and their strengths and limitations discussed as they related to student writing assessment. Analyses from both approaches indicated statistically significant rater and gender effects on student writing. Using domain scores as the dependent variables, there was a statistically significant rater by gender interaction effect at the multivariate level, but not at the univariate level. The Rasch analysis indicated a statistically significant rater by gender effect. The comparison between the two approaches highlighted their strengths and limitations, their different measurement and statistical models, and their different procedures.

****

The Efficacy of Link Items in the Construction of a Numeracy Achievement Scale—from Kindergarten to Year 6

Juho Looveer and Joanne Mulligan

Abstract

A large-scale numeracy research project was commissioned by the Australian Government, involving 4732 Australian students from 91 NSW primary schools. Rasch analysis was applied in the construction of a Numeracy Achievement Scale (NAS) in order to measure numeracy growth. Following trialling, a pool of 244 items was developed to assess number, space and measurement concepts. Link items were included in test forms within year levels and across adjacent year levels to enable linking of the forms and the construction of a scale spanning Kindergarten to Year 6 (5 to 13 years of age). However, results from the scaling were not consistent with expectations of increases in student abilities or item difficulties across year levels. Differential item functioning determined the problematic role of link items across year levels. After a different set of items was used for linking test forms, the results were consistent with expectations. A key finding was that items used to link forms must not exhibit differential item functioning across those levels.

****

The Study Skills Self-Efficacy Scale for Use with Chinese Students

Mantak Yuen, Everett V. Smith, Jr., Lidia Dobria, and Qiong Fu

Abstract

Silver, Smith and Greene (2001) examined the dimensionality of responses to the Study Skills Self-Efficacy Scale (SSSES) using exploratory principal factor analysis (PFA) and Rasch measurement techniques based on a sample of social science students from a community college in the United States. They found that responses defined three related dimensions. In the present study, Messick’s (1995) conceptualization of validity was used to organize the exploration of the psychometric properties of data from a Chinese version of the SSSES. Evidence related to the content aspect of validity was obtained via item fit evaluation; the substantive aspect of validity was addressed by examining the functioning of the rating scales; the structural aspect of validity was explored with exploratory PFA and Rasch item fit statistics; and support for the generalizability aspect of validity was investigate via differential item functioning and internal consistency reliability estimates for both items and persons. The exploratory PFA and Rasch analysis of responses to the Chinese version of the SSSES were conducted with a sample of 494 Hong Kong high school students. Four factors emerged including Study Routines, Resource Use, Text-Based Critical Thinking, and Self-Modification. The fit of the data to the Rasch rating scale model for each dimension generally supported the unidimensionality of the four constructs. The ordered average measures and thresholds from the four Rasch analyses supported the continued use of the six-point response format. Item and person reliability were found to be adequate. Differential item functioning across gender and language taught in was minimal.

****

Rasch Family Models in e-Learning: Analyzing Architectural Sketching with a Digital Pen

Kathleen Scalise, Nancy Yen-wen Cheng, and Nargas Oskui

Abstract

Since architecture students studying design drawing are usually assessed qualitatively on the basis of their final products, the challenges and stages of their learning have remained masked. To clarify the challenges in design drawing, we have been using the BEAR Assessment System and Rasch family models to measure levels of understanding for individuals and groups, in order to correct pedagogical assumptions and tune teaching materials. This chapter discusses the analysis of 81 drawings created by architectural students to solve a space layout problem, collected and analyzed with digital pen-and-paper technology. The approach allows us to map developmental performance criteria and perceive achievement overlaps in learning domains assumed separate, and then re-conceptualize a three-part framework to represent learning in architectural drawing. Results and measurement evidence from the assessment and Rasch modeling are discussed.

****

Measuring Measuring: Toward a Theory of Proficiency with the Constructing Measures Framework

Brent Duckor, Karen Draney, and Mark Wilson

Abstract

This paper is relevant to measurement educators who are interested in the variability of understanding and use of the four building blocks in the Constructing Measures framework (Wilson, 2005). It proposes a uni-dimensional structure for understanding Wilson’s framework, and explores the evidence for and against this conceptualization. Constructed and fixed choice response items are utilized to collect responses from 72 participants who range in experience and expertise with constructing measures. The data was scored by two raters was analyzed with the Rasch partial credit model using ConQuest (1998). Guided by the 1999 Testing Standards, analyses of validity and reliability evidence provide support for the construct theory and limited uses of the instrument pending item design modifications.

****

Plausible Values: How to Deal with Their Limitations

Christian Monseur and Raymond Adams

Abstract

Rasch modeling and plausible values methodology were used to scale and report the results of the Organization for Economic Cooperation and Development’s Programme for International Student Achievement (PISA). This article will describe the scaling approach adopted in PISA. In particular it will focus on the use of plausible values, a multiple imputation approach that is now commonly used in large-scale assessment. As with all imputation models the plausible values must be generated using models that are consistent with those used in subsequent data analysis. In the case of PISA the plausible value generation assumes a flat linear regression with all students’ background variables collected through the international student questionnaire included as regressors. Further, like most linear models, homoscedasticity and normality of the conditional variance are assumed. This article will explore some of the implications of this approach. First, we will discuss the conditions under which the secondary analyses on variables not included in the model for generating the plausible values might be biased. Secondly, as plausible values were not drawn from a multi-level model, the article will explore the adequacy of the PISA procedures for estimating variance components when the data have a hierarchical structure.

****

Understanding Rasch Measurement: Item and Rater Analysis of Constructed Response Items via the Multi-Faceted Rasch Model

Edward W. Wolfe

Abstract

This article describes how the multi-faceted Rasch model (MFRM) can be applied to item and rater analysis and the types of information that is made available by a multifaceted analysis of constructed-response items. Particularly, the text describes evidence that is made available by such analyses that is relevant to improving item and rubric development as well as rater training and monitoring. The article provides an introduction to MRFM extensions of the family of Rasch models, a description of item analysis procedures, a description of rater analysis procedures, and concludes with an example analysis conducted using a commercially available program that implements the MFRM, Facets.

 

Vol. 10, No. 4 Winter 2009

****

The Rasch Model and Additive Conjoint Measurement

Van A. Newby, Gregory R. Conner, Christopher P. Grant, and C. Victor Bunderson

Abstract

In this paper we clarify the relationship between the Rasch model, additive conjoint measurement, and Luce and Tukey’s (1964) axiomatization of additive conjoint measurement. We prove a theorem which links the Rasch model with additive conjoint measurement.

****

The Construction and Implementation of User-Defined Fit Tests for Use with Marginal Maximum Likelihood Estimation and Generalized Item Response Models

Raymond J. Adams and Margaret L. Wu

Abstract

Wu (1997) developed a residual based fit statistic for the generalized Rasch model when marginal maximum likelihood estimation is used. This statistic is based upon a comparison of individual’s contributions to the sufficient statistics with the expectation of those contributions, for modeled parameters. In this paper, we present a more flexible approach in which linear combinations of individual’s contributions to the sufficient statistics are compared to their expectations. This more flexible approach can be used to test the fit of combinations of items. This article first describes briefly the theoretical derivation of the fit statistics, and then illustrates how user-defined fit tests can be implemented for a number of simulated data sets, and for two real data sets. The results show that user-defined fit tests are much more powerful than fit tests at individual parameter level in testing hypothesized violations of the item response model, such as local dependence and multi-dimensionality.

****

Development of a Multidimensional Measure of Academic Engagement

Kyra Caspary and Maria Veronica Santelices

Abstract

This article describes development of a measure of academic engagement using items from an existing survey of undergraduates enrolled at the University of California. The use of academic engagement as a criterion in higher education admissions has been justified by the argument that highly engaged students benefit the most from the academic experience an institution offers. A valid and reliable measure of academic engagement would allow for research into this relationship between student engagement and student learning. After reviewing the literature on engagement at both the secondary and postsecondary level, a multidimensional model of engagement is proposed using a construct modeling approach. First the various hypothesized dimensions of engagement are described, then items are mapped onto these dimensions, and finally responses to these items are compared to our hypothesized dimensions. Results support the conceptualization of academic engagement as a multidimensional measure composed of goals and behavioral constructs.

****

Random Parameter Structure and the Testlet Model: Extension of the Rasch Testlet Model

Insu Paek, Haniza Yon, Mark Wilson, and Taehoon Kang

Abstract

The current Rasch testlet model (RT) assumes independence of the testlet effect and the target dimension. This article investigated the impact of the violation of that assumption on RT and the performance of an extended Rasch testlet model (ET) in which the random parameter variance-covariance matrix is estimated without any constraints. Our simulation results showed that ET was the same or superior to RT in its performance. The target dimension variance in RT was the most strongly affected parameter and the bias of the target dimension variance was largest when the testlet effect was large and the correlation between the testlet effect and the target dimension was high. This suggests that in some real data applications, it may be difficult to accurately assess the size of testlet effect relative to the target dimension. RT showed close performance to ET with regard to item and testlet effect parameter estimation.

****

A Comparative Analysis of the Ratings in Performance Assessment Using Generalizability Theory and Many-Facet Rasch Model

Sungsook C. Kim and Mark Wilson

Abstract

The purpose of this study is to compare two different methods for modeling rater effects in performance assessment: Generalizability (G) Theory and the Many-facet Rasch Model (MFRM). The view that G theory and the MFRM are alternative solutions to the same measurement problem, in particular, rater effects, is seen to be only partially true. G theory provides a general summary including an estimation of the relative influence of each facet on a measure and the reliability of a decision based on the data. MFRM concentrates on the individual examinee or rater and provides as fair a measure as it is possible to derive from the data as well as summary information such as reliability indices and ways to express the relative influence of the facets. These conclusions are illustrated using data for ratings of student writing assessments.

****

The Family Approach to Assessing Fit in Rasch Measurement

Richard M. Smith and Christie Plackner

Abstract

There has been a renewed interest in comparing the usefulness of a variety of model and non-model based fit statistics to detect measurement disturbances. Most of the recent studies compare the results of individual statistics trying to find the single best statistic. Unfortunately, the nature of measurement disturbances is such that they are quite varied in how they manifest themselves in the data. That is to say, there is not a single fit statistic that is optimal for detecting every type of measurement disturbance. Because of this, it is necessary to use a family of fit statistics designed to detect the most important measurement disturbances when checking the fit of data to the appropriate Rasch model. The early Rasch fit statistics (Wright and Panchapakasen, 1969) were based on the Pearsonian chi square. The ability to recombine the NxL chi squares into a variety of different fit statistics, each looking at specific threats to the measurement process, is critical to this family approach to assessing fit. Calibration programs, such as WINSTEPS and FACETS, that use only one type of fit statistic to assess the fit of the data to the model, seriously underestimate the presence of measurement disturbances in the data. This is due primarily to the fact that the total fit statistics (INFIT and OUTFIT), used exclusively in these programs, are relatively insensitive to systematic threats to unidimensionality. This paper, which focuses on the Rasch model and the Pearsonian chi-square approach to assessing fit, will review the different types or measurement disturbances and their underlying causes, and identify the types of fit statistics that must be used to detect these disturbances with maximum efficiency.

****

Understanding Rasch Measurement: Standard Setting with Dichotomous and Constructed Response Items: Some Rasch Model Approaches

Robert G. MacCann

Abstract

Using real data comprising responses to both dichotomously scored and constructed response items, this paper shows how Rasch modeling may be used to facilitate standard-setting. The modeling uses Andrich’s Extended Logistic Model, which is incorporated into the RUMM software package. After a review of the fundamental equations of the model, an application to Bookmark standard setting is given, showing how to calculate the bookmark difficulty location (BDL) for both dichomotous items and tests containing a mixture of item types. An example showing how the bookmark is set is also discussed. The Rasch model is then applied in various ways to the Angoff standard-setting methods. In the first Angoff approach, the judges’ item ratings are compared to Rasch model expected scores, allowing the judges to find items where their ratings differ significantly from the Rasch model values. In the second Angoff approach, the distribution of item ratings are converted to a distribution of possible cutscores, from which a final cutscore may be selected. In the third Angoff approach, the Rasch model provides a comprehensive information set to the judges. For every total score on the test, the model provides a column of item ratings (expected scores) for the ability associated with the total score. The judges consider each column of item ratings as a whole and select the column that best fits the expected pattern of responses of a marginal candidate. The total score corresponding to the selected column is then the performance band cutscore.

Home