Journal of Applied Measurement

P.O. Box 1283

Maple Grove, MN 55311

 


Article abstracts for Volumes 1 to 7 are available in pdf format. Just click on the link below.

Abstracts for Volume 1, 2000

Abstracts for Volume 2, 2001

Abstracts for Volume 3, 2002

Abstracts for Volume 4, 2003

Abstracts for Volume 5, 2004

Abstracts for Volume 6, 2005

Abstracts for Volume 7, 2006

Article abstracts for Volumes 8 to 14 are available in html format. Just click on the link below.

Abstracts for Volume 8, 2007

Abstracts for Volume 9, 2008

Abstracts for Volume 10, 2009

Abstracts for Volume 11, 2010

Abstracts for Volume 12, 2011

Abstracts for Volume 13, 2012

Abstracts for Volume 14, 2013

Abstracts for Volume 15, 2014

 

Current Volume Article Abstracts

 

Vol. 16, No. 1 Winter 2015

A Mathematical Theory of Ability Measure Based on Partial Credit Item Responses

Nan L. Kong

Abstract

This paper defines a measure of examinees’ abilities using additivity, the fundamental property of a measure, based on the partially-credited item responses. The fundamental properties of this newly-defined ability measure are demonstrated using mathematical proofs. This paper also shows that interactive ability and conditional ability are measurable with additivity. Finally, the paper looks at the ability measures associated with subscales and their decompositions.

____________________

Differential Item Functioning Analysis by Applying Multiple Comparison Procedures

Paolo Eusebi and Svend Kreiner

Abstract

Analysis within a Rasch measurement framework aims at development of valid and objective test score. One requirement of both validity and objectivity is that items do not show evidence of differential item functioning (DIF). A number of procedures exist for the assessment of DIF including those based on analysis of contingency tables by Mantel-Haenszel tests and partial gamma coefficients. The aim of this paper is to illustrate Multiple Comparison Procedures (MCP) for analysis of DIF relative to a variable defining a very large number of groups, with an unclear ordering with respect to the DIF effect. We propose a single step procedure controlling the false discovery rate for DIF detection. The procedure applies for both dichotomous and polytomous items. In addition to providing evidence against a hypothesis of no DIF, the procedure also provides information on subset of groups that are homogeneous with respect to the DIF effect. A stepwise MCP procedure for this purpose is also introduced.

____________________

Visually Discriminating Upper Case Letters, Lower Case Letters and Numbers

Janet Richmond, Russell F. Waugh, and Deslea Konza

Abstract

English and number literacy are important for successful learning and testing student literacy and numeracy standards enables early identification and remediation of children who have difficulty. Rasch measures were created with the RUMM2020 computer program for the perceptual constructs of visual discrimination upper case letters, lower case letters and numbers. Thirty items for Visual Discrimination of Upper Case Letters (VDUCL), 36 for Lower Case Letters (VDLCL) and 20 for Visual Discrimination of Numbers (VDN) were presented to 324 Pre-Primary through Year 4 children, aged 4-9 years old. All students attended school in Perth, Western Australia. Eighteen of the initial 30 items for VDUCL, thirty-one of the original 36 items for VDLCL and thirteen of the original 20 items for VDN were used to create linear scales (the others were deleted due to misfit) and these clearly showed which letters and numbers children said were easy and which were hard.

____________________

Testing the Multidimensionality of the Inventory of School Motivation in a Dutch Student Sample

Hanke Korpershoek, Kun Xu, Magdalena Mo Ching Mok, Dennis M. McInerney, and Greetje van der Werf

Abstract

A factor analytic and a Rasch measurement approach were applied to evaluate the multidimensional nature of the school motivation construct among more than 7,000 Dutch secondary school students. The Inventory of School Motivation (McInerney and Ali, 2006) was used, which intends to measure four motivation dimensions (mastery, performance, social, and extrinsic motivation), each comprising of two first-order factors. One unidimensional model and three multidimensional models (4-factor, 8-factor, higher order) were fit to the data. Results of both approaches showed that the multidimensional models validly represented the school motivation among Dutch secondary school pupils, whereas model fit of the unidimensional model was poor. The differences in model fit between the three multidimensional models were small, although a different model was favoured by the two approaches. The need for improvement of some of the items and the need to increase measurement precision of several first-order factors are discussed.

____________________

Measuring Teaching Assistants’ Efficacy using the Rasch Model

Zi Yan, Chun Wai Lum, Rick Tze Leung Lui, Steven Sing Wa Chu, and, Ming Lui

Abstract

Teaching assistants (TAs) play an influential role in primary and secondary schools. But there is an absence in literature about the TA’s efficacy, and to date no instrument is available for measuring TA’s efficacy. The present study aims to develop and validate a scale (Teaching Assistant Efficacy Scale, TAES) for measuring TA’s efficacy on identified capabilities. A total of 531 teaching assistants from Hong Kong schools participated in the survey. The multidimensional Rasch model was used to analyse the data. The results revealed that a 5-dimension structure of TA’s efficacy was supported. The final 30-item version of TAES assesses TA’s efficacy on learning support, teaching support, behaviour management, cooperation, and administrative support. The Rasch reliabilities for all five dimensions were around 0.90. The 6-category response structure worked well for the scale. Further research was recommended to validate and test the robustness of the TAES both in Hong Kong and elsewhere.

____________________

Detecting Measurement Disturbance Effects: The Graphical Display Of Item Characteristics

Randall E. Schumacker

Abstract

Traditional identification of misfitting items in Rasch measurement models have interpreted the Infit and Outfit z standardized statistic. A more recent approach made possible by Winsteps is to specify “group = 0” in the control file and subsequently view the item characteristic curve for each item against the true probability curve. The graphical display reveals whether an item follows the true probability curve or deviates substantially, thus indicating measurement disturbance. Probability of item response and logit ability are easily copied into data vectors in R software then graphed. An example control file, output item data, and subsequent preparation of an overlay graph for misfit items are presented using Winsteps and R software. For comparison purposes the data are also analyzed using a multi-dimensional (MD) mapping procedure.

____________________

Criteria Weighting with Respect to Institution’s Goals for Faculty Selection

Sheu Hua Chen, Yen Ting Chen, and Hong Tau Lee

Abstract

Employers frequently select an employee among numerous candidates. They have to evaluate these candidates by multiple criteria that raise the problem of how to determinate the relative importance of these criteria. Traditionally, when engaging a new employee, the employer will develop a set of criteria and their associate weightings according with its institution’s goals. However, the weight setting also reflects the priority of goals. It is frequently ignored. That is to say, it is necessary to recheck whether the weighting set reflects the institution’s goals’ priority appropriately. In this research, we proposed a mechanism that gives the chance to review the criteria weighting to see if it is adequately satisfies its institution’s actual goals. This double-check procedure can further help the employer select appropriate personnel for his or her institution.

____________________

Gendered Language Attitudes: Exploring Language as a Gendered Construct using Rasch Measurement Theory

Kris A. Knisely and Stefanie A. Wind

Abstract

Gendered language attitudes (GLAs) are gender-based perceptions of language varieties based on connections between gender-related and linguistic characteristics of individuals, including the perception of language varieties as possessing degrees of masculinity and femininity. This study combines substantive theory about language learning and gender with a model based on Rasch measurement theory to explore the psychometric properties of a new measure of GLAs. Findings suggest that GLAs is a unidimensional construct and that the items used can be used to describe differences among students in terms of the strength of their GLAs. Implications for research, theory, and practice are discussed. Special emphasis is given to the teaching and learning of languages.

____________________

Vol. 16, No. 2 Spring 2015

Implications of Removing Random Guessing from Rasch Item Estimates in Vertical Scaling

Ida Marais

Abstract

Large scale testing programs often involve a number of assessments that include multiple choice items administered to students in different grades. The Rasch model is sometimes used to transform the raw test scores onto a common vertical scale of proficiency. However, with multiple choice items students may guess and the Rasch model makes no provision for guessing. In this study a procedure for removing random guessing from Rasch item estimates is applied to two assessments. The results showed that, when there was guessing, the vertical scale of proficiency was shrunk. Moreover, the highly proficient students were penalised more than the low proficiency students were advantaged by guessing. After removing the effect of guessing from the estimates, the vertical scale was more spread out. Also, because proficient students answer the more difficult items correctly at a greater rate than the less proficient students, they obtained the greatest benefit when the effect of guessing had been removed from the estimates of these items.

____________________

Funding Medical Research Projects: Taking into Account Referees’ Severity and Consistency through Many-Faceted Rasch Modeling of Projects’ Scores

Luigi Tesio, Anna Simone, Mariuzs T. Grzeda, Michela Ponzio, Gabriele Dati, Paola Zaratin, Laura Perucca, and Mario A. Battaglia

Abstract

The funding policy of research projects often relies on scores assigned by a panel of experts (referees). The nonlinear nature of raw scores and the severity and inconsistency of individual raters may generate unfair numeric project rankings. Rasch measurement (“many-facets” version, MFRM) provides a valid alternative to scoring. MFRM was applied to the scores achieved by 75 research projects on multiple sclerosis sent in response to a previous annual call by FISM-Italian Foundation for Multiple Sclerosis. This allowed to simulate, a posteriori, the impact of MFRM on the funding scenario. The applications were each scored by 2 to 4 independent referees (total = 131) on a 10-item, 0-3 rating scale called FISM-ProQual-P. The rotation plan assured “connection” of all pairs of projects through at least 1 shared referee. The questionnaire fulfilled satisfactorily the stringent criteria of Rasch measurement for psychometric quality (unidimensionality, reliability and data-model fit). Arbitrarily, 2 acceptability thresholds were set at a raw score of 21/30 and at the equivalent Rasch measure of 61.5/100, respectively. When the cut-off was switched from score to measure 8 out of 18 acceptable projects had to be rejected, while 15 rejected projects became eligible for funding. Some referees, of various severity, were grossly inconsistent (z-std fit indexes <–1.9 or >1.9) The FISM-ProQual-P questionnaire seems a valid and reliable scale. MFRM may help the decision-making process for allocating funds to MS research projects but also in other fields. In repeated assessment exercises it can help the selection of reliable referees. Their severity can be steadily “calibrated”, thus obviating the need to “connect” them with other referees assessing the same projects.

____________________

A Family of Rater Accuracy Models

Edward W. Wolfe, Hong Jiao, and Tian Song

Abstract

Engelhard (1996) proposed a rater accuracy model (RAM) as a means of evaluating rater accuracy in rating data, but very little research exists to determine the efficacy of that model. The RAM requires a transformation of the raw score data to accuracy measures by comparing rater-assigned scores to true scores. Indices computed based on raw scores also exist for measuring rater effects, but these indices ignore deviations of rater-assigned scores from true scores. This paper demonstrates the efficacy of two versions of the RAM (based on dichotomized and polytomized deviations of rater-assigned scores from true scores) to two versions of raw score rater effect models (i.e., a Rasch partial credit model, PCM, and a Rasch rating scale model, RSM). Simulated data are used to demonstrate the efficacy with which these four models detect and differentiate three rater effects: severity, centrality, and inaccuracy. Results indicate that the RAMs are able to detect, but not differentiate, rater severity and inaccuracy, but not rater centrality. The PCM and RSM, on the other hand, are able to both detect and differentiate all three of these rater effects. However, the RSM and PCM do not take into account true scores and may, therefore, be misleading when pervasive trends exist in the rater-assigned data.

____________________

Using PISA as an International Benchmark in Standard Setting

Gary W. Phillips and Tao Jiang

Abstract

This study describes how the Programme for International Student Assessment (PISA) can be used to internationally benchmark state performance standards. The process is accomplished in three steps. First, PISA items are embedded in the administration of the state assessment and calibrated on the state scale. Second, the international item calibrations are then used to link the state scale to the PISA scale through common item linking. Third, the statistical linking results are used as part of the state standard setting process to help standard setting panelists determine how high their state standards need to be in order to be internationally competitive. This process was carried out in Delaware, Hawaii, and Oregon, in three subjects—science, mathematics and reading with initial results reported by Phillips and Jiang (2011). An in depth discussion of methods and results are reported in this article for one subject (mathematics) and one state (Hawaii).

____________________

Investigating the Function of Content and Argumentation Items in a Science Test: A Multidimensional Approach

Shih-Ying Yao, Mark Wilson, J. Bryan Henderson, and Jonathan Osborne

Abstract

The latest national science framework has formally stated the need for developing assessments that test both students’ content knowledge and scientific practices. In response to this call, a science assessment that consists of (a) content items that measure students’ understanding of a grade eight physics topic and (b) argumentation items that measure students’ argumentation competency has been developed. This paper investigated the function of these content and argumentation items with a multidimensional measurement framework from two perspectives. First, we performed a dimensionality analysis to investigate whether the relationship between the content and argumentation items conformed to test deign. Second, we conducted a differential item functioning analysis in the multidimensional framework to examine if any content or argumentation item unfairly favored students with an advanced level of English literacy. Methods and findings of this study could inform future research on the validation of assessments measuring higher-order and complex abilities.

____________________

Using a Rasch Model to Account for Guessing as a Source of Low Discrimination

Stephen Humphry

Abstract

The most common approach to modelling item discrimination and guessing for multiple-choice questions is the three parameter logistic (3PL) model. However, proponents of Rasch models generally avoid using the 3PL model because to model guessing entails sacrificing the distinctive property and advantages of Rasch models. One approach to dealing with guessing based on the application of Rasch models is to omit responses in which guessing appears to play a significant role. However, this approach entails loss of information and it does not account for variable item discrimination. It has been shown, though, that provided specific constraints are met, it is possible to parameterize discrimination while preserving the distinctive property of Rasch models. This article proposes an approach that uses Rasch models to account for guessing on standard multiple-choice items simply by treating it as a source of low item discrimination. Technical considerations are noted although a detailed examination of such considerations is beyond the scope of this article.

____________________

Chi-Squared Test of Fit and Sample Size — A Comparison between a Random Sample Approach and a Chi-Square Value Adjustment Method

Daniel Bergh

Abstract

Chi-square statistics are commonly used for tests of fit of measurement models. Chi-square is also sensitive to sample size, which is why several approaches to handle large samples in test of fit analysis have been developed. One strategy to handle the sample size problem may be to adjust the sample size in the analysis of fit. An alternative is to adopt a random sample approach. The purpose of this study was to analyze and to compare these two strategies using simulated data. Given an original sample size of 21,000, for reductions of sample sizes down to the order of 5,000 the adjusted sample size function works as good as the random sample approach. In contrast, when applying adjustments to sample sizes of lower order the adjustment function is less effective at approximating the chi-square value for an actual random sample of the relevant size. Hence, the fit is exaggerated and misfit under-estimated using the adjusted sample size function. Although there are big differences in chi-square values between the two approaches at lower sample sizes, the inferences based on the p-values may be the same.

____________________

Properties of the Tampa Scale for Kinesiophobia across Workers with Different Pain Experiences and Cultural Backgrounds: A Rasch Analysis

M. B. Jørgensen, E. Damsgård, A. Holtermann, A. Anke, K. Søgaard, and C. Røe

Abstract

The main aim of this study was to evaluate whether the construct validity of the Tampa Scale for Kinesiophobia (TSK) is consistent with respect to its scaling properties, unidimensionality and targeting among workers with different levels of pain. The 311 participating Danish workers reported kinesiophobia by TSK (13 statement version) and number of days with pain during the past year (less than 8 days, less than 90 days and greater than 90 days). A Rasch analysis was used to evaluate the measurement properties of the TSK in the workers across pain levels, ages, genders and ethnicities. The TSK did not fit the Rasch model, but removing one item solved the poorness of fit. Invariance was found across the pain levels, ages and genders. Thus, with a few modifications, the TSK was shown to capture a unidimensional construct of fear of movement in workers with different pain levels, ages, and genders.

____________________

Vol. 16, No. 3 Fall 2015

Comparison of Models and Indices for Detecting Rater Centrality

Edward W. Wolfe and Tian Song

Abstract

To date, much of the research concerning rater effects has focused on rater severity/leniency. Consequently, other potentially important rater effects have largely ignored by those conducting operational scoring projects. This simulation study compares four rater centrality indices (rater fit, residual-expected correlations, rater slope, and rater threshold variance) in terms of their Type I and Type II error rates under varying levels of centrality magnitude, centrality pervasiveness, and rating scale construction when each of four latent trait models is fitted to the simulated data (Rasch rating scale and partial credit models and the generalized rating scale and partial credit models). Results indicate that the residual-expected correlation may be most appropriately sensitive to rater centrality under most conditions.

____________________

Measuring Psychosocial Impact of CBRN Incidents by the Rasch Model

Stef van Buuren and Diederik J. D. Wijnmalen

Abstract

An effective response to chemical, biological, radiological and nuclear (CBRN) incidents requires capability planning based upon an assessment of risks in which all types of possible consequences of such incidents have been taken into account. CBRN incidents can have a wide range of consequences of which psychological and social effects (possibly leading to societal unrest) are often pointed out as very likely to occur. The goal of our research was to establish an objective measurement of psychosocial impact of CBRN incidents with the use of the Rasch model. We created a list of eleven items, each of which tapped into an aspect of psychosocial impact of incidents. Eleven judges scored ten CBRN scenarios on this list of items. Two items needed to be removed due to misfit. The resulting nine-items test fitted the Rasch model well. Three items showed mild forms of differential item functioning, but were retained in the test. The reliability of the instrument was 0.83. The scale can be used to quantitatively measure the inherently qualitative nature of psychosocial impact of CBRN incident scenarios in order to better compare this type of impact with quantitative impact types such as number of casualties, costs, etc. Administration of the scale is simple and takes about one minute per scenario. We recommend wider use of the Rasch model for improving the quality of total impact measurement in case of being faced with both qualitative and quantitative types of impact.

____________________

Using the Partial Credit Model to Evaluate the Student Engagement in Mathematics Scale

Micela Leis, Karen M. Schmidt, and Sara E. Rimm-Kaufman

Abstract

The Student Engagement in Mathematics Scale (SEMS) is a self-report measure that was created to assess three dimensions of student engagement (social, emotional, and cognitive) in mathematics based on a single day of class. In the current study, the SEMS was administered to a sample of 360 fifth graders from a large Mid-Atlantic district. The Rasch partial credit model (PCM) was used to analyze the psychometric properties of each sub-dimension of the SEMS. Misfitting items were removed from the final analysis. In general, items represented a range of engagement levels. Results show that the SEMS is an effective measure for researchers and practitioners to assess upper elementary school students’ perception of their engagement in math. The paper concludes with several recommendations for researchers considering using the SEMS.

____________________

Estimation of Parameters of the Rasch Model and Comparison of Groups in Presence of Locally Dependent Items

Mohand-Larbi Feddag, Myriam Blanchin, Véronique Sébille, and Jean-Benoit Hardouin

Abstract

Measurement specialists routinely assume examinee responses to test are independent of one another. However, previous research has shown that many tests contain item dependencies, and not accounting for these dependencies leads to misleading estimates of item and person parameters. In this paper, the marginal maximum likelihood estimation in Rasch model with the violation of the local independence is studied. The power of the Wald test on a group effect parameter on the latent traits in cross-sectional studies is examined under the local independence and the local item dependence assumptions. The different results are illustrated with simulation studies.

____________________

Help Me Tell My Story: Development of an Oral Language Measurement Scale

Patrick Charles, Michelle Belisle, Kevin Tonita, and Julie Smith

Abstract

Help Me Tell My Story (HMTMS) is an assessment tool that uses a holistic approach and an electronic application to measure the oral language development of pre-kindergarten and kindergarten children. It includes access to an online portal that provides meaningful information to caregivers, educators and administrators. This study examines the psychometric characteristics of one of the five questionnaires included in the HMTMS assessment, which explores the ability of children to talk to family members, friends and teachers. It uses an unrestricted partial credit Rasch version to analyse data from 844 children. Results indicate that, although we obtained a modest reliability index, the scale’s psychometric characteristics are within effective ranges, as no response dependency was found and the items constitute a unidimensional scale. There is no differential item functioning (DIF) related to gender, grade levels and ethnicity on this scale. Thus this assessment tools is appropriate for use in early years oral language measurement.

____________________

A Dual-purpose Rasch Model with Joint Maximum Likelihood Estimation

Xiao Luo and John T. Willse

Abstract

In practice, there is a growing need of reporting both overall score for the ranking/decision-making purpose and subscores for the diagnostic purpose. The Rasch model with subdimensions (RMS) was employed in this study to address this problem. A joint maximum likelihood estimation (JMLE) procedure was proposed to obtain computationally efficient estimation for this model. A simulation study was conducted to investigate the properties of this model with the JMLE procedure in conditions with varying sample size, test lengths and subdimension loading structure. Results indicated that in general, parameters were estimated well using the JMLE procedure. The item parameters and overall ability parameters in RMS were in accordance with parameters obtained from the Rasch model.

____________________

Using Rasch Analysis to Evaluate Accuracy of Individual Activities of Daily Living (ADL) and Instrumental Activities of Daily Living (IADL) for Disability Measurement

Bruce Friedman and Yanen Li

Abstract

Our study objectives were to examine the accuracy of individual activities of daily living (ADLs) and instrumental ADLs (IADLs) for disability measurement, and determine whether dependence or difficulty is more useful for disability measurement. We analyzed data from 499 patients with 2+ ADLs or 3+ IADLs who participated in a home visiting nurse intervention study, and whose function had been assessed at study baseline and 22 months. Rasch analysis was used to evaluate accuracy of 24 individual ADL and IADL items. The individual items differed in the amount of information provided in measuring functional disability along the range of disability, providing much more information in (usually) one part of the range. While nearly all of the Item Information Curves (IICs) for the ADL dependence, IADL difficulty, and IADL dependence items were unimodal with one information peak each, the IICs for ADL difficulty exhibited a bimodal pattern with two peaks. Which of the individual items performed better in disability measurement varied by the extent of functional disability (i.e., by how disabled the patients were). The information peaks of most ADLs and many IADLs rise or drop steeply in a relatively short distance. Thus, whether dependence or difficulty is superior often changes very quickly along the disability continuum. There was considerable heterogeneity in which individual items provided the most and the least information at the three points of interest examined across the disability range (–2 SD units, mean, +2 SD units). While the disability region (low, medium, and high disability) for which each individual item provided the most information remained quite stable between baseline and 22 months for ADL difficulty, IADL difficulty, and IADL dependence, relatively large shifts occurred for ADL dependence items. At the disability mean dependence items offered more information for assessment than difficulty. While ADLs also provided more information at –2 and +2 SD units, there was more heterogeneity at these points for IADLs, with little difference between dependence and difficulty assessment for some IADLs.

____________________

Home