Vol. 20, No. 1, Spring 2019

The Effects of Probability Threshold Choice on an Adjustment for Guessing using the Rasch Model

Glenn Thomas Waterbury and Christine E. DeMars


This paper investigates a strategy for accounting for correct guessing with the Rasch model that we entitled the Guessing Adjustment. This strategy involves the identification of all person/item encounters where the probability of a correct response is below a specified threshold. These responses are converted to missing data and the calibration is conducted a second time. This simulation study focuses on the effects of different probability thresholds across varying conditions of sample size, amount of correct guessing, and item difficulty. Biases, standard errors, and root mean squared errors were calculated within each condition. Larger probability thresholds were generally associated with reductions in bias and increases in standard errors. Across most conditions, the reduction in bias was more impactful than the decrease in precision, as reflected by the RMSE. The Guessing Adjustment is an effective means for reducing the impact of correct guessing and the choice of probability threshold matters.


Quantifying Item Invariance for the Selection of the Least Biased Assessment

W. Holmes Finch, Brian F. French, and Maria E. Hernandez Finch


An important aspect of educational and psychological measurement and evaluation of individuals is the selection of scales with appropriate evidence of reliability and validity for inferences and uses of the scores for the population of interest. One aspect of validity is the degree to which a scale fairly assesses the construct(s) of interest for members of different subgroups within the population. Typically, this issue is addressed statistically through assessment of differential item functioning (DIF) of individual items, or differential bundle functioning (DBF) of sets of items. When selecting an assessment to use for a given application (e.g., measuring intelligence), or which form of an assessment to use in a given instance, researchers need to consider the extent to which the scales work with all members of the population. Little research has examined methods for comparing the amount or magnitude of DIF/DBF present in two assessments when deciding which assessment to use. The current simulation study examines 6 different statistics for this purpose. Results show that a method based on the random effects item response theory model may be optimal for instrument comparisons, particularly when the assessments being compared are not of the same length.


Rasch Model Calibrations with SAS PROC IRT and WINSTEPS

Ki Cole


The WINSTEPS software is widely used for Rasch model calibrations. Recently, SAS/STAT® released the PROC IRT procedure for IRT analysis, including Rasch. The purpose of the study is compare the performance of the PROC IRT procedure with WINSTEPS to calibrate dichotomous and polytomous Rasch models in order to diagnose the possibility of using PROC IRT as a viable alternative. A simulation study was used to compare the two programs in terms of the convergence rate, run time, item parameter estimates, and ability estimates with different test lengths and sample sizes. Implications of the results and the features of each software are discussed for research applications and large–scale assessment.


Student Perceptions of Grammar Instruction in Iranian Secondary Education: Evaluation of an Instrument using Rasch Measurement Theory

Stefanie A. Wind, Behzad Mansouri, and Parvaneh Yaghoubi Jami


Isolated and integrated grammar instruction are two approaches to grammar teaching that can be implemented within a form-focused instruction (FFI) framework. In both approaches, instructors primarily concentrate on meaning, and the difference is in the timing of instruction on specific language forms. In previous studies, researchers have observed that the match between teachers’ and learners’ beliefs related to the effectiveness of instructional approaches is an important component in predicting the success of grammar instruction. In this study, we report on the psychometric properties of a questionnaire designed to measure students’ perceptions of isolated and integrated FFI taking place in Iranian secondary schools. The Iranian context is interesting with regard to approaches to grammar instruction in light of recent policy reforms that emphasize isolated FFI. Using a combination of principal components analysis and Rasch measurement theory techniques, we observed that Iranian students distinguish among the two forms of grammar instruction. Looking within each approach, we observed significant differences among individual students as well as differences in the difficulty for students to endorse different instructional activities related to both isolated and integrated instruction. Together, our findings highlight the importance of examining students’ beliefs about the effectiveness of approaches to grammar instruction within different instructional contexts. We discuss implications for research and practice.


Computer Adaptive Test Stopping Rules Applied to the Flexilevel Shoulder Functioning Test

Trenton J. Combs, Kyle W. English, Barbara G. Dodd, and Hyeon-Ah Kang


Computerized adaptive testing (CAT) is an attractive alternative to traditional paper-and-pencil testing because it can provide accurate trait estimates while administering fewer items than a linear test form. A stopping rule is an important factor in determining an assessments efficiency. This simulation compares three variable-length stopping rules—standard error (SE) of .3, minimum information (MI) of .7 and change in trait (CT) of .02— with and without a maximum number of items (20) imposed. We use fixed-length criteria of 10 and 20 items as a comparison for two versions of a linear assessment. The MI rules resulted in longer assessments with more biased trait estimates in comparison to other rules. The CT rule resulted in more biased estimates at the higher end of the trait scale and larger standard errors. The SE rules performed well across the trait scale in terms of both measurement precision and efficiency.


Examining Rater Judgements in Music Performance Assessment using Many-Facets Rasch Rating Scale Measurement Model

Pey Shin Ooi and George Engelhard, Jr.


The fairness of raters in music performance assessment has become an important concern in the field of music. The assessment of students’ music performance depends in a fundamental way on rater judgements. The quality of rater judgements is crucial to provide fair, meaningful and informative assessments of music performance. There are many external factors that can influence the quality of rater judgements. Previous research has used different measurement models to examine the quality of rater judgements (e.g., generalizability theory). There are limitations with the previous analysis methods that are based on classical test theory and its extensions. In this study, we use modern measurement theory (Rasch measurement theory) to examine the quality of rater judgements. The many-facets Rasch rating scale model is employed to investigate the extent of rater-invariant measurement in the context of music performance assessments related to university degrees in Malaysia (159 students rated by 24 raters). We examine the rating scale structure, the severity levels of the raters, and the judged difficulty of the items. We also examine the interaction effects across musical instrument subgroups (keyboard, strings, woodwinds, brass, percussions and vocal). The results suggest that there were differences in severity levels among the raters. The results of this study also suggest that raters had different severity levels when rating different musical instrument subgroups. The implications for research, theory and practice in the assessment of music performance are included in this paper.


Examining Differential Item Functioning in the Household Food Insecurity Scale: Does Participation in SNAP Affect Measurement Invariance?

Victoria T. Tanaka, George Engelhard, Jr., and Matthew P. Rabbitt


The Household Food Security Survey Module (HFSSM) is a scale used by the U.S. Department of Agriculture to measure the severity of food insecurity experienced by U.S. households. In this study, measurement invariance of the HFSSM is examined across households based on participation in the Supplemental Nutrition Assistance Program (SNAP). Households with children who responded to the HFSSM in 2015 and 2016 (N = 3,931) are examined. The Rasch model is used to analyze differential item functioning (DIF) related to SNAP participation. Analyses suggest a small difference in reported food insecurity between SNAP and non-SNAP participants (27% versus 23% respectively). However, the size and direction of the DIF mitigates the impact on overall estimates of household food insecurity. Person-fit indices suggest that the household aberrant response rate is 6.6% and the number of misfitting households is comparable for SNAP (6.80%) and non-SNAP participants (6.30%). Implications for research and policy related to food insecurity are discussed.


Accuracy and Utility of the AUDIT-C with Adolescent Girls and Young Women (AGYW) Who Engage in HIV Risk Behaviors in South Africa

Tracy Kline, Corina Owens, Courtney Peasant Bonner, Tara Carney, Felicia A. Browne, and Wendee M. Wechsberg


Hazardous drinking is a risk factor associated with sexual risk, gender-based violence, and HIV transmission in South Africa. Consequently, sound and appropriate measurement of drinking behavior is critical to determining what constitutes hazardous drinking. Many research studies use internal consistency estimates as the determining factor in psychometric assessment; however, deeper assessments are needed to best define a measurement tool. Rasch methodology was used to evaluate a shorter version of the Alcohol Use Disorders Identification Test, the AUDIT-C, in a sample of adolescent girls and young women (AGYW) who use alcohol and other drugs in South Africa (n =100). Investigations of operational response range, item fit, sensitivity, and response option usage provide a richer picture of AUDIT-C functioning than internal consistency alone in women who are vulnerable to hazardous drinking and therefore at risk of HIV. Analyses indicate that the AUDIT-C does not adequately measure this specialized population, and that more validation is needed to determine if the AUDIT-C should continue to be used in HIV prevention intervention studies focused on vulnerable adolescent girls and young women.