Journal of Applied Measurement
P.O. Box 1283
Maple Grove, MN 55311
Article abstracts for Volumes 1 to 7 are available in pdf format. Just click on the link below.
Abstracts for Volume 1, 2000
Abstracts for Volume 2, 2001
Abstracts for Volume 3, 2002
Abstracts for Volume 4, 2003
Abstracts for Volume 5, 2004
Abstracts for Volume 6, 2005
Abstracts for Volume 7, 2006
Article abstracts for Volumes 8 to 14 are available in html format. Just click on the link below.
Abstracts for Volume 8, 2007
Abstracts for Volume 9, 2008
Abstracts for Volume 10, 2009
Abstracts for Volume 11, 2010
Abstracts for Volume 12, 2011
Abstracts for Volume 13, 2012
Abstracts for Volume 14, 2013
Current Volume Article Abstracts
Vol. 15, No. 1 Spring 2014
Automatic Item Generation Implemented for Measuring Artistic Judgment Aptitude
Automatic item generation (AIG) is a broad class of methods that are being developed to address psychometric issues arising from internet and computer-based testing. In general, issues emphasize efficiency, validity, and diagnostic usefulness of large scale mental testing. Rapid prominence of AIG methods and their implicit perspective on mental testing is bringing painful scrutiny to many sacred psychometric assumptions. This report reviews basic AIG ideas, then presents conceptual foundations, image model development, and operational application to artistic judgment aptitude testing.
Comparison Is Key
Mark H. Stone, and A. Jackson Stenner
Several concepts from Georg Rasch’s last papers are discussed. The key one is comparison because Rasch considered the method of comparison fundamental to science. From the role of comparison stems scientific inference made operational by a properly developed frame of reference producing specific objectivity. The exact specifications Rasch outlined for making comparisons are explicated from quotes, and the role of causality derived from making comparisons is also examined. Understanding causality has implications for what can and cannot be produced via Rasch measurement. His simple examples were instructive, but the implications are far reaching upon first establishing the key role of comparison.
Rasch Model of a Dynamic Assessment: An Investigation of the Children’s Inferential Thinking Modifiability Test
Linda L. Rittner and Steven M. Pulos
The purpose of this study was to develop a general procedure for evaluation of a dynamic assessment and to demonstrate an analysis of a dynamic assessment, the CITM (Tzuriel, 1995b), as an objective measure for use as a group assessment. The techniques used to determine the fit of the CITM to a Rasch partial credit model are explicitly outlined. A modified format of the CITM was administered to 266 diverse second grade students in the USA; 58% of participants were identified as low SES. The participants (males n = 144) were White Anglo and Latino American students (55%), many of whom were first generation Mexican immigrants. The CITM was found to adequately fit a Rasch partial credit model (PCM) indicating that the CITM is a likely candidate for a group administered dynamic assessment that can be measured objectively. Data also supported that a model for objectively measuring change in learning ability for inferential thinking in the CITM was feasible.
Performance Assessment of Higher Order Thinking
This article describes a study investigating the effect of intervention on student problem solving and higher order competency development using a series of complex numeracy performance tasks (Airasian and Russell, 2008). The tasks were sequenced to promote and monitor student development towards hypothetico-deductive reasoning. Using Rasch partial credit analysis (Wright and Masters, 1982) to calibrate the tasks and analysis of residual gain scores to examine the effect of class and school membership, the study illustrates how directed intervention can improve students’ higher order competency skills. This paper demonstrates how the segmentation defined by Wright and Masters can offer a basis for interpreting the construct underlying a test and how segment definitions can deliver targeted interventions. Implications for teacher intervention and teaching mentor schemes are considered. The article also discusses multilevel regression models that differentiate class and school effects, and describes a process for generating, testing and using value added models.
A Rasch Measure of Young Children’s Temperament (Negative Emotionality) in Hong Kong
Po Lin Becky Bailey-Lau and Russell F. Waugh
An aspect of child behavior and temperament, called Negative Emotionality in the literature, is very important to teachers of very young children. The Children’s Behavior Questionnaire, initially designed by Rothbart, Ahadi, Hershey and Fisher (2001) for use in western countries, was modified in line with Rasch measurement theory, revised for suitability with Hong Kong preschool children, and conceptually ordered from easy to hard along a continuum of attitude/behavior for negative emotionality, before data collection. Three ordered scoring categories (never or rarely scored 1, on some occasions scored 2, and on many occasions scored 3) were used. Data were collected from preschool teachers for N = 628 preschool children from 32 schools in Hong Kong and analyzed with the 2010 Rasch unidimensional measurement model computer program (RUMM2030). The item-trait interaction probability is 0.05 (chi square = 101.88, df = 80) which indicates that there is reasonable agreement about the different difficulties of the items along the scale for all the children. Results and implications are discussed, and revisions for the scale suggested.
Snijders’s Correction of Infit and Outfit Indexes with Estimated Ability Level: An Analysis with the Rasch Model
David Magis, Sébastien Béland, and Gilles Raîche
The Infit mean square W and the Outfit mean square U are commonly used person fit indexes under Rasch measurement. However, they suffer from two major weaknesses. First, their asymptotic distribution is usually derived by assuming that the true ability levels are known. Second, such distributions are even not clearly stated for indexes U and W. Both issues can seriously affect the selection of an appropriate cut-score for person fit identification. Snijders (2001) proposed a general approach to correct some person fit indexes when specific ability estimators are used. The purpose of this paper is to adapt this approach to U and W indexes. First, a brief sketch of the methodology and its application to U and W is proposed. Then, the corrected indexes are compared to their classical versions through a simulation study. The suggested correction yields controlled Type I errors against both conservatism and inflation, while the power to detect specific misfitting response patterns gets significantly increased.
Optimal Discrimination Index and Discrimination Efficiency for Essay Questions
Recommended guidelines for discrimination index of multiple choice questions are often indiscriminately applied to essay type questions also. Optimal discrimination index under normality condition for essay question is independently derived. Satisfactory region for discrimination index of essay questions with passing mark at 50% of the total is between 0.12 and 0.31 instead of 0.40 or more in the case for multiple-choice questions. Optimal discrimination index for essay question is shown to increase proportional to the range of scores. Discrimination efficiency as the ratio of the observed discrimination index over the optimal discrimination index is defined. Recommended guidelines for discrimination index of essay questions are provided.
Vol. 15, No. 2 Summer 2014
Examining Rating Scales Using Rasch and Mokken Models for Rater-Mediated Assessments
Stefanie A. Wind
A variety of methods for evaluating the psychometric quality of rater-mediated assessments have been proposed, including rater effects based on latent trait models (e.g., Engelhard, 2013; Wolfe, 2009). Although information about rater effects contributes to the interpretation and use of rater-assigned scores, it is also important to consider ratings in terms of the structure of the rating scale on which scores are assigned. Further, concern with the validity of rater-assigned scores necessitates investigation of these quality control indices within student subgroups, such as gender, language, and race/ethnicity groups. Using a set of guidelines for evaluating the interpretation and use of rating scales adapted from Linacre (1999, 2004), this study demonstrates methods that can be used to examine rating scale functioning within and across student subgroups with indicators from Rasch measurement theory (Rasch, 1960) and Mokken scale analysis (Mokken, 1971). Specifically, this study illustrates indices of rating scale effectiveness based on Rasch models and models adapted from Mokken scaling, and considers whether the two approaches to evaluating the interpretation and use of rating scales lead to comparable conclusions within the context of a large-scale rater-mediated writing assessment. Major findings suggest that indices of rating scale effectiveness based on a parametric and nonparametric approach provide related, but slightly different, information about the structure of rating scales. Implications for research, theory, and practice are discussed.
Differential Item Functioning Analysis Using a Multilevel Rasch Mixture Model: Investigating the Impact of Disability Status and Receipt of Testing Accommodations
W. Holmes Finch and Maria E. Hernàndez Finch
The assessment of differential item functioning (DIF) remains an area of active research in psychometrics and educational measurement. In recent years, methodological innovations involving mixture Rasch models have provided researchers with an additional set of tools for more deeply understanding the root causes of DIF, while at the same time increased interest in the role of disabilities and accommodations has also made itself felt in the measurement community. The current study furthered work in both areas by using the newly described multilevel mixture Rasch model to investigate the presence of DIF associated with disability and accommodation status at both examinee and school levels for a 3rd grade language assessment. Results of the study found that indeed DIF was present at both levels of analysis, and that it was associated with the presence of disabilities and the receipt of accommodations. Implications of these results for both practitioners and researchers are discussed.
Rater Effect Comparability in Local Independence and Rater Bundle Models
Edward W. Wolfe and Tian Song
A large body of literature exists describing how rater effects may be detected in rating data. In this study, we compared the flag and agreement rates for several rater effects based on calibration of a real data under two psychometric models—the Rasch rating scale model (RSM) and the Rasch testlet-based rater bundle model (RBM). The results show that the RBM provided more accurate diagnoses of rater severity and leniency than do the RSM which is based on the local independence assumption. However, the statistical indicators associated with rater centrality and inaccuracy remain consistent between these two models.
Improving the Individual Work Performance Questionnaire using Rasch Analysis
Linda Koopmans, Claire M. Bernaards, Vincent H. Hildebrandt, Stef van Buuren, Allard J. van der Beek, and Henrica C.W. de Vet
Recently, the Individual Work Performance Questionnaire (IWPQ) version 0.2 was developed using Rasch analysis. The goal of the current study was to improve targeting of the IWPQ scales by including additional items. The IWPQ 0.2 (original) and 0.3 (including additional items) were examined using Rasch analysis. Additional items that showed misfit or did not improve targeting were removed from the IWPQ 0.3, resulting in a final IWPQ 1.0. Subsequently, the scales showed good model fit and reliability, and were examined for key measurement requirements (e.g., category ordening, unidimensionality, and differential item functioning). Finally, calculation and interpretability of scores were addressed. Compared to its previous version, the final IWPQ 1.0 showed improved targeting for two out of three scales. As a result, it can more reliably measure workers at all levels of ability, discriminate between workers at a wider range on each scale, and detect changes in individual work performance.
Influence of DIF on Differences in Performance of Italian and Asian Individuals on a Reading Comprehension Test of Spanish as a Foreign Language
Gerardo Prieto and Eloísa Nieto
Research into Differential Item Functioning (DIF) has been an active research area in language testing (Ferne and Rupp, 2007). In this study we analyzed the DIF of two groups with different types of native language (927 Italians and 280 Asians) in a reading comprehension task forming part of an exam in Spanish as a foreign language. The Mantel-Haenszel (MH) and Rasch procedures for the detection of uniform and nonuniform DIF were used. The results reveal that the Rasch model and MH converge substantially on the results. Uniform DIF was detected in 6.6% of the items and nonuniform DIF in 16.7%. Half of the items affected by DIF favored the focal group (Asians) and the other half favored the reference group (Italians). The difference in test performance of the two groups did not appear to be affected by the elimination of items with DIF.
Rasch Rating Scale Analysis of the Attitudes Toward Research Scale
Elena C. Papanastasiou and Randall Schumacker
College students may view research methods courses with negative attitudes, however, few studies have investigated this issue due to the lack of instruments that measure the students’ attitudes towards research. Therefore, the purpose of this study was to examine the psychometric properties of a Attitudes Toward Research Scale using Rasch rating scale analysis. Assessment of attitudes toward research is essential to determine if students have negative attitudes towards research and assist instructors in better facilitation of learning research methods in their courses. The results of this study have shown that a thirty item Attitudes Toward Research Scale yielded scores with high person and item reliability.
Measuring the Ability of Military Aircrews to Adapt to Perceived Stressors when Undergoing Centrifuge Training
Jenhung Wang, Pei-Chun Lin, and Shih-Chin Li
This study assessed the ability of military aircrews to adapt to stressors when undergoing centrifuge training and determined what equipment items caused perceived stress and needed to be upgraded. We used questionnaires and the Rasch model to measure aircrew personnel’s ability to adapt to centrifuge training. The measurement items were ranked by 611 military aircrew personnel. Analytical results indicated that the majority of the stress perceived by aircrew personnel resulted from the lightproof cockpit without outer reference. This study prioritized the equipment requiring updating as the lightproof cockpit design, the dim lighting of the cockpit, and the pedal design. A significant difference was found between pilot and non-pilot subjects’ stress from the pedal design; and considerable association was discernible between the seat angle design and flight hours accrued. The study results provide aviators, astronauts, and air forces with reliable information as to which equipment items need to be urgently upgraded as their present physiological and psychological effects can affect the effectiveness of centrifuge training.