Journal of Applied Measurement
P.O. Box 1283
Maple Grove, MN 55311
Article abstracts for Volumes 1 to 7 are available in pdf format. Just click on the link below.
Abstracts for Volume 1, 2000
Abstracts for Volume 2, 2001
Abstracts for Volume 3, 2002
Abstracts for Volume 4, 2003
Abstracts for Volume 5, 2004
Abstracts for Volume 6, 2005
Abstracts for Volume 7, 2006
Article abstracts for Volumes 8 to 14 are available in html format. Just click on the link below.
Abstracts for Volume 8, 2007
Abstracts for Volume 9, 2008
Abstracts for Volume 10, 2009
Abstracts for Volume 11, 2010
Abstracts for Volume 12, 2011
Abstracts for Volume 13, 2012
Abstracts for Volume 14, 2013
Abstracts for Volume 15, 2014
Abstracts for Volume 16, 2015
Current Volume Article Abstracts
Vol. 17, No. 1 Spring 2016
Assessing the Validity of a Continuum-of-care Survey: A Rasch Measurement Approach
Michael Peabody, Kelly D. Bradley, and Melba Custer
Satisfied patients are more likely to be compliant, have better outcomes, and are more likely to return to the same provider or institution for future care. The Satisfaction with a Continuum of Care survey (SCC) was designed to improve patient care using measures of patient satisfaction and facilitate a cultural shift from a “silos-ofcare” to a “continuum-of-care” mentality by fostering inter-departmental communication as patients moved between environments of care at a Midwestern rehabilitation hospital. This study provides a Rasch measurement framework for investigating issues related to survey reliability and validity. The results indicate that although certain aspects of the survey seem to function in a psychometrically sound manner, the questions are too easy to endorse and provide little information to help improve patient care. Suggestions for future revisions to this survey instrument are provided.
What You Don’t Know Can Hurt You: Missing Data and Partial Credit Model Estimates
Sarah L. Thomas, Karen M. Schmidt, Monica K. Erbacher, and Cindy S. Bergeman
The authors investigated the effect of missing completely at random (MCAR) item responses on partial credit model (PCM) parameter estimates in a longitudinal study of Positive Affect. Participants were 307 adults from the older cohort of the Notre Dame Study of Health and Well-Being (Bergeman and Deboeck, 2014) who completed questionnaires including Positive Affect items for 56 days. Additional missing responses were introduced to the data, randomly replacing 20%, 50%, and 70% of the responses on each item and each day with missing values, in addition to the existing missing data. Results indicated that item locations and person trait level measures diverged from the original estimates as the level of degradation from induced missing data increased. In addition, standard errors of these estimates increased with the level of degradation. Thus, MCAR data does damage the quality and precision of PCM estimates.
Rasch Measurement of Collaborative Problem Solving in an Online Environment
Susan-Marie E. Harding and Patrick E. Griffin
This paper describes an approach to the assessment of human to human collaborative problem solving using a set of online interactive tasks completed by student dyads. Within the dyad, roles were nominated as either A or B and students selected their own roles. The question as to whether role selection affected individual student performance measures is addressed. Process stream data was captured from 3402 students in six countries who explored the problem space by clicking, dragging the mouse, moving the cursor and collaborating with their partner through a chat box window. Process stream data were explored to identify behavioural indicators that represented elements of a conceptual framework. These indicative behaviours were coded into a series of dichotomous items. These items represented actions and chats performed by students. The frequency of occurrence was used as a proxy measure of item difficulty. Then given a measure of item difficulty, student ability could be estimated using the difficulty estimates of the range of items demonstrated by the student. The Rasch simple logistic model was used to review the indicators to identify those that were consistent with the assumptions of the model and were invariant across national samples, language, curriculum and age of the student. The data were analysed using a one and two dimension, one parameter model. Rasch separation reliability, fit to the model, distribution of students and items on the underpinning construct, estimates for each country and the effect of role differences are reported. This study provides evidence that collaborative problem solving can be assessed in an online environment involving human to human interaction using behavioural indicators shown to have a consistent relationship between the estimate of student ability, and the probability of demonstrating the behaviour.
The Impact of Item Parameter Drift in Computer Adaptive Testing (CAT)
This study looked at numerous aspects of item parameter drift (IPD) and its impact on measurement in computer adaptive testing (CAT). A series of CAT simulations were conducted, varying the amount and magnitude of IPD, as well as the size of the item pool. The effects of IPD on measurement precision, classification, and test efficiency, were evaluated using a number of criteria. These included bias, root mean square error (RMSE), absolute average difference (AAD), total percentages of misclassifcation, the number of false positives and false negatives, the total test lengths, and item exposure rates. The results revealed negligible differences when comparing the IPD conditions to the baseline condition for all measures of precision, classification accuracy, and test efficiency. The most relevant finding indicates that magnitude of drift has a larger impact on measurement precision than the number of items with drift.
Exploring the Utility of Logistic Mixed Modeling Approaches to Simultaneously Investigate Item and Testlet DIF on Testlet-based Data
Hirotaka Fukuhara and Insu Paek
This study explored the utility of logistic mixed models for the analysis of differential item functioning when item response data were testlet-based. Decomposition of differential item functioning (DIF) into item level and testlet level for the testlet-based data was introduced to separate possible sources of DIF: (1) an item, (2) a testlet, and (3) both the item and the testlet. Simulation study was conducted to investigate the performance of several logistic mixed models as well as the Mantel-Haenszel method under the conditions, in which the item-related DIF and testlet-related DIF were present simultaneously. The results revealed that a new DIF model based on a logistic mixed model with random item effects and item covariates could capture the item-related DIF and testlet-related DIF well under certain conditions.
What Are You Measuring? Dimensionality and Reliability Analysis of Ability and Speed in Medical School Didactic Examinations
James J. Thompson
Summative didactic evaluation often involves multiple choice questions which are then aggregated into exam scores, course scores, and cumulative grade point averages. To be valid, each of these levels should have some relationship to the topic tested (dimensionality) and be sufficiently reproducible between persons (reliability) to justify student ranking. Evaluation of dimensionality is difficult and is complicated by the classic observation that didactic performance involves a generalized component (g) in addition to subtest specific factors. In this work, 183 students were analyzed over two academic years in 13 courses with 44 exams and 3352 questions for both accuracy and speed. Reliability at all levels was good (>0.95). Assessed by bifactor analysis, g effects dominated most levels resulting in essential unidimensionality. Effect sizes on predicted accuracy and speed due to nesting in exams and courses was small. There was little relationship between person ability and person speed. Thus, the hierarchical grading system appears warrented because of its g-dependence.
Applying the Rasch Model to Measure Mobility of Women: A Comparative Analysis of Mobility of Informal Workers in Fisheries in Kerala, India
Mobility or ‘freedom and ability to move’ is gendered in many cultural contexts. In this paper I analyse mobility associated with work from the capability approach perspective of Sen. This is an empirical paper which uses the Rasch Rating Scale Model (RSM) to construct the measure of mobility of women for the first time in the development studies discourse. I construct a measure of mobility (latent trait) of women workers engaged in two types of informal work, namely, peeling work and fish vending, in fisheries in the cultural context of India. The scale measure enables first, to test the unidimensionality of my construct of mobility of women and second, to analyse the domains of mobility of women workers. The comparative analysis of the scale of permissibility of mobility constructed using the RSM for the informal women workers shows that women face constraints on mobility in social and personal spaces in the socially advanced state of Kerala in India. Work mobility does not expand the real freedoms, hence work mobility can be termed as ‘bounded capability’ which is a capability ‘limited or bounded’ by either the social, cultural and gender norms or a combination of all of these. Therefore at the macro level, growth in informal employment in sectors like fisheries which improve mobility of women through work mobility does not necessarily expand the capability sets by contributing to greater freedoms and transformational mobility. This paper has a significant methodological contribution in that it uses an innovative method for the measurement of mobility of women in the development studies discipline.
Vol. 17, No. 2 Summer 2016
Creating a Physical Activity Self-Report Form for Youth using Rasch Methodology
Christine DiStefano, Russell Pate, Kerry McIver, Marsha Dowda, Michael Beets, and Dale Murrie
Measurement of youth’s physical activity levels is recommended to ensure that children are meeting recommended activity guidelines. This article describes the creation of an instrument to measure youth’s levels of physical activity, where a strong test validation perspective (Benson, 1998) was followed to create the scale. The development process involved a mixed-method (qualitative followed by quantitative) framework. First, focus groups were conducted, where results informed item creation. Next, three alternative forms were created with different response formats to measure childrens’ frequency of participation in various physical activities and intensity of participation. Lastly, a sample of over 500 middle school children was obtained, where three different response scales were investigated. The optimal scale considered measurement of physical activity using a three-point Likert frequency; intensity of activity participation did not strongly contribute to the measurement of children’s activity levels. The final version form is thought to be acceptable for use with children in surveillance and large-group studies, as well as in smaller sample applications.
Examining the Psychometric Quality of Multiple-Choice Assessment Items using Mokken Scale Analysis
Stefanie A. Wind
The concept of invariant measurement is typically associated with Rasch measurement theory (Engelhard, 2013). Concerned with the appropriateness of the parametric transformation upon which the Rasch model is based, Mokken (1971) proposed a nonparametric procedure for evaluating the quality of social science measurement that is theoretically and empirically related to the Rasch model. Mokken’s nonparametric procedure can be used to evaluate the quality of dichotomous and polytomous items in terms of the requirements for invariant measurement. Despite these potential benefits, the use of Mokken scaling to examine the properties of multiplechoice (MC) items in education has not yet been fully explored. A nonparametric approach to evaluating MC items is promising in that this approach facilitates the evaluation of assessments in terms of invariant measurement without imposing potentially inappropriate transformations. Using Rasch-based indices of measurement quality as a frame of reference, data from an eighth-grade physical science assessment are used to illustrate and explore Mokken-based techniques for evaluating the quality of MC items. Implications for research and practice are discussed.
A Practitioner’s Instrument for Measuring Secondary Mathematics Teachers’ Beliefs Surrounding Learner-Centered Classroom Practice
Alyson E. Lischka and Mary Garner
In this paper we present the development and validation of a Mathematics Teaching Pedagogical and Discourse Beliefs Instrument (MTPDBI), a 20 item partial-credit survey designed and analyzed using Rasch measurement theory. Items on the MTPDBI address beliefs about the nature of mathematics, teaching and learning mathematics, and classroom discourse practices. A Rasch partial credit model (Masters, 1982) was estimated from the pilot study data. Results show that item separation reliability is .96 and person separation reliability is .71. Other analyses indicate the instrument is a viable measure of secondary teachers’ beliefs about reform-oriented mathematics teaching and learning. This instrument is proposed as a useful measure of teacher beliefs for those working with pre-service and in-service teacher development.
Using the Rasch Model and k-Nearest Neighbors Algorithm for Response Classification
In this paper we propose using the k-nearest neighbors (k-NN) algorithm (Cover and Hart, 1967) for classifying and predicting the responses to dichotomous items. We show using the percent correct statistic how k-NN can be used with Rasch model parameter estimation methods such as joint maximum likelihood (JMLE), conditional maximum likelihood estimation (CMLE), marginal maximum likelihood estimation (MMLE), and marginal Bayes modal estimation (MBME). We further suggest how one can use the algorithm to predict responses on future assessments. The empirical data set that we used to illustrate this procedure was the fraction subtraction data set from Tatsuoka (1984). Using R software we show the accuracy and efficacy of k-NN for classifying responses.
Exploring Aberrant Responses using Person Fit and Person Response Functions
A. Adrienne Walker, George Engelhard, Jr., Mari-Wells Hedgpeth, and Kenneth D. Royal
Person fit statistics provide equivocal interpretations regarding aberrant responses. This study uses person response functions (PRF) to supplement the interpretation of person fit statistics. Sixty-three multiple-choice items were administered to a sample of persons (N=31) who used guessing strategies to answer them. After answering each item, participants indicated which guessing strategy they used. The data were analyzed with a Rasch (1960) model, where the item calibrations were anchored to values obtained when the items were appropriately administered. The participants showed poor model-data fit as expected. Further examination of person misfit using person response functions suggests that PRF can provide information about absolute person fit to a model, whereas fit statistics provide information about relative fit, given the other persons in the testing group. PRF can also provide information about where and how person responses misfit the model. This additional information can assist practitioners in using and interpreting individual scores appropriately.
Evaluation of the Bifactor Nominal Response Model Analysis of a Health Efficacy Measure
Zexuan Han and Kathleen Suzanne Johnson Preston
The bifactor nominal response item response theory (IRT) model, proposed by Cai, Yang and Hansen (2011), provides an extension of Bock’s (1972, 1997) unidimensional nominal response model to multidimensional IRT. This model has not been utilized in any published studies since its original development. In this study, the model was applied to data from a sample of college students (n = 799) to evaluate the psychometric properties of a health efficacy measure. The nominal response model has the unique capability to estimate the functioning of each single response category, and higher response categories were found to have better functioning in this study. Poor-functioning categories were identified and combined into their adjacent categories. Items with revised response format showed improved functioning. The bifactor nominal response model is a useful tool for evaluation of bifactor scales with ordered while non-equivalently functioning categories.
Measurement Properties of the Nordic Questionnaire for Psychological and Social Factors at Work: A Rasch Analysis
C. Røe, K. Myhre, G. H. Marchand, B. Lau, G. Leivseth, and E. Bautz-Holter
The main aim of this study was to evaluate the measurement properties of the Nordic Questionnaire for Psychological and Social Factors at Work (QPS Nordic) and the domains of demand, control and support. The Rasch analysis (RUMM 2030) was based on responses from 226 subjects with back pain who completed the QPS Nordic dimensions of demand, control, and social support (30 items) at one year follow up. The Rasch analysis revealed disordered thresholds in a total of 25 of the 30 items. The domains of demand, control and support fit the Rasch model when analyzed separately. The demand domain was well targeted, whereas patients with current neck and back pain had lower control and higher support than reflected by the questions. Two items revealed DIF by gender, otherwise invariance to age, gender, occupation and sick-leave was documented. The demand, control support domains of QPS Nordic comprised unidimensional constructs with adequate measurement properties.
Ben Wright: A wisp of greatness: Brief photographic review of his life and times
With Ben Wright's death in October 2015, the Rasch measurement community lost its most enthuastic and dedicated supporter. In honor of Ben's great contribution to Rasch measurement, Nick Bezreczko was invited to write a memorial article about Ben's life, education, and scholarship.