Journal of Applied Measurement

P.O. Box 1283

Maple Grove, MN 55311

 


 

Volume 20 Article Abstracts

 

Vol. 20, No. 1, Spring 2019

The Effects of Probability Threshold Choice on an Adjustment for Guessing using the Rasch Model

Glenn Thomas Waterbury and Christine E. DeMars

Abstract

This paper investigates a strategy for accounting for correct guessing with the Rasch model that we entitled the Guessing Adjustment. This strategy involves the identification of all person/item encounters where the probability of a correct response is below a specified threshold. These responses are converted to missing data and the calibration is conducted a second time. This simulation study focuses on the effects of different probability thresholds across varying conditions of sample size, amount of correct guessing, and item difficulty. Biases, standard errors, and root mean squared errors were calculated within each condition. Larger probability thresholds were generally associated with reductions in bias and increases in standard errors. Across most conditions, the reduction in bias was more impactful than the decrease in precision, as reflected by the RMSE. The Guessing Adjustment is an effective means for reducing the impact of correct guessing and the choice of probability threshold matters.

____________________

Quantifying Item Invariance for the Selection of the Least Biased Assessment

W. Holmes Finch, Brian F. French, and Maria E. Hernandez Finch

Abstract

An important aspect of educational and psychological measurement and evaluation of individuals is the selection of scales with appropriate evidence of reliability and validity for inferences and uses of the scores for the population of interest. One aspect of validity is the degree to which a scale fairly assesses the construct(s) of interest for members of different subgroups within the population. Typically, this issue is addressed statistically through assessment of differential item functioning (DIF) of individual items, or differential bundle functioning (DBF) of sets of items. When selecting an assessment to use for a given application (e.g., measuring intelligence), or which form of an assessment to use in a given instance, researchers need to consider the extent to which the scales work with all members of the population. Little research has examined methods for comparing the amount or magnitude of DIF/DBF present in two assessments when deciding which assessment to use. The current simulation study examines 6 different statistics for this purpose. Results show that a method based on the random effects item response theory model may be optimal for instrument comparisons, particularly when the assessments being compared are not of the same length.

____________________

Rasch Model Calibrations with SAS PROC IRT and WINSTEPS

Ki Cole and Insu Paek

Abstract

The WINSTEPS software is widely used for Rasch model calibrations. Recently, SAS/STAT® released the PROC IRT procedure for IRT analysis, including Rasch. The purpose of the study is compare the performance of the PROC IRT procedure with WINSTEPS to calibrate dichotomous and polytomous Rasch models in order to diagnose the possibility of using PROC IRT as a viable alternative. A simulation study was used to compare the two programs in terms of the convergence rate, run time, item parameter estimates, and ability estimates with different test lengths and sample sizes. Implications of the results and the features of each software are discussed for research applications and large–scale assessment.

____________________

Student Perceptions of Grammar Instruction in Iranian Secondary Education: Evaluation of an Instrument using Rasch Measurement Theory

Stefanie A. Wind, Behzad Mansouri, and Parvaneh Yaghoubi Jami

Abstract

Isolated and integrated grammar instruction are two approaches to grammar teaching that can be implemented within a form-focused instruction (FFI) framework. In both approaches, instructors primarily concentrate on meaning, and the difference is in the timing of instruction on specific language forms. In previous studies, researchers have observed that the match between teachers’ and learners’ beliefs related to the effectiveness of instructional approaches is an important component in predicting the success of grammar instruction. In this study, we report on the psychometric properties of a questionnaire designed to measure students’ perceptions of isolated and integrated FFI taking place in Iranian secondary schools. The Iranian context is interesting with regard to approaches to grammar instruction in light of recent policy reforms that emphasize isolated FFI. Using a combination of principal components analysis and Rasch measurement theory techniques, we observed that Iranian students distinguish among the two forms of grammar instruction. Looking within each approach, we observed significant differences among individual students as well as differences in the difficulty for students to endorse different instructional activities related to both isolated and integrated instruction. Together, our findings highlight the importance of examining students’ beliefs about the effectiveness of approaches to grammar instruction within different instructional contexts. We discuss implications for research and practice.

____________________

Computer Adaptive Test Stopping Rules Applied to the Flexilevel Shoulder Functioning Test

Trenton J. Combs, Kyle W. English, Barbara G. Dodd, and Hyeon-Ah Kang

Abstract

Computerized adaptive testing (CAT) is an attractive alternative to traditional paper-and-pencil testing because it can provide accurate trait estimates while administering fewer items than a linear test form. A stopping rule is an important factor in determining an assessments efficiency. This simulation compares three variable-length stopping rules—standard error (SE) of .3, minimum information (MI) of .7 and change in trait (CT) of .02— with and without a maximum number of items (20) imposed. We use fixed-length criteria of 10 and 20 items as a comparison for two versions of a linear assessment. The MI rules resulted in longer assessments with more biased trait estimates in comparison to other rules. The CT rule resulted in more biased estimates at the higher end of the trait scale and larger standard errors. The SE rules performed well across the trait scale in terms of both measurement precision and efficiency.

____________________

Examining Rater Judgements in Music Performance Assessment using Many-Facets Rasch Rating Scale Measurement Model

Pey Shin Ooi and George Engelhard, Jr.

Abstract

The fairness of raters in music performance assessment has become an important concern in the field of music. The assessment of students’ music performance depends in a fundamental way on rater judgements. The quality of rater judgements is crucial to provide fair, meaningful and informative assessments of music performance. There are many external factors that can influence the quality of rater judgements. Previous research has used different measurement models to examine the quality of rater judgements (e.g., generalizability theory). There are limitations with the previous analysis methods that are based on classical test theory and its extensions. In this study, we use modern measurement theory (Rasch measurement theory) to examine the quality of rater judgements. The many-facets Rasch rating scale model is employed to investigate the extent of rater-invariant measurement in the context of music performance assessments related to university degrees in Malaysia (159 students rated by 24 raters). We examine the rating scale structure, the severity levels of the raters, and the judged difficulty of the items. We also examine the interaction effects across musical instrument subgroups (keyboard, strings, woodwinds, brass, percussions and vocal). The results suggest that there were differences in severity levels among the raters. The results of this study also suggest that raters had different severity levels when rating different musical instrument subgroups. The implications for research, theory and practice in the assessment of music performance are included in this paper.

____________________

Examining Differential Item Functioning in the Household Food Insecurity Scale: Does Participation in SNAP Affect Measurement Invariance?

Victoria T. Tanaka, George Engelhard, Jr., and Matthew P. Rabbitt

Abstract

The Household Food Security Survey Module (HFSSM) is a scale used by the U.S. Department of Agriculture to measure the severity of food insecurity experienced by U.S. households. In this study, measurement invariance of the HFSSM is examined across households based on participation in the Supplemental Nutrition Assistance Program (SNAP). Households with children who responded to the HFSSM in 2015 and 2016 (N = 3,931) are examined. The Rasch model is used to analyze differential item functioning (DIF) related to SNAP participation. Analyses suggest a small difference in reported food insecurity between SNAP and non-SNAP participants (27% versus 23% respectively). However, the size and direction of the DIF mitigates the impact on overall estimates of household food insecurity. Person-fit indices suggest that the household aberrant response rate is 6.6% and the number of misfitting households is comparable for SNAP (6.80%) and non-SNAP participants (6.30%). Implications for research and policy related to food insecurity are discussed.

____________________

Accuracy and Utility of the AUDIT-C with Adolescent Girls and Young Women (AGYW) Who Engage in HIV Risk Behaviors in South Africa

Tracy Kline, Corina Owens, Courtney Peasant Bonner, Tara Carney, Felicia A. Browne, and Wendee M. Wechsberg

Abstract

Hazardous drinking is a risk factor associated with sexual risk, gender-based violence, and HIV transmission in South Africa. Consequently, sound and appropriate measurement of drinking behavior is critical to determining what constitutes hazardous drinking. Many research studies use internal consistency estimates as the determining factor in psychometric assessment; however, deeper assessments are needed to best define a measurement tool. Rasch methodology was used to evaluate a shorter version of the Alcohol Use Disorders Identification Test, the AUDIT-C, in a sample of adolescent girls and young women (AGYW) who use alcohol and other drugs in South Africa (n =100). Investigations of operational response range, item fit, sensitivity, and response option usage provide a richer picture of AUDIT-C functioning than internal consistency alone in women who are vulnerable to hazardous drinking and therefore at risk of HIV. Analyses indicate that the AUDIT-C does not adequately measure this specialized population, and that more validation is needed to determine if the AUDIT-C should continue to be used in HIV prevention intervention studies focused on vulnerable adolescent girls and young women.

____________________

 

Vol. 20, No. 2, Summer 2019

Loevinger on Unidimensional Tests with Reference to Guttman, Rasch, and Wright

Mark H. Stone Aurora and A. Jackson Stenner

Abstract

Loevinger’s specifications for a unidimensional test are discussed. The implications are reviewed using commentary from Guttman’s and Rasch’s specification for specific objectivity. A large population is sampled to evaluate the implications of this approach in light of Wright’s early presentation regarding data analysis. The results of this analysis show the sample follows the specifications of Loevinger and those of Rasch for a unidimensional test.

____________________

Standard-Setting Procedures for Counts Data

Rianne Janssen, Jorge González, and Ernesto San Martín

Abstract

An examinee- and an item-centered procedure are proposed to set cut scores for counts data. Both procedures assume that the counts data are modelled according to the Rasch Poisson counts model (RPCM). The examineecentered method is based on Longford’s (1996) approach and links contrasting-groups judgements to the RPCM ability scale using a random logistic regression model. In the item-centered method, the judges are asked to describe the minimum performance level of the minimally competent student by giving the minimum number of correct responses (or, equivalently, the maximum number of admissible errors). On the basis of these judgements for each subtest, the position of the minimally competent student on the RPCM ability scale is estimated. Both procedures are illustrated with a standard-setting study on mental arithmetic for students at the end of primary education.

____________________

Expected Values for Category-To-Measure and Measure-To-Category Statistics: A Simulation Study

Eivind Kaspersen

Abstract

There are many sources of evidence for a well-functioning rating-scale. Two of these sources are analyses of measure-to-category and category-to-measure statistics. An absolute cut-value of 40% for these statistics has been suggested. However, no evidence exists in the literature that this value is appropriate. Thus, this paper discusses the results of simulation studies that examined the expected values in different contexts. The study concludes that a static cut-value of 40% should be replaced with expected values for measure-to-category and category-to-measure analyses.

____________________

Missing Data and the Rasch Model: The Effects of Missing Data Mechanisms on Item Parameter Estimation

Glenn Thomas Waterbury

Abstract

This simulation study explores the effects of missing data mechanisms, proportions of missing data, sample size, and test length on the biases and standard errors of item parameters using the Rasch measurement model. When responses were missing completely at random (MCAR) or missing at random (MAR), item parameters were unbiased. When responses were missing not at random (MNAR), item parameters were severely biased, especially when the proportion of missing responses was high. Standard errors were primarily affected by sample size, with larger samples associated with smaller standard errors. Standard errors were inflated in MCAR and MAR conditions, while MNAR standard errors were similar to what they would have been, had the data been complete. This paper supports the conclusion that the Rasch model can handle varying amounts of missing data, provided that the missing responses are not MNAR.

____________________

Cross-Cultural Comparisons of School Leadership using Rasch Measurement

Sijia Zhang and Stefanie A. Wind

Abstract

School leadership influences school conditions and organizational climate; these conditions in turn impact student outcomes. Accordingly, examining differences in principals’ perceptions of leadership activities within and across countries may provide insight into achievement differences. The major purpose of this study was to explore differences in the relative difficulty of principals’ leadership activities across four countries that reflect Asian and North American national contexts: (1) Hong Kong SAR, (2) Chinese Taipei, (3) the United States, and (4) Canada. We also sought to illustrate the use of Rasch measurement theory as a modern measurement approach to exploring the psychometric properties of a leadership survey, with a focus on differential item functioning. We applied a rating scale formulation of the Many-facet Rasch model to principals’ responses to the Leadership Activities Scale in order to examine the degree to which the overall ordering of leadership activities was invariant across the four countries. Overall, the results suggested that there were significant differences in the difficulty ordering of leadership activities across countries, and that these differences were most pronounced between the two continents. Implications are discussed for research and practice.

____________________

Development of a Mathematics Self-Efficacy Scale: A Rasch Validation Study

Song Boon Khing and Tay Eng Guan

Abstract

The main objective of this study is to develop and validate a sources of mathematics self-efficacy (SMSE) scale to be used in a polytechnic adopting Problem Based Learning (PBL) as its main instructional strategy. Based on socio-constructivist learning approach, PBL emphasizes collaborative and self-directed learning. A non-experimental cross-sectional design using a questionnaire was employed in this study. The validation process was conducted over three phases. Phase 1 was the initial development stage to generate a pool of items in the questionnaire. In Phase 2, a pilot test was performed to obtain qualitative and quantitative feedback to refine the initial pool of items in the questionnaire. Finally, in Phase 3, the revised scale was administered to the main student cohort taking the mathematics module. The collected data from the questionnaire was subjected to empirical scrutiny, including exploratory factor analysis (EFA) and Rasch analysis. The participants for this study were first year polytechnic students taking a mathematics module. There were 29 participants taking part in Phase 2 of the study, comprising 12 (41%) females and 17 (59%) males. For Phase 3, there were 161 participants, comprising 91 (57%) males and 70 (43%) females. The EFA yielded a three-factor solution, comprising (a) personal experience; (b) vicarious experience; and (c) psychological states. The items in the SMSE scale demonstrated good internal consistency and reliability. The results from the Rasch rating scale analysis showed an acceptable item and person fit statistics. The final 23-item SMSE scale was found to be invariant across gender. Finally, the study showed that the SMSE scale is a psychometrically reliable and valid instrument to measure the sources of mathematics self-efficacy among students. PBL educators could use the results from the SMSE scale in the study to adopt appropriate interventions in curriculum design and delivery to boost self-efficacy of students and hence improve their mathematics achievement.

____________________

Lucky Guess? Applying Rasch Measurement Theory to Grade 5 South African Mathematics Achievement Data

Sarah Bansilal, Caroline Long, and Andrea Juan

Abstract

The use of multiple-choice items in assessments in the interest of increased efficiency brings associated challenges, notably the phenomenon of guessing. The purpose of this study is to use Rasch measurement theory to investigate the extent of guessing in a sample of responses taken from the Trends in International Mathematics and Science Study (TIMSS) 2015. A method of checking the extent of the guessing in test data, a tailored analysis, is applied to the data from a sample of 2188 learners on a subset of items. The analysis confirms prior research that showed that as the difficulty of the item increases, the probability of guessing also increases. An outcome of the tailored analysis is that items at the high proficiency end of the continuum, increase in difficulty. A consequence of item difficulties being estimated as relatively lower than they would be without guessing, is that learner proficiency at the higher end is under estimated while the achievement of learners with lower proficiencies are over estimated. Hence, it is important that finer analysis of systemic data takes into account guessing, so that more nuanced information can be obtained to inform subsequent cycles of education planning.

____________________

A Note on the Relation between Item Difficulty and Discrimination Index

Xiaofeng Steven Liu

Abstract

Item difficulty and discrimination index are often used to evaluate test items and diagnose possible issues in true score theory. The two statistics are more related than the literature suggests. In particular, the discrimination index can be mathematically determined by the item difficulty and the correlation between the item performance and the total test score.

____________________

 

Vol. 20, No. 3, Fall 2019

Psychometric Validation of the 10-Item USDA Food Security Scale for Use with College Students

Allison J. Ames and Tracey M. Barnett

Abstract

Food insecurity is defined as inadequate access to food due to limited resources. Studies regarding college student food insecurity have shown consistently higher rates than the rest of the nation. Many of these studies measure food insecurity using the United States Department of Agriculture’s Adult Food Security Survey Module. Despite its prevalence, the module has not been evaluated for use with the college student population. This study uses Rasch analysis, which underlies the current food insecurity classification approach used by the Department of Agriculture, to investigate the Adult Food Security Survey Module’s psychometric properties. A sample of 511 students from a public university in the South was used. Findings indicate that the requirements of the Rasch model do not hold for the module with college students. Specifically, the requirements of equal item discrimination and unidimensionality were violated, along with the presence of moderate to large differential item functioning.

____________________

A Validity Study Applying the Rasch Model to the American Association for the Advancement of Science Force and Motion Sub-Topic Assessment for Middle School Students

Kristin L. K. Koskey, Nidaa Makki, Wondimu Ahmed, Nicholas G. Garafolo, Donald P. Visco, Jr., Benjamin G. Kruggel, and Katrina Halasa

Abstract

The purpose of this research was to investigate the reliability of the scores produced and validity of the inferences drawn from the American Association for the Advancement of Science (AAAS, 2018) force and motion sub-topic assessment for middle school students. The assessment of student outcomes in STEM is an international focus in K-12 education. Project 2061, initiated by the AAAS, focuses on addressing challenges related to standards and assessments. This study informs this effort through testing a 14-item multiple-choice test constructed of questions from the AAAS item bank. Two samples of eighth-grade students participated (N = 1777). Rasch analysis applying the dichotomous model (Rasch, 1960) indicated sufficient item separation and reliability. Thirteen items fit the Rasch model and one item was removed for misfit. Further support for construct validity was observed with 78% of item ordering aligned with that predicted by physics educators and stability of measures for 11 items across the two samples. One item exhibited significant differential item functioning by gender and minority status in science. After inspection by physics educators, no bias in item wording or context was determined. Recommendation for additional items is made to increase item targeting and variance explained by the Rasch linear measure.

____________________

Rasch Analysis of Catholic School Survey Data

Stephen M. Ponisciak and Monica J. Kowalski

Abstract

Tracking changes over time, especially as schools transition to a new operating model, is important to understand the effects of the model on students’ perceptions and experiences. An accurate measure of such changes requires a stable measure of item difficulty, which the Rasch model can provide. The University of Chicago Consortium on School Research has applied Rasch analysis to surveys in Chicago Public Schools since 1991; these surveys (in whole or in part) are now used in many non-public schools as well. We examine Rasch measures derived from these survey data in 15 Catholic schools in five U.S. communities, and compare them to previous results from public schools. We also study changes in these measures over time, and their relationship to student academic outcomes. In our sample, the student and teacher surveys provide reliable individual school climate measures, but we are unable to differentiate between schools, likely due to the homogeneity of the schools.

____________________

Validating a Measure of Numeracy Skill Use in the Workplace for Incarcerated and Household Adults

Emily D. Buehler and Maria Pampaka

Abstract

The aim of this study is to construct a measure of numeracy skill use in the workplace for incarcerated and household adults. The 2012/2014 Programme for the International Assessment of Adult Competencies (PIAAC) Survey of Adult Skills asked about the type and frequency of numeracy tasks performed as part of one’s job to nationally-representative incarcerated and household adult samples. This paper takes these items from this survey and focuses on the validation of a measure of numeracy skill use in the workplace using the principles of the Rasch rating scale model. In the interest of exploring options for strengthened validity, response categories were collapsed to produce an optimal categorization structure. Findings suggest an instrument to measure numeracy skill use in prison and free market workplaces could potentially be improved with fewer response categories and more items that ask about a broader range of numeracy skills.

____________________

Using Confidence Intervals of the Item and Test Information Functions to Test Differential Item and Test Functioning: Visual and Statistical Analyses

Georgios D. Sideridis, Ioannis Tsaousis, and Khaleel Al. Harbi

Abstract

The purpose of the present paper was twofold: (a) to use 95% confidence intervals of the item and test information functions as a means of visualizing differences between groups on the information provided at the item and test levels, and, (b) to statistically compare item and test information functions as a method for evaluating differential item and differential test functioning. Participants were 2,305 high school students who took a Mathematics National entrance examination in Saudi Arabia. Item and test information functions, conditional standard errors of measurement and reliability were estimated for both males and females. Differences between groups became evident when plotting 95% confidence intervals of the item and test information functions and the visual findings were confirmed using population-based Z-tests of point estimates using a Monte-Carlo simulation. It was concluded that differential group behavior at the item and test levels can be evidenced using information functions and inferential tests of significance can be constructed using the bootstrap distribution. The current procedure involves both item difficulties and discrimination indices and provides increased sensitivity over the traditional methods relying on item difficulties only.

____________________

Examining Parameter Estimation when Treating Semi-Mixed Multidimensional Constructs as Unidimensional

Sakine Gocer Sahin, Selahattin Gelbal, and Cindy M. Walker

Abstract

In this study, parameter estimation error was examined when three dimensional tests of a semi-mixed structure were estimated unidimensionally. Since previous studies have generally focused on two-dimensional mixed structured tests or three-dimensional approximately simple structured tests, this study adds to the literature by considering the impact of fitting a unidimensional model to multidimensional data using a test structure that has not previously been considered. Test structure, interdimensional correlation, difficulty of the test, and different underlying distributions of ability were considered. Test length was set at 30 items for all conditions. Although test length was fixed, the number of approximately simple and complex items varied. Under all conditions for both moderately difficult and difficult tests, the lowest error values for all discrimination parameters, with the exception of MDISC, were obtained, surprisingly, with a correlation of 0.00. The lowest RMSE values for the difficulty parameter were obtained for tests of medium difficulty when the underlying ability distribution was simulated as standard normal for all three dimensions. The estimation errors associated with the difficulty parameter were greatly impacted by differences in the underlying ability distributions. Ability estimation errors associated with the unidimensional estimate of ability decreased as the correlation between dimensions. increased.

____________________

Dimensionality of the Russian CORE-OM from a Rasch Perspective

Marina Zeldovich, Andrey A. Ivanov, and Rainer W. Alexandrowicz

Abstract

The evaluation of outcomes in mental health care embraces evaluation, quality assurance, and progress measurement of treatments. The Clinical Outcome in Routine Evaluation – Outcome Measure (CORE-OM) is an outcome focused self-assessment instrument, comprising 34 items covering four scales well-being, problems, functioning, and risk. The questionnaire has been translated into 52 languages, including Russian. Despite its broad application, the dimensionality of the CORE-OM deserves some further research. Thus, the present study examines the dimensionality of the Russian CORE-OM using the multidimensional random coefficients multinomial logit model (MRCMLM) based on data of N = 240 patients. The results indicate the need for further research on factorial structure and response formats of the CORE-OM. In addition, differential item functioning was found for gender and diagnostic groups, suggesting separate test norms. Again, the MRCMLM and the Test Analysis Modules (TAM) package have proven valuable tools for investigating a questionnaire’s psychometric properties.

____________________

 

Vol. 20, No. 4, Winter 2019

Evaluating Angoff Method Structured Training Judgments by the Rasch Model

Ifeoma C. Iyioke

Abstract

This paper presents a methodology for evaluating the judgments of participants of a Completely Structured Training (CST) on the Angoff standard setting method based on the Rasch measurement model. The CST was designed for K-12 teachers on the strategy of judging their students’ performance. The evaluation examines judgments for reflecting realistic expectations based on bootstrap sampling distribution of the Rasch measurement model difficulty estimates of the test items. Additionally, the report includes a study application of the methodology and recommendations.

____________________

An Examination of Sensitivity to Measurement Error in Rasch Residual-based Fit Statistics

R. Noah Padgett and Grant B. Morgan

Abstract

The purpose of this paper is to examine the sensitivity of commonly used Rasch fit measures to different distributions of error in item responses. Using Monte Carlo methods, we generated 10 different measurement error conditions within the Rasch rating scale model or partial credit model, and we recorded the estimates of INFIT MNSQ, OUTFIT MNSQ, and person separation reliability for each error distribution condition. INFIT MNSQ and OUTFIT MNSQ were not sensitive to error distributions when the distribution was the same across items. When the error distribution varies across items, INFIT MNSQ and OUTFIT MNSQ detected items with higher levels of measurement error as potentially misfitting. The Rasch person separation reliability statistic was sensitive to varying levels of measurement error, as expected. Our findings have implications for the use of fit measures in diagnosing model misfit.

____________________

Identifying Bullied Youth: Re-engineering the Child-Adolescent Bullying Scale into a Brief Screen

Judith A. Vessey, Tania D. Strout, Rachel L. Difazio, and Larry H. Ludlow

Abstract

While youth bullying is a critical public health problem, standardized exposure screening is not routinely practiced. The Child-Adolescent Bullying Scale, (CABS), a psychometrically robust 22-item tool, was designed and evaluated for this purpose using classical test theory. The goals of the present study were to examine and optimize the measurement properties of the CABS using a Rasch psychometric analysis to develop a brief screening tool appropriate for clinical use. A methodologic design and the Rasch rating scale model were employed. Three hundred and fifty-two youths from two clinical sites participated. Rasch-based analyses included evaluation of response category functioning, measurement precision, dimensionality, targeting, differential item functioning and guidance in item reduction. After iterative revisions, the resulting screening instrument consists of 9 items. Cut-scores and interpretive guidance are provided to aid clinical identification of bullying-related risk. Findings suggest the CABS-9 holds promise as a useful screening tool for identifying bullying exposure.

____________________

Priors in Bayesian Estimation under the Rasch Model

Seock-Ho Kim, Allan S. Cohen, Minho Kwak, and Juyeon Lee

Abstract

A review of various priors used in Bayesian estimation under the Rasch model is presented together with clear mathematical definitions of the hierarchical prior distributions. A Bayesian estimation method, Gibbs sampling, was compared with conditional, marginal, and joint maximum likelihood estimation methods using the Knox Cube Test data under the Rasch model. The shrinkage effect of the priors on item and ability parameter estimates was also investigated using the Knox Cube Test data. In addition, item response data for a mathematics test with 14 items by 765 examinees were analyzed with the joint maximum likelihood estimation method and Gibbs sampling under the Rasch model. Both methods yielded nearly identical item parameter estimates. The shrinkage effect was observed in the ability estimates from Gibbs sampling. The computer program OpenBUGS that implemented the rejection sampling method of Gibbs sampling was the main program employed in the study.

____________________

An IRT-Based Objection Against the IQ

Takuya Yanagida and Klaus D. Kubinger

Abstract

The concept of IRT (Item Response Theory), offering several models which at least guarantee if they hold, that a scoring rule in question is indeed fair, can be referred to in regard to the pertinent scoring rule of the IQ (intelligence quotient) in many intelligence test-batteries. Müller’s continuous Rasch model (1987, 1999) applies. Analyses were carried out for three test-batteries, that is the German and the English version of AID 3 and a respective version for group administration, to show this in an exemplary way. Sample sizes comprised 431, 761, and 2278, respectively. Above all, the graphical model check disclosed a serious misfit of the model: There is no support for the notion that the respective scoring rule is fair. Detailed inspections give the impression that essentially calculating the sum of the subtest scores mixes (at least) two components, “intelligence” and “willingness to achieve in unchallenging tasks.” Practitioners should be assailed by doubts that all other intelligence test-batteries in use, which are not evaluated accordingly, do score fair.

____________________

Evaluating Observer Ratings: The Case of Measuring Neighborhood Disorder

Mei Ling Ong, George Engelhard, Jr., Eric T. Klopack, and Ronald L. Simons

Abstract

The purpose of this study is to evaluate the quality of observer ratings of neighborhood disorder using a manyfacet Rasch model (MFRM). Our goal is to investigate observer severity and observer consistency. Observers trained in the use of a systematic social observation visited and rated residential neighborhoods. Data for this study are drawn from the Family and Community Health Study (FACHS). The FACHS sample consisted of 673 neighborhoods. Two observers, out of a total of 67 observers used for this study, rated each residential neighborhood. The results of this study suggested that there were statistically significant differences in observer severity, even after observer training, and that the ratings of observers are not consistent. Therefore, more or better observer training is necessary. In addition, the interaction effect between observer and item was significant, indicating significant variance in observer severity across at least one item.

____________________

Measuring Genuine Progress: An Example from the UN Millennium Development Goals

William P. Fisher, Jr.

Abstract

Proposals for incorporating information on the quality of human, social, and environmental conditions in more authentic and comprehensive versions of the Gross National Product (GNP) or Gross Domestic Product (GDP) date back to the foundations of econometrics. Typically treated as external to markets, these domains have lately been objects of renewed interest. Calls for accountability and transparency have expanded to include the now topical but previously neglected economic implications of human, social, and natural capital. Clear advantages for the measurement and management of these forms of capital can be drawn from econometric criteria for identifiable models of structurally invariant relationships. The United Nation’s Millennium Development Goals (MDG) provide an example application of a probabilistic model for measurement used to evaluate data quality, reduce data volume with no loss of information, estimate linear units of comparison with known uncertainties, express measures from different sets of indicators in a common metric, and frame a meaningful interpretive context. Data on 22 MDG indicators from 64 countries are scored and analyzed. Model fit was reasonable, the item hierarchy tells a meaningful story of structural invariance in economic development, and Cronbach’s alpha was 0.93. The measures estimated in this study correlated over 0.90 with independently produced measures of per-capita GDP and life satisfaction. These results provide a positive demonstration of relevant methods applicable in the context of today’s Sustainable Development Goals 2030 Agenda.

____________________

Using Rasch Analyses To Inform the Revision of a Scale Measuring Students’ Process-Oriented Writing Competence in Portfolios

Mai Duong, Cuc Nguyen, and Patrick Griffin

Abstract

Thanks to the wide range of benefits it provides to teaching and learning, portfolio assessment has maintained widespread popularity in language education over the last few decades. However, the practical use of this assessment method is still subject to debates, particularly about the lack of clear definitions and empirical validations of the constructs underlying the assessment. This problem can be addressed by research into portfolio scale development and examination and this article reports on the process of investigating the psychometric properties of a scale assessing the portfolio-based writing competence of Vietnamese students who speak English as a foreign language (EFL). The psychometric validation in this investigation involved applying different Rasch models, including multidimensional, partial credit and many-facet models, to examine the characteristics of the scale items. The findings support the use of the scale with mostly good item functioning and acceptable raters’ consistency in using the scale. Finally, only one item addressing the length of writing is removed from the developed scale and the items assessing the planning stage of writing in writing portfolios are flagged for further inspection in a larger scale study. Implications for using the scale to improve the quality of teaching and assessment of writing via portfolios can be drawn.

____________________

Home