Journal of Applied Measurement

P.O. Box 1283

Maple Grove, MN 55311

 

Volume 8, 2007 Article Abstracts

 

Vol. 8, No. 1 Spring 2007

Attitudes, Order and Quantity: Deterministic and Direct Probabilistic Tests of Unidimensional Unfolding

Andrew Kyngdon and Ben Richards

Abstract

This article is the final in the series on unidimensional unfolding. The investigations of Kyngdon (2006b) and Michell (1994) were extended to include direct probabilistic tests of the quantitative and ordinal components of unfolding theory with the multinomial Dirichlet model (Karabatsos 2005); and tests of the higher order axiomatic conjoint measurement (ACM, Krantz, Luce, Suppes and Tversky (KLST) 1971) condition of triple cancellation. Strong Dirichlet model support for both the ordinal and quantitative components of unfolding was only found in datasets that satisfied at least double cancellation. In contrast, the Item Response Theory (IRT) simple hyperbolic cosine model for pairwise preferences (SHCMpp, Andrich 1995) fitted all datasets. The paper concluded the SHCMpp is suited to the instrumental rather than scientific task (Michell 2000) of psychological measurement; with the caveat of the problematic chi square fit statistic. The paper also presents original work by the second author on coherent tests of triple cancellation.

****

Conception and Construction of a Rasch-Scaled Measure for Self-Confidence in One’s Vocabulary Ability

Michaela M. Wagner-Menghin

Abstract

Although different theoretical approaches indicate the importance of self-confidence in one’s ability in educational or counselling psychology, there are only a few standardized psychological instruments, these being mainly questionnaires, that focus on this construct. The idea pursued is to expand a traditional ability test and create an objective personality measure for self-confidence in one’s ability. Empirical evidence will be provided as to whether it is possible to construct and cross validate this measure using the Rasch-model. A two step procedure allowing for a self rating of one’s knowledge of words was designed using 25 items from a general knowledge - vocabulary test. The results indicate that it is possible to construct and cross validate a measure for self-confi- dence in one’s ability using the Rasch-Model. Additionally, it is shown that the measure for self-confidence in one’s ability is independent from the general knowledge.

****

Relative Precision, Efficiency and Construct Validity of Different Starting and Stopping Rules for a Computerized Adaptive Test: The GAIN Substance Problem Scale

Barth B. Riley, Kendon J. Conrad, Nikolaus Bezruczko, and Michael L. Dennis

Abstract

Substance abuse treatment programs are being pressed to measure and make clinical decisions more efficiently about an increasing array of problems. This computerized adaptive testing (CAT) simulation examined the relative efficiency, precision and construct validity of different starting and stopping rules used to shorten the Global Appraisal of Individual Needs’ (GAIN) Substance Problem Scale (SPS) and facilitate diagnosis based on it. Data came from 1,048 adolescents and adults referred to substance abuse treatment centers in 5 sites. CAT performance was evaluated using: (1) average standard errors, (2) average number of items, (3) bias in person measures, (4) root mean squared error of person measures, (5) Cohen’s kappa to evaluate CAT classification compared to clinical classification, (6) correlation between CAT and full-scale measures, and (7) construct validity of CAT classification vs. clinical classification using correlations with five theoretically associated instruments. Results supported both CAT efficiency and validity.

****

Bookmark Locations and Item Response Model Selection in the Presence of Local Item Dependence

Garry Skaggs

Abstract

The bookmark standard setting procedure is a popular method for setting performance standards on state assessment programs. This study reanalyzed data from an application of the bookmark procedure to a passagebased test that used the Rasch model to create the item ordered booklet. Several problems were noted in this implementation of the bookmark procedure, including disagreement among the SMEs about the correct order of items in the bookmark booklet, performance level descriptions of the passing standard being based on passage difficulty as well as item difficulty, and the presence of local item dependence within reading passages. Bookmark item locations were recalculated for the IRT three-parameter model and the multidimensional bifactor model. The results showed that the order of item locations was very similar for all three models when items of high difficulty and low discrimination were excluded. However, the items whose positions were the most discrepant between models were not the items that the SMEs disagreed about the most in the original standard setting. The choice of latent trait model did not address problems of item order disagreement. Implications for the use of the bookmark method in the presence of local item dependence are discussed.

****

Comparing Concurrent versus Fixed Parameter equating with Common Items: Using the Rasch Dichotomous and Partial Credit Models in a Mixed Item-Format Test

Husein M. Taherbhai and Daer Yong Seo

Abstract

There has been some discussion among researchers as to the benefits of using one calibration process over the other during equating. Although literature is rife with the pros and cons of the different methods, hardly any research has been done on anchoring (i.e., fixing item parameters to their pre-determined values on an established scale) as a method that is commonly used by psychometricians in large-scale assessments. This simulation research compares the fixed form of calibration with the concurrent method (where calibration of the different forms on the same scale is accomplished by a single run of the calibration process, treating all non-included items on the forms as missing or not reached), using the dichotomous Rasch (Rasch, 1960) and the Rasch partial credit (Masters, 1982) models, and the WINSTEPS (Linacre, 2003) computer program. Contrary to the belief and some researchers’ contention that the concurrent run with larger n-counts for the common items would provide greater accuracy in the estimation of item parameters, the results of this paper indicate that the greater accuracy of one method over the other is confounded by the sample-size, the number of common items, etc., and there is no real benefit in using one method over the other in the calibration and equating of parallel tests forms.

****

Understanding Rasch Measurement: Instrument Development Tools and Activities for Measure Validation using Rasch Models: Part I – Instrument Development Tools

Edward W. Wolfe and Everett V. Smith, Jr.

Abstract

Instrument development is an arduous task that, if undertaken with care and consideration, can lay the foundation for the development of validity arguments relating to the inferences and decisions that are based on test measures. This article, Part I of a two-part series, provides an overview of validity concepts and describes how instrument development efforts can be conducted to facilitate the development of validity arguments. Our discussion focuses on documentation of the purpose of measurement, creation of test specifications, item development, expert review, and planning of pilot studies. Through these instrument development activities and tools, essential information is documented that will feed into the analysis, summary, and reporting of data relevant to validity arguments discussed in Part II of this series.

 

Vol. 8, No. 2 Summer 2007

Mental Self-Government: Development of the Additional Democratic Learning Style Scale using Rasch Measurement Models

Tine Nielsen, Svend Kreiner, and Irene Styles

Abstract

This paper describes the development and validation of a democratic learning style scale intended to fill a gap in Sternberg’s theory of mental self-government and the associated learning style inventory (Sternberg, 1988, 1997). The scale was constructed as an 8-item scale with a 7-category response scale. The scale was developed following an adapted version of DeVellis’ (2003) guidelines for scale development. The validity of the Democratic Learning Style Scale was assessed by items analysis using graphical loglinear Rasch models (Kreiner and Christensen, 2002, 2004, 2006) The item analysis confirmed that the full 8-item revised Democratic Learning Style Scale fitted a graphical loglinear Rasch model with no differential item functioning but weak to moderate uniform local dependence between two items. In addition, a reduced 6-item version of the scale fitted the pure Rasch model with a rating scale parameterization. The revised Democratic Learning Style Scale can therefore be regarded as a sound measurement scale meeting requirements of both construct validity and objectivity.

****

Measuring Math Anxiety (in Spanish) with the Rasch Rating Scale Model

Gerardo Prieto and Ana R. Delgado

Abstract

Two successive studies probed the psychometric properties of a Math Anxiety questionnaire (in Spanish) by means of the Rasch Rating Scale Model. Participants were 411 and 216 Spanish adolescents. Convergent validity was examined by correlating the scale with both the Fennema and Sherman Attitude Scale and a math achievement test. The results show that the scores are psychometrically appropriate, and replicate those reported in metaanalyses: medium-sized negative correlations with achievement and with attitudes toward mathematics, as well as moderate sex-related differences (with girls presenting higher anxiety levels than boys).

****

Using Rasch Analysis to Construct a Clinical Problem-Solving Inventory in the Dental Clinic:A Case Study

Chien-Lin Yang and Gene A. Kramer

Abstract

A 28-item inventory was developed to measure the clinical problem-solving abilities of 3rd and 4th year dental students. The judgments of 57 expert raters (dental-school faculty) from four dental schools used the inventory to evaluate 183 dental students on a 5-point rating scale. The Rasch measurement model was employed to examine the psychometric properties and construct validity of this inventory. In this study, fit statistics identified the “noise” in the data and residual analysis assisted in extracting a meaningful structure. The study results indicate that the Rasch measurement model appeared to be a useful method for use in producing a unidimensional instrument. All five rating categories were used in a coherent manner, and four discernable levels of clinical problem-solving ability were identified. After removal of four repetitious items, a version of the Clinical Problem-Solving Inventory was finalized that could serve as a criterion measure for validating the use of a critical thinking test on the Dental Admission Test.

****

Evidence-Based Practice for Equating Health Status Items: Sample Size and IRT Model

Karon F. Cook, Patrick W. Taylor, Barbara G. Dodd,Cayla R. Teal, and Colleen A. McHorney

Abstract

Background: In the development of health outcome measures, the pool of candidate items may be divided into multiple forms, thus “spreading” response burden over two or more study samples. Item responses collected using this approach result in two or more forms whose scores are not equivalent. Therefore, the item responses must be equated (adjusted) to a common mathematical metric. Objectives: The purpose of this study was to examine the effect of sample size, test size, and selection of item response theory model in equating three forms of a health status measure. Each of the forms was comprised of a set of items unique to it and a set of anchor items common across forms. Research Design: The study was a secondary data analysis of patients’ responses to the developmental item pool for the Health of Seniors Survey. A completely crossed design was used with 25 replications per study cell. Results: We found that the quality of equatings was affected greatly by sample size. Its effect was far more substantial than choice of IRT model. Little or no advantage was observed for equatings based on 60 or 72 items versus those based on 48 items. Conclusions: We concluded that samples of less than 300 are clearly unacceptable for equating multiple forms. Additional sample size guidelines are offered based on our results.

****

Computing Confidence Intervals of Item Fit Statistics in the Family of Rasch Models Using the Bootstrap Method

Ya-Hui Su, Ching-Fan Sheu, and Wen-Chung Wang

Abstract

The item infit and outfit mean square errors (MSE) and their t-transformed statistics are widely used to screen poorly fitting items. The t-transformed statistics, however, do not follow the standard normal distribution so that hypothesis testing of item fit based on the conventional critical values is likely to be inaccurate (Wang and Chen, 2005). The MSE statistics are effect-size measures of misfit and have an expected value of unity when the data fit the model’s expectation. Unfortunately, most computer programs for item response analysis do not report confidence intervals of the item infit and outfit MSE, mainly because their sampling distributions are analytically intractable. Hence, the user is left without interval estimates of the magnitudes of misfit. In this study, we developed a FORTRAN 90 computer program in conjunction with the commercial program WINSTEPS (Linacre, 2001) that yields confidence intervals of the item infit and outfit MSE using the bootstrap method. The utility of the program is demonstrated through three illustrations of simulated data sets.

****

Understanding Rasch Measurement: Instrument Development Tools and Activities for Measure Validation using Rasch Models: Part II – Validation Activities

Edward W. Wolfe and Everett V. Smith, Jr.

Abstract

Accumulation of validity evidence is an important part of the instrument development process. In Part I of a two-part series, we provided an overview of validity concepts and described how instrument development efforts can be conducted to facilitate the development of validity arguments. In this, Part II of the series, we identify how analyses, especially those conducted within a Rasch measurement framework, can be used to provide evidence to support validity arguments that are founded during the instrument development process.

 

Vol. 8, No. 3 Fall 2007

The Programme for International Student Assessment: An Overview

Ross Turner and Raymond J. Adams

Abstract

This paper provides an overview of the Programme for International Student Assessment (PISA), an on-going international comparative survey study of educational outcomes at age 15. PISA is sponsored by the Organization for Economic Co-operation and Development (OECD) and for the period 1998-2010 has been designed and implemented by a consortium led by the Australian Council for Educational Research (ACER).

****

Translation Equivalence across PISA Countries

Aletta Grisay, John H.A.L. de Jong, Eveline Gebhardt, Alla Berezner, and Beatrice Halleux-Monseur

Abstract

Due to the continuous increase in the number of countries participating in international comparative assessments such as TIMSS and PISA, ensuring linguistic and cultural equivalence across the various national versions of the assessment instruments has become an increasingly crucial challenge. For example, 58 countries participated in the PISA 2006 Main Study. Within each country, the assessment instruments had to be adapted into each language of instruction used in the sampled schools. All national versions in languages used for 5 per cent or more of the target population (that is, a total of 77 versions in 42 different languages) were verified for equivalence against the English and French source versions developed by the PISA consortium. Information gathered both through the verification process and through empirical analyses of the data are used in order to adjudicate whether the level of linguistic equivalence reached an acceptable standard in each participating country. The paper briefly describes the procedures typically used in PISA to ensure high levels of translation/adaptation accuracy, and then focuses on the development of the set of indicators that are used as criteria in the equivalence adjudication exercise. Empirical data from the PISA 2005 Field Trial are used to illustrate both the analyses and the major conclusions reached.

****

Ameliorating Culturally Based Extreme Response Tendencies To Attitude Items

Maurice Walker

Abstract

Using data from the PISA 2006 field trial, Rasch item response models are used to demonstrate that extreme response tendency was exhibited differentially across culturally distinct countries when answering Likert type attitude items. A single attitude scale is examined across eight culturally distinct countries in this paper. Two avenues to ameliorate this tendency are explored: first using dichotomous variants of the items, and second incorporating the country specific response tendency into the Rasch item response model. Analysis of the item variants reveals similar scale outcomes and correlations with achievement but preference for the Likert variant when test information is considered. A hierarchical analysis using facet models reveals that the data fit significantly better in a model that incorporates an interaction effect between the country and the item delta parameters. The implications for reporting attitudes measured with Likert items across cultures are outlined.

****

The Impact of Differential Investment of Student Effort on the Outcomes of International Studies

Jayne Butler and Raymond J. Adams

Abstract

International comparative assessments of student achievement, such as Trends in Mathematics and Science (TIMSS) and Programme for International Student Achievement (PISA) are becoming increasingly important in the development of evidence-based education policy. The potentially far-reaching influence of such studies underscores the need for these assessments to be valid and reliable. In education, increasing recognition is being given to motivational factors which impact on student learning. This research considers a possible threat to the validity of such studies by investigating the influence the amount of effort invested by test-takers has on their outcomes. Reassuringly, it is found that the reported expenditure of effort by students is fairly stable across countries. This finding counters the claim that systematic cultural differences in the effort expended by students invalidate international comparisons. Realistically reported effort expenditure is related to reading achievement with an effect size similar to variables such as single parent family structure, gender and socioeconomic background. Finally, when reporting trends, taking effort into account should be considered and may well facilitate the interpretation of national and gender trends in reading achievement.

****

The Influence of Equating Methodology on Reported Trends in PISA

Eveline Gebhardt and Raymond J. Adams

Abstract

In 2005 PISA published trend indicators that compared the results of PISA 2000 and PISA 2003. In this paper we explore the extent to which the outcomes of these trend analyses are sensitive to the choice of test equating methodologies, the choice of regression models and the choice of linking items. To establish trends PISA equated its 2000 and 2003 tests using a methodology based on Rasch Modelling that involved estimating linear transformations that mapped 2003 Rasch-scaled scores to the previously established PISA 2000 Rasch-scaled scores. In this paper we compare the outcomes of this approach with an alternative, which involves the joint Rasch scaling of the PISA 2000 and PISA 2003 data separately for each country. Note that under this approach the item parameters are estimated separately for each country, whereas the linear transformation approach used a common set of item parameter estimates for all countries. Further, as its primary trend indicators, PISA reported changes in mean scores between 2000 and 2003. These means are not adjusted for changes in the background characteristics of the PISA 2000 and PISA 2003 samples – that is, they are marginal rather than conditional means. The use of conditional rather than marginal means results in some differing conclusions regarding trends at both the country and within-country level.

****

The Computation of Equating Errors in International Surveys in Education

Christian Monseur and Alla Berezner

Abstract

Since the IEA’s Third International Mathematics and Science Study, one of the major objectives of international surveys in education has been to report trends in achievement. The names of the two current IEA surveys reflect this growing interest: Trends in International Mathematics and Science Study (TIMSS) and Progress in International Reading Literacy Study (PIRLS). Similarly a central concern of the OECD’s PISA is with trends in outcomes over time. To facilitate trend analyses these studies link their tests using common item equating in conjunction with item response modelling methods. IEA and PISA policies differ in terms of reporting the error associated with trends. In IEA surveys, the standard errors of the trend estimates do not include the uncertainty associated with the linking step while PISA does include a linking error component in the standard errors of trend estimates. In other words, PISA implicitly acknowledges that trend estimates partly depend on the selected common items, while the IEA’s surveys do not recognise this source of error. Failing to recognise the linking error leads to an underestimation of the standard errors and thus increases the Type I error rate, thereby resulting in reporting of significant changes in achievement when in fact these are not significant. The growing interest of policy makers in trend indicators and the impact of the evaluation of educational reforms appear to be incompatible with such underestimation. However, the procedure implemented by PISA raises a few issues about the underlying assumptions for the computation of the equating error. After a brief introduction, this paper will describe the procedure PISA implemented to compute the linking error. The underlying assumptions of this procedure will then be discussed. Finally an alternative method based on replication techniques will be presented, based on a simulation study and then applied to the PISA 2000 data.

 

Vol. 8, No. 4 Winter 2007

Nonequivalent Survey Consolidation: An Example From Functional Caregiving

Nikolalus Bezruczko and Shu-Pi C. Chen

Abstract

Functional Caregiving (FC) is a construct about mothers caring for children (both old and young) with intellectual disabilities, which is operationally defined by two nonequivalent survey forms, urban and suburban, respectively. The purposes of this research are, first, to generalize school-based achievement test principles to survey methods by equating two nonequivalent survey forms. A second purpose is to expand FC foundations by a) establishing linear measurement properties for new caregiving items, b) replicate a hierarchical item structure across an urban, school-based population, c) consolidate survey forms to establish a calibrated item bank, and d) collect more external construct validity data. Results supported invariant item parameters of a fixed item form (96 items) for two urban samples (N = 186). FC measures also showed expected construct relationships with age, mental depression, and health status. However, only five common items between urban and suburban forms were statistically stable because suburban mothers’ age and child’s age appear to interact with medical information and social activities.

****

Mindfulness Practice: A Rasch Variable Construct Innovation

Sharon G. Solloway and William P. Fisher, Jr.

Abstract

Is it possible to establish a consistent, stable relationship between the structure of number and additive amounts of mindfulness practice? A bank of thirty items, constructed from a review of the literature and from novice practitioners’ journal responses to mindfulness practice, comprised the instrument. A convenience sample of students in a teacher education program participated. The WINSTEPS Rasch measurement software was used for all analyses. Measurement separation reliability was 0.92 and item separation reliability was 0.98, with satisfactory model fit. The 30 items measure a single construct of mindfulness practice. Construct validity was supported by the meaningfulness of the items perceived as easy to hard. The same scale was produced when the items were calibrated separately on the T1 and T2 groups (Rsq = 0.83). The experimental group’s T2 measures were significantly different from both its own T1 measures and the control group’s T1 and T2 measures. ANOVA showed significance for variance between the experimental and control groups for T2 (F = 43.66, 151 d.f., p <.001) for a nearly two-logit (20 unit) difference (48.9 vs. 68.0). The study is innovative in its demonstration of mindfulness practice as a measurable variable.

****

Substance Use Disorder Symptoms: Evidence of Differential Item Functioning by AgeTo Attitude Items

Kendon J. Conrad, Michael L. Dennis, Nikolaus Bezruczko, Rodney R. Funk, and Barth B. Riley

Abstract

This study examined the applicability of substance abuse diagnostic criteria for adolescents, young adults, and adults using the Global Appraisal of Individual Need’s Substance Problems Scale (SPS) from 7,408 clients. Rasch analysis was used to: 1) evaluate whether the SPS operationalized a single reliable dimension, and 2) examine the extent to which the severity of each symptom and the overall test functioned the same or differently by age. Rasch analysis indicated that the SPS was unidimensional with a person reliability of .84. Eight symptoms were significantly different between adolescents and adults. Young adult calibrations tended to fall between adolescents and adults. Differential test functioning was clinically negligible for adolescents but resulted in about 7% more adults being classified as high need. These findings have theoretical implications for screening and treatment of adolescents vs. adults. SPS can be used across age groups though age-specific calibrations enable greater precision of measurement.

****

A Monte Carlo Study of the Impact of Missing Data and Differential Item Functioning on Theta Estimates from Two Polytomous Rasch Family Models

Carolyn F. Furlow, Rachel T. Fouladi, Phill Gagné, and Tiffany A. Whittaker

Abstract

This paper examines the impact of differential item functioning (DIF), missing item values, and different methods for handling missing item values on theta estimates with data simulated from the partial credit model and Andrich’s rating scale model. Both Rasch family models are commonly used when obtaining an estimate of a respondent’s attitude. The degree of missing data, DIF magnitude, and the percentage of DIF items were varied in MCAR data conditions in which the focal group was 10% of the total population. Four methods for handling missing data were compared: complete-case analysis, mean substitution, hot-decking, and multiple imputation. Bias, RMSE, means, and standard errors of the theta estimates for the focal group were adversely affected by the amount and magnitude of DIF items. RMSE and fidelity coefficients for both the reference and focal group were adversely impacted by the amount of missing data. While all methods of handling missing data performed fairly similarly, multiple imputation and hot-decking showed slightly better performance.

****

Investigation of 360-Degree Instrumentation Effects: Application of the Rasch Rating Scale Model

John T. Kulas and Kelly M. Hannum

Abstract

Performance appraisals have been frequently investigated for inaccuracies attributable to raters. In addition to rater contributions, problems in measurement can arise from properties of instrumentation. If administered items do not parallel the full range of employee performance, a restriction of range can occur. Here employees are rated similarly (no distinction made) not because of rater error, but because of an instrument floor or ceiling effect. A RASCH measurement procedure was applied to a 360-degree dataset in order to uncover potential instrumentation effects. A similar magnitude ceiling effect was found across rater categories. It is recommended that performance appraisal practitioners consider including items of greater difficulty in their criterion measures.

****

Rasch Measurement of Self-Regulated Learning in an Information and Communication Technology (ICT)-rich Environment

Joseph N. Njiru and Russell F. Waugh

Abstract

This report describes how a linear scale of self-regulated learning in an ICT-rich environment was created by analysing student data using the Rasch measurement model. A person convenience sample of (N = 409) university students in Western Australia was used. The stem-item sample was initially 41, answered in two perspectives (I aim for this and I actually do this), and reduced to 16 that fitted the measurement model to form a unidimensional scale. Items for motivation (extrinsic rewards, intrinsic rewards, and social rewards), academic goals (fear of performing poorly) (but not standards), self-learning beliefs (ability and interest), task management (strategies and time management) (but not cooperative learning), Volition (action control (but not environmental control), and self-evaluation (cognitive self-evaluation and metacognition) fitted the measurement model. The proportion of observed variance considered true was 0.90. A new instrument is proposed to handle the conceptually valid but non-fitting items. Characteristics of high self-regulated learners are measured.

****

The Saltus Model Applied to Proportional Reasoning Data

Karen Draney

Abstract

This article examines an application of the saltus model, a mixture model that was designed for the analysis of developmental data. Some background in the types of research for which such a model might be useful is discussed. The equations defining the model are given, as well as the model’s relationship to the Rasch model and to other mixture models. An application of the saltus model to an example data set, collected using Noelting’s orange juice mixtures tasks, is examined in detail, along with the control files necessary to run the software, and the output file it produced.

Home