Journal of Applied Measurement
P.O. Box 1283
Maple Grove, MN 55311
Volume 8, 2007 Article Abstracts
Vol. 8, No. 1 Spring 2007
Attitudes, Order and Quantity: Deterministic
and Direct Probabilistic Tests of Unidimensional Unfolding
Andrew Kyngdon and Ben Richards
Abstract
This article is the final in the series on unidimensional unfolding. The investigations of Kyngdon (2006b) and
Michell (1994) were extended to include direct probabilistic tests of the quantitative and ordinal components
of unfolding theory with the multinomial Dirichlet model (Karabatsos 2005); and tests of the higher order
axiomatic conjoint measurement (ACM, Krantz, Luce, Suppes and Tversky (KLST) 1971) condition of triple
cancellation. Strong Dirichlet model support for both the ordinal and quantitative components of unfolding was
only found in datasets that satisfied at least double cancellation. In contrast, the Item Response Theory (IRT)
simple hyperbolic cosine model for pairwise preferences (SHCMpp, Andrich 1995) fitted all datasets. The paper
concluded the SHCMpp is suited to the instrumental rather than scientific task (Michell 2000) of psychological
measurement; with the caveat of the problematic chi square fit statistic. The paper also presents original work
by the second author on coherent tests of triple cancellation.
****
Conception and Construction of a Rasch-Scaled
Measure for Self-Confidence in One’s Vocabulary Ability
Michaela M. Wagner-Menghin
Abstract
Although different theoretical approaches indicate the importance of self-confidence in one’s ability in educational
or counselling psychology, there are only a few standardized psychological instruments, these being mainly
questionnaires, that focus on this construct. The idea pursued is to expand a traditional ability test and create
an objective personality measure for self-confidence in one’s ability. Empirical evidence will be provided as to
whether it is possible to construct and cross validate this measure using the Rasch-model. A two step procedure
allowing for a self rating of one’s knowledge of words was designed using 25 items from a general knowledge
- vocabulary test. The results indicate that it is possible to construct and cross validate a measure for self-confi-
dence in one’s ability using the Rasch-Model. Additionally, it is shown that the measure for self-confidence in
one’s ability is independent from the general knowledge.
****
Relative Precision, Efficiency and Construct
Validity of Different Starting and Stopping Rules for a Computerized Adaptive Test: The GAIN Substance Problem
Scale
Barth B. Riley, Kendon J. Conrad, Nikolaus Bezruczko, and Michael L. Dennis
Abstract
Substance abuse treatment programs are being pressed to measure and make clinical decisions more efficiently
about an increasing array of problems. This computerized adaptive testing (CAT) simulation examined the
relative efficiency, precision and construct validity of different starting and stopping rules used to shorten the
Global Appraisal of Individual Needs’ (GAIN) Substance Problem Scale (SPS) and facilitate diagnosis based
on it. Data came from 1,048 adolescents and adults referred to substance abuse treatment centers in 5 sites. CAT
performance was evaluated using: (1) average standard errors, (2) average number of items, (3) bias in person
measures, (4) root mean squared error of person measures, (5) Cohen’s kappa to evaluate CAT classification
compared to clinical classification, (6) correlation between CAT and full-scale measures, and (7) construct
validity of CAT classification vs. clinical classification using correlations with five theoretically associated
instruments. Results supported both CAT efficiency and validity.
****
Bookmark Locations and Item Response Model
Selection in the Presence of Local Item Dependence
Garry Skaggs
Abstract
The bookmark standard setting procedure is a popular method for setting performance standards on state assessment
programs. This study reanalyzed data from an application of the bookmark procedure to a passagebased
test that used the Rasch model to create the item ordered booklet. Several problems were noted in this
implementation of the bookmark procedure, including disagreement among the SMEs about the correct order
of items in the bookmark booklet, performance level descriptions of the passing standard being based on passage
difficulty as well as item difficulty, and the presence of local item dependence within reading passages.
Bookmark item locations were recalculated for the IRT three-parameter model and the multidimensional bifactor
model. The results showed that the order of item locations was very similar for all three models when items
of high difficulty and low discrimination were excluded. However, the items whose positions were the most
discrepant between models were not the items that the SMEs disagreed about the most in the original standard
setting. The choice of latent trait model did not address problems of item order disagreement. Implications for
the use of the bookmark method in the presence of local item dependence are discussed.
****
Comparing Concurrent versus Fixed Parameter
equating with Common Items: Using the Rasch Dichotomous and Partial Credit Models in a Mixed Item-Format Test
Husein M. Taherbhai and Daer Yong Seo
Abstract
There has been some discussion among researchers as to the benefits of using one calibration process over the
other during equating. Although literature is rife with the pros and cons of the different methods, hardly any
research has been done on anchoring (i.e., fixing item parameters to their pre-determined values on an established
scale) as a method that is commonly used by psychometricians in large-scale assessments.
This simulation research compares the fixed form of calibration with the concurrent method (where calibration
of the different forms on the same scale is accomplished by a single run of the calibration process, treating all
non-included items on the forms as missing or not reached), using the dichotomous Rasch (Rasch, 1960) and
the Rasch partial credit (Masters, 1982) models, and the WINSTEPS (Linacre, 2003) computer program.
Contrary to the belief and some researchers’ contention that the concurrent run with larger n-counts for the
common items would provide greater accuracy in the estimation of item parameters, the results of this paper
indicate that the greater accuracy of one method over the other is confounded by the sample-size, the number
of common items, etc., and there is no real benefit in using one method over the other in the calibration and
equating of parallel tests forms.
****
Understanding Rasch
Measurement: Instrument Development Tools and Activities for Measure Validation using Rasch Models:
Part I – Instrument Development Tools
Edward W. Wolfe and Everett V. Smith, Jr.
Abstract
Instrument development is an arduous task that, if undertaken with care and consideration, can lay the foundation
for the development of validity arguments relating to the inferences and decisions that are based on test
measures. This article, Part I of a two-part series, provides an overview of validity concepts and describes
how instrument development efforts can be conducted to facilitate the development of validity arguments.
Our discussion focuses on documentation of the purpose of measurement, creation of test specifications, item
development, expert review, and planning of pilot studies. Through these instrument development activities
and tools, essential information is documented that will feed into the analysis, summary, and reporting of data
relevant to validity arguments discussed in Part II of this series.
Vol. 8, No. 2 Summer 2007
Mental Self-Government: Development of the
Additional Democratic Learning Style Scale using Rasch Measurement Models
Tine Nielsen, Svend Kreiner, and Irene Styles
Abstract
This paper describes the development and validation of a democratic learning style scale intended to fill a gap
in Sternberg’s theory of mental self-government and the associated learning style inventory (Sternberg, 1988,
1997). The scale was constructed as an 8-item scale with a 7-category response scale. The scale was developed
following an adapted version of DeVellis’ (2003) guidelines for scale development.
The validity of the Democratic Learning Style Scale was assessed by items analysis using graphical loglinear
Rasch models (Kreiner and Christensen, 2002, 2004, 2006) The item analysis confirmed that the full 8-item
revised Democratic Learning Style Scale fitted a graphical loglinear Rasch model with no differential item
functioning but weak to moderate uniform local dependence between two items. In addition, a reduced 6-item
version of the scale fitted the pure Rasch model with a rating scale parameterization. The revised Democratic
Learning Style Scale can therefore be regarded as a sound measurement scale meeting requirements of both
construct validity and objectivity.
****
Measuring Math Anxiety (in Spanish)
with the Rasch Rating Scale Model
Gerardo Prieto and Ana R. Delgado
Abstract
Two successive studies probed the psychometric properties of a Math Anxiety questionnaire (in Spanish) by means
of the Rasch Rating Scale Model. Participants were 411 and 216 Spanish adolescents. Convergent validity was
examined by correlating the scale with both the Fennema and Sherman Attitude Scale and a math achievement
test. The results show that the scores are psychometrically appropriate, and replicate those reported in metaanalyses:
medium-sized negative correlations with achievement and with attitudes toward mathematics, as well
as moderate sex-related differences (with girls presenting higher anxiety levels than boys).
****
Using Rasch Analysis to
Construct a Clinical Problem-Solving Inventory in the Dental Clinic:A Case Study
Chien-Lin Yang and Gene A. Kramer
Abstract
A 28-item inventory was developed to measure the clinical problem-solving abilities of 3rd and 4th year dental
students. The judgments of 57 expert raters (dental-school faculty) from four dental schools used the inventory to
evaluate 183 dental students on a 5-point rating scale. The Rasch measurement model was employed to examine
the psychometric properties and construct validity of this inventory. In this study, fit statistics identified the “noise”
in the data and residual analysis assisted in extracting a meaningful structure. The study results indicate that the
Rasch measurement model appeared to be a useful method for use in producing a unidimensional instrument.
All five rating categories were used in a coherent manner, and four discernable levels of clinical problem-solving
ability were identified. After removal of four repetitious items, a version of the Clinical Problem-Solving
Inventory was finalized that could serve as a criterion measure for validating the use of a critical thinking test
on the Dental Admission Test.
****
Evidence-Based
Practice for Equating Health Status Items: Sample Size and IRT Model
Karon F. Cook, Patrick W. Taylor, Barbara G. Dodd,Cayla R. Teal, and Colleen A. McHorney
Abstract
Background: In the development of health outcome measures, the pool of candidate items may be divided into
multiple forms, thus “spreading” response burden over two or more study samples. Item responses collected
using this approach result in two or more forms whose scores are not equivalent. Therefore, the item responses
must be equated (adjusted) to a common mathematical metric. Objectives: The purpose of this study was to
examine the effect of sample size, test size, and selection of item response theory model in equating three forms
of a health status measure. Each of the forms was comprised of a set of items unique to it and a set of anchor
items common across forms. Research Design: The study was a secondary data analysis of patients’ responses
to the developmental item pool for the Health of Seniors Survey. A completely crossed design was used with
25 replications per study cell. Results: We found that the quality of equatings was affected greatly by sample
size. Its effect was far more substantial than choice of IRT model. Little or no advantage was observed for
equatings based on 60 or 72 items versus those based on 48 items. Conclusions: We concluded that samples
of less than 300 are clearly unacceptable for equating multiple forms. Additional sample size guidelines are
offered based on our results.
****
Computing Confidence
Intervals of Item Fit Statistics in the Family of Rasch Models Using the Bootstrap Method
Ya-Hui Su, Ching-Fan Sheu, and Wen-Chung Wang
Abstract
The item infit and outfit mean square errors (MSE) and their t-transformed statistics are widely used to screen
poorly fitting items. The t-transformed statistics, however, do not follow the standard normal distribution so
that hypothesis testing of item fit based on the conventional critical values is likely to be inaccurate (Wang and
Chen, 2005). The MSE statistics are effect-size measures of misfit and have an expected value of unity when
the data fit the model’s expectation. Unfortunately, most computer programs for item response analysis do not
report confidence intervals of the item infit and outfit MSE, mainly because their sampling distributions are
analytically intractable. Hence, the user is left without interval estimates of the magnitudes of misfit. In this study,
we developed a FORTRAN 90 computer program in conjunction with the commercial program WINSTEPS
(Linacre, 2001) that yields confidence intervals of the item infit and outfit MSE using the bootstrap method. The
utility of the program is demonstrated through three illustrations of simulated data sets.
****
Understanding Rasch
Measurement: Instrument Development Tools and Activities for Measure Validation using Rasch Models:
Part II – Validation Activities
Edward W. Wolfe and Everett V. Smith, Jr.
Abstract
Accumulation of validity evidence is an important part of the instrument development process. In Part
I of a two-part series, we provided an overview of validity concepts and described how instrument
development efforts can be conducted to facilitate the development of validity arguments. In this, Part
II of the series, we identify how analyses, especially those conducted within a Rasch measurement
framework, can be used to provide evidence to support validity arguments that are founded during the
instrument development process.
Vol. 8, No. 3 Fall 2007
The Programme for International Student Assessment: An Overview
Ross Turner and Raymond J. Adams
Abstract
This paper provides an overview of the Programme for International Student Assessment (PISA), an on-going
international comparative survey study of educational outcomes at age 15. PISA is sponsored by the Organization
for Economic Co-operation and Development (OECD) and for the period 1998-2010 has been designed and
implemented by a consortium led by the Australian Council for Educational Research (ACER).
****
Translation Equivalence across PISA Countries
Aletta Grisay, John H.A.L. de Jong, Eveline Gebhardt, Alla Berezner, and Beatrice Halleux-Monseur
Abstract
Due to the continuous increase in the number of countries participating in international comparative assessments
such as TIMSS and PISA, ensuring linguistic and cultural equivalence across the various national versions of the
assessment instruments has become an increasingly crucial challenge. For example, 58 countries participated in
the PISA 2006 Main Study. Within each country, the assessment instruments had to be adapted into each language
of instruction used in the sampled schools. All national versions in languages used for 5 per cent or more of the
target population (that is, a total of 77 versions in 42 different languages) were verified for equivalence against
the English and French source versions developed by the PISA consortium. Information gathered both through
the verification process and through empirical analyses of the data are used in order to adjudicate whether the
level of linguistic equivalence reached an acceptable standard in each participating country.
The paper briefly describes the procedures typically used in PISA to ensure high levels of translation/adaptation
accuracy, and then focuses on the development of the set of indicators that are used as criteria in the
equivalence adjudication exercise. Empirical data from the PISA 2005 Field Trial are used to illustrate both
the analyses and the major conclusions reached.
****
Ameliorating Culturally Based Extreme Response Tendencies To Attitude Items
Maurice Walker
Abstract
Using data from the PISA 2006 field trial, Rasch item response models are used to demonstrate that extreme
response tendency was exhibited differentially across culturally distinct countries when answering Likert type
attitude items. A single attitude scale is examined across eight culturally distinct countries in this paper. Two
avenues to ameliorate this tendency are explored: first using dichotomous variants of the items, and second
incorporating the country specific response tendency into the Rasch item response model. Analysis of the item
variants reveals similar scale outcomes and correlations with achievement but preference for the Likert variant
when test information is considered. A hierarchical analysis using facet models reveals that the data fit significantly
better in a model that incorporates an interaction effect between the country and the item delta parameters.
The implications for reporting attitudes measured with Likert items across cultures are outlined.
****
The Impact of Differential Investment of Student Effort on the Outcomes of International Studies
Jayne Butler and Raymond J. Adams
Abstract
International comparative assessments of student achievement, such as Trends in Mathematics and Science
(TIMSS) and Programme for International Student Achievement (PISA) are becoming increasingly important
in the development of evidence-based education policy. The potentially far-reaching influence of such studies
underscores the need for these assessments to be valid and reliable. In education, increasing recognition is being
given to motivational factors which impact on student learning. This research considers a possible threat
to the validity of such studies by investigating the influence the amount of effort invested by test-takers has
on their outcomes. Reassuringly, it is found that the reported expenditure of effort by students is fairly stable
across countries. This finding counters the claim that systematic cultural differences in the effort expended by
students invalidate international comparisons. Realistically reported effort expenditure is related to reading
achievement with an effect size similar to variables such as single parent family structure, gender and socioeconomic
background. Finally, when reporting trends, taking effort into account should be considered and may
well facilitate the interpretation of national and gender trends in reading achievement.
****
The Influence of Equating Methodology on Reported Trends in PISA
Eveline Gebhardt and Raymond J. Adams
Abstract
In 2005 PISA published trend indicators that compared the results of PISA 2000 and PISA 2003. In this paper
we explore the extent to which the outcomes of these trend analyses are sensitive to the choice of test equating
methodologies, the choice of regression models and the choice of linking items. To establish trends PISA
equated its 2000 and 2003 tests using a methodology based on Rasch Modelling that involved estimating linear
transformations that mapped 2003 Rasch-scaled scores to the previously established PISA 2000 Rasch-scaled
scores. In this paper we compare the outcomes of this approach with an alternative, which involves the joint
Rasch scaling of the PISA 2000 and PISA 2003 data separately for each country. Note that under this approach
the item parameters are estimated separately for each country, whereas the linear transformation approach used a
common set of item parameter estimates for all countries. Further, as its primary trend indicators, PISA reported
changes in mean scores between 2000 and 2003. These means are not adjusted for changes in the background
characteristics of the PISA 2000 and PISA 2003 samples – that is, they are marginal rather than conditional
means. The use of conditional rather than marginal means results in some differing conclusions regarding trends
at both the country and within-country level.
****
The Computation of Equating Errors in International Surveys in Education
Christian Monseur and Alla Berezner
Abstract
Since the IEA’s Third International Mathematics and Science Study, one of the major objectives of international
surveys in education has been to report trends in achievement. The names of the two current IEA surveys
reflect this growing interest: Trends in International Mathematics and Science Study (TIMSS) and Progress in
International Reading Literacy Study (PIRLS). Similarly a central concern of the OECD’s PISA is with trends
in outcomes over time. To facilitate trend analyses these studies link their tests using common item equating
in conjunction with item response modelling methods.
IEA and PISA policies differ in terms of reporting the error associated with trends. In IEA surveys, the
standard errors of the trend estimates do not include the uncertainty associated with the linking step while PISA
does include a linking error component in the standard errors of trend estimates. In other words, PISA implicitly
acknowledges that trend estimates partly depend on the selected common items, while the IEA’s surveys do
not recognise this source of error.
Failing to recognise the linking error leads to an underestimation of the standard errors and thus increases
the Type I error rate, thereby resulting in reporting of significant changes in achievement when in fact these
are not significant. The growing interest of policy makers in trend indicators and the impact of the evaluation
of educational reforms appear to be incompatible with such underestimation.
However, the procedure implemented by PISA raises a few issues about the underlying assumptions for the
computation of the equating error.
After a brief introduction, this paper will describe the procedure PISA implemented to compute the linking
error. The underlying assumptions of this procedure will then be discussed. Finally an alternative method based on
replication techniques will be presented, based on a simulation study and then applied to the PISA 2000 data.
Vol. 8, No. 4 Winter 2007
Nonequivalent Survey Consolidation: An Example From Functional Caregiving
Nikolalus Bezruczko and Shu-Pi C. Chen
Abstract
Functional Caregiving (FC) is a construct about mothers caring for children (both old and young) with intellectual
disabilities, which is operationally defined by two nonequivalent survey forms, urban and suburban,
respectively. The purposes of this research are, first, to generalize school-based achievement test principles to
survey methods by equating two nonequivalent survey forms. A second purpose is to expand FC foundations by
a) establishing linear measurement properties for new caregiving items, b) replicate a hierarchical item structure
across an urban, school-based population, c) consolidate survey forms to establish a calibrated item bank, and
d) collect more external construct validity data. Results supported invariant item parameters of a fixed item
form (96 items) for two urban samples (N = 186). FC measures also showed expected construct relationships
with age, mental depression, and health status. However, only five common items between urban and suburban
forms were statistically stable because suburban mothers’ age and child’s age appear to interact with medical
information and social activities.
****
Mindfulness Practice: A Rasch Variable Construct Innovation
Sharon G. Solloway and William P. Fisher, Jr.
Abstract
Is it possible to establish a consistent, stable relationship between the structure of number and additive amounts of
mindfulness practice? A bank of thirty items, constructed from a review of the literature and from novice practitioners’
journal responses to mindfulness practice, comprised the instrument. A convenience sample of students in a teacher
education program participated. The WINSTEPS Rasch measurement software was used for all analyses. Measurement
separation reliability was 0.92 and item separation reliability was 0.98, with satisfactory model fit. The 30 items
measure a single construct of mindfulness practice. Construct validity was supported by the meaningfulness of the
items perceived as easy to hard. The same scale was produced when the items were calibrated separately on the T1
and T2 groups (Rsq = 0.83). The experimental group’s T2 measures were significantly different from both its own
T1 measures and the control group’s T1 and T2 measures. ANOVA showed significance for variance between the
experimental and control groups for T2 (F = 43.66, 151 d.f., p <.001) for a nearly two-logit (20 unit) difference (48.9
vs. 68.0). The study is innovative in its demonstration of mindfulness practice as a measurable variable.
****
Substance Use Disorder Symptoms: Evidence of
Differential Item Functioning by AgeTo Attitude Items
Kendon J. Conrad, Michael L. Dennis, Nikolaus Bezruczko, Rodney R. Funk, and Barth B. Riley
Abstract
This study examined the applicability of substance abuse diagnostic criteria for adolescents, young adults, and
adults using the Global Appraisal of Individual Need’s Substance Problems Scale (SPS) from 7,408 clients.
Rasch analysis was used to: 1) evaluate whether the SPS operationalized a single reliable dimension, and 2)
examine the extent to which the severity of each symptom and the overall test functioned the same or differently
by age. Rasch analysis indicated that the SPS was unidimensional with a person reliability of .84. Eight
symptoms were significantly different between adolescents and adults. Young adult calibrations tended to
fall between adolescents and adults. Differential test functioning was clinically negligible for adolescents but
resulted in about 7% more adults being classified as high need. These findings have theoretical implications
for screening and treatment of adolescents vs. adults. SPS can be used across age groups though age-specific
calibrations enable greater precision of measurement.
****
A Monte Carlo Study of the
Impact of Missing Data and Differential Item Functioning on Theta Estimates from
Two Polytomous Rasch Family Models
Carolyn F. Furlow, Rachel T. Fouladi, Phill Gagné, and Tiffany A. Whittaker
Abstract
This paper examines the impact of differential item functioning (DIF), missing item values, and different methods
for handling missing item values on theta estimates with data simulated from the partial credit model and
Andrich’s rating scale model. Both Rasch family models are commonly used when obtaining an estimate of a
respondent’s attitude. The degree of missing data, DIF magnitude, and the percentage of DIF items were varied
in MCAR data conditions in which the focal group was 10% of the total population. Four methods for handling
missing data were compared: complete-case analysis, mean substitution, hot-decking, and multiple imputation.
Bias, RMSE, means, and standard errors of the theta estimates for the focal group were adversely affected by
the amount and magnitude of DIF items. RMSE and fidelity coefficients for both the reference and focal group
were adversely impacted by the amount of missing data. While all methods of handling missing data performed
fairly similarly, multiple imputation and hot-decking showed slightly better performance.
****
Investigation of 360-Degree Instrumentation
Effects: Application of the Rasch Rating Scale Model
John T. Kulas and Kelly M. Hannum
Abstract
Performance appraisals have been frequently investigated for inaccuracies attributable to raters. In addition to
rater contributions, problems in measurement can arise from properties of instrumentation. If administered items
do not parallel the full range of employee performance, a restriction of range can occur. Here employees are rated
similarly (no distinction made) not because of rater error, but because of an instrument floor or ceiling effect.
A RASCH measurement procedure was applied to a 360-degree dataset in order to uncover potential instrumentation
effects. A similar magnitude ceiling effect was found across rater categories. It is recommended that
performance appraisal practitioners consider including items of greater difficulty in their criterion measures.
****
Rasch Measurement of Self-Regulated Learning in an Information and
Communication Technology (ICT)-rich Environment
Joseph N. Njiru and Russell F. Waugh
Abstract
This report describes how a linear scale of self-regulated learning in an ICT-rich environment was created by
analysing student data using the Rasch measurement model. A person convenience sample of (N = 409) university
students in Western Australia was used. The stem-item sample was initially 41, answered in two perspectives (I
aim for this and I actually do this), and reduced to 16 that fitted the measurement model to form a unidimensional
scale. Items for motivation (extrinsic rewards, intrinsic rewards, and social rewards), academic goals
(fear of performing poorly) (but not standards), self-learning beliefs (ability and interest), task management
(strategies and time management) (but not cooperative learning), Volition (action control (but not environmental
control), and self-evaluation (cognitive self-evaluation and metacognition) fitted the measurement model. The
proportion of observed variance considered true was 0.90. A new instrument is proposed to handle the conceptually
valid but non-fitting items. Characteristics of high self-regulated learners are measured.
****
The Saltus Model Applied to Proportional Reasoning Data
Karen Draney
Abstract
This article examines an application of the saltus model, a mixture model that was designed for the analysis
of developmental data. Some background in the types of research for which such a model might be useful is
discussed. The equations defining the model are given, as well as the model’s relationship to the Rasch model
and to other mixture models. An application of the saltus model to an example data set, collected using Noelting’s
orange juice mixtures tasks, is examined in detail, along with the control files necessary to run the software,
and the output file it produced.