Journal of Applied Measurement
P.O. Box 1283
Maple Grove, MN 55311
Volume 14, 2013 Article Abstracts
Vol. 14, No. 1 Spring 2013
A Bootstrap Approach to Evaluating Person and Item Fit to the Rasch Model
Edward W. Wolfe
Abstract
Historically, rule-of-thumb critical values have been employed for interpreting fit statistics that depict anomalous
person and item response patterns in applications of the Rasch model. Unfortunately, prior research has shown
that these values are not appropriate in many contexts. This article introduces a bootstrap procedure for identifying
reasonable critical values for Rasch fit statistics and compares the results of that procedure to applications
of rule-of-thumb critical values for three example datasets. The results indicate that rule-of-thumb values may
over- or under-identify the number of misfitting items or persons.
****
Using The Rasch Measurement Model to Design a Report Writing Assessment Instrument
Wayne R. Carlson
Abstract
This paper describes how the Rasch measurement model was used to develop an assessment instrument designed
to measure student ability to write law enforcement incident and investigative reports. The ability to
write reports is a requirement of all law enforcement recruits in the state of Michigan and is a part of the state’s
mandatory basic training curriculum, which is promulgated by the Michigan Commission on Law Enforcement
Standards (MCOLES). Recently, MCOLES conducted research to modernize its training and testing in the area
of report writing. A structured validation process was used, which included: a) an examination of the job tasks
of a patrol officer, b) input from content experts, c) a review of the professional research, and d) the creation of
an instrument to measure student competency. The Rasch model addressed several measurement principles that
were central to construct validity, which were particularly useful for assessing student performances. Based on
the results of the report writing validation project, the state established a legitimate connectivity between the
report writing standard and the essential job functions of a patrol officer in Michigan. The project also produced
an authentic instrument for measuring minimum levels of report writing competency, which generated results
that are valid for inferences of student ability. Ultimately, the state of Michigan must ensure the safety of its
citizens by licensing only those patrol officers who possess a minimum level of core competency. Maintaining
the validity and reliability of both the training and testing processes can ensure that the system for producing
such candidates functions as intended.
****
Using Multidimensional Rasch to Enhance Measurement Precision:
Initial Results from Simulation and Empirical Studies
Magdalena Mo Ching Mok and Kun Xu
Abstract
This study aimed to explore the effect on measurement precision of multidimensional, as compared with unidimensional,
Rasch measurement for constructing measures from multidimensional Likert-type scales. Many
educational and psychological tests are multidimensional but common practice is to ignore correlations among
the latent traits in these multidimensional scales in the measurement process. These practices may have serious
validity and reliability implications. This study made use of both empirical data from 208,083 students,
and simulated data simulated by 24 systematic combinations, each replicated 1000 times, of three conditions,
namely, sample size, degree of dimensionality, and scale length to compare unidimensional and multidimensional
approaches and to identify effects of sample size, dimensionality and scale length on measurement precision.
Results showed that the multidimensional Rasch approach yielded more precise estimates than did unidimensional
approach if the two dimensions were strongly correlated. The effect was more pronounced for long scales.
****
Using the Dichotomous Rasch Model to Analyze Polytomous Items
Qingping He and Chris Wheadon
Abstract
One of the most important applications of the Rasch measurement models in educational assessment is the
equating of tests. An important feature of attainment tests is the use of both dichotomous and polytomous items.
The partial credit model (PCM) developed by Masters (1982) represents an extension of the dichotomous Rasch
model for analysing polytomous item data. The dichotomous Rasch model has been used primarily to analyse
dichotomous item data. Whilst the partial credit model can provide detailed information on the performance of
individual score categories of polytomous items, it is mathematically more complex to use than the dichotomous
Rasch model and can, under certain circumstances, present difficulties in interpreting item measures and
in practical applications. This study explores the potential of using the dichotomous Rasch model to analyse
polytomous items and equate tests. Results obtained from a simulation study and from analysing the data of a
science achievement test indicate that the partial credit model and the dichotomous Rasch model produce similar
item and person measures and equivalent cut scores on different test forms.
****
With Hiccups and Bumps: The Development of a Rasch-based Instrument
to Measure Elementary Students’ Understanding of the Nature of Science
Shelagh M. Peoples, Laura M. O’Dwyer, Katherine A. Shields, and Yang Wang
Abstract
This research describes the development process, psychometric analyses and part validation study of a theoretically-
grounded Rasch-based instrument, the Nature of Science Instrument-Elementary (NOSI-E). The NOSI-E was
designed to measure elementary students’ understanding of the Nature of Science (NOS). Evidence is provided for
three of the six validity aspects (content, substantive and generalizability) needed to support the construct validity
of the NOSI-E. A future article will examine the structural and external validity aspects. Rasch modeling proved
especially productive in scale improvement efforts. The instrument, designed for large-scale assessment use, is
conceptualized using five construct domains. Data from 741 elementary students were used to pilot the Rasch
scale, with continuous improvements made over three successive administrations. The psychometric properties
of the NOSI-E instrument are consistent with the basic assumptions of Rasch measurement, namely that the
items are well-fitting and invariant. Items from each of the five domains (Empirical, Theory-Laden, Certainty,
Inventive, and Socially and Culturally Embedded) are spread along the scale’s continuum and appear to overlap
well. Most importantly, the scale seems appropriately calibrated and responsive for elementary school-aged
children, the target age group. As a result, the NOSI-E should prove beneficial for science education research.
As the United States’ science education reform efforts move toward students’ learning science through engaging
in authentic scientific practices (NRC, 2011), it will be important to assess whether this new approach to teaching
science is effective. The NOSI-E can be used as one measure of whether this reform effort has an impact.
****
Application of Single-level and Multi-level Rasch Models using the lme4 Package
Iasonas Lamprianou
Abstract
The aim of the article is to illustrate how researchers may use the lme4 package to run multilevel Rasch models.
The lme4 package is a popular open-source software and is frequently used by researchers around the world
to fit generalized mixed-effects models with crossed or partially crossed random effects. The article starts with
a short discussion of the reasons why a researcher might, sometimes, be motivated to use a multi-level Rasch
model and presents a practical example using empirical data. The main features of the lme4 package are presented,
and finally, the paper presents information about other open-source software that could alternatively be
used to fit multi-level Rasch models.
****
Rasch Modeling to Assess Albanian and South African Learners’ Preferences
for Real-life Situations to be Used in Mathematics: A Pilot Study
Suela Kacerja, Cyril Julie, and Said Hadjerrouit
Abstract
This paper reports on an investigation on the real-life situations students in grades 8 and 9 in South Africa and
Albania prefer to use in Mathematics. The functioning of the instrument used to assess the order of preference
learners from both countries have for contextual situations is assessed using Rasch modeling techniques. For
both the cohorts, the data fit the Rasch model. The differential item functioning (DIF) analysis rendered 3 items
operating differentially for the two cohorts. Explanations for these differences are provided in terms of differences
in experiences learners in the two countries have related to some of the contextual situations. Implications
for interpretation of international comparative tests are offered, as are the possibilities for the cross-country
development of curriculum materials related to contexts that learners prefer to use in Mathematics.
****
Vol. 14, No. 2 Summer 2013
Adaptive Testing for Psychological Assessment: How Many Items Are Enough To Run an Adaptive Testing Algorithm?
Michaela M. Wagner-Menghin and Geoff N. Masters
Abstract
Although the principles of adaptive testing were established in the psychometric literature many years ago (e.g.,
Weiss, 1977), and practice of adaptive testing is established in educational assessment, it not yet widespread in
psychological assessment. One obstacle to adaptive psychological testing is a lack of clarity about the necessary
number of items to run an adaptive algorithm. The study explores the relationship between item bank size, test
length and measurement precision. Simulated adaptive test runs (allowing a maximum of 30 items per person)
out of an item bank with 10 items per ability level (covering .5 logits, 150 items total) yield a standard error
of measurement (SEM) of .47 (.39) after an average of 20 (29) items for 85-93% (64-82%) of the simulated
rectangular sample. Expanding the bank to 20 items per level (300 items total) did not improve the algorithm’s
performance significantly. With a small item bank (5 items per ability level, 75 items total) it is possible to reach
the same SEM as with a conventional test, but with fewer items or a better SEM with the same number of items.
****
DIF Cancellation in the Rasch Model
Adam E. Wyse
Abstract
Differential item functioning (DIF) cancellation occurs when the cumulative effect of an item or set of items
exhibiting DIF against one subgroup cancels with other items that exhibit DIF against the comparison group
and hence results in non-existent DIF at the test level. This paper investigates DIF cancellation in the context
of Rasch measurement. It is shown that this phenomenon is not a property of the Rasch model, but rather, a
function of the manner in which item parameters are estimated and the way that DIF impacts these estimates.
The conditions under which DIF cancellation would exist when using the Rasch model are suggested and a
proof is provided to support this suggestion. Empirical examples are provided to refute prior suggestions that
DIF cancellation always exists if the Rasch model is used.
****
Multidimensional Diagnostic Perspective on Academic Achievement Goal Orientation Structure, Using the Rasch Measurement Models
Daeryong Seo, Husein Taherbhai, and Insu Paek
Abstract
This study is designed to investigate a multidimensional structure of academic achievement goal orientations
from a diagnostic perspective, using the Rasch measurement models. A data set of Korean students who responded
to the Patterns of Adaptive Learning Survey (PALS) was analyzed. Both consecutive unidimensional
and multidimensional Rasch measurement models were applied for comparative purposes. Each goal orientation
dimension (i.e., the attitude) was standardized and then classified into three categorical levels, i.e., low, middle
and high. These categorizations of goal dimensions were used to examine the role of students’ performanceapproach
goals on mathematics achievement in relation with the other achievement goals. Results indicate that
the multidimensional partial credit model was the best model with respect to the fit of the data to the models.
Findings of the current study also demonstrate that practitioners who need specific feedback for instruction and/
or intervention can benefit from the multidimensional approach.
****
An Extension of a Bayesian Approach to Detect Differential Item Functioning
Sandip Sinharay
Abstract
The application of the existing test statistics to determine differential item functioning (DIF) requires large
samples, but test administrators often face the challenge of detecting DIF with small samples. One advantage
of a Bayesian approach over a frequentist approach is that the former can incorporate, in the form of a prior
distribution, existing information on the inference problem at hand. Sinharay, Dorans, Grant, and Blew (2009)
suggested the use of information from past data sets as a prior distribution in a Bayesian DIF analysis. This
paper suggests an extension of the method of Sinharay et al. (2009). The suggested extension is compared to
the existing DIF detection methods in a realistic simulation study.
****
The Development of the de Morton Mobility Index (DEMMI) in an Older Acute Medical Population: Item Reduction using the Rasch Model (Part 1)
Natalie A. de Morton, Megan Davidson, and Jennifer L. Keating
Abstract
The DEMMI (de Morton Mobility Index) is a new and advanced instrument for measuring the mobility of all older
adults across clinical settings. It overcomes practical and clinimetric limitations of existing mobility instruments.
This study reports the process of item reduction using the Rasch model in the development of the DEMMI. Prior
to this study, qualitative methods were employed to generate a pool of 51 items for potential inclusion in the
DEMMI. The aim of this study was to reduce the item set to a unidimensional subset of items that ranged across
the mobility spectrum from bed bound to high levels of independent mobility. Fifty-one physical performance
mobility items were tested in a sample of older acute medical patients. A total of 215 mobility assessments were
performed. Seventeen mobility items that spanned the mobility spectrum were selected for inclusion in the new
instrument. The 17 item scale fitted the Rasch model. Items operated consistently across the mobility spectrum
regardless of patient age, gender, cognition, primary language or time of administration during hospitalisation.
Using the Rasch model, an interval level scoring system was developed with a score range of 0 to 100.
****
A Comparison of Confirmatory Factor Analysis and Multidimensional Rasch Models to Investigate the Dimensionality of Test-Taking Motivation
Christine E. DeMars
Abstract
Using a scale of test-taking motivation designed to have multiple factors, results are compared from a confirmatory
factor analysis (CFA) using LISREL and a multidimensional Rasch partial credit model using ConQuest. Both
types of analyses work with latent factors and allow the comparison of nested models. CFA models most typically
model a linear relationship between observed and latent variables, while Rasch models specify a non-linear
relationship between observed and latent variables. The CFA software provides many more measures of overall
fit than ConQuest, which is focused more on the fit of individual items. Despite the conceptual differences in
these techniques, the results were similar. The data fit a three-dimensional model better than the one-dimensional
or two-dimensional models also hypothesized, although some misfit remained.
****
Measuring Alternative Learning Outcomes: Dispositions to Study in Higher Education
Maria Pampaka, Julian Williams, Graeme Hutcheson, Laura Black, Pauline Davis, Paul Hernandez-Martinez, and Geoff Wake
Abstract
In this paper we describe the validation of two scales constructed to measure pre-university students’ changing
disposition (i) to enter Higher Education (HE) and (ii) to further study mathematically-demanding subjects. Items
were selected drawing on interview data, and on a model of disposition as socially- as well as self- attributed.
Rasch analyses showed that the two scales each produce robust one-dimensional measures on what we call a
‘strength of commitment to enter HE’ and ‘disposition to study mathematically-demanding subjects further’
respectively. However, the former scale was initially found to suffer psychometrically from a ceiling effect,
which we ‘corrected’ by adding some harder items at a later data point, and revised the scale according to our
interpretation of subsequent results. We finally discuss the potential significance of the constructed measures of
learning outcomes, as variables in monitoring or even explaining students’ progress into different subjects in HE.
****
Vol. 14, No. 3 Fall 2013
The Development of the de Morton Mobility Index (DEMMI) in an
Independent Sample of Older Acute Medical Patients: Refinement and Validation using the Rasch Model (Part 2)
Natalie A de Morton, Megan Davidson, and Jennifer L Keating
Abstract
This study describes the refinement and validation of the 17-item DEMMI in an independent sample of older acute
medical patients. Instrument refinement was based on Rasch analysis and input from clinicians and researchers.
The refined DEMMI was tested on 106 older general medical patients and a total of 312 mobility assessments
were conducted. Based on the results of this study a further 2 items were removed and the 15 item DEMMI was
adopted. The Rasch measurement properties of the DEMMI were consistent with estimates obtained from the
instrument development sample. No differential item functioning was identified and an interval level scoring
system was established. The DEMMI is the first mobility instrument for older people to be developed, refined
and validated using the Rasch model. This study confirms that the DEMMI provides clinicians and researchers
with a unidimensional instrument for measuring and monitoring changes in mobility of hospitalised older acute
medical patients.
****
Rasch Modeling of Accuracy and Confidence Measures from Cognitive Tests
Insu Paek, Jihyun Lee, Lazar Stankov, and Mark Wilson
Abstract
The use of IRT models has not been rigorously applied in studies of the relationship between test-takers’ confidence
and accuracy. This study applied the Rasch measurement models to investigate the relationship between
test-takers’ confidence and accuracy on English proficiency tests, proposing potentially useful measures of
under or overconfidence. The Rasch approach provided the scaffolding to formulate indices that can assess the
discrepancy between confidence and accuracy at the item or total test level, as well as at particular ability levels
locally. In addition, a “disattenuated” measure of association between accuracy and confidence, which takes
measurement error into account, was obtained through a multidimensional Rasch modeling of the two constructs
where the latent variance-covariance structure is directly estimated from the data. The results indicate that the
participants tend to show overconfidence bias in their own cognitive abilities.
****
Baselines for the Pan-Canadian Science Curriculum Framework
Xiufeng Liu
Abstract
Using a Canadian student achievement assessment database, the Science Achievement Indicators Program
(SAIP), and employing the Rasch partial credit measurement model, this study estimated the difficulties of
items corresponding to the learning outcomes in the Pan-Canadian science curriculum framework and the latent
abilities of students of grades 7, 8, 10, 11, 12 and OAC (Ontario Academic Course). The above estimates serve
as baselines for validating the Pan-Canadian science curriculum framework in terms of the learning progression
of learning outcomes and expected mastery of learning outcomes by grades. It was found that there was no
statistically significant progression in learning outcomes from grades 4-6 to grades 7-9, and from grades 7-9 to
grades 10-12; the curriculum framework sets mastery expectation about 2 grades higher than students’ potential
abilities. In light of the above findings, this paper discusses theoretical issues related to deciding progression of
learning outcomes and setting expectation of student mastery of learning outcomes, and highlights the importance
of using national assessment data to establish baselines for the above purposes. This paper concludes with
recommendations for further validating the Pan-Canadian science curriculum frameworks.
****
An Experimental Study Using Rasch Analysis to Compare Absolute Magnitude Estimation and Categorical Rating Scaling as Applied in Survey Research
Kristin L. K. Koskey, Toni A. Sondergeld, Svetlana A. Beltyukova, and Christine M. Fox
Abstract
Limited research has applied a measurement model to compare the rating scale functioning of categorical rating
scaling (CRS) and absolute magnitude estimation scaling (MES) when rating subjective stimuli. We used an
experimental design and applied the Rasch model to the survey data, with each respondent rating items using
MES and one of four commonly used agreement-disagreement rating scales. The results indicated that the CRS
and MES data were comparable in person and item separation and reliability when the respondents’ scales were
known. MES had lower standard error for people and items; however MES had disordered step calibrations.
Finally, the respondents reported preference of CRS to MES.
****
Developing of Two Instruments to Measure Attitudes of Vietnamese Parents and Students toward Schooling
Thi Kim Cuc Nguyen and Patrick Griffin
Abstract
The attitudes of parents and students towards schooling are often considered to be important factors associated
with students’ educational outcomes. This article presents the process of constructing and calibrating two scales
to measure the attitudes of students and parents in Vietnam, and then linking these two scales to compare the
two groups. A set of items that covered both development and opportunity aspects of education was designed.
After the items were trialled, a final version of 13 items was compiled. The two scales yielded scores that were
shown to have logical, face, content and construct validity.
****
The Tendency of Individuals to Respond to High-Stakes Tests in Idiosyncratic Ways
Iasonas Lamprianou
Abstract
It has been frequently suggested that personal characteristics (e.g., language deficiencies, atypical schooling)
may be responsible for the tendency of individuals to answer with aberrant response patterns to high stakes tests.
This has not, however, been adequately validated using empirical data. This research uses datasets from seven
mathematics, English and science papers to investigate the consistency with which individuals respond aberrantly
across papers. Pupils who responded aberrantly on one paper were more likely to do so on other papers
on the same subject. Also, pupils who responded aberrantly on one paper of one subject were more likely to do
so on papers of another subject. Logistic multilevel models using the generation of aberrant response patterns
as a dependent variable have suggested non-negligible intra-pupil and intra-school correlations.
****
Development and Validation of the Sense of Competence Scale, Revised
Cara McFadden, Gary Skaggs, and Steven Janosik
Abstract
The purpose of this study was to develop an instrument to measure the sense of competence of traditional age
college students across the dimensions that define the construct. The Sense of Competence Scale-Revised (SCSR)
was developed to provide a measure of Chickering’s (1969) first vector, an important psychosocial construct.
Administrators can use data from the instrument to modify an institution’s academic and social environment
to enhance the development of the intellectual, physical, and interpersonal competencies of college students.
During the development and validation, various aspects of the SCS-R were examined in accordance with the
validity framework outlined by Messick (1995). Of the six types of validity evidence proposed by Messick
(1995), four were the primary focus: content, substantive, structural and generalizability. The evidence generated
from the study suggested that the chosen items for the SCS-R support the validity of estimates of a student’s
personal assessment of their sense of competence.
****
Vol. 14, No. 4 Winter 2013
Application of the Rasch Model to Measuring the Performance of Cognitive Radios
Edward W. Wolfe, Carl B. Dietrich, and Garrett Vanhoy
Abstract
Cognitive radios (CRs) are recent technological developments that rely on artificial intelligence to adapt a radio’s
performance to suit environmental demands, such as sharing radio frequencies with other radios. Measuring the
performance of the cognitive engines (CEs) that underlie a CR’s performance is a challenge for those developing
CR technology. This simulation study illustrates how the Rasch model can be applied to the evaluation
of CRs. We simulated the responses of 50 CEs to 35 performance tasks and applied the Random Coefficients
Multidimensional Multinomial Logit Model (MRCMLM) to those data. Our results indicate that CEs based
on different algorithms may exhibit differential performance across manipulated performance task parameters.
We found that a multidimensional mixture model may provide the best fit to the simulated data and that the two
algorithms simulated may respond to tasks that emphasize achieving high levels of data throughput coupled
with lower emphasis on power conservation differently than they do to other combinations of performance task
characteristics.
****
Properties of Rasch Residual Fit Statistics
Margaret Wu and Raymond J. Adams
Abstract
This paper examines the residual-based fit statistics commonly used in Rasch measurement. In particular, the
paper analytically examines some of the theoretical properties of the residual-based fit statistics with a view to
establishing the inferences that can be made using these fit statistics. More specifically, the relationships between
the distributional properties of the fit statistics and sample size are discussed; some research that erroneously
concludes that residual-based fit statistics are unstable is reviewed; and finally, it is analytically illustrated that,
for dichotomous items, residual-based fit statistics provide a measure of the relative slope of empirical item
characteristic curves. With a clear understanding of the theoretical properties of the fit statistics, the use and
limitations of these statistics can be placed in the right light.
****
Validating Workplace Performance Assessments in Health Sciences Students: A Case Study from Speech Pathology
Sue McAllister, Michelle Lincoln, Alison Ferguson, and Lindy McAllister
Abstract
Valid assessment of health science students’ ability to perform in the real world of workplace practice is critical
for promoting quality learning and ultimately certifying students as fit to enter the world of professional practice.
Current practice in performance assessment in the health sciences field has been hampered by multiple issues
regarding assessment content and process. Evidence for the validity of scores derived from assessment tools are
usually evaluated against traditional validity categories with reliability evidence privileged over validity, resulting
in the paradoxical effect of compromising the assessment validity and learning processes the assessments
seek to promote. Furthermore, the dominant statistical approaches used to validate scores from these assessments
fall under the umbrella of classical test theory approaches. This paper reports on the successful national
development and validation of measures derived from an assessment of Australian speech pathology students’
performance in the workplace. Validation of these measures considered each of Messick’s interrelated validity
evidence categories and included using evidence generated through Rasch analyses to support score interpretation
and related action. This research demonstrated that it is possible to develop an assessment of real, complex,
work based performance of speech pathology students, that generates valid measures without compromising the
learning processes the assessment seeks to promote. The process described provides a model for other health
professional education programs to trial.
****
Rasch Analysis for the Evaluation of Rank of Student Response Time in Multiple Choice Examinations
James J. Thompson, Tong Yang, and Sheila W. Chauvin
Abstract
The availability of computerized testing has broadened the scope of person assessment beyond the usual accuracyability
domain to include response time analyses. Because there are contexts in which speed is important, e.g.
medical practice, it is important to develop tools by which individuals can be evaluated for speed. In this paper,
the ability of Rasch measurement to convert ordinal nonparametric rankings of speed to measures is examined
and compared to similar measures derived from parametric analysis of response times (pace) and semi-parametric
logarithmic time-scaling procedures. Assuming that similar spans of the measures were used, non-parametric
methods of raw ranking or percentile-ranking of persons by questions gave statistically acceptable person estimates
of speed virtually identical to the parametric or semi-parametric methods. Because no assumptions were
made about the underlying time distributions with ranking, generality of conclusions was enhanced. The main
drawbacks of the non-parametric ranking procedures were the lack of information on question duration and the
overall assignment by the model of variance to the person by question interaction.
****
Assessing DIF Among Small Samples with Separate Calibration t and Mantel-Haenszel Chi-Square Statistics in the Rasch Model
Ira Bernstein, Ellery Samuels, Ada Woo, and Sarah L. Hagge
Abstract
The National Council Licensure Examination (NCLEX) program has evaluated differential item functioning
(DIF) using the Mantel-Haenszel (M-H) chi-square statistic. Since a Rasch model is assumed, DIF implies a difference
in item difficulty between a reference group, e.g., White applicants, and a focal group, e.g., African-American
applicants. The National Council of State Boards of Nursing (NCSBN) is planning to change the statistic used
to evaluate DIF on the NCLEX from M-H to the separate calibration t-test (t). In actuality, M-H and t should
yield identical results in large samples if the assumptions of the Rasch model hold (Linacre and Wright, 1989,
also see Smith, 1996). However, as is true throughout statistics, “how large is large” is undefined, so it is quite
possible that systematic differences exist in relatively smaller samples. This paper compares M-H and t in four
sets of computer simulations. Three simulations used a ten-item test with nine “fair” items and one potentially
containing DIF. To address instability that may result from a ten-item test, the fourth used a 30-item test with 29
“fair” items and one potentially containing DIF. Depending upon the simulation, the magnitude of population
DIF (0, .5, 1.0, and 1.5 z-score units), the ability difference between the focal and reference group (–1, 0, and 1
z-score units), the focal group size (0, 10, 20, 40, 50, 80, 160, and 1000), and the reference group size (500 and
1000) were varied. The results were that: (a) differences in estimated DIF between the M-H and t statistics are
generally small, (b) t tends to estimate lower chance probabilities than M-H with small sample sizes, (c) neither
method is likely to detect DIF, especially when it is of slight magnitude in small focal group sizes, and (d) M-H
does marginally better than t at detecting DIF but this improvement is also limited to very small focal group sizes.
****
Application of Latent Variable Model in Rosenberg Self-Esteem Scale
Shing-On Leung and Hui-Ping Wu
Abstract
Latent Variable Models (LVM) are applied to Rosenberg Self-Esteem Scale (RSES). Parameter estimations
automatically give negative signs hence no recoding is necessary for negatively scored items. Bad items can be
located through parameter estimate, item characteristic curves and other measures. Two factors are extracted
with one on self-esteem and the other on the degree to take moderate views, with the later not often being
covered in previous studies. A goodness-of-fit measure based on two-way margins is used but more works are
needed. Results show that scaling provided by models with more formal statistical ground correlated highly
with conventional method, which may provide justification for usual practice.
****
A Rasch Analysis of the Statistical Anxiety Rating Scale
Eric D. Teman
Abstract
The conceptualization of a distinct construct known as statistics anxiety has led to the development of numerous
rating scales, including the Statistical Anxiety Rating Scale (STARS), designed to assess levels of statistics
anxiety. In the current study, the STARS was administered to a sample of 423 undergraduate and graduate students
from a midsized, western United States university. The Rasch measurement rating scale model was used
to analyze scores from the STARS. Misfitting items were removed from the analysis. In general, items from the
six subscales represented a broad range of abilities, with the major exception being a lack of items at the lower
extremes of the subscales. Additionally, a differential item functioning (DIF) analysis was performed across sex
and student classification. Several items displayed DIF, which indicates subgroups may ascribe different meanings
to those items. The paper concludes with several recommendations for researchers considering using the STARS.
****