Journal of Applied Measurement
P.O. Box 1283
Maple Grove, MN 55311
Volume 15, 2014 Article Abstracts
Vol. 15, No. 1 Spring 2014
Automatic Item Generation Implemented for Measuring Artistic Judgment Aptitude
Nikolaus Bezruczko
Abstract
Automatic item generation (AIG) is a broad class of methods that are being developed to address psychometric
issues arising from internet and computer-based testing. In general, issues emphasize efficiency, validity, and
diagnostic usefulness of large scale mental testing. Rapid prominence of AIG methods and their implicit perspective
on mental testing is bringing painful scrutiny to many sacred psychometric assumptions. This report
reviews basic AIG ideas, then presents conceptual foundations, image model development, and operational
application to artistic judgment aptitude testing.
____________________
Comparison Is Key
Mark H. Stone, and A. Jackson Stenner
Abstract
Several concepts from Georg Rasch’s last papers are discussed. The key one is comparison because Rasch
considered the method of comparison fundamental to science. From the role of comparison stems scientific
inference made operational by a properly developed frame of reference producing specific objectivity. The exact
specifications Rasch outlined for making comparisons are explicated from quotes, and the role of causality
derived from making comparisons is also examined. Understanding causality has implications for what can and
cannot be produced via Rasch measurement. His simple examples were instructive, but the implications are far
reaching upon first establishing the key role of comparison.
____________________
Rasch Model of a Dynamic Assessment: An Investigation of the Children’s Inferential Thinking Modifiability Test
Linda L. Rittner and Steven M. Pulos
Abstract
The purpose of this study was to develop a general procedure for evaluation of a dynamic assessment and to
demonstrate an analysis of a dynamic assessment, the CITM (Tzuriel, 1995b), as an objective measure for use
as a group assessment. The techniques used to determine the fit of the CITM to a Rasch partial credit model are
explicitly outlined. A modified format of the CITM was administered to 266 diverse second grade students in
the USA; 58% of participants were identified as low SES. The participants (males n = 144) were White Anglo
and Latino American students (55%), many of whom were first generation Mexican immigrants. The CITM was
found to adequately fit a Rasch partial credit model (PCM) indicating that the CITM is a likely candidate for a
group administered dynamic assessment that can be measured objectively. Data also supported that a model for
objectively measuring change in learning ability for inferential thinking in the CITM was feasible.
____________________
Performance Assessment of Higher Order Thinking
Patrick Griffin
Abstract
This article describes a study investigating the effect of intervention on student problem solving and higher
order competency development using a series of complex numeracy performance tasks (Airasian and Russell,
2008). The tasks were sequenced to promote and monitor student development towards hypothetico-deductive
reasoning. Using Rasch partial credit analysis (Wright and Masters, 1982) to calibrate the tasks and analysis of
residual gain scores to examine the effect of class and school membership, the study illustrates how directed
intervention can improve students’ higher order competency skills. This paper demonstrates how the segmentation
defined by Wright and Masters can offer a basis for interpreting the construct underlying a test and how
segment definitions can deliver targeted interventions. Implications for teacher intervention and teaching mentor
schemes are considered. The article also discusses multilevel regression models that differentiate class and
school effects, and describes a process for generating, testing and using value added models.
____________________
A Rasch Measure of Young Children’s Temperament (Negative Emotionality) in Hong Kong
Po Lin Becky Bailey-Lau and Russell F. Waugh
Abstract
An aspect of child behavior and temperament, called Negative Emotionality in the literature, is very important
to teachers of very young children. The Children’s Behavior Questionnaire, initially designed by Rothbart,
Ahadi, Hershey and Fisher (2001) for use in western countries, was modified in line with Rasch measurement
theory, revised for suitability with Hong Kong preschool children, and conceptually ordered from easy to hard
along a continuum of attitude/behavior for negative emotionality, before data collection. Three ordered scoring
categories (never or rarely scored 1, on some occasions scored 2, and on many occasions scored 3) were used.
Data were collected from preschool teachers for N = 628 preschool children from 32 schools in Hong Kong
and analyzed with the 2010 Rasch unidimensional measurement model computer program (RUMM2030). The
item-trait interaction probability is 0.05 (chi square = 101.88, df = 80) which indicates that there is reasonable agreement
about the different difficulties of the items along the scale for all the children. Results and implications
are discussed, and revisions for the scale suggested.
____________________
Snijders’s Correction of Infit and Outfit Indexes with Estimated Ability Level: An Analysis with the Rasch Model
David Magis, Sébastien Béland, and Gilles Raîche
Abstract
The Infit mean square W and the Outfit mean square U are commonly used person fit indexes under Rasch
measurement. However, they suffer from two major weaknesses. First, their asymptotic distribution is usually
derived by assuming that the true ability levels are known. Second, such distributions are even not clearly stated
for indexes U and W. Both issues can seriously affect the selection of an appropriate cut-score for person fit
identification. Snijders (2001) proposed a general approach to correct some person fit indexes when specific
ability estimators are used. The purpose of this paper is to adapt this approach to U and W indexes. First, a
brief sketch of the methodology and its application to U and W is proposed. Then, the corrected indexes are
compared to their classical versions through a simulation study. The suggested correction yields controlled Type
I errors against both conservatism and inflation, while the power to detect specific misfitting response patterns
gets significantly increased.
____________________
Optimal Discrimination Index and Discrimination Efficiency for Essay Questions
Wing-shing Chan
Abstract
Recommended guidelines for discrimination index of multiple choice questions are often indiscriminately applied
to essay type questions also. Optimal discrimination index under normality condition for essay question is
independently derived. Satisfactory region for discrimination index of essay questions with passing mark at 50%
of the total is between 0.12 and 0.31 instead of 0.40 or more in the case for multiple-choice questions. Optimal
discrimination index for essay question is shown to increase proportional to the range of scores. Discrimination
efficiency as the ratio of the observed discrimination index over the optimal discrimination index is defined.
Recommended guidelines for discrimination index of essay questions are provided.
____________________
Vol. 15, No. 2 Summer 2014
Examining Rating Scales Using Rasch and Mokken Models for Rater-Mediated Assessments
Stefanie A. Wind
Abstract
A variety of methods for evaluating the psychometric quality of rater-mediated assessments have been proposed,
including rater effects based on latent trait models (e.g., Engelhard, 2013; Wolfe, 2009). Although information about rater effects
contributes to the interpretation and use of rater-assigned scores, it is also important to consider ratings in terms of the structure
of the rating scale on which scores are assigned. Further, concern with the validity of rater-assigned scores necessitates investigation
of these quality control indices within student subgroups, such as gender, language, and race/ethnicity groups. Using a set of guidelines
for evaluating the interpretation and use of rating scales adapted from Linacre (1999, 2004), this study demonstrates methods that can be
used to examine rating scale functioning within and across student subgroups with indicators from Rasch measurement theory (Rasch, 1960)
and Mokken scale analysis (Mokken, 1971). Specifically, this study illustrates indices of rating scale effectiveness based on Rasch models
and models adapted from Mokken scaling, and considers whether the two approaches to evaluating the interpretation and use of rating scales
lead to comparable conclusions within the context of a large-scale rater-mediated writing assessment. Major findings suggest that indices
of rating scale effectiveness based on a parametric and nonparametric approach provide related, but slightly different, information about
the structure of rating scales. Implications for research, theory, and practice are discussed.
____________________
Differential Item Functioning Analysis Using a Multilevel Rasch Mixture Model: Investigating the Impact of
Disability Status and Receipt of Testing Accommodations
W. Holmes Finch and Maria E. Hernàndez Finch
Abstract
The assessment of differential item functioning (DIF) remains an area of active research in psychometrics and educational
measurement. In recent years, methodological innovations involving mixture Rasch models have provided researchers with an additional set of tools
for more deeply understanding the root causes of DIF, while at the same time increased interest in the role of disabilities and accommodations
has also made itself felt in the measurement community. The current study furthered work in both areas by using the newly described multilevel
mixture Rasch model to investigate the presence of DIF associated with disability and accommodation status at both examinee and school levels for
a 3rd grade language assessment. Results of the study found that indeed DIF was present at both levels of analysis, and that it was associated
with the presence of disabilities and the receipt of accommodations. Implications of these results for both practitioners and researchers are discussed.
____________________
Rater Effect Comparability in Local Independence and Rater Bundle Models
Edward W. Wolfe and Tian Song
Abstract
A large body of literature exists describing how rater effects may be detected in rating data. In this study, we compared
the flag and agreement rates for several rater effects based on calibration of a real data under two psychometric models—the Rasch rating scale
model (RSM) and the Rasch testlet-based rater bundle model (RBM). The results show that the RBM provided more accurate diagnoses of rater severity
and leniency than do the RSM which is based on the local independence assumption. However, the statistical indicators associated with rater
centrality and inaccuracy remain consistent between these two models.
____________________
Improving the Individual Work Performance Questionnaire using Rasch Analysis
Linda Koopmans, Claire M. Bernaards, Vincent H. Hildebrandt, Stef van Buuren, Allard J. van der Beek, and Henrica C.W. de Vet
Abstract
Recently, the Individual Work Performance Questionnaire (IWPQ) version 0.2 was developed using Rasch analysis. The goal of
the current study was to improve targeting of the IWPQ scales by including additional items. The IWPQ 0.2 (original) and 0.3 (including additional
items) were examined using Rasch analysis. Additional items that showed misfit or did not improve targeting were removed from the IWPQ 0.3,
resulting in a final IWPQ 1.0. Subsequently, the scales showed good model fit and reliability, and were examined for key measurement requirements
(e.g., category ordening, unidimensionality, and differential item functioning). Finally, calculation and interpretability of scores were addressed.
Compared to its previous version, the final IWPQ 1.0 showed improved targeting for two out of three scales. As a result, it can more reliably
measure workers at all levels of ability, discriminate between workers at a wider range on each scale, and detect changes in individual work performance.
____________________
Influence of DIF on Differences in Performance of Italian and Asian Individuals on a Reading Comprehension Test of Spanish
as a Foreign Language
Gerardo Prieto and Eloísa Nieto
Abstract
Research into Differential Item Functioning (DIF) has been an active research area in language testing (Ferne and Rupp, 2007).
In this study we analyzed the DIF of two groups with different types of native language (927 Italians and 280 Asians) in a reading comprehension task
forming part of an exam in Spanish as a foreign language. The Mantel-Haenszel (MH) and Rasch procedures for the detection of uniform and nonuniform
DIF were used. The results reveal that the Rasch model and MH converge substantially on the results. Uniform DIF was detected in 6.6% of the items and
nonuniform DIF in 16.7%. Half of the items affected by DIF favored the focal group (Asians) and the other half favored the reference group (Italians).
The difference in test performance of the two groups did not appear to be affected by the elimination of items with DIF.
____________________
Rasch Rating Scale Analysis of the Attitudes Toward Research Scale
Elena C. Papanastasiou and Randall Schumacker
Abstract
College students may view research methods courses with negative attitudes, however, few studies have investigated this issue
due to the lack of instruments that measure the students’ attitudes towards research. Therefore, the purpose of this study was to examine the
psychometric properties of a Attitudes Toward Research Scale using Rasch rating scale analysis. Assessment of attitudes toward research is essential
to determine if students have negative attitudes towards research and assist instructors in better facilitation of learning research methods in their
courses. The results of this study have shown that a thirty item Attitudes Toward Research Scale yielded scores with high person and item reliability.
____________________
Measuring the Ability of Military Aircrews to Adapt to Perceived Stressors when Undergoing Centrifuge Training
Jenhung Wang, Pei-Chun Lin, and Shih-Chin Li
Abstract
This study assessed the ability of military aircrews to adapt to stressors when undergoing centrifuge training and determined
what equipment items caused perceived stress and needed to be upgraded. We used questionnaires and the Rasch model to measure aircrew personnel’s
ability to adapt to centrifuge training. The measurement items were ranked by 611 military aircrew personnel. Analytical results indicated that the
majority of the stress perceived by aircrew personnel resulted from the lightproof cockpit without outer reference. This study prioritized the
equipment requiring updating as the lightproof cockpit design, the dim lighting of the cockpit, and the pedal design. A significant difference was
found between pilot and non-pilot subjects’ stress from the pedal design; and considerable association was discernible between the seat angle design
and flight hours accrued. The study results provide aviators, astronauts, and air forces with reliable information as to which equipment items need to
be urgently upgraded as their present physiological and psychological effects can affect the effectiveness of centrifuge training.
____________________
Vol. 15, No. 3 Fall 2014
A Comparison of Stopping Rules for Computerized Adaptive Screening Measures Using the Rating Scale Model
Audrey J. Leroux and Barbara G. Dodd
Abstract
The current study evaluates three stopping rules for computerized adaptive testing (CAT): the predicted standard error reduction (PSER), the fixed-length, and the minimum SE using Andrich’s rating scale model with a survey to identify at-risk students. PSER attempts to reduce the number of items administered and increase measurement precision of the trait. Several variables are manipulated, such as trait distribution and item pool size, in order to evaluate how these conditions interact and potentially help improve the correct classification of students. The findings indicate that the PSER stopping rule may be preferred when wanting to correctly diagnose or classify students at-risk and at the same time alleviate test burden for those taking screening measures based on the rating scale model with smaller item pools.
____________________
Creating the Individual Scope of Practice (I-SOP) Scale
Thomas O’Neill, Michael R. Peabody, Brenna E. Blackburn, and Lars E. Peterson
Abstract
Research indicates that the scope of practice for primary care physicians has been shrinking (Tong, Makaroff, Xierali, Parhat, Puffer, Newton, et al., 2012; Xierali, Puffer, Tong, Bazemore, and Green, 2012; and Bazemore, Makaroff, Puffer, Parhat, Phillips, Xierali, et al., 2012) despite research showing that areas with robust primary care services have better population health outcomes at lower costs (Starfield, Shi, and Macinko, 2005). Examining issues related to the scope of practice for primary care physicians has wide-ranging implications for both patient health outcomes and related healthcare costs. This article describes the development and use of a scale intended to measure the breath of the individual physician’s scope of practice using 22 self-reported, dichotomous indicators obtained from a physician survey.
____________________
Measuring Teacher Dispositions using the DAATS Battery: A Multifaceted Rasch Analysis of Rater Effect
W. Steve Lang, Judy R. Wilkerson, Dorothy C. Rea, David Quinn, Heather L. Batchelder, Deirdre S. Englehart, and Kelly J. Jennings
Abstract
The purpose of this study was to examine the extent to which raters’ subjectivity impacts measures of teacher dispositions using the Dispositions Assessments Aligned with Teacher Standards (DAATS) battery. This is an important component of the collection of evidence of validity and reliability of inferences made using the scale. It also provides needed support for the use of subjective affective measures in teacher training and other professional preparation programs, since these measures are often feared to be unreliable because of rater effect. It demonstrates the advantages of using the Multi-Faceted Rasch Model as a better alternative to the typical methods used in preparation programs, such as Cohen’s Kappa. DAATS instruments require subjective scoring using a six-point rating scale derived from the affective taxonomy as defined by Krathwohl, Bloom, and Masia (1956). Rater effect is a serious challenge and can worsen or drift over time. Errors in rater judgment can impact the accuracy of ratings, and these effects are common, but can be lessened through training of raters and monitoring of their efforts. This effort uses the multifaceted Rasch measurement models (MFRM) to detect and understand the nature of these effects.
____________________
On Robustness and Power of the Likelihood-ratio Test as a Model Test of the Linear Logistic Test Model
Christine Hohensinn, Klaus D. Kubinger, and Manuel Reif
Abstract
Recently the linear logistic test model (LLTM) by Fischer (1973) is increasingly used. In applications of LLTM, a likelihood-ratio test comparing the likelihood of the LLTM to the likelihood of the Rasch model is the most often applied model test. The present simulation study evaluates the empirical Type I risk, test power, and approximation to the expected distribution in the context of the LLTM. Furthermore, as possible influence factors on the distribution of the likelihood-ratio test statistic, the misspecification of the superior model, the closeness to ingularity of the design matrix, and different sorts of misspecification of the design matrix are implemented. In summary, results of the simulations indicate that the likelihood-ratio test statistic holds the fixed Type I risk under typical conditions. Nevertheless, it is especially important to ensure the fit of the superior model, the Rasch model, and to consider the closeness to singularity of the design matrix.
____________________
Performance of the Likelihood Ratio Difference (G2 Diff) Test for Detecting Unidimensionality in Applications of the Multidimensional Rasch Model
Leigh Harrell-Williams and Edward W. Wolfe
Abstract
Previous research has investigated the influence of sample size, model misspecification, test length, ability distribution offset, and generating model on the likelihood ratio difference test in applications of item response models. This study extended that research to the evaluation of dimensionality using the multidimensional random coefficients multinomial logit model (MRCMLM). Logistic regression analysis of simulated data reveal that sample size and test length have a large effect on the capacity of the LR difference test to correctly identify unidimensionality, with shorter tests and smaller sample sizes leading to smaller Type I error rates. Higher levels of simulated misfit resulted in fewer incorrect decisions than data with no or little misfit. However, Type I error rates indicate that the likelihood ratio difference test is not suitable under any of the simulated conditions for evaluating dimensionality in applications of the MRCMLM.
____________________
Applying the Rasch Sampler to Identify Aberrant Responding through Person Fit Statistics under Fixed Nominal Alpha-level
Christian Spoden, Jens Fleischer, and Detlev Leutner
Abstract
Testing hypotheses on a respondent’s individual fit under the Rasch model requires knowledge of the distributional properties of a person fit statistic. We argue that the Rasch Sampler (Verhelst, 2008), a Markov chain Monte Carlo algorithm for sampling binary data matrices from a uniform distribution, can be applied for simulating the distribution of person fit statistics with the Rasch model in the same way as it used to test for other forms of misfit. Results from two simulation studies are presented which compare the approach to the original person fit statistics based on normalization formulas. Simulation 1 shows the new approach to hold the expected Type I error rates while the normalized statistics deviate from the nominal alpha-level. In Simulation 2 the power of the new approach was found to be approximately the same or higher than for the normalized statistics under most conditions.
____________________
Power Analysis on the Time Effect for the Longitudinal Rasch Model
M. L. Feddag, M. Blanchin, J. B. Hardouin, and V. Sebille
Abstract
Statistics literature in the social, behavioral, and biomedical sciences typically stress the importance of power analysis. Patient Reported Outcomes (PRO) such as quality of life and other perceived health measures (pain, fatigue, stress,...) are increasingly used as important health outcomes in clinical trials or in epidemiological studies. They cannot be directly observed nor measured as other clinical or biological data and they are often collected through questionnaires with binary or polytomous items. The Rasch model is the well known model in the item response theory (IRT) for binary data. The article proposes an approach to evaluate the statistical power of the time effect for the longitudinal Rasch model with two time points. The performance of this method is compared to the one obtained by simulation study. Finally, the proposed approach is illustrated on one subscale of the SF-36 questionnaire.
____________________
Application of Rasch Analysis to Turkish Version of ECOS-16 Questionnaire
Pinar Gunel Karadeniz, Nural Bekiroglu, Ilker Ercan, and Lale Altan
Abstract
The aim of this study is to reevaluate validity of Turkish version of the ECOS-16 questionnaire by using Rasch analysis in post-menopausal women with osteoporosis. ECOS-16 (Assessment of health related quality of life in osteoporosis) is a quality of life questionnaire, which is convenient for measuring the quality of life of postmenopausal women with osteoporosis. 132 post-menopausal women with osteoporosis who attended Uludag Universtity, Atatürk Rehabilitation and Research Center between January 2010 and March 2011 were included in this study. The subjects filled out Turkish version of ECOS-16 questionnaire by themselves. The Rasch model was used for assessing construct validity of ECOS-16 data. Internal consistency was assessed by Cronbach’s alpha coefficient. The mean infit and outfit mean square (z std) were found as 1.08 (0.1) and 1.02 (-0.1), respectively. The separation indices for the item and person were found as 7.72 and 3.13; the separation reliabilities were 0.98 and 0.91, respectively. Cronbach’s alpha coefficient was found as 0.90. The construct validity of ECOS-16 questionnaire was assessed by Rasch analysis.
____________________
Vol. 15, No. 4 Winter 2014
An Attempt to Lower Sources of Systematic Measurement Error Using Hierarchical Generalized Linear Modeling (HGLM)
Georgios D. Sideridis, Ioannis Tsaousis, and Athanasios Katsis
Abstract
The purpose of the present studies was to test the effects of systematic sources of measurement error on the
parameter estimates of scales using the Rasch model. Studies 1 and 2 tested the effects of mood and affectivity.
Study 3 evaluated the effects of fatigue. Last, studies 4 and 5 tested the effects of motivation on a number of
parameters of the Rasch model (e.g., ability estimates). Results indicated that (a) the parameters of interest and
the psychometric properties of the scales were substantially distorted in the presence of all systematic sources
of error, and, (b) the use of HGLM provides a way of adjusting the parameter estimates in the presence of these
sources of error. It is concluded that validity in measurement requires a thorough evaluation of potential sources
of error and appropriate adjustments based on each occasion.
____________________
The Nature of Science Instrument-Elementary (NOSI-E): The End of the Road?
Shelagh M. Peoples and Laura M. O’Dwyer
Abstract
This research continues prior work published in this journal (Peoples, O’Dwyer, Shields and Wang, 2013). The
first paper described the scale development, psychometric analyses and part-validation of a theoretically-grounded
Rasch-based instrument, the Nature of Science Instrument-Elementary (NOSI-E). The NOSI-E was designed
to measure elementary students’ understanding of the Nature of Science (NOS). In the first paper, evidence was
provided for three of the six validity aspects (content, substantive and generalizability) needed to support the
construct validity of the NOSI-E.
The research described in this paper examines two additional validity aspects (structural and external). The
purpose of this study was to determine which of three competing internal models provides reliable, interpretable,
and responsive measures of students’ understanding of NOS. One postulate is that the NOS construct is
unidimensional;. alternatively, the NOS construct is composed of five independent unidimensional constructs
(the consecutive approach). Lastly, the NOS construct is multidimensional and composed of five inter-related
but separate dimensions. The vast body of evidence supported the claim that the NOS construct is multidimensional.
Measures from the multidimensional model were positively related to student science achievement
and students’ perceptions of their classroom environment; this provided supporting evidence for the external
validity aspect of the NOS construct. As US science education moves toward students learning science through
engaging in authentic scientific practices and building learning progressions (NRC, 2012), it will be important
to assess whether this new approach to teaching science is effective, and the NOSI-E may be used as a measure
of the impact of this reform.
____________________
Toward a Theory Relating Text Complexity, Reader Ability, and Reading Comprehension
Carl W. Swartz, Donald S. Burdick, Sean T. Hanlon, A. Jackson Stenner, Andrew Kyngdon, Harold Burdick, and Malbert Smith
Abstract
Validity of specification equations used by auto-text processors to estimate theoretical text complexity have
increased importance because of the Common Core State Standards. Theoretical estimates of text complexity
will inform (a) setting standards for college and career readiness, (b) grade-level standards, matching readers
to text, and (d) creating a daily diet of stretch and targeted text designed to grow reading ability and content
knowledge. The purpose of this research was to investigate the specification equation used in the Lexile Framework
for Reading to measure text complexity. The Lexile Reading Analyzer contains a specification equation
that uses proxies for the semantic difficulty and syntactic complexity to estimate the theoretical complexity
of professionally-edited text. Differences between theoretical and empirical estimates of text complexity were
examined for a set of 446 professionally authored, previously published passages. Students in grades 2-12 read
these passages using A Learning Oasis, a web-based technology, to ensure that most of the articles read were
well-targeted to student ability (+100L). Each article was response illustrated using an auto-generated semantic
cloze item type embedded into passages. Observed student performance on this item type was used to derive an
empirical estimate of text complexity for each passage. Theoretical estimates of text complexity accounted for
approximately 90% of the variance in empirical estimates of text complexity. These findings suggest that the
specification equation contains powerful predictors of empirical text complexity, speculation remains on what
additional variables might account for the 10% of unexplained variation.
____________________
Measuring Students’ Perceptions of Plagiarism: Modification and Rasch Validation of a Plagiarism Attitude Scale
Steven J. Howard, John F. Ehrich, and Russell Walton
Abstract
Plagiarism is a significant area of concern in higher education, given university students’ high self-reported
rates of plagiarism. However, research remains inconsistent in prevalence estimates and suggested precursors
of plagiarism. This may be a function of the unclear psychometric properties of the measurement tools adopted.
To investigate this, we modified an existing plagiarism scale (to broaden its scope), established its psychometric
properties using traditional (EFA, Cronbach’s alpha) and modern (Rasch analysis) survey evaluation approaches,
and examined results of well-functioning items. Results indicated that traditional and modern psychometric
approaches differed in their recommendations. Further, responses indicated that although most respondents
acknowledged the seriousness of plagiarism, these attitudes were neither unanimous nor consistent across the
range of issues assessed. This study thus provides rigorous psychometric testing of a plagiarism attitude scale
and baseline data from which to begin a discussion of contextual, personal, and external factors that influence
students’ plagiarism attitudes.
____________________
Survey Analysis with Mixture Rasch Models
Andrew D. Dallas and John T. Willse
Abstract
This research provides a demonstration of the utility of mixture Rasch models (MRMs) for the analysis of survey
data. Specifically, a framework based on a mixture partial credit model (MPCM) will be presented. MRMs
are able to provide information regarding latent classes (subpopulations without manifest grouping variables)
and separate item parameter estimates for each of these latent classes. Analyses can provide insight into how a
survey scale is functioning and how survey respondents differ from one another. The paper provides a detailed
example with real survey data from a higher education survey administered to college seniors through all stages
of model estimation and selection, description of model results, and follow-up analyses using the MRM results.
The results found three distinct classes and discussed each class in terms of the pattern of item parameter estimates
within class. The paper also investigated differences of class assignment based on the college the student
belongs to on campus.
____________________
Validating a Developmental Scale for Young Children Using the Rasch Model: Applicability of the Teaching Strategies GOLD®Assessment System
Do-Hong Kim, Richard G. Lambert, and Diane C. Burts
Abstract
This article reports the results of an application of the Rasch rating scale model to the Teaching Strategies
GOLD® assessment system in a norm sample of children aged birth to 71 months. The analyses focused on the
examination of dimensionality, rating scale effectiveness, the hierarchy of item difficulties, and the relationship
of developmental scale scores to child age. Results show that each subscale satisfies the Rasch model for unidimensionality.
Ratings were found to be less reliable at the lowest and highest ends of the scale and less distinct at
“In-between” levels. Items appear to form theoretically expected hierarchies, supporting evidence for construct
validity for the measures. Moderately high correlations of developmental scale scores with child age suggest that
teachers are able to make valid ratings of the developmental progress of children across the intended age range.
____________________
Produce Significant Gaps between Orders and Are the Orders Equally Spaced?
Michael Lamport Commons, Eva Yujia Li, Andrew Michael Richardson, Robin Gane-McCalla, Cory David Barker, and Charu Tara Tuladhar
Abstract
The model of hierarchical complexity (MHC) provides an analytic a priori measurement of the difficulty of
tasks. As part of the theory of measurement in mathematical psychology, the model of hierarchical complexity
(Commons and Pekker, 2008) defines a new kind of scale. It is important to note that the orders of hierarchical
complexity of tasks are postulated to form an ordinal scale. A formal definition of the model of hierarchical
complexity is presented along with the descriptions of its five axioms that help determine how the model of
hierarchical complexity orders actions to form a hierarchy. The fourth and the fifth axioms are of particular
importance in establishing that the orders of hierarchical complexity form an equally spaced ordinal scale.
Previously, it was shown that Rasch-scaled items followed the same sequence as their orders of hierarchical
complexity. Here, it is shown that the gaps between the highest Rasch scaled item scores at a lower order and
the lowest scores at the next higher order exist. We found there was no overlap between the Rasch-scaled item
scores at one order of complexity, and those of the adjoining orders. There are “gaps” between the stages of
performance on those items. Second, we tested for equal spacing between the orders of hierarchical complexity.
We found that the orders of hierarchical complexity were equally spaced. To deviate significantly from the
data, the orders had to deviate from linearity by over .25 of an order. This would appear to be an empirical and
mathematical confirmation for the equally spaced stages of development.
____________________