Psychometric properties of the questionnaire version (ERS-Q) of the Environmental Rating Scale (ERS) for assessment of residential programmes for individuals with autism

The Environmental Rating Scale (ERS) is the only autism spectrum disorders (ASD) specific tool for assessment of residential services and treatment models. However, one limitation with the ERS is its dependence on expert observations and interviews, particularly in larger scale studies. The ERS has therefore been adapted into a staff self-report questionnaire (ERS-Q). Here the measurement properties of the ERS-Q were examined according to traditional test theory criteria. Data provided support for summation of raw item scores into total and subscale ERS-Q scores and item-total correlations indicated that items within scales tap a common construct, suggesting that the ERS-Q is useful in survey as well as interventional studies. As such the ERS-Q appears a valuable addition to the current ASD research toolbox.

Due to the fact that there are several community-based treatment strategies (Van Bourgondien et al. 1998) a major force in creating public policy is to document and evaluate the daily care and the pedagogical environment for individuals with ASD. Lord et al. (2005) suggests that measures of outcome in ASD should be a high priority in order to document the goal for and effect of treatment. However, research on the effectiveness of treatment models is limited (Van Bourgondien et al. 1998). Such outcomes are typically measured by means of summed rating scales. Valid interpretation of rating scale data requires that certain criteria are met (Ware and Gandek 1998;Ware et al. 1997;Hobart et al. 2004). Measures of outcome of therapeutic interventions must be rigorously evaluated (Hobart et al. 2004) as these results are used to make judgements about the effectiveness of the care and the pedagogical work for vulnerable individuals not able to express their wishes and rights. By confirming the assumptions underlying the construction of summated rating scales, scores can be calculated with confidence that they have their desired properties and that they represent the variables they are assumed to measure.
The Environmental Rating Scale (ERS) was developed based on well-founded theories of autism spectrum disorder (ASD) (Schopler 1997) and appears to be the only ASD specific instrument for assessment of residential ASD services and treatment models (Van Bourgondien et al. 1998;Van Bourgondien, Reichle, and Schopler 2003). However, one limitation of the ERS, particular in larger scale studies, is its dependence on expert observations of residential homes and face-to-face interviews with the staff members.
In an attempt to overcome this, the original interview based ERS was adapted into a staff self-administered questionnaire version (the ERS-Q) and pilot tested among 18 residential staff members (Hubel, Hagell, and Sivberg 2007). The results provided general support for the notion that the ERS can be adapted into a questionnaire without substantial loss of conceptual meaning. However, further evaluations in larger samples are needed to document the measurement properties of the ERS-Q. This study describes the psychometric properties of the ERS-Q in such a sample of residential staff members.

Setting and participants
All group homes (n026) for individuals with ASD in a west Swedish municipality were invited, of which 19 (73.1%) homes agreed to participate. The residents of the group homes were adults functioning in the moderate to severe ranges of intellectual disability. All group homes were working according to the Treatment and Education of Autistic and related Communication Handicapped CHildren (TEACCH) (Schopler 1997) philosophy. Activities and daily routines of the homes were also very similar and adapted to the individual residents.
All staff members working at the included group homes for more than two years were eligible for participation. A total of 303 of 360 eligible respondents at the 19 group homes gave written informed consent and responded to the questionnaire (response rate, 84%). The mean age of respondents was 38.5 (SD 011.46) years and 82.5% were female (Table 1). Respondents had been working with ASD for 9.47 (SD 08.06) years and they had been working at the current homes for 5.33 (SD 0 3.51) years, according to the respondents, explained by the high range of age. A majority (63%) stated that their choice of occupation was based on an interest for individuals with ASD. A majority of the respondents (76.5%) had undergone TEACCH training, declared that they worked according to the TEACCH pedagogy (80.1%), had regular staff meetings (94.8%) and discussed pedagogical issues regularly (96.9%) ( Table 1).

The ERS-Q
The ERS-Q was adapted from the ERS on an item-by-item basis (Hubel, Hagell, and Sivberg 2007). As such, it consist of 32 items with a possible score range from 1 to 5, that are grouped conceptually into five subscales (Table 2): Communication (five items); Structure (six items); Socialization (six items); Developmental Assessment and Planning (eight items); and Behaviour Management (seven items). These subscales are considered to reflect various aspects of the overall construct, which is represented by a total score. Similar to the original ERS, the ERS-Q is scored using Likert's method of summing scale items into total scores (Likert 1932). Subscale scores are thus created by summing the item scores in each area and the total score is obtained by summing all 32 item scores. Higher scores indicate opportunities for greater environmental adaptation for the individuals with ASD (Van Bourgondien, Reichle, and Schopler 2003).

Procedures
Data were collected during a two month period in 2007. Study participants received the ERS-Q and a demographic questionnaire from their staff manager at the group homes. Each respondent was the contact person for one resident only and worked directly with this individual on a daily basis. The respondents were instructed to complete the ERS-Q individually by referring to this person. All responses remained confidential to other staff members.

Analyses
Data quality was assessed by examining the amount of missing data. Up to 10% missing item responses has been suggested as acceptable (Saris-Baglama et al. 2004).
To evaluate if it is legitimate to sum up items to scale scores within the traditional test theory framework (Likert 1932;Ware et al. 1997;Ware and Gandek 1998), item descriptive statistics were examined. Item means and standard deviations should be roughly equivalent within a scale, otherwise standardization or weighting of item Notes: a n (%); b mean (SD). Problem behaviour are analyzed 29 Client reinforced for positive behaviours 30 More positive strategies than punitive 31 Less intrusive than intrusive approaches 32 Documentation of behaviour programmes 7Á35 Total score 32 Á 160 scores will be needed. Furthermore, each item should contribute about equally to the total score. This was examined by inspection of the corrected item-total correlations, i.e., the correlation between each item's score and the total score of the other items in its scale, which should exceed 0.30 .
To support the validity of summing item responses into scale scores there should also be evidence that items in the scale tap a common underlying variable and that they appear correctly grouped into their respective subscales. To assess this, corrected item-total correlations were examined. Different criteria have been suggested, ranging from 0.20 (Streiner and Norman 2000) to 0.40 (Ware and Gandek 1998). To support the hypothesized grouping of items, items should also correlate stronger with the scale they are hypothesized to represent (corrected item-total correlation) than with the other scales. This criterion has been considered satisfied when at least 80% of the hypothesized item-scale correlations are stronger than the alternative item-scale correlations (Saris-Baglama et al. 2004). This aspect can also be more rigorously assessed by determining the number of instances that corrected itemtotal correlations is significantly stronger than the alternative item-total correlations, as determined by 95% confidence interval around correlations . Significantly and non-significantly stronger item-total correlations are considered definite and probable scaling success, respectively . Similarly, significantly and non-significantly weaker item-to-own scale correlations are considered definite and probable scaling failure . Some degree of scaling failure can be acceptable when scales within an instrument measure correlated constructs as defined by a strong theory (Ware and Gandek 1998).
Next, the amounts of floor and ceiling effects and score reliabilities were examined. Floor and ceiling effects are the proportions of minimum and maximum scores, respectively, observed for each scale. These should not exceed 15% (McHorney and Tarlov 1995). The reliabilities of scale scores were estimated using Cronbach's coefficient alpha (Cronbach 1951), which should be above 0.70 to be regarded as acceptable and preferably exceed 0.80 (Streiner and Norman 2000).
To assess if the subscales tap distinct aspects, scale-to-scale correlations were computed and compared with each subscale's internal consistency (Cronbach's alpha), of which the latter can be viewed as the correlation between a scale and itself. Therefore, alpha values that exceed correlations with other scales are taken as support for the notion that subscales represent distinct aspects of the overall construct Ware and Gandek 1998).
All statistical analyses were performed using SPSS for Windows, version 11.0 (SPSS Inc., Chicago, 2003).

Results
There was evidence of good data quality as the proportion of missing item responses ranged from 3.3Á7.3% (mean, 5.0%), indicating that the ERS-Q is acceptable to the sample. There was also support for summing up items to generate scale scores as descriptive statistics showed roughly similar item means and standard deviations within the five subscales and for the total score (Table 3). Similarly, the total score and all subscales had corrected item-total correlations exceeding 0.30 ranging between 0.32Á0.60. The exception was item 23 which had corrected item-total correlations of 0.22 with its subscale (Table 3).  With few exceptions, corrected item-total correlations also exceeded the 0.40 criterion, suggesting that items within scales tap a common construct (Table 3). The corrected item-total correlations were stronger than the item-to-other subscale correlations in 87.5Á95.8% (mean, 91.7%) of instances across the five subscales (Tables 3 and 4). This provides general support for the suggested grouping of items into subscales. When taking statistical criteria into consideration, the definite scaling success rates varied between 32.1% and 75%, and there was no instance of definite scaling failure (Table 4).
The distribution of scores of the ERS-Q spanned the entire subscale range with no notable floor or ceiling effects. Floor effects were virtually absent (0.4% in the Structure scale, otherwise 0% and ceiling effects ranged between 0.8% (ERS-Q total score) and 8.1% (Behaviour Management). Reliability of the total ERS-Q was good (0.90), and 0.73 or above for all subscales ( Table 5). The reliability coefficients were substantially greater than inter-scale correlations for all subscales (Table 6). This suggests that the ERS-Q subscales represent related but distinct constructs.

Discussion
This study assessed the psychometric properties of the ERS-Q. The result showed good data quality, general support for assumptions underlying summation of ERS-Q items into subscales and total scores and acceptable score reliabilities. Furthermore, there were no notable floor or ceilings effects, and the five ERS-Q subscales appear to represent related but distinct constructs.
The corrected item-total correlations varied but were above the 0.30 level except for item 23 i.e., item 6 (functional needs are incorporated into training) in the Developmental Assessment and Planning subscale. This item also showed signs of scaling failure in that its correlations with other subscales were stronger or equal to that with its hypothesized subscale, and had a somewhat lower mean score than other ERS-Q items did. This could indicate that this item should be revised or deleted from the ERS-Q. Speaking against deleting the item is the fact that it appears to work well in the context of the total ERS-Q score and that the Developmental Assessment and Planning subscale (involving this item) appears to represent a separate but related construct from the other subscales. This implies that the construct is sufficiently anchored by the subscale items to allow for item 6 to be retained. Possible explanations for the observed problems with this item in the Developmental Assessment and Planning subscale may relate to cultural differences between Sweden and the US (where the original ERS was developed) and/or ambiguities in wording. However, no such indications were found in the pilot study of the ERS-Q (Hubel, Hagell, and Sivberg 2008). Similar arguments also hold for other observations of potential item level problems.
The fact that a small subset of items correlated stronger with other subscales than their own is not surprising. In their factor analysis of the original ERS, Van Bourgondien et al. (1998) found factor (subscale) interrelatedness. This idea is also supported by the theory behind the ERS (Schopler 1997;Van Bourgondien et al. 1998). Taken together, we therefore do not consider the current observations to justify any modifications of the ERS-Q at this stage.
Floor and ceiling effects were absent or trivial for all ERS-Q scores. This is an important observation for two related reasons. First, it indicates that the scale scores are able to reflect the relevant levels of therapeutic environmental qualities among the Notes: a Percentage of occasions when items correlated significantly stronger with other subscales than with their hypothesized subscale (definite scaling failure); b Percentage of occasion when items correlated non-significantly stronger with other subscales than with their hypothesized subscale (probable scaling failure); c Percentage of occasions when corrected item-total correlations exceeded item-to-other correlations but not significantly (probable scaling success); d Percentage of occasions when corrected item-total correlations significantly exceeded item-to-other correlation (definite scaling success) assessed group homes. Consequently, this increases the possibilities to detect differences between settings and changes over time (Baron et al. 2006). With large floor/ceiling effects, changes or differences outside the range covered by the scale will be undetected and only changes or differences in one direction can be detected. While floor/ceiling effects were acceptable, there was a tendency for ERS-Q scores in this study to be somewhat skewed towards the better end of therapeutic environments. This implies that the ERS-Q may be less effective at differentiating among group homes performing at the high end of the continuum compared to at the lower end. We do not consider this a major problem since it appears feasible to assume that, in general, it is the less well performing homes that are of most concern.  In addition to targeting of the sample, i.e., the distribution of scores, the ability of scores to detect differences is also influenced by their reliabilities. Specifically, compromised reliability has an adverse impact on the sample size required to detect differences; power calculations traditionally do not take this into account (Fleiss 1986). That is, with compromised reliability, even if exceeding the minimal acceptable criteria, sample size requirements will be underestimated for a certain desired level of statistical power (Fleiss 1986). These concerns are relevant for the ERS and ERS-Q as this and other studies (Van Bourgondien et al. 1998) have shown that their reliabilities are similar but compromised (albeit considered 'acceptable'). Given the requirement for expert on-site assessments to administer the original ERS and the maintained validity of its version (Hubel, Hagell, and Sivberg 2008), there thus appears to be a good case for the ERS-Q in future ASD studies. That is, the ERS-Q should be able to allow for considerably larger sample sizes without unreasonable funding demands.
While the present observations provide further evidence for the possibility to adapt the ERS into a questionnaire version with acceptable measurement properties, a number of important aspect were not considered in this study. In particular, the study design did not allow us to address test-retest reliability or to directly study its responsiveness. Responsiveness refers to the ability of an instrument to detect small but clinically important changes. As discussed above, while the score distributions observed here suggest that the ERS-Q should not be associated with major responsiveness problems, this needs to be assessed empirically.
In conclusion, we found general support for the measurement properties of the ERS-Q according to traditional test theory criteria, which lends support for its use in ASD residential home surveys and interventional studies. Our observations indicate that the five ERS-Q subscales appear to tap different but related aspects of a single construct, represented by its total score. Whether the total and/or subscale score profile should be used needs to be governed by the objective of the study at hand. While our observations indicate that a few items did not meet the applied criteria, we do not consider any changes to the ERS-Q to be justified at this stage. For such actions to be taken, more rigorous analyses such as Rasch analysis and confirmatory factor analysis are needed, preferably followed by replications in independent samples.