Testing Writing in the EFL Classroom: STUDENT EXPECTATIONS

Nahla Nola Bacha
Testing Writing in the EFL Classroom: STUDENT EXPECTATIONS
NO EFL PROGRAM CAN DENY OR IGNORE THE SIGNIFICANCE OF TESTING FOR evaluating learners’ acquisition of the target language. An important area of con- cern in testing is how students view their own achievements. Often students’ expectations of test results differ from actual results. Students’ grade expectations are often higher, which may negatively affect student motivation. This situation calls for raising students’ awareness of their abilities.
The focus of this article is testing writing in the EFL classroom. Specifically, it describes a study comparing students’ expectations of grades with their actual grades earned for essays assigned in Freshman English classes at the Lebanese American University. The results confirm a divergence between expected and actual grades, as has been reported in other research. The article concludes with implications for classroom teaching and testing.

Experience has shown teachers, researchers, and school administrators that, just like lan- guage itself, testing practices in ELT are not static but dynamic and changing. One contro- versial area is testing writing, which requires that test construction and evaluation criteria be based on course objectives and teaching methodologies. In the English language class- room, especially at the high school and uni- versity levels, teachers are always challenged by how to reliably and validly evaluate students’ writing skills, so that the students will be bet- ter prepared for internal and external profi- ciency and achievement exams. Indeed, writ- ing in the academic community is paramount; a student can’t be successful without a certain level of academic writing proficiency.
Another question that many ELT programs are addressing is how do students perceive the process used to evaluate their work? Do they know how they are being tested and what is acceptable by the standards of the institution and their teachers? These are questions this study seeks to answer, but first, it is necessary to differentiate between assessment and eval- uation of writing and to present the main issues involved.
Assessing and evaluating writing
There are many reasons for testing writing in the English language classroom, including to meet diagnostic, proficiency, and promo- tional needs. Each purpose requires different test construction (Bachman 1990, 1991; Pierce 1991). Recent approaches to academic writing instruction have necessitated testing procedures that deal with both the process and the product of writing (Cohen 1994; Connor- Linton 1995; Upshur and Turner 1995). It is generally accepted by teachers and researchers that there are two main goals of testing: first, to provide feedback during the process of acquiring writing proficiency (also referred to as responding or assessing), and second, to assign a grade or score that will indicate the level of the written product (also referred to as evaluating).
The present study focuses on evaluating student essays, that is, assigning scores in order to indicate proficiency level. Evaluation of writing in ELT has a long history, with various procedures and scoring criteria being revised and adapted to meet the needs of administra-
tors, teachers, and learners (see Oller and Perkins 1980; Siegel 1990; Silva 1990; Dou- glas 1995; Shohamy 1995; Tchudi 1997; Bacha 2001). For testing writing, reliability and validity, as well as choice of topics and rater training, are important and must be addressed whatever the purpose of the testing situation may be (Jacobs et al. 1981; Kroll 1990; Hamp-Lyons 1991; Airasian 1994; Kunnan 1998; Elbow 1999; Bacha 2001).
Reliability is the degree to which the scores assigned to students’ work accurately and con- sistently indicate their levels of performance or proficiency. Correlation coefficients of .80 and above between readers’ scores (inter-rater reli- ability) as well as between the scores assigned by the same reader (intra-rater reliability) to the same task are considered acceptable for decision making (Bachman 1990). There is research that indicates that the gender, back- gound, and training of the reader can affect the reliability of scores (Brown 1991; Cush- ing-Weigle 1994). Thus, to maintain reliabili- ty many programs put heavy emphasis on the training of raters and as a result have obtained high positive correlations (Jacobs et al. 1981; Hamp-Lyons 1991).
Validity is the degree to which a test or assignment actually measures what it is intended to measure. There are five important aspects of validity (Hamp-Lyons 1991; Jacobs et al. 1981):
1.Face validity Does the test appear to measure what it purports to measure?
2. Content validity Does the test require writers to perform tasks similar to what they are normally required to do in the classroom? Does it sample these tasks rep- resentatively?
3. Concurrent validity Does the test require the same skill or sub-skills that other simi- lar tests require?
4. Construct validity Do the test results provide significant information about a learner’s ability to communicate effectively in English?
5. Predictive validity Does the test predict learners’ performance at some future time? To what extent should we teachers com-
municate these reliability and validity con- cerns to our students? Teachers’ awareness of the issues of reliability and validity is crucial, but perhaps equally important is how accu- rately students perceive their own abilities and the extent to which they understand what is considered acceptable EFL writing at the university level.
Perceptions of achievement
Research in how students perceive their language abilities compared with faculty per- ceptions and actual performance indicates that there is a problem that needs to be addressed (Kroll 1990). In a survey carried out by Pen- nington (1997) with students graduating from university in the United Kingdom, results indicated that 42 of the 48 students rated their writing ability as very good or quite good. In contrast, the teachers did not indicate such confidence. Another study indicated that first- year university students, who were L1 speakers of Arabic, rated their EFL writing skills in gen- eral as good, while faculty rated their skills as only fair (Bacha 1993). There were similar findings in another study comparing student and faculty grade expectations with actual test scores (Douglas 1995). In a needs analysis proj- ect carried out at Kuwait University, Basturk- men (1998:5) reported that “over 60% of fac- ulty members perceived students to have inadequate writing skills.” She also found that students’ English language proficiency did not meet professors’ expectations and students were not aware of the level of proficiency that was expected of them (Basturkmen 1998:5). Basturkmen concludes that one curricular objective should be to “raise students’ aware- ness of the levels of proficiency which the fac- ulty find acceptable” (1998:5).
If EFL students studying at the university level are deficient in academic language skills, a critical question is, to what extent are the students aware of their deficiencies? From the studies cited above, it appears they are not very aware of their deficiencies or, at best, seem to be more confident of their abilities— and thus hold higher grade expectations— than is warranted by their teachers’ percep- tions or by their actual test scores. This study will examine the problem in the Lebanese university context.
Survey on student grade expectations
Participants and procedure
During the Fall 2000 semester at the Lebanese American University, 150 students in the Freshman English 1 course in the EFL Program (the first of four required courses) were surveyed on their grade expectations. These courses stress essay writing and reading comprehension skills, focusing on sentences, paragraphs, and short essays. The students who completed the survey were L1 Arabic speakers who had studied English during their preuni- versity schooling and were pursuing different majors in the Schools of Arts and Sciences, Business, Engineering and Architecture, and Pharmacy. They had English entrance scores equivalent to TOEFL scores of 525 to 574, and were enrolled in Freshman English 1 sec- tions with between 25 and 30 students each.
Specifically, the survey was given in order to find out if there were any differences between students’ grade expectations and the actual grades they earned. The survey was given two weeks before the end of the semester with the belief that students would have a better idea of their abilities later in the semester than they would at the beginning of the semester. They were requested to indicate the grade range they expected on two end-of-course essays. The five grade ranges were: below 60%, fail- ing; 60–69%, fair; 70–79%, satisfactory; 80–89%, good; and 90–100%, excellent.
Essay 1 (E1) was given toward the end of the semester in the Freshman English 1 course. It is usually in the comparison or contrast rhetor- ical mode with a choice of different topics and completed in two fifty-minute class periods. During the first class period, students write a first draft. The teacher makes comments for improvement on the first draft, which is then rewritten during the second period. Essay 1 constitutes 20% of the final course grade.
Essay 2 (E 2) was given at the end of the semester as part of the final exam for the course, which also included a reading compre- hension and vocabulary component. The reading and vocabulary component of the final exam is similar in content for all Fresh- man English 1 sections, but students have a choice of three or four topics in the essay sec- tion with each topic requiring a different rhetorical mode. Essay 2 also constitutes 20% of the final course grade.
Table 2
Percentage of Students Selecting Each Grade Range for Essays 1 and 2
Expected vs. Actual Grades (figures are in percentages)
Expected E 1 Actual E 1 Expected E 2 Actual E 2
(90–100%) (80–89%) (70–79%)
2.5 37.7 50.6 0.5 4.0 41.6 5.6 46.3 44.2 0.0 6.9 36.1
(60–69%) (below 60%)
9.3 0.0 42.1 11.9 3.9 0.0 42.6 14.4
The survey asked students to indicate their grade expectations for these two end-of-course essays. In addition, for each essay, the students were asked to indicate their grade expectations for the three major sub-skills of essay writing emphasized in the course: language (sentence structure, grammar, vocabulary, coherence, mechanics), organization (format, logical order of ideas, thesis and topic sentences), and content (major and minor supporting ideas). To indicate each expected grade, students selected one of the five possible grade ranges.
Results and discussion
A statistical comparison was made on a random sample of 30 surveys using the Wilcoxon Signed Ranks Test. This statistical test indicates whether there are any differences in mean ranks of scores when normal distrib- ution is uncertain. Results of the Wilcoxon test indicated significant differences of p=<.001 on all tests, confirming that the sur- vey results showing differences between expected and actual grades are not according to chance and have a high degree of certainty.
It is not possible to pinpoint the accuracy with which individual students predicted their grades because the survey responses were tal- lied in mean averages. The results are most revealing when student expectations are exam- ined as a whole and we can see that student grade expectations differed from actual grades.
Table 1 shows that the mean actual scores of the students on the two essays are one grade level lower (10%) than their mean grade expectations.
Since the gap between mean expected and mean actual grades is large, a whole proficiency level, a question raised is whether the students are aware of the criteria for each grade level. In other words, do students understand what is expected of them in the writing skills on which they are being tested? From random interviews with students and faculty, it seems they are not and that more work needs to be done in this area in the university’s EFL program. All of our efforts to set up valid and reliable testing crite- ria seem self-defeating if the learners themselves are unaware of their potential achievement level or what is expected in their writing. These are important issues that need to be addressed in any educational program.
Table 2 compares the percentage of stu- dents who expected each of the possible grade ranges with the percentage of students who actually received those grades on Essays 1 and 2. We can see that no student expected to fail on either of the essays, but actual results show a failure rate of 11.9 percent on Essay 1 and 14.4 percent on Essay 2. The most accurate predictions were made in the grade range 70–79%. Perhaps many of the students placed their expectations in this range because it rep- resented a cautious and modest expectation.
As can be seen in Table 2, expected and actual grades differed in the 60–69% grade range, with only 9.3% and 3.9% of the stu- dents accurately predicting grades on Essays 1 and 2, respectively. In the grade range 80–89%, students showed overconfident pre- dictions of 37.7% and 46.3% on essays 1 and
Table 1
Differences in Mean Expected Grades and Mean Actual Grades
(expressed as a percentage of total possible grade)
Essay 1 (E 1) Mean Expected Grade 74% Mean Actual Grade 64%
Essay 2 (E 2)
75% 65%
Table 3
Percentage of Students Selecting Each Grade Range for Writing Sub-skills in Essay 1
Expected vs. Actual Grades (figures are in percentages)
Expected Language Actual Language Expected Organization Actual Organization Expected Content Actual Content
(90–100%) (80–89%)
7.1 36.1
0.5 3.4 10.2 48.5 0.0 4.4 9.0 49.1 0.0 5.9
(70–79%) (60–69%)
44.1 12.7 36.9 36.5 34.9 6.6 40.9 41.9 36.4 5.6 38.9 45.3
(below 60%)
0.0 22.7 0.0 12.8 0.0 9.9
2, while only 4.0% and 6.9% actually attained these levels, respectively. Students were most overconfident in their predictions of grades between 90–100%; only 0.5% of the students actually attained this score on Essay 1, and none did so on Essay 2.
Table 3 shows expected and actual grades for the three sub-skills of writing (language, organization, and content) in Essay 1 (E 1). It indicates that the actual scores were lower than student expectations and that failure was not expected. In fact, the findings show that for E 1 there is a failure rate of 22.7%, 12.8%, and 9.9% on language, organization, and content, respectively. Again, grade expectations and actual grades were closest in the grade range 70–79%. Students had much higher expecta- tions than actually obtained for both of the upper grade ranges, 80–89% and 90–100%. Of the three sub-skills, language proved to be the weakest for students, indicating a need to focus more on this sub-skill in the classroom.
Table 4 shows expected and actual grades for the three sub-skills of writing in Essay 2 (E 2). Similar to E 1, it indicates that students’ expectations in the sub-skills for that essay were higher than their actual test scores, and that all students expected to pass. In general, student expectations in the sub-skills were higher for E 2 than for E 1. Perhaps students gained more confidence in their abilities by the end of the semester and thus expected higher grades at the completion of the course, even though their actual scores do not support this expectation. In fact, no student attained a grade level of 90–100% in any of the sub-skills in E 2, and there were more actual scores in the failing range than in the grade range 80–89%. Also similar to E 1, students’ expectations were most realistic in the grade range 70–79%.
The results obtained from this survey reveal that students and their instructors have differ-
Table 4
Percentage of Students Selecting Each Grade Range for Writing Sub-skills in Essay 2
Expected vs. Actual Grades (figures are in percentages)
Expected Language Actual Language Expected Organization Actual Organization Expected Content Actual Content
(90–100%) (80–89%)
9.5 38.0
0.0 5.9 14.8 50.1 0.0 6.9 10.1 50.4 0.0 7.9
(70–79%) (60–69%)
45.7 6.8 34.7 42.1 32.9 2.1 36.1 44.6 35.3 4.2 37.1 42.1
(below 60%)
0.0 17.3 0.0 12.4 0.0 12.9
ent perceptions of acceptable essay writing. This has important implications for writing evaluation in the university’s EFL program. Teachers need to help students increase their awareness and understanding of the proficien- cy levels required in writing essays.
One way teachers can do this is by showing their students sample essays, perhaps drawn from the students’ own work, that represent each of the grade levels from poor to excellent. These model essays could be photocopied for the class so that they can be read and discussed in detail. Students could take part in practice evaluation sessions by assigning grades for each sample essay, including the three sub-skills lan- guage, organization, and content, according to the criteria for essays used by the EFL pro- gram. Such practice evaluation could be done in small groups, with each group justifying the grades it assigns in short oral presentations to the rest of the class, followed by questions and discussion. Once this exercise is done, the teacher could discuss the different grade ranges and comment on the grades assigned by the groups in light of what grades the essays would likely receive in a testing situation.
A second way to raise students’ awareness of essay evaluation criteria is through individ- ual or small group conferences held periodi- cally with the teacher. In fact, although stu- dent-teacher conferences are carried out irregularly, they have been quite successful in the EFL program at the university, especially for lower proficiency level writers. Students become more involved in the evaluation process and more aware of what is expected in their essays, and thus realistically build confi- dence in their writing.
In addition to these awareness-raising activities, teachers need to revisit periodically the writing criteria being used for essay evalu- ation in light of recent research and innova- tions in teaching writing. Teachers also might need to clarify criteria for the different profi- ciency levels for the various types of writing tasks assigned throughout a semester. Essay tests in certain rhetorical modes, such as nar- ration or description, might require different evaluation criteria than those used for essays in the comparison or contrast mode. Although the essay tests included in this survey were from the end of the semester, teachers might want to consider whether they should evaluate
essays written earlier in the course according to objectives covered up to that point.
Testing is an inextricable part of the instructional process. If a test is to provide meaningful information on which teachers and administrators can base their decisions, then many variables and concerns must be considered. Testing writing is undeniably dif- ficult. Although we teachers try hard to help students acquire acceptable writing proficien- cy levels, are we aware that perhaps our stu- dents do not know what is expected of them and do not have a realistic concept of their own writing abilities?
This article has reported the grade expecta- tions of students and the actual grades they earned on two important end-of-semester essays. Results show that students’ expecta- tions are significantly higher than their actual proficiency levels. Developing test procedures for more valid and reliable evaluation is neces- sary and important; however, it does very little to motivate students to continue learning if their perceived levels of performance are not compatible with those of their teachers. In addition to the need to develop valid and reli- able testing procedures, we must not overlook the need to raise students’ awareness of their abilities. It is perhaps only through this under- standing that genuine learning occurs.
Note: This is a revised version of a paper presented at the 21st Annual TESOL Greece convention, held in April 2000. The author received a grant from the Center for Research and Development at the Lebanese American University to support this research.
Airasian, P. W. 1994. Classroom assessment (2nd ed.). New York: McGraw-Hill.
Bacha, N. N. 1993. Faculty and EFL student percep- tions of the language abilities of the students in the English courses at the Lebanese American Univer- sity, Byblos Branch. Unpublished survey results, Byblos, Lebanon.
———. 2001. Writing evaluation: What can ana- lytic versus holistic scoring tell us? System, 29, 3, pp. 371–383.
Bachman, L. F. 1990. Fundamental considerations in language testing. Oxford: Oxford University Press. ———. 1991. What does language testing have to
offer? TESOL Quarterly, 25, 4, pp. 671–672.

English Proficiency Test - The Oral Component of a Primary School

English Proficiency Test - The Oral Component of a Primary School

Ishbel Hingle and Viv Linington
Many teachers feel comfortable setting pencil-and-paper tests. Years of experience marking written work have made them familiar with the level of writ- ten competence pupils need in order to succeed in a specific standard. Howev- er, teachers often feel much less secure when dealing with tests which measure speaking and listening even though these skills are regarded as essential compo- nents of a diagnostic test which measures overall linguistic proficiency. Although the second-language English pupils often come from an oral rather than a writ- ten culture, and so are likely to be more proficient in this mode of communica- tion, at least in their own language, speaking in English may be a different mat- ter. In English medium schools in particular a low level of English may impede students’ acquisition of knowledge. Therefore, identifying the correct level of English of the student is all the more challenging and important.
This article outlines some of the problem areas described by researchers when designing a test of oral production for beginning-level speakers of English and suggests ways in which they may be addressed.

How does one set a test which does not intimidate children but encourages them to provide an accurate picture of their oral ability?
In replying to this question, one needs to consider briefly the findings of researchers working in the field of language testing. “The testing of speaking is widely regarded as the most challenging of all language tests to prepare, administer and score,” writes Harold Madsen, an international expert on testing (Madsen 1983:147). This is especially true when exam- ining beginning-level pupils who have just started to acquire English, such as those apply- ing for admission to primary school. Theo- rists suggest three reasons why this type of test is so different from more conventional types of tests.
Firstly, the nature of the speaking skill itself is difficult to define. Because of this, it is not easy to establish criteria to evaluate a speaking test. Is “fluency” more important than “accu- racy,” for example? If we agree fluency is more important, then how will we define this con- cept? Are we going to use “amount of infor- mation conveyed per minute” or “quickness of response” as our definition of fluency?
A second set of problems emerges when test- ing beginning-level speakers of English, which involves getting them to speak in the first place, and then defining the role the tester will play while the speaking is taking place. Relevant elic- itation procedures which will prompt speakers to demonstrate their optimum oral perfor- mance are unique to each group of speakers and perhaps even unique to each occasion in which they are tested. The tester will therefore need to act as a partner in the production process, while at the same time evaluating a number of things about this production.
A third set of difficulties emerges if one tries to treat an oral test like any other more conventional test. “In the latter, the test is often seen as an object with an identity and purpose of its own, and the children taking the test are often reduced to subjects whose only role is to react to the test instrument” (Madsen 1983:159). In oral tests, however, the priority is reversed. The people involved are important, not the test, and what goes on between tester and testee may have an existence indepen- dent of the test instrument and still remain a valid response.
How can one accommodate these diffi- culties and still come up with a valid test of oral production?
In answering this question, especially in relation to the primary school mentioned ear- lier, I would like to refer to the experience I and one of my colleagues, Viv Linington, had in designing such a test for the Open Learning Systems Education Trust (OLSET) to measure the success of their English-in-Action Pro- gramme with Sub B pupils. This Programme is designed to teach English to pupils in the earliest grades of primary school, using the medium of the tape recorder or radio.
In devising this test, we decided to use flu- ency as our basic criterion, i.e., “fluency” in the sense Brumfit uses it: “the maximally effec- tive operation of the language system so far acquired by the student” (Brumfit 1984: 543). To this end, we decided to record the total number of words used by each pupil on the test administration and to employ this as an overall index to rank order the testees in terms of performance.
To address the second and third set of prob- lems outlined above, we decided to use elicita- tion procedures with which the children were familiar. Figures 1 and 2 would require the teacher to find a picture full of images the pupils could relate to such as children playing. Students could participate in the following types of activities:
• an informal interview, to put the children at ease by getting them to talk about themselves, their families and their home or school lives (See Figure 1).
• a set of guided answers to questions about a poster, to test their knowledge of the real life objects and activities depict- ed on the poster as well as their ability to predict the consequences of these activi- ties (See Figure 2).
• narratives based upon packs of story cards, to generate extended language in which the children might display such features as cohesion or a knowledge of the English tense system in an uninter- rupted flow of speaking.
Instead of treating the situation as a “test,” we asked testers to treat it as a “game.” Both partners would be seated informally on the ground (with, in our case, a recorder placed
The tester should capture personal details by asking the following type of questions:
What is your name?
Where do you live?
Do you have any brothers or sisters?
Does anyone else live at home with you?
Now tell me, what do you all do when you get up in the morning?
How do you all go to school and work?
Do you have any brothers or sisters in this school?
What standards are they in?
Which subject do you enjoy most? Why?
What do you do at break?
Tell me about your best friends.
What does your mother/grandmother cook for dinner?
Can you tell me how she cooks it? Why do you all enjoy this food most?
Do you listen to the radio/watch TV in your house?
What is your favorite programme? Why do you enjoy it most?
What do you do when you are getting ready to sleep in the evening?
What time do you go to sleep. Why?
Now look at the picture and tell me what this little boy is doing. Letʼs give him a name.
What do you suggest?
unobtrusively on the floor between them be- cause of the research nature of our test). If the occasion was unthreatening to the pupil with the tester acting in a warm friendly way, we anticipated the child would respond in a simi- lar way, and thus produce a more accurate pic- ture of his or her oral productive ability. We suggested the tester act as a Listener/Speaker only while the test was being conducted, and as Assessor once the test administration was over.
To maintain a more human approach to the testing situation, we decided to allow the tester a certain flexibility in choosing ques- tions to suit each particular child, and also in the amount of time she spent on each subtest. The time allowed for testing each pupil would be limited to 8 minutes, and all three subtests would be covered during this period, but the amount of time spent on each could vary.
Question banks were provided for testers to select questions they felt were within the range
of each child’s experience, but there was an understanding that how and why questions were more difficult to answer than other Wh- questions. A range of both types should there- fore be used.
Story packs also provided for a range of experiences and could be used by the tester telling a story herself first, thus demonstrating what was required of the pupil. However, it was anticipated that some pupils might be suf- ficiently competent to use the story packs without any prompting from the teacher. Pupils could place the cards in any order they chose, as the sole purpose of this procedure was to generate language. Story packs were composed of picture stories that had been photocopied from appropriate level books, cut up into individual pictures, and mounted on cardboard. Six pictures to a story pack were considered sufficient to prompt the anticipat- ed length of a story pupils could handle.
This test of oral production was administer- ed at both rural and urban schools to children who were on the English-in-Action Programme and those who were not. The comparative re- sults are not relevant here, but findings about which aspects of the test worked and which did not may be of assistance to those who wish to set similar tests. In summarising these find- ings, I will comment on the administration of the test, the success of each subtest in eliciting
Questions for guided response:
What are the children doing? Where are they?
How many children are there? Are there more boys than girls? How do you know this?
What is the girl in the green dress doing? What are the boys going to do when they
finish playing marbles?
Do you think the children are happy? Have you ever played marbles?
(If yes) How do you play marbles?
(If no) What other game do you play with
your friends?
How do you play it?
Now look at the picture and tell me what this little boy is doing. Letʼs give him a name.
What do you suggest?
language, and, finally, on the criteria we used for evaluating the test outcomes.
Firstly, both testers commented that this type of test was more difficult to organise and administer than other kinds of evaluation tests they had used. This was caused by the need to find a quiet and relatively private place to ad- minister the test and record the outcome and because the procedure could be done only on a one-to-one basis. We had anticipated this type of feedback but were also not surprised when told that subsequent administrations “were much easier and the children were more enthusiastic about participating than the pre- vious time.” The testing procedure was new to both tester and testee, but once experienced, it gave children greater freedom of expression than other kinds of tests.
Secondly, while the test as a whole did elic- it oral language production, the amount and type of language varied from subtest to sub- test. The interview produced rather less lan- guage than the other two subtests; it also elicit- ed rather learned chunks of language, which we called “patterned responses.”
The guided responses, on the other hand, produced a much greater variety of answers, couched in a fairly wide range of grammatical structures. But even these responses consisted on the whole of single words or phrases. Open-ended questions evoked longer respons- es from the more able students, but seemed to confound less able students. For example, the question “What can you see in the picture?” produced the answer “I can see a car and a woman going to the shop and a boy had a bicycle and the other one riding a bicycle,” from a bright pupil, but only “Boy and bicy- cle” from a weaker pupil.
Higher order Wh- questions such as “What do you think is in the suitcase?” or “What will happen next?” seemed to produce only “I don’t know” responses from even the most competent pupils. They seemed to lack the linguistic resources, or perhaps the cognitive resources, to predict or suggest answers.
The narrative subtest, based on the story cards, elicited the best display of linguistic ability from the testees, both in terms of amount of language produced and range of grammatical structures used.
Competent pupils were able to respond well to the tell/retell aspect and constructed sentences of 7 to 10 words in length, joined by
a variety of coordinating devices. They also employed past tense forms in retelling the story such as the following:
The boys they played with the cow’s what ...... what ...... a ...... bells three bells ...... then they got some apples and went to swim ...... the monkey saw them swim and putted them shirts and shorts ...... some they said hey ...... I want my shirts ...... wait I want my shirts ...... but mon- key she run away
Less competent students could describe isolated images on each card without using narrative in any way to link them together.
From these results we therefore concluded that the story packs were the most successful of the three elicitation procedures we used in stimulating optimum language output.
The final issue from the findings of the OLSET test that are relevant here are the cri- teria used for assessing the language output. Our decision to count “number of words pro- duced” as a measure of speaking ability was a mixed blessing. Initially it did seem to rank order the pupils in terms of ability and gave us a base for comparison at subsequent test administrations, but non-verbal factors such as self confidence, familiarity with the tester, and presence of the teacher may have affected even these results. In the second administration of the test, it was not at all accurate because improvement in ability to speak and respond in English was reflected more in the quality of how the testees spoke, rather than in the quan- tity of language they produced. Several of the more competent pupils spoke the second time in round 1 but displayed knowledge/features not present in their own home languages such as prepositions and articles, used correctly sub- ordinating and coordinating conjunctions they had been introduced to only in the course of conversation, and employed a variety of tenses in their story telling. We therefore used this data to develop a number of assessment levels, or descriptive band scales, based upon these various grammatical competencies, when evaluating the pupils’ output (a band scale outlines a set of linguistic features and skills a pupil needs to display in order to be placed in that category).
In response to our discussion, some schools have begun to introduce two components in their diagnostic test. The first is a multiple-
English Proficiency Test... | Hingle and Linington continued from page 33
choice comprehension test and the second an oral test based upon a set of story cards.
The same test will be used for pupils at all levels of the primary school, using the lead pro- vided by a test produced by the Human Sciences Research Council for the same purpose. How- ever, the expected proficiency levels to enter a particular grade or standard will be different.
In conclusion, let me summarise the advice I would give to teachers who need to design speaking tests but who are afraid to take the plunge into this area of assessment:
• Do not be afraid to set such a test in the first place.
• Draw on your own materials to set a test appropriate for your group of testees.
• Keep the factor of time constant for each test administration.
• Give the testee the opportunity to lead once he or she is at ease.
• Do not allow factors such as accent to cloud your perception of linguistic com- petence.
• Rely on your own instinctive judgment when assigning a value to performance on such a test.
• Try and think of this value in terms of words rather than marks.
Brumfit, C. 1984. Communicative methodology in language teaching. Cambridge: Cambridge Uni- versity Press.
Madsen, H. S. 1983. Techniques in testing. New York: Oxford University Press.
This article was originally published in the April 1997 issue. z

English Teaching: Using Self-assessment for Evaluation

Richard Watson Todd
Using Self-assessment for Evaluation
SELF-ASSESSMENT APPEARED TO COME OF AGE IN 1980 WITH THE publication of a Council of Europe text on the topic (Oskarsson 1980). Since then, more and more programmes around the world have attempted to integrate self-assessment into the learning and evaluation1 process, with varying degrees of success. The usefulness of self-assessment for learning purposes seems to be widely accepted, as illustrated by the widespread use of learner diaries. Self- assessment for evaluation purposes, however, is far less common, and many teachers actively resist its implementation. This situation is due, in part, to the ways in which self-assessment is frequently conducted. In this paper, I will argue that learners can conduct reliable, global self-assessment, and I will suggest three ways in which such data-driven self-assessment can be done.
1. I am using assessment as a broad term for all attempts to gain information concerning learners’ perfor- mance and ability, regardless of the purpose. Evaluation, in contrast, is used for those attempts that pro- duce quantitative data that are used to generate scores measuring the learners’ performance and ability.
Purposes of self-assessment
Several reasons for using self-assessment have been suggested including:
• Self-assessment is a prerequisite for a self- directed learner. If a goal of learning is for learners to be self-sufficient and indepen- dent in language use, then training and experience in self-assessment are needed.
• Self-assessment can raise learners’ awareness of language, effective ways of learning, and their own performance and needs.
• Self-assessment increases motivation and goal orientation in learning.
• Some aspects of language learning, such as effort and learner beliefs, can only be assessed through self-assessment.
• Self-assessment can reduce the teacher’s workload.
The first four reasons clearly suggest that self- assessment can be integrated into courses for learning purposes.
Less clear, however, is whether these rea- sons imply that self-assessment should be used as part of the input in generating a learner’s score for a course. This depends on the objec- tives of the course. For final evaluations of learners’ performance on a course to be valid, the evaluations should match the course objec- tives. If the objectives include increased moti- vation, positive attitudes towards English, and greater independence and awareness, for example, then self-assessment should be seri- ously considered as a potential part of the overall evaluation for a course. Most teachers, however, strongly resist such a move, arguing that self-assessment is subjective, unreliable, open to cheating, and more reflective of the learner’s self-image than actual performance and ability. Such an attitude is at least partial- ly due to the nature and characteristics of existing self-assessment instruments.
Self-assessment instruments
Some self-assessment instruments, while powerful when used for learning purposes, are inappropriate for evaluation purposes. These include learner diaries; the task-based self- assessment instruments of Tudor (1996) by which learners are encouraged to analyse vari- ous aspects of their learning, such as their dif- ficulties in completing a task; and the critical incidents in learning of Singh (1998). These instruments are very subjective—indeed, sub-
jectivity is the raison d’être of critical inci- dents—and produce qualitative information that cannot be converted into scores for evalu- ation purposes.
Self-assessment instruments that produce quantitative information that can be used for evaluation purposes fall into two cate- gories: global self-assessments and self- marking instruments.
Global self-assessment
It is in the area of global self-assessment that Oskarsson’s (1980) work is most influential. Oskarsson suggested that global self-assess- ments could be conducted through rating scales and checklists. However, both of these, as Oskarsson suggests, are very problematic. To illustrate this, here are example questions used to measure learners’ speaking ability:
• Give yourself a rating for your speaking skills on a scale of 0 to 10, where 10 means I am completely fluent in English and 0 means I cannot speak English at all.
• Can you ask someone to help you to arrange an appointment with a doctor?
• Can you express sympathy using phrases like I am sorry to hear that?
At face value, these questions may seem
fairly straightforward. But if we were to apply Oskarsson’s question format to teaching, we would produce questions such as:
• Give yourself a rating for your classroom management skills as a teacher on a scale of 0 to 10, where 10 is a perfect classroom manager and 0 is a complete incompetent.
• Can you explain the meaning of behave?
• Can you give clear instructions for a jigsaw
reading activity?
As a teacher, your reaction to the first of these is likely to be a complete lack of confi- dence in your answer. Maybe, like me, you just plumped for a number in the middle of the range. For all of the items, you also proba- bly feel that your answer depends on the teaching situation. I’d have no problem explaining behave to my postgraduate stu- dents, but I wouldn’t even attempt an expla- nation with a class of undisciplined kids on a Monday evening. In fact, it may even seem unfair to ask items like these.
Yet these are exactly the sort of items learn- ers are faced with in Oskarsson’s questions promoting self-assessment of speaking and in other instruments for global self-assessment.
This type of self-assessment instrument lacks specificity and is divorced from reality. Instead of rating any real-world language perfor- mance, learners are asked to rate their own beliefs and perceptions with little or no evi- dence on which to base their assessments. Such self-assessment, although valuable for learning, is grist for the mill for teachers who argue that self-assessment is too subjective to be used for evaluation purposes.
Self-marking instruments
Self-marking involves learners in giving themselves a score for a piece of work. Where the task is objective, such as a multiple-choice exercise, an answer key can be provided, and learners can mark their own work easily. This reduces the teacher’s marking load and pro- vides a reliable score (with a little cross-check- ing to discourage cheating) that can be used for evaluation purposes. The learning benefits of this approach are, however, negligible.
For more open-ended tasks, where there may be a very large number of possible answers, self-marking is more problematic. One example of how self-marking may be conducted is given by Gardner and Miller (1999). In their example, the task is to skim a newspaper and then listen to the news on the radio. Learners then give themselves two marks: one for their understanding of the main ideas of the news, and one for their understanding of details. This self-assessment task serves a useful learning purpose by high- lighting areas in which learners need to do fur- ther work, but the marks from the self-assess- ment are hardly reliable enough to persuade most teachers to include them as part of the final score for a course.
To increase reliability, self-assessment on open-ended tasks needs to be clearly guided by detailed scoring criteria. The easiest way to generate such criteria is to break down the task into smaller components. For example, for a letter-writing task, the finished product could be self-marked for how well it follows the stan- dard letter-writing conventions, such as intro- ducing the purpose of the letter in the first paragraph, assigning each topic to a separate paragraph, and so on. The close guidance of scoring criteria such as these is likely to increase the reliability of the learner’s self- assessment, making it more palatable for inclusion in the final score for a course.
Data-driven global assessments
The use of objective tasks and detailed scor- ing criteria for self-marking, as described above, are restricted to self-assessment of par- ticular tasks. Such self-assessment may be included in the overall score for a course. How- ever, it can provide only a snapshot of a learn- er’s performance. To obtain a measurement of a learner’s development throughout a course, self-assessment at a more global level is needed.
The usual approaches to global self-assess- ment, such as those of Oskarsson, are, as we have seen, fraught with problems. Divorced from any real-world performance, they end up as very subjective and unreliable guesses that are unsuitable as components of a final score. What is needed is some way to directly relate a learner’s performance to his or her global self-assessment to make that self-assessment more reliable and more reflective of actual per- formance and ability.
A key question in designing global self- assessments, therefore, is how they can be directly related to learners’ experiences. At the task level, the process of completing the task and the finished product provide a clear focus for and input into self-assessment. For global self-assessments at, say, the level of course, what things could be used as data driving the self-assessment?
The most obvious and widely-used learning instrument that could be used as input for self- assessment is the portfolio. A portfolio is “a pur- poseful collection of students’ work that demonstrates to students and others their efforts, progress, and achievements in given areas” (Genesee and Upshur 1996:99). Since the portfolio is evidence to learners of their own efforts, progress, and achievements, it is suitable for self-assessment. To use a portfolio as self- assessment for evaluation purposes, questions to guide the self-assessment must be provided. Sample questions could include the following:
• To what extent did you achieve your goals in learning during this course?
• To what extent did you improve your read- ing? List some of the problems you faced while reading and how you solved those problems.
• To what extent has your knowledge of vocabulary improved? List the new words you have learnt from your portfolio.
• To what extent has your confidence in using English improved?
By referring to their portfolios in answering
these questions, learners have concrete evi- dence of their performance and are not forced to rely on their intuition and possible bias about their performance or ability.
Pre- and post-course writing
A second way of conducting data-driven, rather than intuition-driven, global self-assess- ments is to use the time-honoured research technique of pre- and post-tests. Learners can be asked to write two essays about their atti- tudes towards learning English, one at the start and another at the end of the course. Com- paring the two, learners are able to see the extent of their development through the course. With guiding questions, learners’ per- ceptions of their own development based on the pre- and post-course writing can provide self-assessment that can be used for evaluation purposes. The two pieces of writing can also be self-marked for certain language points. Whereas self-marking instruments applied to a given task provide a snapshot of the learner’s performance at a given moment in a course, a comparison of self-marking on pre- and post- course writing can give a clear indication of the learner’s development and improvement throughout the course.
Learner contracts
A third potential instrument for global self- assessment is the learner contract (e.g., Dick- inson 1987). At the start of a course, learners identify two or three goals they want to achieve in the course, tasks and materials that can be used to reach these goals, and ways of measuring the extent to which the goals have been reached. For example, a learner may decide to increase his or her speed in reading. The learner can then identify some texts with comprehension questions to be used as prac- tice and set a target level of achievement, such as an increase in reading speed of 50 words per minute while retaining a minimum of 70% for comprehension questions answered cor- rectly. A learner contract, then, provides an organised series of tasks throughout a course
and makes attaining specific goals an integral part of the learning process. The choice of goals in learner contracts can be left to the learner or can be controlled by the teacher to match the objectives of the course. In the lat- ter case, self-assessment in learner contracts can be used as a valid part of the overall evalu- ation of learners in the course.
At present, self-assessment is a valuable tool in the teacher’s repertoire of techniques that enhance learning. If its uses are to be extended to include evaluation, self-assessment needs to be set up in such a way as to overcome the resistance of teachers. In this article, I have suggested that this can be done by basing self- assessment on concrete evidence of the learn- er’s performance and by giving guidelines on how to conduct the self-assessment. In these ways, self-assessment can become more reli- able and fulfil an important role in providing learner input into evaluation for a course.
Dickinson, L. 1987. Self-instruction in language learn- ing. Cambridge: Cambridge University Press. Gardner, D. and L. Miller. 1999. Establishing self-
access: From theory to practice. Cambridge: Cam-
bridge University Press.
Genesee, F. and J. A. Upshur. 1996. Classroom-
based evaluation in second language education.
Cambridge: Cambridge University Press. Oskarsson, M. 1980. Approaches to self-assessment in foreign language learning. Oxford: Pergamon, for
the Council of Europe.
Singh, K. 1998. Using critical incidents in self-assess-
ment. Paper presented at the 3rd Annual Thai TESOL English for Specific Purposes Special Interest Group Conference ESP at the Cutting Edge: New Challenges, New Solutions, Bangkok.
Tudor, I. 1996 Learner-centredness as language educa- tion. Cambridge: Cambridge University Press. z
RICHARD WATSON TODD is an associate professor in applied linguistics at King Mongkut’s University of Technology Thonburi in Bangkok, Thailand.