1. Why Measurement Matters in Language Testing
The truth
is that when we design a language test, we are not simply creating an exam — we
are building a scientific tool for observing and quantifying human
communication. In language education, measurement allows us to turn
abstract ideas like “fluency,” “vocabulary control,” or “writing coherence”
into observable evidence. And the fact is that this process follows principles
shared by all social sciences, where human behavior is measured under
controlled and transparent procedures (Bachman, 1990).
In other
words, language testing isn’t guesswork — it’s a systematic way of
collecting information that can later support sound educational decisions.
But this process is only meaningful when our instruments are both reliable
(consistent over time and across raters) and valid (accurately measuring
what they claim to measure).
2.
Defining the Core Terms: Measurement, Test, and Evaluation
Although we
often use these words interchangeably, each serves a unique function in the
assessment process. Understanding their differences helps teachers design
better, fairer evaluations.
Measurement
Measurement
refers to the act of quantifying human characteristics following
explicit rules and procedures. It’s about assigning numbers or categories to
observable features of learners — such as their reading comprehension or
speaking fluency — in a way that is replicable and systematic (Carroll, 1987).
For
example, when we rate a student’s speaking performance on a 1–5 scale, we are
quantifying an attribute (fluency or accuracy) using agreed-upon
criteria. What makes it measurement rather than opinion is the presence
of clear rules. Without shared criteria, one teacher might rate
pronunciation more heavily, while another prioritizes vocabulary — making the
results subjective and unreliable.
So,
measurement must always be explicit, structured, and replicable,
ensuring that the same performance would receive a similar score regardless of
who observes it.
Test
A test
is a particular kind of measurement instrument. It is a structured procedure
designed to elicit specific behaviors that reflect an underlying ability
(Carroll, 1968). For example, a speaking test like the Interagency Language
Roundtable (ILR) oral interview follows a defined sequence of prompts and
rating criteria to capture authentic speech samples (Lowe, 1982).
The key
point is that a test doesn’t measure everything a person can do with
language — it measures what the elicitation procedures allow us to
observe. This is why test design matters: a poorly designed task may not
reflect the skill we intend to measure.
For
instance, if a teacher uses a casual classroom conversation to assess academic
writing ability, the results won’t be valid. The task must target the
construct — the specific skill or ability — that the teacher wants to
measure.
In short,
tests are controlled samples of behaviour carefully designed to allow
teachers to make valid inferences about learners’ linguistic abilities (Fulcher
& Davidson, 2007).
Evaluation
Evaluation goes one step further. It’s the process
of making decisions based on the information we’ve collected (Weiss, 1972).
Evaluation involves judgment — deciding what a test score or observation means
and what actions to take as a result.
For
example, a test might measure a student’s reading comprehension level;
evaluation is when we decide whether that level qualifies the student for an
advanced class.
The truth
is that not all evaluation involves testing, and not all tests are
evaluative. A classroom quiz might be used purely for practice, not grading
— just as a teacher’s narrative feedback can be evaluative even without a
numerical score. The distinction lies in purpose: measurement provides information,
while evaluation uses that information to make decisions (Bachman &
Palmer, 2010).
3.
Essential Qualities of Good Measurement: Reliability and Validity
Reliability
Reliability
refers to consistency — the degree to which test scores remain stable
across different times, tasks, or raters. A reliable test gives you similar
results under similar conditions (American Psychological Association, 2014).
For
instance, if a student scores 80 on a grammar test today but 60 on the same
test tomorrow without any change in ability, the instrument is unreliable.
Likewise, if two teachers score the same writing sample very differently, the
lack of consistency threatens the test’s reliability.
To improve
reliability, bilingual teachers can:
- Use clear rubrics with
defined descriptors.
- Train raters to ensure inter-rater
agreement.
- Pilot test items and analyse
them for difficulty and discrimination.
Reliability
doesn’t make a test “good” by itself, but without it, results are meaningless.
Validity
Validity is
the core of all assessment. It concerns whether the interpretations and
uses of test scores are meaningful, appropriate, and justified (Messick,
1989). In simple terms, a test is valid if it measures what it claims to
measure.
For
example, if a writing test requires students to listen to a lecture first,
their performance might reflect both listening and writing ability —
making it an invalid measure of writing alone.
Validity
also depends on context. A vocabulary test for native English-speaking
children cannot validly assess bilingual learners’ communicative competence;
the purpose, population, and construct must align (Bachman & Palmer, 2010).
In
practice, validity is never “absolute.” It’s built on evidence and
argumentation — combining theoretical justification, statistical data, and
professional judgment.
4.
Understanding Measurement Scales: From Names to Numbers
When we
measure language abilities, we assign numbers or categories to learner
performance. These measurement scales determine how we interpret results
(Stevens, 1946).
Scale Type |
Description |
Example in Language Testing |
Nominal |
Labels or categories without order |
Native language (1 = Spanish, 2 = English, 3
= Arabic) |
Ordinal |
Ordered rankings, but intervals unequal |
Speaking proficiency ranked as Beginner,
Intermediate, Advanced |
Interval |
Equal distances between scores, no absolute
zero |
Standardized test scores (e.g., TOEFL 90 vs.
100) |
Ratio |
Equal intervals and absolute zero |
Number of correct answers on a grammar quiz |
The truth
is that most language test scores fall on ordinal or interval scales,
meaning we can compare learners but must interpret numerical differences
carefully. A score of 90 is not necessarily “twice” as proficient as a score of
45 — it simply represents a higher position on the same scale.
5.
Bringing It All Together: Designing Fair, Meaningful Assessments
For
bilingual teachers, applying these principles means ensuring that every
classroom test is:
- Purposeful – Clearly defines what it
intends to measure.
- Consistent – Produces similar results
across contexts.
- Valid – Reflects the skill it aims
to assess.
- Ethical – Respects learner diversity
and uses results fairly.
In
practice, that could mean developing speaking rubrics with explicit
descriptors, using peer calibration sessions for rating reliability, or
combining quantitative test scores with qualitative observations for a holistic
evaluation.
And the
fact is that, when we understand the difference between measurement, testing,
and evaluation, we don’t just test students — we empower them by
providing clear, accurate, and actionable feedback about their learning
journey.
📚 References
American
Psychological Association. (2014). Standards for educational and
psychological testing. Washington, DC: APA.
Bachman, L.
F. (1990). Fundamental considerations in language testing. Oxford
University Press.
Bachman, L.
F., & Palmer, A. S. (2010). Language assessment in practice. Oxford
University Press.
Carroll, J.
B. (1968). The psychology of language testing. In A. Davies (Ed.), Language
testing symposium (pp. 46–69). Oxford University Press.
Fulcher,
G., & Davidson, F. (2007). Language testing and assessment: An advanced
resource book. Routledge.
Lowe, P.
(1982). The ILR oral proficiency interview. Language Testing, 1(2),
163–178.
Messick, S.
(1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed.,
pp. 13–103). Macmillan.
Weiss, C.
H. (1972). Evaluation research: Methods for assessing program effectiveness.
Prentice-Hall.
Stevens, S.
S. (1946). On the theory of scales of measurement. Science, 103(2684),
677–680.
No comments:
Post a Comment