Ugh… I’m deep on the ai sphere, and this seems like a bad idea to me. Gpt (let’s face it, they are probably using open ai) can be deeply biased and arbitrary in it’s evaluations.
For example, “Two apples and four oranges,” might score better than: “4 oranges and 2 apples.” for inscrutable reasons. Say, if the question spelled out the numbers, and the LLM has a weighted bias to favor overall textual consistently, it might produces a reason to dock points apparently unrelated to that weight, such as: “incomplete sentence.” for the second answer, but not the first.
Students may also receive lower scores due to cultural biases towards certain phrases, and factors as straightforward as their name.
Finally, AI will hallucinate errors constantly if you ask it to evaluate text without any errors. Constantly. Consistently.
Ugh… I’m deep on the ai sphere, and this seems like a bad idea to me. Gpt (let’s face it, they are probably using open ai) can be deeply biased and arbitrary in it’s evaluations.
For example, “Two apples and four oranges,” might score better than: “4 oranges and 2 apples.” for inscrutable reasons. Say, if the question spelled out the numbers, and the LLM has a weighted bias to favor overall textual consistently, it might produces a reason to dock points apparently unrelated to that weight, such as: “incomplete sentence.” for the second answer, but not the first.
Students may also receive lower scores due to cultural biases towards certain phrases, and factors as straightforward as their name.
Finally, AI will hallucinate errors constantly if you ask it to evaluate text without any errors. Constantly. Consistently.