Text this: Evaluating large language models for criterion-based grading from agreement to consistency