this post was submitted on 24 Feb 2025
1 points (100.0% liked)

Machine Learning

17 readers
2 users here now

This subreddit is temporarily closed in protest of Reddit killing third party apps, see /r/ModCoord and /r/Save3rdPartyApps for more information.

founded 1 year ago
MODERATORS
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Successful-Western27 on 2025-02-22 07:02:41+00:00.


A new evaluation benchmark tests language models across 285 graduate-level disciplines using an iterative human-AI collaborative approach to generate and validate questions. The methodology combines expert review with model-assisted filtering to ensure high-quality, discipline-appropriate assessment.

Key technical points:

  • Uses a two-stage question generation process: initial AI generation followed by expert review
  • Implements collaborative filtering where both human experts and LLMs help identify and remove problematic questions
  • Covers disciplines from traditional academia to specialized industrial fields
  • Tests both factual knowledge and reasoning capabilities
  • Evaluated on multiple leading LLMs including GPT-4, Claude 2, and DeepSeek

Results:

  • Best performance: DeepSeek-R1 at 61.82% accuracy
  • Significant variance in performance across different disciplines
  • 80+ expert annotators involved in validation
  • Generated dataset of 2,855 validated questions

I think this benchmark addresses a critical gap in LLM evaluation by going beyond common academic subjects. The methodology of combining human expertise with AI assistance for question validation could be valuable for developing future evaluation datasets.

I think the relatively modest performance (62%) on graduate-level questions across diverse fields suggests current LLMs still have significant room for improvement in specialized domains. This could influence how we approach model training and evaluation for domain-specific applications.

TLDR: New benchmark tests LLMs across 285 graduate disciplines using human-AI collaborative question generation. Best model achieved 62% accuracy, revealing gaps in specialized knowledge.

Full summary is here. Paper here.

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here