According to PANews, OpenAI has launched a new benchmark called MLE-bench, which aims to evaluate the performance of AI agents in the development of machine learning solutions. The benchmark covers 75 Kaggle competitions, focusing on evaluating challenging tasks in current machine learning development and comparing AI results with human performance.

In preliminary tests, the o1-preview model combined with the AIDE framework performed best, winning the bronze medal in 16.9% of the competition, surpassing Anthropic's Claude 3.5 Sonnet. By increasing the number of attempts, o1-preview's success rate doubled to 34.1%.

OpenAI believes that MLE-bench helps assess core ML engineering skills, although it does not cover all areas of AI research.