Evaluating progress of LLMs on scientific problem-solving

April 4, 2025

41

Programmatic and model-based evaluations

Duties in CURIE are different and have ground-truth annotations in combined and heterogeneous type, e.g., as JSONs, latex equations, YAML information, or free-form textual content. Evaluating free-form technology is difficult as a result of solutions are sometimes descriptive, and even when a format is specified, as in most of our circumstances, the response to every area can have differing types. For instance, supplies grid factors could generally be specified as “[p, q, r]” and at different occasions as “p × q × r”. Therefore, along with the programmatic analysis metrics, akin to ROUGE-L, intersection-over-inion (used for BIOGR), and identification ratio (utilized in PDB), we suggest two model-based analysis metrics.

(1) LMScore: Prompts an LLM asking how carefully the predictions match floor reality on a 3-point scale: “good” if the prediction has few minor errors, “okay” if there are various minor errors, and “unhealthy” if there are main errors. We take into account the weighted common of the log-likelihood scores of the tokens to provide a closing confidence.

(2) LLMSim: Is used for retrieval duties the place we ask the mannequin to exhaustively extract many particulars, e.g., descriptors, properties and values of supplies from a analysis doc, and supply as output an unordered checklist of dictionaries or data. We use a chain-of-thought (CoT) immediate that asks the LLM to take a look at every ground-truth file and determine the expected data that accurately match every area (key) and worth of the bottom reality. As soon as we match the ground-truth data with predicted data, we are able to then measure precision and recall for the retrieval activity, and compute the imply common precision, recall and F1 scores throughout all paperwork.

Evaluating progress of LLMs on scientific problem-solving

Programmatic and model-based evaluations

Related Articles

What public cloud will get incorrect with AI

Google’s deepfake hunter sees what you possibly can’t—even in movies with out faces

Pyka – Deployed Flight Take a look at Engineer – sUAS Information

LEAVE A REPLY Cancel reply

Latest Articles

What public cloud will get incorrect with AI

Google’s deepfake hunter sees what you possibly can’t—even in movies with out faces

Pyka – Deployed Flight Take a look at Engineer – sUAS Information

Swan EndoSurgical debuts to advance GI robotics

The Descent Is More durable Than the Climb: Classes in Management from Mt. Fuji | by Victoria Drake | The Startup | Jul, 2025