Large Language Models Benchmarks

Simbian Publishes World’s First Cyber Defense Benchmark; Finds Frontier LLMs Alone Do Poor Job at Attack Discovery

Simbian Cyber Defense Benchmark reveals LLMs find and exploit vulnerabilities but fail at defense out-of-the-box without a sophisticated harness.

Hosted on MSN

New study challenges accuracy of AI benchmark testing

A Nature-published study by an international research team has found that current AI benchmarks fail to accurately measure large language models’ core capabilities. Existing tests often mix skills ...

28d

MLCommons Releases New MLPerf Inference v6.0 Benchmark Results

Today, MLCommons ® announced new results for its industry-standard MLPerf ® Inference v6.0 benchmark suite. This release includes several important advances that ensure the benchmark suite tests ...

iAfrica

Egyptian Startup Releases Open-Source AI Model That Outperforms Larger Global Rivals on Key Benchmarks

A Cairo-based artificial intelligence startup has released Horus 1.0-4B, a fully open-source large language model built in Egypt that outperforms several ...

Hosted on MSN

Simbian benchmark shows AI fails real-world cyber defense

Simbian’s new Cyber Defense Benchmark found that no leading large language model (LLM) could pass realistic enterprise cyber defense tests, despite their offensive capabilities. The study highlights a ...

DeepSeek open-sources V4 large language model series

Chinese artificial intelligence developer DeepSeek today released a new series of open-source large language models. V4, as ...

Bloomberg L.P.

Introducing BloombergGPT, Bloomberg’s 50-billion parameter large language model, purpose-built from scratch for finance

NEW YORK – Bloomberg today released a research paper detailing the development of BloombergGPT TM, a new large-scale generative artificial intelligence (AI) model. This large language model (LLM) has ...

5don MSN

DeepSeek previews new AI model that ‘closes the gap’ with frontier models

DeepSeek says both models are more efficient and performant than DeepSeek V3.2 due to architectural improvements, and have ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results