Recent Posts
- Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements
- DataSciBench: An LLM Agent Benchmark for Data Science
- Beyond Code Generation: LLM-supported Exploration of the Program Design Space
- The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs
- AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents
Recent Comments
No comments to show.