Recent Posts
- Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks
- How confessions can keep language models honest
- Evaluating AI’s ability to perform scientific research tasks
- Evaluating chain-of-thought monitorability
- AGOD: Enhancing Multi-Agent Generalization via Attribution-Guided Observation Dropout
Recent Comments
No comments to show.