OpenAI explores a novel approach in AI alignment research by successfully using weaker AI models to supervise and extract near-full capabilities from more advanced AI systems, offering a potential solution for managing future superintelligence.
In the quest to align superhuman AI systems, a crucial question arises: can humans (weak supervisors) control AI systems that surpass human intelligence? OpenAI's Superalignment team addresses this by asking if small AI models can supervise larger, more advanced ones, aiming to develop a model for future AI alignment where humans oversee superintelligent systems.
The team presents an analogy - instead of humans supervising weaker AI (as is standard), they reverse the situation, using a smaller AI model to supervise a more advanced one. This setup aims to mimic the future challenge of human-AI alignment. They propose that, rather than teaching an AI new tasks, one can elicit its inherent abilities with the right guidance.
By applying a simple method to fine-tune GPT-4 using a GPT-2-level model as the weak supervisor, OpenAI discovered they could significantly enhance the larger model's performance, achieving results between those of GPT-3 and GPT-3.5. This suggests better weak-to-strong generalization is possible, although not yet perfect across all tasks.
Realizing the differences between their setup and the real-world challenge of aligning superintelligent AI, OpenAI identifies avenues for further research. This includes refining their approach, addressing the dissimilarities, and enhancing the scalability of methods. They invite the machine learning community to engage with this research by providing open-source code and a substantial grants program.