The Twilight Era: Humans Supervising Generative AI (While They Still Can)

Published in

roost

3 min readOct 22, 2023

As we delve deeper into the intriguing world of generative artificial intelligence, we continue to draw upon insights from Andrej Karpathy’s compelling talk at Microsoft Build 2023, on the intricate training infrastructure of AI assistants like ChatGPT. In our previous discussion, we explored in depth the fascinating realm of pre-training -the initial stage where the AI model learns to predict the next word in a sentence using a technique known as masking. However, the journey doesn’t end there. Following pre-training, two additional stages of training, along with one crucial intermediate stage, further shape the capabilities of these AI assistants.

The Fine Art of Supervised Tuning

Once pre-training lays the groundwork, the Supervised Fine Tuning (SFT) process steps in to further refine the model’s capabilities. SFT employs a different approach from pre-training: instead of predicting missing words, it employs labeling, where human experts provide the model with pairs of input and output examples to learn from. This difference in methodology underscores the flexibility of transformers-they can adapt to various learning techniques to hone their skills.

Through this process, SFT leverages human expertise to guide the model, helping it to align more closely with human values and expectations. This fine-tuning process ensures that the AI model can handle specific tasks with greater precision, paving the way for improved performance in its subsequent stages.

The Bridge of Rewards: Modeling Anticipation

Once the AI model has been sufficiently fine-tuned, it’s time for the intermediate step — reward modeling. This stage is essential for preparing the model for the final stage of training. Reward modeling involves creating a model that can predict the rewards or ‘scores’ that human raters would give to different model outputs.

This process serves as a bridge between SFT and Reinforcement Learning from Human Feedback (RLHF), enabling the AI model to understand and adapt to human feedback more effectively. By learning to anticipate the rewards associated with different outputs, the model becomes better equipped to adapt its responses in line with human expectations.

The Dual-Edged Sword of Reinforcement Learning

The final chapter in the training journey of an AI model is Reinforcement Learning from Human Feedback (RLHF). This stage employs a reinforcement learning algorithm to fine-tune the model, refining its responses based on the reward predictions from the reward model. This iterative process enables the model to learn from its errors, progressively enhancing its proficiency and adaptability.

A recurring theme in my discussions has been the dual nature of RLHF-its advantages and potential downsides. But one question that has always intrigued us is why RLHF works so effectively. According to Andrej Karpathy, responses from RLHF are preferred by humans over those from SFT and pre-training. One of the primary reasons for this, he suggests, is the computational ease of comparison versus generation.

In a fascinating parallel to the P vs NP problem-a fundamental question in computer science about the relative difficulty of finding solutions versus verifying them-it’s easier to evaluate a solution (comparison) than to find one (generation). In the context of RLHF, it’s computationally simpler to compare and rank different model outputs (given a situation and several possible responses) than to generate an appropriate response from scratch. This efficient strategy of optimization likely contributes to the strong performance and human preference for RLHF responses.

The Final Act: Making Sense of it All

This concludes my commentary on Andrej Karpathy’s talk (I may write bonus takeaways). I hope I’ve been able to distill the complex ideas presented into simpler language. As a wise person once said, “If you can’t explain something simply, you may not fully understand it yourself.” I’m not claiming to have complete mastery over the subject of generative AI-I am, like you, a student, ever curious and constantly learning. Together with my team, we’re at the cutting edge of this exciting field. Visit Roost AI to see how we’re harnessing the power of Generative AI to automate end-to-end test case generation and execution.

Originally published at https://www.linkedin.com.

The Twilight Era: Humans Supervising Generative AI (While They Still Can)

The Fine Art of Supervised Tuning

The Bridge of Rewards: Modeling Anticipation

The Dual-Edged Sword of Reinforcement Learning

The Final Act: Making Sense of it All

Written by Rishi Yadav