Note: I had stopped writing posts in 2017. Slowly getting back into it in 2024, mostly for AI.

Language Models and GPT’s evolution

Jan 15, 2024 | LLM

As explained in this Stanford CS50 tech talk, Language Models (LMs) are basically a probability distribution over some vocabulary. For every word we give an LM, it can determine what the most probable word to come after that. It’s trained to predict the Nth word, given the previous N-1 words. If that sounds like simple probability calculation, you are not realizing that predicting the next word includes the task of keeping in context every other word that came before it. ChatGPT does this over a vocabulary or 50,000 words. Initially I was surprised to hear that. 50K sounded small (intuitively). Then I learned that the average vocabulary size of an adult English speaker is between 20,000 and 35,000 words.

Stanford’s CS224N 2023 Lecture 10 gave a nice overview of OpenAI’s work evolution on GPT. Following was noteworthy:

  • GPT: Released 2018. Was the first iteration. 117M parameters. It was trained on corpus of 7000 unique books (4.68gb text). Published paper titled Improving Language Understanding by Generative Pre-Training. BTW, I was at google back then and completely unaware of this development. I was marveling at the then-recently published paper by googlers about using deep learning techniques on raw EHR data. These researchers had removed the need of specifying predictor variables and instead allowed neural networks to learn representations of the key factors and interactions from the data itself. That paper never mentioned LLMs.
  • GPT2: Released 2019. Same architecture as GPT, but one order of magnitude bigger with 1.5B parameters. But trained on much more data because it used 40gb of internet data. Most famously, they scraped links posted on Reddit – anything with 3+ upvotes (as a way to denote human-judgement based quality). Published paper titled Language Models are Unsupervised Multitask Learners. This was the version that started showing the promise of doing things that the model wasn’t trained to do (!). It did Zero shot learning.
  • GPT3: Released 2020. Two orders of magnitude bigger, with 175B parameters. Trained with >600Gb of data. Published paper titled Language models as few shot learners. At that kind of scale, this model showed the emergence of ‘few shot learning’ property – ie giving examples of the task in the prompt it made it’s predictions better. So the LLM was frozen, but that ‘in-context learning’ from the prompt allowed GPT3 to do some on-the-fly optimization and/or reasoning. That’s different from the traditional approach where we give examples to model and use them to do gradient updates and ‘fine-tune’ of LLM. It’s mind-bending to think that this behavior of the LLM ’emerged’ on its own. Why did it emerge only at this scale – that’s an area of active research.
  • GPT3.5: GPT3 didn’t have the apparatus to interact in a Q&A way. GPT3.5 did and was released as a series of models in 2022. Notable in that were:
    • ChatGPT, which was fine-tuned from a model in the GPT-3.5 series. ChatGPT was trained on data up to June 2021.
    • InstructGPT models (now the default) are the fine-tuned version of GPT-3.5 trained on a dataset of human-written instructions. It was built using a technique called reinforcement learning from human feedback (RLHF). Basically the prompts submitted by ChatGPT customers was used by OpenAI’s labelers to provide demonstrations of the desired model behavior, and rank several outputs from the models. That data (which surfaced more ‘alignment’ with user’s intent, rather than just word prediction) was then used to fine-tune GPT-3. The result was InstructGPT models. All this is wonderfully explained in this research post by  OpenAI.
  • GPT4: Released March 2023. Rumored to have1.7T parameters. Trained using both public data and data licensed from third-party providers. GPT-4 can interact with external interfaces to accomplish tasks like make bookings, create calendar appointments etc.

PS: In the Lex Fridman interview, at 59ish min mark Perplexity CEO Aravind Srinivas explains the cake metaphor that is worth noting. He says self-supervised pretraining (to predict the next token) is the bulk of the cake, icing is the supervised fine tuning, and RLHF is just the cherry on the cake that gives it conversational capabilities.