This doc is meant for a technical audience that is new to AI/ML, has their social feeds filled up with GenAI news, and wants to wrap their head around it—by rooting their understanding on top of foundations, and not hype.
Hence, this doc:
- Has less news, and more timeless concepts
- Aims to give you a lay of the land within 10-15 mins
- Links to help you go deeper—curated based on what pushed my own understanding forward
- Does not give you a practitioner’s intuition (nothing like spending hours on ChatGPT), but complements it
[This doc has been shared externally; Feedback is welcome!]
The model
- What is “generative” about this?
- Older ML models were “discriminative” - they could classify data points, and choose the best ones to meet criteria (e.g. recommendation engines), generative models can create new data that is similar to their training dataset
- Discriminative example: “Is this a painting by Van Gogh?”, and the model responds with a likelihood number (say, 0.8)
- Generative: “Create a painting as if it was made by Van Gogh”, and the model responds with an image
- What has changed in the last few years: Four things happened together
- Generative models have become a thing: across modalities (generate text, voice, images)
- A specific architecture for generative text modeling (the transformer) has scaled to trillions of tokens (= unit of data, roughly equal to one word)
- Where is this data from and how much is it?
- Architectures will come and go: even the transformer is not static (keywords: encoder only, encoder-decoder, decoder-only)
- The goal of architecture is to utilize the amount of compute that is available today
- As compute scales, the architecture will evolve. If you zoom in, progress is made by architectures, if you zoom out, progress is only made by compute (see Bitter Lesson)
- There are many off-the-shelf models (exposed as HTTP endpoints) that generalize to multiple tasks, and therefore one deployment can enable multiple applications
- ChatGPT demoed these capabilities in a simple UX, thereby attracting builders and investors into the space
- Understanding the model
- Key learning: “Predicting the next word” is a good way to learn world knowledge
- What is knowing how to cook? Predicting the next word of a recipe
- What is knowing how to code? Predicting the next word in a line of code
- As models have grown larger, more world knowledge is captured by next-word prediction
- Next word prediction does not give you “instruction following” - requires more training
- Training for following instructions made it possible to build ChatGPT: given a question, the model follows the instruction and answers the question
- See Karpathy’s talk to understand the GPT training pipeline
- Next word prediction also gives the model the “ability to reason”
- The reason is the foundation for decision-making abilities, which enable “autonomy” (IMHO, too early to tell if this will happen)
- Key learning: this is a side product of training on code
- The theoretical mental model for autonomy: the General LLM Company
The stack
Analogous to the OSI model for networking. These boundaries are still emerging: companies are crossing over them as business models and customer needs evolve (e.g. OpenAI is in the application layer, base model layer, inference layer, and arguably also in the fine-tuning and app framework layers)
Layer name |
Description |
Examples |
Application layer |
App or feature built on top of an LLM |
|
[Key question] Who is to gain from these features: incumbents (e.g. GitHub) or new startups (related: do LLMs create a new business model? Sell the work) | ChatGPT, GitHub Copilot, Jasper |
| App frameworks layer | Libraries or frameworks that assist building applications for a particular workload[1] (e.g. RAG)
[Optional layer] Not all apps use frameworks, since they can straitjacket experimentation (learn more)
[1] The workload might require other components that are not in this table. For example, RAG might require vector databases for retrieval | LangChain, Llama Index |
| Middleware layer | Observability, testing, model failovers or routing. Can retro-fit existing tools (e.g. an APM tool) or use LLM-specific tooling
[Optional layer] Becomes relevant after some level of product maturity | LangSmith (from LangChain), Portkey, Martian |
| Fine tuning layer | There are many ways to fine tune models, and techniques like LoRA sit on top of base models (learn more)
[Optional layer] Just use base models to launch the end product faster | Frameworks like Axolotl, companies like Predibase |
| Base model layer | The actual LLM. Can be from a large closed player (e.g. OpenAI), or large open player (e.g. Llama 3), or from long tail of specialized model builders (e.g. Defog for text-to-SQL). HuggingFace is the registry for all open models (from larger players or the long tail)
Model builders are partnering up with inference layers/cloud providers for enterprise distribution | GPTx from OpenAI, Claude3 from Anthropic, Gemini from Google |
| Inference infra layer | Running the model in production to generate text etc is called inference
- For large proprietary models, the model builder is also the inference infra layer (e.g. OpenAI does training and inference)
- Pure inference providers can run open source or custom models
- Big 3 cloud providers also have inference product lines (e.g. AWS Sagemaker)
- Build your own with something like vLLM | Pure inference players: Fireworks, Together, Modal |
| Hardware layer | NVIDIA/GPUs take up mindshare, but most applications use them via their cloud computing provider (big 3 cloud providers) | — |
The future
Where is this headed?