A Pragmatic Intro to LLMs and GenAI

This doc is meant for a technical audience that is new to AI/ML, has their social feeds filled up with GenAI news, and wants to wrap their head around it—by rooting their understanding on top of foundations, and not hype.

Hence, this doc:

Has less news, and more timeless concepts
Aims to give you a lay of the land within 10-15 mins
Links to help you go deeper—curated based on what pushed my own understanding forward
Does not give you a practitioner’s intuition (nothing like spending hours on ChatGPT), but complements it

[This doc has been shared externally; Feedback is welcome!]

The model

What is “generative” about this?
- Older ML models were “discriminative” - they could classify data points, and choose the best ones to meet criteria (e.g. recommendation engines), generative models can create new data that is similar to their training dataset
- Discriminative example: “Is this a painting by Van Gogh?”, and the model responds with a likelihood number (say, 0.8)
- Generative: “Create a painting as if it was made by Van Gogh”, and the model responds with an image
What has changed in the last few years: Four things happened together
1. Generative models have become a thing: across modalities (generate text, voice, images)
  - List of all of these models
2. A specific architecture for generative text modeling (the transformer) has scaled to trillions of tokens (= unit of data, roughly equal to one word)
  - Where is this data from and how much is it?
  - Architectures will come and go: even the transformer is not static (keywords: encoder only, encoder-decoder, decoder-only)
    - The goal of architecture is to utilize the amount of compute that is available today
    - As compute scales, the architecture will evolve. If you zoom in, progress is made by architectures, if you zoom out, progress is only made by compute (see Bitter Lesson)
3. There are many off-the-shelf models (exposed as HTTP endpoints) that generalize to multiple tasks, and therefore one deployment can enable multiple applications
4. ChatGPT demoed these capabilities in a simple UX, thereby attracting builders and investors into the space
Understanding the model
- Key learning: “Predicting the next word” is a good way to learn world knowledge
  - What is knowing how to cook? Predicting the next word of a recipe
  - What is knowing how to code? Predicting the next word in a line of code
- As models have grown larger, more world knowledge is captured by next-word prediction
  - Build this intuition through this Stanford lecture (see the first half, 30 mins)
  - What does growing larger mean? More “weights” to train - see 3Blue1Brown’s video
- Next word prediction does not give you “instruction following” - requires more training
  - Training for following instructions made it possible to build ChatGPT: given a question, the model follows the instruction and answers the question
  - See Karpathy’s talk to understand the GPT training pipeline
- Next word prediction also gives the model the “ability to reason”
  - The reason is the foundation for decision-making abilities, which enable “autonomy” (IMHO, too early to tell if this will happen)
  - Key learning: this is a side product of training on code
  - The theoretical mental model for autonomy: the General LLM Company

The stack

Analogous to the OSI model for networking. These boundaries are still emerging: companies are crossing over them as business models and customer needs evolve (e.g. OpenAI is in the application layer, base model layer, inference layer, and arguably also in the fine-tuning and app framework layers)

Layer name	Description	Examples
Application layer	App or feature built on top of an LLM

[Key question] Who is to gain from these features: incumbents (e.g. GitHub) or new startups (related: do LLMs create a new business model? Sell the work) | ChatGPT, GitHub Copilot, Jasper | | App frameworks layer | Libraries or frameworks that assist building applications for a particular workload[1] (e.g. RAG)

[Optional layer] Not all apps use frameworks, since they can straitjacket experimentation (learn more)

[1] The workload might require other components that are not in this table. For example, RAG might require vector databases for retrieval | LangChain, Llama Index | | Middleware layer | Observability, testing, model failovers or routing. Can retro-fit existing tools (e.g. an APM tool) or use LLM-specific tooling

[Optional layer] Becomes relevant after some level of product maturity | LangSmith (from LangChain), Portkey, Martian | | Fine tuning layer | There are many ways to fine tune models, and techniques like LoRA sit on top of base models (learn more)

[Optional layer] Just use base models to launch the end product faster | Frameworks like Axolotl, companies like Predibase | | Base model layer | The actual LLM. Can be from a large closed player (e.g. OpenAI), or large open player (e.g. Llama 3), or from long tail of specialized model builders (e.g. Defog for text-to-SQL). HuggingFace is the registry for all open models (from larger players or the long tail)

Model builders are partnering up with inference layers/cloud providers for enterprise distribution | GPTx from OpenAI, Claude3 from Anthropic, Gemini from Google | | Inference infra layer | Running the model in production to generate text etc is called inference

For large proprietary models, the model builder is also the inference infra layer (e.g. OpenAI does training and inference)
Pure inference providers can run open source or custom models
Big 3 cloud providers also have inference product lines (e.g. AWS Sagemaker)
Build your own with something like vLLM | Pure inference players: Fireworks, Together, Modal | | Hardware layer | NVIDIA/GPUs take up mindshare, but most applications use them via their cloud computing provider (big 3 cloud providers) | — |

The future

Where is this headed?