AI from first principles.

A practitioner's overview for capital allocators who want to understand artificial intelligence at the atomic level.

10 Principles

We built this briefing because we're tired of watching smart people get bad AI advice.

Here's what we mean.

Every board meeting, every conference keynote, every consultant deck gives you one of two stories. AI is saving the world. Or AI is ending it.

When you think about AI, there are two very distinct narratives. And we think the reality, like most things, is in shades of gray.

We've spent the last two years going deep on this.

Not reading tweets or watching videos.

Reading the actual papers. Studying the researchers who built these systems. Talking to people building with them every day. Helping organizations be part of the 20% (or less) of people who actually get value from AI versus the more than 80% of orgs that don't see ROI as measured by MIT, McKinsey and many others.

We've spent thousands and thousands of hours learning:

How to write code, how to design, how to direct attention and ideas. And now AI can do parts of that for us better than we ever could.

We're not theorizing from a lectern. We're building with this technology every single day.

Principle 01

The Bitter Lesson:
Scale Beats Cleverness

Rich Sutton is one of the founding figures of reinforcement learning. In 2019, he wrote a short essay called "The Bitter Lesson." It's maybe the most important thing written about AI in the last decade.

His argument: Across 70 years of AI research, one pattern repeats. Researchers try to build human knowledge into systems. Hand-code the rules. Program the expertise. It works in the short term. It always plateaus. Then someone comes along with a simpler system that just uses more data and more processing power. And the simpler system wins.

Chess? Experts built elaborate knowledge systems encoding grandmaster strategy. Then Deep Blue won with brute-force search. Go? Same story. AlphaGo won with neural networks and Monte Carlo tree search. Not Go expertise. Language? Linguists spent decades building grammar rules into software. GPT models crushed them by scaling data and compute.

Two methods scale: search and learning. Everything else hits a ceiling.

Scale beats cleverness. Every time. That's the bitter lesson.

Why is it "bitter"? Because it means human expertise in AI gets superseded by raw scale. Researchers pour years into clever approaches. A bigger model with more compute beats them. For anyone who prides themselves on intellectual horsepower, this is a hard pill.

We'd encourage you to read Sutton's essay. It's short. It's free. And it will change how you think about every AI investment you evaluate.

Principle 02

The Three Pillars of Progress
Compute. Data. Model size.

Modern AI progress comes from the convergence of three variables. When you scale all three simultaneously, capability increases in a predictable, mathematical way. This predictability is what makes AI different from most technology bets.

Compute:

This is the total volume of mathematical operations performed by specialized hardware during training. Think of it as how much capital you're deploying against the problem.

Data:

The vast repository of human knowledge used as training material. Data is the underlying asset class. The quality and quantity of your data sets the ceiling of what's possible.

Model Size:

The number of internal parameters that store the patterns learned from data. More parameters means more capacity to learn patterns. But just like fund capacity, there are diminishing returns.

If you think of these as a three-factor model, the insight is that each factor contributes independently. Underweighting any one creates a bottleneck. Right now, the data factor is becoming the constraint. The companies solving the data problem are the ones pulling ahead.

Principle 03

From Neurons to Nodes
How the machine learns.

To understand how AI "thinks," you need to understand how it learns. The analogy to the human brain is everywhere, but the reality is more mathematical than biological.

Feature

Interconnects

Signal Type

Learning Mode

Efficiency

Biological System

Synapses (~100 trillion)

Electrochemical

Synaptic Plasticity

~20 Watts

Artificial System (ANN)

Parameters (Billions to Trillions)

Mathematical Tensors/Numbers

Backpropagation & Gradient Descent

Megawatts (Data Center Scale)

Comparison

Interconnects

Biological

Synapses (~100 trillion)

Artificial

Parameters (Billions to Trillions)

Signal Type

Biological

Electrochemical

Artificial

Mathematical Tensors/Numbers

Learning Mode

Biological

Synaptic Plasticity

Artificial

Backpropagation & Gradient Descent

Efficiency

Biological

~20 Watts

Artificial

Megawatts (Data Center Scale)

A biological neuron receives signals, processes them, and if the total signal exceeds a threshold, fires an impulse to other neurons.

An artificial neuron does the same thing but with numbers. It receives numerical inputs, each assigned a weight representing its importance. These weighted inputs are summed together and passed through a function. If the value is high enough, the node passes a signal to the next layer.

When a model is first created, its weights are random. Its outputs are gibberish. Training is the process of showing it an input, measuring how far its output is from the correct answer, and then working backward through the network to adjust every weight slightly so the error is smaller next time.

The process is called gradient descent. By repeating this trillions of times, the network "converges" on a configuration that can identify a face, translate a language, or solve a complex coding problem.

The key insight: the machine isn't programmed with rules. It learns patterns by adjusting billions of tiny knobs, over and over, until the output matches reality.

Principle 04

Tokens, Embeddings, and the Geometry of Meaning.

Here's where it gets interesting for someone who thinks in terms of markets.

Computers don't see the word "Investment" as a concept. They see it as a point in a vast, multi-dimensional coordinate system. To understand AI's ability to process language, you need to move past "words" and think about "vectors."

Tokens:

The Atom of Language. Before AI can process text, it breaks everything down into tokens. A token is typically a word or a fragment of a word. "Innovator" becomes "innov" and "ator." Tokenization lets the model handle a finite vocabulary while representing an infinite variety of word combinations. Think of tokens as the smallest tradeable unit. Like how a bond gets broken into principal and coupon, language gets broken into processable atoms.

Embeddings

Once tokenized, each token gets converted into an embedding. That's a vector of hundreds or thousands of numbers.

Words with similar meanings end up mathematically close to each other.

The part that blew our minds: you can do arithmetic with meaning. Take the vector for "Japan," subtract the vector for "Tokyo," add the vector for "Paris," and you get a vector very close to "France."

For a finance mind: embeddings are like multi-factor risk models that map every word to a point in factor space. The geometry is the understanding.

Principle 05

The Transformer:
Attention Is the Breakthrough

The 2017 invention of the Transformer architecture by a team at Google is the single breakthrough behind everything you're seeing today. Every major AI model (GPT, Claude, Gemini, Llama) is built on this foundation. Nine years later, it's still the core design.

Before transformers, AI processed language sequentially. It read a sentence from left to right, one word at a time. Like reading a 500-page contract by starting at page one and forgetting page one by the time you reach page 50. These systems were slow to train and bad at understanding long-range connections.

The transformer's core innovation is "self-attention." It processes all words simultaneously and determines which ones are most relevant to each other.

Think of it like a well-run investment committee doing due diligence on an acquisition. They don't read the pitch deck word by word from top to bottom. They process the whole thing at once and ask: which facts are relevant to which other facts? The revenue numbers are relevant to the growth claims. The customer concentration is relevant to the risk assessment. The management bios are relevant to the execution thesis. Self-attention does the same thing. When the model processes "The bank was closed because it was a holiday," the attention mechanism assigns heavy weight between "it" and "bank" to resolve what "it" refers to. It looks at every word simultaneously and determines the relationships.

The "cocktail party" analogy works too. At a loud party, you can filter out background noise to focus on a specific conversation. But if someone across the room says your name, your attention immediately shifts. That's self-attention. Fluid, context-dependent focus.

Transformers use "multi-head attention," which Sanderson explains as multiple specialists looking at the same text from different angles. One head focuses on grammar. Another on pronoun references. Another on semantic themes. Like an investment committee where the lawyer, the analyst, and the strategist each read the same deal from their own perspective. The final understanding combines all their views.

Here's the part Sanderson emphasizes that most explanations miss: the breakthrough isn't what attention does. It's that attention is massively parallelizable. GPUs can run thousands of these attention computations simultaneously. In investment terms, the breakthrough was not a better analyst. It was a better way to run many analysts in parallel.

We've seen what this looks like in practice. In a workshop with a European creative agency, something that would take a 3D artist a day and a half to build, we did in five seconds. That's the transformer at work. Not because the AI is "smarter." Because it can process everything at once.

Principle 06

The LLM Lifecycle: From raw intelligence to useful tool.

Stage 1: Pre-training

The Private Company Phase. A startup absorbs everything. Market knowledge, customer feedback, competitive intelligence, technical skills. Pre-training does the same thing. You feed the model trillions of tokens from the internet. Wikipedia articles, computer manuals, classic novels, code repositories. During this phase, the model learns to predict the next word in a sequence with incredible accuracy.

Here's the counterintuitive part. By learning to predict the next word, the model inadvertently learns the underlying logic of the world. To predict that "The capital of France is __" should be followed by "Paris," it needs to have internalized the concept of countries and capitals. The learning task is simple. The emergent capability is profound.

The result is a "base model." It possesses vast knowledge but has no manners, no focus, and no specialization. Think of a brilliant founder who can talk about anything but can't stay on message for an investor meeting.

Stage 2: Supervised Fine-Tuning

(The IPO Roadshow Prep). Before going public, a company goes through intense preparation. Investment bankers coach the CEO on how to present, what to emphasize, how to handle tough questions. Fine-tuning does the same thing. A curated dataset of high-quality conversations teaches the model the format of being helpful. It learns to follow instructions, stay on topic, and present information clearly. This is where a raw knowledge base becomes a useful assistant.

Stage 3: RLHF

(Post-IPO Market Feedback). After listing, the market teaches the company what investors actually value. Earnings calls, analyst ratings, and price signals shape future behavior. RLHF (Reinforcement Learning from Human Feedback) works the same way. Human evaluators rank multiple model responses. A reward model learns those preferences. The AI is then trained to produce outputs that align with human values, safety guidelines, and quality standards.

Andrej Karpathy (former Tesla AI lead, co-founder of OpenAI) has identified a fourth stage emerging: RLVR, where the model trains against automatically verifiable problems. Math puzzles, coding challenges, logic tests. The model discovers reasoning strategies nobody explicitly taught it. This is like a company discovering that the discipline of quarterly reporting accidentally made it better at core operations.

Here's the thing. Understanding these stages matters for evaluating AI vendors. A company that has better RLHF (better human feedback, better reward modeling) will produce more reliable outputs than one that just has a bigger base model. The "moat" isn't just size. It's the quality of each stage.

Principle 07

Scaling Laws: Predictable Returns on Intelligence

One of the most profound realizations in modern AI: intelligence is surprisingly predictable.

AI performance follows power laws. You already live in power laws. Venture capital returns follow them (1% of deals generate 50%+ of total returns). Wealth distribution follows them (Pareto's 80/20). City sizes follow them (Zipf's law). And now AI capability follows them too.

The specific insight: model error decreases as a precise mathematical function of compute invested. Plotted on a log-log graph, this appears as a straight line. Researchers at OpenAI and DeepMind discovered that you can predict the performance of a $100 million model using tests on a $1,000 model. This predictability turns AI development from a series of experimental guesses into an engineering roadmap.

But there's a frontier debate worth tracking.

Ilya Sutskever (co-founder of OpenAI, now building Safe Superintelligence Inc. at a $32 billion valuation) says the "age of scaling" for pre-training is over. We're running out of new data to feed these models. DeepSeek proved that algorithmic efficiency can substitute for raw compute. Their R1 model matched frontier performance for approximately $6 million using 2,000 GPUs while their competitors spent billions.

The industry is pivoting to a new scaling axis: inference-time scaling. Traditionally, once a model was trained, its intelligence was fixed. New models like OpenAI's o1 and DeepSeek-R1 can "think longer" before responding. They create a tree of possible approaches, evaluate each one, and pick the best path. This is the transition from System 1 thinking (instant, instinctive) to System 2 thinking (slow, deliberative).

For complex tasks (drafting a merger strategy, debugging critical code, analyzing a 500-page contract), paying for extra thinking time during inference can produce dramatically better results than a quick answer.

In macroeconomic terms, this is the debate between extensive and intensive growth. Do you grow by adding more inputs (more data, more compute)? Or by using existing inputs more efficiently (better algorithms, more thinking time at inference)? The answer is probably both. But the easy wins from extensive growth are maturing. The next phase demands more sophistication.

Principle 08

Agentic AI: From Assistants to Autonomous Workers

Here's where the money is heading.

A chatbot is a single analyst. You ask it a question. It gives you an answer. That's useful but limited. An agent is a portfolio manager with a mandate. You give it an objective ("reduce our exposure to emerging market duration risk") and it plans the steps, executes the trades, monitors results, and adjusts course.

The difference between a chatbot and an agent comes down to four building blocks:

We've seen this in practice. We built an AI agent that took a project worth about $30,000 in traditional consulting and completed it in three hours. Site audit, redirect mapping, jobs-to-be-done analysis, high-fidelity wireframes. For an enterprise consumer goods company. That's not task acceleration. That's a different category of work.

Most companies are stuck at what we call Level 1 AI maturity. AI does individual tasks faster. Same workflow, slightly accelerated. "We use ChatGPT for first drafts." Level 2 is workflow automation. AI handles entire multi-step processes. Maybe 8-10% of companies are here. Level 3 is capability creation. Workflows redesigned around what AI makes possible. Doing things that weren't possible before. Maybe 1-2% of companies are here.

The gap between Level 1 and Level 3 is where the asymmetric returns live.

A team member on our staff became 10x more valuable after we built AI tools around her role. She wasn't replaced. She was transformed. Meeting recaps that used to take an hour now happen automatically. That saves her about 10 hours a week. And the time she saved goes to higher-value thinking, not more busywork.

McKinsey and Deloitte both highlight the critical insight for executives: agentic AI does not reduce the need for management. It raises the bar. You need "agent supervisors" who monitor AI workflows at designed decision points. This is delegation, not abdication. The analogy is a CIO who delegates portfolio management to sub-advisors but retains control over asset allocation, risk limits, and rebalancing triggers.

Principle 09

The Controls: Tuning Knobs Every Leader Must Understand

Temperature

The Risk Tolerance Dial. Temperature controls how much randomness the model introduces. Temperature at 0 is like a systematic quant strategy. Deterministic. Consistent. Always takes the highest-probability path. Good for financial analysis, legal review, anything where accuracy matters more than creativity. Temperature at 1.0 is like a macro trader. Creative. Willing to explore low-probability paths. Sometimes brilliant. Sometimes wrong. Good for brainstorming, marketing copy, creative work. Same model, different risk parameters.

Context Window

Depth of Analysis. This is how much information the model can hold in working memory at once. Gemini 3 Pro handles 1 million tokens (roughly 2,500 pages). GPT-5.2 handles 400,000 tokens. This is the difference between an analyst who can only remember the last 50 pages they read and a senior partner who has the entire deal room in their head.

Top-P and Top-K:

Concentration Limits. These parameters limit which word choices the model considers. Top-P of 0.9 means "only consider words in the top 90% probability mass." This is like a portfolio constraint that says "only invest in securities above a minimum credit rating." It prevents the model from making wildly improbable choices that could derail the output.

Jagged Intelligence:

Uncorrelated Factor Exposure. This is the concept that matters most. AI performs at a PhD level in one domain (medical diagnosis, legal analysis, code generation) and fails at a middle-school level in another (counting letters in a word, basic spatial reasoning, simple arithmetic on large numbers).

Ethan Mollick at Wharton coined the term "jagged frontier" to describe this. The capability boundary isn't smooth. It's wildly uneven. For a finance executive, think of it as a fund with massive alpha in equities that consistently loses money in fixed income. You don't fire the fund. You constrain its mandate to what it does well. And you never, ever assume that excellence in one domain transfers to another.

We don't have AGI yet. These machines can't think for themselves. Don't buy into the hype. AI is really good at what's been done before. It's not good at figuring out what's going to be done next. We don't have super intelligence. We don't have sentient beings. Knowing where the boundaries are is what separates informed deployment from expensive mistakes.

Principle 10

Security: The Risks That Keep Getting Undersold

Every major AI system has a novel vulnerability that doesn't exist in traditional software. AI cannot always distinguish between "data" and "commands." This single architectural fact creates an entirely new attack surface.

The practical takeaway:

AI security is not an IT problem. It's an operational risk problem. Layered defenses. Independent verification. Human checkpoints at high-stakes decisions.

The same principles you'd apply to any system that handles sensitive information or makes consequential recommendations apply here, with higher urgency.

What To Do With This

You just read 5,000 words on how AI works at the atomic level. Here's what to do with it.

1. Define tangible outcomes, not vague "productivity."
Don't ask "how can AI make us more productive?" Ask "which decisions get better if we can process 100x the information in the same time?" We've done this exercise with 40+ clients. The answer is never "write emails faster." It's things like "evaluate every competitive threat in real-time" or "spot patterns in client behavior before they become trends."
2. Start with your most expensive people.
20% of everything any employee does should be accelerated with AI. But start the pilot with your highest-cost team members. A $500,000 analyst saving 10 hours a week is a different ROI than an entry-level coordinator saving 10 hours a week. The math is obvious when you start at the top.
3. Role-model adoption from the top.
We call it the train-the-trainer model. Show people what's possible. Don't mandate. Don't send a memo. We've seen skeptics convert after a single workshop where we demonstrated AI doing in seconds what used to take days. Their mouths were on the floor. Demonstration beats argument every time.
4. Build for amplification, not replacement.
There are two modes of AI adoption. Replacement mode: AI does the thinking, humans format the output. Skills atrophy. You create proxy workers who depend on the machine. Amplification mode: humans do the thinking, AI stress-tests and extends ideas. Skills compound. You create augmented thinkers who get stronger over time.
A finance executive who understands compound interest gets this immediately. Replacement is linear. Amplification compounds. Choose accordingly.
5. Maintain your margin of safety.
If the AI system is 95% accurate, build your processes as if it's 85% accurate. Organizations consistently underestimate total AI investment by 40-60%, mostly in data preparation and change management. Budget with a buffer. Never assume AI accuracy is 100%. The margin covers unknown failure modes.

And one more thing. The most important thing.

The one thing we can't lose as we build AI skills: don't forget how to think. Brain rot is real. The more you rely on AI for everything, the more your own judgment atrophies. A newspaper writer let AI write an article and it ended with "What do you want us to write about?" It was all over the internet. Don't be that person.

AI is not coming for your job. But somebody using AI is coming for your job. When you combine breakthrough people with AI, no one can touch it. No one can. That combination wins.

The technology we just walked you through is a force multiplier. Your people are the force. The first principles are clear: computation as the engine, data as the experience, attention as the breakthrough. The question is what you do with that understanding.

A tech CEO told us last month: "Last year was the year of AI exploration. This year is the year of adoption. If you don't adopt, you die." We completely agree with him.

But adopt intelligently. Adopt with first-principles understanding. Adopt knowing where the real risks are and where the real returns live.

That's what this briefing is for.

AI from first principles.

We built this briefing because we're tired of watching smart people get bad AI advice.

Here's what we mean.

We've spent the last two years going deep on this.

Not reading tweets or watching videos.

We've spent thousands and thousands of hours learning:

We're not theorizing from a lectern. We're building with this technology every single day.

The Bitter Lesson: Scale Beats Cleverness

Scale beats cleverness. Every time. That's the bitter lesson.

The Three Pillars of ProgressCompute. Data. Model size.

Compute:

Data:

Model Size:

From Neurons to Nodes How the machine learns.

Tokens, Embeddings, and the Geometry of Meaning.

Here's where it gets interesting for someone who thinks in terms of markets.

Tokens:

Embeddings

The Transformer:Attention Is the Breakthrough

The LLM Lifecycle: From raw intelligence to useful tool.

Stage 1: Pre-training

Stage 2: Supervised Fine-Tuning

Stage 3: RLHF

Scaling Laws: Predictable Returns on Intelligence

Agentic AI: From Assistants to Autonomous Workers

Here's where the money is heading.

The Controls: Tuning Knobs Every Leader Must Understand

Temperature

Context Window

Top-P and Top-K:

Jagged Intelligence:

Security: The Risks That Keep Getting Undersold

The practical takeaway:

What To Do With This

1. Define tangible outcomes, not vague "productivity."

2. Start with your most expensive people.

3. Role-model adoption from the top.

4. Build for amplification, not replacement.

5. Maintain your margin of safety.

The Bitter Lesson:
Scale Beats Cleverness

The Three Pillars of Progress
Compute. Data. Model size.

From Neurons to Nodes
How the machine learns.

The Transformer:
Attention Is the Breakthrough