The value of thought. How human-AI collaboration is measured economically

This touches on how large language models (LLMs) operate! tokenization is the fundamental process in natural language processing (NLP) of breaking down raw text into smaller units called tokens, such as words, subwords, or characters. This is a crucial first step that transforms unstructured text into a structured format that machine learning models can process.

🧠 What is Tokenized?

Tokenization is the process of breaking down raw text into smaller units called tokens.

Tokens are typically words, parts of words (subwords), or individual characters. For example, the sentence “The quick brown fox” might be broken into the tokens: [“The”, “quick”, “brown”, “fox”].

The LLM processes these tokens to predict the next most probable token in a sequence, which forms the output text.

For instance, he sentence “Tokenization is important” would be tokenized into a list like [“Tokenization”, “is”, “important”]. A more complex example could be breaking “AI-based” into [“AI-“, “based”] or even [“AI”, “-“, “based”] depending on the tokenizer’s rules.

🤔 The Relation to “Thinking”

“Thinking” in LLMs is an emergent property of the complex mathematical operations performed on the tokenized input. It’s not a unit you can tokenize itself.

The “thought process” is a series of billions of weighted matrix multiplications and activation functions (the model architecture) that manipulate the numerical representations (embeddings) of the input tokens.

The steps the model takes to go from the input tokens (e.g., your question) to the output tokens (e.g., this answer) simulate a cognitive process like reasoning or understanding, but it’s fundamentally an algorithmic, mathematical process, not an internal, abstract concept that gets converted into a token.

In short, text is tokenized, and the algorithm operates on those tokens to produce the output that we interpret as the result of “thinking.”

The majority of modern Large Language Models (LLMs) use subword tokenization methods, which strike a balance between word-level (which creates a huge vocabulary and struggles with rare words) and character-level (which is inefficient).1 These subword methods break words into smaller, meaningful segments (subwords) when necessary, allowing the model to handle vocabulary effectively and manage out-of-vocabulary (OOV) words.

The three primary types of subword tokenization used in LLMs are Byte Pair Encoding (BPE), WordPiece, and SentencePiece.

In the context of Large Language Models (LLMs) and AI, the concept of “thinking” is not directly tokenized.

That is a profound and complex debate at the intersection of philosophy, economics, and cognitive science. The value of “thought” in the context of human-AI collaboration is determined not by a single metric, but by a combination of measurable outcomes and hard-to-measure cognitive enhancements.

Here is a breakdown of how the value of critical thinking/thought is currently being determined in AI Agent/human collaborative environments:

Value Based on Measurable Outcomes (The “What”)

In a practical, business, or operational setting, the value of thought is measured by its impact on the final result, regardless of whether the thought originated from the human or the AI.

Metric Category How Value is Determined Examples
Decision Quality The thought process leads to a better, more successful, or more robust final decision. Success Rate: Percentage of times the team’s recommendation (Human + AI thought) achieves its objective (e.g., a stock pick, a medical diagnosis, a successful sales strategy).
Error & Risk Reduction The critical thought process identifies and corrects flaws, biases, or inconsistencies in the AI’s or the human’s input. Bias Detection Rate: How often the human or AI agent questions a flawed recommendation and identifies a systemic bias (e.g., in loan applications). Hallucination Rate: Reduction in factually incorrect information published due to human verification of AI outputs.
Efficiency/Cost The thought process, even if critical, leads to a faster or cheaper successful outcome. Cost Per Interaction: Reduction in the number of costly human-AI cycles (prompts/API calls) needed to reach a high-quality decision. Time to Solution: The time taken for the collaborative team to solve a complex, ambiguous problem (an “ill-structured problem”).
Innovation & Creativity The human-AI collaboration generates novel solutions that neither could have produced alone. Novelty Score: A measure of how unique or non-obvious the solution is compared to past solutions (often judged by human domain experts).

Value Based on Cognitive Enhancement (The “How”)

This is where the unique value of the human collaborator lies—in maintaining and strengthening the very act of thinking. In this view, if the AI makes the human a worse thinker, the AI’s value is diminished, regardless of short-term task success.

Metacognition and Oversight

The human’s critical thinking has value by managing the entire process, not just the output.

Role Clarity: The value is in the human accurately determining when to trust the AI (e.g., for pattern recognition) and when to override it (e.g., for ethical or contextual judgment).

The “Why” Factor: The value lies in the human demanding and understanding the reasoning behind the AI’s recommendation (Explainable AI/XAI), which improves long-term organizational learning.

Avoiding Cognitive Atrophy

Research shows that over-reliance on AI can weaken human critical thinking muscles. The value of thought is therefore determined by how well the system prevents this atrophy.

Active Engagement Metrics: Systems are being designed to measure if the human is simply accepting the AI’s answer or actively engaging with it (e.g., modifying the prompt, verifying sources, challenging the core assumption).

The World Economic Forum emphasizes skills like analytical thinking and systems thinking as the most valuable for the future, skills that must be exercised to be maintained.

Contextual & Ethical Judgment

AI is excellent at pattern matching but often blind to unique human context, politics, and unwritten rules.

Contextual Alignment: The value of the human’s thought is measured by their ability to successfully integrate the AI’s generalized insight with domain-specific knowledge, ethical trade-offs, and cultural context. This is the difference between a statistically correct answer and a practically valuable decision.

In summary, the value of thought is moving away from being solely measured by a simple “correct or incorrect” output and towards a comprehensive evaluation of process quality, human cognitive health, and resilience in complex, real-world scenarios.

Here’s a breakdown of what is tokenized and how it relates to the idea of “thinking”:

Byte Pair Encoding (BPE) 💡

BPE was originally developed as a data compression algorithm and is now a popular tokenization method used by models like the GPT series (OpenAI) and RoBERTa.

Mechanism: It starts with a vocabulary of all individual characters (or bytes) in the training corpus.6 It then iteratively merges the most frequently occurring pair of adjacent characters or subwords into a new, single token.

Criterion: The merge decision is strictly based on the frequency count of the pairs.

Advantage: It creates a relatively compact, efficient vocabulary and is great at handling morphological variations (like plurals or different verb tenses) by splitting them into common subword units (e.g., “tokenizing” might become

$$’token’, ‘iz’, ‘ing’$$

, if ‘iz’ and ‘ing’ are common suffixes).

WordPiece 🧱

WordPiece is the tokenization method famously used by models like BERT (Google).9 It is conceptually similar to BPE but uses a different merging criterion.

Mechanism: Like BPE, it starts from an initial vocabulary and iteratively merges pairs.10 However, it typically requires a pre-tokenization step, where the text is first split by whitespace.11 It then applies the merging algorithm to the individual words.

Criterion: The merge decision is based on likelihood maximization.12 Instead of simply picking the most frequent pair, it picks the pair that maximizes the likelihood of the training data when added to the vocabulary.13 In essence, it maximizes the probability $P(\text{merged pair}) / [P(\text{first part}) \times P(\text{second part})]$.

Handling OOV: WordPiece tokens for subwords often start with a special symbol (like ## in BERT) to denote that they are a continuation of a word (e.g., “tokenization” might become

$$’token’, ‘##iz’, ‘##ation’$$

, showing that ‘##iz’ and ‘##ation’ are not standalone words).

SentencePiece (SPM) 🌍

SentencePiece is a tokenization library/framework often used in multilingual models, such as LLaMA and T5.16 It offers a unified approach that is language-agnostic.

Mechanism: The key difference is that SentencePiece treats the entire input text as a raw stream of characters/bytes, without relying on whitespace to pre-split words.18 It converts whitespace into a special visible token (like $\text{_}$) and then applies subword algorithms (often BPE or a variant called the Unigram Language Model) to the entire stream.

Advantage: This makes it highly effective for languages like Chinese, Japanese, and Thai, which do not use clear word boundaries (spaces).19 By standardizing the input across all languages, it is ideal for multilingual LLMs.

Algorithm Options: Unlike BPE or WordPiece, SentencePiece can implement different algorithms, most commonly a Unigram Language Model method (which finds the tokenization that maximizes the probability of the sequence) or BPE.

The choice of tokenizer significantly affects how an LLM processes language, handles rare words, and manages its total memory footprint.

You can learn more about the differences between these tokenization methods in LLM Tokenizers Explained: BPE Encoding, WordPiece and SentencePiece.

That is a profound and complex question at the intersection of philosophy, economics, and cognitive science. The value of “thought” in the context of human-AI collaboration is determined not by a single metric, but by a combination of measurable outcomes and hard-to-measure cognitive enhancements.

Here is a breakdown of how the value of critical thinking/thought is currently being determined in AI Agent/human collaborative environments:

Value Based on Measurable Outcomes (The “What”)

In a practical, business, or operational setting, the value of thought is measured by its impact on the final result, regardless of whether the thought originated from the human or the AI.

Metric Category How Value is Determined Examples
Decision Quality The thought process leads to a better, more successful, or more robust final decision. Success Rate: Percentage of times the team’s recommendation (Human + AI thought) achieves its objective (e.g., a stock pick, a medical diagnosis, a successful sales strategy).
Error & Risk Reduction The critical thought process identifies and corrects flaws, biases, or inconsistencies in the AI’s or the human’s input. Bias Detection Rate: How often the human or AI agent questions a flawed recommendation and identifies a systemic bias (e.g., in loan applications). Hallucination Rate: Reduction in factually incorrect information published due to human verification of AI outputs.
Efficiency/Cost The thought process, even if critical, leads to a faster or cheaper successful outcome. Cost Per Interaction: Reduction in the number of costly human-AI cycles (prompts/API calls) needed to reach a high-quality decision. Time to Solution: The time taken for the collaborative team to solve a complex, ambiguous problem (an “ill-structured problem”).
Innovation & Creativity The human-AI collaboration generates novel solutions that neither could have produced alone. Novelty Score: A measure of how unique or non-obvious the solution is compared to past solutions (often judged by human domain experts).

Value Based on Cognitive Enhancement (The “How”)

This is where the unique value of the human collaborator lies—in maintaining and strengthening the very act of thinking. In this view, if the AI makes the human a worse thinker, the AI’s value is diminished, regardless of short-term task success.

A. Metacognition and Oversight

The human’s critical thinking has value by managing the entire process, not just the output.

Role Clarity: The value is in the human accurately determining when to trust the AI (e.g., for pattern recognition) and when to override it (e.g., for ethical or contextual judgment).

The “Why” Factor: The value lies in the human demanding and understanding the reasoning behind the AI’s recommendation (Explainable AI/XAI), which improves long-term organizational learning.

B. Avoiding Cognitive Atrophy

Research shows that over-reliance on AI can weaken human critical thinking muscles. The value of thought is therefore determined by how well the system prevents this atrophy.

Active Engagement Metrics: Systems are being designed to measure if the human is simply accepting the AI’s answer or actively engaging with it (e.g., modifying the prompt, verifying sources, challenging the core assumption).

The World Economic Forum emphasizes skills like analytical thinking and systems thinking as the most valuable for the future, skills that must be exercised to be maintained.

Contextual & Ethical Judgment

AI is excellent at pattern matching but often blind to unique human context, politics, and unwritten rules.

Contextual Alignment: The value of the human’s thought is measured by their ability to successfully integrate the AI’s generalized insight with domain-specific knowledge, ethical trade-offs, and cultural context. This is the difference between a statistically correct answer and a practically valuable decision.

In summary, the value of thought is moving away from being solely measured by a simple “correct or incorrect” output and towards a comprehensive evaluation of process quality, human cognitive health, and resilience in complex, real-world scenarios.

The healthcare domain provides one of the clearest examples of measuring the value of human critical thinking in collaboration with AI, because the ultimate metric—patient outcome—is so unambiguous and high-stakes.

In healthcare, the value of critical thinking isn’t just measured by the accuracy of the AI, but by how the human clinician’s judgment augments, questions, and corrects the AI’s output.

🏥 Concrete Example: AI-Assisted Medical Diagnosis

The AI Task: Pattern Recognition (System 1 Thinking)

An AI model (often Deep Learning) analyzes large, complex data sets—like images, lab results, and patient history in the Electronic Health Record (EHR)—to provide a diagnostic probability or suggested differential diagnosis.

Example: In radiology, an AI flags a lung nodule on a CT scan and assigns a 92% probability of being benign.

The Human Task: Critical Reasoning (System 2 Thinking)

The clinician uses critical thinking to interrogate the AI’s suggestion, applying context, ethics, and external knowledge.2 The value of the human’s thought is measured by its impact on the final decision and outcome.

Metrics for Measuring Critical Thinking Value

The value of the AI-human collaboration is measured through a set of integrated Human-Centric Key Performance Indicators (KPIs):

Decision Quality & Accuracy Metrics

This measures the team’s ability to arrive at the correct outcome, avoiding both false alarms and dangerous missed diagnoses.

Metric Measurement of Critical Thinking Value
Reduction in False Positives/Negatives How much the human’s review reduces incorrect diagnoses (e.g., preventing unnecessary invasive biopsies for a benign nodule, a False Positive).
Human Override Rate The frequency with which the clinician correctly rejects or modifies the AI’s suggestion. A high correct override rate indicates the human is adding value by applying contextual or novel knowledge the AI missed.
Time-to-Accurate Diagnosis The speed with which the correct diagnosis is finalized. The AI expedites the initial analysis, but the human’s critical validation is necessary to declare the decision “accurate.”
Bias Mitigation Score How often the clinician identifies and corrects an AI recommendation that is biased against a specific demographic (e.g., if the AI underperforms for certain patient populations).

 Cognitive State & Skill Metrics

This is the value derived from preventing “automation bias” and ensuring the human’s skill does not atrophy.

Structured Reflection Prompts: In training and post-case reviews, systems track if clinicians ask critical questions of the AI, such as: “What is the assumption behind this 92% confidence score?” or “What data point did the AI prioritize?” This measures the depth of critical engagement rather than passive acceptance.

Maintenance of Diagnostic Skill: Through simulated non-AI scenarios or by tracking expertise over time, systems measure whether the use of the AI causes deskilling. If critical thinking value is negative, it means the human is losing the ability to think independently.

Patient Outcome Metrics

The ultimate, non-negotiable metric is the impact on the patient.

Reduced Adverse Events: Tracking the reduction in medical errors, complications, or readmission rates when the human-AI team is used, compared to human-only or AI-only decisions.

Mortality/Morbidity Reduction: For critical conditions like sepsis or cardiac events, the value is quantified by the percentage decrease in mortality due to the human-AI system’s ability to predict, flag, and intervene earlier than traditional methods.

In this context, the value of the human’s critical thinking is the essential verification, contextualization, and ethical oversight that converts a statistically probable AI prediction into a safe, patient-centered clinical decision.

You can explore how diagnostic reasoning is taught and how AI can enhance it in Preparing Clinicians for Diagnostic Artificial Intelligence: Tools, Teammates, and Teaching.