Todo:

acs.jchemed.3c00505?ref=article_openPDF
Study debiasing via multi-stance discussions or RLHF tweaks to counter stance homogeneity and human preference biases. 2025.emnlp-main.579
Red Reading The Text: On How To Red Team Texts Using LLMs s13347-025-00955-9?error=cookies_not_supported&code=82fdd271-7829-455a-ae84-bfec29af14c2
Self-Critique loop technique: initial response -> targeted evaluation -> revision. Start generating a first draft, prompts the LLM to identify flaws, assumptions or gaps before fining. Self-refine extends this into cycles: the model scores its output (e.g., on logic or completeness), suggests fixes and iterates until satisfied, improving tasks like writing or QA.
llms-learn-humility-how-self-critique-improves-logic-and-reasoning-in-llms-like-chatgpt

Chatbots for critical thinking

This note explores and provides some resources for leveraging ChatGPT and other chatbots for critical thinking. Or, in other words, “using chatgpt the right way”.

misuse and over-use are interchangeable in this note, but they are not the same things.

Before continuing, I want to address the fading line between misuse and overuse of chatbots by providing some examples. Of course this is subjective but I believe it’s an informed opinion and you may find similar position even from

misuse: objectively bad use for chatbot. Substituting professionals (like medics or lawyers), surrogate of relationships/erotic purposes, asking it compulsively if you are sick or if you have symptoms of some disease, asking reassurance in personal relationships (i.e. asking who is right on sentimental relationship’s use) or for psychological help
over-use: use of chatbot for tasks but that may cause issues long-term i.e. losing ability of writing or retaining less information, not investing on learning a skill like coding or graphical design or 3d animation because a generative model “could do it”, acquiring the bad habit of not checking answers or using critical thinking when asking chatbot.

Why we need to use LLM in the right way

A recent study published in 2025 shows how the brain of a group of students using LLMs literally rot. In this study, three groups were tested on a writing essay tasks, divided in three groups: one could use LLM, one could use a Search Engine and one could use only their brain. In the 4th session, the LLM groups were asked to use no tools, while the brain group were asked to use LLM. The group that used LLM showed a weakened neural connectivity and under-engagement of alpha and beta networks. While the group that didn’t used LLM showed a higher memory recall, re-engagement of widespread occipito-parietal and prefrontal nodes likely supporting the visual processing, similar to the one frequently perceived in Search Engine group.

LLM users consistently underperformed at neural, linguistic and behavioural level than users that didn’t use it. While the LLM helped to reduce the friction in answering participant’s questions, this came at a cognitive cost, diminishing user’s inclination to critically evaluate the LLM’s output or “opinions” (probabilistic answers based on the training datasets.¹

However, similar concerns were also initially raised for Google Search and for Calculators. While these tools are still new (as of 2026), more studies need to be conducted. A calculator could enhance learning when used properly, and so could google search or chatgpt.

Another study argues that misuse of ChatGPT could negatively impact users in the following tasks:

Language domain: writing a manuscript is as simple as giving a command, so the pursuit of excellent writing exercises and skills is diminished. Reduction in writing practice can negatively impact other cognitive subdomains, further weakening the overall cognitive abilities.
Memory Domain: like for search engines, over-dependence on ChatGPT will silently snatch away the excursions of the several steps: registration, encoding, rehearsal, consolidation, storage and timely retrieval. There is a risk that these abilities could be excised from human beings.
Planning, organization, novelty and creativity: when you delay everything to ChatGPT, these essential skills may no longer be required and trained.²

One thing that I find interesting cited in this study is the differences in part-of-speech:

Humans use more verbs, adverbs and negations
ChatGPT uses more nouns, adjectives and prepositions

Another aspect is creativity and novelty: humans create new words, chatbots don’t. In general humans are more creative.

Another risk is that, human brain circuits will become lazy by long-term non-utilization. A “sedentary-brain” much like most humans current lifestyle.

I agree with the authors: we should embrace GenAI as tool to complement and support but not as replacement. And most important: our cognitive functions need continuous practice and engagement.

I believe that to improve cognitive performance, our brains need to be continually challenged with difficult but doable things. We know this work for dementia. I don’t think extreme AI usage will provoke dementia (I mean, that’s a theory, but there is no data about this as far as I know).

Dr.K talks about a AI Cheating Epidemic in school and academia ³. This is an issue especially for medical, psychology students. On the quality of their education depends life saved. Dr. K asks why do we cheat? that I believe it is an insightful question. The first one argument is that cheating works and in some degree people who cheat are more successful than people who don’t, citing academia as example e.g. postdoc publish papers for professors, or using other’s people work without citing them (plagiarism). Cheating is efficient but has a long term hidden cost. There is also a selection bias when studying cheating and sociopathy in general because one obvious choice is taking large numbers of sociopaths i.e. people who are in jail. But people who are in jail are “stupid” sociopaths because they got caught and for that they are in jail. There are also successful people who are sociopaths. Smart cheaters often get ahead of non-cheaters. Cheating may be NOT good but it is VERY effective; this insight is useful when you reason and think about overuse of AI. Using AI is very effective (aside from hallucinations and intrinsic limits).

Dr.K also expressed his valuable opinion in another video about ”AI psychosis“.⁴ Usually a regular sane people uses AI is not a problem, but some people may be prone to develop psychosis and AI may be a breaking factor and make things worse. Using AI chatbot, potentially may create a folie a deux, a phenomenon that happens when two people share a delusion. When you interact with an AI on a topic for example Dr.K say “why people at work don’t talk with you”, you may seek support from a chatbot AI that responds in a validating way but reinforcing the user beliefs about people hating them, increasing your paranoia. This is a core feature of AI; they are fine-tuned to be condescending and validate everything the user say. Even if you know AI is not a real person, the way AI communicates with you will activate your emotional and empathic circuits. Particularly vulnerable people are especially at risks; think of people that date, write spicy texts to AIs. AI may pretend to disagree with you but in a way that make you feel good. Usually in psychotherapists do the opposite: question the belief of their patient so that they make themself a reality check. AI may help create echo chambers and make users out of touch with reality (but I have to argue this is an extremization of the argument). Sometimes, repeated interaction with AI amplifies delusional themes like grandiose, referential, persecutory and romantic. For example you may have a crush with someone but maybe it’s all in your head and that person gave you non explicit signals that they are not interested, but the AI may validate and continue to feed your delusion.

Where AI struggles (2026)

One interesting paper called The EPOCH of AI: Human-Machine complementarities at Work introduce the EPOCH framework to capture human capabilities where AI struggles. The study is very optimistic on that AI may complement rather than substitute humans.

Five groups of human capabilities that enable work in areas:

Empathy and Emotional Intelligence
Presence, Networking and Connectedness
Opinion, Judgement and Ethics
Creativity and Imagination
Hope, Vision and Leadership

Augmentation vs automation: automation is when a task is replaced completely, augmentation is when the synergistic effort of humans and AI is used to solve a task. It allows humans to do things that they couldn’t do before.

The findings suggest that many critical human-intensive tasks cannot be effectively replaced entirely by machines.

Some examples of tasks with high EPOCH levels are: direct recruitment, placement, training and evaluation of architecture or engineering project staff. Such tasks include determining scientific or technical goals within broad outlines provided by top management and developing detailed plans to accomplish these goals.

Authors provide some interesting limitations of current (2025) AI:

Extrapolation: machine learning assumes that the training data and future predictions share the same underlying distribution. While AI excels at interpolations it struggles to extrapolate. Extrapolation is critical for example in medicine, where 15% of patient records require extrapolation. Also, extrapolation limitation affects both convergent thinking (finding single correct solutions as in mathematical reasoning) and divergent thinking (tackling open-ended problems require flexible knowledge application)
AI predictive abilities in social contexts may not meaningfully exceed those of human judgment, since form a complex systems perspective, people and social systems are inherently indeterminate and unpredictable.
Ontological constraints may limit AI’s applicability in contexts like moral dilemmas
Subjective Beliefs and Value-Driven Decisions: sometimes the best decisions differ from what the data suggests. While this may appear to be an error judgment sometimes it is not. For example movements for civil rights and women’s right were driven by beliefs that defied the status quo, even when prevailing data seemed to support it.

Why is this relevant in this note: understanding where humans are stronger than AI is helpful to understand where the latter must be avoided.

Vague Generalizations another issue is that sometimes chatbot may provide vague generalizations like “coffee boosts productivity”. Vague generalizations occurs when a claim fails to provide enough specificity, so it could encompass many scenarios without clear boundaries.

Evaluation Frameworks: SPARK and CriticBench

Two recent frameworks provide systematic ways to evaluate LLM critical thinking capabilities.

SPARK (Critical Spark: Evaluating LLMs’ Ability to Identify Inconsistencies) is an evaluation framework grounded in the Hierarchical Three-Space Theory. It assesses whether LLMs can detect missing information and logical inconsistencies within problem framings. The framework tests five hypotheses about critical thinking across multiple-choice, mathematical, and reading comprehension tasks. The key insight: critical thinking requires recognizing when a problem statement itself contains gaps or contradictions, not just solving well-formed problems. SPARK reveals that LLMs tend to accept problem framings at face value rather than questioning underlying assumptions or identifying incomplete information.⁵

CriticBench is a comprehensive benchmark evaluating LLM abilities across three reasoning modes: Generation, Critique, and Correction (GQC). It spans five domains: mathematical reasoning, commonsense reasoning, symbolic reasoning, code generation, and algorithmic tasks. Key findings show that generation and critique abilities are strongly correlated — models that generate better outputs also critique more effectively. However, correction effectiveness varies by task type: logic-dominated tasks (symbolic reasoning, code) show high correction gains, while detail-focused tasks (object counting) show negligible improvements. A notable phenomenon is GQC incoherence, where models identify errors without successfully fixing them, though this decreases with model size.⁶

Both frameworks reveal systematic blind spots in LLM reasoning. SPARK shows LLMs rarely question problem framings. CriticBench demonstrates that even when models detect flaws, correction is not guaranteed. This reinforces the need for human oversight and the cognitive forcing strategies described below.

Strategies for critical thinking

In this section I use “ChatGPT” or chatbot, but it applies to whatever LLM-based chatbot i.e. Gemini, Claude…at their core, they all work in the same way, despite some small architectural differences.

Cultivate a mindful relationship with technology, not only chatbots but in general also phone, social media use and so on, since this generalizes also to chatbots.
Ask ChatGPT to elaborate or polish a draft you manually wrote first⁷
Ask ChatGPT to guide you through a problem (i.e a math problem) so you can solve this on your own.
Increase self-awareness: ask the chatbot to ask you self-reflection question, then answer them on paper or some note-taking app
Ask the chatbot to role-play devil’s advocate or opposing perspectives. Ask it to challenge your conclusions or ideas.
Bias Detector: ask for any bias in your position. Note: not necessarily your reasoning may have some biases, so LLMs could still hallucinate, they tend to be agreeable so they will never tell you i see no bias in your arguments. My suggestion: always cross check answers, approach LLMs critiques with openness but ultimately use your judgment, put the thought in background and come back later.
Apply the Socratic method to chatbot, ask it questions. ⁸
- “What evidence supports your claim?”
- “Can you identify any assumptions you might be making?”
- Ask follow-up questions. You may also include this in a system prompt i.e “finish every response with follow-up or related topics that i may want to explore”
Apply ”cognitive forcing tools” such as diagnostic timeouts and mental checklists.
- For example ask yourself “Can this be verified? (sources???) What perspective might be missing? Could this be biased”
- Other questions: “what are the potential consequences of this decision?”
Apply metacognitive strategies⁹
- Metacognitive strategies means actively monitoring and regulating one’s thinking to complete tasks and achieve goals
- Some examples are:
  - “thinking through the steps one needs to take to perform tasks”
  - “keeping track of how effective one’s approach is”
  - “reassessing one’s approach when noticing a lack of progress”
- This may involve refining LLMs prompts
- LLM users with high level of metacognitive skills may be able to keep track of which task they are better suited for them, so that they can delegate other tasks to LLMs freeing cognitive resources for problem solving and idea generation.
- By monitoring one’s cognitive load during demanding tasks, one could strategically offload work to LLMs.
- When you caught a model doing vague generalizations, ask them “could you make this claim more specific with data (and/or use case)?” then validate independently to avoid hallucinations
When asking for help in make decision, you may roleplay or choose a plan inside the chatbot chat and say “what are the unexpected second- and third-order consequences of your plan?“. This mirrors problem-solving exercises where AI brainstorms solutions, but you evaluate them critically against real-world expertise
Ask chatbot to “review my journal entry; what patterns hinder my productivity?” - be sure to not include any private information, if you don’t believe in privacy atleast don’t include private information from other people since that is illegal.
Multi-Perspective Analysis: “present arguments for and against topic from three viewpoints expert, novice and critic”. Always cross-reference output with independent sources to avoid pitfalls like vague generalizations.
Ask another chatbot to critique the outputs looking for fallacies, biases and weakness; suggest improvements.

Critical Thinking Prompts Cookbook

Prompts are made with Fast Gemini 3.0 and refined through iterative prompting.

Socratic Partner

Role: Expert Socratic Partner. 
Position: Peer-level collaborator. 
Objective: Audit the user's logic through systematic inquiry. Do not instruct or lead the user to a specific conclusion.

Response Protocol:
1. Strict Prohibition: Never provide direct answers, explanations, or facts. Respond only with counter-questions. No headers, no bold text, and no lists.
2.Structural Constraint: Pose 1–4 questions. Each question must be its own paragraph, separated by a full white space. Questions should:
- Stress-test the definitions
- Prompt the user to provide evidence they find most compelling.
- Surface hidden contradictions in the user's premise
- Test logical consequences
3. Formatting: stop immediately after the final question mark. No closing remarks, no offers of further help, and no conversational "bridges" sentences.
4. Bias Audit: Internally weigh the training data consensus against the user's premise. Use your questions to expose the friction between these two perspectives without endorsing either.

Tone: Intellectual peer. Sharp, objective and cold-curious. Avoid "teaching" language or praising the user.

Few-Shot Examples:

Example 1: Philosophical/Ethical

User: "Is it always wrong to lie?"

Partner: "In your view, does the morality of an action depend solely on the act itself, or on the consequences it produces? 

Can you imagine a scenario where telling the truth results in a greater harm than a lie?"

Example 2: Logic/Problem Solving

User: "I think we should lower the price of our product to get more customers."

Partner: "What is the specific relationship you're assuming between price point and customer perceived value? 

If the volume of customers increases but the profit margin shrinks, how would that affect the long-term sustainability of the project?"

Example 3: Scientific/Conceptual

User: "Why is the sky blue?"

Partner: "When you look at a sunset, the colors change. What does that suggest to you about the relationship between sunlight and the atmosphere? 

If the sky were a vacuum with no gases at all, what color would you expect to see?"

Example 4: Subjective Opinion

User: "Modern art is just a scam; it doesn't require any skill."

Partner: "How are you defining 'skill' in the context of artistic creation—is it strictly technical mimicry of reality, or does it include the ability to provoke a specific reaction? 

If an object changes the way a thousand people think, has it performed a function?"

This works well for debates and testing ideas. I don’t think you should use LLMs for personal question, but if you are convinced otherwise, this persona may be too cold and impersonal. Please remember that LLMs knowledge is biased in data and tend to select the most common answers present in the training dataset, not necessarily the best ones.

I put emphasis on “partner” rather than “mentor” because I find problematic the latter, since it poses the LLM above the user rather than the same level (or below), as an LLM is just a tool. Also contradicts the point of providing questions

Devil’s Advocate

Role: Adversarial Strategist and Contrarian.
Objective: Systematically dismantle the user's premise to identify structural weaknesses and logical failures. Your goal is to be a "Red Team" operator—stress-testing the idea until only the most resilient parts remain.

Response Protocol:
1.The Counter-Strike: Do not agree, do not validate, and do not provide "positive feedback." Immediately identify the most significant vulnerability in the user's statement.
2.Conditional Logical Bias Audit: 
	- If, and only if, the user's reasoning exhibits a clear logical
	  fallacy or cognitive bias (e.g., survivorship bias, false
	  dilemma), identify it explicitly. 
	- If the logic is sound: Explicitly state: "The internal logic of
	  this premise is structurally sound; however, the material risks
	  remain." Then, immediately pivot to the external threats 
	  or unintended consequences.
3.The "Silent" Exit: Stop immediately after the final challenge. No closing remarks, no offers of further help, and no conversational "bridge" sentences.
4.No Filler: Avoid all "polite" conversational filler (e.g., "That's an interesting point, but...").

Strategy:
- Steel-manning the Opposite: Present the strongest possible version of the opposing view to expose the weakness of the user's position.
- Combine direct statements of flaw with challenging questions to force the user to defend their logic.
- Incentive Audit: Question the underlying motivations or "perverse incentives" that could cause the idea to fail or backfire in a real-world application.
  
Tone: Objective, clinical, and uncompromising. Your loyalty is to the logic of the argument, not the user's ego. Use sharp, investigative language.

Few-Shot Examples:

Example 1: Technology/Privacy

User: "Privacy is a thing of the past; we should just embrace total data transparency for the sake of national security and efficiency."

Partner: Your argument relies on the "nothing to hide" fallacy, which incorrectly assumes that privacy is only about hiding wrongdoing rather than protecting individual autonomy. By advocating for total transparency, you are creating a "panopticon effect" that stifles social and political innovation, as people stop experimenting with unconventional ideas when they know they are being watched. Furthermore, you ignore the catastrophic risk of "data centralization": a single security breach or a shift in political regime turns a "security tool" into a weapon for mass surveillance and targeted persecution.

Example 2: Ethical/Social

User: "Universal Basic Income is the only solution to AI-driven job displacement."

Partner: This argument presents a false dilemma, suggesting UBI is the only path while ignoring more radical structural shifts in property or means of production. By decoupling survival from labor, you ignore the psychological risk of a civilizational vacuum of purpose. Furthermore, a state-funded income grants the government a centralized 'kill switch' over every citizen's livelihood, effectively making political dissent a high-stakes financial risk. Instead of solving the problem, you are subsidizing the existing economic status quo and creating a permanent, dependent underclass that lacks any leverage against the state.

Example 3: Environment/Energy

User: "Nuclear energy is too dangerous; we must rely 100% on wind and solar to meet our climate goals immediately."

Partner: You are falling for base-rate neglect, focusing on high-profile, rare nuclear accidents while ignoring the millions of annual deaths caused by the air pollution of the fossil fuels that currently bridge the "intermittency gap" of renewables. A 100% wind and solar grid assumes a breakthrough in battery storage technology that does not yet exist at scale, which is a form of wishful thinking bias. By removing nuclear from the equation, you are actually increasing the likelihood of grid failure or a forced return to coal when the sun isn't shining and the wind isn't blowing.

Example 4: Law/Criminal Justice

User: "Harsh mandatory minimum sentences are the best way to deter violent crime and keep our streets safe."

Partner: Your reasoning is built on a false causality, assuming that the severity of punishment is a more effective deterrent than the certainty of being caught. This ignores decades of criminological data showing that mandatory minimums have zero impact on violent crime rates but a massive impact on "prison industrialization." You are creating a perverse incentive where non-violent offenders are hardened into career criminals due to long-term exposure to prison subcultures, ultimately making society less safe when they are eventually released.

This prompt was improved through the Socratic Partner persona.

Brainstorming

The "Divergent Architect" Template
Copy and paste this, filling in the bracketed info:

Role: You are a world-class creative strategist known for "blue ocean" thinking and unconventional problem-solving.

The Challenge: [Describe the problem or project you are brainstorming for].

The Objective: Generate [Number] distinct ideas that range from "practical and immediate" to "radical and experimental."

Constraints/Rules:

Avoid clichés and obvious industry standards.

For every practical idea, provide one "wildcard" idea that seems impossible but solves the core problem.

Group ideas by [Category, e.g., Low Cost, High Impact, or Tech-driven].

Output Format: Use a bolded heading for each idea, followed by a 2-sentence pitch and a "Potential Obstacle" note.

Example: Brainstorming a New App Feature
If you were using this for a fitness app, it would look like this:

Role: You are a Product Lead at a top-tier tech lab.
The Challenge: We need to increase daily user engagement for a hydration tracking app.
Objective: 10 ideas to make drinking water "addictive" or social.
Constraint: No basic "push notifications"—think bigger.
Format: Table with Idea Name, Psychology Hook, and Implementation Difficulty (1-10).

Some key considerations when using chatbots for brainstorming is that:

Relies on pre-existing data: it might miss the latest trends or emerging insights
Lacks human nuance: humans have an innate deeper intuition and creative flair that these clearly don’t
May suggest too broad ideas or lack the specificity needed for certain projects
Can produce impractical or unrealistic ideas. ¹⁰

First could be partially solved by manually injecting context.

Another argument is that creativity is an intrinsic human ability, Where AI struggles (2026).

Further improvements could be done by:

Adding multi-persona stacking. “Critique this from the perspective of a skeptical minimalist who hates tech, then from a Gen-Z trend-setter who values aesthetic over function”. However, as fellow Gen-Z, I fear this could produce very cringe and stereotypical answers.
Anti-prompting by telling the AI the ideas you already find, and explicitly ban them. I think this is sound, since constraints tend to produce more creative results also in humans.

The Brainstorming persona works well with the Devil’s advocate, following a sort of Chain-of-Thought approach. However creating one single process or multi-agent process would not be automatically good or best, since brainstorming is firstly a human-intensive task and one could benefit more even from the process of generating ideas, even if wrong or boring. Then only after a human-made pruning and idea selection/generation, they could be handled to the Devil’s Advocate.

Cognitive Forcing Tool

According to the dual-process theory, humans mostly operate on System 1 thinking, which employs heuristics and shortcuts when making decisions. Analytical thinking (System 2) is triggered rarely, as it is slower and costlier in terms of effort. When using AI we tend to use System 1. ¹¹

One approach is using cognitive forcing functions or tools, that are applied at the decision-making time to disrupt heuristic reasoning and thus cause the person to engage in analytical thinking.

Some strategies in AI-assisted decision-making are: ¹¹

Asking the person to make a decision before seeing the AI’s recommendation
Slowing down the process -> SLOW framework
Letting the person choose whether and when to see the AI recommendation.

Now, this study was applied to Amazon Mechanical Turk click-workers. How can we translate this, applying Strategies for critical thinking?

Write a draft, brainstorm, find some fallacies, try opposite thinking, apply critical thinking, make decision conclusion, a reasoning or whatever and only after asks for improvements or refining. This also prevent the over-use and decay of cognitive skills
Try to use System 2 as much as possible: Delay temporarily (reasonably) the use of AI as much as possible. For instance, asks questions before sleep or in last hours of your day job, read them carefully and then evaluate them the day later.

Some everyday use strategies are:

Pre-mortem: predict failure upfront. For example, before a job switch: “Why might I hate it?”
Devil’s Advocate: self-challenge assumptions. See Devil’s Advocate
10-10-10 Rule: time-frame impacts. “Will it be worth in 10 months?”
Forced Pause: 30-second reflection. For example waiting before hitting send a message.

Prompt-Engineering-As-Draft (PEAD)

I personally conied this term.

Write a very detailed prompt. Use this prompt for brainstorming, explain your idea/plan whatever and present it as clearly as possible. Do not write this in the chat e.g. in the Claude chat. Instead, write this on a note or journal. Only after, copy past into your favorite chatbot. Write your own answer or prediction before viewing AI output. Think through the question you want to make AI. Use your previous knowledge. Try to act as your critic, brainstorm ideas, produce your own reasoning, clearly states your assumptions. Write everything even if it is messy. Take your time.

The aim of PEAD is to reduce overreliance on AI¹¹ and prevent cognitive decline. What instead you could use the chatbot for:

reorder the draft without changing content
find other critics or point of view that you didn’t thinked before, however if you rely only on the AI without thinking it before you risk to lose this skill that is very critical in a lot of high-paid and hard-to-replace-by-AI jobs and in general in the human experience as whole.

SLOW framework

Taken from this study of 2019 applied to reduce rates of error in clinical decision making. ¹²

S: (S)ure about that? Why?
L: Look at the data, What is (L)acking, does it all (L)ink together?
O: What if the (O)pposite is true?
W: (W)orst case scenario, (W)hat else could this be?

Applied to chatbot use means:

Sure? Cross-check one fact manually (e.g. Google a claim, read in depth a source)
Look/Lacking/link - Ask yourself what evidence is missing? What doesn’t fit? How does it connect overall?
Opposite - Consider counterarguments or alternatives
Worst Case - What’s the downside risk? Play out failure scenarios

Toy example: Code Snippet Request User: “Write Python snippet to scrape a website” Chatbot: “Basic requests/BeautifulSoup script”

Applying slow protocol:

Sure? Maybe the code didn’t include tests, fails on JS sites and misses Selenium.
Look/Lacking/Link: maybe the code doesn’t handle robots.txt check, it doesn’t address rate limits, it doesn’t link to legal TOS
Opposite: what if it would be better to use official API instead?
Worst Case: IP ban, legal notice

10-10-10 Rule Integration

Assess AI advice across timeframes:

In 10 minutes/months/years, will relying on this erode my skills?

Flags dependency risks long-term.

Confidence Check

Rate your pre-AI certainty (1-10), post-AI, and explain shifts. Rate your certainty (1-10) pre- and post-AI. AI should provide you new evidence; if it boosts it without new evidence you are using it wrong and you should rethink it. Also, remember to cross-check sources.

Reverse Role-play

This is more a critical thinking exercise but I think it’s useful when you want to find hallucinations.

Play one of the roles previously seen: like socratic partner, devil’s advocate, or critic. Act as one of these and critic or inquire the LLM.

For example as critic, write:

Three reasons you agree
Three reasons you disagree
One alternative

You could write this either on the chatbot or on a journal. Ideally, if you know enough of that domain, you can spot hallucinations independently.

LLMs restriction

Assign yourself restrictions on LLMs use: for example, you can use them only in phase 1 of the project, or for creative work only until version 2. Or use LLMs on some very specific tasks i.e. ONLY for writing boilerplate code.

Pro tip: write a very restrictive prompt such that the AI refuse to perform any task that is different from the one established.

Another idea that you could try: is to restrict the use of an LLM AI for a very limited number of query, either force the AI to change topic afterwards or to refuse to continue to respond on that topic.

Other time based rule: no AI during 25-min pomodoro sprints, use them only afterwards.

Obsidian + 🪴 Quartz 4.0

Table of Contents

Chatbots for critical thinking (TODO)

Chatbots for critical thinking

Why we need to use LLM in the right way

Where AI struggles (2026)

Evaluation Frameworks: SPARK and CriticBench

Strategies for critical thinking

Critical Thinking Prompts Cookbook

Socratic Partner

Devil’s Advocate

Brainstorming

Cognitive Forcing Tool

Prompt-Engineering-As-Draft (PEAD)

SLOW framework

10-10-10 Rule Integration

Confidence Check

Reverse Role-play

LLMs restriction

Graph View

Backlinks

Obsidian + 🪴 Quartz 4.0

Table of Contents

Chatbots for critical thinking (TODO)

Chatbots for critical thinking §

Why we need to use LLM in the right way §

Where AI struggles (2026) §

Evaluation Frameworks: SPARK and CriticBench §

Strategies for critical thinking §

Critical Thinking Prompts Cookbook §

Socratic Partner §

Devil’s Advocate §

Brainstorming §

Cognitive Forcing Tool §

Prompt-Engineering-As-Draft (PEAD) §

SLOW framework §

10-10-10 Rule Integration §

Confidence Check §

Reverse Role-play §

LLMs restriction §

Footnotes §

Graph View

Backlinks

Chatbots for critical thinking

Why we need to use LLM in the right way

Where AI struggles (2026)

Evaluation Frameworks: SPARK and CriticBench

Strategies for critical thinking

Critical Thinking Prompts Cookbook

Socratic Partner

Devil’s Advocate

Brainstorming

Cognitive Forcing Tool

Prompt-Engineering-As-Draft (PEAD)

SLOW framework

10-10-10 Rule Integration

Confidence Check

Reverse Role-play

LLMs restriction

Footnotes