Jun 26 · AEO & AI Visibility

23 min read

How to Tell If LLMs Know Your Brand — a Practical Diagnostic for Marketers

Avinash Saurabh · CO-Founder & CEO

So, you want to know if LLMs know your brand. Here’s the short version: you need to test four layers (recognition, description, attribution, and recommendation) with a solid set of prompts that cover your buyer’s journey. Run each prompt a few times, score what you get with a simple rubric, and you’ll have a map of the gaps you can actually fix. This isn't about getting a vanity score to show your boss; it’s about building a real action plan.

If you’ve spent time manually searching ChatGPT or Perplexity, seen your brand show up one minute and disappear the next, and felt completely lost about what to do with that information, you are not alone. I talk to content leaders all the time who are doing ad-hoc spot checks. They get spooked when a competitor pops up and they don’t, and they're fielding questions from leadership about an "AI search strategy" without a real framework to stand on.

The frustration is completely valid. You’ve put in the work to rank on Google, you’ve built a content operation that works, and now it feels like these new AI tools have no idea you even exist.

Here’s the thing I’ve learned: one prompt, one answer, one check doesn't tell you anything reliable. What you need is a repeatable diagnostic. A system. This is the playbook I’ve built after running into this exact problem myself. We’re going to define what brand knowledge actually means, design a prompt set that reflects how buyers actually talk and think, run it enough times to trust the results, and turn the output into a prioritized list of things you can actually go and do.

What Does It Mean for an LLM to "Know" Your Brand, and What Should You Test?

Brand knowledge isn't a simple yes or no. An LLM can mention your brand but get your category wrong, describe you in the most generic terms, or recommend you for a problem you don't even solve. To figure out what's broken, you have to break "brand knowledge" down into four testable layers.

The four-layer brand knowledge model:

Recognition: The model knows your brand exists and puts it in the right box. A failure here is getting no response, being told you’re in the wrong industry, or being confused with a competitor who has a similar name.
Description: The model can accurately explain what you do, your positioning, and what makes you different. Failure looks like a vague description that could be about any SaaS company, or one that’s only half-right and misses your real audience.
Attribution: The model connects its claims about you to real, verifiable sources or specific pages on your site. When this fails, you get confident-sounding statements with no backup, or citations that don’t actually prove the point.
Recommendation: The model suggests your brand when it’s a genuine fit for the use case, segment, and constraints in the prompt. Failure is getting recommended when you don’t belong (like an SMB tool for an enterprise buyer) or, just as bad, never getting mentioned even when you’re a perfect fit.

Why do we bother breaking it down like this? Because just counting "mentions" is a waste of time. You can get mentioned inaccurately (wrong category), weakly (a throwaway name at the end of a list), or in a totally irrelevant context. You’ve probably seen this yourself.

Common failure modes I see all the time:

The model confuses you with a direct competitor.
The description is technically right but so vague it’s useless.
You get recommended to the wrong customers, like SMBs when you’re built for enterprise.
You only ever show up in "alternatives to X" lists without any context.

Using the four-layer model saves you from spinning your wheels. If the model doesn’t even recognize your brand exists, tweaking your comparison pages is pointless. If recognition is fine but the description is wrong, you need to publish clear, canonical content about your positioning, not more blog posts. You have to target your fixes to the broken layer.

How Brand Knowledge Fails in the Real World: The 4 Gap Signals

These four layers give us four clear signals of failure to watch for:

Absent: Your brand is nowhere to be found across multiple runs on relevant prompts.
Inaccurate: The model mentions you but gets a key fact wrong about your category, use case, or features.
Vague/low-confidence: You’re there, but the description is generic or the model says something like, "I'm not entirely sure, but..."
Inconsistent recommendation: The model suggests you for one prompt but not a similar one, or puts you in lists where you clearly don't belong.

Here’s a real-world example: A buyer asks, "what are the best SOC 2-ready tools for content operations?" The model recommends your brand, even though compliance has never been your strong suit. It feels like a win, but it's not. It's a mismatch that will kill a buyer's trust as soon as they click through to your site.

Why Are LLM Answers So Inconsistent, and What Metrics Are Actually Trustworthy?

Treat any single LLM output as an anecdote, not a data point. I can’t stress this enough. You can ask the same prompt twice in a row and get different answers. This isn't a bug you have to work around; it's just how these systems work, and your diagnostic has to be built for it.

What causes all this variation?

Sampling randomness: These models are designed to be a little bit random.
Prompt phrasing: A tiny change like ""best tools" vs. "top platforms"" can give you a completely different answer.
Model and version changes: GPT-4o, Claude 3.5, and Gemini 1.5 are all different beasts with different knowledge.
Personalization: Some platforms will change results based on your search history if you’re logged in.
UI differences: You’ll get different answers and citation styles from Perplexity, ChatGPT, and Google's AI Overviews.

Given all that, it's time to stop tracking things that don't matter.

Metric	What it tells you	When it misleads you	How to use it
"We ranked #1"	Nothing stable	Always. Single outputs are not repeatable.	Don't use this as a KPI. Please.
Mention rate	How often you appear across many runs	If your prompts are biased toward you	Compare across models and over time
Citation rate	How often your pages are linked	Varies by platform	Track per platform and watch for trends
Quality score	The accuracy and fit of the mention	If you don't have a clear rubric	Use with the scoring rubric below

Instead of chasing rank, measure these:

Mention rate: Across a whole prompt set and multiple runs, how often does your brand show up at all?
Citation rate: When the model cites sources, how often is it one of your URLs?
Quality score: Based on a simple rubric, how accurate, specific, and well-contextualized was the mention?

These three metrics give you a stable baseline you can compare over time, across different models, and even between different team members running the tests.

A Practical Baseline: How Many Prompts and Runs to Get Started?

You don't need a data science team for this. Here's a lean-team starting point that works:

25–50 prompts to start, balanced across your buyer journey (more on this in a sec).
5 runs per prompt for your main set. For your most critical prompts (like your core use case or a key competitor comparison), bump that to 8–10 runs.
For ongoing monitoring, 3 runs per prompt is fine once you have a baseline.
Running 20 prompts five times each every month is a thousand times better than running 100 prompts once and never doing it again. Log your conditions, run the diagnostic on a schedule, and track the trend lines instead of freaking out over a single bad output.

How to Build a Prompt Set That Reflects Your Buyer Journey (Instead of Your Assumptions)

The fastest way to get useless results is to write prompts based on how your marketing team talks about your product. Buyers just don't ask questions that way. A good prompt set starts with their language, not your internal jargon.

Step 1: Source prompts from real buyer language

Listen to customer interview recordings and sales call transcripts. What questions do people ask right before they buy?
Read through CS support tickets. What problems originally brought people to you?
Check your Google Search Console. What queries are already driving traffic to your most important pages?
Look at the "People also ask" boxes for your main category terms.
Hang out in community forums (Reddit, Slack, G2 reviews). Find the language people use when they're frustrated with the other options.

Step 2: Map those prompts to buyer journey stages

Awareness: Problem-focused and educational. "How do I track whether my brand shows up in AI answers?" or "What is answer engine optimization?"
Consideration: Comparing solutions. "Best AI visibility tools for SaaS marketing teams" or "What's the difference between AEO and SEO?"
Decision: All about implementation, migration, and getting started. "How do I set up AI citation tracking for my content team?" or "Does [product] integrate with Webflow?"

Step 3: Balance your prompts so you're not just chasing "best of" lists

Jobs-to-be-done prompts ("help me track my brand mentions in ChatGPT")
How-to prompts ("how do I improve AI visibility for my SaaS brand")
Alternatives prompts ("alternatives to [competitor]")
Comparison prompts ("[Brand A] vs [Brand B]")
Implementation and template prompts ("AI visibility reporting template for marketing teams")

Step 4: Isolate your direct brand prompts

Prompts like "Tell me about [Your Brand]" are important, but they're a separate test. Keep them in their own bucket. Don't include them in your main mention rate calculation, or you’ll artificially inflate your scores and get a false sense of security.

This is a lot to wrangle, and a platform like DeepSmith can help. Its AI Visibility Prompts feature lets you define and track a prompt library across all the major models and even suggests new prompts based on your context. That’s helpful for finding buyer-stage prompts you might have missed.

Prompt Set Coverage Grid

Use this table to plan your prompt set. Try to have at least 2–3 prompts for each cell that matters to you.

Stage	Intent	Example Prompt Pattern	Layer You're Testing	Common Pitfall
Awareness	Problem framing	"Why don't I show up in ChatGPT results for my category?"	Recognition	Being too broad; tighten it to your specific category
Awareness	Category education	"What is AI citation tracking for content teams?"	Description	Tests the category definition, not your brand
Consideration	Best-of list	"Best AI visibility tools for SaaS marketing teams"	Recommendation	Easy to game with leading words; keep it neutral
Consideration	Comparison	"[Your Brand] vs [Competitor]"	Description + Attribution	Isolate this; don't mix it into your baseline score
Consideration	Alternatives	"Alternatives to [dominant competitor]"	Recommendation	Shows if you're in the consideration set at all
Decision	Implementation	"How do I set up AI brand tracking for a small content team?"	Attribution	Tests if your how-to content is good enough to get cited
Decision	Integration/compliance	"Does [product] work with [specific CMS or stack]?"	Description + Attribution	Lots of qualifiers; segment these carefully
Decision	Templates	"AI visibility reporting template for marketing director"	Attribution	Tests if your structured content gets cited
All stages	JTBD	"Help me understand where my brand stands in AI answers"	All four layers	A great diagnostic baseline to run on every model

How to Avoid "Prompt Bias" That Makes Your Results Look Better (or Worse) Than They Are

Prompt bias is the easiest way to kill your diagnostic before you even start. Here's how to avoid it:

Don't use leading phrasing. "Is [Your Brand] the best tool for AI visibility?" is a marketing question. "What are good tools for AI visibility tracking?" is a diagnostic question. Only the second one is useful here.
Don't stack the deck. Prompts like "What's the best SOC 2-certified, founder-led, series-A-stage AI tracking tool?" are for a very specific, separate test. Keep them out of your baseline set.
Maintain two separate sets: a neutral baseline set (no brand names, no leading questions) and a high-intent qualifier set (for testing specific segments). Score them separately.

How to Control Qualifiers like Persona, Size, and Region to Get Reproducible Results

Adding "for a 50-person SaaS startup" to a prompt will give you a totally different answer. That’s the point, but if you don’t control for it, your month-over-month data is meaningless. It’s like running an A/B test and changing the variable halfway through.

Define a consistent prompt formula:

[role/persona] + [company size] + [industry] + [constraint] + [job-to-be-done] + [success criteria]

You don’t need all six in every prompt. But when you use one, keep it consistent and write it down.

Use qualifiers strategically by journey stage:

Awareness prompts: Keep it light. Role and industry are often enough. "As a content marketer at a B2B SaaS company, how do I know if my brand shows up in AI tools?"
Consideration prompts: Add more constraints. "As a content marketing director at a Series B SaaS company with a two-person team, what are the best tools for tracking AI citation performance?"
Decision prompts: Add hard requirements only when you're specifically testing a niche. "For a US-based B2B SaaS team that needs Webflow integration and SOC 2 documentation…"

Geographic segmentation:

Only add region tags when they actually matter for things like compliance, product availability, or language.
Treat region as a controlled variable. Run a prompt with "US-based" every time, not just sprinkled in randomly.

Version control:

Give each prompt a stable ID (e.g., CONSID-07).
Log the date you change any prompt.
Never, ever change a prompt mid-cycle and compare the "before" and "after" results as if they're the same.

This is the part everyone wants to skip, but it’s the difference between ad-hoc checks and a reliable diagnostic. This is how you build a system.

What's the Simplest Workflow to Run This Diagnostic Without It Becoming a Research Project?

A two-hour baseline run you actually complete is better than a perfect system you never execute. Here's the lean-team starting point you can actually repeat every month.

Step 1: Define your objective and scope Which of your brands are you testing? Which competitors? Which markets, segments, and AI models? Get specific before you start. Trying to test everything at once is a recipe for unusable data.

Step 2: Lock your prompt set Finalize your baseline and segmented prompts. Give them IDs. Check them for bias one last time.

Step 3: Set your test conditions Log everything on the checklist below before you run a single prompt. Your future self will thank you.

Step 4: Run the prompts with repetitions Run each prompt the number of times you planned. Copy the full, raw output each time. Don't summarize as you go.

Step 5: Store outputs in a structured sheet Your columns should be: prompt ID, run number, model name/version, date, raw answer text, and sources cited. One row per run. I know, spreadsheets. But trust me, it's worth it.

Step 6: Score the outputs with your rubric Use the rubric in the next section. Score everything after you've collected all the data, not during the process. You'll be more consistent.

Step 7: Summarize into gaps and priorities Aggregate your scores. Look for patterns and trends, not individual data points.

Step 8: Retest on a cadence and track trends Run your core set monthly and a deeper dive quarterly. Track the trend lines. The direction of change matters more than any single score.

Test Conditions Checklist

Log this stuff before every single run:

Model name and version (e.g., ChatGPT GPT-4o, Claude 3.5 Sonnet)
Date and time
Region or VPN status (keep it consistent)
Logged-in vs. logged-out (always use logged-out/incognito for baseline tests)
Prompt version ID
Run number (1 of 5, 2 of 5, etc.)
Capture method (full text copy, screenshot, etc.)
Any deviations (if something goes weird, note it)

How to Score What the Model Says So It's Not Just a Matter of Opinion

How do you stop the endless debate about whether an answer is "good" or "bad"? You use a rubric. A rubric converts messy text into comparable data. It’s not bureaucracy; it’s a tool for clarity. It also protects you from your own confirmation bias when you're hoping to see a good result.

I’ve seen teams argue for an hour over one ChatGPT output. A good rubric ends that debate in five minutes.

Score each output on five criteria using a simple 0–2 scale:

Scoring Rubric

Criterion	What "Good" Looks Like	Score 2	Score 1	Score 0	Notes/Pitfalls
Accuracy	Factually correct about your product and category	Correct category, use case, and key claims	Partially correct; minor errors	Wrong category, false claims, or not mentioned	Don't give partial credit for lucky guesses.
Specificity	A concrete description, not generic SaaS-speak	Names a specific feature or use case	General but not wrong	"A leading platform..." with no specifics	Be honest. A vague mention is a 1, not a 2.
Fit	Recommended in the right context and segment	Correct segment, use case, and constraints	Mentioned, but the context is a little off	Recommended where your brand doesn't belong	An irrelevant appearance is worse than being absent.
Attribution/Citation	Cites a real, verifiable source	Specific URL or named source cited	Vague reference to your site	No attribution or a made-up source	If a model doesn't cite, mark it N/A, not 0.
Confidence	The model's certainty level feels appropriate	Clear recommendation with solid rationale	Hedged but present ("might be a good option")	Refuses to answer or hallucinates confidently	A refusal is a 0. Log it as a separate, informative data point.

Handling edge cases:

Model refuses to answer: Score Accuracy and Fit as 0 and make a note. A refusal is a signal.
Model gives a list with no explanation: Specificity is a 0 or 1. Attribution is likely a 0.
You only appear in an "alternatives" section: That's a valid mention. Score it normally, but flag the context.

My best advice: Score the output you have, not the one you wish you had.

Once You Have Results, How Do You Interpret Gaps and Decide What to Do First?

Okay, you’ve run the diagnostic and have a spreadsheet full of scores. Now what? The goal is to turn that data into decisions. That means categorizing the gaps you've found before you decide how to fix them.

The four gap types (they map back to our layers):

Missing: Your brand is absent on relevant prompts. The model doesn't have enough signal to include you.
Inaccurate: Your brand shows up, but something important is wrong, like your category or use case.
Weak: You appear, but the description is generic, low-confidence, or misses what makes you different.
Invisible: Your content is out there and it's accurate, but the model isn't citing it. This is a distribution, structure, or authority problem.

What each gap tells you to do:

Gap Type	Layer Failing	Implied Action
Missing	Recognition	Create foundational category content. Ensure your positioning is clear everywhere.
Inaccurate	Description, Attribution	Publish canonical pages with clear facts. Make your claims consistent everywhere.
Weak	Description, Recommendation	Go deeper. Create content that shows differentiation, comparisons, and proof points.
Invisible	Attribution	Make your content more citation-worthy with structured formatting, direct answers, and credible sources.

A really useful next step is to map these outcomes to specific pages on your site. You can move from "we have a weak description problem" to "this specific page should be earning citations for this prompt, and it isn't." That’s when you can actually start fixing things. Tools like DeepSmith's AI Visibility have features that show citation trends at the page level, which makes this much easier. They can also show which competitor pages are getting cited, which is great for seeing what kind of content is working.

A simple prioritization model:

Prioritization Table

Prompt cluster	Layer failing	Gap type	Frequency	Business importance	First action
"Best AI visibility tools for SaaS"	Recommendation	Missing	8/10 runs, 3 models	High (core use case)	Publish a new category page targeting this cluster.
(Your clusters go here)

When you prioritize, you have to weigh three things: impact (is this tied to a high-intent prompt?), effort (how hard is this to fix?), and confidence (is the signal clear and consistent?). Don't waste effort on a noisy signal.

One practical thing to try when you have an "invisible" gap is to improve how machine-friendly the target page is — use clear headings, Q&A-style blocks, short declarative answers, and structured lists so systems can identify and cite the right lines. Search Engine Journal has a good guide to building machine-friendly markup and pages that machines can identify, read, and cite; you can apply those principles to your high-priority pages.

How Do You Turn This Into an Ongoing Report That Leadership Will Trust?

This whole system is useless if it lives and dies in a spreadsheet. A one-time snapshot tells leadership you did a project. A monthly trend line tells them your work is having an impact.

The biggest point of failure I see is the handoff from diagnostics to execution. The spreadsheet gets forgotten, and nothing gets fixed. A unified system, whether it's a tool like DeepSmith or something you build yourself with project management software, is key to closing that loop. You need to be able to turn a finding from your diagnostic directly into a content brief.

Monthly "AI Brand Knowledge" report template:

KPI snapshot: Mention rate, citation rate, average quality score, and how they’ve changed.
What changed: Top 3 wins (e.g., improved scores, new citations) and top 3 losses (e.g., new inaccuracies).
What we learned: An insight tied to the four layers. ("Our attribution scores for decision-stage prompts improved after we published the new implementation guide.")
What we're doing next: The priority actions for the next sprint, with owners.
Risks and uncertainties: Mention model updates, sample sizes, etc.

Handling model drift:

Keep a stable "core baseline" prompt set that never changes so you can compare apples to apples.
When a model update clearly shifts all your results, log it and treat the next month as a new baseline. Don't panic.
Look for 3-month trends, not week-over-week blips.

Internal alignment tip: When you present your results, share the rubric and the prompt list. The most common objection you'll get is "how do you know that's accurate?" When stakeholders can see your methodology, the conversation shifts from skepticism to a productive discussion about priorities.

Frequently asked questions

What's the difference between LLM brand visibility and "LLMs knowing my brand"?

Think of it this way: brand visibility is just about getting mentioned. Brand knowledge is about being mentioned *correctly*. It means the model describes you accurately, trusts your content enough to cite it, and recommends you for the right problems. High visibility with poor knowledge can actually hurt you by sending you bad-fit customers.

How many prompts do I need for a reliable AI visibility diagnostic?

I'd start with 25–50 prompts that are balanced across the buyer journey (awareness, consideration, decision) and cover different intents (how-to, best-of, etc.). Too few prompts and you'll miss things; too many and you'll never get it done. You can always build a bigger set over time.

How many times should I run the same prompt before I trust the result?

For your main set of prompts, run each one at least 5 times. For your absolute most important prompts, do it 8-10 times. A single output is an anecdote. Five runs will start to show you a real pattern.

Should I track "rank" — who appears first — in AI answers?

No, please don't. Rank in AI answers isn't stable, and it's not how these tools even present information. Instead, track mention rate (how often you show up), citation rate (how often you're cited as a source), and quality score (how good the mention was). These are stable metrics you can track over time. Rank is not.

How do I test AI visibility for different personas or industries without skewing results?

Use two separate sets. Have a neutral, baseline set of prompts with no qualifiers. Then, create segmented variants where you add a role, company size, or industry. Score them separately. This lets you see both your general visibility and your niche visibility without mixing the data.

How do I test AI visibility by geography or region?

Only add geographic qualifiers when it actually matters for your product (like for compliance or availability). Then, treat it as a controlled variable. For example, always run the prompt with the region tag, or run it once with and once without. Keep it consistent.

What's a simple rubric I can use to score whether an LLM describes my brand accurately?

Use a 0-2 scale to score each output on five things: Accuracy (is it factually correct?), Specificity (is it concrete or just generic fluff?), Fit (is it recommended for the right job?), Attribution (does it cite a good source?), and Confidence (is the model's certainty level appropriate?). Write down what a 0, 1, and 2 look like for each before you start.

How often should I rerun an LLM brand diagnostic and report it internally?

I recommend running your core set of prompts monthly and doing a full deep-dive quarterly. A monthly cadence gives you enough data to see trends without it becoming your full-time job. The key is to report on the trend lines, not just single-point-in-time snapshots. That's what builds trust with leadership.

Monochrome geometric illustration of a web page card whose sections are wrapped in curly braces and tagged with label chips, with connection lines running from the tagged fields to entity nodes and an answer document, under the centered white cover line "Structured Data for AI Search".

AEO & AI Visibility

What Is Structured Data and Why It Matters for AI Search

A monochrome abstract-geometric cover showing four outlined letter tiles reading E, E, A, T with thin connector lines fanning out to an answer panel and a stack of ranked bars with one row highlighted in white, behind the centered white cover line E-E-A-T FOR AI SEARCH.

AEO & AI Visibility

What Is E-E-A-T for AI Search, and How It Differs From Google's E-E-A-T

Monochrome illustration of a JSON-LD code block linked to a code bracket, a warning icon, a checkmark, and a search answer panel, under the cover line Validate Schema Before AI Reads It.

AEO & AI Visibility

How to Tell If LLMs Know Your Brand — a Practical Diagnostic for Marketers

What Does It Mean for an LLM to "Know" Your Brand, and What Should You Test?

How Brand Knowledge Fails in the Real World: The 4 Gap Signals

Why Are LLM Answers So Inconsistent, and What Metrics Are Actually Trustworthy?

A Practical Baseline: How Many Prompts and Runs to Get Started?

How to Build a Prompt Set That Reflects Your Buyer Journey (Instead of Your Assumptions)

Prompt Set Coverage Grid

How to Avoid "Prompt Bias" That Makes Your Results Look Better (or Worse) Than They Are

How to Control Qualifiers like Persona, Size, and Region to Get Reproducible Results

What's the Simplest Workflow to Run This Diagnostic Without It Becoming a Research Project?

Test Conditions Checklist

How to Score What the Model Says So It's Not Just a Matter of Opinion

Scoring Rubric

Once You Have Results, How Do You Interpret Gaps and Decide What to Do First?

Prioritization Table

How Do You Turn This Into an Ongoing Report That Leadership Will Trust?

Frequently asked questions

Related Articles

What Is Structured Data and Why It Matters for AI Search

What Is E-E-A-T for AI Search, and How It Differs From Google's E-E-A-T

How to Validate Your Schema Markup So AI Engines Can Read It