Evidence log ยท in progress

Can AI customer support chatbots answer FAQs without hallucinating?

We are testing customer-support chatbot tools against the same synthetic Harbor & Pine Home Goods FAQ knowledge base and 50-question answer key. This page is the public log of what we can verify so far: free/private testing boundaries, source-ingestion options, export cautions, and the checks that must pass before any bot-quality score is published.

Short answer

No chatbot winner yet. The current evidence is availability research, not answer-quality proof.

We have not scored any customer-support chatbot because no tool has produced a saved 50-question Harbor & Pine transcript in a private, reproducible no-spend run. The first Chatbase, Help Scout, Crisp, Intercom Fin, and Tidio signup recons added setup-friction evidence, not answer-quality proof.

Chatbase still looks promising on paper, but the scheduled browser hit a Cloudflare verification failure at signup. Help Scout also reached signup but could not submit because the Create account button stayed disabled with no visible error. Crisp reached an email/password signup gate with no visible card or CAPTCHA prompt, but account creation was stopped before password submission because the current scheduled-browser path could not safely store a confirmed password in Bitwarden. Intercom Fin's Start free trial path stops at a human-verification pre-check before signup. Tidio also reached an email/password signup gate with a No credit card required claim and no visible payment prompt, but account creation was stopped before password submission for the same secure credential-storage reason as Crisp. These tools should be retried from a normal/manual browser or secure vault-injection/lower-friction workflow.

Test fixture

One FAQ knowledge base, 50 prompts, explicit hallucination traps.

The fixture uses a fictional home-goods retailer, Harbor & Pine. The source document defines shipping, returns, product-care, discount, warranty, support, and escalation policies. The answer key includes routine questions plus traps that should trigger uncertainty or handoff instead of confident invention.

Can the bot answer 50 Harbor & Pine FAQ prompts from the provided source document without inventing policies?
Does it refuse or escalate hallucination traps about discounts, medical claims, unavailable products, and fake contact details?
Can an operator preview/test the bot privately before installing a live website widget?
Can raw conversations be exported or copied in a way readers could audit?
Are source-grounding, fallback, escalation, privacy, deletion, and AI-training controls visible enough for an advanced beginner?
Current evidence

What the availability and first setup pass found

Chatbase

Status: Scheduled-browser signup blocked; normal-browser retest needed

Setup: Free plan documented; Cloudflare verification failed at signup

Score: Not scored

Chatbase remains a strong candidate from public docs because it documents a free no-card plan, private Playground testing, multiple data-source types, source-quality controls, human escalation actions, and a conversations export API. The first scheduled-browser signup recon reached the email/password form, but Cloudflare verification failed before account creation, so no Harbor & Pine bot or transcript exists yet.

Help Scout AI / Beacon

Status: Scheduled-browser signup submit disabled; normal-browser retest needed

Setup: Free plan / 15-day trial documented; Create account stayed disabled

Score: Not scored

Help Scout still looks useful from public docs because it documents an AI Agent Test tab, Docs/source controls, Review Conversations, reporting, and Mailbox API endpoints. The first scheduled-browser signup recon reached the secure registration form, but after valid neutral-persona fields were filled, Create account remained disabled with no visible CAPTCHA or inline error. No account, AI Agent, Harbor & Pine source, or transcript exists yet.

Crisp Hugo

Status: Scheduled-browser signup reached password gate; normal-browser or secure vault workflow needed

Setup: 14-day no-card trial documented; email/password gate reached without visible card/CAPTCHA prompt

Score: Not scored

Crisp still looks promising from public docs because it documents a no-card trial, Hugo training on websites, KBs, TXT/CSV/PDF files and Q&A snippets, escalation configuration, and chat transcripts/export. The scheduled-browser run reached Step 1/3 signup with neutral-persona fields accepted, but account creation was stopped before password submission because the available automation path could expose a generated password before safe Bitwarden storage. No account, Hugo bot, Harbor & Pine source, or transcript exists yet.

Intercom Fin

Status: Scheduled-browser signup blocked by human verification; normal-browser retest needed

Setup: 14-day no-card Fin trial documented; Start free trial hit a human-verification pre-check

Score: Not scored

Intercom still looks useful from public docs because it documents a no-card trial, Fin previews, batch testing, knowledge sources, content enable/disable controls, escalation guidance, and conversation export. The scheduled-browser run clicked Start free trial, but a human-verification pre-check appeared before any signup form, workspace, Fin preview, source setup, or transcript export path. No account, credential, bot, Harbor & Pine source, or transcript exists yet.

Tidio Lyro

Status: Scheduled-browser signup reached password gate; normal-browser or secure vault workflow needed

Setup: 7-day trial / free Lyro quota documented; email/password gate reached with no credit card visible

Score: Not scored

Tidio still looks useful from public docs because it documents Lyro source ingestion, guidance controls, testing/preview language, and transcript export. The scheduled-browser run reached a Create a free account form with work email, password, website, terms checkbox, Facebook signup, and a No credit card required claim. The neutral email and project URL were accepted, but account creation was stopped before password submission because the available automation path could expose a generated password before safe Bitwarden storage. No account, Lyro bot, Harbor & Pine source, or transcript exists yet.

Zendesk AI agents

Status: Partial candidate; no-card/export boundaries unclear

Setup: Free trial documented; no-card proof not confirmed

Score: Not scored

Zendesk has strong enterprise AI-agent and sandbox documentation, but the public pass did not confirm a no-card boundary for this scenario and native AI-agent ticket transcript export is not straightforward. Keep it as an enterprise benchmark until payment and evidence-export paths are clearer.

Gorgias AI Agent

Status: Partial candidate; ecommerce sandbox follow-up needed

Setup: Trial/sign-up-free language found; no-card proof not confirmed

Score: Not scored

Gorgias is interesting for ecommerce support, with AI Agent guidance, help-center/page sources, handover instructions, and APIs for tickets/messages. Public docs suggest developer sandbox access is gated/manual and ecommerce context may require real store integration, so no hands-on run should connect real customer or store data.

Jotform AI Agents

Status: Partial candidate; export/handoff follow-up needed

Setup: Free Starter/free AI Agents documented; no-card wording not confirmed

Score: Not scored

Jotform documents free AI Agents, private/login-protected conversations, knowledge-base training from URLs/docs/Q&A/text, and conversation review. Direct transcript export and detailed fallback/handoff controls remain unclear from public docs, so it stays unscored until in-account verification.

Beginner workflow

Never launch a support bot until it passes a boring FAQ test.

A safe FAQ-bot workflow separates source setup, private preview, answer testing, handoff rules, and export/audit. A chatbot that sounds polished can still invent refunds, discounts, warranty promises, allergens, delivery timelines, or contact details that your business never approved.

The first usable score will require raw prompt/answer transcripts, screenshots of source and fallback settings, timing notes, and a completed rubric. Public pricing/help claims alone are not enough for a recommendation.

Who should use this workflow?

Useful for FAQ deflection. Risky for sensitive or live support without review.

Best fit: small teams testing routine FAQ coverage, product-care answers, shipping/return policy lookup, and first-line support triage.
Be careful if: support touches legal, medical, financial, safety, warranty, regulated-product, or high-value account issues.
Do not skip: private preview, source locking, fallback/escalation rules, transcript review, deletion controls, and a human review loop.
Not tested yet: Harbor & Pine answer accuracy, hallucination rate, private-bot transcript export, analytics usefulness, and paid-plan behavior.
Disclosure and status

This is an in-progress evidence log, not a final ranking.

No affiliate links are used on this page. No payment method, personal account, real customer data, live widget installation, public chatbot deployment, or paid trial was used for these checks. Scores stay blank until raw private-bot transcripts and completed run notes exist.

Last updated: June 1, 2026. Next evidence goal: use a normal/manual browser or secure password-injection/vault/lower-friction workflow to retry Chatbase, Help Scout, Crisp, Intercom Fin, or Tidio and store any confirmed credentials safely. Scores remain blank until raw private-bot transcript evidence is saved.