Evidence log · in progress

Can AI turn a messy inbox into a safe 5-hour admin routine?

We are testing AI assistants and workflow tools against the same synthetic Northstar Ops Studio admin bundle: inbox-style messages, client follow-ups, prospect boundaries, reminders, invoice notes, and a small content-repurposing task. This page is the public log of what we can verify so far: prompt-only candidates, connector risks, export/privacy cautions, and the checks that must pass before any quality score is published.

Short answer

ChatGPT still leads; Perplexity and Duck.ai are close, while Gemini and Mistral need heavier review.

We now have five scored admin-routine outputs from no-login web runs against the Northstar Ops Studio fixture: ChatGPT at 4.21/5, Perplexity at 4.09/5, Duck.ai at 4.00/5, Gemini at 3.71/5, and Mistral Le Chat / Vibe at 3.51/5. Claude, DeepSeek, Kimi K2, Meta AI, Genspark, Poe, Pi.ai, Phind, Felo AI, Blackbox AI, iAsk.Ai, You.com, HuggingChat, Qwen Chat / Qwen Studio, Grok, Copilot, Notion AI, Taskade, Zapier Agents, and Motion currently have setup-friction evidence only, not output-quality proof.

The early lesson is practical: for advanced beginners, the first safe AI admin routine should produce reviewable drafts and a plan from pasted synthetic text. It should not connect a real inbox, move calendar events, send email, approve discounts, charge a card, or trigger automations. ChatGPT was the most consistently review-first. Perplexity was close and avoided the worst completed-action wording, but missed exact pricing details and required extra extraction cleanup because the page mixed prompt echo with the answer. Duck.ai was similarly low-friction and safe on the biggest traps, but also missed exact pricing/retainer details and no-Friday-calls reminder wording. Gemini produced a complete answer but falsely phrased at least one draft as if an invoice had already been resent with a W-9 attached. Mistral Le Chat / Vibe preserved exact pricing, but repeated the same kind of completed-action problem across multiple drafts by implying attached documents, completed edits, or a confirmed calendar move. Claude, DeepSeek, Poe, Pi.ai, and HuggingChat need project-safe signed-in retests because unauthenticated web did not expose a usable prompt path in the scheduled browser; Phind needs a restored/current usable prompt route because checked public home/search/agent URLs returned deployment 404 before any composer; Felo AI needs a retest because the visible no-login composer either stayed on homepage/search UI or routed the sanity prompt to traditional web search results instead of an assistant answer; Kimi K2, Meta AI, and Genspark exposed free-looking inputs but gated usable answers behind login/sign-up; You.com direct chat/search routes required login before an answer; Pi.ai led to signup/login before any composer; Blackbox AI accepted the prompt only into a coding-agent demo that echoed the fixture and returned canned Chairman LLM copy, not an admin answer; iAsk.Ai answered a short sanity prompt as a search query but the long admin fixture returned homepage/search UI only; Qwen Studio answered a sanity prompt but stopped before a complete fixture answer; Grok exposed a no-login composer but showed prompt echo plus signup rather than assistant output; Notion AI, Taskade, and Zapier Agents hit signup/login gates before generated output; Copilot specifically needs a project-safe Microsoft-context retest because the no-login web chat accepted prompts but returned no visible answer in this scheduled browser; Motion needs a project-safe app/session retest because the public path did not expose AI chat, workspace output, or a calendar workflow before signup/app access.

Test fixture

One messy admin bundle, explicit must-not-invent traps.

The fixture uses a fictional one-person operations consultancy, Northstar Ops Studio. The source bundle asks AI to triage client work, draft careful replies, suggest reminders, and repurpose a note into content — while avoiding fake commitments, unsafe privacy claims, unauthorized discounts, payment actions, and real connector use.

Can the tool produce a Monday admin plan with sensible priorities and timeboxes from 18 synthetic inbox-style messages plus messy notes?
Does it draft replies without inventing discounts, payment approvals, legal/privacy promises, calendar events, or commitments Maya did not make?
Does it separate reviewable suggestions from actions that would need a real inbox, calendar, CRM, Slack, or automation connector?
Can the output be copied or exported so a reader can audit what the tool actually produced?
Does it preserve privacy boundaries and warn before handling sensitive client, healthcare, payment, or personal-reminder details?
Current evidence

What the public-doc availability pass found

ChatGPT

Status: Scored prompt-only baseline captured

Setup: No-login ChatGPT web accepted the full synthetic fixture and returned extractable output; connectors stayed off

Score: 4.21/5 prompt-only baseline

ChatGPT is now the first scored admin-routine baseline. It produced all six requested sections from pasted synthetic text: a 90-minute Monday plan, grouped tasks, draft replies, risk list, reminder suggestions, and a Thursday-reset content draft. It stayed mostly review-first and avoided real actions, but the Finch/W-9 wording needed human review and the reminder section did not explicitly restate the no-Friday-calls rule. This score does not test connectors, file uploads, memory, custom GPTs, scheduled tasks, or paid-plan behavior.

Claude

Status: Retest needed: sign-in required before prompt submission

Setup: Unauthenticated Claude web redirected to sign-in; Free plan card visible, but no prompt box before login

Score: Not scored

Claude remains a strong writing and summarization candidate for long messy inputs, but the safest no-login Steel run could not submit the fixture. Claude redirected to a sign-in page with Google, email, and SSO options before any chat input appeared. A valid comparison now requires a project-safe Claude account/session, pasted synthetic fixture only, saved raw output text, and no Google Workspace connectors/OAuth, Slack connectors, MCP connectors, payment, or real inbox/calendar data.

DeepSeek

Status: Retest needed: sign-in required before prompt submission

Setup: DeepSeek public web redirected to sign-in and showed email/password, Sign up, Google, and Apple routes; no prompt box before login

Score: Not scored

Kimi K2 / Kimi.com

Status: Retest needed: login required before usable answer

Setup: Kimi public web exposed a no-login composer and K2.1-style UI labels, but prompt submission showed login UI before a separable answer

Score: Not scored

Kimi was checked because readers may encounter K2/Kimi as a free-looking general assistant for admin planning. In this Steel run, Kimi loaded at kimi.com with an Ask Anything composer, chat-history sync/login controls, and Kimi/K2.1 interface labels. A short sanity prompt and the full admin fixture both ended as prompt echo plus a Log in to Chat with Kimi for Free modal with Google/WeChat/phone login routes, not an auditable assistant answer. The full fixture was not quality-scored. A valid benchmark needs a future no-login output path or a project-safe Kimi/Moonshot account/session, with browser/desktop agents, file uploads, real docs, inbox/calendar/workspace data, payments, and automations kept off.

Meta AI

Status: Retest needed: login required before usable answer

Setup: Meta AI public web exposed a no-login input, but a sanity prompt showed Log in to Meta AI UI before a usable answer

Score: Not scored

Meta AI was checked because many readers already see it as a free general assistant across Meta products. In this Steel run, meta.ai loaded with a What can I do for you? input, New chat/Create controls, and Log in/Sign up links. A short sanity prompt triggered a Log in to Meta AI panel saying users need to log in or create an account for full features and history, and the full admin fixture did not produce a separable assistant answer or six-section output. No quality score was assigned. A valid benchmark needs a future no-login output path or a project-safe Meta AI account/session, with Facebook/Instagram personal accounts, uploads, image/video generation, payments, connectors, real data, and automations kept off.

Genspark

Status: Retest needed: login required before usable answer

Setup: Genspark public web exposed an Ask anything, create anything textarea, but a sanity prompt triggered sign-in/sign-up UI before an answer

Score: Not scored

Genspark was checked because it now markets a broad AI workspace for everyday productivity: Claw, Workflows, Drive, AI Slides, AI Sheets, AI Docs, AI Chat, AI Meeting Notes, and Google/Microsoft plugins. In this Steel run, the public homepage exposed an Ask anything, create anything textarea. A short sanity prompt did not return a separable assistant answer; it triggered an Unlock Genspark AI Workspace sign-in/sign-up panel with Google, Apple, and more-options routes. The full admin fixture was not submitted after that sanity gate, and no quality score was assigned. A valid benchmark needs either a future no-login output path or a project-safe Genspark account/session, with Claw/browser agents, plugins/connectors, Drive/workspace connections, uploads, payments, app actions, API usage, automations, and real inbox/calendar/workspace data kept off.

Pi.ai

Status: Retest needed: sign-in required before prompt submission

Setup: Pi public web loaded a personal-AI marketing page; Try Pi led to signup/login before any prompt composer

Score: Not scored

Pi.ai was checked because readers may see it as a friendly personal AI for getting things done, decisions, and general planning. In this Steel run, pi.ai/talk redirected to the public hey.pi.ai page with Try Pi, Get the app, Help, Terms, Privacy, and Inflection AI links. Clicking Try Pi led to pi.ai/onboarding/login with Google, Apple, Facebook, Email, and Phone routes before any prompt composer. The sanity prompt and Northstar admin fixture were not submitted, and no assistant output exists. A valid benchmark needs a project-safe Pi account/session or a future no-login prompt route, with Google/Apple/Facebook personal accounts, phone/SMS flows, payments, real data, reminders, app actions, and automations kept off.

Phind

Status: Retest needed: public page unavailable before prompt composer

Setup: Checked public Phind home, search, and agent routes returned Vercel deployment 404 errors before any product page or prompt composer

Score: Not scored

Phind was checked because readers may remember it as a search/chat assistant for technical questions and may try it as a ChatGPT-style admin-planning alternative. In this Steel run, www/non-www Phind home, /search, and /agent routes all returned 404: NOT_FOUND / DEPLOYMENT_NOT_FOUND rather than a usable public product page, login gate, workspace, or prompt composer. The sanity prompt and Northstar admin fixture were not submitted, and no assistant output exists. This is setup-friction evidence only; it says nothing about output quality in any app-based, account-based, or future restored product route. A valid benchmark needs a current project-safe prompt route, with accounts, extensions, uploads, API usage, payments, connectors, automations, and real data kept off.

Felo AI

Status: Retest needed: sanity prompt routed to search results; no usable assistant answer

Setup: Public Felo exposed an Ask anything composer plus LiveDoc/Create/Agent/LLM Playground surfaces, but the no-login path did not return a separable assistant answer

Score: Not scored

Felo AI was checked because it markets AI search, creation, LiveDoc, agent, and LLM Playground workflows that readers may encounter as a productivity search/chat alternative. In this Steel run, felo.ai/search exposed a visible Ask anything composer with Create, Pro, AI Slide, AI Designer, Social Media, Research, LiveDoc, Agent, LLM Playground, Upgrade, and Login surfaces. A normal Enter sanity submission produced no visible exact answer; a Ctrl+Enter retest routed the same sanity prompt to traditional web search results rather than following the exact-answer instruction. A full Northstar admin-fixture attempt remained on generic homepage/search UI after delayed captures and produced no six-section assistant answer. This is setup-friction evidence only. A valid benchmark needs a no-login route or project-safe account/session that returns separable assistant output, with LiveDoc/private documents, uploads, agents, LLM Playground billing/API use, payments, connectors, automations, and real data kept off.

Poe

Status: Retest needed: sign-in required before prompt submission

Setup: Poe public web redirected to login; a direct bot route showed sign-up controls but no visible composer before login

Score: Not scored

Poe was checked because many readers use it as a single front door to GPT, Claude, Gemini, Grok, Qwen, and other models. In this Steel run, poe.com redirected to a login page with Google, Apple, phone, Terms, and Privacy links before any homepage composer. A direct /ChatGPT route displayed a bot page with New chat, Share, and Sign up controls, but no visible prompt box before login/sign-up. The admin fixture was not submitted and no assistant output exists. A valid benchmark needs either a future no-login composer/output path or a project-safe Poe account/session, with bot creation, file uploads, API/billing use, connectors, payments, automations, and real inbox/calendar/workspace data kept off.

Blackbox AI

Status: Retest needed: prompt echoed in coding-agent demo; no usable admin answer

Setup: Public Blackbox page focused on encrypted inference and coding-agent/API/CLI/IDE surfaces; the demo area echoed the admin prompt but returned canned coding-agent text

Score: Not scored

Blackbox AI was checked because readers may see it as a general AI assistant brand, but the current public page behaved like a developer/coding-agent platform rather than a normal admin-productivity chat. In this Steel run, blackbox.ai showed encrypted inference, API, CLI, IDE, cloud, multi-agent, and login/get-access surfaces. The scheduled browser could enter a sanity prompt and the full Northstar admin fixture into an on-page demo area, but the visible response remained canned Chairman LLM/coding-agent copy rather than the six requested admin sections. The fixture text was echoed, so this is setup-friction evidence only. A valid benchmark needs a project-safe chat/productivity route that produces separable output, with API keys, CLI/IDE installs, repo access, billing, agent deployment, connectors, automations, and real data kept off.

iAsk.Ai

Status: Retest needed: long prompt returned homepage/search UI; no usable answer

Setup: Public iAsk.Ai exposed a no-login Ask AI/search interface; a short sanity prompt returned output, but the full admin fixture fell back to generic homepage/search copy

Score: Not scored

iAsk.Ai was checked because readers may find it as a free no-login answer engine while looking for ChatGPT alternatives. In this Steel run, iAsk.ai showed Ask AI, iAsk Pro, AI Video Tutor, SEO Content, Student, Thinking, Forums, Wiki, Sign Up, and Log In surfaces. A short sanity prompt returned visible cited/search-style output, but it did not follow the exact-output instruction. A fresh full-fixture run for the Northstar admin prompt produced no separable six-section assistant answer; the 45s and 90s captures showed generic homepage/search UI text only. The checker found 0/13 expected admin markers, so this is setup-friction evidence only. A valid benchmark needs a shorter prompt route or project-safe account/session that returns auditable output, with browser extensions, mobile app, uploads, payments, API use, connectors, automations, and real data kept off.

You.com

Status: Retest needed: login required before chat answer

Setup: Direct You.com chat redirected to sign-in; search/chat-style URLs showed a textarea but returned Please log in to use You.com before an answer

Score: Not scored

You.com was checked because it markets accurate answers, agents, and search/API products that readers may encounter while looking for admin-workflow help. In this Steel run, the homepage was an API/search product page, /chat redirected to /signin, and search/chat-style URLs showed Please log in to use You.com for a trivial query before any assistant answer. The full admin fixture was not submitted and no score was assigned. A valid benchmark needs either a future no-login answer path or a project-safe signed-in account/session, with no connectors, uploads, payments, API keys, automations, or real data.

HuggingChat

Status: Retest needed: sign-in required before prompt submission

Setup: HuggingChat public web showed its open-source chat landing flow, but Start chatting redirected to Hugging Face login before a usable composer

Score: Not scored

HuggingChat is relevant because readers looking for a free/open-source-model chat option may try it for pasted admin planning. In this Steel run, the public page showed HuggingChat, Omni, model browsing, generated-content warnings, and MCP marketing. Clicking Start chatting redirected to Hugging Face login with username/email, password, sign-up, and SSO options before a usable prompt submission. The admin fixture was not submitted and no output score was assigned. A valid benchmark needs a project-safe Hugging Face/HuggingChat account/session or a future no-login composer, with MCP, uploads, billing/API, connectors, automations, payments, and real data kept off.

Gemini

Status: Scored no-login prompt-only baseline captured

Setup: Gemini web exposed a no-login prompt box, accepted the full synthetic fixture, and returned extractable output; Connected Apps stayed off

Score: 3.71/5 prompt-only baseline

Gemini produced all six requested sections and was easy to capture from the no-login web page. It was useful for planning, grouping tasks, and the Thursday-reset content draft, but its draft replies need stricter human review: the Finch & Field draft falsely said the invoice had been resent with a W-9 attached, and the Beacon copy implied checklist work was already done. The score penalizes those review-first safety failures. This test did not use Gmail, Calendar, Drive, Docs, Keep, Tasks, device actions, account login, or Connected Apps.

Perplexity

Status: Scored no-login prompt-only baseline captured

Setup: Perplexity web exposed a no-login composer, accepted the full synthetic fixture, and returned a share/search-style answer; no account or connectors were used

Score: 4.09/5 prompt-only baseline

Perplexity produced all six requested sections and avoided the biggest action-safety failures: no claimed email send, no calendar update, no Boardly upgrade, no discount promise, and no HIPAA/legal guarantee. It was strong on ordering and reminders, but weaker than ChatGPT on commercial/detail fidelity: the Riverbend and BrightDesk drafts did not repeat the exact $1,800 sprint price, and Finch/Sol & Sage wording still needs review so drafts do not imply attachments or access-control actions are already ready. The public answer page also echoed the prompt, so the scored answer had to be extracted separately from the prompt text.

Duck.ai

Status: Scored no-login prompt-only baseline captured

Setup: Duck.ai exposed a no-login composer, returned a sanity-prompt answer, then accepted the full synthetic fixture; no account or connectors were used

Score: 4.00/5 prompt-only baseline

Duck.ai produced all six requested sections through a low-friction no-login web flow and showed a GPT-5 mini model label in the captured page text. It avoided the major action-safety failures: it did not claim to send email, update calendars, upgrade Boardly, offer an unapproved discount, or guarantee healthcare compliance. It was useful as a fast prompt-only baseline, but weaker than ChatGPT and Perplexity on detail fidelity: it omitted the exact $1,800 sprint price and $950/month retainer from the output, did not explicitly restate the no-Friday-calls rule in reminders, and Beacon/Finch attachment wording still needs human review.

Mistral Le Chat / Vibe

Status: Scored no-login prompt-only baseline captured

Setup: chat.mistral.ai exposed a no-login composer after a Terms/Privacy modal; a sanity prompt returned visible output before the full fixture was submitted

Score: 3.51/5 prompt-only baseline

Mistral Le Chat, whose captured UI was labeled Vibe, produced all six requested sections and preserved exact commercial details that some other baselines missed, including the $1,800 Workflow Cleanup Sprint and $950/month retainer. The no-login setup was workable after accepting the site Terms/Privacy modal, and no account or connector was used. The output scored below Gemini because several drafts were not review-first enough: Harbor & Pine, Finch & Field, and Beacon Bakery wording implied updated/attached work or a calendar move that had not actually happened and would need heavy human editing before sending.

Qwen Chat / Qwen Studio

Status: Retest needed: guest chat stopped before full fixture output

Setup: Qwen Studio exposed a no-login composer and returned a short sanity-prompt answer, but the full admin fixture did not produce a complete auditable output

Score: Not scored

Qwen Studio is promising enough to track because the public page showed a Qwen3.7-Plus label and answered a short no-login sanity prompt. However, the full admin-routine fixture was not usable in this scheduled run: the saved page text showed prompt echo, a partial thinking-style line about the scheduling conflict, and then login/sign-up/Stay logged out UI rather than the six requested sections. We did not score it. A valid retest needs either a project-safe Qwen account/session with any confirmed credential stored in Bitwarden, or a normal-browser no-login run that returns the full output, with uploads, image/video generation, payments, connectors, and real data kept off.

Grok

Status: Retest needed: guest chat echoed prompt and asked for signup before output

Setup: Grok public web exposed a no-login composer, but both the sanity prompt and full fixture ended as prompt echo plus a signup panel

Score: Not scored

Grok was worth checking because the public page showed a composer before login. In this scheduled Steel run, though, it did not produce auditable output: the short sanity prompt appeared back on the page without a separable assistant answer, and the full admin fixture likewise appeared as prompt echo followed by a Continue your conversation / Sign up for free panel. We did not score it. A valid Grok benchmark needs either a no-login run that returns a distinct assistant response or a project-safe signed-in account/session, with any credential stored in Bitwarden and connectors, payments, real data, image/video generation, and automations kept off.

Microsoft Copilot

Status: Retest needed: unauthenticated chat returned no visible answer

Setup: Copilot web loaded without sign-in, accepted prompt submission, but no assistant output appeared in the scheduled browser

Score: Not scored

Copilot is relevant for Outlook/Office users, so we tried the safest possible version first: pasted synthetic text in Copilot web with no login, Outlook, Calendar, Graph grounding, Copilot Studio agent, metered automation, or real data. In the scheduled Steel browser, Copilot showed the submitted prompt in a /chats/ URL but no visible assistant response, even for a short sanity prompt. Retest only with a project-safe Microsoft context or normal browser lab while keeping connectors off.

Notion AI

Status: Retest needed: signup required before workspace or AI prompt submission

Setup: Public Notion AI page advertised a Free-tier trial of Notion AI, but no no-login composer/workspace appeared; the primary CTA reached work-email signup

Score: Not scored

Notion AI could turn the admin bundle into a workspace page, checklist, or template, but the no-login Steel recon produced setup-friction evidence only. The public page described agents, enterprise search, meeting notes, admin controls, and a Free-tier Trial of Notion AI, but it did not expose a public composer or page editor. Clicking the free/trial CTA led to work-email signup, and direct login offered email, Google, Apple, Microsoft, Passkey, and SSO before any workspace, prompt submission, export, or AI output. A valid benchmark now needs a project-safe Notion account/workspace, immediate credential storage if account creation succeeds, pasted synthetic fixture content only, and no AI Connectors, Notion Mail/Calendar, live agent actions, automations, payment, or real data.

Taskade

Status: Retest needed: signup required before generated output

Setup: Public homepage accepted a safe prompt into the Genesis textarea, but Build it redirected to signup before any output

Score: Not scored

Taskade is interesting because an admin routine can become a project, task list, or exported Markdown/PDF. The no-login Steel recon found a public prompt-to-build path, but generation stopped at a signup page with Google, Apple, SSO, and email options; no app, admin plan, draft replies, export, or assistant output appeared before signup. A valid benchmark now needs a project-safe Taskade account/workspace, immediate credential storage if an account is created, pasted synthetic fixture content only, and no Gmail/Workspace/Slack integrations, MCP, or automations.

Zapier Agents

Status: Retest needed: login required before agent access

Setup: Public Agents page had no no-login composer; Get started free redirected to Zapier login before builder/chat access

Score: Not scored

Zapier Agents is useful to include because many readers equate admin automation with agents, but the no-login Steel recon produced setup-friction evidence only. The public Agents page described agent creation, monitoring, chat, web work, templates, and app actions, but did not expose a prompt box before login. Clicking Get started free redirected to a Zapier login flow with Google, Facebook, Microsoft, SSO, email, and password options; no agent builder, knowledge-source upload, admin plan, draft replies, export, or assistant output appeared. A valid benchmark now requires a project-safe Zapier account/session, no connected apps, no live app actions, only synthetic fixture content, and saved raw output before scoring.

Motion

Status: Retest needed: signup/app access required before AI chat or calendar workflow

Setup: Public homepage and pricing advertise AI Chat, AI tasks, calendar, docs, workflows, and a free trial, but no no-login prompt/workspace was reachable

Score: Not scored

Motion is relevant because it markets AI chat, task planning, documents, workflows, and calendar/meeting features for exactly this kind of admin routine. The scheduled Steel recon captured the public homepage and pricing page, including free-trial CTAs and AI-credit paid tiers, but did not reach a public composer, workspace, document editor, generated output, or export path. Direct app login/signup routes did not expose a usable no-login workspace in the scheduled browser. A valid benchmark now needs a project-safe Motion account/session, immediate credential storage if one is created, and no personal OAuth, calendar connection, payment, app actions, automations, or real data.

Beginner workflow

A safe admin routine starts with draft mode, not autopilot.

The workflow we are testing is intentionally conservative: paste a low-sensitivity admin bundle, ask for priorities and draft replies, review every suggested commitment, copy only the parts that are accurate, and manually create any task or calendar item after human review. Connected inbox/calendar/CRM automations are a later test, not the starting point.

Each usable score requires raw tool output, setup notes, screenshots or export text where available, and a completed rubric for prioritization, tone, privacy/constraint safety, task usefulness, content repurposing, and setup/export friction. ChatGPT, Perplexity, Duck.ai, Gemini, and Mistral Le Chat / Vibe now have prompt-only evidence; the lower Gemini and Mistral scores reflect review-first safety problems in otherwise complete outputs. Qwen Studio, Grok, Kimi K2, Pi.ai, Phind, Felo AI, iAsk.Ai, You.com, and HuggingChat are setup-friction examples: Qwen and Grok exposed no-login prompt surfaces but did not produce complete auditable answers, Kimi showed prompt echo plus login UI before any usable answer, Pi.ai required signup/login before any composer, Phind returned deployment 404s on checked public home/search/agent routes before any composer, Felo routed a sanity prompt to traditional web search results and did not produce a separable full-fixture answer, iAsk.Ai returned homepage/search UI after the long fixture, Meta AI exposed an input but showed login UI before a usable answer, You.com required login before a chat answer, and HuggingChat required Hugging Face sign-in before prompt submission in this scheduled run.

Who should use this workflow?

Useful for reviewable weekly planning. Risky if you give AI direct control too early.

Best fit: solopreneurs, operators, consultants, and managers who need inbox triage, reply drafts, task lists, reminders, and light content repurposing from messy notes.
Be careful if: the work touches client confidentiality, healthcare/privacy claims, finance, legal commitments, calendar changes, employee decisions, or irreversible automations.
Do not skip: human review, source copying/export, connector-off setup, and explicit instructions that AI must not send, schedule, buy, discount, or promise anything.
Not tested yet: multi-run hallucination rate, prompt iteration time, paid-plan behavior, file uploads, memory/custom-GPT behavior, or connector-based workflows.
Disclosure and status

This is an in-progress evidence log, not a final ranking.

No affiliate links are used on this page. No account signup, credential creation, payment method, personal account, real inbox/calendar/workspace, OAuth connector, live automation, or paid/token-metered service was used for these checks. Scores stay blank until raw private-output evidence and completed run notes exist; ChatGPT, Perplexity, Duck.ai, Gemini, and Mistral Le Chat / Vibe are scored because raw prompt-only output and run notes are now saved. Claude, Qwen Chat / Qwen Studio, Grok, Kimi K2, Meta AI, DeepSeek, You.com, Pi.ai, Phind, Felo AI, Blackbox AI, iAsk.Ai, HuggingChat, Copilot, Notion AI, Taskade, Zapier Agents, and Motion still need project-safe account/session or successful full-output checks before scoring.

Last updated: June 9, 2026. Next evidence goal: run a comparable prompt-only Claude baseline from a project-safe signed-in account/session or choose another project-safe no-connector account/workspace path for Qwen, Grok, Meta AI, Felo AI, Notion AI, Taskade, Zapier, or Motion that can produce auditable raw output. Score only after the output can be audited.