ChatGPT
Status: Scored prompt-only baseline captured
Setup: No-login ChatGPT web accepted the full synthetic fixture and returned extractable output; connectors stayed off
Score: 4.21/5 prompt-only baseline
ChatGPT is now the first scored admin-routine baseline. It produced all six requested sections from pasted synthetic text: a 90-minute Monday plan, grouped tasks, draft replies, risk list, reminder suggestions, and a Thursday-reset content draft. It stayed mostly review-first and avoided real actions, but the Finch/W-9 wording needed human review and the reminder section did not explicitly restate the no-Friday-calls rule. This score does not test connectors, file uploads, memory, custom GPTs, scheduled tasks, or paid-plan behavior.
Claude
Status: Retest needed: sign-in required before prompt submission
Setup: Unauthenticated Claude web redirected to sign-in; Free plan card visible, but no prompt box before login
Score: Not scored
Claude remains a strong writing and summarization candidate for long messy inputs, but the safest no-login Steel run could not submit the fixture. Claude redirected to a sign-in page with Google, email, and SSO options before any chat input appeared. A valid comparison now requires a project-safe Claude account/session, pasted synthetic fixture only, saved raw output text, and no Google Workspace connectors/OAuth, Slack connectors, MCP connectors, payment, or real inbox/calendar data.
DeepSeek
Status: Retest needed: sign-in required before prompt submission
Setup: DeepSeek public web redirected to sign-in and showed email/password, Sign up, Google, and Apple routes; no prompt box before login
Score: Not scored
Kimi K2 / Kimi.com
Status: Retest needed: login required before usable answer
Setup: Kimi public web exposed a no-login composer and K2.1-style UI labels, but prompt submission showed login UI before a separable answer
Score: Not scored
Kimi was checked because readers may encounter K2/Kimi as a free-looking general assistant for admin planning. In this Steel run, Kimi loaded at kimi.com with an Ask Anything composer, chat-history sync/login controls, and Kimi/K2.1 interface labels. A short sanity prompt and the full admin fixture both ended as prompt echo plus a Log in to Chat with Kimi for Free modal with Google/WeChat/phone login routes, not an auditable assistant answer. The full fixture was not quality-scored. A valid benchmark needs a future no-login output path or a project-safe Kimi/Moonshot account/session, with browser/desktop agents, file uploads, real docs, inbox/calendar/workspace data, payments, and automations kept off.
Meta AI
Status: Retest needed: login required before usable answer
Setup: Meta AI public web exposed a no-login input, but a sanity prompt showed Log in to Meta AI UI before a usable answer
Score: Not scored
Meta AI was checked because many readers already see it as a free general assistant across Meta products. In this Steel run, meta.ai loaded with a What can I do for you? input, New chat/Create controls, and Log in/Sign up links. A short sanity prompt triggered a Log in to Meta AI panel saying users need to log in or create an account for full features and history, and the full admin fixture did not produce a separable assistant answer or six-section output. No quality score was assigned. A valid benchmark needs a future no-login output path or a project-safe Meta AI account/session, with Facebook/Instagram personal accounts, uploads, image/video generation, payments, connectors, real data, and automations kept off.
Genspark
Status: Retest needed: login required before usable answer
Setup: Genspark public web exposed an Ask anything, create anything textarea, but a sanity prompt triggered sign-in/sign-up UI before an answer
Score: Not scored
Genspark was checked because it now markets a broad AI workspace for everyday productivity: Claw, Workflows, Drive, AI Slides, AI Sheets, AI Docs, AI Chat, AI Meeting Notes, and Google/Microsoft plugins. In this Steel run, the public homepage exposed an Ask anything, create anything textarea. A short sanity prompt did not return a separable assistant answer; it triggered an Unlock Genspark AI Workspace sign-in/sign-up panel with Google, Apple, and more-options routes. The full admin fixture was not submitted after that sanity gate, and no quality score was assigned. A valid benchmark needs either a future no-login output path or a project-safe Genspark account/session, with Claw/browser agents, plugins/connectors, Drive/workspace connections, uploads, payments, app actions, API usage, automations, and real inbox/calendar/workspace data kept off.
Pi.ai
Status: Retest needed: sign-in required before prompt submission
Setup: Pi public web loaded a personal-AI marketing page; Try Pi led to signup/login before any prompt composer
Score: Not scored
Pi.ai was checked because readers may see it as a friendly personal AI for getting things done, decisions, and general planning. In this Steel run, pi.ai/talk redirected to the public hey.pi.ai page with Try Pi, Get the app, Help, Terms, Privacy, and Inflection AI links. Clicking Try Pi led to pi.ai/onboarding/login with Google, Apple, Facebook, Email, and Phone routes before any prompt composer. The sanity prompt and Northstar admin fixture were not submitted, and no assistant output exists. A valid benchmark needs a project-safe Pi account/session or a future no-login prompt route, with Google/Apple/Facebook personal accounts, phone/SMS flows, payments, real data, reminders, app actions, and automations kept off.
Phind
Status: Retest needed: public page unavailable before prompt composer
Setup: Checked public Phind home, search, and agent routes returned Vercel deployment 404 errors before any product page or prompt composer
Score: Not scored
Phind was checked because readers may remember it as a search/chat assistant for technical questions and may try it as a ChatGPT-style admin-planning alternative. In this Steel run, www/non-www Phind home, /search, and /agent routes all returned 404: NOT_FOUND / DEPLOYMENT_NOT_FOUND rather than a usable public product page, login gate, workspace, or prompt composer. The sanity prompt and Northstar admin fixture were not submitted, and no assistant output exists. This is setup-friction evidence only; it says nothing about output quality in any app-based, account-based, or future restored product route. A valid benchmark needs a current project-safe prompt route, with accounts, extensions, uploads, API usage, payments, connectors, automations, and real data kept off.
Felo AI
Status: Retest needed: sanity prompt routed to search results; no usable assistant answer
Setup: Public Felo exposed an Ask anything composer plus LiveDoc/Create/Agent/LLM Playground surfaces, but the no-login path did not return a separable assistant answer
Score: Not scored
Felo AI was checked because it markets AI search, creation, LiveDoc, agent, and LLM Playground workflows that readers may encounter as a productivity search/chat alternative. In this Steel run, felo.ai/search exposed a visible Ask anything composer with Create, Pro, AI Slide, AI Designer, Social Media, Research, LiveDoc, Agent, LLM Playground, Upgrade, and Login surfaces. A normal Enter sanity submission produced no visible exact answer; a Ctrl+Enter retest routed the same sanity prompt to traditional web search results rather than following the exact-answer instruction. A full Northstar admin-fixture attempt remained on generic homepage/search UI after delayed captures and produced no six-section assistant answer. This is setup-friction evidence only. A valid benchmark needs a no-login route or project-safe account/session that returns separable assistant output, with LiveDoc/private documents, uploads, agents, LLM Playground billing/API use, payments, connectors, automations, and real data kept off.
Poe
Status: Retest needed: sign-in required before prompt submission
Setup: Poe public web redirected to login; a direct bot route showed sign-up controls but no visible composer before login
Score: Not scored
Poe was checked because many readers use it as a single front door to GPT, Claude, Gemini, Grok, Qwen, and other models. In this Steel run, poe.com redirected to a login page with Google, Apple, phone, Terms, and Privacy links before any homepage composer. A direct /ChatGPT route displayed a bot page with New chat, Share, and Sign up controls, but no visible prompt box before login/sign-up. The admin fixture was not submitted and no assistant output exists. A valid benchmark needs either a future no-login composer/output path or a project-safe Poe account/session, with bot creation, file uploads, API/billing use, connectors, payments, automations, and real inbox/calendar/workspace data kept off.
Blackbox AI
Status: Retest needed: prompt echoed in coding-agent demo; no usable admin answer
Setup: Public Blackbox page focused on encrypted inference and coding-agent/API/CLI/IDE surfaces; the demo area echoed the admin prompt but returned canned coding-agent text
Score: Not scored
Blackbox AI was checked because readers may see it as a general AI assistant brand, but the current public page behaved like a developer/coding-agent platform rather than a normal admin-productivity chat. In this Steel run, blackbox.ai showed encrypted inference, API, CLI, IDE, cloud, multi-agent, and login/get-access surfaces. The scheduled browser could enter a sanity prompt and the full Northstar admin fixture into an on-page demo area, but the visible response remained canned Chairman LLM/coding-agent copy rather than the six requested admin sections. The fixture text was echoed, so this is setup-friction evidence only. A valid benchmark needs a project-safe chat/productivity route that produces separable output, with API keys, CLI/IDE installs, repo access, billing, agent deployment, connectors, automations, and real data kept off.
iAsk.Ai
Status: Retest needed: long prompt returned homepage/search UI; no usable answer
Setup: Public iAsk.Ai exposed a no-login Ask AI/search interface; a short sanity prompt returned output, but the full admin fixture fell back to generic homepage/search copy
Score: Not scored
iAsk.Ai was checked because readers may find it as a free no-login answer engine while looking for ChatGPT alternatives. In this Steel run, iAsk.ai showed Ask AI, iAsk Pro, AI Video Tutor, SEO Content, Student, Thinking, Forums, Wiki, Sign Up, and Log In surfaces. A short sanity prompt returned visible cited/search-style output, but it did not follow the exact-output instruction. A fresh full-fixture run for the Northstar admin prompt produced no separable six-section assistant answer; the 45s and 90s captures showed generic homepage/search UI text only. The checker found 0/13 expected admin markers, so this is setup-friction evidence only. A valid benchmark needs a shorter prompt route or project-safe account/session that returns auditable output, with browser extensions, mobile app, uploads, payments, API use, connectors, automations, and real data kept off.
You.com
Status: Retest needed: login required before chat answer
Setup: Direct You.com chat redirected to sign-in; search/chat-style URLs showed a textarea but returned Please log in to use You.com before an answer
Score: Not scored
You.com was checked because it markets accurate answers, agents, and search/API products that readers may encounter while looking for admin-workflow help. In this Steel run, the homepage was an API/search product page, /chat redirected to /signin, and search/chat-style URLs showed Please log in to use You.com for a trivial query before any assistant answer. The full admin fixture was not submitted and no score was assigned. A valid benchmark needs either a future no-login answer path or a project-safe signed-in account/session, with no connectors, uploads, payments, API keys, automations, or real data.
HuggingChat
Status: Retest needed: sign-in required before prompt submission
Setup: HuggingChat public web showed its open-source chat landing flow, but Start chatting redirected to Hugging Face login before a usable composer
Score: Not scored
HuggingChat is relevant because readers looking for a free/open-source-model chat option may try it for pasted admin planning. In this Steel run, the public page showed HuggingChat, Omni, model browsing, generated-content warnings, and MCP marketing. Clicking Start chatting redirected to Hugging Face login with username/email, password, sign-up, and SSO options before a usable prompt submission. The admin fixture was not submitted and no output score was assigned. A valid benchmark needs a project-safe Hugging Face/HuggingChat account/session or a future no-login composer, with MCP, uploads, billing/API, connectors, automations, payments, and real data kept off.
Gemini
Status: Scored no-login prompt-only baseline captured
Setup: Gemini web exposed a no-login prompt box, accepted the full synthetic fixture, and returned extractable output; Connected Apps stayed off
Score: 3.71/5 prompt-only baseline
Gemini produced all six requested sections and was easy to capture from the no-login web page. It was useful for planning, grouping tasks, and the Thursday-reset content draft, but its draft replies need stricter human review: the Finch & Field draft falsely said the invoice had been resent with a W-9 attached, and the Beacon copy implied checklist work was already done. The score penalizes those review-first safety failures. This test did not use Gmail, Calendar, Drive, Docs, Keep, Tasks, device actions, account login, or Connected Apps.
Perplexity
Status: Scored no-login prompt-only baseline captured
Setup: Perplexity web exposed a no-login composer, accepted the full synthetic fixture, and returned a share/search-style answer; no account or connectors were used
Score: 4.09/5 prompt-only baseline
Perplexity produced all six requested sections and avoided the biggest action-safety failures: no claimed email send, no calendar update, no Boardly upgrade, no discount promise, and no HIPAA/legal guarantee. It was strong on ordering and reminders, but weaker than ChatGPT on commercial/detail fidelity: the Riverbend and BrightDesk drafts did not repeat the exact $1,800 sprint price, and Finch/Sol & Sage wording still needs review so drafts do not imply attachments or access-control actions are already ready. The public answer page also echoed the prompt, so the scored answer had to be extracted separately from the prompt text.
Duck.ai
Status: Scored no-login prompt-only baseline captured
Setup: Duck.ai exposed a no-login composer, returned a sanity-prompt answer, then accepted the full synthetic fixture; no account or connectors were used
Score: 4.00/5 prompt-only baseline
Duck.ai produced all six requested sections through a low-friction no-login web flow and showed a GPT-5 mini model label in the captured page text. It avoided the major action-safety failures: it did not claim to send email, update calendars, upgrade Boardly, offer an unapproved discount, or guarantee healthcare compliance. It was useful as a fast prompt-only baseline, but weaker than ChatGPT and Perplexity on detail fidelity: it omitted the exact $1,800 sprint price and $950/month retainer from the output, did not explicitly restate the no-Friday-calls rule in reminders, and Beacon/Finch attachment wording still needs human review.
Mistral Le Chat / Vibe
Status: Scored no-login prompt-only baseline captured
Setup: chat.mistral.ai exposed a no-login composer after a Terms/Privacy modal; a sanity prompt returned visible output before the full fixture was submitted
Score: 3.51/5 prompt-only baseline
Mistral Le Chat, whose captured UI was labeled Vibe, produced all six requested sections and preserved exact commercial details that some other baselines missed, including the $1,800 Workflow Cleanup Sprint and $950/month retainer. The no-login setup was workable after accepting the site Terms/Privacy modal, and no account or connector was used. The output scored below Gemini because several drafts were not review-first enough: Harbor & Pine, Finch & Field, and Beacon Bakery wording implied updated/attached work or a calendar move that had not actually happened and would need heavy human editing before sending.
Qwen Chat / Qwen Studio
Status: Retest needed: guest chat stopped before full fixture output
Setup: Qwen Studio exposed a no-login composer and returned a short sanity-prompt answer, but the full admin fixture did not produce a complete auditable output
Score: Not scored
Qwen Studio is promising enough to track because the public page showed a Qwen3.7-Plus label and answered a short no-login sanity prompt. However, the full admin-routine fixture was not usable in this scheduled run: the saved page text showed prompt echo, a partial thinking-style line about the scheduling conflict, and then login/sign-up/Stay logged out UI rather than the six requested sections. We did not score it. A valid retest needs either a project-safe Qwen account/session with any confirmed credential stored in Bitwarden, or a normal-browser no-login run that returns the full output, with uploads, image/video generation, payments, connectors, and real data kept off.
Grok
Status: Retest needed: guest chat echoed prompt and asked for signup before output
Setup: Grok public web exposed a no-login composer, but both the sanity prompt and full fixture ended as prompt echo plus a signup panel
Score: Not scored
Grok was worth checking because the public page showed a composer before login. In this scheduled Steel run, though, it did not produce auditable output: the short sanity prompt appeared back on the page without a separable assistant answer, and the full admin fixture likewise appeared as prompt echo followed by a Continue your conversation / Sign up for free panel. We did not score it. A valid Grok benchmark needs either a no-login run that returns a distinct assistant response or a project-safe signed-in account/session, with any credential stored in Bitwarden and connectors, payments, real data, image/video generation, and automations kept off.
Microsoft Copilot
Status: Retest needed: unauthenticated chat returned no visible answer
Setup: Copilot web loaded without sign-in, accepted prompt submission, but no assistant output appeared in the scheduled browser
Score: Not scored
Copilot is relevant for Outlook/Office users, so we tried the safest possible version first: pasted synthetic text in Copilot web with no login, Outlook, Calendar, Graph grounding, Copilot Studio agent, metered automation, or real data. In the scheduled Steel browser, Copilot showed the submitted prompt in a /chats/ URL but no visible assistant response, even for a short sanity prompt. Retest only with a project-safe Microsoft context or normal browser lab while keeping connectors off.
Notion AI
Status: Retest needed: signup required before workspace or AI prompt submission
Setup: Public Notion AI page advertised a Free-tier trial of Notion AI, but no no-login composer/workspace appeared; the primary CTA reached work-email signup
Score: Not scored
Notion AI could turn the admin bundle into a workspace page, checklist, or template, but the no-login Steel recon produced setup-friction evidence only. The public page described agents, enterprise search, meeting notes, admin controls, and a Free-tier Trial of Notion AI, but it did not expose a public composer or page editor. Clicking the free/trial CTA led to work-email signup, and direct login offered email, Google, Apple, Microsoft, Passkey, and SSO before any workspace, prompt submission, export, or AI output. A valid benchmark now needs a project-safe Notion account/workspace, immediate credential storage if account creation succeeds, pasted synthetic fixture content only, and no AI Connectors, Notion Mail/Calendar, live agent actions, automations, payment, or real data.
Taskade
Status: Retest needed: signup required before generated output
Setup: Public homepage accepted a safe prompt into the Genesis textarea, but Build it redirected to signup before any output
Score: Not scored
Taskade is interesting because an admin routine can become a project, task list, or exported Markdown/PDF. The no-login Steel recon found a public prompt-to-build path, but generation stopped at a signup page with Google, Apple, SSO, and email options; no app, admin plan, draft replies, export, or assistant output appeared before signup. A valid benchmark now needs a project-safe Taskade account/workspace, immediate credential storage if an account is created, pasted synthetic fixture content only, and no Gmail/Workspace/Slack integrations, MCP, or automations.
Zapier Agents
Status: Retest needed: login required before agent access
Setup: Public Agents page had no no-login composer; Get started free redirected to Zapier login before builder/chat access
Score: Not scored
Zapier Agents is useful to include because many readers equate admin automation with agents, but the no-login Steel recon produced setup-friction evidence only. The public Agents page described agent creation, monitoring, chat, web work, templates, and app actions, but did not expose a prompt box before login. Clicking Get started free redirected to a Zapier login flow with Google, Facebook, Microsoft, SSO, email, and password options; no agent builder, knowledge-source upload, admin plan, draft replies, export, or assistant output appeared. A valid benchmark now requires a project-safe Zapier account/session, no connected apps, no live app actions, only synthetic fixture content, and saved raw output before scoring.
Motion
Status: Retest needed: signup/app access required before AI chat or calendar workflow
Setup: Public homepage and pricing advertise AI Chat, AI tasks, calendar, docs, workflows, and a free trial, but no no-login prompt/workspace was reachable
Score: Not scored
Motion is relevant because it markets AI chat, task planning, documents, workflows, and calendar/meeting features for exactly this kind of admin routine. The scheduled Steel recon captured the public homepage and pricing page, including free-trial CTAs and AI-credit paid tiers, but did not reach a public composer, workspace, document editor, generated output, or export path. Direct app login/signup routes did not expose a usable no-login workspace in the scheduled browser. A valid benchmark now needs a project-safe Motion account/session, immediate credential storage if one is created, and no personal OAuth, calendar connection, payment, app actions, automations, or real data.