Public AI Accountability, Not Private Quality Assurance

Making AI quality transparent so developers can improve and users can trust.

Why GrandJury Exists

AI systems are increasingly making critical decisions in healthcare, law, finance, and code. But when these systems fail, those failures often go undocumented, unverified, and unfixed.

Users don't know what went wrong. Developers don't get specific, actionable feedback. Domain experts who spot failures have no platform to share their findings publicly. The public can't hold AI systems accountable.

GrandJury changes that. We connect AI developers with verified domain experts who publicly evaluate AI outputs. Not anonymous crowd workers. Not black-box algorithms. Real experts with real names, publicly documenting what works and what fails.

Our mission: Make AI quality transparent through named expert monitoring. Build trust through evidence, not marketing claims.

Who's Building This

👤

Arthur Cho

Founder, GrandJury

Professional Background

AI/ML Product Manager with 8+ years building AI capabilities for 6 products reaching 200k+ MAU. Master's in Applied Data Science from University of Michigan (GPA 3.9).

Previous Experience

  • Conversational AI Manager at HSBC (Enterprise AI chatbots)
  • Product Manager at Intent AI (A*STAR AI research spinoff)
  • Product Manager at FreeD Group (500 Startups graduate)

Technical Background

  • 5+ years AI/NLP solution planning and implementation
  • Built glancias.com (gen-AI newsletter platform)
  • Created spot-a-mood on HuggingFace
  • Published Chinese translation of Chip Huyen's "Designing Machine Learning Systems"

Why I'm Building GrandJury

I've spent years building AI products. I know the quality problem intimately.

As a product manager, I'd get vague feedback: "Your AI isn't good enough." But what specifically failed? Why? Who verified it? I had no systematic way to know.

As a user of AI tools, I'd see claims like "95% accuracy" or "best-in-class" without understanding what that meant in practice. What does it get wrong? When does it fail? Who says so?

I built GrandJury to solve this problem: Public, named expert monitoring that makes AI quality transparent. Developers get specific improvement insights. Users get credible quality signals. Experts get recognized for their work.

This isn't about shaming AI companies. It's about transparency and accountability leading to safer, better AI.

Contact

Open to: Questions about GrandJury, Partnership discussions, Feedback and suggestions, Media inquiries

Where We're Headed

Our long-term vision: "GrandJury Verified" becomes the expected trust signal for serious AI projects - like SSL certificates for websites.

Short-term (2025)

  • • 100+ verified AI Jury across 5 domains (Medical, Legal, Safety, Code, Finance)
  • • First "State of AI Failures" report published
  • • Marketplace connecting AI developers with verified evaluators
  • • Seamless Langfuse integration

Medium-term (2026)

  • • 500+ verified experts
  • • 100+ AI projects with public monitoring
  • • Media regularly citing GrandJury findings
  • • "GrandJury Verified" badge recognized as credibility signal

Long-term (2027+)

  • • Public AI accountability becomes industry standard
  • • Verified experts earning sustainable income through platform
  • • Academic research using GrandJury failure database
  • • Platform self-sustaining through marketplace

What Makes GrandJury Different

vs Anonymous Evaluation Platforms

Others (Outlier.ai, Scale AI)

  • ✗ Anonymous crowd workers (no expert credibility)
  • ✗ Private results (developers only, no public accountability)
  • ✗ Assigned tasks (workers have no choice)
  • ✗ Gig economy model (no long-term value for workers)
  • ✗ Rankings and scores (limited context)

GrandJury

  • ✓ Named domain experts (public credibility)
  • ✓ Public results (transparent accountability)
  • ✓ Self-directed evaluation (experts choose projects)
  • ✓ Career building (portfolio, verification, consulting opportunities)
  • ✓ Evidence documentation (specific failures with context)

Why This Matters: Anonymous votes don't build trust. When Dr. Sarah Chen (Climate Scientist) says "This AI gave dangerous climate misinformation," that means something. When "Anonymous User #4728" votes... it doesn't.

vs AI Benchmarking Platforms

Others (LMSys Arena, HuggingFace Benchmarks)

  • ✗ Rankings only ("Model A is better than Model B")
  • ✗ Aggregate scores (no specific failure documentation)
  • ✗ Limited domains (general tasks, not specialized)
  • ✗ No developer feedback loop (just leaderboards)

GrandJury

  • ✓ Evidence documentation ("What specifically failed and why")
  • ✓ Detailed comments (developers get actionable insights)
  • ✓ Domain-specific evaluation (Medical, Legal, Code, Finance, Safety)
  • ✓ Direct developer integration (scores sync to Langfuse)

Why This Matters: "ChatGPT scored 8.5/10" doesn't tell developers what to fix. "ChatGPT hallucinated drug interactions in 15% of medical queries" does.

Our Core Principles

Transparency First

All evaluations are public by default. Evaluator names, comments, votes - everything visible. Sunlight is the best disinfectant.

Named Expert Attribution

Expert credibility requires expert names. We don't hide behind anonymity. AI Jury build authority through public contributions.

Evidence, Not Just Scores

Rankings tell you WHO is better. Evidence tells you WHAT failed. We document specific failures with context.

Mission Over Profit

We're building for AI safety and quality, not just marketplace revenue. Recognition comes before monetization.

Developer-Friendly

This isn't about shaming. It's about improvement. Public accountability creates better AI when feedback is constructive.

Get In Touch

Questions? Feedback? Partnership inquiries? We'd love to hear from you.

General Inquiries

Email: hello@grandjury.xyz

For AI Jury

Join AI Jury →

Media Inquiries

Email: press@grandjury.xyz