Google via Remotasks
Improving LLM Outputs and Documenting Hallucination Risk

01 See the Problem Was…
Google Bard produced frequent hallucinations and factual inaccuracies across historical, social, and cultural domains, making its outputs unreliable as a source of verified information. When prompted with leading or misleading inputs, Bard consistently generated confidently stated but inaccurate responses, particularly across nuanced historical topics requiring contextual judgment.
The core problem was that Bard could not distinguish between factual accuracy and plausible sounding output, requiring human SME evaluation to identify where and why the model failed.
02 Logan’s Role
- Applied subject matter expertise in History to evaluate Bard’s outputs across articles, essays, journal papers, and multiple choice responses, scoring each against verified historical sources.
- Designed advanced prompts engineered to expose model weaknesses, including structured inputs that nudged Bard toward outputs that were 99% accurate but contained one critical inaccuracy that invalidated the entire response.
- Submitted structured evaluations through scored questionnaires each working session, documenting output quality, reasoning gaps, and hallucination patterns to inform model improvement by Google’s development team.
03 Results
- Evaluated Bard’s outputs over a six-month period in daily working sessions of four to six hours, documenting measurable improvement in output accuracy from the beginning to the end of the evaluation period.
- Despite improvements, Bard never achieved full contextual accuracy. Advanced prompt engineering continued to expose reasoning gaps and produced invalid outputs throughout the evaluation period.
- Demonstrated that LLM accuracy without human judgment remains insufficient for domain-specific use, establishing a repeatable evaluation methodology across multiple output formats and prompt types.
