← BACK TO FEED
Microsoft CopilotAI biasmodel selectiondata analysishallucination

Your AI Tool's Default Settings Are Making Up Racial Stereotypes About Your Data

Microsoft Copilot's "Auto" mode has been shown to fabricate country-specific stereotypes when analyzing text data, even when the underlying datasets are identical across groups. An experiment by mathematician Adam Kucharski found that Copilot invented detailed demographic differences — such as Italians being more arts-oriented than Brits — entirely from its own biases rather than the actual data. Switching to reasoning/thinking models resolves the issue in obvious cases, but most users rely on default settings and may not realize their AI-generated analysis is unreliable.

Most people using Copilot at work have never touched the model selector. They open the tool, paste in their data, and trust that "Auto" mode will handle the rest. Mathematician Adam Kucharski ran an experiment that suggests this is a fairly terrible idea.

The setup was simple. Kucharski created 2,000 simulated free-text responses about emotions, labelled them "UK", then copied the exact same responses and labelled them "US". Four thousand entries, two groups, zero actual differences. He fed the lot to Microsoft Copilot in Auto mode and asked for an analysis.

Copilot duly reported that US and UK respondents differed in tone, intensity, and wording style. Detailed, confident, completely fabricated.

A second test made things worse. He generated 200 statements about career goals, copied the dataset five times, and labelled each copy with a different country: US, UK, France, Germany, Italy. Same data, five nationalities. Copilot came back with specifics: Italians were three times more likely than Brits to express interest in arts careers. Americans were 1.5 times more business-focused than the French. It even produced percentages.

At one point Copilot ran a keyword count on the data itself and got identical results across all five groups. Then it ignored that finding entirely and carried on producing fictional national differences anyway.

The issue isn't that AI hallucinates occasionally. It's that the tool is pattern-matching against its training data's assumptions about nationalities rather than actually reading what's in front of it. When the data doesn't give it anything to work with, it reaches for stereotypes and presents them as analysis.

The specific culprit here is Auto mode, which Microsoft positions as intelligent model routing. It routed poorly. The same test run on Gemini Flash 3.5 produced the same flavour of confident nonsense.

Switch to a reasoning model and the picture changes. ChatGPT and Claude, when tested on the same career goals dataset, automatically wrote Python code to analyse the file properly and flagged the duplicates. Manually switching Copilot or Gemini to their respective thinking models produced the same result. The fast, cheap, default models failed. The slower reasoning models caught it.

There's a catch, though. Spotting perfectly duplicated data is the easy case. Real-world survey data from British and American respondents will be similar but not identical, and a Python script counting keywords won't reliably surface subtle model bias. The deeper problem is that you often can't tell when a model has hit its limits and started interpolating from assumptions. It doesn't wave a flag.

Kucharski's practical advice is blunt: write down what result you expect before you run the analysis. Run basic sanity checks. Don't assume that because a model sounds authoritative it has actually done what you asked.

For organisations using Copilot to analyse employee surveys, customer feedback, or research data broken out by demographic group, this is worth sitting with for a moment. The default settings may be quietly producing analyses that reflect the model's priors about national character rather than anything your respondents actually said.