Working with Claude to build a merchant-risk model, then everything on top

After the hackathon, I went back to the merchant-risk problem with Claude. The pattern across many sessions was the same: I taught Claude what a fraud analyst actually looks for, we generalised what was learned into a regression model, then repeated for the next class of fraud and the next. By the end we had built around six regression models in total, each one targeting a different fraud vector.

The whole thing was me and Claude. Not a data-science team in the usual sense; one of me, several Claude sessions a day, and a Databricks Model Context Protocol (MCP) server the company had installed that gave me direct query access to every transaction, dispute and merchant table I needed.

The MCP made the difference. I have built risk models before. The bit that usually slows me down is not the modelling technique; it is the data access, the back-and-forth with a data scientist to get the right joins or the right cuts. With the MCP I could just ask. The experience was closer to having a team of very fast data scientists available to explore lots of things at once than to building one model on my own.

This post is what we built across those sessions, what worked well, and what is still a work in progress.

What we modelled

Each model was a different angle on the same question: is this merchant doing what they say they are doing? The raw signal came from two places. One was the merchant’s website, crawled via Jina, which returns clean markdown for any URL. The other was the transactional data in Databricks.

The website signals had to be engineered into something a model could use. We built features for:

Contact information. Real corporate domain or free webmail, address present, phone present.
Compliance pages. Presence and similarity to template, including terms of service, privacy policy, refund policy.
Site depth. Internal navigation count. Shell sites have almost none, and the log-transformed internal-link count turned out to be one of the strongest single features in the model.
Brand and content. Social media references, error-page handling, custom build versus template.

From the transactional side we added dispute rates, IP geolocation patterns, merchant category code (MCC) mismatches, and authorisation behaviour. Twenty-three engineered features in the early versions, narrowed to sixteen in the current one.

For modelling technique we tried logistic regression, random forests, gradient-boosted trees and small neural nets, plus a few ensemble blends. The current version is a two-model blend of GBM and sigmoid, sixteen features, trained on a curated sample of 539 merchants chosen to cover the range of issues an analyst would actually want the model to flag. 101 came from the original hackathon work (69 clearly clean, 32 clearly suspicious). A further 438 were added later from the payment service provider’s transaction catalogs to widen the coverage. The output is a score across six risk tiers, from “Not Suspicious” to “Extremely Suspicious”, with the strongest contributing signals attached. Twelve versions in, it works well at the individual merchant level.

From model output to a draft case file

The score and the contributing signals tell an analyst where to look. They do not tell them what to write in the case file. The next layer I built does that.

It is Claude in an investigator role, calling the model as one of its tools alongside Databricks queries, website crawls and dispute lookups. The output is a draft case memo with the headline finding, the supporting evidence, the dispute pattern, the network observations and the recommended next step. The analyst edits and submits that memo rather than writing it from scratch. The model still does the scoring, the agent does the writing, and the two together produce something an analyst can put their name to in a fraction of the time.

Where it gets harder: portfolio and quarterly review

The next step was lifting the same approach from a single merchant to a whole book. Take the per-merchant analysis, run it across a payment service provider’s portfolio, aggregate the outputs, and surface the systemic patterns: concentration risks, sector clusters, geographic patterns, new fraud typologies appearing across multiple unrelated merchants. Then use the portfolio scan to draft a quarterly business review for the payment provider’s leadership: where things are working well, where the gaps are, what the trend lines look like, what to focus on next.

The data access through the MCP made this technically possible; Claude can run the per-merchant analysis at scale and aggregate the outputs. It does well on the established fraud vectors it has been trained on. What it lacks is human curiosity: the ability to notice something new that it has not seen before. Portfolio reviews are most useful for spotting exactly that, so for now we still lean heavily on fraud analysts for the more exploratory work, with the agent’s output as a starting point rather than a finished scan.

What I have learned

A few things stuck.

The data-access tier is the bottleneck. The MCP on Databricks changed how fast I could iterate by an order of magnitude. Every minute saved on data wrangling is a minute spent on the modelling itself.
Variable engineering is where the value is. The choice of model technique mattered less than the choice of what to measure. The strongest single feature in the current version came from a session where Claude and I were just looking at examples and noticing what the shell sites had in common.
Narrower models are more useful. Twelve versions in, the current one is also the most clearly bounded, and that is why it works.
Individual merchant work generalises less easily than I expected. The same toolchain that produces a strong per-merchant case file produces a noisy portfolio scan. The shape of the question changes more between scales than I had assumed.

If you have a Databricks MCP and a problem you have been close to for a long time then why not try something similar!