Image-similarity scoring with CLIP for marketplace listings
Counterfeiters got better at writing innocent listing copy. The image often gives them away — same packaging photo, sometimes literally the same JPEG. Here's how we run CLIP-based image-similarity scoring for marketplace counterfeit detection at low per-detection cost.
The text-only era of brand-protection detection is ending. Counterfeiters got better at writing innocent-looking listing copy that doesn’t mention the brand at all — but the photo of the product is exact-replica imagery, sometimes literally the same JPEG lifted from the genuine listing. This piece is how Brand Protector runs CLIP-based image-similarity scoring across marketplace surfaces, what the trade-offs look like at scale, and why we ship it as a separate Cloud Run service rather than inline in each scanner.
The signal: visual replicas with sanitised text
About 40% of high-confidence counterfeit detections in our 2026 cohort don’t mention the brand by name in the listing title. Title says something generic (“Premium pet supplement, 60 chews”) and the product photo is a 1:1 copy of the brand’s genuine packaging — same lighting, same angle, sometimes literally the same image with the wordmark Photoshopped out. Text-only detection misses this entirely.
Why CLIP, not perceptual hashing
Two technical options for “is this image similar to a reference image”:
- Perceptual hashing (pHash, dHash) — fast, local, deterministic. Catches exact-copy and minor crop / colour variations. Misses anything more complex than that.
- CLIP embeddings — a vision-language model that maps images and text into a shared 512-D space. Two images of the same product from different angles still land near each other. Slower (~50ms per image) and costs an API call or local GPU.
We landed on CLIP. The angle / lighting / minor-crop variation coverage is worth the per-call cost, and the cost can be amortised via a per-detection embedding cache (we keep embeddings for 30 days at tenants/{tid}/embedding_cache/{sha}) so re-scoring the same image across runs is free.
The architecture
Three pieces:
- imgsim_service — a long-running Cloud Run service that hosts the CLIP model in memory (~600MB resident). Receives requests over HTTPS with an ID-token auth check. The model loads once at cold start; subsequent requests are pure inference. We run it under a dedicated service account (
imgsim-runner@) with read-only Firestore access — separation of concerns from the scanners that call it. - image_clusterer_job — a nightly Cloud Run Job that pulls all detection images from the prior day, embeds them via imgsim_service, clusters them by cosine-similarity threshold (0.92 default per tenant, configurable). Same-counterfeit-photo-different-seller patterns surface here.
- per-scanner inline call— the marketplace scanners (Amazon, eBay, Walmart, Google Shopping, Apify) call imgsim_service inline during scoring. Per-tenant opt-in: only fires for tenants who’ve uploaded reference brand assets. Graceful fallback if the service is unreachable — the scanner falls back to text-only scoring rather than failing the run.
Cost guards (real numbers)
Image-similarity isn’t free. Per-tenant per-day caps are the structural defence against runaway cost (a tenant who suddenly uploads 10,000 reference images and asks us to score against the entire Amazon US catalog will burn the per-call budget). Default cap:
- 1,000 imgsim calls per tenant per day (overrideable per tenant via
tenants/{tid}/usage_caps/image_clustering). - 30-day embedding cache — same image hash re-scored against same reference is free after first call.
- Global kill switch at
system_config/imgsim_kill_switch— single operator-flippable doc that halts ALL imgsim calls platform-wide. Used during incident response.
Threshold calibration
The 0.92 cosine-similarity default for clustering is empirically derived. Anything above ~0.94 is almost certainly the same image. Below ~0.88 starts catching unrelated images that happen to share visual style (white background, similar product category). The 0.92 sweet spot catches angle / lighting variations of the same product without too many false positives.
Per-tenant override at tenants/{tid}/config/image_clustering.threshold lets brand-protection operators tune for their category. Pet supplements (mostly bottles, white background) tolerate a higher threshold; cosmetics (highly visual brand identity, many subtle variants) need a lower one.
What we’d build differently next time
- Custom embedding for the brand-protection domain. CLIP is general-purpose; a fine-tuned embedding trained on counterfeit-vs-authentic pairs would probably tighten the threshold by 0.02-0.05. Our training data is just starting to be substantial enough — see the ML-scoring design doc on the public roadmap.
- Vector DB rather than per-tenant embedding cache. Today the cache is a Firestore document per image hash. Works at current scale; will become a read-amplification problem at 100+ tenants. Vertex AI Vector Search is the obvious upgrade path.
- Per-region CLIP models. The base CLIP model is trained on English-leaning web data. Performance on Asian-marketplace listings (Mercari JP, Shopee SE Asia) is slightly worse. A multilingual variant or a per-region fine-tune is the medium-term move.
The architecture is on the public roadmap. If you’re building something similar and want to compare notes, engineering at brandprotector dot io.
Run brand protection on autopilot.
Daily scans, triple-validated takedowns, reappearance checks. $199/mo. 3-day free trial.