Cutting weeks off the ML cycle: a churn prediction journey with InsightWorker

← All posts

Data Science May 8, 2026 by Shreyash Methi 10 min read

It was Q1 retrain season. I'd just spent two weeks pulling features from three different databases, another week tuning gradient boosters across four candidate target windows, and was about to start the validation script I'd written from scratch every six months for the past four years. I caught myself thinking: I am about to do — for the eighth time — exactly what I did the last seven times.

That was the moment I stopped retraining the way I'd been taught and started retraining with an agent in the loop. This post is the full, honest journey.

What the manual cycle actually looks like

If you've built propensity, churn, or fraud models in production, you know this rhythm:

Days 1–3: pull customer master, billing, support, and product usage tables. Profile distributions. Decide window definitions (28-day? 90-day? rolling vs. fixed?).
Days 4–7: write the feature engineering. 80+ features across recency / frequency / monetary, product mix, support sentiment, billing health.
Days 8–10: stitch into a training table. Resolve quality issues — timezone bugs, late-arriving billing rows, returning customers who double-count.
Days 11–15: train baselines. Logistic regression, random forest, XGBoost, LightGBM. Cross-validation. Hyperparameter sweeps that you set up by hand because Optuna's defaults don't match your eval shape.
Days 16–18: validation report. Gain charts, lift at decile, calibration plots, holdout simulation against last quarter's bookings.
Days 19–20: packaging. Notebook for batch scoring. Deploy YAML. Model card. Runbook.
Days 21–22: hand off to MLOps and start answering review questions.

Three weeks for a model the business already had a clear shape for. Not because the work was hard — because there was a lot of it, and 70% of it was the same boilerplate I'd written six months earlier in a slightly different shape.

Where InsightWorker fits in

InsightWorker isn't an AutoML platform. It's an agent with tools — it can read schemas (db_describe_table, db_query), write files, run shell commands and Python, generate notebooks, commit to git, and chain those things together with reasoning. Which means: every step in that 22-day cycle that I was doing in a Jupyter notebook by hand can be done by the agent, with me reviewing and steering instead of typing.

I'm not handing it the keys. I'm handing it the keyboard. I still own the target definition, the business logic, and the sign-off. But I stopped writing the SQL.

The full example: rebuilding the 30-day churn model

Same business problem I'd done six times before — predict which subscribers will cancel in the next 30 days — but with InsightWorker driving. Total time: 2 days end-to-end, including review and committee prep. Compared to the typical three weeks.

Step 1 — Schema discovery and column profiling (8 minutes)

I started with this prompt:

> List the tables in our warehouse that carry customer behavior signals over > the last 90 days. For the top candidates, profile column distributions and > missing-value rates. Skip system / staging schemas.

InsightWorker called db_list_tables against Snowflake, ranked by recency of writes and row volume, and returned 14 candidate tables. Then it ran db_describe_table + sample queries against the top 6, surfacing column-level stats: nulls, cardinality, range. It flagged three quality issues I needed to know about before going further (a UTC vs. local-time bug in the billing table I'd patched last quarter that had crept back in, plus two columns that had been silently renamed).

The agent didn't fix the data. It told me what was wrong and asked whether to proceed or pause. I told it to skip those two columns and keep going.

Step 2 — Feature proposal with rationale (12 minutes)

> Propose features for a 30-day churn model. Justify each based on the > distributions you saw. Group by category (RFM, product, support, billing). > Aim for ~50 features.

It returned 47 features with a one-line rationale per feature, grouped by category. I removed 4 (couldn't be computed in time for scoring) and added 6 specific ones the business team had asked about. The agent didn't argue — it accepted the edits and moved on.

Step 3 — Feature SQL pipeline (25 minutes including review)

> Generate the feature view as Snowflake SQL. Use the corrected mapping for > billing_v2.amount. Validate against today's data and output a column > quality report.

Out came 380 lines of well-structured SQL with CTEs per category, a final SELECT joining everything, and consistent NULL handling. It then ran a profiling query against the view and dropped a markdown report flagging three columns where >5% of rows were null and one feature where the distribution looked suspiciously bimodal. I investigated; one bimodal feature turned out to be a real signal (free vs. paid plans had genuinely different shapes), the others I dropped.

Step 4 — Baseline notebook (12 minutes generation, 22 minutes training)

> Generate a notebook that trains XGBoost on this feature view. Use temporal > splits — train on data up through 60 days ago, validate on 60–30 days ago, > test on the most recent 30 days. Output ROC, gain chart, calibration plot. > Save the notebook under notebooks/churn_baseline.ipynb.

InsightWorker generated a clean notebook — pandas pulls from the feature view, proper temporal split (this matters; random splits leak future data into the past for time-series models), XGBoost fit with sane defaults, eval cells for each chart. I ran it. ROC AUC 0.83 on test, lift at top decile 4.2x. Already as good as last quarter's production model on a fraction of the work.

Step 5 — Hyperparameter tuning (1 hour 14 minutes, mostly compute)

> Sweep XGBoost hyperparameters with Optuna. 50 trials. Optimize PR AUC > (we care more about precision at low base rates). Stop trials early on > validation degradation. Save the study to studies/churn_xgb_v3.db.

It wrote the Optuna driver, ran 50 trials in parallel via joblib, surfaced the top 3 configurations and recommended one based on a stability + score tradeoff. ROC AUC 0.86, PR AUC up 8% over the baseline, top-decile lift 4.6x. I accepted the recommendation. The agent committed the chosen config to a YAML file.

Step 6 — Validation against last quarter's bookings (35 minutes)

> Validate the tuned model against the past three months of held-out > cohorts. Compare lift at top 1%, 5%, 10% to last quarter's production > model. Report which segments improved and which regressed.

This was the part I'd been dreading. The agent ran scoring against the holdout, joined back to actual churn outcomes from the last 90 days, and produced a markdown report with comparison tables and per-segment cuts. The new model gained meaningful lift in the small-business segment and matched on enterprise. It regressed slightly on free-tier customers — which the agent flagged for me to investigate before deploying. I dug into it: free-tier customers had a different feature distribution post a UI change, and my features didn't capture it. Real finding, surfaced cleanly.

Step 7 — Deploy artifacts (45 minutes)

> Generate the batch scoring notebook (deploys to our daily scoring DAG), > a model card following our internal template at .insightworker/playbooks/ > model-card.md, and the MLOps handoff README. Commit to a feature branch.

Three artifacts, all in the standard shapes my MLOps team expects, all committed to a branch. PR was clean. Reviewer asked two questions about the free-tier regression I'd already flagged in the model card. We deployed three days later.

What I still owned

It wasn't all gravy. Three things stayed mine, and I think they should:

The target definition. InsightWorker proposed candidate windows, but I had to pick the one matching our business definition of churn. Domain knowledge isn't going anywhere.
Sanity checks on feature semantics. The agent wrote correct code. Whether a 7-day support contact average is the right operationalization for 'customer frustration' was my call.
The committee narrative. The agent generated charts and tables. The committee wanted a story about why the regression in the free-tier segment was acceptable. I wrote that.

If you're worried an agent will replace your judgement, don't be. It replaces your typing.

What I tried next

After churn worked, I tried InsightWorker on five other model rebuilds over the next two months. Times in parentheses are roughly the agent-driven cycle vs. my typical from-scratch baseline:

Lead propensity scoring — same shape as churn, ran in ~1.5 days vs. my old 2 weeks. Easy win.
Cross-sell next-best-offer — multi-class with a long tail of products. ~3 days vs. ~3 weeks. The longest part was the offer-eligibility logic, which I owned anyway.
Fraud anomaly detection on transaction streams — required more domain customization for the unsupervised baseline. ~5 days vs. 6 weeks. The agent generated a competent isolation forest plus an autoencoder; I picked iForest after looking at the false-positive volume.
Demand forecasting (weekly product-level) — needed a Prophet vs. LightGBM bake-off; the agent ran it cleanly. ~4 days vs. ~3 weeks.
Survival modeling for renewal timing — most complex of the five, but the agent produced a respectable Cox proportional hazards baseline I could extend. ~5 days vs. ~5–6 weeks.

Where I think this goes next

I haven't tried these yet, but the pattern fits cleanly:

Recommender systems — collaborative filtering or two-tower retrieval. The boilerplate is enormous and largely identical across implementations.
NLP classification — intent, sentiment, document type. Off-the-shelf transformer fine-tuning loops are textbook agent territory.
Topic modeling on support transcripts — BERTopic-style pipelines with embedding generation, clustering, and per-cluster summarization.
Image classification for claims documents — preprocessing + transfer learning on a frozen backbone is a single notebook the agent can produce well.
Model monitoring + drift detection — this one I think is the killer app. Schedule it daily via the InsightWorker scheduler, get a digest when feature distributions or prediction calibration drifts past thresholds.
A/B test sample size + analysis automation — the math is well-defined; what eats time is hooking it up to the right metric tables and writing the readout.
Feature store maintenance — stale-feature detection, lineage audits, cost attribution. Boring, important, perfect for an agent.
Model card generation for compliance / model risk — given a deployed model, regenerate the documentation in the regulator-acceptable template. Big win for any team in a regulated industry.
Reproducibility audits — given a deployed model and its training run pointer, regenerate the artifacts and verify they match. Audit-ready in minutes instead of days.

What I'd tell another data scientist starting today

Pick your next retrain. Set aside two days. Point InsightWorker at your warehouse, give it your target definition, and watch where it gets stuck. The places it gets stuck are the places where your domain knowledge is real and the places it doesn't get stuck are the places you've been wasting your time on. That's the actual value of running this experiment — even if you didn't keep using the agent, you'd learn which 70% of your job was boilerplate.

I keep using it. The hours I got back went into thinking about what the model couldn't capture. That's the part of the job I actually like.

I put together the full screenshot walkthrough as a use-case page if you want to see what the prompts and outputs actually look like at each step.

See the ML model creation use case