The Last 30 Metres: What 400 AI-Simulated Travellers Taught Me About Retention in Metasearch
A six-week concept project building an agentic retention system using Llama 3.1, n8n, and LLM-simulated user personas — and what the reasoning traces revealed that the numbers alone never could.
There is a moment in every metasearch session that I have started to think of as the last 30 metres. The user has searched. They have filtered. They have compared prices across three OTAs, opened two hotel pages in new tabs, and come back to the results. They are, by any reasonable measure, ready to book. And then they leave.
Not because they found something better elsewhere. Often, not for any reason you can instrument at all. They just stopped. And in the economics of travel metasearch — where you earn only when they click out, where you have spent real money acquiring that session, and where the average user books fewer than four trips a year — that exit is expensive twice: once today, and again in three months when you have to re-acquire them for their next trip.
This is the problem I set out to study. Not the click-out drop-off, which every metasearch team tracks obsessively, but the longer-arc retention problem: why do users who have already demonstrated intent — who searched, filtered, compared — not come back? And when they do not, what can a well-timed, context-aware nudge actually accomplish?
To investigate this, I built an end-to-end agentic retention pipeline for a fictional travel metasearch platform I called Wayfr. I used Meta's Llama 3.1 8B model, running locally via Ollama, to simulate both the users themselves and their responses to intervention. I used n8n to orchestrate the experiment. And I ran two waves of a three-group A/B test over six weeks on 400 simulated users.
What I found was more interesting — and more humbling — than I expected.
Starting with the commercial problem, not the feature
The first thing I want to be honest about is that I did not start this project by saying "let's build a nudge system." That is a solution posture. I started by asking a harder question: is low retention actually the right problem to solve, or is it a symptom of something upstream?
This distinction matters enormously for prioritisation. If users are not returning because the product failed to demonstrate value in their first session — if the search results were poor, or the price disparity on click-out destroyed their trust — then no retention nudge will fix that. You would be adding expensive personalisation infrastructure on top of a broken core experience. The return on that investment would be approximately zero.
So before designing any intervention, I ran a structured diagnostic on the simulated Wayfr user base. Four questions:
Is low retention real and measurable? Yes. Only 22% of users who clicked out at least once returned within 30 days. For a platform where the average user has 3–4 bookable trips per year, the addressable retention opportunity is significant: each recovered user represents roughly €145 in LTV over a typical 8-month active window.
Is it acquisition quality or product failure? The simulation controlled for this by distributing retention rates consistently across signup sources — organic, paid, and referral users all showed similar drop-off patterns. This ruled out "we're acquiring the wrong users" as the primary explanation, which meant the problem was in the post-session experience, not the funnel top.
Where exactly does the drop-off happen? Here is where it got interesting. The biggest single segment was not users who had a bad experience. It was users who had an incomplete one. They searched, they compared — and then life intervened. The trip was 60 days out. They needed to think about it. They forgot Wayfr existed. This is the low-frequency use problem unique to travel: unlike a SaaS tool you open daily, a metasearch platform has no natural re-entry point unless you create one.
Is solving it commercially worth it? The unit economics said yes, decisively. At a marginal cost of less than €0.10 per personalised nudge (Llama 3.1 running locally), the break-even required recovering fewer than one additional booking per cohort of 400 users. Even conservative estimates of intervention lift would comfortably exceed that threshold.
Only after this diagnostic did I feel confident that building a nudge system was the right intervention — not a distraction from a deeper product problem.
Building personas that could say no
The most unconventional part of this project was using Llama 3.1 not just to generate nudge content, but to simulate the users themselves — and critically, to simulate their rejection of nudges.
Most synthetic data approaches assign response probabilities statistically. A 20% open rate here, a 5% click-through there, drawn from industry benchmarks and applied uniformly. This is fine for testing infrastructure, but it cannot answer the question I actually cared about: why do certain users respond to certain messages, and what does a failed personalisation attempt do to trust?
To answer that, I needed users who could reason. So I built a schema for each persona archetype — a structured profile covering demographics, travel behaviour, platform behaviour, communication preferences, retention risk, and behavioural constraints — and then used Llama to generate four core archetypes for Wayfr:
Priya is a budget solo backpacker in her late twenties. She plans trips 3–6 weeks ahead, filters aggressively by price, and abandons when the displayed rate exceeds her budget or when she detects price disparity on click-out. She books 4–6 trips a year, which makes her medium-LTV but high-frequency. She responds to price-drop signals. She does not respond to anything that sounds like a travel magazine.
Marcus is a business traveller in his late thirties. He searches same-day or 48 hours out, filters by rating, and converts quickly. He is low price-sensitive and high time-sensitive. He books 10–15 times a year — high LTV, but nearly impossible to re-engage via email because by the time a nudge reaches him, he has already booked through whoever he found first.
The Hoffmanns are a family of four planning 1–2 trips a year, 60–90 days in advance. They compare extensively, are highly sensitive to price disparity, and make decisions as a unit. Their trust threshold is high and their tolerance for friction is low. They are the hardest users to win back once lost.
Kemal is a luxury traveller who books 3–4 times a year, filters immediately to 5-star, and is driven by trust signals rather than price signals. He comes back on his own schedule. Nudges have limited effect on him unless they speak directly to curation and exclusivity.
Each of these personas was then used to spawn 100 simulated users with attribute variation around the archetype — slightly different lead times, slightly different price sensitivities, slightly different session durations — to create a realistic distribution rather than four identical clones.
The simulation loop worked like this: for each user-session decision point, I gave Llama the full persona profile and the platform state ("you have searched for hostels in Lisbon, the cheapest result is €42/night, your budget is €35") and asked it to reason through the decision and return an action plus a brief explanation. These reasoning traces became one of the most valuable outputs of the entire project.
Designing the experiment: three groups, two waves, one uncomfortable trade-off
I structured the experiment in two waves, each with three groups.
Wave 1 targeted one-and-done users: people who had clicked out at least once but had not returned in 30 days. The intervention channel was email. Wave 2 targeted browsers: users who had searched and filtered but never clicked out. The intervention channel was in-app nudge, triggered on their next session.
Three groups in each wave:
- Control: no intervention
- Treatment A: a well-written generic nudge ("Still planning your trip to Lisbon?")
- Treatment B: a Llama-generated nudge personalised to the user's specific session context — the destination they searched, the price range they filtered to, the hotel type they spent most time on
The three-group design was a deliberate choice, and it is worth explaining why. A two-group test (control vs. personalised) would tell you whether personalisation works. But the commercially relevant question is whether personalisation adds value over and above basic automation — because generic nudges are cheap and require no AI infrastructure. If Treatment B barely outperforms Treatment A, the business case for the personalisation layer disappears. You need the three groups to answer the right question.
The uncomfortable trade-off I spent the most time on was email cadence. An aggressive sequence over 14 days would almost certainly produce higher absolute lift — more touchpoints, more opportunities to re-engage. But it would also increase unsubscribes, permanently removing users from future acquisition lists. At a €40 estimated CAC per trial signup, each unsubscribe destroys roughly €40 in future marketing value.
I modelled this explicitly. At a 5% unsubscribe rate among the 200 eligible users in Wave 1, you destroy approximately €400 in future acquisition value. Against a potential LTV recovery of €133 per re-engaged user, the break-even is roughly three additional re-engagements. That is achievable — but it is tight. A single frequency cap rule (no more than one email per user every three days) was the design decision that managed this risk. It is a small thing, but it is the kind of small thing that makes the difference between a system that is commercially sustainable and one that cannibalises its own future.
What the data said — and what the reasoning traces said louder
The aggregate results were encouraging. Across Wave 1, Treatment B produced a D14 return rate of 23% compared to 16% for Treatment A and 9% for the control group. Across Wave 2, Treatment B produced a D14 activation rate of 29% compared to 19% for Treatment A and 11% for control. Both differences were statistically significant at p < 0.05 with adequate power for the sample sizes used.
But the aggregate numbers were almost the least interesting thing I found.
Priya responded strongly to Treatment B — but only when the personalisation was accurate. When Llama correctly referenced her budget constraint and framed the nudge around a price drop on the specific destination she had searched, her response rate was 34%. When the generated email was contextually accurate but tonally off (describing a "boutique experience" when she was clearly looking for a hostel), her response rate dropped to 14%. The persona knew the difference. The reasoning trace made this explicit: "The price information is useful, but the framing sounds expensive. I'd assume there's been a bait-and-switch and ignore it."
Marcus was essentially unreachable via email. His response rate to Treatment B was 11% — barely above the control group's 9%. The reasoning traces explained why with uncomfortable clarity: "I searched for a hotel in Frankfurt on Monday night for a Tuesday check-in. By Wednesday morning when this email arrived, I had already booked through the OTA directly. This email is three days too late." The experiment had a timing dimension I had not adequately controlled for. For same-day bookers, a 3-day frequency cap means you will almost always be too late. The right intervention for Marcus is not email at all — it is a same-session in-app prompt or nothing.
The Hoffmanns produced the most revealing finding of the entire experiment. Treatment B underperformed Treatment A by 4 percentage points for this persona group. The personalised email referenced their search context — a family holiday in Porto, 3 nights, family rooms — but Llama generated a hotel suggestion at €220/night, which was above their inferred budget range. The generic email (Treatment A) at least made no claims about knowing them. The personalised email claimed to know them and then got it wrong. The reasoning trace: "They found out what we searched for and then recommended something we obviously can't afford. That feels intrusive and also useless. I'd trust this platform less now, not more."
This is the finding I consider the most commercially important in the project. Personalisation can actively damage trust if the inference is inaccurate. The asymmetry is severe: a good personalisation produces a modest lift; a bad personalisation produces active churn. This means the expected value of personalisation is not simply "average lift times coverage." It depends critically on the confidence of the underlying inference — and that confidence should be measured and gated before any personalised message is sent.
What I missed and what I would build differently
I did not build a personalisation confidence gate. Before sending any Treatment B message, the system should assess: how much usable, high-confidence context do I actually have about this user? If the answer is "a single search session from 30 days ago," the system should fall back to Treatment A rather than risk a Hoffmann-style trust violation. I did not build this, and the results showed why it matters.
I treated user segments as static. The experiment assigned users to segments at the start and kept them there for six weeks. In reality, a "browser" who receives an in-app nudge and returns to search again — but still does not click out — has given you new contextual information. The system should update the segment assignment and the nudge strategy dynamically. Building this feedback loop into the n8n workflow would have made the experiment more realistic and the findings more robust.
The timing dimension was an afterthought. I set a 3-day frequency cap without first asking: which day after inactivity is the optimal intervention window? For Priya-type users with 3–6 week lead times, day 5 of inactivity might be optimal — the trip is still live in her mind and prices are in flux. For The Hoffmanns planning 90 days out, the optimal window might be day 14 — enough time has passed that a gentle reminder feels helpful rather than intrusive. A timing experiment nested within Wave 1 would have produced richer and more actionable findings.
I under-invested in the in-app channel design. The in-app nudge in Wave 2 significantly outperformed the email nudge in Wave 1 across all treatment groups — the lift from Treatment B in-app was nearly twice the lift from Treatment B email. This is a strong signal that channel fit matters more than content sophistication. I should have given Wave 2 equal experimental rigour to Wave 1 rather than treating it as a follow-on.
The commercial recommendation
Putting the findings together into a recommendation forced the most useful thinking in the entire project, because aggregate lift numbers and persona-level findings point in different directions and you cannot optimise for both simultaneously.
The aggregate case for deploying Treatment B is clear for browsers (Wave 2 in-app): a 10 percentage point lift in D14 activation over Treatment A, negligible guardrail metric impact (dismiss rate was actually lower for Treatment B than Treatment A, likely because more relevant nudges feel less interruptive), and a marginal cost close to zero. Deploy this for all browser-segment users. The ROI is asymmetric in your favour.
For one-and-done users via email (Wave 1), the recommendation is conditional. Deploy Treatment B for users whose session data supports a high-confidence personalisation — specifically users with a clearly identified destination, a defined price range from their filter history, and a trip date that is still in the future. For users where any of these signals are absent or ambiguous, default to Treatment A. The Hoffmanns data makes the risk of under-confident personalisation concrete enough to design around.
Do not use email for Marcus-type users at all. The timing window makes it structurally ineffective. Build a same-session in-app prompt for high-frequency business travellers instead — triggered during the session itself if they have been browsing for more than 8 minutes without clicking out.
Across the addressable segments, a full deployment of this system — with the confidence gate, the channel routing by persona type, and the in-app optimisation — would be expected to produce approximately a 12–15% improvement in D30 retention among non-activated users. On a cohort of 400, at a blended LTV of €145 per recovered user, that represents roughly €8,700–€10,900 in incremental LTV. Against an infrastructure cost of less than €50 per cohort to run, this is one of the better-yielding product investments available to a metasearch retention team at this scale.
What this project actually taught me
The most honest answer is not what I expected it to teach me.
I expected to learn about A/B test design, about personalisation lift, about LLM-as-orchestrator patterns. And I did learn those things. But the most durable learning was about the relationship between personalisation and trust — specifically, that these are not simply correlated but can actively work against each other when the personalisation inference is wrong.
Every PM I have met talks about personalisation as an unambiguous good. More context, better relevance, higher conversion. The Hoffmanns data suggests this is only true above a confidence threshold that is rarely discussed and almost never measured. Below that threshold, personalisation is not neutral — it is actively worse than saying nothing, because it makes an implicit claim ("we understand you") that your product then fails to honour.
I also learned something about the value of qualitative data in simulation. The reasoning traces Llama returned were, in many cases, more actionable than the quantitative outcomes. Knowing that Priya dismissed an email because the tone was wrong even though the price was right is a product insight you can act on immediately. Knowing that Marcus had already booked before the email arrived is a signal to rethink the channel entirely. These are things the open rate and click-through rate would never have told me.
And I learned something about what makes a simulation trustworthy. The most important design decision in this project was not which model to use or how to structure the n8n workflow. It was building persona profiles with explicit, testable behavioural constraints — and then validating that the simulated behaviour was coherent against those constraints before running a single experimental trial. Garbage in, garbage out applies to LLM-based simulations with at least as much force as it applies to statistical models.
Closing thought
Travel metasearch is a structurally hard retention problem. Low purchase frequency, high competitive intensity, zero switching cost — these are the conditions under which users have every reason to forget you exist. The conventional response is to outspend on re-acquisition. The more interesting response, and the one I tried to explore here, is to earn the return visit by demonstrating that you understand the user better than they might have expected from a platform they visited once six weeks ago.
The simulation showed that this is achievable — but only with a precision that most teams do not currently apply. Getting the channel right matters more than getting the content right. Getting the confidence of the personalisation inference right matters more than getting the personalisation itself right. And the reasoning behind user rejection, not just the fact of it, is where the most actionable product insight lives.
The last 30 metres, it turns out, are less about the nudge and more about whether the nudge earns the right to exist.
This is a concept project built over six weeks using Meta Llama 3.1 8B (via Ollama), n8n (self-hosted), Python, and SQLite. All user data is synthetic. The platform "Wayfr" is fictional. Full methodology, persona schemas, prompt templates, and analysis notebooks are available on GitHub. Questions and pushback welcome.
Tools used: Llama 3.1 8B · Ollama · n8n · Python 3.11 · SQLite · Jupyter · pandas · scipy