The Story of WalkingRAG

Multi-modal agentic search on complex data

WalkingRAG (2023–2024) was one of the first — I'm not certain, but it might have been the very first — products to implement multi-modal, graph-based search: what we'd now call agentic search. Multi-modal meant it could handle images, diagrams, even video at one point.

Below is one of the demos, where we read and trace a path through an IKEA manual with no words in it at all — and we did it all with GPT-4 and the very first vision models. What a time.

This is the story of WalkingRAG, told in tweet threads, Hugging Face articles, and videos. I'll editorialize where it helps.

§The beginning

@hrishioa·Dec 11, 20231,520 ♥ ↗

A week ago one of our customers handed us 1000 pages of this (10,000 more to come), and asked us for RAG solution. We said yes - because we said yes before we saw the document. But we've solved it - and there's a chance it's a strong improvement on all RAG SoTA.

@hrishioa·Dec 11, 2023View on X ↗

We tried a lot of approaches that got us closer, but the key to unlocking it was thinking like a human: How do I do research? I'll usually find a relevant page or two, but they're not always the right ones. This is the best of RAG today. What if we kept looking?

@hrishioa·Dec 11, 2023View on X ↗

The pages you find will usually add some more context to the answer, but they'll have key information on the pages you need. What if we could 'walk' to those pages? This is what our system does. In simple terms, we build a graph of relationships at ingest, then use context from those pages to find new ones until we've collected enough for an answer.

A page from the 1000-page technical document that started it all

§Proof of concept

We called it a demo at the time, but looking back this was really the first proof of concept — the thing that proved to me WalkingRAG was actually possible.

@hrishioa·Dec 13, 20231,324 ♥ ↗

No more waiting - we finally we have a demo for multi-modal, 'walking' RAG! Still blows my mind - this is an AI that's reading complex diagrams in a document like a human, 'looking' at pages, then 'walking' to more relevant pages until it's found an answer.

@hrishioa·Dec 13, 2023View on X ↗

For each loop, about seven different prompts run: one large text and 3 vision prompts collect information from pages.

@hrishioa·Dec 13, 2023View on X ↗

All of them are stream processed, so facts are immediately streaming to the user - there's usually one every 2s. They all have the ability to 'short-circuit' and terminate all prompts if they find the full answer. Backpressure is applied immediately so we don't waste tokens.

@hrishioa·Dec 13, 2023View on X ↗

After the first iteration, no more embedding-based retrieval from the question - we've practically discarded it. All future iterations use relevant pages unearthed from the previous one.

§The first demo

A couple of weeks later, the first real, working demo of WalkingRAG.

@hrishioa·Dec 31, 2023172 ♥ ↗

New year, new demo for Vision based walking RAG! Bit of a rush job, but V1 got done before the end of year so this is me recording a quick one. More to come

§A proper product

And then the first proper demo of it as a product — with an actual UI.

@hrishioa·Jan 12, 2024304 ♥ ↗

WalkingRAG is finally out! After so many versions and prototypes the UI is done, I couldn't help but record a quick demo Glad to have this out as we close the week, pretty excited to get this in the hands of customers 🚀

§Making it local

The breakthrough that made me happiest came in February: the visual indexing pass — the step that reads every page — finally ran end to end, fast, on an open model.

@hrishioa·Feb 1, 2024621 ♥ ↗

Vision models are having their StableDiffusion moment 🤯 For the first time ever, our visual indexing pipeline for WalkingRAG now runs end-to-end, at 2 seconds per page! I am beyond excited - what does this mean? Let me explain, and compare different sizes of models👇

@hrishioa·Feb 1, 2024View on X ↗

Truly visual document indexing and retrieval is something I've been working on for a while, and something that I still see scarce amounts of research going into. With multi-modal LLMs we can now skip the OCR step and embed a truly visual understanding of what's on a page.

@hrishioa·Feb 1, 2024View on X ↗

I think we're going to see an incredible acceleration from here on. RAG is going to change, and so are most now-stable pipelines in the LLM ecosystem.

§One more thing

Around the same time we pulled off something I'd been chasing since the first release of WishfulSearch. It's less a WalkingRAG demo than pure data magic — the coolest, least sexy thing we'd done, and worth a watch.

@hrishioa·Feb 3, 2024422 ♥ ↗

This is the coolest thing we've ever done - and probably the least sexy demo This has been my white whale ever since the first release of WishfulSearch, to know if this is even possible. Let me explain - or you can just watch it work👇

§Paying it forward

WalkingRAG ran counter to the ideas of the time — those of chunking and embedding. These three articles cover the techniques behind what we did differently. Even today, building double-sided graphs and connecting them with embeddings is an underexplored idea that I believe could significantly improve today's systems.

Better RAG 1: Basics — how RAG actually works, the complexities, and the problems worth solving.
Better RAG 2: Single-shot is not good enough — the WalkingRAG idea itself: giving a model the ability to request new information and walk to it.
Better RAG 3: The text is your friend — transformations on the query and source data to make retrieval better.

Hrishi Olickel

1 Feb 2024

hrishi[o][a][at]gmail.com

Not a newsletter

Hi! If you drop your email/Twit/Bluesky thing down here, I'll let you know next time I write something you should read. (Manually, if I can find the time)