Case Study: Legal Intelligence Platform with RAG & GPT-4o
tl;dr: We helped a US law firm ingest gigabytes of mixed-format case files and, within three weeks, launch a chatbot that answers legal queries and predicts case value in near real-time. A cost-optimised RAG pipeline, Pinecone vector store and GPT-4o power the solution—cutting storage by 75 % and response times below a second.

The problem
In law, each case is unique, and comes with specific challenges, and contextual information. The documentation for each situation can be diverse, ranging from text information, PDFs, scans, documents with hundreds of pages, and unstructured information. On top of this, clients want to know as fast as possible whether their case can be settled, and what is the approximate value, looking at similar cases.
Humans can’t, at least in real time, parse and make sense from large volumes of unstructured data. The data is often “dirty”, needs cleaning, organization, and proper processing. Even though there are generic information management systems, there’s still no magic wand that works in each particular context. The more the business of a company is specific, the more they need a bespoke solution to fit their needs.
Here is where AI Flow came into play, designing and building an end-to-end system for organizing unstructured information and exposing a chatbot on top of it.
The approach
- Discovery: we take our time to understand what the client wants to build. Sometimes, the client is influenced by the over-promising marketing of product companies, and forms an idea of what they need, when in fact they need a much more specific solution. We guided the client through the discovery processes, asking question to define the ideal outcome, the need, and the problem they are trying to solve. We also tried to understand the ideal UX once the solution would be in place.
- Design: after understanding the setup, we put together a proposal that included the end-to-end flow, time and cost estimates, together with the technology stack we planned to use. We went through each component with the client, making sure they are comfortable with the decisions.
- Implementation: we agreed to start with a POC, and extend it afterwards to scale. In 3 weeks, we had a fully functional version, deployed on a public domain, behind an authentication layer, so that the client can do extensive testing.
- Delivery: after the first round of quality control, the POC was approved by the client, and delivered. Memory requirements were optimized by 75%, median response time below 1 second.
The solution: data cleaning pipeline, RAG system, LLM engine and a full stack Streamlit app
From a technical point of view, we built a data curation pipeline, a RAG system with a vector store for saving the data chunks, a relational database for storing requests, results and pricing data, and a LLM layer for running the queries. Those were all wrapped with a Streamlit application, with a token based security system.

- For the data curation pipeline, we had to parse a large amount of HTML data. To optimize for costs and storage space, the HTML need to be cleaned of irrelevant information, such as styling, tags, scripts, etc. The more we reduce the amount of data, and the more we increase the specificity percentage, the more we optimize for ROI. Building this component was crucial, and it saved costs and storage by 75%. We used BeautifulSoup for parsing the HTML, removing the irrelevant parts and keeping the important information.
- For RAG and Vector Store, we started with FAISS for the dev and staging version. Having an on-premises version allowed us to test at 0 costs. Moving to production, we chose a Pinecone index. The chunking strategy was optimized for fact based retrieval.
- For inference end embeddings (LLM), we started with a Mistral model on a local Ollama server, for the dev and staging environments, for free testing. Moving to production, we used the OpenAI API, with the “gpt-4o” model. For factual queries, we used a lower temperature, for summarization we used a higher value.
- For the DB, we used PostgreSQL, deployed on a Digital Ocean DB cluster. We used Prisma for the schema and ORM.
- For putting everything together as a web app, we used Streamlit. It is the best choice for quick prototypes, as it offers a good speed of development to quality ratio.
The results
- An end-to-end information retrieval system, optimized for both factual and summarization queries.
- From hundreds of pages of information to a ChatBot that can answer questions in seconds. No human can beat this.
- 75% cheaper than an off the shelf solution, <1 second response time, and high response relevance.
- A comprehensive proof of concept, that helps the client make decisions on the direction of the technology in the company, and the future of our partnership.
If you reached this far, let’s set up a call to discuss more about AI and how it could transform your business. Click here to find out more.