Case Study: Oil & Gas Invoice Automation – End-to-End Data Engineering with Databricks & Azure
tl;dr: We helped a top-5 Oil & Gas firm turn tens of thousands of unstructured invoices into analytics-ready data. Using Databricks on Azure, AI-driven document intelligence and Power BI, we delivered a future-proof ETL pipeline that cuts costs by millions and speeds decision-making.

The problem
Our client reached out to us in 2023, with a comprehensive plan to automate as much of their processes as possible, to invest heavily in technology, having the belief that it could transform the way they conduct their day to day business in the Oil & Gas industry. The problem that we helped them with was the ingestion of unstructured documents, coming from various sources, in various formats. They were looking at thousands of sources, each with tens of types of documents, coming into the client’s system either daily or monthly. There was no standardized API, or generic way of extracting prices, cap tables, accounting data, locations, and financial information from those documents. Their legacy methods were not scalable, and the accuracy was too low to be useful.
On top of this, the market was getting more competitive than ever, AI was picking up and competitors were already leveraging it to automate processes the smart way. Our client wanted to play at the same level, and wanted competent people, with relevant expertise to work on the problem and solve it, end to end.
This is where AI Flow came into play.
The approach
We follow our standard process when dealing with projects at scale, where we need to build state of the art solutions.
- Discovery: we sat down with the client, reviewed existent components that were build to solve the problem previously. We asked as many questions as we could, inspected the shortcomings of their attempt at solving the problem, and understood how this technical component fit into a larger business and decision making context. Whenever we build deep tech, we always try to understand how the tech fits into a bigger organizational scope. It helps with making tech decisions, optimizing for the desired goal.
- Design: We designed a 12-month roadmap, with quarterly goals, and intermediate checkpoints. The client was kept in sync at least twice a week, and progress was tracked in a ticket board. All the milestones, time estimates, and engineering efforts were transparently laid out, so the client knew what to expect.
- Implementation: Knowing that the project had to be future proof, we started building a solid foundation, with a solid data layer, a scalable Databricks environment, and a clean code practices. These allow us to scale without hitting roadblocks in the future. We planned the infrastructure with testing in mind, and flexibility for plug and play components. Initially, we built one end to end flow, for just one document type, to test the approach. Once it was confirmed, we scaled our ingestion flows to handle all the inbound documents.
- Testing: we would meet with the client representative every week, and collaborate with internal tech and business teams, so that our work aligns with their, and that objectives are clear for both parties. We tend to overcommunicate, so that the client knows what’s happening at any given time, and is comfortable with the progress being made.
- Delivery: other than testing properly, documenting every step of the process and building with solid practices in mind, we had handover meetings and Knowledge Transfer sessions, so that the end to end tool that we build can be expanded and maintained in the future, if needed.
The solution
The client had already a comprehensive cloud offering from Microsoft’s Azure, so we implemented the solution on the stack that the client was already familiar with. To break down the process into individual components:

- Data was captured either in email attachments or directly uploaded in SharePoint.
- Triggers were created to capture new data events, and create jobs to extract information from each file.
- Databricks workflows, specialized for each type of file, and AI models for information extraction were used to parse the document, extract information, and write it to parquet files. PySpark was used within Databricks for big data processing, together with Azure Document Intelligence and custom AI scripts. Python, SQL, and Scala were the main programming frameworks used.
- From parquet files, data was curated and written into an Azure SQL Database.
- From the database, PowerBI reports were created, for business to use for decision making.
The results
- A fully functional, end to end highly scalable and future proof pipeline for turning tens of thousands of unstructured documents into structured data and reports that drive actionable insights.
- Millions of dollars of in cost savings driven by efficient technology choices and optimized implementations.
- Faster data based business decisions and the leverage to stay competitive in a global market.
http://antonmih.ai/If you reached that far, let’s set up a call to discuss more about AI and how it could transform your business. You can reach out to the AI Flow CEO for a call.