About this episode
Bobby Neelon and John Kalfayan from Collide break down the messy reality of getting data ready for RAG, why PDFs are dumpster fires for unstructured data, how extraction changes depending on whether you're dealing with drilling surveys or handwritten logs, and why chunking strategy matters more than people think. They walk through embeddings, vector databases, MCP servers for pulling external data without leaking internal info, and why good metadata and folder structure actually make AI deployments way easier. Plus the hard truth that AI isn't a silver bullet for bad data management and the crap-in-crap-out problem is getting worse because now it can hallucinate on top of the crap.Click here to watch a video of this episode.Join the conversation shaping the future of energy.Collide is the community where oil & gas professionals connect, share insights, and solve real-world problems together. No noise. No fluff. Just the discussions that move our industry forward.Apply today at collide.ioClick here to view the episode transcript.
0:00 - Introductions and RAG overview3:15 - Document identification and classification challenges8:40 - Extracting data from unstructured PDFs13:25 - Real world examples of messy data formats18:50 - OCR paired with vision models for extraction22:10 - Chunking strategies and when to use each26:35 - Embeddings and vector databases explained30:20 - MCP servers and external data integration35:45 - Getting data AI-ready with metadata and structure40:30 - Text-to-SQL approaches and database access44:15 - Handling duplicates and M&A data integration48:50 - How AI learns context over time53:40 - Why traditional data management matters more than everhttps://twitter.com/collide_iohttps://www.tiktok.com/@collide.iohttps://www.facebook.com/collide.iohttps://www.instagram.com/collide.iohttps://www.youtube.com/@collide_iohttps://bsky.app/profile/digitalwildcatters.bsky.socialhttps://www.linkedin.com/company/collide-digital-wildcatters