This is a re-post from one of my favorite articles that I originally posted on 7/23/2018 on my old Blogger blog.
I think I would really like to revisit automating the extraction of text from PDF files. There is a lot of untapped value many companies could be leveraging but aren’t.
Recently, I received a request from a team member to find a way to:
- Extract a large amount of text from a large PDF file.
- Once I get the text out I’ll need to parse and get specific elements in to an excel file.
- Format the Excel file in to specific tabs for each type of report I extract and add column headers
- Create validation code where I connect to a data warehouse using an Ajax web service and Ajax call in the Excel macro to validate the data based on an ID in one of the columns
Pretty cool right? Just finished the prototype today! 7/31/2018.
In this article I’ll be covering the first step of this task where I use a free tool called Ghostscript to extract text from a PDF file.
What is Ghostscript?
Ghostscript is a high-performance Postscript and PDF interpreter and rendering engine with the most comprehensive set of page description languages (PDL’s) on the market today and technology conversion capabilities covering PDF, PostScript, PCL and XPS languages.
Ghostscript has been under active development for over 20 years, and offers an extremely versatile feature set and can be deployed across a wide range of platforms, modules, end uses (embedding in hardware, as an engine in document management systems, providing cloud solution integration and as an engine in leading PDF generators and tools).
How to extract text from a PDF using GhostScript
Please note that the PDF file must be formatted correctly (text not image only).
– Download Ghostscript
– Install Ghostscript
– Copy your pdf file to the bin directory where you installed Ghostscript
– Open a command line window at the bin directory (as Administrator if you get access error when running).
– Sample Command: gswin64 -sDEVICE=txtwrite -o[Output File Name] [Input File Name]
– Sample ghostscript command: gswin64 -sDEVICE=txtwrite -ooutput2.txt test.pdf
I hope this helps someone!
~ Cyber Abyss