How to Extract Text from a PDF Using GhostScript (Command Line)

This is a re-post from one of my favorite articles that I originally posted on 7/23/2018 on my old Blogger blog.

I think I would really like to revisit automating the extraction of text from PDF files. There is a lot of untapped value many companies could be leveraging but aren’t.

Recently, I received a request from a team member to find a way to:

  1. Extract a large amount of text from a large PDF file. 
  2. Once I get the text out I’ll need to parse and get specific elements in to an excel file.
  3. Format the Excel file in to specific tabs for each type of report I extract and add column headers
  4. Create validation code where I connect to a data warehouse using an Ajax web service and Ajax call in the Excel macro to validate the data based on an ID in one of the columns 

Pretty cool right? Just finished the prototype today! 7/31/2018.

In this article I’ll be covering the first step of this task where I use a free tool called Ghostscript to extract text from a PDF file. 

What is Ghostscript?

Ghostscript is a high-performance Postscript and PDF interpreter and rendering engine with the most comprehensive set of page description languages (PDL’s) on the market today and technology conversion capabilities covering PDF, PostScript, PCL and XPS languages.
Ghostscript has been under active development for over 20 years, and offers an extremely versatile feature set and can be deployed across a wide range of platforms, modules, end uses (embedding in hardware, as an engine in document management systems, providing cloud solution integration and as an engine in leading PDF generators and tools).

How to extract text from a PDF using GhostScript

Please note that the PDF file must be formatted correctly (text not image only).

Steps:
– Download Ghostscript
– Install Ghostscript
– Copy your pdf file to the bin directory where you installed Ghostscript
– Open a command line window at the bin directory (as Administrator if you get access error when running).
 – Sample Command: gswin64 -sDEVICE=txtwrite -o[Output File Name] [Input File Name]
– Sample ghostscript command: gswin64 -sDEVICE=txtwrite -ooutput2.txt test.pdf


I hope this helps someone!

~ Cyber Abyss

Capture Right Click Context Menus Using Windows Snipping Tool

I’m working on a huge documentation project where I’m documenting operational support for a suite of C# MVC portal sites with a lot of back end SQL administrative functions.

I used to have SnagIt but my company has been cutting back on licenses.

I’m forced to rely on the Windows native screenshot tool, the Windows Snipping Tool.

One of my first big struggles was how do I capture right click context menus with the Windows Snipping Tool.

In my case, I’m documenting a folder structure and how to commit code to a SVN repository.

  1. Open Snipping Tool, cancel current snippet and leave in standby mode. 
  2. Get focus on your window / folder.
  3. Keys: Shift + F10
  4. Keys: Ctrl + fn + Print Screen (prtsc)
  5. You should have right click menu open and Snipping Tool should prompt you to select an area to capture. Select your menu area.
Windows Snipping Tool w/ Right Click menu captured.