How To Extract Words From a Specific Area of a PDF

Crop the page, extract all the words on the page, reverse crop the page

Mar 23, 2024

letter wood stamp lot — Photo by Amador Loureiro on Unsplash

Automation Background Example

I was hired to develop an automation system for income tax preparers using Lacerte tax preparation software. Each year the tax professionals would send tax organizers to their clients to complete and send back. These organizers consisted of tax return pages with information completed from the previous year. The clients would then revise or add information to help prepare for the current year’s tax returns. The tax pages were PDFs, but they were not fillable.

In partnership with a tax professional, I developed an automation system for taking these static tax organizers and adding fillable fields to them. Because the organizers had specific pages for each unique tax client, out of a possible 200+ pages, it wasn’t simply a matter of sending out a 200-page fillable PDF. The tax organizers contained only the pages that would be used for each tax client, and they showed the previous year’s data for reference.

The system I developed allows mass production of fillable PDFs from static organizers using JavaScript and the Action Wizard in Acrobat Pro. The way it works is this:

It "reads" the title bar of each page and looks for the fillable version of that page in a folder on the hard drive.
It inserts that page into the organizer, uses Acrobat’s replace pages function to replace static part of the fillable PDF with the underlying static PDF page showing the previous year’s data, then removes the orginal page.

Since the information in the title bar is in the exact same location on every page, I was able to program the system to read the title bar of each page in order to find the fillable version of the page on the hard drive and make this work. This article describes how to extract the words from a specific location on a PDF.

Text within this block will maintain its original spacing when published

                               Learn JavaScript for Acrobat Pro.  Take the course:

COURSE INFORMATION

Getting Words From a PDF Page

The doc.getPageNumWords method returns the number of words on a page and takes the 0-based page number as the input parameter. For example, if page 1 has 325 words, running this.getPageNumWords(0) in the JavaScript console will return 325.

The doc.getPageNthWord method returns the nth word on a specific page and takes three input parameters:

nPage - the 0-based page number (0 is page 1, 1 is page 2, etc.)
nWord - the 0-based index of the word (0 is the first word on the page, 1 is the second word, etc.)
bStrip - Specifies whether punctuation and white space should be removed from the returned word (true is yes, false is no, default is true).

For example this.getPageNthWord(0, 324, true) will return the 325th word on page 1, with the punctuation and white space removed. Using the two methods we can create a function that returns all words on a page like this:

function get_Words(oDoc,pg)
{
var words="";
for(var i=0;i<oDoc.getPageNumWords(pg);i++)
{
words += oDoc.getPageNthWord(pg,i,false);
}
return words;
}

Run the function for page 2 like this:

get_Words(this, 1);

and it will return all words on page 2.

Narrowing The Page Area For Words To Extract

Since we don’t want all the words on the page, but only the words in a specific area of the page, we can crop the page to that area, extract all the words, then return the page back to the way it was (reverse crop). You can learn the details of Crop/Reverse Crop here.

Getting Words From a Specific Area of a PDF Page

First get the rectangle array of the area you want to crop by creating a form field and reading the rectangle property by running the following script in the console (assuming the form field is named "Text1"):

this.getField("Text1").rect;

If you draw the field with your mouse the script above will probably return numbers with lots of decimal places. These can usually be rounded since the decimals are fractions of points, or you can modify the script above to do the rounding for you like this:

var rc=this.getField("Text1").rect;
Math.round(rc[0])+","+Math.round(rc[1])+","+Math.round(rc[2])+","+Math.round(rc[3]);

The following function will crop a specific page, extract all the words on that page, and reverse crop the page back to it’s original size. The input parameters are the document, the 0-based page number, each indice of the rectangle array:

/*make sure the crop box is equal to the trim box by running the following script in the console*/
this.getPageBox("Crop",0).toString()==this.getPageBox("Trim",0).toString();

/*If the script above returns true you can use the following function*/

function crop_Words_reverse(oDoc, pg, rc0, rc1, rc2, rc3)
{
var words="";
oDoc.setPageBoxes("Crop", pg, pg, [rc0, rc1, rc2, rc3]);
var rc2=oDoc.getPageBox("Trim",pg);

for(var i=0;i<oDoc.getPageNumWords(pg);i++)
{
words += oDoc.getPageNthWord(pg,i,false);
}
oDoc.setPageBoxes("Crop", pg, pg, rc2);

return words;
}

How to run the function for the first page:

crop_Words_reverse(this, 0, 69,633,156,618);

Running the function above should return all the words in the specific area (69, 633, 156, 618) of page one.