Automating Highlighting and Text Extraction Details
Actiion Wizard tricks and preference settings
Last week I wrote about Extracting PDF Text Using Markup Tools. The markup tools were used manually by reading the document and physically highlighting words, phrases, or numbers. An example was given to automate the extraction of the highlighted words and organize it into a text string that was used to build a spreadsheet.
Highlighting Can Be Automated
In the Protection category of the Acrobat Action Wizard lies a feature called Search & Remove Text, which marks words or phrases with redaction annotations.
These redaction annotations can be easily converted to highlight annotations with a simple line of JavaScript, by changing the annotation type from Redact to Highlight.
this.getAnnots()[0].type="Highlight";
NOTE: Redactions can also be converted to Strike Throughs, Underlines, and Replace Text annotations. This conversion can not be done with all annotation combination types. It works with these because the quads property is based on the location of the words and is identical for all types listed. This trick does not work with form fields. That is, you can’t change a form field to another type by setting the type property. You must delete the the field and create another one.
All Redact annotations can be converted to Highlight annotations by looping through the annotations, testing the type for Redact, and setting the type to Highlight. The highlight color can also be set to the desired color:
var anot=this.getAnnots();
for(var i=0;i<anot.length;i++)
{
if(anot[i].type=="Redact")
{
anot[i].type="Highlight";
anot[i].strokeColor=color.yellow;
}
}
The script above can be added as a second step in the Action. When the Search & Remove Text function is added the Prompt User check box should be checked so that the user can add the text to be highlighted, or the words and phrases can be saved with the Action if they are always the same, and Prompt User not selected. The following popup window will be presented:
Words or phrases can be added by entering them in the field and clicking the Add button. A line separated text file can also be imported by clicking the Import button.
The Setting That Adds Highlighted Text to Highlights
In last week's article I discussed the preferences setting that adds highlighted text as the contents of highlight annotations. If this setting is on, the contents will be adding during the script conversion from Redact to Highlight.
Extracting Information Based on Highlighted Words
There's so much you can do by automating the highlighting of words or phrases with Copy selected text into Highlight, Strikethrough, Underline and Replace Text comment pop-ups selected in preferences. Here's one example. Supposed you had document containing hundreds of pages and you wanted to identify every page containing a specific term (I'll use Social Security Number and SSN in this example).
Simply add a step to the action above that prints a list of page numbers that contain those terms by looping through the annotations and writing the page property (plus 1, since page numbers are zero-based) to the console:
var anot=this.getAnnots();
for(var i=0;i<anot.length;i++)
{
if(anot[i].contents == "Social Security Number" || anots[i].contents=="SSN")
{
var pg=Number(anot[i].page)+1;
console.println("Page: "+pg+":"+anot[i].contents)
}}