Extracting PDF Text Using Markup Tools
PDF text can be extracted using the highlighter, strikethrough, underline, and replace text commenting tools.
Important Setting
There's a preferences setting in both Acrobat and Reader that can be used to extract text from PDFs using specific markup tools as listed in the subtitle of this article. Press Ctrl + k to open the preferences window and select the Comments category at the top of the list. Under the Making Comments section, select Copy selected text into Highlight, Strikethrough, Underline and Replace Text comment pop-ups. After this setting is selected any text selected with those markup tools becomes the contents of the annotation, available inside a popup for that annotation.
Extracting Contents With JavaScript
PDFs can be converted to Excel spreadsheets by selecting File > Export To > Spreadsheet > Microsoft Excel Workbook. While the end result might resemble the PDF visually, the process is far from perfect and data might not be organized into rows and columns that is usable. This is especially true for scanned documents that have been OCR'd (recognize text). Consider a bank or credit card statement for which you need to extract transactions. Suppose you need data from four columns:
Date
Transaction description
Funds out
Funds in
If you use the highlighter tool to highlight the data you can build a string that can be copied and pasted into a spreadsheet and organized by rows and columns by looping through the annotations and using special characters in the string.
In the image above, the entries in the right column were highlighted with a different color (green) so the script could add another tab before the entries to keep the last two columns aligned. For speed, the highlighting color was not changed until the end. At this point, all highlights in the right column were selected and changed to green together. The script can be run in the console and the result copied and pasted to Excel, or the string can be passed to the document createDataObject method, and the spreadsheet will be created as an attachment. Here's the script:
var data="Date\tTransaction\tFunds Out\tFunds In\r"; \\1
var before="";\\2
var after="\t";\\3
var anot=this.getAnnots();\\4
for(var i=0;i<anot.length;i++)\\5
{
if(anot[i].type=="Highlight")\\6
{
if((i+1)%3==0)\\7
{after="\r"}else{after="\t"}\\8
if(anot[i].strokeColor.join(",")=="RGB,0,1,0")\\9
{before="\t"}else{before=""}\\10
data+=before+anot[i].contents+after;\\11
}
}
Script Explanation
A string variable data is set as the first row with entries separated by tabs and a carriage return at the end.
Variable before is declared as an empty string.
Variable after is declared as an empty string.
Variable anot is declared as the all the annotations in the document.
A loop is constructed to loop through all annotations in the document.
The annotation is type is tested. If it's a Highlight, continue.
The loop number plus 1 (the annotation array is zero-based) is tested for equal divisibility by three.
If true, after is a carriage return. If false, after is a tab.
The stroke color of the highlight annotation is tested for green.
If true, before is a tab. If false, before is an empty string.
before + annotation contents + after are added to the data string.
Checklist
Entries should be highlighted in order because the annotation array is in the order they were created. If an error is made the annotation should be deleted before adding another.
There shouldn’t be any other annotations in the document other than highlights that pertain to the entries because they will mess up the organization of the data in rows and columns. This points makes point six in the script explanation redundant. In other words, if there are only highlights there is no need to test the annotation type.