1. Home
  2. Foxit PDF SDK for Windows
  3. How to Extract & Search for Text with Foxit PDF SDK (Java)
  1. Home
  2. Foxit PDF SDK for Mac
  3. How to Extract & Search for Text with Foxit PDF SDK (Java)
  1. Home
  2. Foxit PDF SDK for Linux
  3. How to Extract & Search for Text with Foxit PDF SDK (Java)

How to Extract & Search for Text with Foxit PDF SDK (Java)

Text Page

Foxit PDF SDK provides APIs to extract, select, search and retrieve text in PDF documents. PDF text contents are stored in TextPage objects which are related to a specific page. The TextPage class can be used to retrieve information about text in a PDF page, such as single character, single word, or text content within a specified character range or a rectangle and so on. It also can be used to construct objects of other text related classes to perform other operations for text contents or access specified information from text contents:

  • To search for text in the text contents of a PDF page, construct a TextSearch object with a TextPage object.
  • To access text such as hypertext links, construct a PageTextLinks object with TextPage object.

Example:

How to extract text from a PDF page

import com.foxit.sdk.pdf.PDFDoc;
import com.foxit.sdk.pdf.TextPage;
...
// Assuming PDFPage page has been loaded and parsed.
// Get the text page object.
TextPage textpage = new TextPage(page, e_ParseTextNormal);
int nCharCount = textpage.getCharCount();
String texts = textpage.getChars(0, nCharCount);
...

How to select text of a rectangle area in a PDF

import com.foxit.sdk.pdf.PDFDoc;
import com.foxit.sdk.pdf.TextPage;
import com.foxit.sdk.common.fxcrt.RectF;
import com.foxit.sdk.common.fxcrt.RectFArray;
...
// Assuming PDFPage page has been loaded and parsed.
...
TextPage textpage = new TextPage(page, e_ParseTextNormal);
RectF selRc = new RectF(100,100,250,250);
String selText = textpage.getTextInRect(selRc);
RectFArray rcArr = textpage.getTextRectArrayByRect(selRc);
...

Foxit PDF SDK provides APIs to search text in a PDF document, a XFA document, a text page or in a PDF annotation’s appearance. It offers functions to perform a text search and get the search results:

  • To specify the search pattern and options, use functions TextSearch.SetPattern, TextSearch.SetStartPage (only useful for a text search in a PDF document), TextSearch.SetEndPage (only useful for a text search in a PDF document) and TextSearch.SetSearchFlags.
  • To perform the search, use function TextSearch.FindNext or TextSearch.FindPrev.
  • To get the search results, use function TextSearch.GetMatchXXX().

Example:

How to search a text pattern in a page

import com.foxit.sdk.common.fxcrt.RectF;
import com.foxit.sdk.common.fxcrt.RectFArray;
import com.foxit.sdk.pdf.PDFDoc;
import com.foxit.sdk.pdf.TextSearch;
...
TextSearch search = new TextSearch(doc, null);
int start_index = 0, end_index = doc.getPageCount() - 1;
search.setStartPage(0);
search.setEndPage(doc.getPageCount() - 1);
String pattern = "Foxit";
search.setPattern(pattern);
int flags = e_SearchNormal;
// if want to specify flags, you can do it like this:
// flags |= TextSearch::e_SearchMatchCase;
// flags |= TextSearch::e_SearchMatchWholeWord;
// flags |= TextSearch::e_SearchConsecutive;
search.setSearchFlags(flags);
int match_count = 0;
while (search.findNext()) {
RectFArray rect_array = search.getMatchRects();
match_count++;
}
...

In a PDF page, text contents that represent a hypertext link to a website/resource on the internet, or an email address are the same as common text. Prior to text link processing, user should first call PageTextLinks.GetTextLink to get a textlink object.

Example:

import com.foxit.sdk.pdf.PDFPage;
import com.foxit.sdk.pdf.annots.*;
...
// Assuming PDFPage page has been loaded and parsed.
...
TextPage text_page = new TextPage(page, TextPage.e_ParseTextNormal);
PageTextLinks page_textlinks = new PageTextLinks(text_page);
TextLink text_link = page_textlinks.getTextLink(index); // specify an index.
String str_uri = text_link.getURI();
...
Updated on October 23, 2019

Was this article helpful?

Related Articles