Full-Text Search

Retrieving textual content from documents is a vital part of many PDF workflows. For various reasons, text extraction isn’t always straightforward, but here is how we make it easy for you with Foxit PDF SDK.

What is Full-Text Search?

PDF documents consist of many parts with the textual content often being the most important. Text search, and text extraction, are two common tasks required by both PDF developers as well the end users of PDF software. In order to search for text in a document, the text content must first be extracted from the PDF which can be difficult without our SDK!

Using an index allows a document to be searched quickly as the text extraction phase only needs to be completed once. This allows the search operation to be scaled up to allow searching of large sets of documents.

Simplifying the search process with PDF SDK

PDF SDK offers the fastest text search technology in the market. The biggest challenge with text searching is in the way the PDF format organizes text, and more specifically text objects. The logic doesn’t restrict those objects (or characters) based on the location, size or rotation angle to be displayed. This also applies to the page, line or word the character would then belong to when you eventually read it.

Although challenging, when a library handles the logic well like Foxit PDF SDK, this is very useful. You can find words anywhere in your document and customize the engine to account for common issues. These issues include split words (for example, with a hyphen at the end of a line) and certain combined characters (for example, the fi character instead of separate f and i), and words of a phrase on different lines, and so on.

Full-text search makes searching and text extraction easier and faster. This applies to every single piece of text in the document according to the index of the text object. This overcomes language and any types of document or encoding. We do this by using an SQLite database to check all the content, which returns a very quick response to your query

ELEMENTS OF FULL-TEXT SEARCH

SEARCH A STRING OF TEXT ACROSS ANY/ALL DOCUMENTS

HIGHLIGHT ALL INSTANCES OF A STRING ON A DOCUMENT

NAVIGATE THROUGH PREVIOUS/NEXT SEARCH RESULTS

ABILITY TO SEARCH META INFORMATION

COMPLETE FILE SEARCH IN SECONDS

KEYWORD, STRING OR PHRASE SEARCH

Tagging to help Full-text search

The PDF format offers full tagging support for blocks of text and other items in the page, which allow items to be identified, read, searched and rendered properly. Foxit PDF SDK offers full support for programmatic tagging of phrases, paragraphs, and all other PDF items, which serves a double purpose:

1. Faster, more streamlined PDF searching
2. Enhanced accessibility and compliance with many document accessibility standards

Why USE FULL-TEXT SEARCH IN PDFS?

R

NEVER LOSE INFORMATION AGAIN

R

SEARCH COLLECTIONS OF DOCUMENTS IN SECONDS

R

SEARCH FULL PDFS, INCLUDING METADATA AND ANNOTATIONS

Full-Text Search
and Metadata

When creating documents information can be organized and managed in a way that full-text search can be done easily and logically. This involves editing document metadata to ensure it is all present and updating document tags to outline the topics discussed in files. Using frequent topics or department names as tags, companies can easily search folders full of hundreds of files to find the information they seek quickly and accurately. Foxit PDF SDK allows you to set, remove, and edit all metadata in your documents programmatically, based on preset logic and workflows.

Full Text Search and Redact information

In many industries, particularly after the approval of GDPR in Europe, searching and deleting customer information (such as the name of a customer) across all documents in any one or all your document management systems has become a nightmare. Just think about all the information you hold right now on any one of your customers: contracts, support tickets, archived emails, financial records… Now imagine receiving a GDPR request for removal of all their information.

Foxit PDF SDK turns this manual, multi-hour nightmare task into a quick search and remove by allowing you to search all instances of any given string of text (such as the name of your customer) across all your records, select it and securely redact it while maintaining the integrity of the original document.

USE CASES

SECURELY SEARCHING LEGAL DOCUMENTS

Foxit PDF SDK full-text search can ensure that files are searched completely so that no stone is left unturned. This means that all instances of a chosen word or phrase will be returned in a search giving the searcher a complete picture of the files and documents need for review in the legal industry.

ACHIEVE GDPR COMPLIANCE

GDPR allows your European customers to request all personally identifiable information you hold on them. Have you asked yourself how you would search all the information you hold on any given customer across all your document management systems if you were to get a request like that? With PDF SDK, you can search, select and redact all instances of your customers’ information across all documents quickly and securely.

Trusted by some of the biggest companies in the world

AIG Logo
Bank of America Logo

Sample Code

Full-Text Search

public static void doFullTextSearch(String[] args) throws PDFException {
	// createResultFolder(output_path);
	// Initialize library
	int error_code = Library.initialize(sn, key);
	if (error_code != e_ErrSuccess) {
		System.out.println("Library Initialize Error: " + error_code);
		return;
	}
	String directory = "A search directory...";
	FullTextSearch search = new FullTextSearch();
	try {
		String dbPath = "The path of data base to store the indexed data...";
		search.setDataBasePath(dbPath);
		// Get document source information.
		DocumentsSource source = new DocumentsSource(directory);
	 
		// Create a Pause callback object implemented by users to pause the updating process.
		PauseUtil pause = new PauseUtil(30);
	 
		// Start to update the index of PDF files which receive from the source.
		Progressive progressive = search.startUpdateIndex(source, pause, false);
		int state = Progressive.e_ToBeContinued;
		while (state == Progressive.e_ToBeContinued) {
			state = progressive.resume();
		}
	 
		// Create a callback object which will be invoked when a matched one is found.
		MySearchCallback searchCallback = new MySearchCallback();
	 
		// Search the specified keyword from the indexed data source.
		search.searchOf("looking for this text", RankMode.e_RankHitCountASC, searchCallback);
	} catch (PDFException e) {
		e.printStackTrace();
	}
		
		Library.release();
	}
}