Retrieving textual content from documents is a vital part of many PDF workflows. For various reasons, text extraction isn’t always straightforward, but here is how we make it easy for you with Foxit PDF SDK.
What is Full-Text Search?
PDF documents consist of many parts with the textual content often being the most important. Text search, and text extraction, are two common tasks required by both PDF developers as well the end users of PDF software. In order to search for text in a document, the text content must first be extracted from the PDF which can be difficult without our SDK!
Using an index allows a document to be searched quickly as the text extraction phase only needs to be completed once. This allows the search operation to be scaled up to allow searching of large sets of documents.
Simplifying the search process with PDF SDK
PDF SDK offers the fastest text search technology in the market. The biggest challenge with text searching is in the way the PDF format organizes text, and more specifically text objects. The logic doesn’t restrict those objects (or characters) based on the location, size or rotation angle to be displayed. This also applies to the page, line or word the character would then belong to when you eventually read it.
Although challenging, when a library handles the logic well like Foxit PDF SDK, this is very useful. You can find words anywhere in your document and customize the engine to account for common issues. These issues include split words (for example, with a hyphen at the end of a line) and certain combined characters (for example, the fi character instead of separate f and i), and words of a phrase on different lines, and so on.
Full-text search makes searching and text extraction easier and faster. This applies to every single piece of text in the document according to the index of the text object. This overcomes language and any types of document or encoding. We do this by using an SQLite database to check all the content, which returns a very quick response to your query
ELEMENTS OF FULL-TEXT SEARCH
SEARCH A STRING OF TEXT ACROSS ANY/ALL DOCUMENTS
HIGHLIGHT ALL INSTANCES OF A STRING ON A DOCUMENT
NAVIGATE THROUGH PREVIOUS/NEXT SEARCH RESULTS
ABILITY TO SEARCH META INFORMATION
COMPLETE FILE SEARCH IN SECONDS
KEYWORD, STRING OR PHRASE SEARCH
Tagging to help Full-text search
The PDF format offers full tagging support for blocks of text and other items in the page, which allow items to be identified, read, searched and rendered properly. Foxit PDF SDK offers full support for programmatic tagging of phrases, paragraphs, and all other PDF items, which serves a double purpose:
1. Faster, more streamlined PDF searching
2. Enhanced accessibility and compliance with many document accessibility standards
Why USE FULL-TEXT SEARCH IN PDFS?
NEVER LOSE INFORMATION AGAIN
SEARCH COLLECTIONS OF DOCUMENTS IN SECONDS
SEARCH FULL PDFS, INCLUDING METADATA AND ANNOTATIONS
When creating documents information can be organized and managed in a way that full-text search can be done easily and logically. This involves editing document metadata to ensure it is all present and updating document tags to outline the topics discussed in files. Using frequent topics or department names as tags, companies can easily search folders full of hundreds of files to find the information they seek quickly and accurately. Foxit PDF SDK allows you to set, remove, and edit all metadata in your documents programmatically, based on preset logic and workflows.
Full Text Search and Redact information
In many industries, particularly after the approval of GDPR in Europe, searching and deleting customer information (such as the name of a customer) across all documents in any one or all your document management systems has become a nightmare. Just think about all the information you hold right now on any one of your customers: contracts, support tickets, archived emails, financial records… Now imagine receiving a GDPR request for removal of all their information.
Foxit PDF SDK turns this manual, multi-hour nightmare task into a quick search and remove by allowing you to search all instances of any given string of text (such as the name of your customer) across all your records, select it and securely redact it while maintaining the integrity of the original document.