TET Plugin

The TET Plugin provides easy access to the PDFlib Text and Image Extraction Toolkit (TET). Although the TET Plugin runs as an Acrobat plugin, the underlying content extraction features do not use Acrobat functions, but are completely based on TET. The TET Plugin is provided as a free tool which demonstrate the power of PDFlib TET. Since the TET Plugin is more powerful than Acrobat’s built-in text and image extraction tools and offers a number of convenient user interface features, it is useful as a replacement for Acrobat’s built-in copy and find features. PDFlib TET can successfully process many documents for which Acrobat provides only garbage when trying to extract the text. The TET Plugin offers the following functions:

  • Copy the text from a PDF document in plain text to the system clipboard or a disk file. Enhanced clipboard controls facilitate the use of copy/paste.
  • Convert a PDF to an XML dialect called TETML and place it in the clipboard or a disk file.
  • Copy XMP document metadata to the clipboard or a disk file.
  • Find words in the document. The search text can be supplied literally or in hex syntax to facilitate the search for unusual characters.
  • Highlight all instances of a search term on the page simultaneously.
  • Extract images from the document as TIFF, JPEG, JPEG 2000 or JBIG2 files.
  • Display color space and position information for images.
  • Detailed configuration settings are available to adjust text and image extraction to your requirements. Configuration sets can be saved and reloaded.

Advantages over Acrobat’s copy function

The copy feature of the TET Plugin offers several advantages over Acrobat’s built-in copy facility:

  • The output can be customized to match different application requirements.
  • TET is able to correctly interpret the text in many cases where Acrobat copies only garbage to the clipboard.
  • Unknown glyphs (for which proper Unicode mapping cannot be established) will be highlighted in red color, and can be replaced with a user-selected character (e.g. question mark).
  • TET processes documents much faster than Acrobat.
  • Images can be selected interactively for export, or all images on the page or in the document can be extracted.
  • Tiny image fragments are merged to usable images.

What is PDFlib TET?

PDFlib TET (Text and Image Extraction Toolkit) reliably extracts text, images and metadata from PDF documents. TET makes available the text contents of a PDF as Unicode strings, plus detailed color, glyph and font information as well as the position on the page. Raster images are extracted in common image formats. TET optionally converts PDF documents to an XML-based format called TETML which contains text and metadata as well as resource information.

TET contains advanced content analysis algorithms for determining word boundaries, grouping text into columns and removing redundant text. Using the integrated pCOS interface you can retrieve arbitrary objects from the PDF, such as metadata, interactive elements, etc.

With PDFlib TET you can:

  • Implement the PDF indexer for a search engine
  • Repurpose text and images in PDFs
  • Convert the contents of PDFs to other formats
  • Process PDFs based on their contents, e.g. splitting based on headings (requires PDFlib+PDI in addition to TET)
  • Check wether an area on the page is empty or contains any text, image, or vector graphics

TET is available as a programming library (component) for various development environments, and as a command-line tool for batch operations. Both offer similar features, but are suitable for different deployment tasks.
Fully functional evaluation versions of PDFlib TET for a variety of platforms are available from here.