pCOS Features

PDFlib pCOS 4 - PDF Information Retrieval Tool

What is PDFlib pCOS?

PDFlib pCOS provides a simple and elegant facility for retrieving any information from a PDF document which is not part of the page contents. For example, PDF metadata, interactive elements (links, form fields, etc.), or page dimensions can easily be queried with pCOS.
With pCOS you can extract a variety of interesting items and create output for different purposes. By processing multiple PDF documents with a single call you can easily create summaries of document info entries, page formats, fonts, or any other property. Combined with tabular output this provides a powerful PDF administration tool.
There are many application scenarios for the PDF Information Retrieval Tool PDFlib pCOS within PDF workflows, but you can also use PDFlib pCOS as a tool for learning or debugging PDF. Here are some typical situations:

  • Check incoming documents for predefined criteria
  • Identify problem files in a large collection
  • Create metadata summaries for document management
  • quality assurance before publishing documents
  • document retrieval and repository workflows
  • summarize the bookmarks
  • extract components of PDF documents, e.g. ICC profiles
  • Check PDFs for security problems (JavaScript etc.)

The pCOS programming interface is included in other PDFlib GmbH products: if you use PDFlib+PDI, PDFlib Personalization Server (PPS), PDFlib TET, PDFlib PLOP or PDFlib PLOP DS you also have access to the pCOS interface. In PDFlib TET PDF IFilter you can use pCOS paths to retrieve information from PDF documents and use it for indexing and search. The pCOS command-line tool is only included in the pCOS product. If you need access to text or images on the page use our product PDFlib TET for PDF content extraction.

pCOS Cookbook

The pCOS Cookbook is a collection of programming examples which demonstrate the use of pCOS for various PDF retrieval tasks. The Cookbook is available here and includes sample code, input documents and sample output.

PDFlib pCOS Features

Supported Input

PDFlib pCOS supports all flavors of PDF input:

  • All PDF versions up to Acrobat XI, including ISO 32000
  • Encrypted documents (password may be required)
  • Damaged PDF input documents will be repaired if possible

Information Retrieval

PDFlib pCOS offers a simple query interface. With PDFlib pCOS you can extract a variety of interesting items, such as:

  • Document info fields and XMP metadata
  • General information: linearization and tagged PDF status, encryption details and permission settings, number of pages and fonts
  • Fonts with name, embedding status, etc.
  • Image data, such as bit depth, color space, compression, XMP
  • Color space details
  • Target URLs and coordinates of Web links
  • Bookmarks and the corresponding page numbers, e.g. to create a table of contents
  • Form field data: full field names, contents, position, etc.
  • Page size, CropBox, page rotation
  • Status of ISO standards: PDF/X, PDF/A, PDF/UA, PDF/E, and PDF/VT
  • Geospatial reference information
  • List or extract file attachments
  • Layer names, page labels, article threads
  • Annotation details
  • List all comments along with the reviewer’s name
  • Digital signature details: name of signature field(s), signed/unsigned, name of signer, date and reason of signature
  • Extract ICC output intent profiles from PDF/X or PDF/A documents
  • Block properties for PDFlib Personalization Server
  • JavaScript on document, page, annotation, or field level
  • Retrieve XML invoice data from ZUGFeRD documents
  • Properties of PDF Packages/Portfolios

Output Formats

PDFlib pCOS can create output for different purposes:

  • Plain text output
  • Unicode text output in UTF-8 or UTF-16 formats
  • Tabular output for processing with a spreadsheet/database
  • Binary data, e.g. ICC profiles or file attachments
  • User-defined output formats for custom post-processing

pCOS Paths - Simple Syntax for PDF Objects

Instead of getting bogged down by complex tree structures, e.g. for bookmarks or form fields, you can easily access PDF objects by using the simple pCOS path syntax. It offers convenient shortcuts for accessing commonly used PDF objects, such as pages, fonts, bookmarks, form fields, etc.

pCOS Library or Command-Line Tool?

pCOS is available as a programming library (component) for various development environments, and as a command-line tool for batch operations. Both offer similar features, but are suitable for different deployment tasks.

The pCOS programming library is used...

...for integration into desktop or server applications. Examples for using the library with all supported language bindings are included in the pCOS package.

The pCOS command-line tool is suited...

...for batch processing PDF documents. It doesn’t require any programming, but offers powerful command-line options which can be used to integrate it into complex workflows. The pCOS command-line tool extends the features of the library:

  • Simple retrieval of common PDF elements, such as bookmarks, annotations, metadata, form fields, etc.
  • Extended mode for querying more complex objects and customizing the output format
  • Extract data items, such as file attachments, ICC profiles, etc.
  • Emit information as comma-separated values or a userdefined format for import into a spreadsheet or database
  • Recursion feature for dumping composite PDF objects, such as dictionaries and arrays

Supported Development Environments

PDFlib pCOS is everywhere - it runs on practically all computing platforms. We offer 32-bit and 64-bit packages for all common flavors of Windows, OS X/macOS, Linux and Unix.

The pCOS core is written in highly optimized C and C++ code for maximum performance and small overhead. Via a simple API (Application Programming Interface) the pCOS functionality is accessible from a variety of development environments:

  • COM for use with VB, ASP, etc.
  • C and C++
  • Java, including servlets and JSP
  • .NET for use with C#, VB.NET, ASP.NET, etc.
  • Perl
  • PHP
  • Python