Java APIs for text extraction and parsing

Extract raw and formatted text as well as metadata from different formats including emails, zip files and legal documents

Download Free Trial
Java Text extraction API

GroupDocs.Parser for Java

 

GroupDocs.Parser for Java is a text and metadata extractor API. It extracts text from containers as well as formatted, structured or highlighted text from supported formats. Microsoft Office, Visio, PDF, email and image formats supported. API performs operations with unprecedented accuracy and speed. API also provides convenient tools to detect encoding such as UTF32 LE, UTF32 BE, UTF16 LE , UTF16 BE and more.

Previous Next

Advanced Document Text Extraction API Features

 

 

Extracts raw and formatted text

 

Extracts metadata

 

Extract structured text

 

Extract highlighted text

 

Extract text from databases

 

Search text in documents

 

Fetches text from containers containing other files such as zip archives

 

Gets formatted text from TXT, Markdown and HTML files

 

Support for encoding detection

 

Support for media type detectors

 

Extraction from password protected files

 

Metered Licensing

Text and Metadata Extractors

GroupDocs.Parser for Java provides various relevant text extractor classes. Moreover API also has convenient tools classes like encoding and media type detectors for different files e.g

  • EmailTextExtractor and EmailFormattedTextExtractor classes to extract text from email messages
  • ExtractorFactory class for creating Text, Formatted Extractor and Container.
  • EncodingDetector class for decoding different encoding.
  • MediaTypeDetector abstract class for each custom media type detector class to detect media type of the corresponding file.

In the same way, API has various classes for metadata extraction from various documents

Container Text Extractor

Container has the ability to work with files containing other documents just like zip archives. API can be consumed for extracting messages from these containers such as ost-container.

Support and Learning Resources

 

GroupDocs.Parser offers document parsing APIs for other popular development environments as listed below: