Java APIs for text extraction and parsing
Extract raw and formatted text as well as metadata from different formats including emails, zip files and legal documentsDownload Free Trial
GroupDocs.Parser for Java
GroupDocs.Parser for Java is a text and metadata extractor API. It extracts text from containers as well as formatted, structured or highlighted text from supported formats. Microsoft Office, Visio, PDF, email and image formats supported. API performs operations with unprecedented accuracy and speed. API also provides convenient tools to detect encoding such as UTF32 LE, UTF32 BE, UTF16 LE , UTF16 BE and more.
At A Glance
An overview of Java Text extraction API for documents raw and formatted text retrieval.
- Text from files
- Text from containers
- Formatted text
- Structured text
- Highlighted text
- Plain text
- Gets Input File
- Fetches Raw or Formatted Text
- Fetches Metadata
- Encoding Detectors
- Media Type Detectors
Documents text extraction within any Java application
- Java SE 7.0 and above
- Windows, Desktops and Servers
- Mac OS
Text extractor API Supported File Formats
GroupDocs.Parser for Java supports Microsoft Word, Excel, PowerPoint , TXT, HTML and MHTML files.
- Word: DOC, DOT, DOCX, DOCM, RTF, WordprocessingML (XML)
- Excel: XLS, XLSX, XLSM, XLSB, SpreadsheetML (XML)
- PowerPoint: PPT, PPTX, PPTM, PPS, PPSX, PPSM
- Outlook: MSG, PST, OST
- OneNote: ONE
- OpenDocument Formats: ODT, ODS, ODP
- Portable Document Format: PDF, PDF Portfolio, Encrypted PDF
- Web: HTML, XHTML MHTML, XML
- Text: TXT, CSV
- Email Formats: EML, EMLX, TNEF, POP, IMAP
- Markdown: MD
- Compression: ZIP, CHM
- Other: EPUB, FB2
Advanced Document Text Extraction API Features
Extracts raw and formatted text
Extract structured text
Extract highlighted text
Search text in documents
Fetches text from containers containing other files such as zip archives
Gets formatted text from TXT, Markdown and HTML files
Support for encoding detection
Support for media type detectors
Extraction from password protected files
Text and Metadata Extractors
GroupDocs.Parser for Java provides various relevant text extractor classes. Moreover API also has convenient tools classes like encoding and media type detectors for different files e.g
- EmailTextExtractor and EmailFormattedTextExtractor classes to extract text from email messages
- ExtractorFactory class for creating Text, Formatted Extractor and Container.
- EncodingDetector class for decoding different encoding.
- MediaTypeDetector abstract class for each custom media type detector class to detect media type of the corresponding file.
In the same way, API has various classes for metadata extraction from various documents
Container Text Extractor
Container has the ability to work with files containing other documents just like zip archives. API can be consumed for extracting messages from these containers such as ost-container.