GroupDocs.Parser for Java

Extract data from XLSX documents in Java

Seamlessly extract structured content such as text, metadata, tables, and graphics from PDFs, Word, Excel, and image-based documents using GroupDocs.Parser in your Java apps.

Maven Download

Start Free Trial

How to extract data from Xlsx using Java

To extract useful information from XLSX documents in your Java projects using GroupDocs.Parser, follow these instructions:

Open the XLSX file with a Parser object.
Use the parser to retrieve the required data (text, tables, metadata, etc.).
Ensure the output is correct and complete.
Integrate the parsed content into your data flow, business processes, or applications.

Copy

// Initialize your Parser with the input document
try (Parser parser = new Parser("input.xlsx"))
{
    // Retrieve all available text content from the document
    try (TextReader reader = parser.getText())
    {
        // If no text is found, the return value will be null
        // Incorporate the extracted content into your solution
        System.out.println(reader == null ? 
            "This format may not support text extraction" : reader.readToEnd());
    }
}

<dependencies>
<dependency>
<groupId>com.groupdocs</groupId>
<artifactId>groupdocs-parser</artifactId>
<version>24.9</version>
</dependency>
</dependencies>
<repositories>
<repository>
<id>repository.groupdocs.com</id>
<name>GroupDocs Repository</name>
<url>https://repository.groupdocs.com/repo/</url>
</repository>
</repositories>

click to copy

copied

Versatile document parsing functionality

GroupDocs.Parser does more than just text extraction—it supports full parsing of barcodes, metadata, images, tables and other data to power intelligent automation and data-driven applications.

Visual overview of document data parsing and extraction

Extract from multiple file formats

Access data like text, tables, and media from widely used file types such as PDF, Word, Excel, PowerPoint, HTML, and others.

Parse content from digital and scanned sources

Process content from both native digital files and scanned images, using OCR when necessary to interpret embedded text.

Flexible configuration options

Tailor your parsing with settings for page selection, layout zones, and custom field templates to meet specific extraction needs.

Parsing PDF using a data extraction template

This sample shows how to extract structured fields from a PDF using a custom template via GroupDocs.Parser.

Java

//  Open the PDF using the Parser class
try (Parser parser = new Parser("input.pdf"))
{
    // Apply the parsing template to extract defined data
    DocumentData data = parser.parseByTemplate(GetTemplate());

    // Check if the template-based extraction is available
    if (data == null) {
        return;
    }

    // Work with the extracted data fields
    for (int i = 0; i < data.getCount(); i++) {
        System.out.print(data.get(i).getName() + ": ");
        PageTextArea area = data.get(i).getPageArea() instanceof PageTextArea
                ? (PageTextArea) data.get(i).getPageArea() : null;
        System.out.println(area == null ? "Not a template field" : area.getText());
    }
}

private static Template GetTemplate()
{
    // Define detector settings for extracting the 'Details' section
    TemplateTableParameters detailsTableParameters = 
        new TemplateTableParameters(new Rectangle(new Point(35, 320), new Size(530, 55)), null);

    TemplateItem[] templateItems = new TemplateItem[]
    {
        new TemplateTable(detailsTableParameters, "details", null)
    };

    Template template = new Template(java.util.Arrays.asList(templateItems));
    return template;
}

What is GroupDocs.Parser for Java?

GroupDocs.Parser is a robust API built for Java developers, offering advanced document parsing functionality. It allows you to extract and process textual data, images, tables, structured fields, and barcodes from numerous formats like PDF, DOCX, XLSX, PPTX, and more — all without installing extra libraries.

Learn more

Ready to get started?

Download GroupDocs.Parser for free or get a trial license for full access!

Maven Download

Start Free Trial

Useful resources

Explore documentation, code samples, and community support to enhance your experience.

File types supported for content extraction

GroupDocs.Parser is compatible with a wide range of document and image file types, making it easy to extract information from commonly used formats in parsing and data automation scenarios.

Parse PDF
(Portable Document Format)
Parse DOCX
(Office 2007+ Word Document)
Parse PPTX
(Open XML presentation Format)
Parse TXT
(Text file)
Parse RTF
(Rich Text Format)
Parse XML
(eXtensible Markup Language)
Parse EPUB
(Open eBook File)