GroupDocs.Parser for Java

Retrieve text from PDF using Java

Seamlessly pull readable or structured text from files like PDF, Word, Excel, and more using GroupDocs.Parser in your Java development projects.

Maven Download

Start Free Trial

How to retrieve text from Pdf using Java

Follow the steps below to extract text from PDF files using GroupDocs.Parser within your Java project:

Load the PDF document using the Parser class.
Perform text extraction from the file content.
Check if the text was successfully retrieved.
Use the text data in search, analytics, or automation systems.

Copy

// Initialize Parser with your document
try (Parser parser = new Parser("input.pdf"))
{
    // Read and extract all textual data
    try (TextReader reader = parser.getText())
    {
        // Return null if text content is missing
        // Integrate the extracted text into your workflow
        System.out.println(reader == null ? 
            "Skip unsupported text extraction formats" : reader.readToEnd());
    }
}

<dependencies>
<dependency>
<groupId>com.groupdocs</groupId>
<artifactId>groupdocs-parser</artifactId>
<version>24.9</version>
</dependency>
</dependencies>
<repositories>
<repository>
<id>repository.groupdocs.com</id>
<name>GroupDocs Repository</name>
<url>https://repository.groupdocs.com/repo/</url>
</repository>
</repositories>

click to copy

copied

Rich text extraction functionality

GroupDocs.Parser goes beyond simple text extraction—supporting retrieval of images, metadata, and structured data to enhance content processing tasks.

Extract and structure text content from documents

Works across numerous document formats

Capture both raw and structured text from DOCX, XLSX, PPTX, PDF, HTML, and various formats.

Extract text from visual and textual content

Parse text from scanned documents, slides, spreadsheets, and other file types while preserving logical structure.

Detailed control over extraction process

Configure page ranges, layout zones, and accuracy parameters for fine-tuned text parsing.

Sample: Extracting text regions from a PPTX document

This sample demonstrates extracting text blocks along with their spatial coordinates from a PowerPoint presentation using GroupDocs.Parser.

Java

//  Load your PPTX file with the Parser API
try (Parser parser = new Parser("input.pptx"))
{
    // Get all rectangular text zones
    IEnumerable<PageTextArea> areas = parser.GetTextAreas();

    // Exit if this feature is not supported
    if (areas == null)
    {
        return;
    }

    // Loop through text areas by page
    for (PageTextArea a : areas)
    {
        // Process each text block with its page number and bounding rectangle
        System.out.println(String.format("Page: %d, R: %s, Text: %s", a.getPage().getIndex(), a.getRectangle(), a.getText()));
    }
}

Introducing the GroupDocs.Parser for Java API

GroupDocs.Parser is a robust and scalable document parser designed for Java developers. It offers capabilities to accurately extract text, tables, images, and structured components from various formats including PDF, DOCX, XLSX, PPTX, and others—without relying on external utilities.

Learn more