Extracting Text from PDF Documents Using React and .NET
PDF documents are a common format for storing and sharing information, but extracting text from them programmatically can be challenging. In this tutorial, we’ll explore how to extract text from PDF documents using .NET, including detailed explanations and end-to-end code examples.
Embark on a journey of continuous learning and exploration with DotNet-FullStack-Dev. Uncover more by visiting our https://dotnet-fullstack-dev.blogspot.com reach out for further information.
What is PDF Text Extraction?
PDF text extraction involves programmatically extracting text content from PDF documents. This process enables applications to analyze, search, or manipulate the text within PDF files. Text extraction is particularly useful in scenarios such as data mining, document analysis, and content management.
How Does it Work?
Text extraction from PDF documents typically involves parsing the internal structure of the PDF file and extracting text content from text objects, annotations, and other elements. Libraries and frameworks provide APIs and tools to simplify this process, allowing developers to access text content efficiently.
Where Can We Use PDF Text Extraction?
PDF text extraction can be used in various scenarios, including:
- Data Analysis: Extracting text data from PDF reports, forms, or invoices for analysis and processing.
- Content Management: Parsing text content from PDF documents to index, search, or categorize documents in content management systems.
- Document Conversion: Converting PDF documents to other formats, such as plain text or HTML, for further processing or presentation.
- Information Extraction: Extracting specific information, such as names, dates, or amounts, from structured or semi-structured PDF documents.
React UI for uploading PDF files, extracting text from them, and displaying the extracted text.
Frontend: React UI
Step 1: Create a React Application
Create a new React application using Create React App or your preferred method.
npx create-react-app pdf-text-extractor
Step 2: Install Axios
Install Axios for making HTTP requests to the backend.
npm install axios
Step 3: Create File Upload Component
Create a component for uploading PDF files.
// FileUpload.js
import React, { useState } from 'react';
import axios from 'axios';
const FileUpload = () => {
const [file, setFile] = useState(null);
const handleChange = (event) => {
setFile(event.target.files[0]);
};
const handleSubmit = async () => {
const formData = new FormData();
formData.append('pdfFile', file);
try {
const response = await axios.post('http://localhost:5000/Upload', formData, {
headers: {
'Content-Type': 'multipart/form-data',
},
});
console.log(response.data);
// Handle success: display extracted text
} catch (error) {
console.error('Error uploading file:', error);
// Handle error
}
};
return (
<div>
<input type="file" onChange={handleChange} />
<button onClick={handleSubmit}>Upload PDF</button>
</div>
);
};
export default FileUpload;
Step 4: Display Extracted Text
Display the extracted text in another component.
// ExtractedText.js
import React, { useState, useEffect } from 'react';
import axios from 'axios';
const ExtractedText = () => {
const [extractedText, setExtractedText] = useState('');
useEffect(() => {
const fetchData = async () => {
try {
const response = await axios.get('http://localhost:5000/ExtractedText');
setExtractedText(response.data);
} catch (error) {
console.error('Error fetching extracted text:', error);
// Handle error
}
};
fetchData();
}, []);
return (
<div>
<h2>Extracted Text</h2>
<p>{extractedText}</p>
</div>
);
};
export default ExtractedText;
Step 5: Integrate Components in App
Integrate the file upload and extracted text components in the main App component.
// App.js
import React from 'react';
import FileUpload from './FileUpload';
import ExtractedText from './ExtractedText';
const App = () => {
return (
<div>
<h1>PDF Text Extractor</h1>
<FileUpload />
<ExtractedText />
</div>
);
};
export default App;
Backend: ASP.NET Core Web API
Uploading PDF Files in ASP.NET Core Web Application
Step 1: Create an ASP.NET Core Web Application
Create a new ASP.NET Core Web Application project in Visual Studio or your preferred IDE.
Step 2: Add File Upload Feature
Add a file upload feature to your web application. You can use a simple HTML form with an input field of type “file”.
<form method="post" enctype="multipart/form-data" action="/Upload">
<input type="file" name="pdfFile" />
<button type="submit">Upload PDF</button>
</form>
Step 3: Handle File Upload in Controller
In your controller, handle the file upload request and save the uploaded PDF file to a temporary location.
[HttpPost("Upload")]
public IActionResult Upload(IFormFile pdfFile)
{
if (pdfFile != null && pdfFile.Length > 0)
{
var filePath = Path.GetTempFileName();
using (var stream = new FileStream(filePath, FileMode.Create))
{
pdfFile.CopyTo(stream);
}
return RedirectToAction("ExtractText", new { filePath });
}
return RedirectToAction("Index");
}
Extracting Text from Uploaded PDF Files
Step 1: Install iTextSharp Library
Install the iTextSharp library via NuGet Package Manager.
Step 2: Create a PDF Text Extractor Service
Create a service class to handle PDF text extraction.
public class PdfTextExtractorService
{
public string ExtractText(string pdfFilePath)
{
using (PdfReader reader = new PdfReader(pdfFilePath))
{
StringWriter text = new StringWriter();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
text.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i));
}
return text.ToString();
}
}
}
Step 3: Display Extracted Text
Display the extracted text on a web page.
<h2>Extracted Text</h2>
<p>@Model.ExtractedText</p>
Step 4: Inject PdfTextExtractorService into Controller
Inject the PdfTextExtractorService
into your controller and use it to extract text from the uploaded PDF file.
public class HomeController : Controller
{
private readonly PdfTextExtractorService _pdfTextExtractorService;
public HomeController(PdfTextExtractorService pdfTextExtractorService)
{
_pdfTextExtractorService = pdfTextExtractorService;
}
public IActionResult ExtractText(string filePath)
{
var extractedText = _pdfTextExtractorService.ExtractText(filePath);
return View(new ExtractedTextViewModel { ExtractedText = extractedText });
}
}
Conclusion:
By following the steps outlined above, you can extend the functionality of your ASP.NET Core Web API to include a React UI for uploading PDF files, extracting text from them, and displaying the extracted text in a web application. This end-to-end solution enables users to upload PDF documents via a React frontend, extract text content from them using the backend API, and view the extracted text within the application.
You may also like: CSV-file-generation-with-Kafka-consumers