Back to GSoC 2026
Mentors:Till KamppeterTill KamppeterSahil AroraDheeraj Yadav
GSoC 2026Project Idea

cups-filters: Create OCR filter to deliver scans as searchable PDFs

2 minute read

Introduction

1 contributor medium-size (175 hrs), Level of difficulty: Intermediate

Scanning with IPP Scan gives the user the possibility to request the scanned image in PDF format. If the IPP Scan server is a Scanner Application, a filter function from cups-filters would convert the the raster image coming from the scanner into PDF.

Now such PDF files are simply raster images in a PDF frame, not high-level graphics with text and fonts, as PDFs produced by office applications are. Especially one cannot search text in a PDF coming from a scanning process.

Ghostscript has a new "pdfocr8" device with which Ghostscript takes raster graphics PDFs (or PostScript files) as input, applies OCR (Optical Character Recognition) to the raster image, and creates a PDF which contains the raster image to visually show the scan but adds data about the contained text and where it is located, so that you can find text with the search facility of a PDF viewer.

Here the contributor's task is to write a filter function (or extend the cfFilterGhostscript() filter function) to make the "pdfocr8" output device of Ghostscript being used so that a searchable PDF is obtained.

Mentors

Till Kamppeter, organization lead OpenPrinting (till at linux dot com), Sahil Arora (sahilarora dot 535 at gmail dot com), Dheeraj Yadav (dhirajyadav135 at gmail dot com), TBD

Desired knowledge

C/C++, CUPS

Code License

Apache 2.0