Skip to content

file size problem: 433mb generated from a 15mb document #1338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks done
juanludlf opened this issue Nov 2, 2022 · 10 comments
Open
2 tasks done

file size problem: 433mb generated from a 15mb document #1338

juanludlf opened this issue Nov 2, 2022 · 10 comments

Comments

@juanludlf
Copy link

juanludlf commented Nov 2, 2022

What were you trying to do?

I'm trying to generate a new PDF document based on an existing one.
See How can we reproduce the issue? section to download the original pdf document that is causing this issue.

How did you attempt to do it?

I'm using a code similar to this one:

const pdfBytes = fs.readFileSync("original.pdf");

// Load a PDFDocument from the existing PDF bytes
const inputPdf = await PDFDocument.load(pdfBytes as ArrayBuffer, {
  ignoreEncryption: true,
  parseSpeed: ParseSpeeds.Fastest,
  capNumbers: true
});

// create a new PDFDocument
this.output = await PDFDocument.create();

// get document pages
const pages = await inputPdf.getPages();

for (let pageIndex = 0; pageIndex < pages.length; pageIndex++) {
  const page = pages[pageIndex];
  
  // add new page
  newPage = this.output.addPage(PageSizes.A4);
        
  // embed and scale original page
  const embedPage = await this.output.embedPage(page);
  const scaledPageDims = embedPage.scale(0.75);
        
  newPage.drawPage(embedPage, {
    ...scaledPageDims,
    x: 10,
    y: 10
  });
}

// Serialize the PDFDocument to bytes (a Uint8Array)
const newPdfBytes = await this.output.save();

What actually happened?

The original document is 15 Mb in size and the generated document is 433 Mb.

What did you expect to happen?

I expected to get similar sizes from both the original and the generated document.

How can we reproduce the issue?

The code attached in section How did you attempt to do it? will reproduce this issue.

I think this is an issue specifically with this document, which is based on scanned images.

Version

1.17.1

What environment are you running pdf-lib in?

Node

Checklist

  • My report includes a Short, Self Contained, Correct (Compilable) Example.
  • I have attached all PDFs, images, and other files needed to run my SSCCE.

Additional Notes

No response

@juanludlf juanludlf changed the title Optimize file size file size problem: 433mb generated from a 15mb document Nov 8, 2022
@juanludlf
Copy link
Author

Hi @Hopding
Can you help me guess what's wrong here?
Thank you

@mrdavidrees
Copy link

Also wondering about this

@SergeiReutov
Copy link

Yea, same issue here. Even the simple pages copying increases the result PDF size:

const copyDocument = async (buffer) => {
  console.log('initial size: ', buffer.byteLength); // 20296
  const newPdf = await PDFDocument.create();
  const initialPdf = await PDFDocument.load(buffer);
  const pages = initialPdf.getPages();
  for (let i = 0; i < pages.length; i++) {
    const [newPage] = await newPdf.copyPages(initialPdf, [i]);
    newPdf.addPage(newPage);
  }
  const bufferCopy = await newPdf.save();
  console.log('copy size: ', bufferCopy.byteLength); // 31691
};

@ns-sjli
Copy link

ns-sjli commented Dec 10, 2022

Yes, same issue encountered, 9MB file split with each file 10 page, increase to 60MiB for each sub file.

// split.pdf.js
const fs = require('fs');
const path = require('path');
const { PDFDocument } = require('pdf-lib');

const splitPDF = async (pdfFilePath, outputDirectory) => {
  const data = await fs.promises.readFile(pdfFilePath);
  const readPdf = await PDFDocument.load(data);
  const { length } = readPdf.getPages();

  for (let i = 0, n = length; i < n; i += 10) {
    const writePdf = await PDFDocument.create();
    for (let j = i; j < i + 10; j += 1) {
      const [page] = await writePdf.copyPages(readPdf, [j]);
      writePdf.addPage(page);   
    }
    const bytes = await writePdf.save();
    const outputPath = path.join(outputDirectory, `I100_${i + 1}.pdf`);
    await fs.promises.writeFile(outputPath, bytes);
     
    console.log(`Added ${outputPath}`);
  }
};

splitPDF('100.pdf', 'invoices').then(() =>
  console.log('File have been split!').catch(console.error)
);

@p-kuen
Copy link

p-kuen commented Dec 19, 2022

Have you tried using copyPages instead of embedPage?

 // append to created pdf
  const [copyPage] = await this.output.copyPages(inputPdf, [0])
  this.output.addPage(copyPage)

@juanludlf
Copy link
Author

juanludlf commented Dec 19, 2022

Hi @p-kuen
I will give a try. However, the code samples provided by @SergeiReutov and @ns-sjli use the copyPage method and have the same problem 🤔

@p-kuen
Copy link

p-kuen commented Dec 19, 2022

Oh sorry, I should've watched more closely. I use copyPages myself and use the trick to put the whole merged pdf into ghostscript for compression, so I never had problems with this one. Not the cleanest solution but effective.

@vpatil007
Copy link

Anybody got any solution on this issue?

@weihuiling071
Copy link

same issue,
Anybody got any solution on this issue?

@rakurtz
Copy link

rakurtz commented Dec 29, 2024

Well it seems that this is the same kind of bug described here: issue 1662

copyPages() seems to copy all object referenced in the source pdf into the new pdf regardless of their appearance in the copied page.

I guess this is a severe bug, i wonder why this doesn't come up more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants