In the wake of all the uproar that there are these days around the metadata in Spain, I have been reviewing various tools of PDF metadata deletion. In principle, the tools analyzed work on GNU/Linux systems, but that does not mean that some may not work on other systems.
I started from a PDF created by myself. As you can see in the following image, it contains metadata (screenshot in Spanish, but I guess you get the idea):
The first tool tested was exiftool:
$ exiftool -all= clean.pdf
As you can see, apparently there is no metadata left (“Ninguno” means nothing):
However, in exiftool website you can read: “Note: Changes to PDF files are reversible because the original metadata is never actually deleted from these files. See the PDF Tags documentation for details.”. So the metadata is hidden, not deleted.
Interesting. The question is: how to recover the metadata? Those that know me, probably know the answer: Python. Using pyPdf library, I did some tests:
#!/usr/bin/env python2 from pyPdf import PdfFileReader fnames = ['dirty.pdf', 'exiftool.pdf'] for fname in fnames: pdf = PdfFileReader(open(fname, 'rb')) print fname print pdf.getDocumentInfo() print
If we run it, we get the following information:
dirty.pdf {'/CreationDate': u"D:20131114101846+01'00'", '/ModDate': u"D:20131114101846+01'00'", '/Producer': u'blablabla', '/Creator': u'blablabla', '/Author': u'blablabla'} exiftool.pdf {'/CreationDate': u"D:20131114101846+01'00'", '/ModDate': u"D:20131114101846+01'00'", '/Producer': u'blablabla', '/Creator': u'blablabla', '/Author': u'blablabla'}
So, what alternatives do we have? We could look after other programs or just develop one by ourselves. As you guess, I’ve done both.
First, I developed a small program (Python, of course) that deletes PDF metadata.
#!/usr/bin/env python2 from pyPdf import PdfFileReader, PdfFileWriter from pyPdf.generic import NameObject, createStringObject import argparse parser = argparse.ArgumentParser() parser.add_argument("input") parser.add_argument("output") args = parser.parse_args() fin = file(args.input, 'rb') pdfIn = PdfFileReader(fin) pdfOut = PdfFileWriter() for page in range(pdfIn.getNumPages()): pdfOut.addPage(pdfIn.getPage(page)) info = pdfOut._info.getObject() del info[NameObject('/Producer')] fout = open(args.output, 'wb') pdfOut.write(fout) fin.close() fout.close()
It can’t be easier:
./limpia.py dirty.pdf clean.pdf
The result will be a PDF file named “clean.pdf” from the PDF file “dirty.pdf”, this time with metadata deleted:
$ ./test.py clean.pdf {}
On the other side, we could also use MAT, that supports PDF and other filetypes:
$ mat dirty.pdf [+] Cleaning dirty.pdf dirty.pdf cleaned ! $ ./test.py dirty.pdf {}
Remember: eat vegetables and clean metadata. See you next time.