HOEKSTRA.CO.UK

I was given a USB stick with 18,000 PowerPoint files with the .ppt extension dating from 2005, of which each document is a manually scanned page of sheet music. This was done by a diligent soul some years ago, who knew little about the pernicious ways in which the Big Tech Companies even back then operated to ensure that the data formats for digital artefacts regularly change in order to force users to constantly upgrade their expensive software. No surprise then that files produced by the version of PowerPoint back in 2005 is not recognised by the more recent versions of PowerPoint. So all this hard work appears to be entirely useless. Or is it?

LibreOffice and BASH Wizardry to the Rescue!

If one can convert each of these old-version PowerPoint files to PDF files, which is actually more useful than having it PowerPoint format in any case, then all this hard work can be made useful to consumers of printed sheet music again. Furthermore, the PDF data format also has a far better chance of "not going out of date", so we can be sure that all this scanned sheet music will still be accessible in many years to come.

Luckily, it is possible to solve this problem with a single BASH-incantation that would make Harry Potter blush:

find . -name "*.ppt" -type f -exec bash -c 'd="${0%/*}" ; libreoffice --headless --invisible --convert-to pdf "$0" --outdir "$d"' {}  \; -print

What goes on here? We use LibreOffice, which can still read all the old, discontinued office document formats, and we use its built-in conversion function to convert it to a PDF document. Since we don't want to open the GUI up every time that we process a file, we add the --headless and the --invisible parameters to the command the invokes LibreOffice. Then we use the find-command to find all the .ppt files and apply this command to each file that we find. 

You still with me?

Find's -exec parameter

We use the -exec parameter of the find command to perform all the magic in. What is in the -exec parameter is executed for every file that is found that matches the search criteria. (Reminder: we are searching for all files with a .ppt extension, as specified in the -name parameter.) The -exec parameter is limited to the a single operation, so if multiple operations are required, which is the case here, then the commands need to be scripted in a small script that is then called from here. Or we can use a sub-shell, which is a more elegant approach.

The behaviour of this call the LibreOffice is to dump the converted PDF output file in our current working directory. We would end up with all the PDF files in one directory, with the likelihood of some files overwriting each other. What we really would like it to do is put the converted PDF file alongside the old .ppt file in their respective directories. Luckily, we can force LibreOffice to dump the output file in any desired directory by setting the --outdir [some directory] parameter. All we need to do is specify the name of that directory, which clearly will be the same was the one where our .ppt file is in. The only value that we have to work with is the .ppt full file path, from which we can extract the directory path and use it in the -outdir parameter.

This means that the -exec portion of the find command needs to execute more than one command, so we need to use the sub-shell approach in the -exec portion, so that we can embed a mini-script of commands in the -exec parameter, in the form:

-exec bash -c '...a bunch of commands that use $0...' {}

The found-file placeholder {}, which contains the full file path to the next found file, is passed as a parameter 0 into the sub-shell, where it is referred to as $0. The code in the  sub-shell can be broken down as follows:

  • Extract the directory from the file path into variable 'd'. 
d="${0%/*}" 
  •  Invoke LibreOffice in headless mode to convert the file in the file path and dump the output to the directory indicated by variable 'd':
libreoffice --headless --invisible --convert-to pdf "$0" --outdir "$d

This process took 5 hours to run for the 18,000 files on an old-ish quad-core laptop, which quickly broke out into a sweat and had the fans buzzing. But computers are our slaves, right?

Restart from where we left off

But what if there were some failures along the way, or the process had to be stopped to yield to more important processes? Can we restart this long process where it stopped at the last conversion, without having to repeat the conversion for the files that have already been converted? Yes, we can! If we check if the converted PDF file does not already exist, perform the conversion only then. Sounds simple, but we to code a conditional invocation of the LibreOffice conversion utility, based on whether the PowerPoint file's corresponding PDF file exists or not.

Some interesting string manipulations in BASH

And what's more is that we need to make up the PDF file name first, be taking the PowerPoint file path string, stripping the .ppt extension, and then adding the .pdf bit. Now remember, the PowerPoint filepath in {} arrives in the sub-shell as parameter 0, which we access as $0, or ${0}. Using the curly-bracket form allows us to add in-line operators to the variable in order to manipulate the content of the variable "on-the-fly", such as specifying a regular expressions for remove bits that match the regex, or to change its case, and much more. Here we just want to strip the .ppt extension from the file path, which we do with the %[regular expression] operator, which removes the part of the string that matches the regular expression, starting from the right hand side. This removal process stops when the first match has been made and we call this the "non-greedy" mode. Now if we used the %%[regular expression] operator, all consecutive instances of this regex would be removed from the string, so this mode is unsurprisingly called the "greedy mode". It works in a similar way with the  #[regular expression] operator and the ##[regular expression] operator, except that they start their regex matching from the left hand side. So, here we are last: to remove the .ppt bit off the right hand side of the string held in parameter 0, and to assign the result to parameter t, we do this:

t=${0%\.ppt}

Note that we escape the period (.) to indicate that it an actual period and not regex-speak for "any character".  Let's now add .pdf file extension to variable t by explicitly adding .pdf.: 

t=${0%\.ppt}.pdf

Note that the period did not need to be escaped. We are now ready to check if the PDF file path that is held in variable t exists or not with the -f file operator, which returns TRUE if the file exists. To check for the inverse condition, use the ! not-operator, and since we want to say "If this file does not exist, then perform this process", the condition statement looks like this:

[[ ! -f $t ]]

Combining all of this, we get an improved BASH incantation, that will resume the conversion process where it left off last time:

find . -name "*.ppt" -type f -exec bash -c 'd="${0%/*}" ; t=${0%\.ppt}.pdf; [[ ! -f $t ]] && libreoffice --headless --invisible --convert-to pdf "$0" --outdir "$d"' {}  \; 
-print

Eat this, Harry!