Field Notes of an Audacious Amateur: pdf

Monday, March 21, 2016

Addendum to the seventh installment: imagemagick as a resource for the budget-constrained researcher

In this installment, I'll cover concatenating multiple image files into a multi-page pdf--a very handy trick the imagemagick utility convert makes possible. But first, a bit of grousing on the subject of academia, budget-constrained researching, and academic publishing.

Pricing for on-line academic resources tends, not surprisingly, to be linked to budgetary allowances of large academic institutions: what institutions can afford to pay for electronic access to some journal or other, for example, will influence the fee that will be charged to anyone wishing to gain such access. If one is affiliated with such an institution--whether in an ongoing way such as by being a student, staff, or faculty member, or in a more ephemeral way, such as by physically paying a visit to one's local academic library--one typically need pay nothing at all for such access: the institution pays some annual fee that enables these users to utilize the electronic resource.

But for someone who has no long-term affiliation with such an institution and who may find it difficult to be physically present in its library, some sort of payment may be required. To give an example, while doing some research recently for an article I'm writing, I found several related articles I need to read. I should mention that, since I am still in the early stages of writing my article, there are bound to be several additional articles to which I will need ongoing access. I will address two journal articles in particular, both of which were published between twenty and thirty years ago.

I discovered that both articles were available through an on-line digital library. I was offered the option of downloading the articles at between twenty and forty dollars apiece. At that rate--since one of my articles' main topics has received fairly scant attention in modern times and I might need to review only another twenty or so articles--it could cost me well over six hundred dollars just to research and provide references for this topic. The time I spend actually writing and revising the article--the less tangible cost, which will exceed by a substantial amount the time spent researching--is an additional "expense" for producing the material.

There are, of course, ways to reduce the more tangible costs. Inter-library-loan often proves a valuable resource in this regard since even non-academic libraries who may lack subscriptions to academic publishers or digital libraries can nonetheless request journals or books containing relevant articles or, even better yet, obtain for their patrons such articles in electronic format--typically pdf files--these latter often having been created by scanning from paper-and-ink journals or books.

Some digital libraries even offer free--though quite limited--access to their materials. In researching my project I found three articles available from such a site. On registration at their site, they offered free on-line viewing, in a low-resolution scan, of just a couple of articles--those being made available for viewing for only a few days. Once the limited number of articles was reached, only at the end of those few days could another article be viewed. For purposes of the budget-constrained researcher, while this is a promising development, it's not an entirely practicable one.

Being able to view an article on a computer screen is a lot better than having no electronic access to it at all. But it also is of no help in those circumstances where one may be without an internet connection. Having the ability to save the article to an e-reader would be preferable and far more flexible than reading it, one page at a time, in a browser window. But the service seems calculated to preclude that option without payment of the twenty- to forty-dollar per article fee. It turns out, however, that sometimes ways around such restrictions can be discovered. And that, finally, is where the tools mentioned in the first paragraph of this article enter in. Thus, without further ado, on the the technical details.

Some digital libraries actually display, on each page of the article that appears as you go about reading it in a web browser window, a low-resolution image of the scanned page. As I discovered, one can right-click on that image and select to save it to the local drive. The file name may have, instead of a humanly-comprehensible name, just a long series of characters and/or numbers. And it may, as well, lack any file extension. But I discovered that the page images could, in my case, be saved as png files. Those png files, then, appropriately named so as cause them to retain their proper order, could then, using imagemagick tools, be concatenated into a multi-page pdf. That multi-page pdf can then be transferred to the reading device of choice. I found that, although the image quality is quite poor, it is nonetheless sufficient to allow deciphering of even such smaller fonts as one typically finds in footnotes. Although involving a bit of additional time and labor, using this tactic can yet further defray the budget-constrained researcher's more tangible costs.

For reasons that will become obvious, the image files should be saved to a directory empty of other png files. How the images are saved is essentially a numerical question and is dependent on the total number of pages in the article. If the total number of pages is in the single digits, it would be a simple matter of naming them, for example, 1.png, 2.png, 3.png, and so forth. If the number of pages reaches double digits--from ten through ninety nine, zeros must be introduced so that all file names begin with pairs of numbers; for example 00.png, 01.png, 02.png, and so forth. The same formula would hold for--God forbid, since the task would become quite tedious--articles with total pages reaching triple digits.

Provided imagemagick is already installed, once the saving is done, the very simple formula convert *.png full-article.pdf can be used to produce the pdf of concatenated image files. Since the files have numerical prefixes, the program will automatically concatenate them in the proper order.

In the next installment I will be covering manipulation of pdf's provided through inter-library loan services--focusing on removal of extraneous pages (e.g., copyright-notice pages) routinely included by those services.

Saturday, April 12, 2014

Miscellaneous Friday quickies: crop pdf margins with psnup

As sometimes happens, I recently needed to print off a portion of a public-domain work that's been scanned to portable document format and made available for download via Google Books. As the case proves to be at times, the original book had pages with fairly wide margins; when that sort of scanned book gets printed on letter-sized paper, you end up with a smallish text-box in the middle of a page that will have something like two-inch margins. That makes the text harder to read because the font ends up being relatively small, and it also results in waste of a lot of paper.

What to do, then, to make the text larger and make it occupy more of the page? I used three utilities for this job: xpdf to excise the target pages into a postscript file, psnup to enlarge the text and crop margins, and ps2pdf to convert the modified pages back to a pdf file. psnup is part of the psutils package, while ps2pdf relies on Ghostscript. A description of the steps I took follows.

With the pdf-viewer xpdf, the print dialog offer two option: print the document to a physical printer, or print to a postscript file. Both options allow for page-range stipulation. That's how I created a postscript file from the target pages.

Next, psnup--a really handy tool I've used previously for creating booklets (more about that in a future entry), but which I'd never considered might perform a job like this--was used to reduce margins, which had the added effect of increasing the text size. The command I used, which I appropriated from here, looks like this:

psnup -1 -b-200 infile.ps file_zoomed.ps

The crucial switch here seems to be -b (which is short for borders) followed by a negative numeral. Of course the numeral 200 as seen in this example will need to be modified to suit your circumstances.

The final step of converting file_zoomed.ps, using ps2pdf, to a pdf was simplest of all--in fact so simple that I won't even describe it here. I hope this brief description may be an aid to others wishing to execute a task like this.

Finally, I ran across some other related information of interest while reaserching how to crop pdf margins. Here, for example, is a page that describes a few other, more technically-involved ways to reduce margins: http://ma.juii.net/blog/scale-page-content-of-pdf-files. This posting on the Ubuntu forums has a really interesting script someone put together for making pdf booklets: http://ubuntuforums.org/showthread.php?t=1148764. And this one offers some clever tips on using another psutils utility--psbook--for making booklets: http://www.peppertop.com/blog/?p=35. Then, there's what might be the most straightforward utility of all for cropping pdf margins--pdfcrop. Since I could not get it to work for me in this instance, I looked for and found the alternative described above.

MUCH LATER EDIT: having done some further experimentation with pdfcrop I've discovered an important element: using negative values to specify width of margins. Doing this, I was able, using this tool, to trim margins to my staisfaction. A command such as pdfcrop --margins '-5 -25 -5 -40' MyDoc.pdf MyDoc-cropped.pdf ended up doing the trick for me quite nicely

Field Notes of an Audacious Amateur

Monday, March 21, 2016

Addendum to the seventh installment: imagemagick as a resource for the budget-constrained researcher

Saturday, April 12, 2014

Miscellaneous Friday quickies: crop pdf margins with psnup

Blog Archive

Labels