Number 310 - March 2009

Cut It Out
by Bob Hawes bob hawes@acgnj.org), ACGNJ News January 2009


   Literally, this article ought to be called Copy It Out, but that just doesn't have the same Deja-Vu-ish ring to it. (I like using cliche's in my titles almost as much as I like using silly jokes). Anyway, this article was inspired by a fellow ACGNJ member. At one of our Special Interest Group meetings, he demonstrated a "new" feature he'd found in newsletter PDF files that were posted on the Internet by another club. On our club computer, he highlighted text in one of those files, and pasted it into a new file that he'd created. He then criticized us because he couldn't do the same thing from our newsletters.

   I was amazed, astonished and confused. I'd always thought that paying Adobe enormous piles of money for the full version of Acrobat was the only way to get any information out of a PDF file. When I got home, I went to that other club's website, and discovered that (after jumping through certain prerequisite hoops) I could do the same thing. Intrigued, I called up one of our recent newsletters, and found that I could do it there, too.

   It seems that I've been the victim of a self-fulfilling prophecy. Because I assumed that I couldn't copy information out of a PDF file, I never even looked to see if I could. (Apparently, my aforementioned fellow member also fell into the same trap, at least as far as ACGNJ newsletters went). Wondering just how long this had been going on, I began a systematic investigation. It turns out that you can do this in any PDF file that wasn't scanned in from already printed pages. For our newsletter, that means any file from January 1998 onward except for the June 1998 and November 2000 issues (and mysteriously including the July 1995, August 1995 and January 1996 issues).

   I'm sure that some of you figured out this technique by yourselves a long time ago; but I'm also certain that many of my readers followed the same prevailing common wisdom as I did, and never even tried. Now, I'm going to reveal the "secret" to everyone. We'll look at four experimental subjects: Acrobat Reader 5.0 (from before Adobe changed the name) under Windows XP, Adobe Reader 6.0 under Windows 98 SE, Adobe Reader 7.0 under Windows XP, and Foxit Reader 2.3 under Windows XP. (I never use any Adobe versions above 7. They have too much "phone home" capabilities for my taste; and sorry, no Linux programs this time).

   I briefly mentioned Foxit in my November 2008 article "Turkey with Gremlins", giving it only a so-so review. However, it's since become my preferred PDF reader. It's not perfect, but I like it better than the Adobe products. One thing I particularly didn't like at first was the advertisement in the upper right hand corner. (Actually, the Adobe products each have somewhat similar "features" in just about the same places, but theirs aren't nearly as annoying). In a sort-of halfway nice-guy gesture, Foxit provides a way for users to turn their ad off. If you select the "View" menu from the Menu Bar, one of the nineteen options is Advertisement, which comes up checked by default. Simply uncheck it, and the ad goes away; but the next time you start Foxit, it's back. There's no way to save your settings and keep it turned off permanently. I ran Foxit from the XP Command Prompt with the "/?" switch added, thinking that they might have stuck in a special "off" option for knowledgeable users. Unfortunately, none of the seven "Usage" switches listed in the little window that popped up concerned the advertisement. So much for niceness. Strangely (though may be it's just from familiarity), ever since I found out that I could, in fact, temporarily turn off the ad any time I want, I usually don't bother.

   Personally, I have two programs capable of making PDF files: Writer (the word processor component of OpenOffice.org 2.2.0) and Scribus 1.3.3.12 (a free and very, very good open source desktop publisher). For my first two experimental sources, I chose a file produced by each program, both containing combinations of text and images. For my third source, I chose one of our newsletter PDF files (created by Corel Ventura Publisher 5). For my destinations, I pasted text into TXT files opened in Microsoftr Notepad; and I pasted images into various file formats (BMP, GIF, JPG, PNG, etc.) opened in Microsoftr Paint. (XP specified Version 5.1 for both programs. Windows 98 SE didn't list version numbers for either). You might guess that I wouldn't have gone into that much detail if my experiments hadn't been mostly successful; and you'd be right. With the exceptions to be noted further below, I was able to copy both text and graphics out of all three experimental source files, using all four experimental subject programs.

   Here's the "secret": Each program starts up with the mouse pointer set to the "Hand" tool by default. To get stuff out, you need to change tools. We'll examine our subjects by seniority, oldest first. Acrobat Reader 5.0 has a Text Select Tool and a Graphics Select Tool, located only as icons on its Tool Bar. On the Adobe Reader 6.0 Menu Bar, if you select the Tools menu, and then the Basic sub-menu, you'll find Select Text and Select Image options. By default (although it's removable), there is also a Select Text icon on the Tool Bar. Plus, if you click on the drop-arrow next to that Select Text icon, you'll get the same Select Text and Select Image options that you got from the Menu Bar. On the Adobe Reader 7.0 Menu Bar, if you select Tools, then Basic, you'll find a single Select option that handles both text and images. There's also a single Select icon on the Tool Bar as well. All three Adobe products passed all of my tests with flying colors. In addition to complete images, with all three you can copy just parts of images, too.

   Now for Foxit. If you click the Tools menu on the Menu Bar, you'll find a Select Text option. There's also a Select Text icon on the Tool Bar. Foxit Reader 2.3 passed all of my text tests just fine. However, it contains absolutely no graphics copying capabilities. To work with images, you have to download the evaluation version of Foxit PDF Editor 2.0. This is not an intuitively obvious user-friendly program (at least as far as my intuition is concerned). However, I was eventually able to figure out how to copy a whole image into the clipboard. From there, it was a snap to paste it into Paint. (Maybe I'll also figure out how to do partial images someday, but not yet).


   Every time you start Foxit PDF Editor, you get a message stating that you can use all of its features; but if you change and then save a PDF document, an evaluation mark will be put on every page that you modified. To test this, I deleted an image from one of my expendable PDF test files, and saved the file. Then, when I viewed that file with Foxit Reader, I found the following message (in red letters) in the upper right hand corner, just like they promised:

   Edited by Foxit PDF Editor

   For Evaluation Only.

   Copyright(c)byFoxitSoftwareCompany,2004-2007

   Going back to the image I'd pasted into Paint, I saved that file after adding a tiny black rectangle to one side (as my own mark). Then, when I opened it in Windows Picture and Fax Viewer, I saw my pasted image, my little black rectangle, and nothing else. So I can use my evaluation version of Foxit PDF Editor to copy images out of PDF files if I need to, without any penalties. In addition, it's supposedly an exceptional PDF editor. Maybe, given time and practice, I'll learn how to use it properly. Someday, it might even be worth my while to actually pay them for the full version. Unfortunately, my current experience leads me to the conclusion that I can't recommend it to anyone else. Therefore, I haven't included any instructions (based on my woefully incomplete knowledge of how to use it) here.

   I don't know whether they just haven't yet gotten around to adding the image selection feature to Foxit Reader, or if they left it out on purpose. Whatever the case, this lack (and that annoying ad) keeps me from giving Foxit Reader 2.3 the high recommendation that it would otherwise deserve.

   Some final caveats: There are definite disadvantages to using a text file as your destination. However, I think that they are much preferable to the damage that could be caused by copying who-knows-what hidden control codes directly into a word processing document. I follow this same rule when copying text from the Internet. Here's what happened one time when I didn't.

   It was late, I was doing a read-through of a draft newsletter article, and my deadline was "breathing down my neck". I decided that one section was a little weak, and thus required punching up. I already knew exactly where the information I needed could be found on the Internet, so I went there, located what I wanted, copied it, and then pasted it directly into my DOC file. The trouble was, absolutely nothing appeared on my screen. Thinking I'd messed up the copy process, I went back and did it again; and I got exactly the same result: nothing. After a few more frustrating failures, it finally occurred to me that the web site that I was trying to copy from was displayed in white letters on a black background. All of my copy operations had been successful; but because I'd been pasting white text onto a white background, I just couldn't see any of it.

   Now, that was just an innocuous coloring error. Have you ever looked at the complete contents of an HTML file that was created by Microsoft Word? I'd say that less than 10% of it is data (legible characters that would actually be displayed on the computer screen). At most, a further 1% consists of required HTML commands. The rest is made up of ridiculously over-elaborate and absolutely unnecessary code. Who knows what unanticipated consequences might occur if random, incomplete sections of such code were to be copied into another wordprocessor document? I certainly don't want to findout.

   Even when using a text file as the destination for a PDF copy, you should be careful. There are strange things going on "behind the scenes" in PDF files that may affect the text that you're trying to copy. Sometimes, blocks of text that you highlighted in the source file will be copied to your destination file in a different order. Sometimes, text in the source which seems to be contiguous just flat out refuses to be high-lighted for selection all at once. (In that case, get what you can the first time, then go back for the rest. You'll have to edit everything back together later). Finally, sometimes the text selection tool follows a sort-of "rectangular" rule (as if you were laying out some kind of "highlighting" frame on top of your source document). If this happens, and you accidentally stop your selection box just a tiny little bit short, it's possible to lose the first and/or last few letters from each and every line in all of your paragraphs. (Acrobat Reader 5.0 had a Column Select Tool which could do this anywhere, anytime, on purpose. That particular idea doesn't seem to have been carried over into Adobe Reader 6.0 or 7.0).

   Here are some other things that happened during several of my tests: First, of course, effects like bold, italics, underlining, strikeout and colors were lost. (You have to expect that when pasting into a text file). Blank lines between paragraphs were NOT copied (they just vanished). Occasionally, a single space that should have been between two words disappeared as well. What would have been "soft" returns on the document were copied as "hard" returns in the text file. "Straight-up" quotation marks (ASCII Code 34 in decimal notation, where there's only one symbol displayed for both opening and closing quotes) were copied correctly. The more elaborate "curved" quotation marks used by word processors (which use two different symbols that look like they were both made from double commas, with the opening pair seeming to be up side-down) were displayed by Windows 98's Notepad as little black boxes. (Those are Extended ASCII Codes 147 and 148 in decimal notation, which plain-vanilla text editors often see as unprintable characters). However, the codes themselves weren't changed, because they displayed correctly when later copied into a word processor file.

   Isn't it fun to learn something new?
  Number 310 - March 2009