|
Literally, this
article ought to be called Copy It Out, but that just doesn't have the
same Deja-Vu-ish ring to it. (I like using cliche's in my titles almost
as much as I like using silly jokes). Anyway, this article was inspired
by a fellow ACGNJ member. At one of our Special Interest Group meetings,
he demonstrated a "new" feature he'd found in newsletter PDF files that
were posted on the Internet by another club. On our club computer, he
highlighted text in one of those files, and pasted it into a new file
that he'd created. He then criticized us because he couldn't do the same
thing from our newsletters.
I was amazed, astonished and confused. I'd
always thought that paying Adobe enormous piles of money for the full
version of Acrobat was the only way to get any information out of a PDF
file. When I got home, I went to that other club's website, and
discovered that (after jumping through certain prerequisite hoops) I
could do the same thing. Intrigued, I called up one of our recent
newsletters, and found that I could do it there, too.
It seems that I've been the victim of a
self-fulfilling prophecy. Because I assumed that I couldn't copy
information out of a PDF file, I never even looked to see if I could.
(Apparently, my aforementioned fellow member also fell into the same
trap, at least as far as ACGNJ newsletters went). Wondering just how
long this had been going on, I began a systematic investigation. It
turns out that you can do this in any PDF file that wasn't scanned in
from already printed pages. For our newsletter, that means any file from
January 1998 onward except for the June 1998 and November 2000 issues
(and mysteriously including the July 1995, August 1995 and January 1996
issues).
I'm sure that some of you figured out this
technique by yourselves a long time ago; but I'm also certain that many
of my readers followed the same prevailing common wisdom as I did, and
never even tried. Now, I'm going to reveal the "secret" to everyone.
We'll look at four experimental subjects: Acrobat Reader 5.0 (from
before Adobe changed the name) under Windows XP, Adobe Reader 6.0 under
Windows 98 SE, Adobe Reader 7.0 under Windows XP, and Foxit Reader 2.3
under Windows XP. (I never use any Adobe versions above 7. They have too
much "phone home" capabilities for my taste; and sorry, no Linux
programs this time).
I briefly mentioned Foxit in my November 2008
article "Turkey with Gremlins", giving it only a so-so review. However,
it's since become my preferred PDF reader. It's not perfect, but I like
it better than the Adobe products. One thing I particularly didn't like
at first was the advertisement in the upper right hand corner.
(Actually, the Adobe products each have somewhat similar "features" in
just about the same places, but theirs aren't nearly as annoying). In a
sort-of halfway nice-guy gesture, Foxit provides a way for users to turn
their ad off. If you select the "View" menu from the Menu Bar, one of
the nineteen options is Advertisement, which comes up checked by
default. Simply uncheck it, and the ad goes away; but the next time you
start Foxit, it's back. There's no way to save your settings and keep it
turned off permanently. I ran Foxit from the XP Command Prompt with the
"/?" switch added, thinking that they might have stuck in a special
"off" option for knowledgeable users. Unfortunately, none of the seven
"Usage" switches listed in the little window that popped up concerned
the advertisement. So much for niceness. Strangely (though may be it's
just from familiarity), ever since I found out that I could, in fact,
temporarily turn off the ad any time I want, I usually don't bother.
Personally, I have two programs capable of
making PDF files: Writer (the word processor component of OpenOffice.org
2.2.0) and Scribus 1.3.3.12 (a free and very, very good open source
desktop publisher). For my first two experimental sources, I chose a
file produced by each program, both containing combinations of text and
images. For my third source, I chose one of our newsletter PDF files
(created by Corel Ventura Publisher 5). For my destinations, I pasted
text into TXT files opened in Microsoftr Notepad; and I pasted images
into various file formats (BMP, GIF, JPG, PNG, etc.) opened in
Microsoftr Paint. (XP specified Version 5.1 for both programs. Windows
98 SE didn't list version numbers for either). You might guess that I
wouldn't have gone into that much detail if my experiments hadn't been
mostly successful; and you'd be right. With the exceptions to be noted
further below, I was able to copy both text and graphics out of all
three experimental source files, using all four experimental subject
programs.
Here's the "secret": Each program starts up
with the mouse pointer set to the "Hand" tool by default. To get stuff
out, you need to change tools. We'll examine our subjects by seniority,
oldest first. Acrobat Reader 5.0 has a Text Select Tool and a Graphics
Select Tool, located only as icons on its Tool Bar. On the Adobe Reader
6.0 Menu Bar, if you select the Tools menu, and then the Basic sub-menu,
you'll find Select Text and Select Image options. By default (although
it's removable), there is also a Select Text icon on the Tool Bar. Plus,
if you click on the drop-arrow next to that Select Text icon, you'll
get the same Select Text and Select Image options that you got from the
Menu Bar. On the Adobe Reader 7.0 Menu Bar, if you select Tools, then
Basic, you'll find a single Select option that handles both text and
images. There's also a single Select icon on the Tool Bar as well. All
three Adobe products passed all of my tests with flying colors. In
addition to complete images, with all three you can copy just parts of
images, too.
Now for Foxit. If you click the Tools menu on
the Menu Bar, you'll find a Select Text option. There's also a Select
Text icon on the Tool Bar. Foxit Reader 2.3 passed all of my text tests
just fine. However, it contains absolutely no graphics copying
capabilities. To work with images, you have to download the evaluation
version of Foxit PDF Editor 2.0. This is not an intuitively obvious
user-friendly program (at least as far as my intuition is concerned).
However, I was eventually able to figure out how to copy a whole image
into the clipboard. From there, it was a snap to paste it into Paint.
(Maybe I'll also figure out how to do partial images someday, but not
yet).
|
Every time you
start Foxit PDF Editor, you get a message stating that you can use all
of its features; but if you change and then save a PDF document, an
evaluation mark will be put on every page that you modified. To test
this, I deleted an image from one of my expendable PDF test files, and
saved the file. Then, when I viewed that file with Foxit Reader, I found
the following message (in red letters) in the upper right hand corner,
just like they promised:
Edited by Foxit PDF Editor
For Evaluation Only.
Copyright(c)byFoxitSoftwareCompany,2004-2007
Going back to the image I'd pasted into Paint,
I saved that file after adding a tiny black rectangle to one side (as
my own mark). Then, when I opened it in Windows Picture and Fax Viewer, I
saw my pasted image, my little black rectangle, and nothing else. So I
can use my evaluation version of Foxit PDF Editor to copy images out of
PDF files if I need to, without any penalties. In addition, it's
supposedly an exceptional PDF editor. Maybe, given time and practice,
I'll learn how to use it properly. Someday, it might even be worth my
while to actually pay them for the full version. Unfortunately, my
current experience leads me to the conclusion that I can't recommend it
to anyone else. Therefore, I haven't included any instructions (based on
my woefully incomplete knowledge of how to use it) here.
I don't know whether they just haven't yet
gotten around to adding the image selection feature to Foxit Reader, or
if they left it out on purpose. Whatever the case, this lack (and that
annoying ad) keeps me from giving Foxit Reader 2.3 the high
recommendation that it would otherwise deserve.
Some final caveats: There are definite
disadvantages to using a text file as your destination. However, I think
that they are much preferable to the damage that could be caused by
copying who-knows-what hidden control codes directly into a word
processing document. I follow this same rule when copying text from the
Internet. Here's what happened one time when I didn't.
It was late, I was doing a read-through of a
draft newsletter article, and my deadline was "breathing down my neck". I
decided that one section was a little weak, and thus required punching
up. I already knew exactly where the information I needed could be found
on the Internet, so I went there, located what I wanted, copied it, and
then pasted it directly into my DOC file. The trouble was, absolutely
nothing appeared on my screen. Thinking I'd messed up the copy process, I
went back and did it again; and I got exactly the same result: nothing.
After a few more frustrating failures, it finally occurred to me that
the web site that I was trying to copy from was displayed in white
letters on a black background. All of my copy operations had been
successful; but because I'd been pasting white text onto a white
background, I just couldn't see any of it.
Now, that was just an innocuous coloring
error. Have you ever looked at the complete contents of an HTML file
that was created by Microsoft Word? I'd say that less than 10% of it is
data (legible characters that would actually be displayed on the
computer screen). At most, a further 1% consists of required HTML
commands. The rest is made up of ridiculously over-elaborate and
absolutely unnecessary code. Who knows what unanticipated consequences
might occur if random, incomplete sections of such code were to be
copied into another wordprocessor document? I certainly don't want to
findout.
Even when using a text file as the destination
for a PDF copy, you should be careful. There are strange things going
on "behind the scenes" in PDF files that may affect the text that you're
trying to copy. Sometimes, blocks of text that you highlighted in the
source file will be copied to your destination file in a different
order. Sometimes, text in the source which seems to be contiguous just
flat out refuses to be high-lighted for selection all at once. (In that
case, get what you can the first time, then go back for the rest. You'll
have to edit everything back together later). Finally, sometimes the
text selection tool follows a sort-of "rectangular" rule (as if you were
laying out some kind of "highlighting" frame on top of your source
document). If this happens, and you accidentally stop your selection box
just a tiny little bit short, it's possible to lose the first and/or
last few letters from each and every line in all of your paragraphs.
(Acrobat Reader 5.0 had a Column Select Tool which could do this
anywhere, anytime, on purpose. That particular idea doesn't seem to have
been carried over into Adobe Reader 6.0 or 7.0).
Here are some other things that happened
during several of my tests: First, of course, effects like bold,
italics, underlining, strikeout and colors were lost. (You have to
expect that when pasting into a text file). Blank lines between
paragraphs were NOT copied (they just vanished). Occasionally, a single
space that should have been between two words disappeared as well. What
would have been "soft" returns on the document were copied as "hard"
returns in the text file. "Straight-up" quotation marks (ASCII Code 34
in decimal notation, where there's only one symbol displayed for both
opening and closing quotes) were copied correctly. The more elaborate
"curved" quotation marks used by word processors (which use two
different symbols that look like they were both made from double commas,
with the opening pair seeming to be up side-down) were displayed by
Windows 98's Notepad as little black boxes. (Those are Extended ASCII
Codes 147 and 148 in decimal notation, which plain-vanilla text editors
often see as unprintable characters). However, the codes themselves
weren't changed, because they displayed correctly when later copied into
a word processor file.
Isn't it fun to learn something new?
|