![]() Number 219 - August 2001 |
|
| Computer Files & Formats | |
|
by Alex Dumestre 1960 PC Users Group, Jan 2001 Newsletter, Houston, TX | |
|
Here it is, three days
before the deadline, and I still haven't come up with a real jazzy
subject. I've whined in the past about the fact that choosing a subject
is the most difficult part of trying to write a PC News article each
month. I had sort of intended to write about my long delayed, but
recently accomplished, purchase of a CD-RW drive for my computer. The
trouble is I haven't used it enough to write knowledgeably about it yet.
Maybe I will in a month or two.
Luckily for me (but perhaps unluckily for you) something happened at a SIG a few nights ago that told me that an oft' covered subject requires yet another visit. A good friend of mine came up to me, as the meeting was breaking up, and sort of apologetically asked a question. It went something like this: Good Friend: "I thought all data that is stored on a computer is digital. Isn't that right?" Me: "That's right." GF: "Well, what about these wave files that music is stored in? That's different, isn't it? Sounds like it has something to do with sine waves." Me: "No, that's digital too." Note: you can tell that this is not a true and accurate transcription of the conversation because, in real life, my answers are never shorter than the questions! I thought I saw the basic misunderstanding that was troubling him. A common type of audio file on a PC is a .wav file and is pronounced wave. This conjures up visions of sound waves as seen on an oscilloscope and this is suspiciously analog. A few more sentences showed that he was comfortable with the fact that graphic files (e.g., a jpeg containing a color photograph) are digital and store a series of numbers that are codes for specific colors and arranged as rows and columns of pixels. When arranged on the screen (or printed page) these tiny blobs of color form a picture. If the blobs (pixels) are small enough and placed close enough together they can form a very clear and sharp picture. Likewise, a number in a file could also define the height (or amplitude) of a sound wave at some instant in time. If the sound wave is sampled at numerous successive instants then the series of numerical values can be used to mimic the shape of the sound wave. Figure 1* shows what a digitized waveform might look like. If the samples are close enough together in time then the series of numbers is such a precise representation of the original sound wave that it is "hi-fi". This sounded perfectly reasonable to my friend. Still, I suspect there was a lingering but unvoiced question: "If a string of numbers in a file sometimes represents text, sometimes represents a picture and sometimes represents sound, how does the computer know which is which?" File Formats. This throws us right into the realm of file formats. This is a subject that confuses, bores or repels many computer users. It is viewed as complex and suitable only for the geekiest of geeks. I agree with this if we are talking about a deep understanding of the details of formats, but a superficial understanding is not only within the ability and tolerance level of all of us, it is generally sufficient to remove the veil of mystery and magic about some computer operations. I've made several previous attempts in articles about formats but with less than total success. Here is one more try at a very basic level. Let's build up to this slowly. Assume that I have a small file that contains only this string of numbers: 0651081011206012302814442040. I've written them here as decimal numbers but in reality the numbers would actually be a much longer string of 0 and 1 binary digits. So, what does this file mean? Only the person (or program) that wrote it knows, but if you knew the format of the file then you too would know how to read it. Example 1. (Format #1) First 12 digits are four ASCII characters. Next three represent height: 1 digit for feet followed by two digits for inches. Next 3 digits are weight in pounds. Last 10 digits are a telephone number. The only part that is probably confusing is that dealing with the 4 ASCII characters. ASCII (American Standard Code for Information Interchange) is a "standards" organization that long ago, even before computers, published a numeric code for use on teletypes. I covered it in some detail in my article in the September 1999 PC News(1)*. That article, as well as articles in the May, June and October issues of that year, touch on other aspects of text and graphic file formats(2)**. You can look up the 4 codes used here or you can take my word for it: 065=A, 108=l, 101=e, and 120=x. With this info, it is easy to see that Format #1 leads us to the interpretation. I've placed spaces to separate them into groups, but that is for your reading convenience and the spaces are not actually part of the example file (remember, we are attempting to show that all files are nothing but numbers). 065 108 101 120 6 01 230 2814442040 "Alex" is 6'1", weighs 230 lbs. and his phone number is (281)444-2040 begin_of_the_skype_highlighting (281)444-2040 end_of_the_skype_highlighting. Example 2. (Format #2). Just to drive home the point that data in a file means only what a format says it means, I will make up a different format: First 6 digits are a product code number. Next 4 digits are the number of items on hand. Next 3 digits are the manufacturer code. Next 6 digits are storage bin location in warehouse, 2 digits each for aisle, rack and shelf. Next 2 digits are item weight in oz. Next 2 digits are the reorder level. Last 5 |
digits are an older product code. Using this
definition leads us to the following interpretation:
065108 1011 206 012302 81 44 42040 Product #65108, 1011 on hand, from manufacturer #206, located in aisle 1, rack 23, shelf 2, each unit weighs 81 oz. should reorder when items on hand drops to 44, product number in previous system was 42040. (OK, this isn't a very convincing example so shoot me!) Wow! Exactly the same data -- different format, different meaning. How is the computer to know how to interpret the data? One way would be to identify the type of file as part of the file name. For instance, we could name the file in Example 1 something like "personal.fm1" and the file in Example 2 "inventory.fm2". Assuming that we had set up our computer's file association table properly, if we clicked on "personal.fm1", it would kick off our hypothetical program PersonalData.exe which reads and writes "fm1" files. Clicking on "inventory.fm2" would launch WidgitInventory.exe. This illustrates why you can use an arbitrary string for the file's first name (e.g., "inventory") but you had better be careful before changing the file extension (e.g., ".fm2"). We might refer to each of the two example formats above as unpublished proprietary formats. They were both designed for use by a particular program and are not publicly documented. The computing community knows nothing about them and doesn't particularly care. Contrast this with published proprietary formats. These are created by a person or company for use by a particular program. Their formats are widely published but have not been sanctioned by a recognized standards organization. Let's take as an example of this type the format output by the WordPerfect word processor program that uses the extension ".WPD". The programmers of WordPerfect were free to make up the format to suit their own needs and desires (just as we were free to make up ".fm2"). This format was designed to be written and read by WordPerfect but there is one other requirement that comes into play here. Users of MS Word would be greatly annoyed if they were unable to access WPD files and users of WP would be just as annoyed if they were unable to access .DOC files produced by Word. What is the solution? The makers of WP publish the details of the WPD format so that the programmers of Word can write a translator program (or filter) that reads WPD files and opens them in Word. The same is true in the other direction. These translators or filters are not always perfect and may not always be able to handle all of the bells and whistles of the very latest version of the competitor's files. Still, they allow some degree of conversion between two different proprietary formats. Sometimes it is very important that programs from various vendors be able to read and write the very same formats. This is where a standards organization comes into play. Good examples of these types of formats might be the graphics formats JPEG (.jpg) and TIFF (.tif). JPEG was designed from scratch and published by the Joint Photographic Experts Group. These formats tend to be very stable and are modified only infrequently. JPG for instance has been around since 1988 with only a few compatible updates and extensions. JPEG2000, a major revision, will be released within a few months and looks as if it will be a spectacular improvement in terms of both quality and small file size. It is such a large change that it has to be viewed as an entirely new format and will have its own extension of ".JP2". log vs. Digital. One last subject must be mentioned. I have a microphone attached to my computer sound card. I also have speakers attached to the sound card. Both of these are analog devices, in that when I speak into the microphone it outputs a varying voltage signal that mimics (is the analog of) the varying air pressure waves output by my vocal cords. But wait a minute. Computers don't understand analog signals! Computers don't but certain hardware items attached to them do. In particular, the sound card I have plugged into my computer receives analog signals through the microphone connector and converts this to digital. With a straightforwardness of nomenclature, that is a little frightening to computer geeks, this circuitry is called an Analog-to-Digital-Converter (ADC). Likewise, when my computer reads a digital sound file it sends the digital numbers to the sound card which also contains, you guessed it, a Digital-to-Analog-Converter (DAC). The resulting analog signal (a varying voltage wave) then goes to the speakers. So you see, the computer is pure and only the sound card deals with analog in this example. There are several other computer accessories that have these ADC and DAC circuits built into them. Scanners and modems are examples. * There was no Figure 1 with the original article on the web, 1 You may find an ASCII table on the Web at http://telecom.tbi.net/ascii7e.html or www.asciitable.com -Ed) 2 If today's article whips you into a passion for learning more about formats, you can find these earlier articles on the club's Web site. Just go to www.1960pcug.org, click the "PC News Magazine" button, and then click on the desired issue date. This brings up the table of contents for that issue. Click on the article name to see the entire text. Every PC News article since October 1996 is available. Alex Dumestre has been associated with computers since the mid '60's, most of the time developing geophysical applications for use on mainframes, minicomputers, and work stations. He is a bit of a nut about graphics but is a perpetual novice on PCs. He is a member of the 1960 PC Users Group and can be contacted by e-mail at DumestreA@PDQ.net. |
Number 219 - August 2001
|
|