WEEK 8: DIGITIZATION
Here are some basic terms and concepts that you need to be aware of in digitization. Parts of this post are adapted from the University of Virginia’s Electronic Text Center Scanning Help Sheets.
Note: The best place to go for help with digitization is the Digital Studio on the 2nd floor of Bobst Library, where they have many “digital authoring” tools and where they will help you learn the software and make decisions about standards.
The gold standard of image capture and editing software is Photoshop. However, there are other image editing programs that can do many of the most common image editing tasks. You might want to download and try these:
* Paint.net for Windows
* Seashore for Mac
* Gimp for Mac
There’s also a “light” (and cheaper) version of Photoshop called Photoshop Elements or Photoshop LE.
Pixels, DPI, and Resolution
A “pixel,” or “dot,” is the atom of a digital image. The resolution of a digital image is measured in dots per inch (dpi). If you keep zooming in on a digital image, for instance, the image will become “pixelated,” meaning that you can see the individual pixels.
The NINCH Guide to Good Practice recommends a minimum of 300 dpi for image scanning. At this resolution, images can be printed with little loss of quality. If you have enough space, you may want to scan at 600dpi instead. Serious preservation scanning is done at 1200 dpi. Computer monitors display images at a resolution of 72 dpi, so an image that will only be displayed on screen (not printed) need not have a resolution any higher than that.
Images can be scanned with four basic levels of color information:
* 1-bit black and white — each dot can be either black or white; used for cartoons, line drawings, and to increase contrast on images of text.
* 8-bit greyscale — each dot can be one of 256 grey shades; used for ordinary (non-archival) scanning of what we normally call “black and white” images.
* 8-bit color — each dot can be one of 256 colors; 8-bit color can look a little grainy at times; it’s used mainly for color cartoons and “clip art.”
* 24-bit color — each dot can be one of 16.8 million colors; best used for photographs. There’s even 48-bit color, now, but many programs will not support it.
File Formats, Loss, Losslessness, and Compression
* TIFF — a lossless (uncompressed) format that works on all platforms. The TIFF file has long-term archival use, but is usually too big to put on a web site.
* PNG — a compressed format that works on all platforms; the best one for web delivery.
* JPG or JPEG — a highly compressed format similar to PNG.
* GIFs — a moderately compressed format that is only suitable for 8-bit or 1-bit color images (i.e., cartoons and logos).
Optical Character Recognition (OCR)
To perform “OCR” on an image is to transform a picture of text into editable, searchable, manipulatable text. One piece of software that can do this is called “FineReader,” made by a company called ABBYY.
The books available on Google Book Search have been OCR’d. It can be instructive to view the “plain text” behind the page images, as in this page from the introduction to a version of Jane Eyre. There are some small errors; such errors are inevitably greater when the page is stained or faded, or when the font is unusual.
As the University of Virginia’s Electronic Text Center advises, “even with clean text of a decent type size there will be occasional errors; this error rate increases as the text’s size and clarity decreases. Altering the brightness and resolution can improve results, but little can be done with a badly faded photocopy or a 17th or 18th century typeface.” When it is important to treat a text *as* text, and not simply as images, many prefer to outsource the work of retyping the text in by hand. There are firms that will do this that guarantee a 99.99% rate of accuracy.
Audio and Video Digitization
Note: Audio and video are sometimes called “time-based media,” for reasons that are obvious if you think about them — audio and video happen over time, whereas images and text are “still.” Many of the issues for digitizing audio and video are therefore the same.
One of the best audio editing programs is a free, open-source program called Audacity. You can use this to manage digital audio files. The Digital Studio also has ProTools, a more advanced program.
There are still not many very good free video editing and conversion programs. The Digital Studio has both iMovie (which is simpler) and FinalCut Pro (which is more difficult).
Sample rate and bit rate
“Sample rate” and “bit rate” in audio s analagous to “resolution” for images: they measure of how much information the computer has captured. If you would like to read more about what exactly sample rates and bit rates are, check the NINCH Guide’s section on Audio and Video Management. The basic thing you need to know, however, is that audio should be sampled at a rate of at least 44.1 kHz, which is the same rate as a CD; 96 kHz is the new professional archival standard. The archival bit rate is 24-bit, although many projects have opted to use 16-bit.
Video digitization is more complicated. There is a good tutorial about archival digitization of video at the University of Texas’s Information School.
Just as with images, audio and video digital files can be saved in formats that are either “lossy” (meaning that they are compressed so that the file is smaller) or “lossless” (meaning that no information has been lost).
* mp3 — The mp3 is the most compressed audio format, and therefore the one that is most common on the Internet. It is not suitable for archival purposes.
* aiff — Lossless audio format owned by Apple.
* wav — Lossless audio/video format owned by Microsoft; the most common for archival audio.
* mpeg — The most common video format, used for both preservation and access.
The Digital Studio has a slide scanner that they can teach you to use, if necessary. The Microfilm department at NYU’s Bobst can also help you get a digital copy of a microfilm, if necessary.