File Formats

 

  1. Introduction
  2. File Format Glossary
  3. IVRLA File Formats

1. Introduction

A body of digitised content will be presented by the Irish Virtual Research Library and Archive team, as part of the pilot project.  The project will be undertaking the digitisation of most of the material; however, a portion will already be in digital format.  The digitised content will mainly consist of image, text, audio and video files. 

Initially, Preservation Master (PM) files are created for deep storage purposes only.  Subsequently, Compressed Web (CW) files are created from these for use as surrogate files in the repository and on the information web-site.

Preservation Master files must be uncompressed in order to retain archival integrity. The surrogate files are compressed file formats, but with little perceivable loss of quality.

Specific file formats are used for both preservation and surrogate files, depending on the type of content in the original resource:

Original

Preservation Master

Surrogates

Image

TIFF

JPEG, DjVu

Text

TIFF

JPEG, DjVu, PDF

Audio

Linear WAV

MP3/ RealAudio ram

Video

MPEG 21

 

Dataset

Microsoft Excel File

 

2. File Format Glossary

Definitions for the file formats used by the IVRLA project are given below:

TIFF
TIFF stands for Tagged Image File Format and it is a flexible file format.  

It can handle multiple images and data in a single file through the inclusion of "tags" in the file header.  Tags can indicate the basic geometry of the image, such as its size, or define how the image data is arranged and whether various image compression options are used.

The ability to store image data in a lossless format makes TIFF files a useful method for archiving images. Unlike standard JPEG, TIFF files can be edited and resaved without suffering a compression loss. [Adapted from www.wikipedia.org, accessed 20th September 2006]

JPEG
JPEG is a commonly used standard method of compression for photographic images. The name JPEG stands for Joint Photographic Experts Group, which is the committee that created the standard.

JPEG uses lossy compression algorithms on images. JPEG itself specifies only how an image is transformed into a stream of bytes, but not how those bytes are encapsulated in any particular storage medium. A further standard, created by the Independent JPEG Group, called JFIF (JPEG File Interchange Format) specifies how to produce a file suitable for computer storage and transmission from a JPEG stream. [Adapted from www.wikipedia.org, accessed 20th September 2006]

DjVu
DjVu is a computer file format designed primarily to store scanned images, especially those containing text.

It features advanced technologies such as image layer separation of text and background/images, progressive loading, arithmetic coding, and lossy compression for bitonal images.  This allows for high quality, readable images to be stored in a minimum of space, so that they can be made available on the web. Progressive loading makes the format ideal for images served over the internet.

DjVu can contain an OCRed text layer, making it easy to perform cut and paste operations. [Adapted from www.wikipedia.org, accessed 20th September 2006]

PDF
Portable Document Format (PDF) is a file format proprietary to Adobe Systems for representing two-dimensional documents in a device independent and resolution independent fixed-layout document format.

Each PDF file encapsulates a complete description of a 2D document that includes the text, fonts, images, and 2D vector graphics that compose the document. PDF files are most appropriately used to encode the exact look of a document in a device-independent way. [Adapted from www.wikipedia.org, accessed 20th September 2006]

WAV
WAV (or WAVE), short for Waveform audio format, is a Microsoft and IBM audio file format standard for storing audio on PCs.

It is a variant of the RIFF bitstream format method for storing data in "chunks”. WAVs are compatible with Windows and Macintosh operating systems. The RIFF format acts as a "wrapper" for various audio compression codecs. It is the main format used on Windows systems for raw audio.

Though a WAV file can hold compressed audio, the most common WAV format contains uncompressed audio in the pulse-code modulation (PCM) format. PCM audio is the standard audio file format for CDs at 44,100 samples per second. Since PCM uses an uncompressed, lossless storage method, which keeps all the samples of an audio track, professional users or audio experts may use the WAV format for maximum audio quality. [Adapted from www.wikipedia.org, accessed 20th September 2006]

MP3
MPEG-1 Audio Layer 3, more commonly referred to as MP3, is a popular digital audio encoding and lossy compression format, designed to greatly reduce the amount of data required to represent audio, yet still sound like a faithful reproduction of the original uncompressed audio to most listeners.

MP3 is an audio-specific compression format. It provides a representation of pulse-code modulation-encoded audio in much less space than straightforward methods, by using psychoacoustic models to discard components less audible to human hearing, and recording the remaining information in an efficient manner. MP3 audio can be compressed with different bit rates, providing a range of tradeoffs between data size and sound quality. [Adapted from www.wikipedia.org, accessed 20th September 2006]

RealAudio ram
RealAudio is a proprietary audio format developed by RealNetworks. It uses a variety of audio codecs, ranging from low-bitrate formats that can be used over dialup modems, to high-fidelity formats for music.

It can also be used as a streaming audio format, played at the same time as it is downloaded. In many cases, web pages do not link directly to a RealAudio file.  Instead, they link to a .ram (Real Audio Metadata) or SMIL file. This is a small text file containing a link to the audio stream.

When a user clicks on such a link, the user's web browser downloads the .ram or .smil file and launches the user's media player. The media player reads the RTSP (Real Time Streaming Protocol) URL from the file and then plays the stream. [Adapted from www.wikipedia.org, accessed 20th September 2006]

MPEG 21
The MPEG-21 standard, from the Moving Picture Experts Group aims at defining an open framework for multimedia applications. Specifically, MPEG-21 defines a "Rights Expression Language" standard as means of sharing digital rights/permissions/restrictions for digital content from content creator to content consumer.

As an XML-based standard, MPEG-21 is designed to communicate machine-readable license information and do so in a "ubiquitous, unambiguous and secure" manner.

MPEG-21 is based on two essential concepts: the definition of a fundamental unit of distribution and transaction, which is the Digital Item, and the concept of users interacting with them. Digital Items can be considered the kernel of the Multimedia Framework and the users can be considered as who interacts with them inside the Multimedia Framework.

At its most basic level, MPEG-21 provides a framework in which one user interacts with another one, and the object of that interaction is a Digital Item.  Therefore, the main objective of the MPEG-21 is to define the technology needed to support users to exchange, access, consume, trade or manipulate Digital Items in an efficient and transparent way.

The ability of a consumer to not have to pay multiple times for the same content in different formats is absent. [From www.wikipedia.org, accessed 20th September 2006]

3. IVRLA File Formats

A detailed description of the type of files created by the project, together with their content and function, is given below.

Images
A substantial collection of images were identified for inclusion in the IVRLA project.  Material includes photographs, negatives, slides, watercolours, maps, postcards, and drawings.

The IVRLA project creates four different files for digitised image resources:-

1. Preservation Master (PM) Image file - TIFF

The Preservation Master files for images are scanned as TIFF files in Intel byte order.   It is important that these files are uncompressed, thus losing no details and remaining as true to the original as possible.

The images are scanned as colour RGB, 24 bit per pixel, and at 450 dpi (Dots per Inch). 

TIFF files are then burned to DVD-Gold (2 copies) and once technical metadata has been appended, they are written to the LTO (Linear Tape Open - high capacity backup and storage) drive for deep archival storage.

TIFF file sizes for this project are quite large, typically between 2MB and 330MB, thus making them unsuitable for display purposes. Consequently they are for preservation purposes only.

2. Compressed Web 1 (CW1) Image file – JPEG

One surrogate derived from the Preservation Master is in the JPEG format, at medium quality 5, and unconstrained.

JPEGs are more suitable for display over the web, as the files sizes are much smaller. JPEG uses a lossy compression method, with some data loss.

The JPEGs are initially being used as resources for the information web site, and as an aid to cataloguing.  Ultimately they will reside on the IVRLA server for use in the repository and will be controlled by disseminators.

JPEGs are watermarked and have technical metadata.

3. Compressed Web 2 (CW2) Image file – DjVu

Another surrogate derived from the Preservation Master is in the DjVu format.

This file format generates small file sizes, without the same degree of data loss as the JPEG format.

A viewer is required for this file format which provides the user with a great deal of functionality. The user can zoom to 1200%, and pan.

DjVu files will be kept on the server and are watermarked.

4. Thumbnail (TN) Image file – JPEG

The final surrogate for images is a thumbnail generated from the Preservation Master in the JPEG format. I

This is a compressed file format, at medium quality 5, and constrained (100 pixels tbc).

Thumbnails will be kept on the server and used by the repository, and are watermarked.

Text
A substantial amount of text was identified for inclusion in the IVRLA project. Material includes letters, memorabilia, notes, newspaper cuttings, and pamphlets.

Printed material is subjected to a further process of optical character recognition to generate searchable text

The IVRLA project creates four or five different files for digitised text resources, as applicable:-

1. Preservation Master (PM) Text File – TIFF

The Preservation Master files for text are scanned as TIFF files in Intel byte order.

It is important that these files are uncompressed, thus losing no details and remaining as true to the original as possible.

The images are scanned as colour RGB, 24 bit per pixel, and at 450 dpi (Dots per Inch). 

TIFF files are then burned to DVD-Gold (2 copies) and once technical metadata has been appended, they are written to the LTO (Linear Tape Open - high capacity backup and storage) drive for deep archival storage.

TIFF file sizes for this project are quite large, typically between 2MB and 330MB, thus making them unsuitable for display purposes. Consequently they are for preservation purposes only.

2. Compressed Web 1 (CW) Text File – JPEG

One surrogate derived from the Preservation Master is in the JPEG format, at medium quality 5, and unconstrained.

JPEGs are more suitable for display over the web, as the files sizes are much smaller.  JPEG uses a lossy compression method, with some data loss.

The JPEGs are initially being used as resources for the information web site, and as an aid to cataloguing. Ultimately they will reside on the IVRLA server for use in the repository and will be controlled by disseminators.

JPEGs are watermarked and have technical metadata.

3. Compressed Web 2 (CW) Text File - DjVu

Another surrogate derived from the Preservation Master is in the DjVu format.

This file format generates small file sizes, without the same degree of data loss as the JPEG format.

A viewer is required for this file format which provides the user with a great deal of functionality.  The user can zoom to 1200%, pan, and perform a basic page text search.

DjVu files will be kept on server and are watermarked.

4. Thumbnail (TN) Image file – JPEG

Another surrogate for text is a thumbnail generated from the Preservation Master in the JPEG format.

It is a compressed file format, at medium quality 5, and constrained (100 pixels tbc).

Thumbnails will be kept on the server and used by the repository, and are watermarked.

5. OCR text file – PDF

The final surrogate derived from the Preservation Master file for text is in the PDF format.  This is only used for printed material of a certain quality,

This is generated through the OCR (Optical Character Recognition) process, which scans the TIFF file, deciphers the characters on the page and creates a searchable text file.

This file is then saved as a PDF, giving an even greater degree of searchability.

Due to the acceptable percentage of error and loss of formatting, this file is for backend use only and will not be seen by the user. The PDF files will be kept on the server.

Dataset
A sample set made up of cards from the Irish Dialects Archive was deemed unsuitable for scanning.  It contained data that needed to be searchable and yet was not suitable for OCR, due to its use of orthography.

Consequently the data was manually transferred from the cards into a customised FileMaker Pro database, and the data was then extracted out into an Excel file.

Audio
A certain amount of audio files have been identified by the IVRLA Project for inclusion into the repository.

These analogue formats will have to be converted into digital formats.

The digitisation of audio files is currently in the planning stage.  The project envisages using the following three file formats:

1. Audio Preservation Master file – WAV

The Preservation Master files for audio will be converted from the current analogue format into the digital file format Linear WAV.

It is important that these files are uncompressed, thus losing no sound quality and remaining as true to the original as possible. 

2. Audio Compressed Web 1 file – MP3

One surrogate derived from the Preservation Master WAV file will be in the MP3 format.

MP3 is a lossy compression format, designed to reduce the amount of data required to represent audio, while remaining faithful to the original uncompressed audio as possible.

Its smaller size makes it ideal as a presentation method of audio over the web.

3. Audio Compressed Web 2 file – RealAudio ram

Another surrogate derived from the Preservation Master WAV will be the Real Audio ram format.

This file format can be used as streaming audio that is played at the same time as it is downloaded.

Video
A certain amount of audio files have been identified by the IVRLA Project for inclusion into the repository.

The digitisation of video files is currently in the planning stage.  The project envisages using the following file format:

Video Preservation Master file – MPEG 21
The Preservation Master files for video will be converted from the current 16mm, Beta, U-matic and VHS formats into the digital file format MPEG 21.

It is important that these files are uncompressed, thus losing no picture or sound quality and remaining as true to the original as possible. 

Next Storage Media
Workflow Next