A comparison of logical and physical parallel I/O patterns

(1)

364 A

COMPARISON OF LOGICAL AND

PHYSICAL

PARALLEL

I/O

PATTERNS

Huseyin

Simitci Daniel A. Reed

DEPARTMENT OF COMPUTER

_SCIENCE,

UNIVERSITY OF ILLI-NOIS AT

URBANA-CHAMPAIGN, URBANA, ILLINOIS,

U.S.A.

Address reprint requests to _HuseyinSimitci, Department of Com-puter Science, University of Illinois at

Urbana-Champaign,

1304 West _SpringfieldAvenue, Urbana, IL 61801 U.S.A., e-mail simitci@cs.uiuc.edu.

Summary

Although

there are several extant studies of

parallel

scien-tific

_application

_request

_patterns,

there is little

experimen-tal data on the correlation of

_physical

I/O

_patterns

with

application

I/O stimuli. To understand these

correlations,

the authors have instrumented the SCSI device drivers of the Intel

_Paragon

OSF/1

_operating

_system

to record

_key

physical

I/O

activities,

and have correlated this data with

the I/O

_patterns

of scientific

_{applications captured}

via the Pablo

_analysis

toolkit. This

_analysis

shows that disk

hard-ware features

_profoundly

affect the distribution of

_request

delays

and that current

_parallel

file

_{systems respond}

to

parallel application

I/O

patterns

in _{nonscalable ways.}

1 Introduction

IIO for scalable

_parallel

_systems

continues to be the

_major

performance

bottleneck for many

large-scale

scientific

applications

(Crandall

et

_{al., 1996;}

_Purakayastha

et

_al.,

1995).

Market forces are

increasing

the

disparity

between

processor and disk

system

performance, exacerbating

the

already

difficult

_problem

of

_{achieving high performance}

for

_applications

with

_large

I/O

_{components. Moreover,}

most current

_parallel

file

_systems

_(PFS)

were constructed as extensions of workstation file

_systems

and were

opti-mized for

_large

_sequential

data transfers.

Recent

_experimental

studies

_(Crandall

et

al., 1996;

Purakayastha

et

_{al., 1995;}

Smirni and

Reed, 1996, 1997;

Reed, Elford,

Madhyastha,

Scullin,

et

al.,

1996)

have shown that

_{parallel applications}

have much more

_complex

access

_patterns,

with

_greater

_spatial

and

_temporal

variabil-ity,

than first

_{suspected. Although}

there is a

_large,

comple-mentary

body

of

_experimental

data on disk behavior

(Ruemmler

and

_{Wilkes, 1993,}

₁₉₉₄₎

for

_sequential

file

systems, there is much less

_experimental

data on the

correlation of

_physical

I/O

_patterns

with

_parallel

applica-tion IIO stimuli. Because PFSs mediate

_application

I/O stimuli and

_physical

I/O _responses,

_developing

appropri-ate

_designs

for scalable

_parallel

I/O

_systems

_requires

a

detailed characterization of the I/O behavior at

_multiple

system

levels

_(Gibson,

_Vitter,

and

_Wilkes,

_1996).

To correlate

_{parallel application}

I/O stimuli with disk

system

responses, we

_augmented

our

_{portable application}

IIO instrumentation infrastructure

_(Reed,

Elford,

Mad-hyastha,

Scullin,

et

_al.,

₁₉₉₆₎

with disk device driver

instrumentation. Built _atopthe Pablo

_performance

analy-sis

toolkit,

the former can

_capture

both statistical

summa-ries and

_time-stamped

traces of

_application

I/O

patterns.

In turn, our device driver instrumentation creates

_temporal

summaries and

_{activity histograms}

for selected SCSI

de-vices,

including

read/write

_{request sizes,}

device driver

delays,

SCSI device service

time,

response

times,

and queue

lengths.

Using

this

experimental

infrastructure,

we

have studied

_logical

and

_physical

If0

_patterns

for three disk

_{configurations}

of the Intel

_{ParagonTll’}

XP/S and the Intel

PFS,

one of the few extant commercial PFSs now

available.

The remainder of this paper is

organized

as follows. In

Section

2,

we outline related work in

parallel

I/O

charac-terization and disk

modeling.

We then describe our

_logical

(application)

and

_physical

_(disk)

I/O characterization

methodology

in Section 3. As a baseline for

_analysis

of

(2)

summarizes

_logical

and

_physical

I/O characteristics for a set of

_simple

benchmarks. This is followed in Section 5

by

a

_description

of MESSKIT

_(High

Performance

Com-putational

Chemistry

Group,

1995),

a

_{large, multiphase}

quantum

chemistry

code with

_demanding

IIO

require-ments that are

_{representative}

of current

_parallel

scientific

applications.

In Sections 6 and

7,

we

_analyze

the

logical

and

_physical

I/O

_patterns

for MESSKIT when executed

on three different hardware

_{configurations.}

_Finally,

Sec-tions 8 and 9 summarize our

findings

and outline

direc-tions for future work.

2 Related Work

Although

our

_{understanding}

of I/O

_parallelism

is still

evolving,

there is a

_{long history}

of file access

charac-terization for mainframes and vector

_{supercomputers.}

Notable

_examples

include

Lawrie, Randal,

and Barton’s

(1982)

study

of automatic file

_{migration algorithms,}

Strit-ter’s

₍₁₉₇₇₎

_analysis

of file lifetime

distributions,

Smith’s

( 1981 ) study

of mainframe file access

behavior,

and Jen-sen and Reed’s

₍₁₉₉₃₎

_study

of file archive accesses.

More

_recently,

Miller and Katz

₍₁₉₉₁₎

_captured

de-tailed traces of

_application

file accesses from a suite of

Cray

applications, identifying compulsory, checkpoint,

and

_staging

I/O.

_Pasquale

and

_Polyzos

_{(1993, 1994)}

fol-lowed with two additional studies of vector

_workloads,

concluding

that most I/O had

_regular

behavior.

In the

_parallel

_domain,

Kotz and

_colleagues

_(Kotz

and

Nieuwejaar,

1994;

Purakayastha

et

_{al., 1995)}

used

_library

instrumentation to

_study

I/O

_patterns

on the Intel

iPSC/860 and the

_Thinking

Machines CM-5.

_They

ob-served that

_parallel

I/O

_patterns

were more

_complex

than

expected,

with small

_requests

_quite

common.

Complementary

studies of

_physical

I/O

_patterns

have focused on

_modeling

and simulation of

_single

disks

(Ruemmler

and

Wilkes,

1994)

and

_analysis

of disk work-loads in UNIX

_systems

_(Ruemmler

and

Wilkes, 1993;

Baker,

1991).

This work showed that without

_higher

level file _system

_{optimizations,}

the benefits from even the best disk

_{scheduling algorithms}

were limited

_(Seltzer,

_Chen,

and

_Ousterhout,

_1990).

Our work differs from these earlier studies

_by

examin-ing

the correlations between

_parallel

_application

I/O

re-quests,

PFS

_policies,

and

_physical

disk

_request

streams. This correlation is a

_prerequisite

to

_{understanding}

how

PFSs mediate and transduce

_logical

and

_physical

_request

streams and can

provide

a basis for

intelligent design

of

massively

PFSs.

&dquo;Because PFSs mediate

_application

I/O stimuli and

physical

I/O _responses,

_developing

_{appropriate designs}

for

scalable

_parallel

110

_systems

_requires

a detailed

characterization

of

the 110 behavior at

_multiple

_system

levels.&dquo;

(3)

366

3

_{Experimental Methodology}

Application

requests

are the

_logical

stimuli to an UO

system; their

sizes,

temporal spacing,

and

_spatial

_patterns

(e.g.,

sequential

or

random)

constrain

possible

library

and

file

_system

_{optimizations (e.g., by prefetching}

or

cach-ing).

After mediation

_by

a

PFS,

the

_physical

_patterns

of

I/O manifest at the

_storage

devices are the ultimate

_system

response.

To understand the

_implications

of

_application

_request

patterns

for PFSs and the

_efficacy

of

_parallel

disk

configu-rations,

we have

_augmented

the Pablo

_performance

analy-sis environment’s

_support

for

_application

I/O

_tracing

(Crandall

et

_{al., 1996)}

with SCSI device driver instrumen-tation on the Intel

_Paragon

XP/S

_{system. Below,}

we

describe the

_experimental

_platform

and our measurement toolkit.

3.1 EXPERIMENTAL PLATFORM

The distributed memory

Paragon

XP/S architecture

con-sists of a _{group of}

_compute

and UO

nodes,

all connected

by

a two-dimensional mesh. These nodes execute a

dis-tributed version of OSF/1

AD,

with

_application

IIO

re-quests

on the

_compute

nodes sent to file servers on the I/O nodes. In turn, the I/O nodes

support

Intel’s PFS. The PFS

stripes

files across the I/O nodes in 64-KB

units,

with the initial 64-KB file

_segment

_{randomly placed}

on one of the I/O

_nodes;

_subsequent

64-KB

_stripes

are distributed

_using

a round-robin

_algorithm.

Together,

OSF/1 and the PFS I/O node file servers

support both buffered and unbuffered

_modes,

selectable via a _systemcall for each file. When

_buffering

is

enabled,

PFS and OSF/I use a read-ahead and write-behind

cach-ing algorithm

with LRU

_replacement

and 64-KB units.

Finally,

if the last two consecutive disk reads have

con-tiguous

logical

block

numbers,

one more

_logical

file block

is

_prefetched

_{asynchronously.}

By

default,

PFS

_buffering

is

disabled,

and a

_technique

called fast

_path

I/O is used to avoid data

_caching

and

copying.

In this case, I/O node buffer caches and client side memory

mapped

file

_support

are

_bypassed,

and data

are transferred

_directly

between disks and user buffers.

All our

_experiments

were conducted on the 512-node

Intel

_Paragon

XP/S at the Center for Advanced

Comput-ing

Research at the California Institute of

_Technology

(Caltech)

and used OSF/I version R1.4.1. As a

_major

test

platform

for the Scalable I/O Initiative

_{(Pool, 1996),}

this

system

supports

multiple

I/O

_{configurations,}

each

differ-ing

in the number of IIO nodes and both number and

_type

of disks.

In our

_experiments,

we considered two disk

_types

and

three different numbers of I/O

nodes,

each with either a

single

attached SCSI disk or a

single

RAID-3 disk _array.

Disk hardware parameters for each I/O node

configura-tion are

_given

in Table 1. The

_stripe

_{group is the}number of disks on which the files are

_striped

for each I/O

hard-ware

_{configuration,}

and the

_stripe

size is the amount of data stored on each disk before

moving

to the next disk.

Although

the two disk _typesdiffer in

_logical

disk block

size,

the basic unit of storage on the

_disk,

C~SFI1 treats the

different disks

_identically;

it

_always

transfers data in

mul-tiples

of a

_single

sector

_(i.e.,

_multiples

of 2

_KB)

in

re-sponse to read or write

_requests.

Varying

the number of I/O nodes allowed us to assess

the effects of IIO node

_parallelism

on disk I/O

_patterns

and

application

I/O response times.

Conversely, varying

the

hardware characteristics of the disks allows us to under-stand the effects of disk

_capabilities

on observed behavior.

Although exploring

an even

larger

set of hard-ware/software

_{configurations}

would be

desirable,

we were constrained

by

the fact that the Caltech Intel

_Paragon

XP/S is in

_production

use. In _{consequence, wholesale}

hardware

_{configuration}

_changes

_{(e.g., modifying}

the

placement

and number of

_disks)

or

_{major operating}

sys-tem

_changes

were not

_practical.

3.2 LOGICAL 110 INSTRUMENTATION

The Scalable IIO Initiative

(Pool, 1996)

is a broad-based

multiagency

research group

working

in concert with

ven-dors to

_design

_parallel

I/O APIs and file _systems.As

_part

of the Scalable I/O Initiative’s I/O characterization

effort,

we have extended the Pablo

performance

environment

(Reed

et

al., 1993; Reed,

Elford,

Madhyastha,

Scullin,

et

al., 1996)

to

_capture

_application

I/O behavior on a

_variety

of

_single

_processorand

_parallel

_systems.The extended Pablo I/O toolkit wraps invocations of I/O routines with instrumentation calls that record the

_parameters

and dura-tion of each invocadura-tion.

To minimize

_potential

I/O

_{perturbations}

due to

perfor-mance data

extraction,

the Pablo toolkit

_supports

both

real-time reduction of I/O

_performance

data and

_capture

of detailed event traces. These two

_options

_trade compu-tation

_perturbation

for I/O

_{perturbation.}

Extensive use of the Pablo toolkit for

_application

(Crandall

et

al., 1996;

Smirni et

al., 1996; Reed, Elford,

Madhyastha,

Scullin,

et

al.,

1996)

has shown that the instrumentation overhead is

_negligible

for most

applica-tion codes.

(4)

3.3 PHYSICAL 110 INSTRUMENTATION

Device drivers define the interface between file

_system

services and I/O

devices,

isolating

the

_{idiosyncrasies}

of

specific

devices behind standard interfaces. Because all

physical

I/O

_requests

transit the device

_drivers,

instru-menting

these device drivers allows one to capture and

analyze

the

_temporal

and

_spatial

_patterns

of all requests

generated by

a PFS.

In most current

_parallel

I/O _systems,

_including

the Intel

Paragon

XP/S,

the disks are connected to the IIO nodes via standard SCSI

controllers,

and SCSI device drivers

service all

_requests.

A _{SCSI disk appears}

_externally

as a

linear vector of addressable blocks. All

_physical

charac-teristics

_{(i.e., cylinders,}

tracks,

sectors, and bad

blocks)

are hidden

_by

this virtual device interface.

This

_separation

of

_logical

and

_physical

views

simpli-fies the device driver interface and allows the storage

device to

_{transparently optimize}

_{requests. However,}

ex-ternal

_entities,

_including

the device

_driver,

have little or no

_knowledge

of the data

_layout

on the

_physical

media;

the status of the on-board disk cache or

_scheduling

algo-rithm ;

or the

_seek,

rotational

_latency,

or transfer compo-nents of a

_request

service time.

Although

there are

_{experimental techniques}

for

deter-mining

these features

_(Wilkes

et

_al.,

_1995),

for

simplic-ity’s

sake we have restricted our

_analysis

to behavior observable at the SCSI device drivers. As a further

com-promise

between detail and

overhead,

we have

_opted

to

generate

activity histograms

that summarize read and

write

_{request sizes,}

device driver

_delays,

device service

times, request

run

_lengths,

driver and device _queue

lengths,

and interarrival times.

To _generatethese

_histograms,

we have modified the

SCSI disk driver read/write routines

_{(Forin, Golub,}

and

Bershad,

1991)

to time _stampeach

_request

on

_arrival,

transmission to or

_receipt

from the

_device,

and

_departure.

Using

these time _stamps,we then

_compute

the time each

request spent

queued

for service at the device

driver,

the time

_spent

on the

device,

and the total _{response time}

(i.e.,

the sum of

_queuing

and service

_delay).

This and other data

are

_kept

in a kernel data structure associated with each SCSI device.

Figure

1 illustrates the

_high-level

structure of the

re-sulting

SCSI device driver instrumentation. On the Para-gon

XP/S,

processors in the service

partition provide

general operating

system services,

and the external

inter-face to the users. Parallel

applications

are executed on the

processors in the

compute

partition.

I/O nodes differ from

Table 1

SCSI Disk

_{Configurations (Intel Paragon XP/S)}

(5)

Table 2

Benchmark

_{Logical/Physical}

1/0

Comparison (Intel Paragon XPIS)

NOTE: _Requestsize = 80 _KB,

_interrequest

time = 1000 _ms,file _type= _private,

requests/processor

= 130, number of _processors= 64,

access pattern = sequential.

other

_compute

nodes

_{by having}

an SCSI interface and an

SCSI disk.

’

During application

program

execution,

an external user _program

executing

on a service node can

_configure,

reset, and retrieve the

physical

IIO data via an extended set of ioctl () calls. When

synchronized

with

applica-tion

instrumentation,

this

_{periodic histogram}

extraction

provides

the data needed to correlate

_logical

and

_physical

request patterns.

4

_Physical

1/0

Benchmarks

As a basis for

_{understanding}

the more

_complex

I/O

pat-terns found in

_{parallel applications}

_(e.g.,

the

_quantum

chemistry

code described in Section

_5),

we first measured

the

_logical

and

_physical

I/O behavior of a suite of

bench-marks. Via these

_configurable

_benchmarks,

one can

spec-ify

request types

and

sizes,

_interrequest

latencies,

paral-lelism,

file

_sharing,

and PFS

_buffering.

Table 2 summarizes the

_logical

and

_physical

I/O attri-butes of two such benchmarks where every processor

sequentially

reads or writes 80-KB units to or from a

private

file.’ In turn,

Figures

2 and 3 illustrate the

_temporal

patterns

of

_physical

I/O for file read benchmark. The

histograms

in

_Figures

2 and 3 are

_captured

and

plotted

with 10-s

intervals,

while

_Figures

4

_through

6 are

plotted

(6)

369

Fig.

2 Read benchmark

_physical

)/0

_histograms

₍₆₄

_Seagate

disks without

_buffering)

Several

_striking

attributes are

_immediately

_apparent

from the table and

_figures.

First,

both the number of

physical operations

and the total data volume differ

mark-edly

from that

_requested

at the

_application

level. For the write

_benchmark,

PFS not

_only

must

_update

the file metadata as file blocks are

_written,

it must also write the file system metadata to reflect the creation of new files.

When a _processorcreates a new

_file,

other

_experiments

(not

displayed

here due to _space

_limitations)

have shown that PFS must write two 64-KB blocks of metadata on

each disk across which the file will be

_striped

_Hence,

the

physical

data volume of Table 2 rises with the number of disks

_(e.g.,

when 64 _processorscreate a file that is

_striped

across 64

_disks,

over 512 MB of metadata must be written before _any

_application

data are

_stored).

Although

this metadata overhead is

_partially

an artifact

of the interactions between PFS and

OSF/1,

it

_highlights

the often hidden costs of

_scaling

workstation file systems

to hundreds or thousands of disks. The UNIX inode scheme was

_originally

_developed

to

_efficiently

_support

&dquo;... both the number

_{of physical operations}

and the total data volume

_{differ markedly from}

that

_requested

at the

_application

level.&dquo;

(7)

370

Fig. 3 Read benchmark physical IIO _{histograms (64 Seagate}disks with buffering)

small files and

_{dynamic growth.}

In contrast, files associ-ated with scientific

_applications

are both _verysmall and

very

large.

As others have also

noted,

this

suggests

that for the

latter,

extent-based allocation schemes could

in-crease

_{physical contiguity}

on

_storage

devices and reduce

metadata

_manipulation

costs.

Table 2 and

_Figures

2 and 3 also show that the

file-buffering

policy

has

_profound

effects on

_physical

_request

sizes and response times for both reads and writes. With the default PFS

_policy

(no

buffering), satisfying

sequen-tial 80-KB

_application

read or write

_requests

requires

reading

from or

writing

data to two disk

_{stripes. Although}

buffering

and write-behind ameliorate this

overhead,

they

incur additional memory costs-a more flexible file sys-tem would match the size and

_placement

of the disk

_stripes

to the

_request

size and access

_pattern.

Turning

to the read

benchmark,

both

_Figures

2 and 3 illustrate the effects of SCSI command

_queuing

and track

(8)

satisfied from the track

buffers,

whereas others see the full

seek, rotation,

and transfer latencies.

Figure

2 shows that all

_multiples

of 16 KB are retrieved to

_satisfy

unbuffered read

_requests.

_Again,

this reflects the mismatch between

_request

size and

_striping

factor. A

succession of 80-KB

_requests

_requires

16-, 32-,

and 48-KB

_portions

of the 64-KB

_stripes.

With similar bench-marks

_using

16-KB

_{request sizes,}

we observe that the

application

and

_physical

_request

counts and volumes were

identical for unbuffered I/O.

With

_buffering

_enabled,

PFS and OSF/1

_sequentially

prefetch

data in 64-KB units for file

reads,

yielding

the

single physical

request

size of

_Figure

3. This reduces the number of actual

_physical

reads but increases the total I/O

volume,

as Table 2 illustrates.

However,

Figure

3 shows that even with a read

_interrequest

interval of 1 s, the

prefetching algorithm

is unable to retrieve all data before

they

are

_requested.

In _{turn, this results in the response time}

spike

at the

_top

of

_Figure

3.

Third,

as Table 2 and

_Figures

2 and 3

_show,

the read

benchmark

_generates

a nontrivial number of

_physical

writes. For

reads,

these writes accrue from metadata

pro-cessing

to record last access times for file blocks.

_Every

30 s, all I/O nodes write this data to their disks.

_By

their nature, these file

synchronizations

are

_bursty,

_resulting

in

large

disk response times.

Moreover,

comparison

of

Fig-ures 2 and 3 shows that the interaction between

prefetch-ing

and file

_{synchronization adversely}

affects both.

Although

read and write benchmarks

_highlight

many

possible

interactions between access

_patterns,

file

_system

policies,

and hardware

_{configurations,}

their resource de-mands often are

_simpler

and more

regular

than those in

realistic

_{applications.}

Hence,

we turn now to an

_analysis

of a

_large,

I/O intensive

_{parallel application.}

5

Quantum

Chemistry

Code

As noted

_earlier,

one of the

_primary

_goals

of the Scalable

IIO Initiative is

_analyzing

the I/O

_{patterns present}

in a

large

suite of scientific and

_engineering

codes. _{These span}

a broad _rangeof

disciplines

and have been the

subject

of several

_application

characterization studies

_(Crandall

et

al.,

1996;

Smimi et

_{al., 1996; Reed,}

_Elford,

_Madhyastha,

Scullin,

et

al.,

1996).

As an initial basis for

_integrated

_analysis

of

_application

and

_physical

I/O

_analysis,

we selected one code

(MESSKIT)

from the Scalable I/O Initiative suite. This code has been the

subject

of earlier

_application

analysis

(Crandall

et

_{al., 1996)}

and is

_{representative}

of the IIO

patterns

observed in

_parallel

scientific

_{applications.}

As

such,

it

_provides

a baseline for

_comparison

of

logical

and

physical

I/O

patterns.

MESSKIT is a Fortran

_{implementation}

of the

Hartree-Fock self-consistent field method

_(High

Performance

Computational Chemistry Group,

1995)

that

_computes

the electron

_density

around a molecule

_by

considering

each electron in the molecule in the collective field of the others. The

_{implementation}

uses basis sets derived from the atoms and the relative

_geometry

of the atomic centers. Atomic

_integrals

are then calculated over these basis

functions and are used to

_approximate

molecular

_density.

A Fock matrix is derived

_using

molecular densities and the atomic

_integrals.

_Finally,

a self-consistent field

method is used until the molecular

_density

converges to within an

_acceptable

threshold. Because a Fock matrix of

size N _generates

_O(N2)

one-electron and

_O(1V4)

two-electron

integrals,

the total I/O demand for realistic

_problems

is

beyond

what can be

_{feasibly supported}

with current

par-allel I/O

_systems.

The MESSKIT code consists of three _distinct

pro-grams that

operate

as a

_logical

_pipeline,

with each stage

accepting input

from the

_previous

one.

~ _psetup:The processors read the initial

files,

transform

the data in ways needed

by

the later

phases,

and write the result to disk.

~ _{pargos: Each processor}

locally

calculates and writes to disk

_integrals

that involve either one or two electrons.

~

pscf-.

Finally,

each processor

repeatedly

reads its

pri-vate

_integral

files to retrieve the necessary

quadrature

data and solves the self-consistent field

_equations.

The results are

_periodically

collected and written to disk

_by

processor zero.

Although

both the pargos and

pscf phases

are 1/0

inten-sive,

for

_brevity’s

sake,

we consider

_only

the

_input

inten-sive

_{pscf phase}

below.

6

_Logical

110

Patterns

Table 3 shows the I/O behavior of MESSKIT’s

_{pscf phase,}

as

_captured

_using

our

_application

software The data were obtained

_{by executing}

the code on 64 processors and different hardware

configurations,

all

using

a small 16-atom test

_problem

and the Intel PFS M_UNIX file access

_mode,

a direct extension of standard

UNIX file _{system semantics.4}4

Table 3 shows that even for this small test

_problem,

the

pscf phase requires

over

50,000

accesses to

_secondary

(9)

372

Table 3

psct Logical

If 0

_{Summary (64 processors)}

Fig. 4

pscf logical VO

_histograms

₍₆₄

_disks)

RAIl7-3 disk _array,these accesses consume

_nearly

40% of all execution

time,

although

this decreases to

_nearly

10% with a

_larger

number of faster disks.

In addition to

_input

_{costs, Table}3 also illustrates the

high

cost _{of file open and}

_close;

_{each file open}_averages

roughly

0.3 s _per_processor.As described in Section

_4,

these costs arise from PFS metadata

_{manipulation.}

In

principle,

one could

_preallocate

space for the

output files,

a method used

_by

many database

systems

to reduce metadata

_manipulation

overhead.

_However,

for scientific

applications

like

MESSKIT,

the size of the

_output

files is

strongly dependent

on the

_input

_data,

and limited excess

storage

space is available for

preallocation.

As a basis for

_comparison

with

_physical

IIO

histo-grams,

Figure

4 shows a

_histogram

of

_application

read sizes and durations for the 64-disk

_{configuration.}

_Clearly,

the

_{pscf read}

_activity

is

_bursty,

with six

_cycles

visible in

(10)

373

Table 4

pscf Logical/Physical

1/0

Comparison (64 processors)

Figure

4. Most read

_request

sizes are near 80

KB,

although

a few are near 200 KB. As a _{consequence of}

_request

burstiness,

the

_application

read durations are

_highly

vari-able-PFS makes no

_attempt

to minimize read

latency by

aggressively prefetching during

compute

intensive intervals.

7

Physical

1/0

Patterns

Using

the SCSI device driver instrumentation described in Section

3,

we measured the

physical

I/O characteristics

for the MESSKIT code

_phases

on each of our three SCSI

disk

_{configurations.}

Table 4 summarizes these measure-ments for the

_{psef phase.}

When PFS

_buffering

is

_disabled,

the

_physical

read data

volume

_{roughly equals}

the

_logical

data

volume,

although

the number of

_physical

_requests

exceeds the number of

logical

requests

by

a factor of four. With

buffering,

the

data volume for

_physical

reads increases

_{substantially,}

but the total number of

_physical

_{requests declines,}

_reflecting

the fact that

_prefetching

retrieves a smaller number of

larger

64-KB

_stripes.

As with the benchmarks of Section

4,

the

_majority

of the write traffic is attributable to metadata

updates.

As a

_complement

to the

_logical

access

_histograms

of

Figure

4,

Figure

5 illustrates the

_temporal

distribution of

physical

request

sizes and durations for the

_pscf

phase

with PFS file

_buffering

disabled. The

_logical

and

_physical

access

_patterns

are

_quite

_similar,

_although

the 80-KB

(11)

(12)

More

_striking

is the effect of

_changing

hardware attri-butes on the distribution of

_physical

_request

_response

times. The SCSI standard

_supports

_{multiple outstanding}

requests

through

a mechanism called command

_queuing.

Using

this

_mechanism,

a disk controller can _resequence

requests

based on internal state to minimize

_request

re-sponse times. The older RAID-3 disk arrays of Table 1 do

not

_support

command

_queuing,

but the newer

_Seagate

disks

do.

In

_Figure

_5,

with 64

_Seagate

disks,

the I/O system has

a sufficient

_parallelism

to avoid

_{long queuing delays}

at each disk.

This,

together

with command

_queuing

and on-board

_request

_{resequencing,}

allows the disks to

_satisfy

most

_requests

from the disk track

buffers;

these are the response times below 5 ms in

_Figure

5. As the number of disks declines to

_16,

a

_larger

fraction of the

_requests

require

disk arm movement or encounter

_queuing

_delays.

Finally,

the

_12,

slower RAID-3 disks are saturated

_during

application

request

bursts and lack command

_queuing

to

resequence

requests.

In _consequence,most

_physical

re-quests

see

_{large queuing delays.}

7.1 QUEUING AND BLOCK RUNS

Although Figure

5 shows both the distribution of

_physical

request

sizes and the

_pernicious

effects of insufficient hardware

_parallelism,

it reveals little about the

_locality

of

physical

requests

or the sizes of disk queues.

Using

our

SCSI driver

instrumentation,

we also

captured

SCSI block run

_{lengths (i.e.,}

the number of consecutive blocks

ac-cessed on each

_disk)

and the distribution of _{queue sizes.}

An

_analysis

of these data shows that the

_product

of the block run

length

and block size

_generally

_equals

the size

of the

_physical

_requests.

This means that there is little or no

_locality

across successive

physical

read

_requests,

even

though

the

_high-level

file

operations

are almost all

se-quential

reads in the MESS KIT code.

PFS file

_striping

distributes all files across all available disks in the file

_system.

When

_multiple

_{processes contend}

for access on a

_single

_disk,

successive accesses to that disk have little

_{locality. Simply}

_put,

the disks see

_{nonsequential}

requests,

necessitating

disk arm movements and

increas-ing

access times.

Turning

to _{device queue}

_{lengths, Figure}

6 illustrates the

_temporal

_{distribution of queue sizes for the 12}

RAID,

3 disk _{arrays. As}

_expected

from the

_high

_{response times}

of

_Figure

_5,

these disks are

_operating

in

_saturation,

with

queue

lengths exceeding

40

requests

during periods

of

Fig.

6

_pscf

driver queue (12 RAIDs)

&dquo;When

_multiples

_processes

_{contend for}

access on a

single

disk,

successive accesses to that disk have little

locality.

Simply

put,

the disks see

_{nonsequential}

requests,

necessitating

disk arm movements and

increasing

access times.&dquo;

(13)

Fig. 7 _pscfdisk _requestdistributions _{(64 Seagate disks)}

high activity.

Even the 16- and 64-disk

_{configurations}

see

maximum queue

lengths

of 10 and

20,

respectively.

7.2 DISK LOAD DISTRIBUTIONS

As noted

earlier,

Intel’s PFS

_stripes

file data in 64-KB

units,

beginning

with a

_randomly

selected I/O node. For

large sequential

file accesses like those in

MESSKIT,

one

would

_expect

this data

_placement

to distribute the number of

_physical

_requests

_{nearly equally}

across the UO nodes

and disks.

_{Surprisingly,}

this is not the case for the

_pscf

phase.

Figure

7 shows both the number of

_requests

and the total read service time for each disk in the 64-disk

con-figuration

as a function of

_application

_parallelism.

When

the number of

_application

_processorsis no

_larger

than the

number of

_disks,

read time is

_largely

_independent

of

application parallelism

levels.

However,

when the number of

_contending

_processors

_equals

or exceeds the number of

disks,

the total service time increases

_{substantially.}

Counterintuitively,

smaller numbers

_{of processors}

gen-erate

_larger

numbers of

_requests,

even

_though

the total

application

IIO volume in

_pscf

is

_independent

of the number _{of processors.}

_This,

_together

with the lack of

physical

locality

described in Section

_{7.1, suggests}

that there are substantial

_{opportunities}

for

_performance

(14)

377

8 Discussion and

_Implications

Based on our

_comparison

of

_application

I/O stimuli with

physical

I/O

_system

_responseson the Intel

_Paragon

_XP/S,

several

_implications

for both

_performance

measurement

toolkits and

_{next-generation}

PFSs are clear.

_First,

SCSI device driver instrumentation and metric

_histograms

strike the

_right

_{balance between the limited detail} pro-vided

_{by simple}

counts and the excessive overhead from

tracing

physical

I/O within device drivers.

In our

_{experience, histogramming}

is efficient and

pro-vides a wealth of

_time-varying

detail on _request

sizes,

device driver service

_{times, request}

run

_lengths,

driver and

device queue

lengths,

and interarrival times. When

com-plemented by

user-level measurement of

_application

IIO

patterns,

one can

_analyze

the interactions of

_application

requests,

PFS

_policies,

and disk

_parallelism.

Second,

maintaining

metadata is a

_major

overhead for file

_systems

that

_naively

retain UNIX file _system

seman-tics. Current

_teraflop

_systems

are

_{being configured}

with

thousands of

_disks,

and

_{proposed petaflop}

systems would have

_nearly

100,000

disks.

_{Straightforward extrapolation}

of

_PFS-style

metadata

_storage

for such

_systems

would entail

_{writing gigabytes}

of metadata

_just

to create a file.

Alternative

_{representations}

_(e.g.,

extent-based

_storage)

are needed that are more amenable to file

_striping

while

retaining

fault tolerance.

Third,

both our benchmarks and the MESSKIT

chem-istry

code illustrated the limitations of a

_single

disk

_stripe

size and distribution

_policy.

When

_application

_requests

are not a natural

_multiple

of the

_stripe

_size,

either because

_they

are much smaller or much

larger,

the number and volume

of

_physical

If0 can differ

_markedly

from that

_{specified by}

the

_application.

This

mismatch,

together

with the demon-strable

_{overheads, suggests}

that

_{next-generation}

file sys-tems must

_support

flexible

_storage

formats that can

dy-namically

select a

_stripe

size and a distribution of

_stripes

across disks.

As a

_complement

to more flexible data

_{distributions,}

file

_system

_policies

must

_{aggressively exploit application}

access

_patterns.

Intel PFS and OSF/1 include

_simple

read-ahead and write-behind

_policies

with LRU cache

replace-ment. For the MESSKIT

_application,

these

_policies

frag-ment

_application

_{requests, generate}_unnecessarydisk

ac-tivity,

and fail to

_{exploit bursty}

behavior to

_aggressively

prefetch

data

_during

idle

_periods.

For

_example,

we ob-served that the combination of PFS

_policies

and data distributions across disks eliminated almost all access

locality

present in the

_application

_pattern.

Automatic ac-cess

_pattern

classification

_(Madhyastha

and

Reed,

1996),

&dquo;SCSI device driver instrumentation and metric

histograms

strike the

right

balance between the limited

_{detait provided by simple}

counts and the excessive

_{overitead from}

_{tracing physical}

110 within device drivers.&dquo;

(15)

&dquo;Automatic access

_pattern

classification,

_coupled

with

_{performance-directed adaptive control for}

policy

selection,

could

_dynamically

tailor

policies

to access

_{patterns.&dquo;}

coupled

with

_{performance-directed}

_adaptive

control for

policy

selection

_{(Reed, Elford,}

_Madhyastha,

_Smirni,

et

al., 1996),

could

_dynamically

tailor

_policies

to access

patterns.

Fourth,

despite

the

_temptation

to sacrifice I/O systems

for additional _processorsor

_primary

_memory,

_high

perfor-mance

_parallel

_systemscan realize their

_potential

_only

when balanced. For the MESSKIT

_application,

we saw

that a 4:

_{1 processor-to-disk}

ratio was insufficient to maxi-mize

_performance.

_Although

Amdahl’s

_suggestion

that an

MIPS of

_computing

must be balanced

_by

a

_megabyte

of

memory and a

_megabyte

_{per second of I/O may}not hold for

_{massively parallel}

_systems,

the

_premise

of

_system

balance remains true. Our

_experiments

_suggest

that the number of disks in

_parallel

I/O

_systems

must be within a

small constant _{factor of the number of processors for}

many scientific

applications.5

5

Finally,

characterization studies are

_by

their nature

inductive,

covering only

a small

_sample

of the

possibili-ties and

_attempting

to extract more

_general

_patterns.

Al-though

the benchmarks and MESSKIT

_chemistry

appli-cation we studied on the Intel

_Paragon

XP/S are but a few

samples

from a

_large

_{space of}

_possible

I/O

_patterns,

earlier

application

characterization studies

_(Crandall

et

al.,1996;

Smimi and

Reed, 1996, 1997; Reed, Elford,

Madhyastha,

Scullin,

et

al., 1996;

Purakayastha

et

_al.,

_{1995) suggest}

that our selections are

_{representative}

of current

_practice.

Although

a wider range of

_experiments

is

_desirable,

the

level of instrumentation and

_experiments

we conducted

required

access to the

_operating

_system

code and

single-user time to load

_{experimental operating}

_systemkernels. This restricted our

_ability

to conduct

_comparative

experi-ments on

_{multiple platforms.}

A

_complete

_exploration

will

require analysis

of additional

_{applications,}

hardware

plat-forms,

and PFSs.

9 Conclusions and Futures

We have examined the interactions of

_application

I/O

requests,

PFS

_policies,

and disk hardware

_{configurations}

using

both

_application

IIO measurements and SCSI device driver instrumentation. This

_analysis

_suggests

that the

physical

I/O

_patterns

induced

_{by application}

_requestsare

strongly

affected

_{by data-striping}

mechanisms,

file sys-tem

_policies,

and disk hardware attributes.

_Simply

_put,

no

single

file

_policy

or data distribution is

_optimal

for all

application

access

_patterns.

Based on this

_analysis,

we are

_exploring

three

(16)

379

classification based on trained neural networks and hid-den Markov models

_(Madhyastha

and

Reed,

1996),

flex-ible

_policy

selection

_{using fuzzy}

_logic

_techniques

_(Reed,

Elford,

Madhyastha,

Smirni,

et

al.,

1996),

and

_adaptive

storage formats based on redundant

_{representations.}

ACKNOWLEDGMENTS

We thank

Evgenia

Smirni,

_Christopher

Elford,

Tara

Mad-hyastha,

and Ruth

_Aydt

for their

_insights

on

_parallel

Il0

and instrumentation. We also thank Rick Kendall of the Molecular Science Software

_Group

at the Molecular

Sci-ence Research

_Center,

Pacific Northwest National

Labo-ratory,

for the MESSKIT code. All data

_presented

here

were obtained from code executions at the Caltech Center for Advanced

_Computing

Research. This work was

sup-ported

in

_part

_by

the Defense Advanced Research

_Projects

Agency

under DARPA Contracts DABT63-94-C0049

(Scalable

I/O

_Initiative),

_{DAVT63-91-C-0029,}

and

DABT63-93-C0040,

by

the National Science Foundation under Grant NSF ASC

92-12369,

by a joint

Grand

Chal-lenge

grant

with

Caltech,

and

_by

the National Aeronautics and

Space

Administration under NASA Contract NAG-1-613.

BIOGRAPHIES

Huseyin

Simitci is a doctoral candidate in _computerscience at the

_University

of Illinois at

_{Urbana-Champaign.}

His research interests include

_parallel

file _systems,

_{high performance}

com-puting,

and

_intelligent

software. He obtained an M.S. and a B.S. from Bilkent

_University,

Ankara,

Turkey,

in 1994 and _1992,

respectively.

He is a member of the Pablo Research

_Group.

Daniel A. Reed is a

_professor

and head of the

_Department

of

_Computer

Science at the

University

of Illinois at

Urbana-Champaign.

In addition, he holds

_{a joint appointment}

as a senior research scientist with the National Center for

_{Supercomputing}

Applications.

He received a B.S. in _computerscience from the

University

of Missouri at Rolla in 1978 and an M.S. and _Ph.D., also in _{computer science,}from Purdue

_University

in 1980 and

1983,

respectively.

He was a

_recipient

of the 1987 National

Science Foundation Presidential

_{Young Investigator}

Award.

NOTES

1. As a basis for _comparison,this combination of request size and

access _patternwas chosen to match the _applicationaccess pattern in Section 5.

2. Traces showed that in most cases there are two _{block writes per}

stripe file, but occasionally the number of blocks is one or _{three; thus,}the small variations in the metafile volume occur.

3. In Table 3, the I/O time column is the sum of all time spent performing I/O across all processors.

4. The PFS M_ASYNC mode, which does not preserve file access

atomicity when files are _{concurrently opened by multiple}_{processes, is}a

potentially lower overhead altemative to the M_UNIX mode. However, in the MESSKIT code, each file is accessed _bya _single_processor.

5. Clearly, for some _applicationdomains this is not the case.

REFERENCES

Baker, M. G. 1991. Measurements of a distributed file _system.

Proceedings of

the Thirteenth

_Symposium

on

_Operating

System Principles

25. Association for

_Computing

Machin-ery,

pp. 198-212.

Crandall, P.,

Aydt,

R. A., Chien, A. A., and Reed, D. A. 1996.

I/O characterization of scalable

_{parallel applications.}

In

Pro-ceedings of Supercomputing

1995, San

_Diego.

Forin, A., Golub, D., and _Bershad,B. 1991. An I/O _systemfor

Mach 3.0. In

_{Proceedings of the}

USENIX Mach

Symposium,

USENIX, pp. 163-176.

Gibson, G. A., Vitter, J. S., and Wilkes, J. 1996.

_Strategic

directions in

_computing

research:

_Working

_groupon _storage I/O issues in

_{large-scale computing.}

ACM

_Computing

Sur-veys 28(4): 779-793.

High

Performance

_{Computational Chemistry Group.}

1995. NWChem, a

_{computational chemistry package}

for

_parallel

computers, version 1.1. Available at Pacific Northwest Na-tional

_Laboratory,

Richland, WA, 99352, U.S.A..

Jensen, D. W., and _Reed,D. A. 1993. File archive

_activity

in a

supercomputing

environment. In

_{Proceedings of}

the 1993 ACM International

_Conference

on

_{Supercomputing, July.}

Kotz, D., and

Nieuwejaar,

N. 1994.

Dynamic

file-access char-acteristics of a

_{production parallel}

scientific workload. In

Proceedings of Supercomputing

’94, Los Alamitos,

Novem-ber, pp. 640-649.

Lawrie, D. _H.,_Randal,J. _M.,and _Barton,R. R. 1982.

Experi-ments with automatic file

_migration.

IEEE

_{Computer, July,}

pp. 45-55.

Madhyastha,

T., and Reed, D. A. 1996.

_{Intelligent, adaptive}

file

system

policy

selection. In

_{Proceedings of}

Frontiers ’96, October, 172-179.

Miller, E. L., and Katz, R. H. 1991. I/O behavior of supercom-puter

applications.

In

_{Proceedings of Supercomputing}

’91,

November, pp. 567-576.

Pasquale,

B. K., and

_Polyzos,

G. A. 1993, Static

_analysis

of I/O

characteristics of scientific

_applications

in a

_production

workload. In

_{Proceedings of Supercomputing}

_{’93, Portland,} November, pp. 388-397.

Pasquale,

B. K., and

_Polyzos,

G. C. 1994.

_Dynamic

I/O

charac-terization of I/O intensive scientific

_{applications.}

In

Proceed-ings of Supercomputing

’94,

Washington,

DC, November,

pp. 660-669.

Pool, J. T. 1996. Scalable I/O Initiative. California Institute of

Technology.

Available at

_{http://www.ccsf.caltech.edulSIO/}

Purakayastha,

A., Ellis, C. S., Kotz, D.,

Nieuwejaar,

N., and Best,

M. 1995.

Characterizing parallel

file access _patternson a

large-scale

_{multiprocessor.}

In

_{Proceedings of the}

Ninth International Parallel

_{Processing Symposium, April,}

pp. 165-172.

Reed, D. A.,

Aydt,

R. A., Noe, R. _J.,_Roth,P. _{C., Shields,}K. _A., Schwartz, B. W., and _Tavera,L. F. 1993. Scalable

(17)

perfor-mance

_analysis:

The Pablo

_{performance analysis}

environ-ment. In

_{Proceedings of}

the Scalable Parallel Libraries

Conference,

edited

by

A.

Skjellum.

Los Alamitos, CA: IEEE

Computer Society

Press, 1993, pp. 104-113.

Reed, D. _{A., Elford,}C. L.,

Madhyastha,

T., Scullin, W. H.,

Aydt,

R. A., and _Smimi,E. 1996. _I/O,

_{performance analysis,}

and

performance

data immersion. In

_{Proceedings of MASCOTS}

’96, San Jose,

February,

pp. 1-12.

Reed, D. A., Elford, C. L.,

Madhyastha,

T., Smirni, E., and Lamm, S. L. 1996. The next frontier: Interactive and closed

loop

performance steering.

In

_{Proceedings of the}

1996

Inter-national

_Conference

on Parallel

_Processing

_Workshop,

Bloomington, August,

pp. 20-31.

Ruemmler, C., and Wilkes, J. 1993. UNIX disk access _patterns. In

_{Proceedings of}

the Winter 1993 USENIX

_Conference,

Ruemmler, C., and Wilkes, J.1994. An introduction to disk drive

modeling. Computer

27(3):17-28.

Seltzer, M., Chen, P., and Ousterhout, J. 1990. Disk

_scheduling

revisited. In

_{Proceedings of the}

Winter 1990 USENIX

Con-ference, January,

Smimi, E.,

_Aydt,

R. A., Chien, A. A., and Reed, D. A. 1996. I/O

requirements

of scientific

_{applications:}

An

_evolutionary

view.

_{High Performance}

Distributed

_Computing,

_{pp. 49-59.} Smirni, E.,

Aydt,

R. A., Chien, A. A., and Reed, D. A. 1996. I/O

requirements

of scientific

_{applications:}

An

_evolutionary

view. In

_{Proceedings of the Fifth}

IEEE International

Sympo-sium on

_{High-Performance}

Distributed

_{Computing, August,}

pp. 49-59.

Smirni, E., and _Reed,D. A. 1997. Workload characterization of I/O intensive

_{parallel applications.}

In

_{Proceedings of the}

9th International

_Conference

on

_{Modelling Techniques}

and

Tools for Computer Performance

Evaluation, June.

Smith, A. J. 1981.

_{Analysis of long}

term file reference _patterns

for

application

to file

_{migration algorithms.}

IEEE Transac-tions on

Software Engineering

SE-7 4:403-417.

Stritter, T. R. 1977. File

migration.

Ph.D. thesis,

Department

of

Computer

Science, Stanford

_University.

Wilkes, J.,

Worthington,

B. _L.,

_Ganger,

G. R., and Patt, Y N. 1995. On-line extraction of SCSI disk drive _parameters.In

Proceedings of the

Joint International

_Conference

on

Mea-surement and Modeling of Computer Systems,

Ottawa, Canada,

May,

pp. 146-156.