Workload clustering for increasing energy savings on embedded MPSOCS

(1)

SESSION TA2:

Embedded Systems, Sensors

and

MEMS

Chair:

Christopher Ryan, Vitesse

Co-chair: Kenneth Hsu, Rochester Institute of

Technology

(2)

(3)

Workload

Clustering

for

Increasing

Energy

Savings

on

Embedded MPSoCs*

S.H.K.

Narayanan,O. Ozturk,M.

Kandemir

Department

of

Computer Science and

Engineering

The

Pennsylvania

State

University

{snarayan,ozturk,kandemir}

@cse.psu.edu

ABSTRACT

Voltage/frequency scalingandprocessor

low-power

modes

(i.e.,

proces-sorshut-down)are twoimportantmechanismsusedfor reducingenergy consumption inembedded MPSoCs. Whilea

unified

schemethat

com-bines these two mechanisms canachieve

significant

savings

insome

cases, suchanapproach is limited

by

the code

parallelization

strategy employed. Inthispaper, weproposea

novel, integer

linear program-ming(ILP)based workloadclusteringstrategyacrossparallel proces-sors,oriented towardsmaximizingthenumber

of

idleprocessors with-outimpactingoriginalexecution times. These idleprocessorscanthen beswitchedtoalowpowermodetomaximizeenergy

savings,

whereas theremainingonescanmakeuse

ofvoltage/frequency scaling.

Inorder tocheckwhether thisapproach

brings

anyenergy

benefits

overthepure voltage scaling

based,

pure processorshut-down

based,

or a

simple

unified scheme, weimplemented four

different approaches

and tested themusing a setof eight

array/loop-intensive

embedded

applications.

Oursimulation-based

analysis

reveals that the

proposed

ILPbased ap-proach(1)is veryeffectiveinreducing theenergyconsumptionsofthe applications testedand(2)generatesmuchbetterenergysavings than all thealternateschemes tested(includinga

unified

schemethat

com-binesvoltage/frequency scalingand processorshutdown).

I. INTRODUCTION

We canroughly dividetheeffortsonenergy

savings

inembedded multi-processorsystem-on-a-chip architectures

(MPSoCs)

intotwo cat-egories.Inthis firstcategoryarethe studies that

employ

processor

volt-age/frequency

scaling.

Thebasic idea istoscaledown

voltage/frequency

ofaprocessorifitscurrentworkload is less thantheworkloads of other processors. Incomparison, the studiesinthe secondcategoryshut down unusedprocessors

(i.e.,

putthemintolow-powerstates

along

with their

private

memorycomponents)during the executionofthecurrent

com-putation.

Both these

techniques, i.e.,

voltage

scaling

andprocessorshut down, canbeappliedatthe software level (e.g.,directed byan opti-mizing compiler)oratthehardware-level (e.g., basedon apast history-based workload/idleness detection

algorithm).

Itisalso conceivableto combinethese twotechniques underaunifiedoptimizer.

Each of thesetechniqueshasitsadvantages and drawbacks. For ex-ample, aprocessor shut-downbased schememay notbe applicableif there is nounusedprocessor(notethatthis doesnot meanthatthe work-loadsofallthe processors in theMPSoCare similar). Similarly,the effectiveness ofavoltagescaling based scheme is limited bythe num-ber of

voltage/frequency

levelssupported by the underlying hardware. Ingeneral, exploiting processor/memory shutdownsaves more energy when it isapplicable (as it reduces leakageenergysignificantly)orwhen

wehaveonlyacouple of voltage/frequency levelstouse. Ifthis isnot thecase, thenvoltage scaling can beeffective (and in some cases it is theonly choice). Based on thisdiscussion, one can expect a unified schemetobesuccessful. However,wewant tore-iterate that if there is nounused(idle) processor in the current workload

assignment,

such a unified schemesimplyreduces to a voltage scaling based approach.

M. Karakoy

180 Queen's

Gate

Imperial College

[email protected]

Ourgoalin this paper is toexploreaworkload

(job)

clustering scheme that combines voltage scaling with processor

shut-down'.

The unique-nessofthe proposed unified approach is that it maximizes the oppor-tunities for processor shut-downbyassigningworkloads toprocessors carefully. It achieves this byclusteringtheoriginal workloads of pro-cessorsin as few processors as possible. In this paper, we discuss the technical details of thisapproach to energy savingin embedded MP-SoCs. Theproposed approach is ILP (integer linear programming) based; that is, it determines theoptimal workload clusterings across the processors by formulating theproblem using ILP and solving it using a linear solver. In order to check whether thisapproachbrings any energy benefits over the purevoltage scaling based,pure processor shut-downbased, or a simple unifiedscheme,weimplementedfour dif-ferentapproacheswithin ourlinear solverand tested themusingaset ofeightarray/loop-intensive embeddedapplications. Our simulation-basedanalysis reveals thattheproposedILPbasedapproach (1)is very effectiveinreducingtheenergyconsumptionsoftheapplicationstested and(2) generates muchbetterenergysavings thanallthealtemateschemes tested(including one that combinesvoltage/frequency scalingand pro-cessorshutdown).

II. EMBEDDED

MPSOC ARCHITECTURE,

EXECUTION MODEL, AND RELATED WORK

Thechip multiprocessorweconsiderinthis workisashared-memory architecture;thatis,theentire addressspaceisaccessiblebyall proces-sors.Eachprocessorhas aprivateLlcache, and the sharedmemoryis assumedtobeoff-chip. Optionally,wemayincludea(shared) L2 cache aswell. Note that severalarchitectures fromacademia and industry fit in this description [1, 10, 8, 9]. Wekeep the subsequent discussion simple by usingasharedbusastheinterconnect (thoughonecoulduse fancier/higher bandwidth interconnectsaswell).Wealsousethe MESI protocol (thechoiceis orthogonal tothefocus of thispaper) to keep the caches coherent across the CPUs. We assume thatvoltagelevel andfrequency ofeach processorin this architecturecanbe set indepen-dently oftheothers,and alsoprocessors can be placedintolowpower modesindependently. This paper focuses on asingle-issue,five-stage (instruction fetch (IF), instruction decode/operandfetch(ID),execution (EXE),memory access(MEM),andwrite-back (WB) stages) pipelined datapathforeachon-chipprocessor.

Ourapplication execution model in this embedded MPSoC can be summarizedasfollows. Wefocusonarray-basedembedded applica-tions thatareconstructedfrom loopnests. Typically, eachloop nest in such anapplication is smallbut executes a large number of iterations andaccesses/manipulates largedatasets(typically multidimensional ar-raysofsignals). Weemployaloopnestbasedapplication paralleliza-tionstrategy. Morespecifically, each loop nest is parallelized indepen-dently of the others. Inthiscontext, parallelizingaloop nest means distributing itsiterations across processors and allowingprocessorsto execute theirportions in parallel. For example, a loop with 1000 iter-ations can beparallelized across10processors by allocating100 itera-*Thisworkissupported in part by NSF Career Award #0093082 and

'In

this paper, we use the

terms

"processor

show-down"

and

"low-byagrant from GSRC. power mode"interchangeably.

(4)

(a)

(b) (c) PoPI P2 P3 P4 P5 POPI P2P3P4 P5 POP1P2 P3 P4 P5 POPIP2P3P4P5 POPIP23P4P5

`El

idietunused voltage scaled

I

shut down WIl

Figure 1: Comparison of different

energy-saving

approachesfor a sixprocessorarchitecture. Arrows indicate how theworkloads (jobs)areclusteredbyourapproach.

tionstoeachprocessor.

There are manyproposalsforpower managementofadynamic volt-agescaling-capableprocessor. Most of themare attheoperating sys-temlevel and are either task-based

[12]

orinterval-based [4]. While someproposals aimat

reducing

energywithout

compromising

perfor-mance, a recentstudyby Grunwaldetal[5]observed noticeable

perfor-mancelossfor someinterval-based algorithmsusing actual measure-ments. Mostofthe

existing compiler

basedstudies suchas [11]target

single

processorarchitectures.In

comparison,

ourworktargetsat a

chip

multiprocessor based environment and combines voltage

scaling

and processor shutdown.

[15]

presents and

analyzes

a

voltage/frequency

scaling scheme but,

they

donotconsiderprocessorshut-down.

[6]

em-ploys processor ashut-down based mechanism but does not consider voltage/frequencyscaling. Inourexperimental

evaluation,

wecompare ourapproach topure

voltage/frequency

scaling

andtopure processor shut-downaswell.

III. OUR

APPROACH

III.1

Overview

Figure1 comparesfour different altemate schemes thatsavesenergy inanembedded MPSoC architecture. It is

assumed,

forillustrative pur-poses, that the architecture has sixprocessors. In

Figure

l(a)

shows theworkloads

of

theprocessors

(i.e.,

the

jobs assigned

to

them)

ina

givenloopnest. Theseareassumedtobe theloadseither estimated

by

the

compiler

orcalculated

through profiling

andarefora

single

nest.

Figures

l(b)

and (c) show the scenarios withpure

voltage/frequency

scaling andpure processorshut-down based

approaches,

respectively.

In(b), fouroutofoursixprocessorstake

advantage

of

voltage

scaling

(notethat

P5

isnotused in the

computation

at

all).

In

(c),

ontheother hand,wecanplaceonlyoneprocessors

(P5)

intothe

low-power

mode. Acombination of thesetwo

approaches

is

depicted

in

Figure l(d).

Ba-sically, this version combines the benefits of

voltage/frequency scaling

andprocessorshut-down.

Finally,

the result thatcanbeobtained

by

the ILPapproach

proposed

inthispaper isillustratedin

Figure

l(e). Note that whatour

approach

essentially

doesistoclusterthetotalamountof

computational

load inasfewerprocessorsas

possible

sothat the number of unusedprocessorsis maximized. Inthis

particular

case,the

original

loadsof threeprocessors

(P2, P3,

and

P4)

arecombinedand

assigned

to processor

P2.

Asaresult,processors

P3

and

P4

canalso be

placed

into thelow-power mode

(along

withtheir

private

memory

components)

to

maximizeenergysavings,inadditionto

P5.

Thenextsubsection

gives

the technical detailsofthis

approach.

When there are

opportunities,

our approachcan also use

voltage/frequency

scalingfor the clustered jobs. Itis importantto pointout that thebenefits fromourapproach

canbeexpectedtobeeven moresignificantwhenthe number of volt-age/frequencylevels is small. Insuchacase,apure

voltage/frequency

scalingbasedapproachcannotstretchthe execution time ofaprocessor

tofill the available slackcompletely.

However, wefirst needtoclarifytwo important issues. Someone may askatthispoint "whyhastheapplication

(corresponding

tothe scenario inFigure l(a))notbeen

parallelized

atthefirstplaceasshown inFigure

1(e)?"

There areseveralreasonsforthis. First,most current

codeparallelizersdonotconsider any energyoptimizations.

Therefore,

thereis really littlereasonfor calculating the workloads of individual processors, and thus little opportunity forworkload clustering. Sec-ond,theconventional parallelizingcompilerstry to use as many proces-sors aspossible for executing agivencomputation unless there exists a compelling reason to dootherwise(e.g.,the excessive synchronization costs). Third,in many cases,tryingtoclustercomputationin very few processors canhave an impact on execution cycles. Since most paral-lelizing compilers donotpredictorquantifythis impact, they do not attemptsuchclusterings, beingontheconservative side.

The secondissue isthat, it is possible that the scenario depicted in Figure l(e) has poor datalocality as compared to scenarios in Fig-ures l(b), (c),and (d). This is because conventional codeparallelizers generallytry toachieve good data locality, by ensuring that each pro-cessormostlyusesthe same setof data elements asmuchaspossible (i.e., high data reuse). Asaresult, the scenario in Figure1(e)canlead to anincrease in data cachemisses,which in turnincreases overallenergy consumption. This overhead should also be factored inourclustering approachto ensure afair comparison.

Themaincontribution ofthe ILP approachproposed inthispaper istoobtain, for each loopnestin anapplication, theresult shown in Figure

l(e),

giventheinitial

scenario

(workloadassignment)shownin Figure

1(a)

and thus reduceenergyconsumption.

11.2 Technical

Details

and Problem

Formulation

This section elaboratesonthe ILPmodel usedto representthe prob-lem.In ourproblem,thereexista setofjobs (workloads) that havetobe executedon a setof availableprocessorsin theembedded MPSoC such that the totalenergy spentby thesystemisminimal and that the exe-cution of thejobs completes withinaspecified time limit, Tmax.2 The processors can rundifferent

jobs

atdifferentvoltage and frequency lev-els, which affectsenergy

consumption.

Theenergyexpended by each processoris thesumofthedynamicenergy aswellastheleakage en-ergyexpended whilerunning. Therestof this sectiondescribes theILP model in detail.

III.2.1

System

and Job

Model

We assumethat thejobsaremembersof thesetJ consisting ofJmax elements and the processors belong to the set P in which there are

Pmax

elements. Theprocessors can run atVnum discretesetof

volt-age/frequency

levels (as

supported by

the

architecture).

Itis assumed thatonlyonejobcan run on aprocessor at

anytime

andthatonce ajob startsrunningonaprocessor,itruns

uninterrupted

to

completion.

How-ever, a processorcanbe

assigned

to runmorethanone

job,

as aresultof workloadclustering. The durationthat thejob

occupies

the processoris dependentonthesupply

voltage/frequency

aswellasthe the

frequency

atwhichtheprocessoris

running

that

particular job.

Thetime

(latency)

eachjob takesup atdifferent

voltage

levels is

specified

in thearray

Job-Length(j,

v).

Similarly,

the

dynamic

energy spent

by

each

job

at

different

voltage

levels variesand is

captured by

Job-Dynamic(j, v).3

Total-Energy

is thesumof theenergies spent by all

jobs

onall pro-cessorsduetotheir

running

aswellastheleakageenergyconsumed

by

theprocessors. This is the metric whose valuewe want tominimize.

111.2.2 Mathematical

Programming Model

Theconstraints

specified

belowgive the mathematical representation ofourmodel. Weuse0-1integer linear

programming

(ILP). ThisILP formulation is executed for each loopnestseparately. Table1gives the notation used inourformulation.

Job

Assignment

Constraints. The0-1 variable

X(p,

j,

v)

deter-mines whether processor pruns

job j

at

voltage/frequency

levelv. One

2In

this paper, we do not assumeaspecificcode(loop nest) paralleliza-tionstrategy.Rather,we assumethateachloopnestis

parallelized using

oneof the knowntechniques. For eachloopnest,

Tmax

isdetermined bytheprocessor with thelargestworkload. This isto ensurethatour

workloadclusteringdoesnothavea

negative

impactonexecution times.

3Here

jrepresents ajob (workload) andvrepresents avoltage

(fre-quency)level. In our

implementation,

theentries

ofJobliength(j,

v)

and

Job-Dynamic(j,

v) arefilledusingprofiling. All energy estima-tionsare

performed

usingWattch[2]underthe70nmprocess technol-ogy. The increase in data cache missesas aresult ofclusteringis cap-tured

during

our

profiling.

158

to)

(5)

onthatprocessor.This is capturedby Expression(5):

D-Energy =

Pmax-1 Jmax-1Vnum-1

E

X(p,

j, v) *Job.Energy(j,v)

p=o j=o v=o

(5)

Expression (6)calculates theleakageenergyspent.As mentioned

ear-lier,ifBusy(p) is 1, then leakage isspentbyprocessor p.

Pma.-1

L-Energy=Leakage-Value*

E

*Busy(p)

p=o (6)

Table 1: Notation usedinourmodel.

jobrunscompletelyon oneprocessorand alljobsarescheduledtorun

onlyonce.This isspecifiedasfollows:

VpEP VjEJ VvEV

X(p,j,v)E{0I1}

(1)

Pmax-1Vnum 1

VjEJ

E

X(p,j,v)=1 (2) p0o v-o

Constraint (1) expresses thetermX(j,p,v) as abinary variable; a processoreitherrunsthejob oritdoesnot. Constraint(2) statesthat

eachjobcanberunonlyononeprocessorand that alljobsareassigned

tosomeprocessors(i.e.,nojob isleftunassigned). Noticethatwewant todetermine the valueofX(p, j, v)for allp,j,andv.

Deadline Constraints. Jobsare assignedtoprocessors aslong as

they canmeetthe time deadline that is specified. Constraint (3) ex-pressesthis:

Jmax-1Vnum-1

VpEP E X(p,j, v)*Job-Length(j, v) <Tmax (3)

j=0 v=0

NotethatTmaxisdetermined, for each loopnest,by the longest(largest) workload.

Clustering and Processor Shut-Down Constraints. Multiple jobs

arerun onthesameprocessorifthe number ofjobs,Jmax, exceedsthe

numberofprocessor,Pmax, but also if suchanarrangementreduces the overallenergyspentby thesystem. Incase aprocessorisnotassigned

anyjob, either becauseofclusteringofjobsorbecauseJmax < Pmax

orbecause of both thesereasons,then it is shut down. Suchaprocessor

doesnotconsumeanydynamicenergyasithasnojobs runningonit anditdoesnotconsumeanyleakageenergysince it isshut down(except forsomesmallamountofleakageinmemorycomponents). Constraint

(4) is introducedtocaptureprocessorshutdown:

VpE P, VjE J,VvEV Busy(p) >X(p, j,v) (4)

Foraparticularprocessor p, Busy(p) is necessarily 1 ifanyof the

values inX(p, j, v) is 1. Through this constraint, the value of Busy(P) isnotexplicitly expressed if all values in X(p, j, v)are0. However,

avalue of1 inBusy(p) adds leakagetotheoverall energy. As the

objective oftheILP-based modelistoreduceenergy,Busy(p) will be

assignedtobe0if allvaluesinX(p, j, v)are

0.4

Leakage and DynamicEnergy Calculation. The following

expres-sions capturetheleakageenergyanddynamicenergyspentby the

sys-temasthesumof theleakage and dynamic energies, respectively, spent

byeachprocessor. Thetotalamountof dynamicenergyspentbya pro-cessoris thesumof thedynamic energiesspentfor each job that isrun

4Topreservedata inmemorycomponents,ashut-downprocessor con-sumes someleakage [3]. Our experiments areperformed basedon

thisprinciple. However, inour presentation of the ILPformulation, weassume noleakage consumption in the shut-downstateforeaseof

presentation.

Objective Function. The objective function which is the total

en-ergyspentby thesystemis the sumof thethe leakage anddynamic

energies. This is theobjectivefunction thatourapproach triesto mini-mize:

Total-Energy=DEnergy+L.Energy (7)

The constraints and expressions mentioned in this sectionaresufficient

toexpress ourproblem within ILP. Wenextlookatthe additional

con-straints thatcanbe usedin ordertohandletwospecialcases.

Voltage/Frequency Scaling without Clustering. To model classical voltage/frequency scaling within ourILPformulation, aninput value

Assign(j, p) should specify theprocessoronwhich eachjobruns.

Fur-ther, by connecting this valuetothat ofX(j,p,v), all jobsareforcedto

run ontheassignedprocessors alone. This connectioncanbecaptured

by thefollowing constraint:

Vnum-1

VpEP,VjEJ , X(p,j,v)r=Assign(p,j)

V=o

(8)

ClusteringwithoutVoltage/Frequency Scaling.Tomodeljob clus-tering without voltage and frequency scaling,weneedtoconstrainthe choice ofavailable voltagefrequency levelstoeither eachprocessor

in-dividuallyorallprocessors. Inthecaseofconstraining the voltage

lev-els of allprocessorstoonevalue, Constraint(9)canbe usedtoensure

thatnojobsareassigned voltage levels other than theonespecified.

VpEP,VjEJ,veV-l{v'}

X(p,j,v)-=O. (9)

Toconstrain each individualprocessortoanindependent voltage level,

Constraint(10) belowcanbe used.

VpEP,VjEJ,VvEV--{vp}

X(p,j,v)=O. (10)

Here, v' and vparetheuniversaland individual (forprocessorp) volt-agelevels, respectively. These constraintssimply limit the voltage

lev-elstobeused. In thiscase,thedecisiontocluster jobs togetherona

processorismadebyoursolverand dependsonwhetherit results ina

loweredoverallenergyconsumption.

IV. EXPERIMENTALEVALUATION

Wepresentonly energyresults in this section. Thereason is that noneof the techniques evaluated increases original execution cycles

(i.e., wedonotexceed Tmex inanyloop nest). Specifically, for each

loopnest,theprocessorwiththe largestworkloadsetsthelimit for volt-age/frequency scaling andprocessorshut-down. The ILP solver used

inourexperiments islp-solve [7]. We observed that the ILP solution

timeswiththeapplication codes in ourexperimental suite varied

be-tween 56.7 seconds and 13.2 minutes. Considering the large energy

savings, these solution timesarewithintolerable limits.

Alltheexperimental resultsareobtained using the SIMICS

simu-lationplatform [13]. Specifically, weembedded in the SIMICS

plat-formtiming andenergymodels that helpus simulate the behavior of

thefollowing four schemes: VS (pure voltage/frequency scaling based approach); SD (pureprocessorshut-down basedapproach); VS+SD (a

unifiedapproach thatcombines VS andSD); and CLUSTERING(the

ILP-based approach proposed inthis paper). Thedefault simulation

parametersused inourexperiments are listedinTable 2. In the last

threeschemes, whena processoris unused in thecurrentloopnest,it

159

Notation Explanation

Job-Dynamic(j,v) Dynamic energy forrunning job(workload)jatvoltagev

Job-Length(j, v) Time takento runjobjatvoltagev

X(p,j,v) Value is 1 ifjob j runs on processor patvoltagev J Set ofjobs

P Setofprocessors

T_max Timedeadlinebeforewhich alljobsmustfinish J max Totalnumber ofjobs to be executed

P-max Totalnumberofprocessorsavailable

Vrnum Totalnumber ofvoltage(andfrequency)levels available

Total-Energy Total energyconsumptionof the system(tobeminimized) Leakage-Value Leakageenergy spentbyaprocessor if itisnot

(6)

SimulationParameter Value

Processor Speed 400MHz

Numberof Processors 8

Lowest/Highest Voltage Levels 0.8V/1.4V Numberof Voltage Levels 4

8KB Instruction Cache 2-wayassociative

32 byte blocks 8KB DataCache 2-wayassociative

32 byteblocks

Memory 32MB(banked)

Off-ChipMemory AccessLatency 100cycles

BusArbitration Delay 5cycles ReplacementPolicy StrictLRU Cache Dynamic EnergyConsumption 0.6 nJ MemoryDynamic EnergyConsumption 1.17 nJ

LeakageEnergyConsumption for32bytes

NonnalOperation 4.49 pJ

Shut-Down State 0.92pJ

Resynchronization Timefor Shut-DownState 30msec

ResynchronizationTimeforVoltage Scaling 5msec

Table 2: Thedefault simulation parameters.

is shut-down and itsLIinstruction anddata cachesare placed into the low-power mode. Thespecificlow-power modeemployed in this paper isfrom [3].

We used 8

array/loop-intensive

applications for evaluating thefour approachesmentionedabove: 3D,

DFE,

LU, SPLAT,MGRID,WAVE5, SPARSE,and XSEL. 3D is animage-based modeling application that simplifiesthe task ofbuilding 3Dmodels and scenes. DFEis a dig-ital imagefiltering and enhancement code. LU is an LU decomposi-tionprogram. SPLAT isa volumerenderingapplication which is used inmulti-resolution volume visualization through hierarchical wavelet splitting. MGRID andWAVE5are Cversions of two Spec95FP ap-plications. SPARSE is animageprocessing codethat performs sparse matrixoperations, andfinally, XSEL is an image rendering code. These Cprograms are written in such a fashion that they can operate on inputs of differentsizes. Wefirst ran these applicationsthrough our simulator withoutany voltage scaling or processor shut-down. This version of an application is referred to as the baseversion or the base executionin theremainder of thispaper. The energy consumptions (which include energiesspent inprocessors, caches,interconnects, and off-chip mem-ory) underthe base execution are 272.lmJ, 388.3mJ, 197.9mJ, 208.4mJ, 571.0mJ, 466.2mJ, 292.2mJ, and 401.5mJ for 3D, DFE, LU, SPLAT, MGRID,WAVE5, SPARSE, and XSEL, respectively. Theenergy re-sultspresented in this section are given as normalizedvalueswith re-spect tothis base execution.

Tocalculate the dynamic energy consumptions forcaches and mem-ory, weused the Cacti tool [14]. We approximated theleakage energy consumption by assuming that the leakage energy per cycle for4KB SRAM isequal to the dynamic energy consumed per access to a32byte data fromthe same SRAM. Note thatthis assumption tries to capture theanticipated importance of leakage energy in the future as leakage be-comes thedominant part of energy consumption for 0.10 micron (and below)technologies for the typicalinternal junction temperatures in a chip. In the shut-downstate, a processor and its caches consume only a smallpercentage of their original (per cycle) leakage energy. However, when aprocessor and its data and instruction caches in the shut-down state areneeded, they need to be reactivated (resynchronized). This resynchronization costs extra execution cycles aswell as extra energy consumption as noted in [3], and all these costs are captured in our simulations and includedin all our results.

Our first set of results,thenormalized energy consumptions with the different schemes, are presented in Figure 2. Each group of bars in this graphcorrespond to an application, and the last group of bars gives the average results across all eightapplications. The energy savings achievedby the VS scheme are not very large (6.55% on the average). There aretwo main reasons for this. The first one is the inherent char-acteristicsof some applications. More specifically, when there areno long idleperiods, VS is not applicable. The second reason is the lim-itednumber of voltage/frequency levels used in the default configura-tion (seeTable 2). In comparison, the SD scheme behaves in a different

aI

Eva mao /* s.smeRLt#E I UAC" E .

Figure 3: Theactiveandidle

pe-Finsure tio2:

Nraie eer

rnods

ofprocessors in the

mx3-consumpt*ons. raw.croutine fromMGRID. manner. While it is not applicable in some cases (e.g.,in applications DFE, MGRID,SPARSE, and XSEL), the energy savingsit bringsare significant in caseswhere it isapplicable.VS+SDsimply combinesthe benefits of the VS and SD schemes, reducing to VS when SD is not applicable. The average energy savings (across all eight applications) achieved by SD and VS+SD are7.36%and 13.52%,respectively. The highestenergy savings are obtained by our ILP-based approach, which is 22.65% on the average. These results clearly show the potential ben-efits of ourILP-basedworkloadclusteringapproach.

Tobetter illustrate where our energy benefits are coming from, we give inFigure 3 the percentage oftimeeachprocessor spends in the activeandidlestatesforproceduremx3-raw.c,oneofthethirteen sub-programsinapplication MGRID.Weseefrom thisgraph that our ILP-based approach is abletoincrease the number of idleprocessors. We observed similar trends with most of otherprocedures in our applica-tions. These results explain the energy benefits observed in Figure 2.

V. CONCLUSIONS

This paper proposes a workload clustering schemefor embedded MPSoCsthatcombinesvoltagescaling withprocessorshut-down. The uniqueness of the proposed unifiedapproach is that it maximizes the use ofprocessor shut-downbyclustering workloads (jobs) in asfew processors aspossible. We tested this approach along withthree al-temate schemes using asimulation-based platform andeight embed-ded applications. Ourexperiments show thatthisclustering approach isveryeffective inreducing energyconsumptionand generatesbetter resultsthan the threealtemativeschemesevaluated.

VI. REFERENCES

[1] L. A.Barrosoet al.Piranha: A ScalableArchitectureBasedonSingle-Chip

Multiprocessing.ProceedingsofISCA'2000.

[2] D.Brooks et al.Wattch: aframework for architectural-levelpoweranalysis and optimizationsInProceedings ofISCA, Canada 2000.

[31 K.Flautner, N. Kim, S.Martin,D.Blaauw,and T.Mudge.DrowsyCaches:Simple techniquesforreducingleakage power.Proceedings ofISCA,2002.

[4] K.Govil,E.Chan,and H. Wasserman.Comparing AlgorithmsforDynamic

Speed-Setting ofaLow-Power CPU.Proceedingsofthe IstACMInternational

ConferenceonMobileComputingandNetworking, November 1995.

[5] D. Grunwald, P. Levis,K.Farkas, C. Morreym,and M.Neufeld. Policies for

Dynamic ClockScheduling.ProceedingsOSDI'2000.

[6] L.Kadayif,M.Kandemir,and U. Sezer. An IntegerLinear Programming Based ApproachforParallelizingApplicationsinOn-Chip Multiprocessors. In Proceedings ofDAC'2002.

[7] lp.solve.ftp:l/ftp.es.ele.tue.nl/pub/lp.solve/

[8] MAJC-5200.http://www.sun.com/microelectronics/MAJC/5200wp.html [91 MP98: A Mobile Processor.http://www.labs.nec.co.jp/MP98/top-e.htm.

[10] K.Olukotun, B. A.Nayfeh, L. Hammond, K. Wilson, andK.Chang. The Casefor a

Single ChipMultiprocessor.ProceedingsofASPLWS'1996.

[111 H.Saputraetal.Energy-ConsciousCompilationBased onVoltage Scaling. ProceedingsofACMSIGPLANJointConferenceLCTES'02 and SCOPES'02, Berlin, Germany, June,2002.

[121 YShin, K.Choi, and T. Sakurai. PowerOptimization of Real-Time Embedded Systems on Variable Speed Processors.ProceedingsoftheInternationalConference onComputer-Aided Design, November 2000.

[13] SIMICS.http://www.virtutech.com/simics/simics.html.

[14] S.Wiltonand N.Jouppi.Cacti:Anenhanced cacheaccessandcycletime model. IEEEJournal ofSolid-State Circuits, May 1996.

[151 Q.Wu,P.Juang, M.Martonosi, and D. W.Clark. Formalon-line methods for

voltage/frequencycontrol inmultiple clock domainmicroprocessors. Proceedings ofASPLOS'2004.