Home
Compilers
Development
Documentation
Distributions
Machines
Processors

Minutes of the GCC for IA-64 Summit

Held June 6, 2001 at Hewlett Packard in Cupertino, California, USA

Minutes were compiled by Janis Johnson based on her notes and updated
with corrections from other summit participants.


Executive Summary
-----------------

This one-day summit brought together a diverse group of people who share
an interest in improving GCC (the GNU Compiler Collection) for IPF (the
Itanium Processor Family, the preferred name for IA-64).  Participants
included experienced GCC developers, both with and without experience
supporting IPF; three members of the GCC Steering Committee; new GCC
developers interested in IPF support; developers of proprietary IPF
compilers who want to share their experiences in order to improve IPF
support in GCC; people interested in long-term infrastructure changes to
improve GCC for all platforms; and people from large companies who came
to learn about GCC and its current and potential viability on IPF.
There were 27 people present in Cupertino, and 10 participated by
telephone.

Those of us who organized the summit are extremely pleased with the
wealth of information that was discussed, the lists of short-term and
long-term projects that were proposed, and the opportunity we had to
meet our far-flung colleagues.


Table of Contents
-----------------

   Welcome
   Who's who
   Current state of IA-64 support in GCC
      Completed work
      Unfinished and/or abandoned work
      Current shortcomings
      Things tried without success
      Discussion
   HP IPF Compiler
      Talk by Suneel Jain
      Discussion
   Intel IPF compiler
      Talk by Bill Savage
      Discussion
   SGI Pro64 compiler
   IBM IPF compiler
      Talk by Jim McInnes
      Discussion
   Long-term infrastructure improvements to GCC
      Talk by Brian Thomson
      Discussion
   GCC projects to improve IA-64 support
      Short-term IPF optimizations
      Long-term optimizations or infrastructure changes
      Tools: performance tools, benchmarks, etc.
   Where do we go from here?


Welcome
-------

Gary Tracy of Hewlett Packard welcomed us to the summit.  The goal of
the summit is to discuss optimization ideas and divide them into
short-term (those that can be done without major surgery to GCC) and
long-term (those that require infrastructure changes).  We will have
been successful if GCC IPF code is "good enough" by the end of calendar
year 2001.

Gary asked other participants to introduce themselves and for one member
of each group to explain the purpose of the group.


Who's who
---------

[The order here is different from how people introduced themselves at
the summit so that people from the same team can be listed together
and teams from the same company are adjacent.]

Gary Tracy ; HP
Reva Cuthbertson 
Steve Ellcey 
Jessica Han 
Robin Sun 
Jeff Okamoto 
   Gary manages this group that ports software from the open source
   community to HP-UX.  Unlike most other participants at the summit,
   their customers are developers that target HP-UX rather than Linux.
   Reva, Steve, and Jessica are porting GCC for IA-64 to HP-UX; unlike
   the Linux version of GCC, the HP-UX version of GCC on IA-64 supports
   ILP32 as well as LP64.  Robin is the group's Quality Engineer and
   Jeff handles open source tools other than GCC.

David Mosberger ; HP Labs (via telephone)
Stephane Eranian 
Hans Boehm 
   David and Stephane are involved in Linux kernel changes for IPF.
   David is the official maintainer for Linux on IPF and a core member
   of the Linux on IA-64 project, and worked on the initial port of GCC
   to IA-64.  Hans has recently worked on the Java run-time in GCC, some
   of which has been IA-64 specific.

Suneel Jain ; HP
   Suneel is involved with the HP compiler for IPF and wants to help GCC
   developers leverage his experience with IPF.

Vatsa Santhanam ; HP
   Vatsa manages compiler work at HP and is interested in helping to
   improve GCC for Itanium.

Sassan Hazeghi ; HP
   Sassan is the project manager for HP's C++ compiler team.

John Wolfe ; Caldera (via telephone)
Dave Prosser  (via telephone)
   John and Dave don't yet know just what they'll be doing with GCC or
   how much support they have from their employer to do GCC
   improvements.

Gerrit Huizenga ; IBM LTC (via telephone)
Gary Hade  (via telephone)
Steve Christiansen 
Janis Johnson 
   The purpose of this team within IBM's Linux Technology Center is to
   improve Linux on IA-64 in preparation for IBM products based on
   IA-64.  This team is not as bound by what it does as some of the
   other participants, and is working on GCC because improving the
   compiler's generated code was identified as a good way to improve
   application performance.  The team has a goal of working on overall
   system and application performance, beginning with short-term
   projects that can show improved performance within 6 months to 2
   years.  Gerrit, a member of the Linux on IA-64 project (formerly
   known as Trillian), leads the team.  Steve, Janis, and Gary are
   former Sequent compiler people who are coming up to speed on GCC but
   already have some experience with IA-64.

David Edelsohn ; IBM Research (via telephone)
   David is on the GCC Steering Committee and is the co-maintainer of
   the PowerPC port of GCC.  He is involved in efforts to improve the
   GCC infrastructure.

Brian Thomson ; IBM (via telephone)
   Brian manages a team that is responsible for application development
   tools for Linux.

Jim McInnes ; IBM (via telephone)
   Jim is in IBM's compiler group in Toronto and spent a couple of
   years adding IA-64 support to IBM's Visual Age compilers.

John Sias ; University of Illinois (via telephone)
   John is involved with the Impact compiler and works as an intern with
   Jim McInnes in the IBM compiler group.

Richard Henderson ; Red Hat
Jim Wilson 
   Richard is in the GCC core maintainer group and has been working on
   the GCC IA-64 port for 18-24 months.  Jim is the main GCC maintainer
   for the IPF port and a member of the Linux on IA-64 project.  Jim has
   concentrated on support work while Richard has done more feature
   work.  Jim is a member of the GCC Steering Committee.

Richard Kenner ; Ada Core Technologies
   Richard has been heavily involved with GCC work, but not with IA-64.

Suresh Rao ; Intel MicroComputer Software Labs
Bill Savage 
   Bill is co-manager of the Intel Compiler Lab, which develops C/C++ 
   and Fortran compilers for IA32 and IPF.  Suresh manages the Compiler
   Product portion of the Intel Compiler Lab and is responsible for the
   Intel Linux compilers.

Pete Kronowitt ; Intel
Chuck Piper 
Tracey Erway 
Dawn Foster 
   Pete manages a group at Intel that works with Linux distributors.
   He attended the summit to learn about GCC issues that will help him
   ensure that Intel's resources are distributed correctly.  Chuck
   manages the Performance Tools Enabling Team, which does marketing for
   Intel tools including the Intel compiler.  Tracey manages a marketing
   group with a focus on 3rd party core tools including compilers.  Dawn
   works for Tracey on 3rd party tools enabling, particularly Linux and
   GCC.

Bruce Korb 
   Bruce has worked with GCC but not for IA-64.  He's the maintainer
   of fixincludes.

Ross Towle ; SGI
   Ross manages the group that does compilers and tools, including SGI's
   Pro64 compiler.  His group is currently more interested in C++ and
   Fortran than C, and in technical computing.

Con Bradley ; SuperH
   Con is interested in short-term changes to GCC that will help IPF and
   architectures with similar features.

Waldo Bastian ; SuSE
   Waldo is a KDE developer.  He attended the summit to take information
   back to GCC developers at SuSE.

Mark Mitchell ; Code Sourcery
   Mark is the release manager for GCC 3.0 and a member of the GCC
   Steering Committee.  He has been involved with high-level GCC work,
   including the C++ ABI, and is interested in long-term infrastructure
   changes to GCC.

Sverre Jarp ; CERN (via telephone)
   Sverre is a member of the Linux on IA-64 project.


Current state of IA-64 support in GCC
-------------------------------------

All of this information is from Jim Wilson and Richard Henderson unless
otherwise indicated.


Completed work
--------------

 - Initial port of GCC to IA-64;

   This was done by David Mosberger.

 - Constructing control flow graphs (CFGs) and doing optimizations
   based on them

   This provides an interface between the front end and middle end.
   The front end builds an entire function in tree format and then
   converts the entire function to RTL.  The C front end doesn't yet do
   anything to optimize at the high level, but the C++ front end has
   function inlining that uses trees.

   The C++ inliner should be merged to work with a C front end; that
   wouldn't be too much work.  Right now, any function that takes a
   complex argument won't be inlined, but with this change it could be.

 - A little bit of dependence analysis

   Red Hat is planning to use this information in loop dependence
   analysis.

 - New optimization pass for predicated instructions (if-conversion)

 - Work on the scheduler, to bundle and emit stop bits at the same time
   as scheduling


Unfinished and/or abandoned work
--------------------------------

 - A new pipeline description model and a new scheduler that uses it,
   with support for software pipelining

   This is for resource-constrained software pipelining and helps loops
   that can't be modulo scheduled.  The new scheduler is in the Cygnus
   source tree and has not been merged into the FSF tree.  Red Hat 
   believes that this new scheduler is the right way to go long-term,
   but it is not yet ready.  It is a very large piece of work that shows
   only 1-2% performance improvement for Itanium.  Red Hat is using it
   from the Cygnus tree for MIPS and PowerPC and for several other
   embedded processors.

   The new scheduler can do incremental updates of its state.  In theory
   it should be a lot faster than the Haifa scheduler (for compile
   time), but in practice it didn't prove to be.  The model was much
   larger, so the theoretical speed gains could have been eaten up by
   extra work caused by the larger size.

   The current pipeline model can indicate the duration of an
   instruction, but not really enough information for IPF instruction
   scheduling.  The new model dooes have enough information.

   The FSF GCC sources use the Haifa scheduler but as a moderately
   sophisticated list processor.  The IA-64 scheduler has lots of tweaks
   outside of that, including bundling.

 - A new pass to use more registers to avoid an anti-dependence stall in
   Itanium that David Mosberger discovered.

   The new pass uses more registers.  By default the register allocator
   will use as few registers as possible, which isn't always the right
   thing to do.  This change resulted in very measurable speed-ups of
   5-6% when using more registers on IPF.


Current shortcomings
--------------------

 - First in line for things that need to happen is language independence
   in the tree language.  Someone at Red Hat in Toronto is working on
   adding SSA form.  Multiple IL levels are necessary so we don't lose
   so much information so soon.

 - The machine description could be improved

   It currently doesn't describe dependencies between functional units,
   or resources that must be claimed at some cycle other than cycle
   zero, e.g. needing the memory unit at cycle two.

   Currently GCC tweaks partial redundancy code and adds fixup code.
   There's not enough pressure on units to show improvement.  The code
   knows when it's decreasing the number of instructions but doesn't
   know about decreasing the number of bundles instead.

 - Missing high-level loop optimization

   Everything above the expression level is language-specific; that
   needs to be changed.

   Vatsa asked if GCC synthesizes post-increments on the fly.  Richard
   Henderson said that post-increments are done in the loop code of GCC,
   generated as part of induction variable manipulation.  The second
   part is that regmove.c should generate post-increment but doesn't do
   a very good job.  Some source in the Cygnus tree tries harder and
   should do better but hasn't in practice and is buggy.  Red Hat will
   remove it shortly and use a form of GCSE instead.  They tweak partial
   redudancy code to not delete certain code in some places, then
   optimize later.  Post-increment hasn't been seeing much improvement
   yet.

 - Weak on loop optimizations

 - No pipelining in the FSF sources; there is in the new scheduler in
   the Cygnus tree

 - No rotating registers

 - No prefetching support

 - Control, data speculation


Things tried without success
----------------------------

[Gary Tracy: Potentially there is great value in "known failure paths".
Things tried where it was learned that we should never try again.]

 - Control speculation tied into Haifa scheduler

   The scheduler is supposed to handle this but it exposed the fact that
   alias analysis in GCC during scheduling is extremely weak; it can
   even lose track of which addresses are supposed to come from the
   stack frame and so it would speculate way too much.  This project,
   though, was tried quickly, and maybe it could be done successfully
   with more time, e.g. 2 months rather than a week.

 - The new scheduler

   Alias analysis is a general infrastructure problem; GCC has no
   knowledge of cross-block scheduling.

   Richard Henderson thought of a scheme that could be done in 4-6 weeks
   using existing the alias code to keep information disambiguated.  The
   problem is that GCC drops down the representation practically to the
   machine level, so the compiler just sees memory with a register base.

   Alias analysis in GCC is weak in general and is even weaker in IA-64.

   Gary Tracy asked if the new scheduler resulted in slower code rather
   than incorrect code; the answer is yes.

   Richard Kenner is planning something that could help here, linking
   each MEM to the declaration it's from so that alias analysis can know
   that two MEMs from different declarations can't conflict.  This will
   also allow other things to be specified in a MEM, like alignment,
   which was his original motivation.

   Vatsa asked if the register allocator is predicate aware; the answer
   is no.

 - Data speculation work

   There was one patch, but it was never reviewed.  There currently is
   not enough ILP to make completion of this patch worthwhile.

 - Control speculation work

   Bernd Schmidt might have an unfinished patch that could be picked up.


Discussion
----------

Richard Henderson:  Almost all of the things that didn't show
   performance improvements didn't work because the GCC infrastructure
   didn't support them.  A lot of work was done right out the Intel
   optimization guide, and presumably those changes were successful in
   other compilers.

Richard Kenner:  We should get an idea of how much speedup an
   optimization should provide, with a quantitative answer for each
   optimization, to know where to spend time if we could get ILP.  It
   would also help if we could know how to get ILP, but we might not be
   able to separate them.  We should collect empirical results.

Bill Savage:  He and Suneel should be able to help us see what pays off
   on Itanium.  ILP is only one thing to exploit.  He has some numbers,
   but they'll vary by infrastructure.

Ross Towle: Reported some experiences SGI customers saw using GCC (gcc
   and g++ on ia64 and ia32) that was no performance related.  They
   found it hard to get programs to work other than at -O0, for code
   that works on a number of other platforms but doesn't work when
   compiled with gcc on ia64.  This is proprietary code for which it is
   not easy to submit problem reports.

SuSE and HP have successfully built other packages than those that are
included in a Linux distribution, so this is not a generally-seen
problem.

Gary Tracy: This is an important issue but not one to cover here.

someone:  What kind of code do we want to improve the performance of?

most:  System code and applications that are primarily integer

Ross Towle:  SGI is more interested in technical computing with floating
   point code.

Bill Savage: IPF excels at floating point code, and Bill put in a strong
   plug for supporting good floating point performance.

Suneel Jain:  GCC does not support Fortran-90, so technical computing
   isn't as much of an issue for GCC.

Richard Henderson: There is lots to do to succeed in technical
   computing.  GCC is so far off the mark now that it's years away from
   being there.

In general, different users and different markets dictate different
priorities.


[break while people figured out how to set up the projector; lunch had
been brought in, so some people started eating]


HP IPF Compiler; talk by Suneel Jain
------------------------------------

[Suneel's slides]

Key IPF Optimizations in HP Compilers

 o Predication
   - If-conversion before register allocation, scheduling
   - Implies predicate aware dataflow and register allocation
 o Control and data speculation
   - Control speculation more important
   - Generation of recovery code
 o Cross block instruction scheduling
   - Region based or Hyperblocks
 o Profile based optimization
   - Affects region formation, code layout, predication decisions
 o Data prefetching

Other Optimizations

 o Accurate machine model for latencies, stalls
   - Handling of variable shifts, fcmp
   - Register allocation to avoid w/w stall
   - Nop padding for multimedia instructions
 o Template selection, intra-bundle stops
 o Post-increment synthesis, esp. for loops
 o Sign-extension, zero-extension removal
 o Predicate optimizations
 o Post-increment synthesis, especially for loops

Strong Infrastructure

 o High level loop transformations
 o Good aliasing and data dependency information
 o Cross module optimizations
 o SSA based global optimizations
 o Accurate machine modelling
 o Support for predicated instructions, PQS.

Application domain specific issues

 o Commercial Applications
   - Instruction and Data cache/TLB misses
   - Profile based optimization
   - Large pages
   - Selective inlining
 o Technical Applications
   - Loop optimizations
   - Data prefetching
   - Modulo Scheduling

[notes from his talk]

HP had a compiler for PA-RISC, but decided to write a brand-new back end
because of all the things needed for IPF.

The key to good optimization for IPF is the ability to expose ILP.
Having a scheduler that is able to expose ILP is the reason for having
the other optimizations.  The predication and speculation features
significantly increase the number of instructions that can be fed to the
scheduler in parallel.

The compiler is not yet at an optimal point for predication.  There's a
danger in doing too much; there's a penalty for i-cache misses if it's
too aggressive, which causes a degradation in performance.  Currently it
uses lots of tuning and several tweaking heuristics.  Predication is not
an easy feature to use even if you have the infrastructure.

It's best to do predication before scheduling and register allocation.
Loop unrolling in GCC will require infrastructure changes in terms of
allowing the RTL to recognize predicate registers and whether
definitions are killed or not.  Lots of things are most important.  You
do want to use predication for simple if-then-else cases early, but not
be more aggressive.

Control speculation is more important than data speculation.  It needs
cross-block scheduling, since the compiler doesn't see the opportunity
or need within a basic block.  Both require generating recovery code,
which introduces new instructions and new register definitions and uses.
It might be difficult to build in.

For cross-block instruction scheduling, the compiler must identify
regions to concentrate on and then schedule the whole region at once.
There are also related optimizations, like if-conversion.  You can
extend scheduling for basic blocks to hyperblocks, which is a reasonable
approach for IPF.

Profile-based optimization is very key for IPF codes, especially integer
codes, as shown by measurements HP has done.

Data prefetching provided the biggest wins in technical compute
applications.  David Mosberger saw some gains in the Linux kernel.

It's important to validate that the machine model in GCC is accurate.

Template selection and intra-bundle stops should be integrated into the
scheduler rather than done in a separate pass.  This allows a robust
mechanism to find the best layout for templates and minimize the number
of nops.

Cross-module optimizations reduce the overhead of calls, build a larger
scope over which to generate code, and expose more ILP.

Discussion
----------

Richard Kenner:  Does HP have any quantitative measures of the benefit
   of each optimization?

Suneel Jain:  They all go together.  It doesn't help to have only one of
   them.  A compiler needs good aliasing infrastructure.  Profile-based
   optimization is a key factor.  The instruction scheduler must be
   aware of predication.

Vatsa Santhanam, Bill Savage:  Code locality is even more important for
   this architecture than for others where it shows a benefit.

Ross Towle:  Aggressive code motion allows data speculation to be more
   important than control speculation.


Intel IPF compiler; talk by Bill Savage
---------------------------------------

[Bill's slides]

Compiling for IPF; High Level View

General Purpose Computing

 o Characteristics
   - Very large, with few or no hot spots.
   - Heavily dependent on i-cache / TLB for perf.
 o Code size and locality most significant
 o To get the best performance, compiler needs
   - profile-guided optimization
   - code-size sensitive scheduling, bundling interprocedural optimization

Profile Guided Optimization

 o Instruction cache management
   - block ordering -- move cold blocks to end of functions
   - function ordering -- move cold functions to end of executable
   - function splitting -- move cold blocks out to end of executable
 o Branch elimination and branch prediction
   - block ordering -- make better use of static heuristics
   - indirect branch profiling -- especially for C++
   - predication

Interprocedural Optimization

 o Improves TLB, i-cache
   - Dead static function elimination
   - Better function ordering
   - Inlining of small functions

Technical Computing

 o Characteristics
   - Large data sets, loops, hot spots
   - Heavily dependent on d-cache / TLB
 o Loop scheduling and data locality most important
 o To get best performance, compiler needs
   - Software pipelining
   - Loop transformations and prefetch insertion
   - Interprocedural optimizations

Software Pipelining

 o Get the most from loop computations
 o Overlap multiple iterations
 o Works best with moderate-to-large number of iterations
 o Requires loop dependence information

Loop Transformations, Prefetching, IP Optimization

 o Loop transformations
   - Enhance data locality by reordering accesses
   - Requires data dependence analysis, cost model
   - Highest benefits
     o  interchange, distribution, fusion
 o Prefetching
   - Generate request to move data that overlaps other computation
 o Interprocedural optimizations
   - MOD/REF analysis important, data promotion

[notes from his talk; Janis was eating lunch and didn't get many notes.]

The value of profile guided optimization is mostly for code locality.
This is the highest priority.

Interprocedural analysis is important.

The developers looked at how to get all the ILP available.  The most
aggressive predication is not the best predication.

There are some low-tech big hitters that can make a difference.  You can
compromise in the register allocator between using the fewest and most
registers.  GCC could do well for compiling the kernel and some
applications.

Loop transformations enable other things.

Suneel's discussion gave a complete list of what a good ILP compiler
needs.

[Bill showed a slide that showed the performance impact on SPEC, which
can't be shared because it was prepared with others under NDA.  In mail,
he said that the interesting data for the Intel compiler is:
  12% improvement from IPO on integer SPEC
  58% improvement from IPO on FP SPEC
He didn't have hard data on profile based optimizations, but
observations are that it provides an improvement of 10%-30%.]


Discussion
----------

Mark Mitchell:  Some of the work for profile-directed optimization is
   independent of the compiler itself and might be able to use
   technology from other compilers.

Richard Henderson:  Nobody real is using profile-directed optimization
   in GCC; he doesn't know how we can address that.

Bill Savage:  All database vendors will use profile-directed
   optimization if it's available.  Other vendors will use it depending
   on how much gain there is and how important performance it is to
   them.

Janis Johnson:  Sequent's C compiler supported feedback-directed
   ordering of functions and Sequent's database customers used this
   extensively.

David Edelsohn:  Profiling the Linux kernel and libraries (for profile
   directed optimization) would be useful.

Vatsa Santhanam: HP profiles its kernel.

Jim Wilson:  Red Hat has some customers who have used profile-directed
   optimization, but it takes a long time to do a build, e.g. 4 hours
   rather than 16 hours, so it is not very practical for them.

Suneel Jain:  The build for a final release of a product might use
   different optimizations than development versions, and profile-
   directed optimization might be used only at the end.

Mark Mitchell:  Is there a way to get "sloppy data" for profile-directed
   optimization so that code can change slightly but the same profiling
   data can be used?

[this wasn't answered]

Suneel Jain:  It's useful to have pragmas for hints and to be used by
   static heuristics.

Jim Wilson:  We can build that information into the compiler.  GCC
   allows the developer to flag branch probability via __builtin_expect.

Hans Boehm:  That mechanism is limited.  If the user says to expect 0
   percent, it's not possible to give some other hint or override it.

Richard Henderson:  GCC uses the given hints and then unannotated
   branches are run through the usual heuristics.

Hans Boehm:  For some things it's useful to know that the branch will
   be taken 50% of the time, rather than never taken.  The problem
   really is that if one of the default heuristics says 0%, the user
   can't specify 50%.

Richard Henderson:  Currently GCC expects either 1% or 99% based on user
   information.

Bill Savage:  Annotations get out of date quickly and can give a false
   sense of security.  Intel has given up on using them.

Richard Henderson:  This (__builtin_expect) is used within the layout of
   a function. GCC does not split a function into multiple regions,
   which has been mentioned as a possibility.

   The compiler can put each function into a separate section (with the
   option --function-sections) so that the linker could rearrange them.

   GCC does block ordering within the compiler.

Mark Mitchell:  It wouldn't be hard to do function ordering within a
   file.

Dave Prosser:  Fur [a proprietary tool from SCO pre-Caldera that permits
   some editing of code in object files, not ported to IA-64] is worth
   looking at for ordering across modules, which is the real challenge
   for IA-64.  For some applications fur can increase performance by
   30-40 percent, while for others it has no real effect.

someone: Would it be better to port fur to IA-64 or to put a
   profile-directed optimization framework within GCC?

Dave Prosser:  Certainly there are advantages that fur gets over
   profile-directed optimization, but we could get more through less
   effort by putting a profile-directed optimization framework in GCC.

Bill Savage:  Would Fur give the same benefits as profile-directed
   optimization, if we didn't change the GCC infrastructure?

[Dave Prosser provided this during the review of the minutes:]
   Fur has to figure out the intent of code -- a level or two above the
   mere job of disassembly.  Unlike with the IA-32 instruction set,
   IA-64 makes this job much, much harder.  Moreover, doing anything
   finer with code than at the function-at-a-time level with IA-64 for
   fur will also take a lot of effort.  There are tons of stuff to keep
   track of, instead of 5 or so registers.  For example, with IA-32 we
   can insert a call to a function (at the start of a basic block)
   without touching "anything" in the calling context.  Fur makes use of
   this to be able to do its simple instrumenting.  With IA-64, there is
   no simple instruction sequence that can be used in this way.  It's
   even possible to imagine the start of a basic block where every
   single register and predicate is live in IA-64, thus making it
   impossible to drop in any transfer code sequence, let alone a canned
   one.

   Thus, in the end, it seems *much* more feasible to get the
   per-compilation-unit benefits of more localized code by putting
   profile-directed optimization into the compiler before one ever
   starts to generate code, for IA-64.  Yes, fur can still serve as a
   function-at-a-time editing tool, but that isn't sufficient for good
   IA-64 code generation, and we can still use that level of fur on
   IA-64 on top of code generated based on feedback.
[end of Dave's addition]

Vatsa Santhanam:  What mechanisms are used in the GCC community to track
   performance?

GCC developers: laughter

Richard Henderson:  Nightly SPEC runs are posted to watch performance go
   up and down.  These are not completely ignored, and some people have
   fixed performance regressions shown by these results.

[SPEC runs are posted at http://www.cygnus.com/~dnovillo/SPEC/]

Richard Kenner:   We need to actually study generated code occasionally.
   He does diffs of generated assembly files and tries to understand
   the differences.

Mark Mitchell:  A key infrastructure problem is that we can't unit test
   different parts of the compiler.  We feed in source and get assembler
   out, and can't tell what changes come from each part of the compiler,
   or that the expected transformations were done.  There is no
   conceptual reason why this is impossible to do.

Vatsa Santhanam:  Do GCC developers have tools to help analyze changes?

Mark Mitchell:  It would be nice to know which part of the compiler
   caused worse performance.  Some benchmark loops do well on some chips
   but poorly on others.

Richard Kenner:  We could have regression tests for specific
   optimizations to recognize when they break.

someone:  Who would look at results and fix the the problems?

Mark Mitchell:  Most funding for GCC has been for features, such as
   porting to new chips or adding new languages, rather than for
   particular optimizations or for ongoing maintenance.

[Vatsa Santhanam said during the review that he got the sense that GCC
developers do not have adequate tools to help analyze performance
regressions and opportunites at the assembly code level.]


[break for lunch, and to switch phone numbers for those dialing in]


SGI Pro64 compiler
------------------

Ross Towle:
   If he had brought slides they would look exactly like Suneel's. SGI
   has seen the same thing about what is important.

   The key point regarding infrastructure, going back to data
   dependence and data analysis, is being able to see memory references
   from the start; it doesn't work to derive this information later.
   In C and C++, the compiler needs information about subscripted
   references.  If it has that information then data dependence is that
   much more correct and less fuzzy.  Other optimizations fall out
   nicely if data dependence information is as perfect as it can be.


IBM IPF compiler; talk by Jim McInnes
-------------------------------------

[The following is from mail that Jim sent and which Gary Tracy copied
for those at the summit.  Comments like this were added by Janis from
her notes of additional things Jim said during his talk.]

Here is a list of things that we feel are important to do well if you
want to do well on IA64.  The list is more or less in descending order.
I'm restricting my attention to things that are IA64 specific and am
leaving out things that are platform independent.  I know little about
GCC, so these remarks are not targetted at any specific aspects of GCC.

0) Alignment.

   You probably already know that misaligned data references cost
   plenty.  Zero of them is a good number to have.

   I haven't had the impression that this is going to get better.

   Perhaps this is a bigger problem for us because PPC allows them.
   This can affect application performance by a factor of 25 to 100.
   This stopped our plans to try to exploit FP load quad word insns
   because they require 16 byte alignment and we don't always have the
   info to guarantee that.

   [Interactions with other optimizations means that this might not
   always be known.]

1) Good bundle aware local scheduling after register allocation.

   This pretty much has to be the last thing that looks at the code
   before it is written out.

   The code needs to be arranged into a stream in a way that maximizes
   dispatch bandwidth, i.e.

   i)   bundles and stop bits need to be correct and the number of stop
        bits needs to be minimal.
   ii)  Pipeline delays are met.
   iii) machine resources are not oversubscribed

   In order to do this the compiler must have knowledge about which
   predicate registers are disjoint, and this knowledge must survive the
   register allocation process.

   I think you will lose a lot by trying to separate the "pipeline
   awareness"  from the "bundle awareness"

   This is good for all code and is essential for code size.

2) Software pipelining

   You need to do Modulo Scheduling using the special hardware provided.
   All instructions are fully pipelined to facilitate this.

   It is less important to get all the funky cases involving predicate
   registers used inside the loop etc.  We struggled to get every case
   correct.  This is very important for fp code.  While loops are also
   important but this you frequently need ...

3) Speculation.

   We did control speculation and not data speculation.  Both kinds are
   moderately dangerous because they can introduce debugging headaches
   for users of the compiler.  We found this useful - I have no
   particular comments about it.

   For us the speculation was driven by our Global scheduler [similar
   to the Haifa scheduler used in GCC].  We saw code in important
   applications that could have benefitted from very local data
   speculation.  In particular the dependent sequence:

      LD
      CMP
      BR   label x
      STORE
   label x:

   repeated many times.  In this case it is a large win to data-
   speculate the LD up at least one block and replace it with a LD.CHK.
   In the instance we saw, all the stores were through *char or *void
   and couldn't be disambiguated from the loads.  I think that this
   kind of local speculation where only one or two stores have a chance
   to invalidate the ALAT entry is less likely to incur penalties.

   [This kind requires less infrastructure.  Code motion out of loops,
   for example, couldn't be disambiguated from a load through a pointer;
   that might require major infrastructure changes.]

4) Predication.

   This is difficult.  My understanding is that great performance
   benefits can come from predicating, but this understanding didn't
   come from my work on IA64.  I think that you need to get many things
   right before this is effective. In particular:

   i)   The heuristics for deciding what to predicate need work.  Static
        heuristics seem hopeless, so I think that it should be disabled
        unless PDF is being used.  We used a fairly simple minded scheme
        from one of the popular papers.  It was inadequate, in our view.
        [Most customers don't use PDF.]

        Real compilers have to pay attention to compile time as well as
        optimization - this might hinder efforts in this area.

        The biggest danger here is over predicating.  The big
        predication boosters always fail to mention that predicated
        instructions that get squashed are just extra path length.

   ii)  The register allocator has to be fully with the program and be
        able to assign registers optimally in predicated code.  I don't
        know all the problems here.

   iii) You need pretty good technology to represent relations among
        predicates.  At the very least you need to know accurately if
        two predicates are disjoint.  You will also need to make some
        predicates manifest in the code (that were previously only
        implicit) when you predicate.  Doing this  efficiently is also
        hard.  [The Intel IPF assembler requires a lot of information
        about branches to be explicit in branch expressions.]

5) Branch prediction.

   Our efforts in this area were hampered by bugs in early hardware and
   were never able to measure the benefit.  My understanding is that it
   is important.

[end of Jim's prepared talk.]

The IBM compiler targets a lot of platforms, but the IL is lower level.
The developers had trouble redirecting optimizations that deal with
addresses because IA-64 doesn't have base+displacement addressing.  It
was difficult to teach optimizations about addressing.  To minimize code
size, the compiler must make effective use of post-increment forms; this
was challenging.

The optimizations all interacted with each other, so the people working
on them had to work closely to get the optimizations to all work well
together.  Other platforms supported by the compiler allowed the
optimizations to be separate.


Discussion
----------

Richard Henderson:  Alignment is not really an issue for GCC; other
   platforms that it supports have similar issues, so it already keeps
   data aligned.

Mark Mitchell:  For GCC, it is an absolute requirement to target almost
   any crazy chip.  What were Jim's impressions working on IA-64 support
   in a retargetable compiler?   How much could be done in a target
   independent way, and how much was specific to IA-64?

Jim McInnes:  The IBM compiler has three phases of optimizing.  The
   first is inter-procedural and platform independent, and they didn't
   have to do much to that in order to get it to work with IA-64.  This
   phase uses a different IL from other phases.  The back end has two
   phases.  The first of these is bread-and-butter optimizations [a long
   list that I didn't record], getting closer and closer to what the
   code will be on machine.  The goal was that for IA-64 those two
   phases would continue to be the same on all platforms.

   For IA-64, the compiler didn't optimize properly because
   post-increment had to be done early.  There were specific problems
   that required getting more platform-specific in the first back-end
   phase, but it didn't really affect most optimizations.

   The third phase is instruction scheduling and register allocation.
   Scheduling was pretty much rewritten, since the existing scheduler
   didn't work for IA-64.

   Allocating registers was not very hard, although they had trouble
   with understanding that you're spilling when doing calls and need to
   minimize the use of registers in a function.  The allocator tended to
   use a different register for every address, so it ended up using too
   many registers.

   Rotating registers were new to them and caused problems in prolog and
   epilog at first.

[John Sias said during the review:]  If there are insufficient registers
   available on the register stack when a function is invoked, the
   machine (at least Itanium) stalls to spill some entries into the
   backing store, making room for the new function's register frame.
   The register allocator needs to acknowledge that there's some cost in
   allocating additional stack registers because there's the danger of
   this hidden spilling.

Mark Mitchell:  What about loop optimization?

Jim McInnes:  This was totally unique, and was rewritten for IA-64 to be
   specific to that platform.

Mark Mitchell:  The optimization passes in GCC are the same for all
   platforms now.  Might we need to write different versions of some
   passes for IA-64?

Richard Henderson:  The pipeliner was originally written for RISC chips,
  and it's not an issue to use modulo scheduling.

Jim McInnes:  IBM found that modulo scheduling is profitable on RISC
   chips as well.

Richard Kenner:  No matter how something looks, you might find another
   platform later on to take advantage of some optimizations.

Jim McInnes:  Rotating registers are important.  The epilog counter
   is not that important.  One way or another you'll have a big chunk of
   code that is only used on IA-64.

Ross Towle:  Yes, modulo scheduling important is on other platforms as
   well.  It's necessary to use a very different register allocation for
   rotating registers; this is very mechanical, very simple.

[Vatsa Santhanam said during the review:]  I think the code example does
   not clearly illustrate the point being made in the text.
   Specifically, the LD in the code fragment is already positioned above
   the STORE and so the need for data speculation is not evident.  So
   either the LD was below the STORE to begin with and had to be data
   (and control) speculated or the above code sequence repeats
   *back-to-back* multiple times giving rise to the data speculation
   opportunity.


Long-term infrastructure improvements to GCC; talk by Brian Thomson
-------------------------------------------------------------------

Brian is soliciting support for long-term infrastructure improvements
to GCC.  He sees a real synergy between that effort and efforts like
this one.

Brian is working with vendors who have a dependence on GCC generating
good code, to get them signed up to support the effort.  They will lay
out requirements and then invite the GCC development community to offer
ideas.  This effort is broader than a single platform; the efforts will
probably help all processors, but some more than others.

Some of the changes needed for GCC are more fundamental, with broader
effect, involve more upheaval in existing code, and will take longer to
implement.  There will be a natural breakdown of work that comes out of
this summit for the two groups of effort.  Some of what comes out of
this summit will be input into Brian's effort with other system vendors.


Discussion
----------

Gerrit Huizenga:  Is there any list so far of what changes are proposed?

Brian Thomson:  There has been some discussion about specific items.
   The intention is to identify targets that we want to see improvement
   in; what kind of code, languages, and architectures, and then allow
   GCC designers to propose technology to do that.  He doesn't want to
   be prescriptive, but let people who own the solution provide that.
   For example, deciding whether and how to do multiple levels of IL as
   in the IBM compiler.

Richard Kenner: There are trees and RTL in GCC.  The tree representation
   is used for expression folding and inlining.

Richard Henderson:  Making tree-level optimizations language-
   independent is a high priority.  Red Hat has a person in Toronto who
   is working on SSA.  In general, GCC shouldn't throw away so much
   information so quickly.  Moving to multiple levels of IL is going to
   have to come before any longer-term projects can see any benefit.
   The first step there is to create a clean interface between the
   front end and the optimizer so we can reuse cool optimization
   technology for all languages.

Mark Mitchell:  The existing tree representation also needs changes.  It
   needs to be more regular and clean.

Richard Henderson:  This is a fair description of what he had in mind.
   The higher level RTL that Brian or Jim McInnes mentioned has been
   discussed several times in the last couple of years.

   Jeff Law been doing some serious thinking on that subject and has
   started submitting some bits.  GCC has an SSA pass but it suffers
   from representational problems in the current RTL.  In certain
   situations it isn't reliable, so it is not turned on by default.
   Jeff has started attacking some of those issues so it can be turned
   on.

   As for longer term directions, Richard is not sure he has any.  When
   fundamental problems are resolved then the future direction is more
   based on desire and whatever else we identify that can best help
   performance.

Suneel Jain:  Is the goal of having a higher-level tree representation
   to do inter-procedural optimizations from information written to
   disk?

Mark Mitchell:  This question comes up a lot.  The sticky issue is that
   the FSF is morally opposed to doing this.  The aim of the FSF is not
   to produce the best compiler, but to convince the world that all
   software should be free.  The concern is that writing out the
   representation to disk would allow a vendor to use a GCC front end,
   write the IL to disk, read it back in, and graft proprietary code
   into GCC in a sneaky way to get around the GPL.  This is a very
   firmly held position in the FSF.

Richard Henderson:  Pre-compiled headers write some information to disk,
   but not in a way that is entirely accessible.  If performance
   speed-ups for that are as benchmarks have suggested (20x speedup in
   C++ compile time) then it will be almost impossible to disallow it.
   For this, though, the representation is very close to the source
   level and is not really a GCC internal representation.

Mark Mitchell:  Other related work to write out parts of the internal
   representation are inevitable.  Eventually the political issue
   might be weakened if this is important for the long-term viability
   of the compiler.

Richard Kenner:  We can write out some information and still get
   inter-module optimizations, e.g. information about register usage
   within a function.

Mark Mitchell:  We don't want to let vendors leverage GCC in their
   products.

Bill Savage:  Summary information could be used for analysis.

Richard Kenner:  We don't want a vendor to be able to use a GCC
   back end or front end.  It isn't clear whether an IL used in GCC
   could be GPLed, since it's not just the actual code but the methods
   that it uses.

David Edelsohn:  Is it useful to discuss specific optimizations, with
   long-term vs. short-term?

Richard Kenner:  There was a private discussion earlier about setting
   up a data structure with useful information about a MEM.

Mark Mitchell:  He would like to understand the state of the current
   profiling code in GCC.  We might be able to take advantage of that
   in places where GCC is now guessing .

Richard Henderson:  GCC currently collects trip counts off a minimal
   spanning tree, for how many times you went from this block to
   another block.  There are a couple of PRs in the GCC bug database,
   but it mostly works except for some computed gotos.

   He has built SPEC with it and it seemed to function.  He doesn't
   know how much performance improvement it gave, but it got data,
   which went back in and got attached to the right branches.

   We could use the information to improve linearization of the code,
   and sometimes for if-conversion to decide which side of the branch
   should be predicated.  It could also be used for delay slots.

   For profiling, GCC generates different code and increments counters
   inline.

Jim Wilson:  Some information can be computed after the fact from
   execution counts.

Bill Savage:  What we want is basic block profiling with extra
   instrumentation around loops.  The IR can be annotated with branch
   probabilities, with counts to guide the heuristics of all
   optimizations downstream.  Code locality is biggest payoff.

Jim Wilson:  Jim wrote this functionality when he first started at
   Cygnus about 11 years ago.  It might have been six years before it
   was approved.  It's online and usable but very few people use it.
   It can be used for profiling; profile-directed feedback is extra.

Bill Savage:  Someone really ought to look into using this.

Richard Henderson:  Block reordering is a year and a half old.  Before
   that it was used for branch hints.  He doesn't know what kind of
   performance help it gives.

Mark Mitchell:  Data prefetching might be simple to tackle.

Richard Henderson:  Jan Hubicka at SuSE did this.

Bill Savage:  There are two approaches: one for floating point that's
   complicated, one that's simple-minded that didn't hurt anything but
   didn't buy much.  There might be other techniques that are simple for
   integer computing.  Some techniques some work well on linked lists if
   they are laid out so elements are contiguous.   This showed a good
   speedup on SPEC, but real applications might not work that way.

Suneel Jain:  He has done comparative SPEC runs and analysis with GCC,
   the Intel compiler, and SGI's Pro64 on IPF Linux, using default
   optimization levels, -O2, no special options.

   The Intel compiler was about 30% higher than GCC. GCC and Pro64 were
   comparable at -O2, although Pro64 was really bad at Perl, which
   brought down its average.

   The Intel and HP compilers are comparable at -O2.

   With peak numbers from HP and Intel, GCC gets about half the
   performance (GCC 1.0, HP and Intel 1.8).  This was two months ago
   with a GCC pre-3.0 version.

   If we want to focus on application performance then we should improve
   -O2 and not require special options.

   The difference for GCC runtime (from SPEC) was close to 10-15%,
   without profiling feedback.

Mark Mitchell:  Was there any analysis of where the differences came
   from?

Suneel Jain:  No.

Mark Mitchell:  We can put anything in -O2 we want.

Suneel Jain:  But not profile-based optimization.

Richard Henderson:  If our goal is to improve performance of a Linux
   system, then profile feedback is not where we should begin looking.

lots:  Why?

Richard Henderson:  Because we're not going to build the whole
   distribution that way.

Waldo Bastian:  Lots of people just use the distribution as they receive
   it so it would be useful for those people.

Richard Henderson:  We need more than a compiler that supports it, we
   need representative test cases.

Waldo Bastian:  But we can do that.  A project like KDE can build its
   own test cases if only the profiling tools are easy enough to use.

Richard Henderson:  It's not that hard to get the data out of the
   kernel.

Reva Cuthbertson:  The problem is getting accurate data.

Steve Christiansen:  You have to decide what workload to use.

Mark Mitchell:  Workload issues are important, but using a workload
   that is close wouldn't hurt other workloads.

Bill Savage:  Linux kernel performance should be a high priority.
   Behind that is general applications and database software, and
   behind that is technical computing, which is lower because there
   might be other compilers available for those applications.

Hans Boehm:  There might not be much opportunity in the kernel, since
   much of it is hand tuned.  David Mosberger might know whether there
   are problems shown by benchmarking.

Mark Mitchell:  The shell and the C library could benefit; they both
   have a lot of CPU usage.

Richard Kenner:  A webserver might be an interesting workload.  The
   shell is too simple.

Mark Mitchell:  It's interesting to profile a workload to see which
   processes are running.

Jim Wilson:  Itanium's performance monitoring registers let you see a
   lot of information.

David Edelsohn:  We should focus on uses of the system to guide which
   areas would benefit the most from performance improvements.

someone:  The kernel is important but has a lower priority than other
   parts of the system.

Mark Mitchell:  It's reasonable to focus on enterprise applications for
   Itanium.

Gary Tracy:  We need to have a list of projects and let people sign up
   for them.

Mark Mitchell:  There has been a project file for GCC for years and he
   can't remember any of those projects being done.

Gerrit Huizenga:  We should keep track of failed projects so that others
   don't go down the same path.

Mark Mitchell:  We should get detailed technical information from
   developers of other compilers so we don't need to start from scratch.

Richard Kenner:  It could slow things down if some of the people
   improving GCC are planning to patent their methods.  Is anyone
   planning to do that?

someone:  The various companies which have cross-licensed the compiler
   optimization patents could license them to the FSF.

David Edelsohn:  The cross-license does not include the right to
   sub-license the patents to others, such as the FSF.

Richard Kenner:  Companies can't do cross-licensing with the FSF because
   it doesn't have its own patents.

David Edelsohn:  Daniel Berlin has written a new register allocator
   based on a paper that touches on every register allocation patent,
   from lots of companies and universities.  That work will not go into
   GCC until the FSF decides what to do about the relevant patents.

Mark Mitchell:  It would be nice if big companies could do patent
   searches on behalf of the FSF.  Unintentional patent infringement is
   a potential risk to the open source community.


[Break]


GCC projects to improve IA-64 support
-------------------------------------

We had a brainstorming session, with lots of sidetracks into the items
being brought up, to divide potential enhancements into three
categories:

   short-term IPF optimizations
   long-term optimizations
   tools: performance tools, benchmarks, etc.

Items were written on large charts as they were brought up.  The
information below shows what was written on the charts and some of the
discussion about them.


Short-term IPF optimizations
----------------------------

 - alias analysis improvements

   Richard Henderson:  This work is self-contained and doesn't affect
   the rest of the compiler.  The idea is to track the origin of the
   memory when it is known, despite the memory reference being broken
   down.  Register+displacement addressing doesn't usually require this
   kind of information.  With IA-64 we start losing information
   immediately.

   Richard Kenner is already planning some work on tracking memory
   origin.

 - prefetching

   Richard Henderson:  There are existing patches to examine in the
   gcc-patches archive.  There is dependence distance code already
   checked into the compiler that no one uses; that information could be
   hooked into the loop unroller and the prefetcher and we might see
   improvements.

 - prefetch intrinsic

 - code locality; function order based on profiling

   Bill Savage:  Getting functions ordered requires interaction with the
   linker.

   Richard Henderson:  There was some work on such a tool, but it might
   be easiest to start from scratch.  The [GNU] linker has a scripting
   language that can tell it where to place functions.  The tool could
   almost be a shell script.

   Bill Savage:  This functionality requires more than a call graph.

   Hans Boehm:  There might be a problem that profiling tools don't work
   with threads.

   There is an article by Karl Pettis and Bob Hansen about how to order
   functions based on a call graph: "Profile guided code positioning",
   http://acm.proxy.nova.edu/pubs/articles/proceedings/pldi/93542/p16-pettis/p16-pettis.pdf

 - static function ordering

   SGI has a tool called CORD for code ordering that uses either static
   or dynamic information.

 - machine model

   Richard Henderson:  There is a good machine model from Vlad [Vladimir
   Makarov], but it was not submitted.  The current one isn't good
   enough for advanced scheduling.

 - improve GCC bundling of instructions

   Richard Henderson:  GCC currently uses an ad-hoc method of bundling;
   the machine model should guide it.

   Vatsa Santhanam:  Look at nop density.

 - selective inlining

   Mark Mitchell:  GCC with -O2 inlines functions that are declared as
   inline; -O3 will inline everything "small".  GCC could be smarter
   about how to inline.

   Vatsa Santhanam:  GCC could do profile-based inlining.

 - hook up to open source KAPI library (machine model description)

   Suresh Rao:  We can use it to build the machine model rather than
   using it directly.

 - control speculation for loads only

   Suneel Jain:  Speculation for loads doesn't need recovery code and is
   quite simple, with chk.s.

   someone: Recovery mode is not supported in Linux.

   Richard Henderson:  If you don't care about seg faults you don't even
   need the check.

 - region formation heuristics

   Richard Henderson:  We could rip out CFG detection, use regular data
   structures, and fix region detection.

[John Sias sent this during the review of the minutes:]
   Region formation is a way of coping with either limitations of the
   machine or limitations of the compiler / compile time.  "Regions" are
   control-flow-subgraphs, formed by various heuristics, usually to
   perform transformations (i.e. hyperblock formation) or to do register
   allocation or other work-intensive things.  For hyperblock formation,
   for example, region formation heuristics are critical---selecting too
   much unrelated code wastes resources; conversely, missing important
   paths that interact well with each other defeats the purpose of the
   transformation.  Large functions are sometimes broken heuristically
   into regions for compilation, with the goal of reducing compile time.

 - new Cygnus scheduler

   Richard Henderson:  This scheduler makes the compiler slower and
   doesn't always make code faster.  It was written by Vlad.

 - exploit the PBO (profile based optimization) capability that already
   exists in GCC

   Make sure it works and improve the documentation.

   Try it on the Linux kernel and discuss the information.

   Make the instrumentation thread-safe.

   Build gcc with feedback; but Mark Mitchell says that the time spent
   in gcc is mostly paging because it allocates too much memory.

 - straight-line post-increment

   non-loop induction variable opportunities

   Jeff Law is looking at post-increment work.

 - make better use of dependence information in scheduling

   Richard Henderson: This is very helpful and very easy.

 - enable branch target alignment

   It's necessary to measure trade-offs between alignment and code size.

 - alignment of procedures


Long-term optimizations or infrastructure changes
-------------------------------------------------

 - language-independent tree optimizations

   Richard Henderson:  Cool optimizations require more information than
   is available in RTL.  The C and C++ front ends now render an entire
   function into tree format, but it is transformed into RTL before
   going to the optimization passes.  We need to represent everything
   that is needed to be represented from every language.  Every
   construct doesn't need to be represented; WHIRL (SGI's IL) level 4 is
   about what he means.

   Mark Mitchell:  This is one of the projects he's wanted to do for a
   couple of years.  The IL needs to maintain machine independence
   longer.

 - hyperblock scheduling

   Richard Henderson:  This requires highly predicated code.

 - predication

   if-conversion, predication, finding longer strings of logical

   notion of disjoint predicates

   PQS (predicate query system); a database of known relationships
   between predicate registers

 - data speculation

 - control speculation

 - modulo scheduling

 - rotating registers

 - function splitting (moving function into two regions), for locality

   Richard Kenner:  This is difficult if an exception is involved.

   Vatsa Santhanam:  There might be synergistic effects with reordering
   functions for code locality

   Jim Wilson: Dwarf2 is the only debugging format that can handle it.

 - optimization of structures that are larger than a register

   The infrastructure doesn't currently handle this.  This is related to
   memory optimizations.

 - make better use of alias information

 - instruction prefetching

 - use of BBB template for multi-way branches (e.g. switches)

   It might be difficult to keep track of this in the machine-
   independent part of GCC.

 - cross-module optimizations

   Avoid reloads of GP when it is not necessary.  The compiler needs
   more information than is currently available.

 - high-level loop optimizations

   This requires infrastructure changes.

 - C++ optimizations

   Jason Merrill invented cool stuff, e.g. thunks for multiple
   inheritance, that hasn't been done yet.

   It's possible to inline stubs.

 - "external" attribute or pragma

   This would be for information like DLL import/export; it is not
   machine independent.

   If GCC defined such an attribute, glibc would probably use it.

 - register allocator handling GP as special


Tools: performance tools, benchmarks, etc.
------------------------------------------

 - GCC measurements and analysis, comparison with other compilers

   Mark Mitchell:  It would be useful to compare performance using real
   applications, e.g. Apache and MySQL.

 - profile the Linux kernel

 - dispersal analysis

   Steve Christiansen has a dispersal analysis tool.  The output is
   similar to the comments in GCC assembler output with -O2 or greater,
   but it can be used on any object file and prints information at the
   end of each function with the number of bundles and nops.
   [Currently this uses McKinley rules and so would still be under NDA,
   but if there's interest, Steve could use Itanium rules instead.]

 - statistics gathering tool

 - PMU-based performance monitor

 - small test cases and sample codes for examining generated code

   These could come from developers of proprietary IPF compilers, who
   presumably have used such code fragments to analyze the code that
   their compilers generate.

 - compiler instrumentation that would cause an application to dump
   performance counter information


Where do we go from here?
-------------------------

Richard Henderson:  Any changes can go into the mainline CVS now, but
   there's no way to tell when there will be another release.

Mark Mitchell:  Perhaps GCC should go to a more schedule-driven release
   policy; he'll bring it up at the next steering committee  meeting.

Gary Tracy:  His group will be making commitments sometime in June.

Further communications should take place in the gcc mailing list
(gcc@gcc.gnu.org, archived at gcc.gnu.org).  Use "ia64" in the subject
line to allow people who are only interested in IA-64 work to search for
it.

Janis Johnson will merge items from the list above with the existing GCC
IA-64 wish list (at linuxia64.org) and get someone to add it to the GCC
project list.  People planning to work on a project can mail the gcc
list and record their plans in the projects file.