manta - [Manta] Re: Re: Gcc __m128 register to/from stack

Closed list
Subscribers: 0
Owners

sparker

thiago

Subscribe
Unsubscribe
Info
Admin
Archive

Post

Shared documents

Manta Interactive Ray Tracer Development Mailing List

Text archives Help

[Manta] Re: Re: Gcc __m128 register to/from stack

From: Aaron Knoll < >
To: Solomon Boulos < >
Cc: " " < >, " " < >
Subject: [Manta] Re: Re: Gcc __m128 register to/from stack
Date: Thu, 21 Jan 2010 20:12:27 +0100

I second Solomon... in my experience I have found:

slowest:
_mm_set_ps
_mm_load_ps

slow:
union{ __m128 sse; float f[4]; }
and pointer arithmetic

fast:
_mm_shuffle_ps

fastest:
_mm_unpacklo_ps, _mm_unpackhi_ps, _mm_movehl_ps

You can do a lot of SSE magic with the bottom 3, it makes code a lot faster :)

-Aaron

On Jan 21, 2010, at 8:03 PM, Solomon Boulos wrote:

> This is almost certainly just an issue with filling up the vector from
> scalars. In other portions of code, you shouldn't see the same behavior.
> You can usually replace things like this with better _mm_shuffle_ps
> variants but this varies from system to system.
>
> On more recent processors you are also unlikely to notice much of a hit
> from these ops, but it is still good practice to avoid _mm_set_ps and its
> equivalents.
>
> On Jan 21, 2010, at 10:09, "Li-Ta Lo"
> < >
>  wrote:
>
>> Hi,
>>
>> I recently noticed that there are many instances of "inefficient?"
>> SSE code generated by GCC, for both Manta and my own SSE vector/matrix
>> library. For example, in Manta's RayPacket.o, you can find code to
>> load/store an XMM register from/to stack that done by an movlps plus
>> an movhps instead of a single movaps.
>>
>>    c35:       45 0f 12 21             movlps (%r9),%xmm12
>>    c39:       45 0f 12 1a             movlps (%r10),%xmm11
>>    c3d:       45 0f 16 61 08          movhps 0x8(%r9),%xmm12
>>    c42:       45 0f 16 5a 08          movhps 0x8(%r10),%xmm11
>>
>> My small test program
>>
>> extern void print(const __v4sf &);
>>
>> int main()
>> {
>>
>>   __v4sf a = { 1.0f, 2.0f, 3.0f, 3.0f};
>>   __v4sf b = { 3.0f, 2.0f, 1.0f, 0.0f};
>>   __v4sf c = a + b;
>>
>>    print(c);
>> }
>>
>> will generate code like this
>>
>>       .text
>>       .p2align 4,,15
>> .globl main
>>       .type   main, @function
>> main:
>> .LFB1859:
>>       .loc 1 21 0
>>       .cfi_startproc
>>       subq    $24, %rsp       #,
>> .LCFI1:
>>       .cfi_def_cfa_offset 32
>> .LBB17:
>>       .loc 1 25 0
>>       movaps  .LC3(%rip), %xmm0       #, tmp61
>>       .loc 1 27 0
>>       movq    %rsp, %rdi      #, tmp62
>>       .loc 1 25 0
>>       addps   .LC2(%rip), %xmm0       #, tmp61
>>       movlps  %xmm0, (%rsp)   # tmp61, c
>> .LVL0:
>>       movhps  %xmm0, 8(%rsp)  # tmp61, c
>> .LVL1:
>>       .loc 1 27 0
>>       call    print(float __vector const&)    #
>> .LBE17:
>>       .loc 1 28 0
>>       xorl    %eax, %eax      #
>>       addq    $24, %rsp       #,
>>       ret
>>       .cfi_endproc
>>
>> Any one know the reason behind this?
>>
>> Ollie

[Manta] Gcc __m128 register to/from stack, Li-Ta Lo, 01/21/2010
- [Manta] Re: Gcc __m128 register to/from stack, Solomon Boulos, 01/21/2010
  - [Manta] Re: Re: Gcc __m128 register to/from stack, Aaron Knoll, 01/21/2010

Archive powered by MHonArc 2.6.16.