Text archives Help
- From: Aaron Knoll <
>
- To: Solomon Boulos <
>
- Cc: "
" <
>, "
" <
>
- Subject: [Manta] Re: Re: Gcc __m128 register to/from stack
- Date: Thu, 21 Jan 2010 20:12:27 +0100
I second Solomon... in my experience I have found:
slowest:
_mm_set_ps
_mm_load_ps
slow:
union{ __m128 sse; float f[4]; }
and pointer arithmetic
fast:
_mm_shuffle_ps
fastest:
_mm_unpacklo_ps, _mm_unpackhi_ps, _mm_movehl_ps
You can do a lot of SSE magic with the bottom 3, it makes code a lot faster :)
-Aaron
On Jan 21, 2010, at 8:03 PM, Solomon Boulos wrote:
>
This is almost certainly just an issue with filling up the vector from
>
scalars. In other portions of code, you shouldn't see the same behavior.
>
You can usually replace things like this with better _mm_shuffle_ps
>
variants but this varies from system to system.
>
>
On more recent processors you are also unlikely to notice much of a hit
>
from these ops, but it is still good practice to avoid _mm_set_ps and its
>
equivalents.
>
>
On Jan 21, 2010, at 10:09, "Li-Ta Lo"
>
<
>
>
wrote:
>
>
> Hi,
>
>
>
> I recently noticed that there are many instances of "inefficient?"
>
> SSE code generated by GCC, for both Manta and my own SSE vector/matrix
>
> library. For example, in Manta's RayPacket.o, you can find code to
>
> load/store an XMM register from/to stack that done by an movlps plus
>
> an movhps instead of a single movaps.
>
>
>
> c35: 45 0f 12 21 movlps (%r9),%xmm12
>
> c39: 45 0f 12 1a movlps (%r10),%xmm11
>
> c3d: 45 0f 16 61 08 movhps 0x8(%r9),%xmm12
>
> c42: 45 0f 16 5a 08 movhps 0x8(%r10),%xmm11
>
>
>
> My small test program
>
>
>
> extern void print(const __v4sf &);
>
>
>
> int main()
>
> {
>
>
>
> __v4sf a = { 1.0f, 2.0f, 3.0f, 3.0f};
>
> __v4sf b = { 3.0f, 2.0f, 1.0f, 0.0f};
>
> __v4sf c = a + b;
>
>
>
> print(c);
>
> }
>
>
>
> will generate code like this
>
>
>
> .text
>
> .p2align 4,,15
>
> .globl main
>
> .type main, @function
>
> main:
>
> .LFB1859:
>
> .loc 1 21 0
>
> .cfi_startproc
>
> subq $24, %rsp #,
>
> .LCFI1:
>
> .cfi_def_cfa_offset 32
>
> .LBB17:
>
> .loc 1 25 0
>
> movaps .LC3(%rip), %xmm0 #, tmp61
>
> .loc 1 27 0
>
> movq %rsp, %rdi #, tmp62
>
> .loc 1 25 0
>
> addps .LC2(%rip), %xmm0 #, tmp61
>
> movlps %xmm0, (%rsp) # tmp61, c
>
> .LVL0:
>
> movhps %xmm0, 8(%rsp) # tmp61, c
>
> .LVL1:
>
> .loc 1 27 0
>
> call print(float __vector const&) #
>
> .LBE17:
>
> .loc 1 28 0
>
> xorl %eax, %eax #
>
> addq $24, %rsp #,
>
> ret
>
> .cfi_endproc
>
>
>
> Any one know the reason behind this?
>
>
>
> Ollie
Archive powered by MHonArc 2.6.16.