manta - [Manta] r2053 - in trunk: Core/Thread Core/Util Engine/Control Interface Model/Groups

Closed list
Subscribers: 0
Owners

sparker

thiago

Subscribe
Unsubscribe
Info
Admin
Archive

Post

Shared documents

Manta Interactive Ray Tracer Development Mailing List

Text archives Help

[Manta] r2053 - in trunk: Core/Thread Core/Util Engine/Control Interface Model/Groups

From: "Solomon Boulos" <boulos@cs.utah.edu>
To: manta@sci.utah.edu
Subject: [Manta] r2053 - in trunk: Core/Thread Core/Util Engine/Control Interface Model/Groups
Date: Tue, 12 Feb 2008 15:05:59 -0700 (MST)

Author: boulos
Date: Tue Feb 12 15:05:56 2008
New Revision: 2053

Modified:
   trunk/Core/Thread/AtomicCounter.h
   trunk/Core/Util/CallbackHelpers.h
   trunk/Engine/Control/RTRT.cc
   trunk/Interface/Task.cc
   trunk/Interface/Task.h
   trunk/Model/Groups/DynBVH.cc
   trunk/Model/Groups/DynBVH.h
Log:

This commit expands the maturity of the UpdateGraph and demonstrates
parallel BVH update and build. It's still not ready for primetime for
a number of reasons, the biggest of which is memory consumption. I am
pre-allocating memory like crazy and using placement new to try to
mitigate dynamic allocation (which isn't going to be successful all
over the place). In a future commit, I'd like to add a MemoryArena so
that Tasks, Callbacks, and TaskLists all eat from the MemoryArena
instead.

Details for the individual files follow.

Core/Thread/AtomicCounter.h

I lied about how operator += returns the new value. It actually
returns the old value (which I prefer anyway).

Core/Util/CallbackHelpers.h

Everyone loves adding more variants to the CallbackHelpers. Added
Static_0Data_5Arg, 1Data_4Arg and 1Data_5Arg.

Engine/Control/RTRT.cc

Adding some harness code to run per-frame rebuild. A warning during
compilation will let you know that you're using this (don't use it
yet)

Interface/Task.cc
Interface/Task.h

Reductions should tack the TaskList* to the start of the function so
that useful reductions over the Tasks can be performed. Intermediate
results should be placed in the scratchpad_data of each Task and then
can be merged during reduction.

Correspondingly, the Task scratchpad is now bigger to account for
parallel BVH build (need space for the 24 SampleBins)

Model/Groups/DynBVH.cc
Model/Groups/DynBVH.h

Mostly updates to add a parallel build and update through the UpdateGraph.

Updating all 1e-5f references to be T_EPSILON instead.

Changing num_nodes and nextFree to be AtomicCounter's so that
parallel build is easily integrated with serial build.

The parallel build works okay but eats up tons of memory to
pre-allocate all the Task, TaskList and callback data. I hope to
replace this with a Memory Arena. Note that per-thread memory is not
sufficient here since we need to allocate and delete the data from
potentionally different threads.

Adding partitionObjects to do in-place partioning of the object ids
(to avoid the STL vector push_backs we used to have). This was mostly
for lazyness but caused a huge problem when trying to parallelize the
build (too much memory allocation and de-allocation). This also
improves the single thread build quite a bit.

No longer using the exact SAH when the number of objects is below a
threshold. Due to the way buildEvents works, it becomes a prohibitive
bottleneck during parallel build (lots of vector push back).

Modified: trunk/Core/Thread/AtomicCounter.h
==============================================================================
--- trunk/Core/Thread/AtomicCounter.h   (original)
+++ trunk/Core/Thread/AtomicCounter.h   Tue Feb 12 15:05:56 2008
@@ -101,7 +101,7 @@
   //    Increment the counter and return the old value
   int operator++(int);

-  // Increment the counter by the given value and return the new
+  // Increment the counter by the given value and return the old
   // value. Like operator++ this doesn't return AtomicCounter& to
   // avoid atomicity issues.
   int operator+=(int);

Modified: trunk/Core/Util/CallbackHelpers.h
==============================================================================
--- trunk/Core/Util/CallbackHelpers.h   (original)
+++ trunk/Core/Util/CallbackHelpers.h   Tue Feb 12 15:05:56 2008
@@ -192,7 +192,7 @@
   template<typename Arg1, typename Arg2, typename Arg3, typename Arg4,
typename Arg5>
   class Callback_Static_0Data_5Arg : public CallbackBase_0Data {
   public:
-    Callback_Static_0Data_5Arg(void (*pmf)(Arg1, Arg2, Arg3, Arg4, Arg5),
+    Callback_Static_0Data_5Arg(void (*pmf)(Arg1, Arg2, Arg3, Arg4, Arg5),
         Arg1 arg1, Arg2 arg2, Arg3 arg3, Arg4 arg4, Arg5 arg5)
       : pmf(pmf), arg1(arg1), arg2(arg2), arg3(arg3), arg4(arg4), arg5(arg5)
     {
@@ -618,6 +618,53 @@
     void (T::*pmf)(Data1, Arg1, Arg2);
     Arg1 arg1;
     Arg2 arg2;
+  };
+
+  template<class T, typename Data1, typename Arg1, typename Arg2, typename
Arg3, typename Arg4>
+  class Callback_1Data_4Arg : public CallbackBase_1Data<Data1> {
+  public:
+    Callback_1Data_4Arg(T* ptr, void (T::*pmf)(Data1, Arg1, Arg2, Arg3,
Arg4), Arg1 arg1, Arg2 arg2, Arg3 arg3, Arg4 arg4)
+      : ptr(ptr), pmf(pmf), arg1(arg1), arg2(arg2), arg3(arg3), arg4(arg4)
+    {
+    }
+    virtual ~Callback_1Data_4Arg()
+    {
+    }
+    virtual void call(Data1 data1)
+    {
+      (ptr->*pmf)(data1, arg1, arg2, arg3, arg4);
+    }
+  private:
+    T* ptr;
+    void (T::*pmf)(Data1, Arg1, Arg2, Arg3, Arg4);
+    Arg1 arg1;
+    Arg2 arg2;
+    Arg3 arg3;
+    Arg4 arg4;
+  };
+
+  template<class T, typename Data1, typename Arg1, typename Arg2, typename
Arg3, typename Arg4, typename Arg5>
+  class Callback_1Data_5Arg : public CallbackBase_1Data<Data1> {
+  public:
+    Callback_1Data_5Arg(T* ptr, void (T::*pmf)(Data1, Arg1, Arg2, Arg3,
Arg4, Arg5), Arg1 arg1, Arg2 arg2, Arg3 arg3, Arg4 arg4, Arg5 arg5)
+      : ptr(ptr), pmf(pmf), arg1(arg1), arg2(arg2), arg3(arg3), arg4(arg4),
arg5(arg5)
+    {
+    }
+    virtual ~Callback_1Data_5Arg()
+    {
+    }
+    virtual void call(Data1 data1)
+    {
+      (ptr->*pmf)(data1, arg1, arg2, arg3, arg4, arg5);
+    }
+  private:
+    T* ptr;
+    void (T::*pmf)(Data1, Arg1, Arg2, Arg3, Arg4, Arg5);
+    Arg1 arg1;
+    Arg2 arg2;
+    Arg3 arg3;
+    Arg4 arg4;
+    Arg5 arg5;
   };

   // 2 Data

Modified: trunk/Engine/Control/RTRT.cc
==============================================================================
--- trunk/Engine/Control/RTRT.cc        (original)
+++ trunk/Engine/Control/RTRT.cc        Tue Feb 12 15:05:56 2008
@@ -74,6 +74,7 @@
#include <Core/Math/CheapRNG.h>
#include <Core/Math/MT_RNG.h>

+#include <Model/Groups/Group.h>
// NOTE(boulos): This is temporary to get the code to compile and
// allow us to discuss the interface (so that I don't have to do all
// the Factory work just yet)
@@ -641,6 +642,10 @@

///////////////////////////////////////////////////////////////////////////////

///////////////////////////////////////////////////////////////////////////////

+void PrintHello() {
+
+}
+
void RTRT::internalRenderLoop(int proc, bool lateComerFlag)
{
#ifdef MANTA_SSE
@@ -732,8 +737,22 @@
     doSerialAnimationCallbacks(changed, proc, workersAnimAndImage);
     doParallelAnimationCallbacks(changed, proc, workersAnimAndImage);
#else
-    doUpdates(changed, proc, workersAnimAndImage);
+#warning "Using UpdateGraph instead of animation callbacks"
+    doSerialAnimationCallbacks(changed, proc, workersAnimAndImage);
+    doParallelAnimationCallbacks(changed, proc, workersAnimAndImage);
+    doUpdates(changed, proc, workersRendering);
     barrier1.wait(workersRendering);
+    if (proc == 0) {
+      // NOTE(boulos): Use this code to rebuild the first Object in the
group every frame
+#if 1
+#warning "Sending update transaction for first Object in group"
+      // Temporary
+      Group* top_level_group = dynamic_cast<Group*>(getScene()->getObject());
+      addUpdateTransaction("Test Update Graph",
+                           Callback::create(PrintHello),
+                           top_level_group->get(0));
+#endif
+    }
#endif

     if(!firstFrame){
@@ -1016,8 +1035,9 @@
   unsigned long end = (proc+1)*totalCallbacks/numProcs;
   ACallbackMapType::iterator iter = start+serialAnimationCallbacks.begin();
   ACallbackMapType::iterator stop = end+serialAnimationCallbacks.begin();
-  for(; iter != stop; iter++)
+  for(; iter != stop; iter++) {
     (*iter)->call(proc, numProcs, changed);
+  }
}

void RTRT::doTerminationCallbacks() {

Modified: trunk/Interface/Task.cc
==============================================================================
--- trunk/Interface/Task.cc     (original)
+++ trunk/Interface/Task.cc     Tue Feb 12 15:05:56 2008
@@ -39,10 +39,10 @@
   if (num_left == 0) {
     // Call the reduction function
     if (reduction_function)
-      reduction_function->call();
+      reduction_function->call(this);
   }
}

-void TaskList::setReduction(CallbackBase_0Data* new_reduction) {
+void TaskList::setReduction(ReductionCallback* new_reduction) {
   reduction_function = new_reduction;
}

Modified: trunk/Interface/Task.h
==============================================================================
--- trunk/Interface/Task.h      (original)
+++ trunk/Interface/Task.h      Tue Feb 12 15:05:56 2008
@@ -16,7 +16,7 @@
     typedef CallbackBase_1Data<Task*> TaskCallback;
     Task(TaskCallback* task_function);

-    static const unsigned int MaxScratchpadSize = 128;
+    static const unsigned int MaxScratchpadSize = 1024;

     unsigned int task_id;
     TaskList* task_list;
@@ -30,6 +30,7 @@

   class TaskList {
   public:
+    typedef CallbackBase_1Data<TaskList*> ReductionCallback;
     TaskList();

     // NOTE(boulos): push_back will modify your task to point to this
@@ -39,11 +40,10 @@

     void finishTask(unsigned int task_id);

-    void setReduction(CallbackBase_0Data* new_reduction);
+    void setReduction(ReductionCallback* new_reduction);
     // After finishing all the tasks, the reduction is called by the
     // finishing thread
-    CallbackBase_0Data* reduction_function;
-
+    ReductionCallback* reduction_function;
     std::vector<Task*> tasks;
     AtomicCounter num_assigned;
     // Need to keep track of how many tasks are left unfinished

Modified: trunk/Model/Groups/DynBVH.cc
==============================================================================
--- trunk/Model/Groups/DynBVH.cc        (original)
+++ trunk/Model/Groups/DynBVH.cc        Tue Feb 12 15:05:56 2008
@@ -138,19 +138,31 @@
#if TEST_MASKS
       if (rays.rayIsMasked(i)) continue;
#endif
-      float tmin = 1e-5f;
-      float tmax = rays.getMinT(i);
+      __m128 t00 = _mm_mul_ss(_mm_sub_ss(_mm_set_ss(box[0][0]),
_mm_set_ss(rays.getOrigin(i, 0))),
+                              _mm_set_ss(rays.getInverseDirection(i, 0)));
+      __m128 t01 = _mm_mul_ss(_mm_sub_ss(_mm_set_ss(box[1][0]),
_mm_set_ss(rays.getOrigin(i, 0))),
+                              _mm_set_ss(rays.getInverseDirection(i, 0)));
+      __m128 tmin0 = _mm_max_ss(_mm_min_ss(t00, t01), _mm_set_ss(T_EPSILON));
+      __m128 tmax0 = _mm_min_ss(_mm_max_ss(t00, t01),
_mm_set_ss(rays.getMinT(i)));

-      for (int c = 0; c < 3; c++) {
-        float t0 = (box[0][c] - rays.getOrigin(i,c)) *
rays.getInverseDirection(i,c);
-        float t1 = (box[1][c] - rays.getOrigin(i,c)) *
rays.getInverseDirection(i,c);
+      __m128 t10 = _mm_mul_ss(_mm_sub_ss(_mm_set_ss(box[0][1]),
_mm_set_ss(rays.getOrigin(i, 1))),
+                              _mm_set_ss(rays.getInverseDirection(i, 1)));
+      __m128 t11 = _mm_mul_ss(_mm_sub_ss(_mm_set_ss(box[1][1]),
_mm_set_ss(rays.getOrigin(i, 1))),
+                              _mm_set_ss(rays.getInverseDirection(i, 1)));
+      __m128 tmin1 = _mm_max_ss(_mm_min_ss(t10, t11), tmin0);
+      __m128 tmax1 = _mm_min_ss(_mm_max_ss(t10, t11), tmax0);

-        float near = (t0 < t1) ? t0 : t1;
-        float far  = (t0 < t1) ? t1 : t0;
-        tmin = (tmin < near) ? near : tmin; // max of tmin, near
-        tmax = (far <  tmax) ? far : tmax;  // min of tmax, far
-      }
-      if (tmin <= tmax) {  // valid intersection
+      __m128 t20 = _mm_mul_ss(_mm_sub_ss(_mm_set_ss(box[0][2]),
_mm_set_ss(rays.getOrigin(i, 2))),
+                              _mm_set_ss(rays.getInverseDirection(i, 2)));
+      __m128 t21 = _mm_mul_ss(_mm_sub_ss(_mm_set_ss(box[1][2]),
_mm_set_ss(rays.getOrigin(i, 2))),
+                              _mm_set_ss(rays.getInverseDirection(i, 2)));
+      __m128 tmin2 = _mm_max_ss(_mm_min_ss(t20, t21), tmin1);
+      __m128 tmax2 = _mm_min_ss(_mm_max_ss(t20, t21), tmax1);
+
+      float tmin, tmax;
+      _mm_store_ss(&tmin, tmin2);
+      _mm_store_ss(&tmax, tmax2);
+      if(tmin < tmax){
         *out_tmin = tmin;
         return i;
       }
@@ -159,25 +171,6 @@
     int i = rays.rayBegin;

     {
-      // Try first hit as scalar
-#if 0
-      float tmin = 1e-5f;
-      float tmax = rays.getMinT(i);
-
-      for (int c = 0; c < 3; c++) {
-        float t0 = (box[0][c] - rays.getOrigin(i,c)) *
rays.getInverseDirection(i,c);
-        float t1 = (box[1][c] - rays.getOrigin(i,c)) *
rays.getInverseDirection(i,c);
-
-        float near = (t0 < t1) ? t0 : t1;
-        float far  = (t0 < t1) ? t1 : t0;
-        tmin = (tmin < near) ? near : tmin; // max of tmin, near
-        tmax = (far <  tmax) ? far : tmax;  // min of tmax, far
-      }
-      if (tmin <= tmax) {  // valid intersection
-        *out_tmin = tmin;
-        return i;
-      }
-#else
       __m128 t00 = _mm_mul_ss(_mm_sub_ss(_mm_set_ss(box[0][0]),
_mm_set_ss(rays.getOrigin(i, 0))),
                               _mm_set_ss(rays.getInverseDirection(i, 0)));
       __m128 t01 = _mm_mul_ss(_mm_sub_ss(_mm_set_ss(box[1][0]),
_mm_set_ss(rays.getOrigin(i, 0))),
@@ -206,12 +199,11 @@
         *out_tmin = tmin;
         return i;
       }
-#endif
     }

     // try a frustum miss
     {
-      float tmin_frustum = 1e-5;
+      float tmin_frustum = T_EPSILON;
       float tmax_frustum = FLT_MAX;

       for (int axis = 0; axis < 3; axis++) {
@@ -254,7 +246,7 @@

     // Scalar pickup loop
     for (; i < b; i++) {
-      float tmin = 1e-5f;
+      float tmin = T_EPSILON;
       float tmax = rays.getMinT(i);

       for (int c = 0; c < 3; c++) {
@@ -310,7 +302,7 @@
       __m128 zmin = _mm_min_ps(z0,z1);
       __m128 zmax = _mm_max_ps(z0,z1);

-      __m128 maximum_minimum =
_mm_max_ps(xmin,_mm_max_ps(ymin,_mm_max_ps(zmin, _mm_set1_ps(1e-5f))));
+      __m128 maximum_minimum =
_mm_max_ps(xmin,_mm_max_ps(ymin,_mm_max_ps(zmin, _mm_set1_ps(T_EPSILON))));
       __m128 minimum_maximum =
_mm_min_ps(xmax,_mm_min_ps(ymax,_mm_min_ps(zmax,_mm_load_ps(&data->minT[i]))));
       __m128 valid_intersect = _mm_cmple_ps(maximum_minimum,minimum_maximum);
       if (_mm_movemask_ps(valid_intersect) != 0x0)
@@ -321,7 +313,7 @@
#if TEST_MASKS
       if (rays.rayIsMasked(i)) continue;
#endif
-      float tmin = 1e-5f;
+      float tmin = T_EPSILON;
       float tmax = rays.getMinT(i);

       for (int c = 0; c < 3; c++) {
@@ -342,7 +334,7 @@
   return rays.end();
#else // NOT MANTA_SSE (scalar case)
   for (int i = rays.begin(); i < rays.end(); i++) {
-    float tmin = 1e-5f;
+    float tmin = T_EPSILON;
     float tmax = rays.getMinT(i);

     for (int c = 0; c < 3; c++) {
@@ -415,7 +407,7 @@
#if TEST_MASKS
       if (rays.rayIsMasked(i)) continue;
#endif
-      float tmin = 1e-5f;
+      float tmin = T_EPSILON;
       float tmax = rays.getMinT(i);

       for (int c = 0; c < 3; c++) {
@@ -438,7 +430,7 @@
#if TEST_MASKS
       if (rays.rayIsMasked(i)) continue;
#endif
-      float tmin = 1e-5f;
+      float tmin = T_EPSILON;
       float tmax = rays.getMinT(i);

       for (int c = 0; c < 3; c++) {
@@ -494,7 +486,7 @@
       __m128 zmin = _mm_min_ps(z0,z1);
       __m128 zmax = _mm_max_ps(z0,z1);

-      __m128 maximum_minimum =
_mm_max_ps(xmin,_mm_max_ps(ymin,_mm_max_ps(zmin, _mm_set1_ps(1e-5f))));
+      __m128 maximum_minimum =
_mm_max_ps(xmin,_mm_max_ps(ymin,_mm_max_ps(zmin, _mm_set1_ps(T_EPSILON))));
       __m128 minimum_maximum =
_mm_min_ps(xmax,_mm_min_ps(ymax,_mm_min_ps(zmax,_mm_load_ps(&data->minT[i]))));
       __m128 valid_intersect = _mm_cmple_ps(maximum_minimum,minimum_maximum);
       if (_mm_movemask_ps(valid_intersect) != 0x0)
@@ -504,7 +496,7 @@
     for (; i > rays.rayBegin;) {
       i--;
       //if (rays.rayIsMasked(i)) continue;
-      float tmin = 1e-5f;
+      float tmin = T_EPSILON;
       float tmax = rays.getMinT(i);

       for (int c = 0; c < 3; c++) {
@@ -524,7 +516,7 @@
#else
   for (int i = rays.rayEnd - 1; i > rays.rayBegin; i--) {
     //if (rays.rayIsMasked(i)) continue;
-    float tmin = 1e-5f;
+    float tmin = T_EPSILON;
     float tmax = rays.getMinT(i);

     for (int c = 0; c < 3; c++) {
@@ -556,6 +548,7 @@

void DynBVH::addToUpdateGraph(ObjectUpdateGraph* graph,
                               ObjectUpdateGraphNode* parent) {
+  cerr << MANTA_FUNC << endl;
   ObjectUpdateGraphNode* node = graph->insert(this, parent);
   getGroup()->addToUpdateGraph(graph, node);
}
@@ -577,31 +570,83 @@
   }
}

-typedef Callback_1Data_2Arg<DynBVH, Task*, int, UpdateContext> bvhtask_type;
+typedef Callback_1Data_2Arg<DynBVH, Task*, int, UpdateContext> BVHUpdateTask;
+typedef Callback_1Data_2Arg<DynBVH, TaskList*, int, Task*>
BVHUpdateReduction;

-void DynBVH::performUpdate(const UpdateContext& context) {
-  //cerr << "Asking for BVH update" << endl;
+typedef Callback_1Data_4Arg<DynBVH, Task*, int, int, int, UpdateContext>
BVHBuildTask;
+typedef Callback_1Data_2Arg<DynBVH, TaskList*, int, Task*> BVHBuildReduction;
+typedef Callback_1Data_5Arg<DynBVH, TaskList*, Task*, int, int, int,
UpdateContext> BVHSAHReduction;
+
+#define USE_PARALLEL_BUILD 1
+
+void DynBVH::beginParallelBuild(UpdateContext context) {
+  cerr << "Doing parallel BVH build" << endl;
+  num_nodes.set(1);
+  nextFree.set(1);
+
+  int num_possible_nodes = (2*currGroup->size()) + 1;
+  TaskListMemory = new char[(num_possible_nodes) * sizeof(TaskList)];
+  CurTaskList = 0;
+  TaskMemory = new char[2 * (num_possible_nodes) * sizeof(Task)];
+  CurTask = 0;
+
+  FourArgCallbackMemory = new char[3 * (num_possible_nodes) *
sizeof(BVHBuildTask)];
+  CurFourArgCallback = 0;
+
+  FiveArgCallbackMemory = new char[3 * (num_possible_nodes) *
sizeof(BVHSAHReduction)];
+
+  TaskList* new_list = new (TaskListMemory + sizeof(TaskList)*CurTaskList)
TaskList();
+
+  Task* build_tree =
+    new (TaskMemory + sizeof(Task) * CurTask) Task(
+      new (FourArgCallbackMemory) BVHBuildTask
+        (this,
+         &DynBVH::parallelTopDownBuild,
+         0,
+         0,
+         currGroup->size(),
+         context));
+
+  CurTask++;
+  CurTaskList++;
+  CurFourArgCallback++;
+
+  new_list->push_back(build_tree);
+  new_list->setReduction(
+    Callback::create(this,
+                     &DynBVH::beginParallelUpdate,
+                     context));
+
+  // Ask the work queue to update my BVH
+  context.insertWork(new_list);
+}
+
+void DynBVH::beginParallelUpdate(TaskList* list, UpdateContext context) {
+  cerr << "Doing parallel BVH update" << endl;
   // preallocate task list memory
-  tasklist_memory = new char[(num_nodes+1) * sizeof(TaskList)];
-  cur_tasklist = 0;
-  task_memory = new char[2 * (num_nodes+1) * sizeof(Task)];
-  cur_task = 0;
-  callback_memory = new char[3 * (num_nodes+1) * sizeof(bvhtask_type)];
-  cur_callback = 0;
+  int num_possible = num_nodes + 1;
+  cerr << "Updating " << num_possible << " nodes\n";
+  TaskListMemory = new char[(num_possible) * sizeof(TaskList)];
+  CurTaskList = 0;
+  TaskMemory = new char[2 * (num_possible) * sizeof(Task)];
+  CurTask = 0;
+
+  TwoArgCallbackMemory = new char[3 * (num_possible) *
sizeof(BVHUpdateTask)];
+  CurTwoArgCallback = 0;

-  TaskList* new_list = new (tasklist_memory + sizeof(TaskList)*cur_tasklist)
TaskList();
+  TaskList* new_list = new (TaskListMemory + sizeof(TaskList)*CurTaskList)
TaskList();

   Task* update_tree =
-    new (task_memory + sizeof(Task) * cur_task) Task(
-      new (callback_memory) Callback_1Data_2Arg<DynBVH, Task*, int,
UpdateContext>
+    new (TaskMemory + sizeof(Task) * CurTask) Task(
+      new (TwoArgCallbackMemory) BVHUpdateTask
         (this,
          &DynBVH::parallelTopDownUpdate,
          0,
          context));

-  cur_task++;
-  cur_tasklist++;
-  cur_callback++;
+  CurTask++;
+  CurTaskList++;
+  CurTwoArgCallback++;

   new_list->push_back(update_tree);
   new_list->setReduction(
@@ -613,13 +658,36 @@
   context.insertWork(new_list);
}

-void DynBVH::finishUpdate(UpdateContext context) {
-  cur_tasklist = 0;
-  cur_task = 0;
-  cur_callback = 0;
-  delete[] tasklist_memory;
-  delete[] task_memory;
-  delete[] callback_memory;
+void DynBVH::performUpdate(const UpdateContext& context) {
+#if USE_PARALLEL_BUILD
+  beginParallelBuild(context);
+#else
+  beginParallelUpdate(0, context);
+#endif
+}
+
+void DynBVH::finishUpdate(TaskList* list, UpdateContext context) {
+  CurTaskList = 0;
+  CurTask = 0;
+  CurTwoArgCallback = 0;
+  CurFourArgCallback = 0;
+  CurFiveArgCallback = 0;
+
+  if (TaskListMemory)
+    delete[] TaskListMemory;
+
+  if (TaskMemory)
+    delete[] TaskMemory;
+
+  if (TwoArgCallbackMemory)
+    delete[] TwoArgCallbackMemory;
+
+  if (FourArgCallbackMemory)
+    delete[] FourArgCallbackMemory;
+
+  if (FiveArgCallbackMemory)
+    delete[] FiveArgCallbackMemory;
+
   context.finish(this);
}

@@ -627,18 +695,17 @@
   // If node is a leaf update bounds and exit
   //cerr << MANTA_FUNC << "(node_id = " << node_id << ")\n";
   BVHNode& node = nodes[node_id];
-  if (node.isLeaf()) {
+  unsigned int num_objects = num_prims_beneath[node_id];
+  // TODO(boulos): Determine a good value for this leads to low
+  // contention and good parallel speedup when scenes are large
+  // enough.
+  const unsigned int kMinObjects = 2048;
+  if (node.isLeaf() || num_objects < kMinObjects) {
     PreprocessContext preprocess_context(context.manta_interface,
                                          context.proc,
                                          context.num_procs,
                                          NULL);
-    node.bounds.reset();
-    for (int i = 0; i < node.children; i++) {
-      BBox temp;
-      int obj_id = object_ids[node.child + i];
-      currGroup->get(obj_id)->computeBounds(preprocess_context, temp);
-      node.bounds.extendByBox(temp);
-    }
+    updateBounds(preprocess_context, node_id);
     task->finished();
     //cerr << MANTA_FUNC << "(node_id = " << node_id << ") finishing at a
leaf\n";
   } else {
@@ -647,20 +714,20 @@
     // Recurse into two subtrees
     taskpool_mutex.lock();

-    size_t tasklist_id = cur_tasklist;
-    size_t left_task_id = cur_task;
-    size_t right_task_id = cur_task + 1;
-    size_t callback_id = cur_callback;
-
-    cur_tasklist++;
-    cur_task += 2;
-    cur_callback += 3;
+    size_t tasklist_id = CurTaskList;
+    size_t left_task_id = CurTask;
+    size_t right_task_id = CurTask + 1;
+    size_t callback_id = CurTwoArgCallback;
+
+    CurTaskList++;
+    CurTask += 2;
+    CurTwoArgCallback += 3;

     taskpool_mutex.unlock();

-    TaskList* children = new (tasklist_memory + sizeof(TaskList) *
tasklist_id) TaskList();
-    Task* left_child = new (task_memory + sizeof(Task) * left_task_id) Task(
-      new (callback_memory + callback_id * sizeof(bvhtask_type)
)Callback_1Data_2Arg<DynBVH, Task*, int, UpdateContext>
+    TaskList* children = new (TaskListMemory + sizeof(TaskList) *
tasklist_id) TaskList();
+    Task* left_child = new (TaskMemory + sizeof(Task) * left_task_id) Task(
+      new (TwoArgCallbackMemory + callback_id * sizeof(BVHUpdateTask)
)BVHUpdateTask
         (this,
          &DynBVH::parallelTopDownUpdate,
          left_son,
@@ -668,8 +735,8 @@

     children->push_back(left_child);

-    Task* right_child = new (task_memory + sizeof(Task) * right_task_id)
Task(
-      new (callback_memory + (callback_id+1)*sizeof(bvhtask_type))
Callback_1Data_2Arg<DynBVH, Task*, int, UpdateContext>
+    Task* right_child = new (TaskMemory + sizeof(Task) * right_task_id) Task(
+      new (TwoArgCallbackMemory +
(callback_id+1)*sizeof(BVHUpdateTask))BVHUpdateTask
         (this,
          &DynBVH::parallelTopDownUpdate,
          right_son,
@@ -677,19 +744,22 @@

     children->push_back(right_child);

+    // NOTE(boulos): This is wasteful, but apparently BVHUpdateTasks are
larger than BVHUpdateReductions...
+
     // When the children finish, call reduction
     children->setReduction(
-      new (callback_memory + (callback_id+2) * sizeof(bvhtask_type))
Callback_0Data_2Arg<DynBVH, int, Task*>(this,
-
&DynBVH::updateBoundsReduction,
-                                                  node_id,
-                                                  task));
+      new (TwoArgCallbackMemory + (callback_id+2)
*sizeof(BVHUpdateTask))BVHUpdateReduction
+      (this,
+       &DynBVH::updateBoundsReduction,
+       node_id,
+       task));

     // Insert new TaskList for the BVH Object* in UpdateGraph
     context.insertWork(children);
   }
}

-void DynBVH::updateBoundsReduction(int node_id, Task* task) {
+void DynBVH::updateBoundsReduction(TaskList* list, int node_id, Task* task) {
   //cerr << MANTA_FUNC << "(node_id = " << node_id << ")\n";
   BVHNode& node = nodes[node_id];

@@ -706,6 +776,462 @@
   //cerr << MANTA_FUNC << "(node_id = " << node_id << ") after
task->finished()\n";
}

+void DynBVH::splitBuild(Task* task, int nodeID, int objectBegin, int
objectEnd, UpdateContext context) {
+  int* vals = (int*)task->scratchpad_data;
+  int bestAxis = vals[0];
+  int split = vals[1];
+
+  BVHNode& node = nodes[nodeID];
+
+  if (bestAxis == -1) {
+    // make leaf
+    node.makeLeaf(objectBegin, objectEnd-objectBegin);
+
+    std::sort(object_ids.begin()+objectBegin,object_ids.begin()+objectEnd);
+
+    num_nodes++;
+    task->finished();
+  } else {
+    // make internal node
+    // allocate two spots
+    int my_spot = (nextFree += 2);
+    node.makeInternal(my_spot, bestAxis);
+    num_nodes++;
+
+    // Make two subtasks to build the children
+    int left_son = node.child;
+    int right_son = left_son + 1;
+    // Recurse into two subtrees
+    taskpool_mutex.lock();
+
+    size_t tasklist_id = CurTaskList;
+    size_t left_task_id = CurTask;
+    size_t right_task_id = CurTask + 1;
+    size_t callback_id = CurFourArgCallback;
+
+    CurTaskList++;
+    CurTask += 2;
+    CurFourArgCallback += 3;
+
+    taskpool_mutex.unlock();
+
+    TaskList* children = new (TaskListMemory + sizeof(TaskList) *
tasklist_id) TaskList();
+    Task* left_child = new (TaskMemory + sizeof(Task) * left_task_id) Task
+      (
+       new (FourArgCallbackMemory + callback_id * sizeof(BVHBuildTask)
)BVHBuildTask
+       (this,
+        &DynBVH::parallelTopDownBuild,
+        left_son,
+        objectBegin,
+        split,
+        context));
+
+    children->push_back(left_child);
+
+    Task* right_child = new (TaskMemory + sizeof(Task) * right_task_id) Task
+      (
+       new (FourArgCallbackMemory +
(callback_id+1)*sizeof(BVHBuildTask))BVHBuildTask
+       (this,
+        &DynBVH::parallelTopDownBuild,
+        right_son,
+        split,
+        objectEnd,
+        context));
+
+    children->push_back(right_child);
+    // Insert new TaskList for the BVH Object* in UpdateGraph
+
+    // When the children finish, call reduction
+    children->setReduction(
+                           new (FourArgCallbackMemory + (callback_id+2)
*sizeof(BVHBuildTask))BVHBuildReduction
+                           (this,
+                            &DynBVH::parallelBuildReduction,
+                            nodeID,
+                            task));
+
+    context.insertWork(children);
+  }
+}
+
+// We know that there is "lots of work" (objectEnd - objectBegin is
+// sort of big), so no need to check for small cases here
+void DynBVH::parallelApproximateSAH(Task* task, int nodeID, int objectBegin,
int objectEnd, UpdateContext context) {
+  // First compute overall bounds in parallel
+  int num_objects = objectEnd - objectBegin;
+  const int kNumObjectsPerBound = 1024;
+  int num_tasks = num_objects / kNumObjectsPerBound;
+
+  taskpool_mutex.lock();
+  size_t tasklist_id = CurTaskList;
+  size_t first_task = CurTask;
+  size_t callback_id = CurFourArgCallback;
+  size_t reduction_id = CurFiveArgCallback;
+
+  CurTaskList++;
+  CurTask += num_tasks;
+  CurFourArgCallback += num_tasks;
+  CurFiveArgCallback++;
+  taskpool_mutex.unlock();
+
+  TaskList* children = new (TaskListMemory + sizeof(TaskList) * tasklist_id)
TaskList();
+  for (int i = 0; i < num_tasks; i++) {
+    int begin = objectBegin + i * kNumObjectsPerBound;
+    int end   = begin + kNumObjectsPerBound;
+    if (i == num_tasks - 1) {
+      end = objectEnd;
+    }
+    Task* child = new (TaskMemory + sizeof(Task) * (first_task + i)) Task
+      (new (FourArgCallbackMemory +
(callback_id+i)*sizeof(BVHBuildTask))BVHBuildTask
+       (this,
+        &DynBVH::parallelComputeBounds,
+        nodeID,
+        begin,
+        end,
+        context));
+    children->push_back(child);
+  }
+
+  children->setReduction(new (FiveArgCallbackMemory + reduction_id *
sizeof(BVHSAHReduction))BVHSAHReduction
+                         (this,
+                          &DynBVH::parallelComputeBoundsReduction,
+                          task,
+                          nodeID,
+                          objectBegin,
+                          objectEnd,
+                          context));
+  context.insertWork(children);
+
+  // Then compute bin entries in parallel (sweep during reduction?)
+
+  // If a split has been deemed necessary (very likely), segment the object
ids in parallel
+}
+
+void DynBVH::parallelComputeBounds(Task* task, int nodeID, int objectBegin,
int objectEnd, UpdateContext context) {
+  BBox result;
+  for (int i = objectBegin; i < objectEnd; i++) {
+    result.extendByBox(obj_bounds[object_ids[i]]);
+  }
+  BBox* output = (BBox*)task->scratchpad_data;
+  *output = result;
+  //cerr << MANTA_FUNC << " NodeID " << nodeID << " has bounds " << result
<< endl;
+  task->finished();
+}
+
+void DynBVH::parallelComputeBoundsReduction(TaskList* tasklist,
+                                            Task* task,
+                                            int nodeID,
+                                            int objectBegin,
+                                            int objectEnd,
+                                            UpdateContext context) {
+  // Merge the bounds
+  TaskList& list = *tasklist;
+  BBox overall_bounds;
+  for (size_t i = 0; i < list.tasks.size(); i++) {
+    BBox* task_bounds = (BBox*)list.tasks[i]->scratchpad_data;
+    overall_bounds.extendByBox(*task_bounds);
+  }
+  //cerr << MANTA_FUNC << " NodeID " << nodeID << " has bounds " <<
overall_bounds << endl;
+
+  // Now kick off a parallel bin computation
+  const int kBinningObjects = 1024;
+  int num_objects = objectEnd - objectBegin;
+  int num_tasks = num_objects / kBinningObjects;
+
+  taskpool_mutex.lock();
+  size_t tasklist_id = CurTaskList;
+  size_t first_task = CurTask;
+  size_t callback_id = CurFourArgCallback;
+  size_t reduction_id = CurFiveArgCallback;
+
+  CurTaskList++;
+  CurTask += num_tasks;
+  CurFourArgCallback += num_tasks;
+  CurFiveArgCallback++;
+  taskpool_mutex.unlock();
+
+  TaskList* children = new (TaskListMemory + sizeof(TaskList) * tasklist_id)
TaskList();
+  for (int i = 0; i < num_tasks; i++) {
+    int begin = objectBegin + i * kBinningObjects;
+    int end   = begin + kBinningObjects;
+    if (i == num_tasks - 1) {
+      end = objectEnd;
+    }
+    Task* child = new (TaskMemory + sizeof(Task) * (first_task + i)) Task
+      (new (FourArgCallbackMemory +
(callback_id+i)*sizeof(BVHBuildTask))BVHBuildTask
+       (this,
+        &DynBVH::parallelComputeBins,
+        nodeID,
+        begin,
+        end,
+        context));
+    BBox* task_bounds = (BBox*)child->scratchpad_data;
+    *task_bounds = overall_bounds;
+    children->push_back(child);
+  }
+
+  children->setReduction(new (FiveArgCallbackMemory + reduction_id *
sizeof(BVHSAHReduction))BVHSAHReduction
+                         (this,
+                          &DynBVH::parallelComputeBinsReduction,
+                          task,
+                          nodeID,
+                          objectBegin,
+                          objectEnd,
+                          context));
+
+  context.insertWork(children);
+}
+
+void DynBVH::parallelComputeBins(Task* task, int nodeID, int objectBegin,
int objectEnd, UpdateContext context) {
+  // The overall bounds is in the scratch pad
+  BBox& overall_bounds = *((BBox*)task->scratchpad_data);
+  //cerr << MANTA_FUNC << " NodeID " << nodeID << " has bounds " <<
overall_bounds << endl;
+  struct SampleBin {
+    SampleBin() { count = 0; }
+    SampleBin(const SampleBin& copy) : bounds(copy.bounds),
count(copy.count) {
+    }
+    BBox bounds;
+    int count;
+  };
+
+  const int num_samples = 8;
+  SampleBin bins[3][num_samples];
+
+  Vector min_point = overall_bounds.getMin();
+  Vector max_point = overall_bounds.getMax();
+  Vector width = max_point - min_point;
+
+  for (int i = objectBegin; i < objectEnd; i++) {
+    BBox& obj_box = obj_bounds[object_ids[i]];
+    Vector& obj_centroid = obj_centroids[object_ids[i]];
+    for (int axis = 0; axis < 3; axis++) {
+      // Sample bin is where this position would fall to the left
+      // int(num_samples * (centroid[axis] - min_point[axis])/width[axis])
+      int which_bin = static_cast<int>
+        (num_samples * (obj_centroid[axis] - min_point[axis])/width[axis]);
+
+      if (width[axis] == 0)
+        which_bin = 0;
+      else if (which_bin >= num_samples)
+        which_bin = num_samples - 1;
+
+      bins[axis][which_bin].count++;
+      bins[axis][which_bin].bounds.extendByBox(obj_box);
+    }
+  }
+
+  SampleBin* output = (SampleBin*)task->scratchpad_data;
+  for (int i = 0; i < 3; i++) {
+    for (int s = 0; s < num_samples; s++) {
+      *output++ = bins[i][s];
+    }
+  }
+
+  task->finished();
+}
+
+int DynBVH::partitionObjects(int objBegin, int objEnd, int axis, float
position) {
+  //cerr << MANTA_FUNC << " begin = " << objBegin << ", end = " << objEnd <<
endl;
+  int first = objBegin;
+  int last  = objEnd;
+  --first;
+  while (1) {
+    for (++first; first < last; first++) {
+      if (obj_centroids[object_ids[first]][axis] > position)
+        break;
+    }
+
+    for(--last; first < last; --last) {
+      if (obj_centroids[object_ids[last]][axis] <= position)
+        break;
+    }
+
+    if (first < last) {
+      // Swap first and last
+      int temp = object_ids[first];
+      object_ids[first] = object_ids[last];
+      object_ids[last]  = temp;
+    } else {
+      return first;
+    }
+  }
+  // To quiet warnings
+  return -1;
+}
+
+void DynBVH::parallelComputeBinsReduction(TaskList* tasklist,
+                                          Task* task,
+                                          int nodeID,
+                                          int objectBegin,
+                                          int objectEnd,
+                                          UpdateContext context) {
+  const int num_samples = 8;
+  struct SampleBin {
+    SampleBin() { count = 0; }
+    BBox bounds;
+    int count;
+  };
+  SampleBin bins[3][num_samples];
+  int num_objects = objectEnd - objectBegin;
+
+  // Merge the bins
+  TaskList& list = *tasklist;
+  BBox overall_bounds;
+  for (size_t i = 0; i < list.tasks.size(); i++) {
+    SampleBin* task_bins = (SampleBin*)list.tasks[i]->scratchpad_data;
+    for (int axis = 0; axis < 3; axis++) {
+      for (int sample = 0; sample < num_samples; sample++) {
+        SampleBin& task_bin = *(task_bins + axis * num_samples + sample);
+        bins[axis][sample].count += task_bin.count;
+        bins[axis][sample].bounds.extendByBox(task_bin.bounds);
+        overall_bounds.extendByBox(task_bin.bounds);
+      }
+    }
+  }
+
+  float inv_overall_area = 1.f/overall_bounds.computeArea();
+  Vector min_point = overall_bounds.getMin();
+  Vector max_point = overall_bounds.getMax();
+  Vector width = max_point - min_point;
+  // Search for best cost
+  BVHCostEval best_cost;
+  best_cost.cost = BVH_C_isec * num_objects;
+  best_cost.axis = -1;
+  best_cost.position = FLT_MAX;
+  best_cost.event = -1;
+
+  for (int axis = 0; axis < 3; axis++) {
+    float left_areas[num_samples];
+    float right_areas[num_samples];
+    int left_counts[num_samples];
+    int right_counts[num_samples];
+    BBox left_box, right_box;
+    int num_left = 0;
+    int num_right = num_objects;
+
+    // Sweep from left to right, aggregating area and counts
+    for (int i = 0; i < num_samples; i++) {
+      num_left += bins[axis][i].count;
+      num_right -= bins[axis][i].count;
+      left_box.extendByBox(bins[axis][i].bounds);
+      left_counts[i] = num_left;
+      right_counts[i] = num_right;
+      left_areas[i]  = left_box.computeArea();
+    }
+
+    // Sweep from right to left, aggregating areas
+    for (int i = num_samples - 1; i >= 0; i--) {
+      if (i == num_samples - 1)
+        right_areas[i] = FLT_MAX;
+      else
+        right_areas[i] = right_box.computeArea();
+
+      right_box.extendByBox(bins[axis][i].bounds);
+    }
+
+    // NOTE(boulos): The last bin is pointless (since it indicates leaf)
+    for (int i = 0; i < num_samples - 1; i++) {
+      float cost = (left_areas[i] * left_counts[i] +
+                    right_areas[i] * right_counts[i]);
+      cost *= inv_overall_area;
+      cost *= BVH_C_isec;
+      cost += BVH_C_trav;
+
+      if (cost < best_cost.cost) {
+        // Found new best cost
+        best_cost.cost = cost;
+        best_cost.axis = axis;
+        float sample_position =
+          (min_point[axis] +
width[axis]*(static_cast<float>(i)/num_samples));
+        best_cost.position = sample_position;
+        best_cost.num_left = left_counts[i];
+        best_cost.num_right = right_counts[i];
+        best_cost.event = -1; // Unused
+      }
+    }
+  }
+
+  int* vals = (int*)task->scratchpad_data;
+  vals[0] = best_cost.axis;
+  vals[1] = 0;
+  if (best_cost.axis == -1) {
+    // no need for parallel split
+    splitBuild(task, nodeID, objectBegin, objectEnd, context);
+    return;
+  }
+
+  // Otherwise, we need to kick off a parallel partition of the object
+  // ids based on the axis and position
+
+  // TODO(boulos): Actually do this in parallel
+  // write out object ids [objBegin,objEnd) in appropriate order
+
+  // First collect all the left_objs and right_objs into local
+  // vectors (because we need to copy them out)
+  int middle = partitionObjects(objectBegin, objectEnd, best_cost.axis,
best_cost.position);
+  if (middle == objectBegin || middle == objectEnd) {
+    // Splitting didn't find a valid split, split in the middle unless
+    // we have too few
+    const int kMinObjects = 3;
+    if (num_objects < kMinObjects) {
+      vals[0] = -1;
+    } else {
+      middle = objectBegin + num_objects / 2;
+    }
+  }
+  vals[1] = middle;
+  splitBuild(task, nodeID, objectBegin, objectEnd, context);
+}
+
+void DynBVH::parallelTopDownBuild(Task* task, int nodeID, int objectBegin,
int objectEnd, UpdateContext context) {
+  //cerr << MANTA_FUNC << "(nodeId = " << nodeID << ", objectBegin = " <<
objectBegin << ", objectEnd = " << objectEnd << ")\n";
+  int num_objects = objectEnd - objectBegin;
+  const int kSmallObjects = 1024;
+  PreprocessContext preprocess_context(context.manta_interface,
+                                       context.proc,
+                                       context.num_procs,
+                                       NULL);
+  if (num_objects < kSmallObjects) {
+    // Just have this thread do the rest of the build
+    build(preprocess_context, nodeID, objectBegin, objectEnd);
+    task->finished();
+  } else {
+
+    if (objectEnd <= objectBegin) {
+      throw InternalError("Tried building BVH over invalid range");
+    }
+
+    num_prims_beneath[nodeID] = num_objects;
+
+    // We know that there are more than kSmallObjects, but how many
+    // are necessary for parallel sorting?
+    const int kParallelSortThreshold = kSmallObjects * 2;
+
+    if (num_objects > kParallelSortThreshold) {
+      // do split in parallel
+      parallelApproximateSAH(task, nodeID, objectBegin, objectEnd, context);
+    } else {
+      int bestAxis = -1;
+      int split = -1;
+
+      // NOTE(abe): 30% slower rendering for a single CAD dataset.
+#if USE_APPROXIMATE_BUILD
+      split = partitionApproxSAH(preprocess_context, objectBegin, objectEnd,
bestAxis);
+#else
+      split = partitionSAH(preprocess_context, objectBegin, objectEnd,
bestAxis);
+#endif
+
+      int* vals = (int*)task->scratchpad_data;
+      vals[0] = bestAxis;
+      vals[1] = split;
+      splitBuild(task, nodeID, objectBegin, objectEnd, context);
+    }
+  }
+}
+
+void DynBVH::parallelBuildReduction(TaskList* list, int node_id, Task* task)
{
+  task->finished();
+}
+
void DynBVH::update(int proc, int numProcs) {
   PreprocessContext context;
   parallelUpdateBounds(context, proc, numProcs);
@@ -732,7 +1258,9 @@

   if(currGroup->size() > object_ids.size()) {
     nodes.resize(2*currGroup->size());
+    num_prims_beneath.resize(2*currGroup->size());
     object_ids.resize(currGroup->size());
+
     // TODO(boulos): Free these after construction? or keep around
     // for rebuild?
     obj_bounds.resize(currGroup->size());
@@ -745,10 +1273,10 @@
     obj_centroids[i] = obj_bounds[i].center();
   }

-  num_nodes = 1; // root node
-  int nextFree = 1;
+  num_nodes.set(1);
+  nextFree.set(1);
   double build_start = Time::currentSeconds();
-  build(context, 0, 0, currGroup->size(), nextFree);
+  build(context, 0, 0, currGroup->size());
   double updateBound_start = Time::currentSeconds();
   updateBounds(context, 0);
   double end = Time::currentSeconds();
@@ -766,27 +1294,23 @@
}

void DynBVH::build(const PreprocessContext& context,
-                   int nodeID, int objectBegin, int objectEnd,
-                   int& nextFree, int depth)
+                   int nodeID, int objectBegin, int objectEnd)
{
-
+  //cerr << MANTA_FUNC << " begin = " << objectBegin << " , end = " <<
objectEnd << endl;
   if (objectEnd <= objectBegin) {
     throw InternalError("Tried building BVH over invalid range");
   }

   BVHNode& node = nodes[nodeID];
+  int num_objects = objectEnd - objectBegin;
+  num_prims_beneath[nodeID] = num_objects;

   int bestAxis = -1;
   int split = -1;

   // NOTE(abe): 30% slower rendering for a single CAD dataset.
#if USE_APPROXIMATE_BUILD
-  int num_objects = objectEnd - objectBegin;
-  if (num_objects < 32) {
-    split = partitionSAH(context, objectBegin,objectEnd,bestAxis);
-  } else {
-    split = partitionApproxSAH(context, objectBegin,objectEnd,bestAxis);
-  }
+  split = partitionApproxSAH(context,objectBegin,objectEnd,bestAxis);
#else
   split = partitionSAH(context, objectBegin,objectEnd,bestAxis);
#endif
@@ -800,17 +1324,14 @@
     num_nodes++;
   } else {
     // make internal node
-    node.makeInternal(nextFree, bestAxis);
-    num_nodes++;
     // allocate two spots
-    nextFree += 2;
+    int my_spot = (nextFree += 2);
+    node.makeInternal(my_spot, bestAxis);
+    num_nodes++;

-    build(context, node.child+0,objectBegin,split,nextFree,depth+1);
-    build(context, node.child+1, split, objectEnd,nextFree,depth+1);
-  }

-  if (depth == 0 && print_info) {
-    cerr << "DynBVH Build Complete (used " << nextFree << " nodes)\n";
+    build(context, node.child+0,objectBegin,split);
+    build(context, node.child+1, split, objectEnd);
   }
}

@@ -986,26 +1507,8 @@

     // First collect all the left_objs and right_objs into local
     // vectors (because we need to copy them out)
-    std::vector<int> left_objs;
-    std::vector<int> right_objs;
-    for (int i = objBegin; i < objEnd; i++) {
-      if (obj_centroids[object_ids[i]][output_axis] < best_cost.position) {
-        left_objs.push_back(object_ids[i]);
-      } else {
-        right_objs.push_back(object_ids[i]);
-      }
-    }
-
-    // Write them back in order
-    for (size_t i = 0; i < left_objs.size(); i++) {
-      object_ids[i + objBegin] = left_objs[i];
-    }
-
-    for (size_t i = 0; i < right_objs.size(); i++) {
-      object_ids[i + objBegin + left_objs.size()] = right_objs[i];
-    }
-    if (left_objs.size() == (size_t)num_objects ||
-        left_objs.size() == 0) {
+    int middle = partitionObjects(objBegin, objEnd, best_cost.axis,
best_cost.position);
+    if (middle == objBegin || middle == objEnd) {
       // Splitting didn't find a valid split, split in the middle unless
       // we have too few
       const int kMinObjects = 3;
@@ -1015,7 +1518,7 @@
       }
       return objBegin + num_objects / 2;
     }
-    return objBegin + left_objs.size();
+    return middle;
   }

   // making a leaf
@@ -1725,6 +2228,6 @@
   //archive->readwrite("group", currGroup);
   archive->readwrite("object_ids", object_ids);
   archive->readwrite("nodes", nodes);
-  archive->readwrite("num_nodes", num_nodes);
+  //archive->readwrite("num_nodes", num_nodes);
}

Modified: trunk/Model/Groups/DynBVH.h
==============================================================================
--- trunk/Model/Groups/DynBVH.h (original)
+++ trunk/Model/Groups/DynBVH.h Tue Feb 12 15:05:56 2008
@@ -5,12 +5,14 @@
#include <Core/Geometry/BBox.h>
#include <Interface/RayPacket.h>
#include <Interface/AccelerationStructure.h>
+#include <Core/Thread/AtomicCounter.h>
#include <Core/Thread/Barrier.h>
#include <Core/Util/SpinLock.h>

namespace Manta
{
   class Task;
+  class TaskList;

   class DynBVH : public AccelerationStructure
   {
@@ -63,29 +65,39 @@
   protected:
     vector<BVHNode> nodes;
     vector<int> object_ids;
-    unsigned int num_nodes;
+    vector<unsigned int> num_prims_beneath; // the number of primitives
beneath this node.
+
+    AtomicCounter num_nodes;
     Group* currGroup;
     bool group_changed;
     Barrier barrier;
     SpinLock taskpool_mutex;

+    AtomicCounter nextFree;
+
     vector<BBox> obj_bounds;
     vector<Vector> obj_centroids;

-    char* tasklist_memory;
-    size_t cur_tasklist;
-    char* task_memory;
-    size_t cur_task;
+    char* TaskListMemory;
+    size_t CurTaskList;
+
+    char* TaskMemory;
+    size_t CurTask;
+
+    char* TwoArgCallbackMemory;
+    size_t CurTwoArgCallback;
+
+    char* FourArgCallbackMemory;
+    size_t CurFourArgCallback;

-    char* callback_memory;
-    size_t cur_callback;
+    char* FiveArgCallbackMemory;
+    size_t CurFiveArgCallback;

     bool print_info;

   public:
-    //DynBVH() : currGroup(NULL), group_changed(false), barrier("DynBVH
barrier"), print_info(false) {
-    //}
-    DynBVH(bool print=false) : currGroup(NULL), group_changed(false),
barrier("DynBVH barrier"), print_info(print)
+
+    DynBVH(bool print = true) : num_nodes("DynBVH Num Nodes", 0),
currGroup(NULL), group_changed(false), barrier("DynBVH barrier"),
nextFree("DynBVH Next Free", 0), TaskListMemory(0), CurTaskList(0),
TaskMemory(0), CurTask(0), TwoArgCallbackMemory(0), CurTwoArgCallback(0),
FourArgCallbackMemory(0), CurFourArgCallback(0), FiveArgCallbackMemory(0),
CurFiveArgCallback(0), print_info(print)
     {}
     virtual ~DynBVH();

@@ -96,10 +108,35 @@
                                   ObjectUpdateGraphNode* parent);

     void performUpdate(const UpdateContext& context);
-    void finishUpdate(UpdateContext context);
+    void finishUpdate(TaskList* list, UpdateContext context);
+
+    void beginParallelBuild(UpdateContext context);
+    void beginParallelUpdate(TaskList* list, UpdateContext context);

     void parallelTopDownUpdate(Task* task, int node_id, UpdateContext
context);
-    void updateBoundsReduction(int node_id, Task* task);
+    void updateBoundsReduction(TaskList* list, int node_id, Task* task);
+
+    void parallelTopDownBuild(Task* task, int node_id, int objectBegin, int
objectEnd, UpdateContext context);
+    void parallelBuildReduction(TaskList* list, int node_id, Task* task);
+
+    void parallelApproximateSAH(Task* task, int nodeID, int objectBegin, int
objectEnd, UpdateContext context);
+    void parallelComputeBounds(Task* task, int nodeID, int objectBegin, int
objectEnd, UpdateContext context);
+    void parallelComputeBoundsReduction(TaskList* tasklist,
+                                        Task* task,
+                                        int nodeID,
+                                        int objectBegin,
+                                        int objectEnd,
+                                        UpdateContext context);
+    void parallelComputeBins(Task* task, int nodeID, int objectBegin, int
objectEnd, UpdateContext context);
+    void parallelComputeBinsReduction(TaskList* tasklist,
+                                      Task* task,
+                                      int nodeID,
+                                      int objectBegin,
+                                      int objectEnd,
+                                      UpdateContext context);
+
+    int partitionObjects(int first, int last, int axis, float position);
+    void splitBuild(Task* task, int nodeID, int objectBegin, int objectEnd,
UpdateContext context);

     void preprocess(const PreprocessContext&);
     void intersect(const RenderContext& context, RayPacket& rays) const;
@@ -129,8 +166,7 @@
     void rebuild(int proc=0, int numProcs=1);

     void build(const PreprocessContext& context,
-               int nodeID, int objectBegin, int objectEnd,
-               int &nextFree, int depth = 0);
+               int nodeID, int objectBegin, int objectEnd);

     void parallelUpdateBounds(const PreprocessContext& context,
                               int proc, int numProcs);

[Manta] r2053 - in trunk: Core/Thread Core/Util Engine/Control Interface Model/Groups, Solomon Boulos, 02/12/2008

Archive powered by MHonArc 2.6.16.