The abstract from the paper about it HiFrames: High Performance Data Frames in a Scripting Language [1] describes HPAT better than the description in the repo:
> Data frames in scripting languages are essential abstractions for processing structured data. However, existing data frame solutions are either not distributed (e.g., Pandas in Python) and therefore have limited scalability, or they are not tightly integrated with array computations (e.g., Spark SQL). This paper proposes a novel compiler-based approach where we integrate data frames into the High Performance Analytics Toolkit (HPAT) to build HiFrames. It provides expressive and flexible data frame APIs which are tightly integrated with array operations. HiFrames then automatically parallelizes and compiles relational operations along with other array computations in end-to-end data analytics programs, and generates efficient MPI/C++ code. We demonstrate that HiFrames is significantly faster than alternatives such as Spark SQL on clusters, without forcing the programmer to switch to embedded SQL for part of the program. HiFrames is 3.6x to 70x faster than Spark SQL for basic relational operations, and can be up to 20,000x faster for advanced analytics operations, such as weighted moving averages (WMA), that the map-reduce paradigm cannot handle effectively. HiFrames is also 5x faster than Spark SQL for TPCx-BB Q26 on 64 nodes of Cori supercomputer.
The linked papers demonstrate the performance gains in a specific arena: static compilation around a loop instead of mere data parallelism of the iterations. For example:
for i in range(num_iterations):
some_function(i)
A library like Spark can execute the function in parallel, but must return control to the host language for each iteration. That leads to a restart of the distributed environment.
HPAT is a compiler. The function is executed in parallel, just like in Spark, but the distributed environment doesn't return control the host language between iterations. Instead, the loop is run in its entirety in the distributed environment.
Spark's driver must farm work to executors during each iteration. HPAT removes that runtime overhead.
From a usability standpoint this is really interesting. If I don't have to think about moving to spark but instead just code within the restrictions of the subset of Python that is really neat.
It's especially interesting because of all of the other implications of moving to Spark — like trying to debug your app code that runs PySpark that runs Scala Spark on the JVM. It's very difficult to trace errors all the way through, no way to use tools like pdb. It takes me so much longer to debug errors in Spark code.
So this framework generates MPI compliant code, which is intended to be sent out to a bunch of super-computer nodes and executed.
What kind of existing infrastructure would you need to actually run this? I assume you would need to install agents that understand MPI on all your nodes and run some sort of centralised master agent to send out messages and supervise the whole thing. Is there a standard way to do this?
Yes you need to install an MPI implementation such as OpenMPI or MPICH. You probably also want to install a scheduler such as TGSFKA Sun Grid Engine (https://en.m.wikipedia.org/wiki/Oracle_Grid_Engine), Torque, or Slurm. All of these contain a program called qsub which launches multiple copies of a job in parallel across machines. Perfect for MPI.
A vanilla compiler will suffice for MPI code, so you don't need anything special at build time.
HPAT looks pretty promising. I wonder how they managed to signficantly increase performance of shuffling and sorting which are known to be quite difficult operations in map-reduce.
> Data frames in scripting languages are essential abstractions for processing structured data. However, existing data frame solutions are either not distributed (e.g., Pandas in Python) and therefore have limited scalability, or they are not tightly integrated with array computations (e.g., Spark SQL). This paper proposes a novel compiler-based approach where we integrate data frames into the High Performance Analytics Toolkit (HPAT) to build HiFrames. It provides expressive and flexible data frame APIs which are tightly integrated with array operations. HiFrames then automatically parallelizes and compiles relational operations along with other array computations in end-to-end data analytics programs, and generates efficient MPI/C++ code. We demonstrate that HiFrames is significantly faster than alternatives such as Spark SQL on clusters, without forcing the programmer to switch to embedded SQL for part of the program. HiFrames is 3.6x to 70x faster than Spark SQL for basic relational operations, and can be up to 20,000x faster for advanced analytics operations, such as weighted moving averages (WMA), that the map-reduce paradigm cannot handle effectively. HiFrames is also 5x faster than Spark SQL for TPCx-BB Q26 on 64 nodes of Cori supercomputer.
[1]: https://arxiv.org/abs/1704.02341