marionet logo

UKMAC 2017

UK Manycore Developer Conference

Tuesday 11th July 2017

University of Warwick

The UK Manycore Developer Conference is an informal day of talks spanning the whole landscape of accelerated, heterogeneous and manycore computing. Topics of interest include high-performance computing, embedded and mobile systems, computational science, finance, computer vision, formal verification, and beyond. The goal of the event is to develop and bring together the UK community of manycore developers, both industrial and academic.

The 2017 event was held on Tuesday 11th July 2017 at the University of Warwick. This is the eighth event in the series. Previous meetings have taken place at:

  • University of Edinburgh (2016)
  • University of Cambridge (2010 and 2014)
  • University of Oxford (2009 and 2013)
  • University of Bristol (2012)
  • Imperial College (2011)

These meetings regularly attract 100 participants and have proved to be invaluable opportunities to meet colleagues and swap stories of manycore successes and challenges.

Registration

Registration is free for this event, due to sponsorship from the UK Manycore Network.

Registration is now available through the UK Manycore Network Webpage .

Programme

09.00-09.15 Coffee and Registration
09.15-09.45 Talk 1: Marco Cianfriglia - Dividiti Ltd
Adaptive libraries for emerging applications
High-performance libraries expose multiple tunable parameters that influence the performance. Traditionally, such parameters are tuned and hardcoded for a given target architecture to perform well for typical inputs (e.g. large operand sizes). Emerging AI and Big Data applications, however, are often data-driven, rendering the conventional approach ineffective.
In this talk, we describe our work-in-progress on designing adaptive GPU libraries for data-driven applications. We proceed in three stages. First, starting with some tractable subsets of the parameter space, we evaluate various search strategies to discover highly performant parameter combinations for a range of inputs in reasonable time. Second, we use the discovered combinations to train models that most accurately predict the best performing combination for a given input. Third, we explore trade-offs between prediction accuracy and runtime overhead to make this approach feasible in practice.
We introduce runtime adaptation to the CLBlast library which we use with the Caffe framework, and study its effects across several convolutional neural networks, and GPU architectures for mobile and desktop platforms. Our workflows are implemented using the Collective Knowledge framework for collaborative and reproducible R&D (cknowledge.org), and will be extended to crowdsource experimentation across diverse platforms provided by volunteers.
09.45-10.15 Talk 2: Tim Law - University of Warwick
Optimisation of a molecular dynamics simulation of chromosome condensation
We present optimisations applied to a bespoke biophysical molecular dynamics simulation designed to investigate chromosome condensation. Our primary focus is on domainspecific algorithmic improvements to determining short-range interaction forces between particles, as certain qualities of the simulation render traditional methods less effective. We implement tuned versions of the code for both traditional CPU architectures and the modern many-core architecture found in the Intel Xeon Phi coprocessor and compare their effectiveness. We achieve speed-ups starting at a factor of 10 over the original code, facilitating more detailed and larger-scale experiments.
10.15-10.45 Talk 3: Christopher Brown – University of St.Andrews
ParaFormance: Democratizing Parallel Software Development
Emerging multicore and manycore architectures offer major advantages in terms of performance and low energy usage. We are already seeing designs for 100+ cores CPUs and 1000+ cores GPUs, offering significant potential for parallelism. However, programming models are lagging behind. Exploiting the potential of new parallel systems, even using higher-level programming models, is highly challenging.
Fundamentally:
"Parallelism is too hard for programmers today"
Bjarne Stroustrup, Inventor of C++
ParaFormance is a novel software toolset for C and C++ that allows software developers to optimise systems for performance and energy consumption by exploiting parallelism quickly and easy. Our ParaFormance tool discovers the potential areas in the application for parallelism, refactors it to introduce the parallel business logic automatically and checks it for thread-safety and runtime bugs. Our case studies have shown 2.5 million lines of code analysed and refactored using ParaFormance, that’s 1 month of manual effort reduced to around 5 minutes. In this talk I will introduce the ParaFormance toolset and give a demonstration of it on a realistic use-case.
10.45-11.00 Coffee
11.00-11.45Keynote Speaker - Eiko Yoneki - University of Cambridge
Efficient Massive-Scale Graph Processing
The analysis of graph-structured data is gaining importance due to its relevance to social media and big data. Due to the interconnection patterns in social network graphs, the performance of graph analytics is impeded by irregular memory accesses patterns which expose memory latency.
The emergence of big data requires fundamental new methodology for data analysis, processing, and information extraction. The main challenge here is to perform efficient and robust data processing, while adapting to the underlying resource availability in a dynamic, large-scale computing environment. Do we really need high performance computers or a large cluster computing? I would introduce our recent work on the graph processing that have billion-scale of vertices and edges in a commodity single computer, which requires secondary storage as external memory. Executing algorithms results in access to such secondary storage and performance of I/O takes an important role, regardless of the algorithmic complexity or runtime efficiency of the actual algorithm in use.
11.45-12.15 Talk 4: Hans Vandierendonck - Queen's University Belfast
GraphGrind: Taming Irregular Memory Accesses in Graph Analytics Workloads
The analysis of graph-structured data is gaining importance due to its relevance to social media and big data. Due to the interconnection patterns in social network graphs, the performance of graph analytics is impeded by irregular memory accesses patterns which expose memory latency.
This talk presents our recent work on high-performance graph analytics. We will demonstrate how graph partitioning is crucial to tame memory locality and how it can be used to map graph analytics to non-uniform memory architectures (NUMA). Key to the graph partitioning algorithm is that it achieves memory locality, avoids overlap in write-sets between threads and is efficient to apply. We will discuss the difficulties of making graph partitioning scalable as a result of a strongly biased degree distribution in the partitions. We will demonstrate solutions to these problems. We will moreover identify new opportunities to switch between different representations of the graph during graph traversal in order to maximise processing speed.
These ideas are implemented in GraphGrind, an open source framework for graph analytics on shared memory systems.
12.15-13.45 Lunch
13.45-14.15Talk 5: Tim Harris – Oracle
Big Graphs on Big Machines
Oracle's largest SPARC M7 system provides 4096 hardware threads spread over 16 sockets in one cache-coherent address space. I will talk about our experience tuning graph analytics workloads to run well on this system, and how we went from an implementation that stopped scaling at around 200 threads to a version that provides super-linear speed-ups on PageRank and SSSP running on 1TB+ inputs over the full machine.  I will focus on the interactions between the threads and the memory system, and the lessons we learned in terms of how to allocate memory and distribute work on these large NUMA systems.
14.15-14.45 Talk 6: Pablo Gonzales – Imperial College London
An optimization approach for the computational modeling of biological development
Current research in the field of computational biology often involves simulations on high-performance computer clusters. It is crucial that the code of such simulations is efficient and correctly reflects the model specifications. In this paper, we present an optimization strategy for simulations of biological dynamics using Intel Xeon Phi coprocessors, demonstrated by a winning entry of the ``Intel Modern Code Developer Challenge'' competition. These optimizations allow simulating various biological mechanisms, in particular, the simulation of millions of cell agents, their proliferation, movements and interactions in 3D space. Overall, our results demonstrate a powerful approach to implement and conduct very detailed and large-scale computational simulations for biological research. We also highlight the main difficulties faced when developing such optimizations, in particular, the changing simulation load over time, the dependencies between different optimization techniques and counter-intuitive effects in the speed of the optimized solution. The overall speedup of 320x shows a good parallel scalability.
14.45-15.15 Talk 7: Jose Nunez/Dr Mohammad Hosseinabady - University of Bristol
Simultaneous multiprocessing in a software defined heterogeneous chip
Recent advances in hardware compilers and synthesis tools have resulted in significant increases in hardware design productivity while heterogeneous chips that combine CPUs and FPGAs can be used to distribute processing so that the tasks present in an algorithm map to the most suitable processing element. This software defined high-level design environments use general purpose languages such as C++ and OpenCL without requiring hardware description language expertise. In this paper, we investigate how to enhance an existing software defined framework to reduce overheads and enable the utilisation of all the programmable processing resources present in the system in parallel to optional hardware accelerators. Instead of selecting the best processing resource for a task and simply offloading we create a dynamic scheduler that distributes the task between all processing resources in an optimal way. A new hardware platform is created based on interrupts that removes spin-locks and allows the processing resources to sleep when no performing useful work. Performance and functional portability between 32-bit and 64-bit chips using the same source code with different CPU and FPGA hardware is investigated. The performance and energy evaluation shows up to 40% reduction in execution time when the CPU cores assist FPGA execution at the same level of energy requirements depending on hardware speed-ups.
15.15-15.30 Coffee
15.30-16.00 Talk 8: Kevin Hammond – University of St. Andrews
Automatically Deriving Cost Models for Structured Parallel Processes Using Hylomorphisms
Structured parallelism using nested algorithmic skeletons can greatly ease the task of writing parallel software, since common, but hard-to-debug problems such as race conditions are eliminated by design. However, choosing the right combination of algorithmic skeletons to yield good parallel speedups for a specific program on a specific parallel architecture is still a difficult problem. This talk introduces the unifying notion of hylomorphisms, a general recursion pattern, to make it possible to reason about both functional correctness properties and about extra-functional timing properties of structured parallel programs. Using our approach, we can now automatically and statically choose the provably optimal parallel structure for a given program with respect to a parallel architecture or a class of architectures.
16.00-17.00 Discussion Panel - with Daniel Goodman, Paul Kelly and Eiko Yoneki
17.00 Closing

Location

The UKMAC 2017 event was held at the Computer Science department [location] at the University of Warwick in Room CS1.04. Please see Warwick campus maps for more info. There is a downloadable PDF map.

Arrival

Parking:
There is parking on campus that is pay and display. Parking for the day is £4.50.
The car park 15 is just by the back of the CS department.
We are also able to reserve some spaces in the arts centre car park (Near Psychology and Senate house on the map). The top two floors of this car park can be reserved for events up to 11am. There is still a requirement to pay and display but you would have a guaranteed space. If you wish to have a reserve parking space please contact us directly.
Alternatively, for anyone staying overnight in Scarman house, there is free parking for them at the conference centre.
Train:
The easiest approach to the department is to get a train to Coventry train station and then take the number 11 or number 12 bus onto campus (arriving at the Bus Interchange). Alternatively, from Leamington Spa (if you happen to be on that train line), you can catch bus U1 onto campus.
Those that fancy more of a walk can get off at Canley station and can get to the campus with a 20 minute walk down Sir Henry Parkes Road, across the A45, continue on Sir Henry Parkes Road until you approach Cannon Park. Follow the road past Cannon Park and eventually you will arrive at the roundabout in the bottom right corner of this map. It is a short walk from here to the Computer Science department.

Accommodation

If people wish to stay overnight, there is accommodation on campus in Scarman House. Although this is a conference centre, its also a hotel that can be booked. Unfortunately there are no discounts available, since the workshop is being held in the Computer Science department. Please note that UKMAC is not able to cover any travel/accomodation costs.
You can book rooms at this page: https://bandb.warwick.ac.uk

Organizing Committee


Organization for UKMAC 2017:

  • Jeremy Singer, School of Computing Science, University of Glasgow
  • Steven Wright, Department of Computer Science, University of Warwick

Steering Committee for the UKMAC series: