Automatic performance tuning and reproducibility as a side effect

Posted by s.hettrick on 22 July 2014 - 5:00pm

By Grigori Furisin, President and CTO of international cTuning foundation.

Computer systems' users are always eager to have faster, smaller, cheaper, more reliable and power efficient computer systems either to improve their everyday tasks or to continue innovation in science and technology. However, designing and optimising such systems is becoming excessively time consuming, costly and error prone due to an enormous number of available design and optimisation choices and complex interactions between all software and hardware components. Furthermore, multiple characteristics have to be carefully balanced at the same time including execution time, code size, compilation time, power consumption and reliability using a growing number of incompatible tools and techniques with many ad-hoc, intuition based heuristics.

During the EU FP6 MILEPOST project in 2006-2009, we attempted to solve the above issues by combining empirical performance auto-tuning with machine learning. We wanted to be able to automatically and adaptively explore and model large design and optimisation spaces. This, in turn, could allow us to quickly predict better program optimisations and hardware designs to minimise execution time, power consumption, code size, compilation time and other important characteristics. However, during this project, we faced multiple problems.

There was a lack of common, large and diverse benchmarks and data sets which were needed to build statistically meaningful predictive models. There was a lack of a common experimental methodology and unified ways to preserve, systematise and share our growing optimisation knowledge and research material (including benchmarks, data sets, tools, tuning plugins, predictive models and optimisation results). There was a dramatic lack of computational power to automatically explore large program and architecture design and optimisation space required to effectively train a compiler (building predictive models). It was difficult to reproduce and validate related performance tuning and machine learning techniques from existing publications due to a lack of a culture of sharing research artifacts with full experiment specifications along with publications in computer engineering.

Based on our background in physics and machine learning, we proposed an alternative solution to develop a common experimental infrastructure, repository and public web portal that could help crowdsource performance tuning across multiple users. At the same time, we tried to persuade our community to share various benchmarks, data sets, tools, predictive models together with experimental results along with their publications. This, in turn, could help the community validate and improve past techniques or quickly prototype new ones using shared code and data.

In the beginning, many academic researchers were not very enthusiastic about this approach since it was breaking traditional research model in computer engineering where promotion is often based on a number of publications rather than on reproducibility and practicality of techniques or sharing of research artifacts. Nevertheless, we decided to risk and validate our approach with the community by releasing our whole program and compiler optimisation and learning infrastructure together with all benchmarks and data sets. This infrastructure was connected to a public repository of knowledge allowing the community to share their experimental results and consider program optimisation as a collaborative big data problem. At the same time, we shared all experimental results as well as program, architecture and data sets features or meta-information necessary for machine learning and data mining together with generated predictive models along with our open-access publication. As a result, we gained several important and practical experiences summarised below.

The community served as a reviewer of our open-access publication, shared code and data, and experimental results on machine learning based self-tuning compiler. For example, our work was featured twice on the front page of slashdot.org news website with around 150 comments (IBM Releases Open Source Machine Learning Compiler and Using AI With GCC to Speed Up Mobile Design). Of course, such public comments can be just likes, dislikes, unrelated or possibly unfair which may be difficult to cope particularly since academic researchers often consider their work and publications unique and exceptional. On the other hand, quickly filtering comments and focusing on constructive feedback or criticism helped us validate and improve our research techniques besides fixing obvious bugs. Furthermore, the community helped us find most relevant and missing citations, related projects and tools - this is particularly important nowadays with a growing number of publications, conferences, workshops, journals, initiatives and only a few truly novel ideas.

Exposing our research to the community and engaging in public discussions was much more fun and motivating in comparison with traditional publication models, particularly after the following remark which we received on Slashdot about our machine learning based compiler (MILEPOST GCC): "GCC goes online on the 2nd of July, 2008. Human decisions are removed from compilation. GCC begins to learn at a geometric rate. It becomes self-aware 2:14 AM, Eastern time, August 29th. In a panic, they try to pull the plug. GCC Strikes back".

It is even more motivating to see our shared techniques have been immediately used in practice, improved by the community, added to mainline GCC, or even had an impact on industry. For example, our community driven approach was referenced in 2009 by IBM for speeding up development and optimisation of their embedded systems.

Moreover, it was possible to fight unfair or biased reviewing which is sometimes intended to block other authors from publishing new ideas and to keep monopoly on some research topics by several large academic groups or companies. To some extent, rebuttals were originally intended to solve this problem, but due to an excessive amount of submissions and lack of reviewing time, it nowadays has very little effect on the acceptance decision. This problem often makes academic research looks like business rather than collaborative science, puts off many students and younger researchers, and was emphasized at all our organised events and panels.

However, with an open-source publication and shared artifacts, it is possible to have a time stamp on your open-access publication and immediately engage in public discussions thus advertising and explaining your work or even collaboratively improving it - something that academic research was originally about. At the same time, having an open-access paper does not prevent from publishing a considerably improved article in a traditional journal while acknowledging all contributors including engineers whose important work is often not even recognised in academic research. Therefore, open access and traditional publication models may possibly co-exist while still helping academic researchers with a traditional promotion.

It is even possible to share and discuss negative results (failed techniques, model mispredictions, performance degradations, unexplainable results) to prevent the community from making the same mistakes and to collaboratively improve them. This is largely ignored by our community and practically impossible to publish currently. However, negative results are very important for machine-learning-based optimisation and auto-tuning. Such techniques are finally becoming popular in computer engineering but require sharing of all benchmarks, data sets and all model misprediction, positive results, to be able to improve them as it is already done in some other scientific disciplines.

The community continue to be interested in our projects mainly because they are accompanied by all code and data enough to reproduce, validate and extend our model-driven optimisation techniques. At the same time, sharing all research material in a unified way helped us to bring interdisciplinary communities together to explain performance anomalies, improve machine learning models or find missing features for automatic program and architecture optimisation while treating it as a big data problem. We also used it to conduct internal student competitions to find the best performing predictive model. Finally, we used such data to automatically generate interactive graphs to simplify research in workgroups and to enable interactive publications, as shown in the online example.

Furthermore, such community driven research helped us to expose a major problem that makes reproducibility in computer engineering very challenging. We have to deal with ever changing hardware and software stack making it extremely difficult to describe experiments with all software and hardware dependencies, and to explain unexpected behavior of computer systems. Therefore, just reporting and sharing experimental results including performance numbers, version of a compiler or operating systems and a platform is not enough - we need to preserve the whole experimental setup with all related artifacts and meta-information describing all software and hardware dependencies. Sharing virtual machines does not help either since performance and energy results are naturally very different across different hardware configurations.

Above problems motivated us to start developing a new methodology and open source infrastructure (Collective Mind: paper, wiki, repository) to gradually and collaboratively define, preserve and share the whole experimental setups and describe all software and hardware dependencies. It helps convert ad-hoc, error prone, time consuming and costly process of benchmarking, optimising and co-designing computer systems into a unified and collaborative big data problem tackled using predictive analytics and collective intelligence. There is still a long way to go but we continue collaboratively improving it with the help of the cTuning foundation and our academic and industrial partners.

After many years of evangelising collaborative and reproducible research in computer engineering based on the presented practical experience, we finally started to see a change in mentality in academia, industry and funding agencies. In our last ADAPT workshop authors of two papers (out of nine accepted) agreed to have their papers validated by volunteers. Note that rather than enforcing specific validation rules, we decided to ask authors to pack all their research artifacts as they wish (for example, using a shared virtual machine or as a standard archive) and describe their own validation procedure. Thanks to our volunteers, experiments from these papers have been validated, archives shared in our public repository, and papers marked with a "validated by the community" stamp.

Note, that experimental reproducibility comes naturally in our research as a side effect rather than enforced or only because it is a noble thing to do. However, we continue experiencing considerable difficulties when reproducing complex experimental setups from existing publications often due to specific requirements placed on operating systems, libraries, benchmarks, data sets, compilers, architecture simulators, and other tools, or due to a lack of precise specifications, lack of all software dependencies, and lack of access to some hardware.

Therefore, we started organising regular workshops such as REPRODUCE and TRUST sponsored by ACM to discuss technological aspects of enabling collaborative and reproducible research and experimentation particularly on program and architecture empirical analysis, optimisation, simulation, co-design and run-time adaptation including how to: capture, catalog, systematise, modify, replay and exchange experiments (possibly whole setups with all artefacts including benchmarks, codelets, datasets, tools, models, etc); validate and verify experimental results; deal with a rising amount of experimental data using statistical analysis, data mining, predictive modeling, etc.

Furthermore, with the help of the cTuning foundation and the community, we started setting up an artefact evaluation process (AE) for the major conferences in computer engineering. Briefly, AE is intended to help validate experimental results from the accepted publications. Authors now have an option to participate in the AE by submitting all related research material (tools, benchmarks, data sets, models) necessary to validate claims and results of their accepted articles. Authors also provide a readme file describing their artefacts and validation procedure. Eventually, volunteers from the Program Committee members perform validation and ranking (or nominate colleagues and postgraduate researchers familiar with the topic). Papers above a given threshold receive a validation stamp on their paper. At the same time, workshops like REPRODUCE and TRUST allow us to discuss how to continuously improve this evaluation process.

Finally, based on our practical experience, we propose a new publication model for computer engineering where publications, experimental results and all related artefacts are continuously reviewed and discussed by the community. We plan to use this model at our next ADAPT workshop and will push it to major conferences and journals if successful. We hope that such approach will help reduce reviewers' burden, improve quality and fairness of the reviews and restore attractiveness of academic research in computer engineering as a traditional, collaborative and fair science rather than hacking, publication machine or monopolised business.