Designing the SKA Observatory’s supercomputers: maximising science while minimising their environmental impact
Publié le 13 octobre 2023 – Mis à jour le 13 octobre 2023
Author
Shan Mignota
a Université Côte d'Azur, CNRS, OCA, UMR 7293 Lagrange, EUR SPECTRUM, Nice, France
In observational astronomy, angular resolution is determined by the receiver's diameter and instrument sensitivity is driven by its collecting area. The pursuit of ever fainter and colder targets often representing farther and older objects is currently driving the trend towards bigger optical and radio telescopes. Modern radio-astronomy relies on arrays of antennas to increase resolution and produces images via aperture synthesis, a technique derived from interferometry which, in the radio domain, is carried out through data processing. The SKA is an international project to develop the world’s largest radio observatory (SKAO) with the completion of phase 1 expected by 2029. With its two telescopes located in Australia (SKA-LOW) and South Africa (SKA-MID), the SKA will address an impressive variety of fundamental physics and astronomy topics.
Data-wise, the SKA represents a major leap forward. Radio-waves are digitised at the level of the 197 dishes (SKA-MID) and 131 072 antennas (SKA-LOW) and the data are transported to two edge computers, the Central Signal Processors (CSPs), where preliminary operations are performed to reduce the data flow. As a result, 1 TB/s of data is then transferred to two supercomputers, the Science Data Processors (SDPs), where it is transformed into science-ready products and volume reduced to 700 PB/year. The data is finally distributed to a network of computing centres, the SKA Regional Centers Network (SRCNet), where it is stored and made accessible to scientists for interpretation.
The SDP is at the frontier between the highly predictable, simple and repetitive processing of the CSPs, which calls for dedicated hardware, and the versatile use of the SRCNet by the scientists, which points to data centres, while facing both important data flows and a need for running a diverse set of tasks. The estimated 125 petaflops of peak processing power required from each SDP [1] compares with the performance of the Sierra supercomputer which currently ranks 6th in the TOP500 [2].
Supercomputers consist of a set of nodes1 integrated in blades and arranged in racks which are themselves interconnected with fast networks. Systems with 1.5 million processing cores like Sierra include 4320 nodes and 240 racks [3] which amounts to a tremendous amount of high-technology components organised in a large and complex system. Evaluating the environmental impact of such systems is complex as it requires considering the full life cycle (manufacturing, transport, operation, disposal) which involves many economic operators. It is also complex due to the multiplicity of aspects to be considered, like use of natural resources2, energy3 and pollution4. Although significant progress has been made over the last decade to improve the energy efficiency of systems, the power consumption of data centres is still growing due to bigger systems and the greater number of data centres (rebound effect).
SKAO has acknowledged the complexity of procuring two such machines in what is not a high-performance computing project. A complexity related to capital and operational costs, fast-evolving technologies and the difficulty of accurately specifying the need. The SDPs are hence considered part of the telescopes because they need to be available both to process the data and to provide feedback to the antennas and CSP within seconds for observations to be possible. This is an innovative approach to data reduction compared to the classical offline approach.
While the objective of the SKA is to further our understanding of the universe and fundamental physics with positive societal impacts, notably in terms knowledge and innovation, the environmental impact of the SDP, which is only part of the SKA, is significant. With installations in South Africa and Australia, where the energy mix emits considerable greenhouse gas, it was roughly estimated that the construction footprint for the SKA would be 312 kt CO2e and the annual operations footprint 18 kt CO2e / year [5], with a significant share owed to computing infrastructures.
SKAO has included sustainability as one of the core values of the observatory. Together with budget, this has led SKAO to devise a power plan to reduce its exposure to fluctuations in the price of energy and reduce greenhouse gas emissions (e.g., with solar plants close to the SDPs) and to set an upper bound to the power consumption of the SDP at 2 MW [6]. This is a very aggressive objective in the light of the current efficiency of High Performance Computing (HPC) systems based on uncertain extrapolation of current trends in power reduction to the date of the systems' procurement.
The cost cap for the SDPs in SKAO's budget follows a similar strategy and is also very aggressive. Hence SKAO has identified a risk that, for the allocated money and power, the procured SDPs might limit the exploitation of the telescopes [7]. Procuring the SDPs will be led by France and will build on a co-design activity to achieve an optimised design with matching software and hardware for maximum efficiency. The Observatoire de la Côte d'Azur leads this effort and, while the details of the scope are still being discussed with SKAO, it is our understanding that its objective is to allow the SDPs to achieve maximum science return within the assigned envelopes while minimising their environmental impact.
The community developing embedded systems has a long history of co-design as a result of targeted applications and highly constrained conditions for operation (power, volume, mass). Until recently, HPC has relied mostly on scaling existing systems: Dennard's scaling5 and adding racks provided regular increases in performance. Its recent demise has led the larger systems to consume increasing amounts of power so that, by 2014, IBM made a case for increasing performance with a heterogeneous architecture mixing GPUs and CPUs during the development of Summit and Sierra, based on predefined use cases [8]. More recently, a system's analysis of power consumption was carried out for Fugaku to improve its efficiency, also for predefined use cases [9]. Co-design, as a means to jointly specialise the hardware and software architectures of a machine for more targeted uses, while still using commodity components, is now the agreed way to increase performance [8][9][10] but is only an emerging practice in HPC.
Such a strategy is well adapted to the SDPs which, although they will need to run diverse tasks related to different observation modes and be able to evolve as the science progresses, are intrinsically specialised in the sense that they are part of the telescopes. The ongoing development of the corresponding software, while it makes describing the expected use more difficult, also offers additional opportunities in deeply revisiting the software architecture at the application level and also for the supporting stack.
CNRS INSU has adopted a twofold approach to this challenge. First, with engineering activities carried out as an international collaboration led by Côte d'Azur Observatory and ASTRON and inscribed, like the SCOOP team, in the SKA project itself. Then, with upstream research and R&D soon to be organised under the umbrella of a common laboratory (ECLAT) – between CNRS, Inria and Atos and directed by the Paris Observatory – intended to feed SCOOP activities in the longer run.
Besides benchmarking the software to estimate the performance achieved and understand what limits it, SCOOP also tests behaviour on a variety of hardware and simulates execution via models of the SDP in order to explore the solution space and identify promising avenues. Beyond this effort to improve the match, SCOOP also intends to contribute to a system-level view of the SDP in order to refine the architectural and operational needs and contribute to devising a system able to adapt to scarce temporal, computational and energetic resources. Developing a long-term vision of the management of the SDP is also proposed, including staged deployment, maintenance, extension and decommissioning to reduce the capital environmental cost of the SDP over its entire life cycle.
In addition to enlarging SCOOP's horizon with a prospective view fuelled by upstream research efforts or making recent findings applicable, ECLAT will provide expertise on the design of supercomputers through the contribution of Atos. The hope is that this joint and long-running collaboration will lead to a shared understanding of needs and solutions and result in a finely-tuned, tailored system when the time for procurement comes.
The design and procurement of the SDPs is a challenge in terms of complexity, cost and schedule, especially if SKAO is to fully profit from the investment of building the antenna arrays – in comparison with the LOFAR, NeNuFAR, MWA and pathfinder radio-telescopes whose current use is severely limited by the associated computing resources. The environmental impact adds to this challenge while not being entirely disconnected from budgetary constraints. The co-design endeavour led by France in the frame of such a large project triggers interest in HPC research communities and is expected to foster research and innovation at the hardware and software levels but also concerning the process of developing and operating such infrastructures. Besides the SDPs, the tools and know-how developed could apply to the CSPs and SRCs within the SKA and, beyond, to HPC infrastructures associated with research experiments, in the search for an improved balance between science's high-end requirements and environmental sustainability.
Reference
- Alexander P. et al., 2019, SDP Critical Design Review, SKA1 SDP High Level Overview (SKA-TEL-SDP-0000180)
- TOP500 ranking of supercomputers (https://www.top500.org)
- Bertsch A., 2018, Standford HPC Conference, The Sierra Supercomputer: Science and Technology on a Mission
- Das A. et al., 2021, IEEE International Parallel and Distributed Processing Symposium, Systemic Assessment of Node Failures in HPC Production Platforms
- Knödlseder J. et al., 2022, arXiv, Estimate of the carbon footprint of astronomical research infrastructures
- Barriere K, Schutte A., 2021, SKA1 Power Budget (SKA-TEL-SKO-0000035-06)
- Degan M., Broekema C., 2019, SKA Computing Hardware Risk Mitigation Plan (SKA-TEL-SKO-0001083)
- Moreno J.H., 2019, International Conference on High Performance Computing & Simulation, Benchmarking Summit and Sierra Supercomputers from Proposal to Acceptance
- Sato M. et al., Super Computing, 2020 Co-design for A64FX many core processor and "Fugaku"
- Gómez C. et al., 2019, International Parallel and Distributed Processing Symposium, Design Space Exploration of Next-Generation HPC Machines
1individual computing unit
2 ~180 kg of raw material to produce a smartphone
3 ~12 MW to operate Sierra
4 Mean time between failures is of the order of minutes for data centres [4].
5 The reduction of the size of the transistors allowing for reducing consumption and increasing frequency.