# Allen: Processing 4 TB/s of Streaming Data From the LHCb Experiment on GPUs.

#### Roel Aaij

**October 19th, 2021** 



### LHCb Upgrade in a Single Slide



30 MHz (4 TB/s) of input contains a MHz of signal, while we can only store 10 GB/s long-term



#### **LHCb Upgrade Dataflow**



HLT1 challenge: reduce 5 TB/s to 70-200 GB/s in real-time with high physics efficiency

### **DAQ Architecture**



#### 30 MHz (5 TB/s) of event building and processing in a data center

#### **HLT1 on GPUs: Allen**



GPU solution (Allen) selected as baseline HLT1; up to 3 GPUs installed in each event builder server



## **HLT1 on GPUs: Allen**

- Fully standalone software project: <u>https://gitlab.cern.ch/lhcb/Allen</u>
- Dependencies: C++17 compliant compiler, boost, ZeroMQ, cppgsl
- Built-in physics validation
- Configurable algorithm sequence, custom memory manager
- Cross-architecture compatibility (CPU, CUDA, HIP)
- Approximately 100k LOC, 90% written from scratch
- Integrated with LHCb stack and LHCb DAQ and control system
- Project started in February 2018
- After 15 months of development time:
- project reviewed as viable solution for Run 3 (starting in 2022)
- Accepted as baseline solution in May 2020

## **Reconstruction and Selection**



#### **Reconstruction Performance**





# **Selection Performance**

| Signal                       | GEC        | TIS -OR- TOS | TOS        | $\operatorname{GEC} \times \operatorname{TOS}$ |
|------------------------------|------------|--------------|------------|------------------------------------------------|
| $B^0 \to K^{*0} \mu^+ \mu^-$ | $89 \pm 2$ | $91 \pm 2$   | $89 \pm 2$ | $79 \pm 3$                                     |
| $B^0 \to K^{*0} e^+ e^-$     | $84 \pm 3$ | $69 \pm 4$   | $62 \pm 4$ | $52 \pm 4$                                     |
| $B_s^0 \to \phi \phi$        | $83 \pm 3$ | $76 \pm 3$   | $69 \pm 3$ | $57 \pm 3$                                     |
| $D_s^+ \to K^+ K^- \pi^+$    | $82 \pm 4$ | $59\pm5$     | $43 \pm 5$ | $35 \pm 4$                                     |
| $Z  ightarrow \mu^+ \mu^-$   | $78 \pm 1$ | $99 \pm 0$   | $99\pm0$   | $77 \pm 1$                                     |

| Trigger          | Rate [kHz]   |
|------------------|--------------|
| 1-Track          | $215 \pm 18$ |
| 2-Track          | $659 \pm 31$ |
| High- $p_T$ muon | $5 \pm 3$    |
| Displaced dimuon | $74 \pm 10$  |
| High-mass dimuon | $134 \pm 14$ |
| Total            | $999 \pm 38$ |



\_

# **Throughput Evolution Since TDR**

Trigger Rate [kHz] vs TFlops (32bit)



Performance at the time of the TDR; Approximately linear scaling with theoretical FLOPS

GPU Theoretical 32 bit TFLOPS

## **Throughput Evolution Since TDR**





# Integration

- DAQ:
  - Data input (event-builder layout)
  - Data output (event-by-event layout)
- Avoid large-scale transposition of data
  - Process data in the layout provided by the event builder
  - $\circ$  Batches of 30k grouped by frontend, not by event
  - Process in batches of 1k events
- Steering by the experiment control system
- Error handling and failover
- Detector information such as geometry and alignment
  - Obtain from "regular" LHCb stack on the fly
  - Deal with changing conditions
- Monitoring for data quality and shifters
- HLT1 hardware and processes share the server with event building: keep a close eye on CPU and memory usage

### **Event Builder Server (Gigabyte G481-Z51)**



#### **Event Builder Server**



# **Throughput Test**

- 10 event builder nodes
- event builder generates data with scaled up events
- HLT1 consumes generated data on (single) GPU-equipped node
- HLT1 without reconstruction, only copy data to and from GPU
- Frontends in data-generator mode
- Goal: verify that data I/O and throughput requirements are met
- Goal: Verify that a single GPU works



# **Throughput Test**

- 10 event builder nodes
- event builder generates data with scaled up events
- HLT1 consumes generated data on (single) GPU-equipped node
- HLT1 without reconstruction, only copy data to and from GPU
- Frontends in data-generator mode

| Buffer Na                                                             | anager Monitor [Fri 13 Aug 2021 11:41:31] pid:50767 on TBEB01                                                                                                 |
|-----------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Buffer "Events_0_TIET"<br>Occupancy [Events]:                         | Events: Produced:3508 (0,003 kHz) Seen:3535 (0,003 kHz) Wey: 0 Map: 0 Wal: 0<br>Space(kB):[Tot:25956575 Free:62595615] Users:[Tot:3 Hax:20] Perding:5 Hax:150 |
| [Space]:<br>Buffer "Events_1_TIET"<br>Occupancy [Events]:<br>[Space]: | Events: Produced:3589 (0,004 kHz) Seen:3595 (0,003 kHz) Mev: 0 Mey: 0 Maj: 0<br>Space(kB):[[0t:25950875 Free:6258455] Waers:[Tot:3 Hax:20] Pending:5 Hax:150  |
|                                                                       |                                                                                                                                                               |
| Buffer "Output_TDET"<br>Occupancy [Events]:<br>[Space]:               | Events: Produced:0 (0.000 kHz) Seen:0 (0.000 kHz) Hev: 0 Uep: 0 Uel: 0<br>Space(kB):[Tot:439877 Free:45877] Users:[Tot:0 Max:5] Pending:0 Max:150             |
| Occupancy [Events]:<br>[Space]:<br>ScovNaze                           | Space(kB):[Tot:458977 Free:458972] Users:[Tot:0 Hw:5] Pending:0 Hw:150 Partilion Pid Type State Produced Zorod Recen Zoron Reg Buffer                         |
| Occupancy [Events]:<br>[Space]:                                       | Search(UB):[Int:458977 Free:458977] Users:[Int:0 HardS] Pending:0 Nat:150                                                                                     |

- Test successful: can reach 43 MHz
- Bottleneck: PCIe throughput to GPU
- Important takeaway: NUMA and and (device) runtime considerations very important
- Need more monitoring

# Summary

- GPU HLT1 (Allen) selected as the baseline solution for LHCb
- Hardware has been purchased (RTX A5000)
- Work continues to improve performance; both physics and throughput
- Shifting to commissioning mode: focus on integration, consolidation and testing
- Careful consideration of dataflow memory and networks - crucial to success
- Larger-scale integration tests a success
- Exciting times ahead!



**Technical Design Report**