USB Bulk Endpoint Throughput in a USB HS Device

USB Bulk Endpoint Throughput in a USB HS Device

Overview

Although USB 2.0 High-Speed signals at 480 Mbps, the actual data throughput depends on the endpoint type and is reduced by USB protocol overhead. While certain USB endpoint types are allocated guaranteed bandwidth or fixed throughput, bulk endpoints are not. Instead, they are scheduled only after all other endpoint transfers have been serviced, utilizing the remaining bus bandwidth. This document explains the factors that affect the achievable throughput of bulk endpoint when an RA MCU is used as a USB High-Speed device, and presents the resulting measurements.

About Bulk endpoint 

If the USB HS device is the only one on the bus, with no other USB devices or hubs competing for bandwidth, the following bulk endpoint throughput is typically considered achievable. 

 

Throughput

Remark

 

Throughput

Remark

EHCI specification (Theory)

~ 425 Mbps

Up to 13 max-size BULK transactions are permitted per microframe.

Practical (Optimistic at best)

~ 320 Mbps

Typically, fewer than 10 max-size bulk transactions per microframe are achievable because of scheduling delays and bus idle gaps.

Although a PC’s USB port may appear to be a direct root port from the host controller, in most cases it is actually connected through an internal USB hub IC, and what we see externally is just one of its downstream ports. Multiple devices, such as webcams, fingerprint readers, and Bluetooth modules, share this hub and compete for bandwidth, often reducing throughput below the Practical (Optimistic at best) value shown in the table above.

Throughput Influencing Factors

  1. USB host scheduling latency

    1. All USB read and write operations are initiated by the USB host, so the throughput of bulk transfers also depends on how the USB host operates. 

    2. The Measurement results below compare two USB host software cases: LibUsbDotNet and PyUSB, both of which are based on the WinUSB driver. The results may further differ depending on whether a synchronous or asynchronous API is used by the USB host software.

  2. USB FSP using DMA or interrupt

    1. As shown in the measurement results below, DMA achieves much higher performance than interrupt mode. However, when using DMA, the USB peripheral's D0FIFO and D1FIFO can each be assigned to only one pipe until the DMA transfer completes. This restriction means that only one bulk endpoint per IN/OUT direction can be used with DMA at a time.

    2. The current USB FSP driver does not support configuring two or more bulk endpoints in the same direction, with one using DMA mode and the other using interrupt mode.

  3. FIFO size allocated for the bulk pipe

    1. Pipes 1 to 5 can be used for bulk endpoints, and their FIFO size (PIPEBUF.BUFSIZE[4:0]) can be configured up to 0x1F (2 Kbytes). When Double Buffer mode (PIPECFG.DBLB) is enabled, this results in a 4 Kbytes FIFO for the pipe. Note that the total FIFO capacity is limited to 8.5 Kbytes and is shared across all pipes.

    2. The measurement results below compare cases where the FIFO size (PIPEBUF.BUFSIZE[4:0]) is set to the default 512 bytes or to 2048 bytes.

  4. User buffer size allocated for the bulk transfer

    1. If the total data to be transferred by bulk is 1 Mbyte and sufficient user memory is available, a 1 Mbyte buffer can be used. However, if a smaller buffer is allocated due to memory limitations, the smaller the buffer, the lower the throughput will be.

  5. USB device scheduling latency

    1. For example, with a FIFO size of 4 Kbytes and a user buffer size of 16,384 bytes, two delays can be observed on the USB bus: one each time the 4 Kbytes FIFO becomes full, and another when the 16,384 bytes transfer is finished and re-armed. The second delay in re-arming a USB bulk transfer can vary much depending on the RTOS being used, the way its resources are managed, the synchronization method of the data transfer thread, the USB device class stack employed, and so on.

image-2025-8-29_9-26-59.png

Measurement results

  • Hardware

    • EK-RA6M5, 200MHz CPU clock, USB High-speed port

  • Software

    • FreeRTOS (v11.1.0) + PVND class stack

    • e2 studio (2025-07), FSP (6.0.0), GCC (14.3.1) 

    • Continuous Transfer mode (PIPECFG.CNTMD)=ON

    • Double Buffer mode (PIPECFG.DBLB)=ON

  • Others

    • Transfer size=1,000,000 byte

    • Direct connection to USB host (no USB hubs or other devices in between)

    • USB host=Windows PC

Test code used:

uint8_t g_buf[BUF_SIZE]; R_USB_PipeRead(&g_basic_ctrl, g_buf, BUF_SIZE, bulk_out_pipe); or R_USB_PipeWrite(&g_basic_ctrl, g_buf, BUF_SIZE, bulk_in_pipe);

 

FIFO size (byte)

Compiler Optimization

User buffer size (BUF_SIZE, byte)

Data only BW (Mbps) for OUT operation

Data only BW (Mbps) for IN operation

Libusbdotnet Host

Pyusb Host

Libusbdotnet Host

Pyusb Host

Interrupt

512

None (-O0)

8192

27 

29

30

31

More (-O2)

8192

64 

70

65

68

16384

71

74

67

71

2048

None (-O0)

8192

34 

37

41

42

More (-O2)

8192

77 

87

93

98

16384

96 

97

98

104

DMA

512

None (-O0)

8192

110 

132

121

139

More (-O2)

8192

143 

162

132

156

16384

175 

201

(Note 1)

148

172

2048

None (-O0)

8192

120 

126

138

150

More (-O2)

8192

138

(Note 3)

153

(Note 3)

155

179

16384

175 

196

176

219

(Note 2)

Note 1. Best OUT (host->device) result when using FIFO size=512, Compiler Opt=more(-O2), BUF_SIZE=bigger, USB host=PyUSB.

Note 2. Best IN (device->host) result when using FIFO size=2048, Compiler Opt=more(-O2), BUF_SIZE=bigger, USB host=PyUSB.

Note 3. FIFO sizes of 2048 bytes result in lower throughput compared to 512 bytes, as more idle packets observed and caused delays.

image-2025-8-28_17-41-5.png