Connecting an FPGA accelerator to the Raspberry Pi 5

Since the first version of Raspberry Pi was released in 2012, four more versions of this Single Board Computer (SBC) were released. The fourth version came with a new feature that increased the power of that SBC. That version was the first with PCIe interface (just the CM4 version), although we had to wait for the fifth version, the Raspberry Pi 5, to have that feature on the regular board. Having a PCIe interface accessible to the processor means a lot of things. First, we can use NVME hard drives as a primary storage unit, also the (limited) use of an external graphics card is now possible. Notice that I used the word limited. The reason is that the PCIe interface available in the Raspberry Pi 5 has just one lane, and can achieve up to 8 GT/s, so we have available a PCIe gen3 x1. Is this slow? Not at all, we can use this interface for many different things, and, being this one a blog about FPGA, that is exactly what we will use.

For this article, I am going to use the Litefury board. This board features an Artix XC7A100T with an M.2 format. To connect the board to the Raspberry Pi I will use the Pineberry HAT Drive Bottom. This hat uses the PCIe x1 connector of the Raspberry PI 5 (RP5) and is connected to an M.2 socket. It is intended for SSD but it works well with the Litefury. When I started this article, I was concerned about the power consumption of the Litefury, but I was able to use the bard without any external supply. As an operating system for the Raspberry Pi, I used the Raspberry Pi OS server edition.

When we power on the RP5 for the first time, we need to modify the config.txt file to enable the PCIe interface. This file is located at /boot/firmware/config.txt. To enable the PCIe we need to add the next line.

# Enable the PCIe external connector.
dtparam=pciex1

This configuration enables the PCIe interface but with a limited speed of 5 GT/s (gen2), but by adding another instruction we can increase the speed to 8 GT/s (gen3). For this project this is not useful because the Artix7 used in the Litefury can only be configured for gen2, so this will be enough.

Once the RP5 is configured, is time to create the FPGA design. When the project for the XC7A100t-2LFGG484 part is created, and a new block design is added to the project, we need to add the DMA/Bridge Subsystem for PCI Express IP.

When this IP is added to the block design, we can configure it through a wizard. The configuration is the one you can see in the next figure. We are going to use just one lane of the PCIe, 5 GT/s and the DMA interface will be set to AXI Stream, with just one channel for read and write.

When the basic configuration is done, a BUffer for the PCIe clock will be added automatically. Now, that the basic configuration of the IP is finished, we can go into the configuration of the IP, and change other parameters, like the PCI Class Code, which will be set to 120000. The rest of the configuration can be the default one.

Now it is time to add some other IP to the design. The project will be very simple. We are going to use the AXI Stream interface to read and write an AXI4-Stream Data FIFO. Then, the AXI4 LITE Master interface will be connected to an AXI GPIO IP that controls the four LEDs of the board. Also, we need to add the port pcie_clkreq_l, and assign it a value of ‘0’. The entire block design is shown in the next figure.

Regarding the constraints, we can divide the constraint into two different files, and change the processing order. The reason is that Litefury does not use the default pins for the PCIe lanes, so when the DMA/Bridge Subsystem for PCI Express IP is synthesized, it generates its own constraints which are added to the project. Those constraints are not valid for the Litefury, so we need to change it after Vivado implements the IP. To summarize, we need to be sure that our constraints will be processed after all the predefined constraints. The UG896 says the next: The order in which IP XDC files could be processed is, as follows:

User XDC set to EARLY
IP XDC set to EARLY
User XDC set to NORMAL (default)
IP XDC set to LATE
User XDC set to LATE

So we need to set our constraints to `LATE, so they will be read the last one and will be the ones used in the implementation. We can do this by executing this command:

set_property PROCESSING_ORDER LATE [get_files -all ./raspberrypi_litefury/src/late_litefury.xdc]

The constraints that have to be executed last are the ones with the PCIe lane location.

# Late constraints

# PCIe lane 0
set_property PACKAGE_PIN A10 [get_ports {pcie_mgt_rxn[0]}]
set_property PACKAGE_PIN B10 [get_ports {pcie_mgt_rxp[0]}]
set_property PACKAGE_PIN A6 [get_ports {pcie_mgt_txn[0]}]
set_property PACKAGE_PIN B6 [get_ports {pcie_mgt_txp[0]}]

# PCIe refclock
set_property PACKAGE_PIN F6 [get_ports {pcie_clkin_clk_p[0]}]
set_property PACKAGE_PIN E6 [get_ports {pcie_clkin_clk_n[0]}]

# Other PCIe signals
set_property PACKAGE_PIN G1 [get_ports {pcie_clkreq_l[0]}]
set_property IOSTANDARD LVCMOS33 [get_ports {pcie_clkreq_l[0]}]

set_property PACKAGE_PIN J1 [get_ports pcie_reset]
set_property IOSTANDARD LVCMOS33 [get_ports pcie_reset]

Also, there will be another set of constraints for the LEDs and some instructions to reduce the configuration time.

## Normal constraints

# LEDs

set_property PACKAGE_PIN G3 [get_ports {leds[0]}]
set_property IOSTANDARD LVCMOS33 [get_ports {leds[0]}]
set_property PULLUP true [get_ports {leds[0]}]
set_property DRIVE 8 [get_ports {leds[0]}]

set_property PACKAGE_PIN H3 [get_ports {leds[1]}]
set_property IOSTANDARD LVCMOS33 [get_ports {leds[1]}]
set_property PULLUP true [get_ports {leds[1]}]
set_property DRIVE 8 [get_ports {leds[1]}]

set_property PACKAGE_PIN G4 [get_ports {leds[2]}]
set_property IOSTANDARD LVCMOS33 [get_ports {leds[2]}]
set_property PULLUP true [get_ports {leds[2]}]
set_property DRIVE 8 [get_ports {leds[2]}]

set_property PACKAGE_PIN H4 [get_ports {leds[3]}]
set_property IOSTANDARD LVCMOS33 [get_ports {leds[3]}]
set_property PULLUP true [get_ports {leds[3]}]
set_property DRIVE 8 [get_ports {leds[3]}]

# Input reset is resynchronized within FPGA design as necessary
set_false_path -from [get_ports pci_reset]

# Power down on overtemp
set_property BITSTREAM.CONFIG.OVERTEMPPOWERDOWN ENABLE [current_design]

# High-speed configuration so FPGA is up in time to negotiate with PCIe root complex
set_property BITSTREAM.CONFIG.EXTMASTERCCLK_EN Div-1 [current_design]
set_property BITSTREAM.CONFIG.SPI_BUSWIDTH 4 [current_design]
set_property CONFIG_MODE SPIx4 [current_design]
set_property BITSTREAM.CONFIG.SPI_FALL_EDGE YES [current_design]
set_property BITSTREAM.GENERAL.COMPRESS TRUE [current_design]

set_property CONFIG_VOLTAGE 3.3 [current_design]
set_property CFGBVS VCCO [current_design]

Now, we are ready to generate the bitstream, and the FPGA design is finished.

We can now come back to the RP5. To talk with the Litefury through PCIe we are going to use the dma_ip_drivers from Xilinx. The problem is that these drivers are developed for x86 processors, so ARM processors, like the one in the RP5, are not supported, at least in principle. Fortunately, Laurence Barker discovered what makes them incompatible with ARM processors, and we can find in his Github repository a version of the drivers with some modifications that makes them compatible with ARM processors. The changes are very well documented in the readme file, so what I did was download the original drivers from the Xilinx repository, and apply just the first change. Then I installed the drivers following the Xilinx repository instructions and all went fine.

To test the drivers we can execute the script run_test.sh, but it didn’t work for me, neither on the x86 processor I used here, so I executed directly the test using this command.

pablo@raspberrypi:~ $ ./dma_ip_drivers/XDMA/linux-kernel/tests$ sudo bash ./dma_memory_mapped_test.sh xdma0 1024 1 1 1

The test finished well, so I assume that the drivers are installed correctly and indeed, when I list the devices, all the xdma0 devices are shown.

pablo@raspberrypi:~ $ ls /dev
autofs           gpiochip4     mapper     ram12       stdin   tty25  tty44  tty63      vcsa2        video24          xdma0_events_11
block            gpiomem0      media0     ram13       stdout  tty26  tty45  tty7       vcsa3        video25          xdma0_events_12
btrfs-control    gpiomem1      media1     ram14       tty     tty27  tty46  tty8       vcsa4        video26          xdma0_events_13
bus              gpiomem2      media2     ram15       tty0    tty28  tty47  tty9       vcsa5        video27          xdma0_events_14
cachefiles       gpiomem3      mem        ram2        tty1    tty29  tty48  ttyAMA10   vcsa6        video28          xdma0_events_15
cec0             gpiomem4      mmcblk0    ram3        tty10   tty3   tty49  ttyprintk  vcsu         video29          xdma0_events_2
cec1             hwrng         mmcblk0p1  ram4        tty11   tty30  tty5   uhid       vcsu1        video30          xdma0_events_3
char             initctl       mmcblk0p2  ram5        tty12   tty31  tty50  uinput     vcsu2        video31          xdma0_events_4
console          input         mqueue     ram6        tty13   tty32  tty51  urandom    vcsu3        video32          xdma0_events_5
cpu_dma_latency  kmsg          net        ram7        tty14   tty33  tty52  v4l        vcsu4        video33          xdma0_events_6
cuse             kvm           null       ram8        tty15   tty34  tty53  vc-mem     vcsu5        video34          xdma0_events_7
disk             log           port       ram9        tty16   tty35  tty54  vcio       vcsu6        video35          xdma0_events_8
dma_heap         loop-control  ppp        random      tty17   tty36  tty55  vcs        vga_arbiter  video36          xdma0_events_9
dri              loop0         pps0       rfkill      tty18   tty37  tty56  vcs1       vhci         video37          xdma0_h2c_0
fd               loop1         ptmx       rtc         tty19   tty38  tty57  vcs2       vhost-net    watchdog         xdma0_user
full             loop2         ptp0       rtc0        tty2    tty39  tty58  vcs3       vhost-vsock  watchdog0        xdma0_xvc
fuse             loop3         pts        serial0     tty20   tty4   tty59  vcs4       video19      xdma0_c2h_0      zero
gpiochip0        loop4         ram0       shm         tty21   tty40  tty6   vcs5       video20      xdma0_control
gpiochip1        loop5         ram1       snd         tty22   tty41  tty60  vcs6       video21      xdma0_events_0
gpiochip2        loop6         ram10      spidev10.0  tty23   tty42  tty61  vcsa       video22      xdma0_events_1
gpiochip3        loop7         ram11      stderr      tty24   tty43  tty62  vcsa1      video23      xdma0_events_10

Now, it is time to mix the RP5 and the Litefure and see what happens. When I turned on the system for the first time, I listed the PCI devices and a processing accelerator appeared. At this point, I didn’t load my design into the FPGA, so what we are seeing is the example design that comes in the QSPI memory of the Litefury.

pablo@raspberrypi:~ $ lspci
00:00.0 PCI bridge: Broadcom Inc. and subsidiaries Device 2712 (rev 21)
01:00.0 Processing accelerators: Xilinx Corporation Device 7021
00:00.0 PCI bridge: Broadcom Inc. and subsidiaries Device 2712 (rev 21)
01:00.0 Ethernet controller: Device 1de4:0001

To load my design, I used openFPGALoader installed in the RP5 (load through PCIe for another article), so the Digilent HS3 will be connected to the RP5 and I can manage the load remotely. To install openFPGALoader, we just need to install it from the official repositories.

pablo@raspberrypi:~ $ sudo apt install openFPGALoader

Now, the bitstream was generated in my host computer, so I need to send it to the RP5 using the scp command.

pablo@friday:/litefury_rp5/litefury_rp5.runs/impl_1$ scp ./litefury_rp5_bd_wrapper.bit [email protected]:
[email protected]'s password: 
litefury_rp5_bd_wrapper.bit                                                                         100% 1770KB  97.9MB/s   00:00

And finally, I can load the design on the Litefury using the next command.

pablo@raspberrypi:~ $ openFPGALoader -c digilent_hs3 ./litefury_rp5_bd_wrapper.bit 
empty
Jtag frequency : requested 6.00MHz   -> real 6.00MHz  
Open file DONE
Parse file DONE
load program
Load SRAM: [==================================================] 100.00%
Done

At this point, we have all the needed to send and receive packets through PCIe. I am going to use a Jupyter Notebook to develop the application to manage those transactions. To use a Jupyter notebook in the RP5 from another computer we need to redirect the HTTP port that Jupyter will create. To do this, we need to access the RP5 over SSH and redirect the address localhost:8080 of the remote to port 8080 of the host. The way to do that is the next.

~$ ssh -L 8080:localhost:8080 [email protected]

When the SSH connection is established, we can open the Jupyter notebook in the remote computer as sudo. Also, remember to change the port to the redirected in the host.

~$ sudo jupyter-notebook --no-browser --port=8080 --allow-root

Then, opening the URL localhost:8080 on the host computer, we will see the Jupyter Notebook webpage of the RP5, so all that is executed in this notebook is actually executed in the RP5. (An FPGA engineer needs to know about everything).

Now, the Python application. First of all, some imports. The important ones are the os and the struct. The rest are specific for my application.

import os
import struct
from scipy import signal as sig
import numpy as np
from matplotlib import pyplot as plt

Next, we need to define and open the devices where we are going to write and read. Since I only used one channel in the DMA, we just have the **_0 channels available.

# Open devices for DMA operation
xdma_axis_rd_data = os.open('/dev/xdma0_c2h_0',os.O_RDONLY)
xdma_axis_wr_data = os.open('/dev/xdma0_h2c_0',os.O_WRONLY)

Now, I generated a sine signal with 256 points (the FIFO depth is 512), and using the command os.pwrite I will send the points to the Litefury. There is an extra step that uses the struct.pack command. The width of the DMA channels is 64 bits, so we need to send packets with that width. With the pack command, we are converting the list of 8-bit integers into a 256 byte-like elements array of 64-bit values (‘<256Q’).

# writing a sine signal on DMA channel
nSamples = 256
angle = np.linspace(0,2*np.pi,nSamples, endpoint=False)
sig = np.sin(2*angle)*100
sig_int = sig.astype(int)+100

# Write DMA data channel
data = struct.pack('<256Q', *sig_int)
os.pwrite(xdma_axis_wr_data,data,0)

When this command is executed, it returns us the amount of bytes sent, which is 256x8.

Then, we are going to perform the opposite instruction, we are going to read 256x8 bytes from the read channel. The last zero is the offset.

# Read DMA data channel
data_rd = os.pread(xdma_axis_rd_data,2048,0)
data_unpack = struct.unpack('<256Q',data_rd)
plt.plot(data_unpack)

As we did in the sending, we need to unpack the data to obtain again a list of integers. If all works, we can plot the data sent and the data received and it will be the same.

Also, we enabled an extra AXI4 Master that can be accessed through the device /dev/xdma0_user. The way to proceed is the same, but this time we need to use the offset in the pwrite command to write in a specific AXI4 address. Remember that this interface is connected to an AXI GPIO, so writing the offset 0 we will write the value of the output.

# Open User channel for DMA operation
xdma_user_wr = os.open('/dev/xdma0_user',os.O_WRONLY)

# Write DMA User channel
leds = 10

data = struct.pack('<Q',(15-leds)) # leds status are negated
os.pwrite(xdma_user_wr,data,0)

If you have arrived at this point, you can send bursts of data from the Raspberry Pi 5 to an FPGA, and also read them, which means that you can execute algorithms over data read by the Raspberry PI 5 using hardware ergo, you can do hardware acceleration. The power of Raspberry Pi has been increased over the different versions of the board, but improvements like an accessible PCIe interface are a big step, not to connect a RTX 4090, which is useful just in terms of research, however adding hardware accelerators can be the door to use these SBC as powerful computation devices because, many times, in hardware acceleration, the processor is used just to move data. The question, is this like a Zynq device? has an easy answer, no. On Zynq devices, the FPGA and the processor are very integrated, with many different communication channels, but we have to keep in mind that the cost of this setup is less than 200 euros, including the Litefury. As always, the use of this kind of setup depends of your application.

This project is available on Github

Connecting an FPGA accelerator to the Raspberry Pi 5

Related posts

Parallelizing SHA256 calculation on FPGA

Streaming data to DDR memory from PL on Arty board.

Streaming data to DDR memory from PL. HW design.