Artificial Intelligence (AI) and Machine Learning (ML) are key drivers in pushing the frontiers of technology and transforming our society. AI/ML has led to significant progress in several domains including self-driving cars, robotics, wearable healthcare devices, etc. However, AI/ML algorithms need to process the ever-growing amount of data for low latency decision making while constrained by limited resources at the edge.
Heterogeneous architectures consisting of CPUs and FPGAs have become popular as they integrate the general-purpose processing power of CPUs with energy-efficient, fine-grained parallelism of FPGAs. Developing energy-efficient RL accelerators targeting these platforms requires a deep understanding of RL algorithms as well as the features of the platform. Challenges such as conflicts in parallel access to shared objects, irregular memory accesses, low data reuse and load balancing adversely impact the performance and require innovative optimizations. We focus on developing novel techniques such as Deep Neural Network compression, memory layout and access pattern optimization, strategies for load balancing and optimization to utilize the high available memory bandwidth to accelerate AI/ML workflows on heterogeneous architectures.
AREAS OF INTEREST:
AI/ML acceleration, ML Compression, DNN Compression, Graph ML, Memory Optimization
Deep Neural Network Compression
Convolutional Neural Networks (CNNs) are a popular choice for FPGA acceleration due to their widespread utility in tasks such as classification, detection, and segmentation. Given a large number of potentially redundant operations, compression (of the CNN model) can significantly reduce memory and computation overheads, and so is a widely accepted technique for improving efficiency.
We have focused on increasing the performance by designing a low-latency, low-bandwidth, spectral sparse CNN accelerator on FPGAs. We have analyzed the bandwidth-storage tradeoff of sparse convolutional layers and locate communication bottlenecks. We have also developed a dataflow for flexibly optimizing data reuse in different layers to minimize off-chip communication and novel scheduling algorithms to optimally schedule the on-chip memory access of multiple sparse kernels and minimize read conflicts.
Accelerating Graph Convolutional Neural Networks (GCN)
Graph Convolutional Networks (GCNs) has emerged as the state-of-the-art deep learning model for representation learning on graphs. It is challenging to accelerate training of GCNs, due to substantial and irregular data communication to propagate information within the graph, and intensive computation to propagate information along with the neural network layers. To address these challenges, we design a novel accelerator for training GCNs on CPU-FPGA heterogeneous systems by incorporating multiple algorithm-architecture co-optimizations. We analyze the computation and communication characteristics of various GCN training algorithms and select a subgraph-based algorithm that is well suited for hardware execution. To optimize the feature propagation within subgraphs, we use a light-weight pre-processing step based on a graph-theoretic approach. Such pre-processing performed on the CPU significantly reduces the memory access requirements, and the computation to be performed on the FPGA. To accelerate the weight update in GCN layers, we came up with a systolic array-based design for efficient parallelization. We integrate the above optimizations into a complete hardware pipeline and analyze its load-balance and resource utilization by accurate performance modeling.
Accelerating Reinforcement Learning
Reinforcement Learning (RL) is an area of AI that constitutes a wide range of algorithms spanning the Observe, Orient, Decide and Act phases of autonomous agents. Recently, certain classes of RL algorithms such as policy gradient methods and Q-Learning based methods have found widespread success in a variety of application. We focus on accelerating these algorithms by developing novel techniques for parallel neural network inference, high throughput training, data streaming, etc.
The following papers may have copyright restrictions. Downloads will have to adhere to these restrictions. They may not be reposted without explicit permission from the copyright holder.