InMemQK: A Product Quantization Based MatMul Module for Compute-in-Memory Attention Macro

Large Language Models (LLMs), based on transformer architecture, have demonstrated remarkable capabilities in natural language processing tasks, enabling machines to generate human-like text and engage in meaningful dialogues. However, the exponential increase in model parameters has led to limitati...

Full description

Saved in:
Bibliographic Details
Main Authors: Pengcheng Feng, Yihao Chen, Jinke Yu, Hao Yue, Zhelong Jiang, Yi Xiao, Wan’ang Xiao, Huaxiang Lu, Gang Chen
Format: Article
Language:English
Published: MDPI AG 2024-12-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/14/23/11198
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Large Language Models (LLMs), based on transformer architecture, have demonstrated remarkable capabilities in natural language processing tasks, enabling machines to generate human-like text and engage in meaningful dialogues. However, the exponential increase in model parameters has led to limitations in inference speed and energy efficiency. Compute-in-memory (CIM) technology offers a promising solution to accelerate AI inference by performing analog computations directly within memory, potentially reducing latency and power consumption. At the same time, CIM has been successfully applied to accelerate Convolutional Neural Networks (CNNs); however, the matrix–matrix multiplication (MatMul) operations inherent in the scaled dot-product attention of the transformer present unique challenges for direct CIM implementation. In this work, we propose InMemQK, a compute-in-memory-based attention accelerator that focuses on optimizing MatMul operations through software and hardware co-design. At the software level, InMemQK employs product quantization (PQ) to eliminate data dependencies. At the hardware level, InMemQK integrates energy-efficient time-domain MAC macros for ADC-free computations. Experimental results show InMemQK achieves 13.2×–13.9× lower power consumption than existing CIM-based accelerators.
ISSN:2076-3417