Ervin Tasnadi’s blog

GPU programming

Nanobenchmarking: cycle accurate benchmarking of CUDA kernels

Dec 3, 2025

CUDA, GPU programming

CUDA, GPU, Microbenchmarking

This post focuses on the accurate measurement of the number of cycles needed to execute a particular CUDA device code snippet. We will use the clock() function for the measurement and focus on adjusting the compiled device code using an assembler to get the accurate results. Methodology We measure the latency using the CUDA’s clock()…
Memory efficient Scaled Dot Product Attention (SDPA) with Tensor Cores acceleration implemented in Vulkan

Jan 19, 2025

Deep learning, GPU programming, Uncategorized

ai, attention, FlashAttention, FlashAttention-2, GLSL, machine-learning, Scaled Dot Product Attention, SDPA, Vulkan

I recently uploaded the implementation of the forward pass of a memory efficient attention algorithm (FlashAttention-2 (Dao et al., 2023)) using Vulkan compute and VK_KHR_cooperative_matrix extension to use Tensor Cores or equivalent hardware to accelerate matrix-matrix multiplications . In this post I will go into the details. Background The goal of this project is to…