李理的博客

Implementing and Optimizing a BPE Tokenizer from Scratch—Part 11: Wrapping C++ Code with Cython

This series of articles implements a subtask of Stanford’s CS336 Assignment 1: building an efficient training algorithm for a BPE Tokenizer. Through a series of optimizations, our algorithm’s training time on OpenWebText was reduced from over 10 hours to less than 10 minutes. This series explains these optimizations, including algorithmic improvements, data structure enhancements, parallelization with OpenMP, Cython optimization, and implementing key code in C++ along with its integration via Cython. This is the twelfth and final article in a series on using Cython to wrap the previous C++ code into an extension module for Python.

Posted by lili on September 25, 2025

动手实现和优化BPE Tokenizer的训练——第11部分：使用cython封装c++代码

本系列文章完成Stanford CS336作业1的一个子任务——实现BPE Tokenizer的高效训练算法。通过一系列优化，我们的算法在OpenWebText上的训练时间从最初的10多个小时优化到小于10分钟。本系列文章解释这一系列优化过程，包括：算法的优化，数据结构的优化，并行(openmp)优化，cython优化，用c++实现关键代码和c++库的cython集成等内容。本文是第十二篇，也是最后一篇，使用cython把之前的c++代码封装成扩展模块供python调用。

Posted by lili on September 25, 2025

Implementing and Optimizing a BPE Tokenizer from Scratch—Part 10: Using Cython and PyPy for Acceleration

This series of articles implements a subtask of Stanford’s CS336 Assignment 1: building an efficient training algorithm for a BPE Tokenizer. Through a series of optimizations, our algorithm’s training time on OpenWebText was reduced from over 10 hours to less than 10 minutes. This series explains these optimizations, including algorithmic improvements, data structure enhancements, parallelization with OpenMP, Cython optimization, and implementing key code in C++ along with its integration via Cython. This article, the eleventh in the series, will cover using Cython and PyPy to accelerate Python code.

Posted by lili on September 24, 2025

动手实现和优化BPE Tokenizer的训练——第10部分：使用cython和pypy来加速

本系列文章完成Stanford CS336作业1的一个子任务——实现BPE Tokenizer的高效训练算法。通过一系列优化，我们的算法在OpenWebText上的训练时间从最初的10多个小时优化到小于10分钟。本系列文章解释这一系列优化过程，包括：算法的优化，数据结构的优化，并行(openmp)优化，cython优化，用c++实现关键代码和c++库的cython集成等内容。本文是第十一篇，使用cython和pypy来加速python代码。

Posted by lili on September 24, 2025

Implementing and Optimizing a BPE Tokenizer from Scratch—Part 9: Using a Heap to Find the Maximum Pair

This series of articles implements a subtask of Stanford’s CS336 Assignment 1: building an efficient training algorithm for a BPE Tokenizer. Through a series of optimizations, our algorithm’s training time on OpenWebText was reduced from over 10 hours to less than 10 minutes. This series explains these optimizations, including algorithmic improvements, data structure enhancements, parallelization with OpenMP, Cython optimization, and implementing key code in C++ along with its integration via Cython. This is the tenth article, where we use the heap data structure to replace the process of finding the maximum pair, thereby improving performance.

Posted by lili on September 21, 2025

动手实现和优化BPE Tokenizer的训练——第9部分：使用堆来寻找最大pair

本系列文章完成Stanford CS336作业1的一个子任务——实现BPE Tokenizer的高效训练算法。通过一系列优化，我们的算法在OpenWebText上的训练时间从最初的10多个小时优化到小于10分钟。本系列文章解释这一系列优化过程，包括：算法的优化，数据结构的优化，并行(openmp)优化，cython优化，用c++实现关键代码和c++库的cython集成等内容。本文是第十篇，使用堆(heap)这个数据结构来替代求最大pair，提升性能。

Posted by lili on September 21, 2025

Implementing and Optimizing a BPE Tokenizer from Scratch—Part 8: Implementing Fine-Grained Updates

This series of articles implements a subtask of Stanford’s CS336 Assignment 1: building an efficient training algorithm for a BPE Tokenizer. Through a series of optimizations, our algorithm’s training time on OpenWebText was reduced from over 10 hours to less than 10 minutes. This series explains these optimizations, including algorithmic improvements, data structure enhancements, parallelization with OpenMP, Cython optimization, and implementing key code in C++ along with its integration via Cython. This is the ninth article, which focuses on optimizing the update process for data structures like pair_counts.

Posted by lili on September 19, 2025

动手实现和优化BPE Tokenizer的训练——第8部分：实现细粒度更新

本系列文章完成Stanford CS336作业1的一个子任务——实现BPE Tokenizer的高效训练算法。通过一系列优化，我们的算法在OpenWebText上的训练时间从最初的10多个小时优化到小于10分钟。本系列文章解释这一系列优化过程，包括：算法的优化，数据结构的优化，并行(openmp)优化，cython优化，用c++实现关键代码和c++库的cython集成等内容。本文是第九篇，优化pair_counts等数据结构的更新过程。

Posted by lili on September 19, 2025

Implementing and Optimizing a BPE Tokenizer from Scratch—Part 7: Using Flat Hash Map instead of std::unordered_map

This series of articles implements a subtask of Stanford’s CS336 Assignment 1: building an efficient training algorithm for a BPE Tokenizer. Through a series of optimizations, our algorithm’s training time on OpenWebText was reduced from over 10 hours to less than 10 minutes. This series explains these optimizations, including algorithmic improvements, data structure enhancements, parallelization with OpenMP, Cython optimization, and implementing key code in C++ along with its integration via Cython. This is the eighth article in the series, focusing on using a flat hash map to replace the C++ standard library’s std::unordered_map for improved performance.

Posted by lili on September 18, 2025

动手实现和优化BPE Tokenizer的训练——第7部分：使用flat hashmap替代std::unordered_map

本系列文章完成Stanford CS336作业1的一个子任务——实现BPE Tokenizer的高效训练算法。通过一系列优化，我们的算法在OpenWebText上的训练时间从最初的10多个小时优化到小于10分钟。本系列文章解释这一系列优化过程，包括：算法的优化，数据结构的优化，并行(openmp)优化，cython优化，用c++实现关键代码和c++库的cython集成等内容。本文是第八篇，使用flat hashmap来替代c++标准库的std::unordered_map来提高性能。