李理的博客

Implementing and Optimizing a BPE Tokenizer from Scratch—Part 6: Implement parallel maximum finding with OpenMP

This series of articles implements a subtask of Stanford’s CS336 Assignment 1: building an efficient training algorithm for a BPE Tokenizer. Through a series of optimizations, our algorithm’s training time on OpenWebText was reduced from over 10 hours to less than 10 minutes. This series explains these optimizations, including algorithmic improvements, data structure enhancements, parallelization with OpenMP, Cython optimization, and implementing key code in C++ along with its integration via Cython. This is the seventh article, exploring a method for using OpenMP to find the pair with the highest count in pair_counts in parallel.

Posted by lili on September 15, 2025

动手实现和优化BPE Tokenizer的训练——第6部分：用OpenMP实现并行求最大

本系列文章完成Stanford CS336作业1的一个子任务——实现BPE Tokenizer的高效训练算法。通过一系列优化，我们的算法在OpenWebText上的训练时间从最初的10多个小时优化到小于10分钟。本系列文章解释这一系列优化过程，包括：算法的优化，数据结构的优化，并行(openmp)优化，cython优化，用c++实现关键代码和c++库的cython集成等内容。本文是第七篇，探索使用OpenMP并行求pair_counts里最大pair的方法。

Posted by lili on September 15, 2025

Implementing and Optimizing a BPE Tokenizer from Scratch—Part 5: Implementing the Merge Algorithm in C++

This series of articles implements a subtask of Stanford’s CS336 Assignment 1: building an efficient training algorithm for a BPE Tokenizer. Through a series of optimizations, our algorithm’s training time on OpenWebText was reduced from over 10 hours to less than 10 minutes. This series explains these optimizations, including algorithmic improvements, data structure enhancements, parallelization with OpenMP, Cython optimization, and implementing key code in C++ along with its integration via Cython. This is the sixth article in the series. It focuses on implementing the merge algorithm in C++ to be equivalent to Python’s version. We also compare two traversal methods for std::unordered_map in preparation for a discussion on parallel maximum computation with OpenMP.

Posted by lili on September 12, 2025

动手实现和优化BPE Tokenizer的训练——第5部分：用C++实现Merge算法

本系列文章完成Stanford CS336作业1的一个子任务——实现BPE Tokenizer的高效训练算法。通过一系列优化，我们的算法在OpenWebText上的训练时间从最初的10多个小时优化到小于10分钟。本系列文章解释这一系列优化过程，包括：算法的优化，数据结构的优化，并行(openmp)优化，cython优化，用c++实现关键代码和c++库的cython集成等内容。本文是第六篇，用C++实现和Python等价的merge算法，并且比较std::unordered_map的两种遍历方式，为下问的openmp并行求max做准备。

Posted by lili on September 12, 2025

Implementing and Optimizing a BPE Tokenizer from Scratch—Part 4: A Failed Parallel Optimization

This series of articles implements a subtask of Stanford’s CS336 Assignment 1: building an efficient training algorithm for a BPE Tokenizer. Through a series of optimizations, our algorithm’s training time on OpenWebText was reduced from over 10 hours to less than 10 minutes. This series explains these optimizations, including algorithmic improvements, data structure enhancements, parallelization with OpenMP, Cython optimization, and implementing key code in C++ along with its integration via Cython. This is the fifth article, documenting a failed attempt at parallel optimization. Through this failure, we can understand the problems that exist with Python’s multiprocessing module and, in turn, learn when it should be used.

Posted by lili on September 10, 2025

动手实现和优化BPE Tokenizer的训练——第4部分：一次失败的并行优化

本系列文章完成Stanford CS336作业1的一个子任务——实现BPE Tokenizer的高效训练算法。通过一系列优化，我们的算法在OpenWebText上的训练时间从最初的10多个小时优化到小于10分钟。本系列文章解释这一系列优化过程，包括：算法的优化，数据结构的优化，并行(openmp)优化，cython优化，用c++实现关键代码和c++库的cython集成等内容。本文是第五篇，记录一次失败的并行优化尝试，通过这次失败，我们可以了解Python multiprocessing存在的问题，从而理解什么时候应该用multiprocessing什么时候不应该用它。

Posted by lili on September 10, 2025

Implementing and Optimizing a BPE Tokenizer from Scratch—Part 3: Parallel word segmentation and word frequency counting

This series of articles implements a subtask of Stanford’s CS336 Assignment 1: building an efficient training algorithm for a BPE Tokenizer. Through a series of optimizations, our algorithm’s training time on OpenWebText was reduced from over 10 hours to less than 10 minutes. This series explains these optimizations, including algorithmic improvements, data structure enhancements, parallelization with OpenMP, Cython optimization, and implementing key code in C++ along with its integration via Cython. This is the fourth article, using parallel algorithms for acceleration.

Posted by lili on September 9, 2025

动手实现和优化BPE Tokenizer的训练——第3部分：并行分词和统计词频

本系列文章完成Stanford CS336作业1的一个子任务——实现BPE Tokenizer的高效训练算法。通过一系列优化，我们的算法在OpenWebText上的训练时间从最初的10多个小时优化到小于10分钟。本系列文章解释这一系列优化过程，包括：算法的优化，数据结构的优化，并行(openmp)优化，cython优化，用c++实现关键代码和c++库的cython集成等内容。本文是第四篇，使用并行算法加速。

Posted by lili on September 9, 2025

Implementing and Optimizing a BPE Tokenizer from Scratch—Part 2: Optimizing the Algorithm

This series of articles implements a subtask of Stanford’s CS336 Assignment 1: building an efficient training algorithm for a BPE Tokenizer. Through a series of optimizations, our algorithm’s training time on OpenWebText was reduced from over 10 hours to less than 10 minutes. This series explains these optimizations, including algorithmic improvements, data structure enhancements, parallelization with OpenMP, Cython optimization, and implementing key code in C++ along with its integration via Cython. This is the third article, focusing on optimizing the previous algorithm.

Posted by lili on September 8, 2025

动手实现和优化BPE Tokenizer的训练——第2部分：优化算法

本系列文章完成Stanford CS336作业1的一个子任务——实现BPE Tokenizer的高效训练算法。通过一系列优化，我们的算法在OpenWebText上的训练时间从最初的10多个小时优化到小于10分钟。本系列文章解释这一系列优化过程，包括：算法的优化，数据结构的优化，并行(openmp)优化，cython优化，用c++实现关键代码和c++库的cython集成等内容。本文是第三篇，优化之前的算法。