This series of articles implements a subtask of Stanford’s CS336 Assignment 1: building an efficient training algorithm for a BPE Tokenizer. Through a series of optimizations, our algorithm’s training time on OpenWebText was reduced from over 10 hours to less than 10 minutes. This series explains these optimizations, including algorithmic improvements, data structure enhancements, parallelization with OpenMP, Cython optimization, and implementing key code in C++ along with its integration via Cython. This is the twelfth and final article in a series on using Cython to wrap the previous C++ code into an extension module for Python.
Table of Content
- 1. Goal
- 2. Wrapping the C++ Code into a Dynamic Library
- 3. Wrapping with Cython into an Extension Module
- 4. Testing
- 5. Conclusion
Here’s the English translation, with Markdown formatting and code blocks preserved:
1. Goal
In previous articles, we’ve done a lot of exploration. BPE training can be divided into two stages. The first stage is tokenization and counting, which we can perform in parallel using multiprocessing. Our bpe_v3.py implementation can complete this in 120 seconds using 32 cores, while the bpe_v3_bytes_time.py version, which reads bytes, can finish in 70 seconds using 64 cores. The second stage is merging. The Python version, bpe_v8_v3.py, takes about 500 seconds, with a total time of over ten minutes. The fastest C++ version, bpe_train_updater_fine_grained_heap_emhash8_set9_opt.cpp, has a merging time of around 100 seconds.
Today, our goal is to create the fastest version possible by wrapping the C++ code into an extension module using Cython, which can then be called by Python for the second stage.
2. Wrapping the C++ Code into a Dynamic Library
We’ve conducted many experiments in cppupdate, and now we need to wrap the best-performing versions into a dynamic library for easy use with Cython.
We will create a new C++ project called cppstep2. The full code can be found at cppstep2.
Based on the cppupdate experiment results, I have selected the following files: bpe_train_updater_fine_grained_emhash8.cpp, bpe_train_updater_fine_grained_emhash8_set.cpp, bpe_train_updater_fine_grained_emhash8_set9.cpp, bpe_train_updater_fine_grained_heap_emhash8_set9.cpp, and bpe_train_updater_fine_grained_heap_emhash8_set9_opt.cpp.
I chose emhash over absl for two reasons: it’s faster and simpler to use. I previously git cloned the full emhash code, but to simplify, I’ve only copied the three necessary header files: hash_set8.hpp, hash_set4.hpp, and hash_table8.hpp. Additionally, max_heap.h and max_heap.cpp were copied directly from the cppupdate project.
Our dynamic library will consist of two main files: the header file bpe_train_step2.h and the implementation file bpe_train_step2.cpp.
2.1 Header File
#pragma once
#include <vector>
#include <unordered_map>
#include <unordered_set>
#include "emhash/hash_table8.hpp"
struct pair_hash {
template <class T1, class T2>
std::size_t operator () (const std::pair<T1,T2> &p) const {
auto h1 = std::hash<T1>{}(p.first);
auto h2 = std::hash<T2>{}(p.second);
return h1 ^ (h2 + 0x9e3779b9 + (h1 << 6) + (h1 >> 2));
}
};
void bpe_train_step2(int vocab_size,
emhash8::HashMap<std::pair<int, int>, int, pair_hash> & pair_counts,
std::unordered_map<std::pair<int, int>, std::vector<std::vector<int>>, pair_hash> & pair_strings,
std::unordered_map<int, std::vector<int>> & vocabulary,
std::unordered_map<std::pair<int, int>, std::unordered_set<int>, pair_hash> & pair_wordids,
const std::unordered_map<int, long long> & wordid_counts,
std::unordered_map<int, std::vector<int>> & wordid_encodings,
std::vector<std::pair<std::vector<int>, std::vector<int>>> & merges);
void bpe_train_step2_v2(int vocab_size,
std::unordered_map<int, std::vector<int>> & vocabulary,
const std::unordered_map<int, long long> & wordid_counts,
std::unordered_map<int, std::vector<int>> & wordid_encodings,
std::vector<std::pair<std::vector<int>, std::vector<int>>> & merges);
void bpe_train_step2_v3(int vocab_size,
std::unordered_map<int, std::vector<int>> & vocabulary,
const std::unordered_map<int, long long> & wordid_counts,
std::unordered_map<int, std::vector<int>> & wordid_encodings,
std::vector<std::pair<std::vector<int>, std::vector<int>>> & merges);
void bpe_train_step2_v4(int vocab_size,
std::unordered_map<int, std::vector<int>> & vocabulary,
const std::unordered_map<int, long long> & wordid_counts,
std::unordered_map<int, std::vector<int>> & wordid_encodings,
std::vector<std::pair<std::vector<int>, std::vector<int>>> & merges);
void bpe_train_step2_v5(int vocab_size,
std::unordered_map<int, std::vector<int>> & vocabulary,
const std::unordered_map<int, long long> & wordid_counts,
std::unordered_map<int, std::vector<int>> & wordid_encodings,
std::vector<std::pair<std::vector<int>, std::vector<int>>> & merges);
void bpe_train_step2_v6(int vocab_size,
std::unordered_map<int, std::vector<int>> & vocabulary,
const std::unordered_map<int, long long> & wordid_counts,
std::unordered_map<int, std::vector<int>> & wordid_encodings,
std::vector<std::pair<std::vector<int>, std::vector<int>>> & merges);
This header file defines five functions: bpe_train_step2_v2, bpe_train_step2_v3, bpe_train_step2_v4, bpe_train_step2_v5, and bpe_train_step2_v6. These correspond to bpe_train_updater_fine_grained_emhash8.cpp, bpe_train_updater_fine_grained_emhash8_set.cpp, bpe_train_updater_fine_grained_emhash8_set9.cpp, bpe_train_updater_fine_grained_heap_emhash8_set9.cpp, and bpe_train_updater_fine_grained_heap_emhash8_set9_opt.cpp, respectively.
bpe_train_step2 is a direct copy of the cppupdate implementation and requires seven input parameters. As we’ll see later, pair_counts and wordid_counts can be calculated inside C++ for better speed and fewer parameters passed. Thus, bpe_train_step2_v2 and subsequent versions only require five parameters.
2.2 CMakeLists.txt
Our goal is to compile a dynamic library. We’ll write the following in CMake syntax:
cmake_minimum_required(VERSION 3.20)
project(BPE_TRAIN_STEP2 LANGUAGES CXX)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
add_library(bpe_train_step2 SHARED bpe_train_step2.cpp max_heap.cpp)
target_include_directories(bpe_train_step2 PUBLIC ${CMAKE_CURRENT_SOURCE_DIR})
target_compile_definitions(bpe_train_step2 PUBLIC MY_LIBRARY_EXPORT)
target_include_directories(bpe_train_step2 PUBLIC
"${PROJECT_SOURCE_DIR}/emhash"
)
install(TARGETS bpe_train_step2
EXPORT bpe_train_step2_export_targets
RUNTIME DESTINATION bin
LIBRARY DESTINATION lib
ARCHIVE DESTINATION lib)
install(FILES "bpe_train_step2.h"
DESTINATION include
)
We create the bpe_train_step2 target using add_library. target_compile_definitions seems to define a macro for Windows compatibility. I don’t have a Windows environment to test this, so I’m unsure if the project will compile there.
The next target_include_directories includes the emhash headers. Finally, install copies the bpe_train_step2 target (mainly libbpe_train_step2.so) and bpe_train_step2.h to the appropriate locations. I’ll explain the installation process below.
2.3 bpe_train_step2
First, we will integrate it in the same way as cppupdate, which means we need to pass seven parameters, including pair_counts.
void bpe_train_step2(int vocab_size,
emhash8::HashMap<std::pair<int, int>, int, pair_hash> & pair_counts,
std::unordered_map<std::pair<int, int>, std::vector<std::vector<int>>, pair_hash> & pair_strings,
std::unordered_map<int, std::vector<int>> & vocabulary,
std::unordered_map<std::pair<int, int>, std::unordered_set<int>, pair_hash> & pair_wordids,
const std::unordered_map<int, long long> & wordid_counts,
std::unordered_map<int, std::vector<int>> & wordid_encodings,
std::vector<std::pair<std::vector<int>, std::vector<int>>> & merges){
auto start = std::chrono::steady_clock::now();
while(vocabulary.size() < vocab_size){
int max_count = -1;
std::pair<int, int> max_pair;
std::vector<std::vector<int>> max_strings;
for(const auto& [pair, count] : pair_counts){
if(count > max_count){
max_count = count;
max_pair = pair;
max_strings = pair_strings[pair];
}else if(count == max_count){
std::vector<std::vector<int>> strings = pair_strings[pair];
ComparisonResult r1 = three_way_compare(strings[0], max_strings[0]);
if(r1 == ComparisonResult::Greater){
max_count = count;
max_pair = pair;
max_strings = strings;
}else if(r1 == ComparisonResult::Equal){
ComparisonResult r2 = three_way_compare(strings[1], max_strings[1]);
if(r2 == ComparisonResult::Greater){
max_count = count;
max_pair = pair;
max_strings = strings;
}
}
}
}
const std::vector<int>& bytes1 = vocabulary[max_pair.first];
const std::vector<int>& bytes2 = vocabulary[max_pair.second];
std::vector<int> merge_bytes;
merge_bytes.reserve(bytes1.size() + bytes2.size());
merge_bytes.insert(merge_bytes.end(), bytes1.begin(), bytes1.end());
merge_bytes.insert(merge_bytes.end(), bytes2.begin(), bytes2.end());
int size = vocabulary.size();
vocabulary[size] = merge_bytes;
auto& affected_words = pair_wordids[max_pair];
updated_affected_word_count(max_pair, affected_words, wordid_encodings, wordid_counts,
pair_counts, pair_wordids, size, pair_strings, vocabulary);
merges.push_back({bytes1, bytes2});
}
auto end = std::chrono::steady_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << "bpe_train_step2: " << duration.count() << "ms." << std::endl;
}
The code is a complete copy of bpe_train_updater_fine_grained_emhash8.cpp.
2.4 bpe_train_step2_v2
If you’ve carefully reviewed the previous time statistics, you’ll notice that the total time is about 50-60 seconds more than “tokenization time” + “merging time”. This extra time is for calling the BPE_Trainer._count_pairs function:
@staticmethod
def _count_pairs(word_counts, word_encodings, pair_strings, vocabulary, pair_to_words):
pair_counts = defaultdict(int)
for word, count in word_counts.items():
encoding = word_encodings[word]
for i in range(0, len(encoding) - 1):
pair = encoding[i], encoding[i + 1]
pair_counts[pair] += count
if pair not in pair_strings:
pair_strings[pair] = (vocabulary[pair[0]], vocabulary[pair[1]])
pair_to_words[pair].add(word)
return pair_counts
This code calculates pair_counts based on word_counts and builds the initial inverted index pair_to_words and pair_strings. Its input parameters are word_counts, word_encodings, and vocabulary. This code can be implemented entirely in C++, which would not only speed it up but also reduce the number of parameters passed between Python and C++.
Therefore, the input parameters for bpe_train_step2_v2 can be reduced by three:
void bpe_train_step2_v2(int vocab_size,
std::unordered_map<int, std::vector<int>> & vocabulary,
const std::unordered_map<int, long long> & wordid_counts,
std::unordered_map<int, std::vector<int>> & wordid_encodings,
std::vector<std::pair<std::vector<int>, std::vector<int>>> & merges){
std::unordered_map<std::pair<int, int>, std::vector<std::vector<int>>, pair_hash> pair_strings;
emhash8::HashMap<std::pair<int, int>, int, pair_hash> pair_counts;
std::unordered_map<std::pair<int, int>, std::unordered_set<int>, pair_hash> pair_wordids;
std::pair<int, int> pair;
for(const auto& [wordid, count] : wordid_counts){
const auto& word_tokens = wordid_encodings[wordid];
for(int i = 0; i < word_tokens.size() - 1; ++i){
pair.first = word_tokens[i];
pair.second = word_tokens[i + 1];
pair_counts[pair] += count;
if (pair_strings.find(pair) == pair_strings.end()) {
pair_strings[pair] = {vocabulary[pair.first], vocabulary[pair.second]};
}
pair_wordids[pair].insert(wordid);
}
}
bpe_train_step2(vocab_size, pair_counts, pair_strings, vocabulary,
pair_wordids, wordid_counts, wordid_encodings, merges);
}
The bpe_train_step2_v2 function first calculates pair_counts, pair_strings, and pair_wordids from wordid_counts, wordid_encodings, and vocabulary, and then calls bpe_train_step2.
The bpe_train_step2_v3 and subsequent versions have the exact same interface as bpe_train_step2_v2, and the code is copied from the corresponding cppupdate versions, which I won’t detail here.
2.5 Compile and Install
Use the following commands to compile and install:
cd cppstep2
mkdir build && cd build
cmake -D CMAKE_INSTALL_PREFIX=../../lib_bpe_train_step2/ -D CMAKE_BUILD_TYPE=Release ..
cmake --build . -- -j8
# If you are using gcc, you can directly use make -j8
cmake --install .
# Or make install
Since we don’t want to install this library into system paths like /usr/local (which requires root privileges and is not convenient for Cython integration due to different paths on different systems), I’m installing it to ../../lib_bpe_train_step2/, which is a subdirectory of the project root assignment1-basics-bpe:
$ ls assignment1-basics-bpe
cppstep2/
cppupdate/
cs336_basics/
data/
lib_bpe_train_step2/
After installation, the lib_bpe_train_step2 directory will look like this:
lib_bpe_train_step2$ tree
.
├── include
│ ├── bpe_train_step2.h
│ └── emhash
│ └── hash_table8.hpp
└── lib
└── libbpe_train_step2.so
Besides bpe_train_step2.h, hash_table8.hpp is also included because bpe_train_step2.h uses it. This is primarily because bpe_train_step2’s declaration needs it. If we only kept bpe_train_step2_v2 and later versions, this header file would not need to be installed (though we’d still need all emhash headers for compilation).
3. Wrapping with Cython into an Extension Module
For those new to wrapping C++ libraries with Cython, you can refer to the Using C++ in Cython guide.
We need two files: bpe_train_step2_wrapper.pxd and bpe_train_step2_wrapper.pyx.
3.1 bpe_train_step2_wrapper.pxd
First, let’s look at the .pxd file:
# distutils: language = c++
# 导入 C++ 标准库类型
from libcpp.utility cimport pair
from libcpp.vector cimport vector
from libcpp.unordered_map cimport unordered_map
from libcpp.unordered_set cimport unordered_set
cdef extern from "../lib_bpe_train_step2/include/bpe_train_step2.h" :
cppclass pair_hash:
pass
cdef extern from "../lib_bpe_train_step2/include/emhash/hash_table8.hpp" namespace "emhash8":
cppclass HashMap[K, V, H]:
#ValueT& operator[](const KeyT& key) noexcept
V& operator[](const K& key)
cdef extern from "../lib_bpe_train_step2/include/bpe_train_step2.h":
void bpe_train_step2(int vocab_size,
HashMap[pair[int, int], int, pair_hash] & pair_counts,
unordered_map[pair[int, int], vector[vector[int]], pair_hash] & pair_strings,
unordered_map[int, vector[int]] & vocabulary,
unordered_map[pair[int, int], unordered_set[int], pair_hash] & pair_wordids,
const unordered_map[int, long long] & wordid_counts,
unordered_map[int, vector[int]] & wordid_encodings,
vector[pair[vector[int], vector[int]]] & merges) except +
void bpe_train_step2_v2(int vocab_size,
unordered_map[int, vector[int]] & vocabulary,
const unordered_map[int, long long] & wordid_counts,
unordered_map[int, vector[int]] & wordid_encodings,
vector[pair[vector[int], vector[int]]] & merges) except +
The first line tells distutils (or setuptools) that you are building a C++ extension module, not the default C extension.
The next four lines import the C++ standard library types std::vector, std::pair, std::unordered_map, and std::unordered_set which are wrapped by Cython under libcpp. We can use them with cimport, which is a compile-time import. To see which C++ standard libraries are available, refer to this link.
Next, we declare pair_hash with cppclass. Since we only need to know about the pair_hash symbol for the lib_bpe_train_step2 library, we can leave the content empty with pass. Note that the path for cdef extern from is "../lib_bpe_train_step2/include/bpe_train_step2.h", which requires lib_bpe_train_step2 to be installed in the correct location as described earlier.
Next is the declaration of emhash8::HashMap. The "namespace emhash8" after cdef extern from tells Cython that HashMap is in the emhash8 namespace. cppclass HashMap[K, V, H] indicates that HashMap is a template class for Key/Value/Hash function. We’ll need operator[] in the .pyx file, so we also need to declare its prototype here.
Finally, we declare the bpe_train_step2 functions. The syntax is similar to C++, but < > is replaced with [ ], which might look a bit strange.
3.2 bpe_train_step2_wrapper.pyx
Now, we’ll wrap the C++ functions into Python-callable functions. The main task here is parameter conversion, such as converting a Python dict to a C++ std::unordered_map. Let’s look at the most direct implementation first.
3.2.1 py_bpe_train_step2
cpdef py_bpe_train_step2(int vocab_size,
pair_counts_py,
pair_strings_py,
vocabulary_py,
pair_wordids_py,
wordid_counts_py,
wordid_encodings_py,
merges_py):
# 声明 C++ 容器
cdef HashMap[pair[int, int], int, pair_hash] pair_counts_cpp
cdef unordered_map[pair[int, int], vector[vector[int]], pair_hash] pair_strings_cpp
cdef unordered_map[int, vector[int]] vocabulary_cpp
cdef unordered_map[pair[int, int], unordered_set[int], pair_hash] pair_wordids_cpp
cdef unordered_map[int, long long] wordid_counts_cpp
cdef unordered_map[int, vector[int]] wordid_encodings_cpp
cdef vector[pair[vector[int], vector[int]]] merges_cpp
cdef pair[int, int] pair_key
cdef vector[vector[int]] strings_value
cdef vector[int] vector_value
cdef unordered_set[int] set_value
for p, count in pair_counts_py.items():
pair_key.first = p[0]
pair_key.second = p[1]
pair_counts_cpp[pair_key] = count
for p, string in pair_strings_py.items():
pair_key.first = p[0]
pair_key.second = p[1]
strings_value.clear()
value = [list(item) for item in string]
vector_value = value[0]
strings_value.push_back(vector_value)
vector_value = value[1]
strings_value.push_back(vector_value)
pair_strings_cpp[pair_key] = strings_value
for k, v in vocabulary_py.items():
value = list(v)
vector_value = value
vocabulary_cpp[k] = vector_value
for p, wordids in pair_wordids_py.items():
pair_key.first = p[0]
pair_key.second = p[1]
set_value = wordids
pair_wordids_cpp[pair_key] = set_value
for k, v in wordid_counts_py.items():
wordid_counts_cpp[k] = v
for k, v in wordid_encodings_py.items():
vector_value = v
wordid_encodings_cpp[k] = vector_value
# 调用 C++ 函数
bpe_train_step2(vocab_size,
pair_counts_cpp,
pair_strings_cpp,
vocabulary_cpp,
pair_wordids_cpp,
wordid_counts_cpp,
wordid_encodings_cpp,
merges_cpp)
return merges_cpp, vocabulary_cpp
Most of the code in this function converts Python variables to C++ variables before calling bpe_train_step2. Let’s look at two typical examples:
cdef HashMap[pair[int, int], int, pair_hash] pair_counts_cpp
cdef pair[int, int] pair_key
for p, count in pair_counts_py.items():
pair_key.first = p[0]
pair_key.second = p[1]
pair_counts_cpp[pair_key] = count
The key for pair_counts_cpp is pair[int,int], which is std::pair<int,int>. We can assign to its first and second members. We then use the operator[] to insert, which is why we needed to declare V& operator[](const K& key) earlier. The code iterates like a Python for loop.
Let’s look at a more complex example:
cdef unordered_map[pair[int, int], vector[vector[int]], pair_hash] pair_strings_cpp
for p, string in pair_strings_py.items():
pair_key.first = p[0]
pair_key.second = p[1]
strings_value.clear()
value = [list(item) for item in string]
vector_value = value[0]
strings_value.push_back(vector_value)
vector_value = value[1]
strings_value.push_back(vector_value)
pair_strings_cpp[pair_key] = strings_value
The key for pair_strings_py is a tuple, and the value is also a tuple with two bytes elements. For example:
(111,110): (b'o', b'n')
The corresponding pair_strings_cpp is unordered_map[pair[int, int], vector[vector[int]], pair_hash], so we need to convert the bytes to list[int]. This is done with the following statements:
value = [list(item) for item in string]
vector_value = value[0]
strings_value.push_back(vector_value)
First, a list comprehension converts tuple[bytes] to list[list[int]]. Then, Python’s list[int] is converted to C++’s vector[int] through assignment, and finally pushed back to strings_value.
3.2.2 py_bpe_train_step2_v2
cpdef py_bpe_train_step2_v2(int vocab_size,
vocabulary_py,
wordid_counts_py,
wordid_encodings_py,
merges_py):
# 声明 C++ 容器
cdef unordered_map[int, vector[int]] vocabulary_cpp
cdef unordered_map[int, long long] wordid_counts_cpp
cdef unordered_map[int, vector[int]] wordid_encodings_cpp
cdef vector[pair[vector[int], vector[int]]] merges_cpp
cdef pair[int, int] pair_key
cdef vector[vector[int]] strings_value
cdef vector[int] vector_value
cdef unordered_set[int] set_value
for k, v in vocabulary_py.items():
value = list(v)
vector_value = value
vocabulary_cpp[k] = vector_value
for k, v in wordid_counts_py.items():
wordid_counts_cpp[k] = v
for k, v in wordid_encodings_py.items():
vector_value = v
wordid_encodings_cpp[k] = vector_value
# 调用 C++ 函数
bpe_train_step2_v2(vocab_size,
vocabulary_cpp,
wordid_counts_cpp,
wordid_encodings_cpp,
merges_cpp)
return merges_cpp, vocabulary_cpp
This version is similar to the previous one, but with three fewer parameters.
Note: We return C++ variables merges_cpp and vocabulary_cpp at the end. Their types are:
cdef unordered_map[int, vector[int]] vocabulary_cpp
cdef vector[pair[vector[int], vector[int]]] merges_cpp
When they are returned to Python, Cython automatically converts them to dict[int, list[int]] and list[tuple[list[int], list[int]]]. We’ll need to convert the list[int] back to bytes later.
In fact, not just for return values, but also when we copy a Python variable to a C++ variable or vice versa, Cython automatically performs these common conversions:
| Python type => | C++ type | => Python type |
|---|---|---|
| bytes | std::string | bytes |
| iterable | std::vector | list |
| iterable | std::list | list |
| iterable | std::set | set |
| iterable | std::unordered_set | set |
| mapping | std::map | dict |
| mapping | std::unordered_map | dict |
| iterable (len 2) | std::pair | tuple (len 2) |
| complex | std::complex | complex |
We can use this feature to simplify the variable conversions and get py_bpe_train_step2_opt:
cpdef py_bpe_train_step2_opt(int vocab_size,
vocabulary_py,
wordid_counts_py,
wordid_encodings_py,
merges_py):
# 声明 C++ 容器
cdef unordered_map[int, vector[int]] vocabulary_cpp
cdef unordered_map[int, long long] wordid_counts_cpp
cdef unordered_map[int, vector[int]] wordid_encodings_cpp
cdef vector[pair[vector[int], vector[int]]] merges_cpp
vocabulary_cpp = vocabulary_py
wordid_counts_cpp = wordid_counts_py
wordid_encodings_cpp = wordid_encodings_py
# 调用 C++ 函数
bpe_train_step2_v2(vocab_size,
vocabulary_cpp,
wordid_counts_cpp,
wordid_encodings_cpp,
merges_cpp)
return merges_cpp, vocabulary_cpp
Here, we use just three assignment statements, and Cython automatically handles the conversion between Python and C++. The calls to other versions, like py_bpe_train_step2_v3, are identical to py_bpe_train_step2_opt:
cpdef py_bpe_train_step2_v3(int vocab_size,
vocabulary_py,
wordid_counts_py,
wordid_encodings_py,
merges_py):
# 声明 C++ 容器
cdef unordered_map[int, vector[int]] vocabulary_cpp
cdef unordered_map[int, long long] wordid_counts_cpp
cdef unordered_map[int, vector[int]] wordid_encodings_cpp
cdef vector[pair[vector[int], vector[int]]] merges_cpp
vocabulary_cpp = vocabulary_py
wordid_counts_cpp = wordid_counts_py
wordid_encodings_cpp = wordid_encodings_py
# 调用 C++ 函数
bpe_train_step2_v3(vocab_size,
vocabulary_cpp,
wordid_counts_cpp,
wordid_encodings_cpp,
merges_cpp)
return merges_cpp, vocabulary_cpp
3.3 Modifying setup.py
project_root = os.path.dirname(os.path.abspath(__file__))
ext_modules = [
Extension(
name="cs336_basics.bpe_train_step2_wrapper",
sources=["cs336_basics/bpe_train_step2_wrapper.pyx"],
language="c++",
#extra_compile_args=['-std=c++17', '-O3'],
extra_compile_args=['-std=c++17'],
libraries=["bpe_train_step2"],
library_dirs=[f"{project_root}/lib_bpe_train_step2/lib"],
runtime_library_dirs=[f"{project_root}/lib_bpe_train_step2/lib"],
include_dirs=[f"{project_root}/lib_bpe_train_step2/include",
f"{project_root}/lib_bpe_train_step2/include/emhash"],
)
]
setup(
packages=['cs336_basics'],
name='bpe_train_step2',
ext_modules=cythonize(ext_modules),
)
We need to compile bpe_train_step2_wrapper.pyx.
namespecifies the module name, allowing Python toimport cs336_basics.bpe_train_step2_wrapper. Here,cs336_basicsis the package name, andbpe_train_step2_wrapperis the module name.sourcesspecifies the source files to be compiled.languagespecifies the language of the module as C++.extra_compile_argsspecifies additional compilation options, in this case'-std=c++17'.librarieslists the dependent libraries.library_dirsspecifies the library location for compilation.runtime_library_dirsspecifies the library location for runtime.include_dirsspecifies the header file location for compilation.
To avoid hard-coding, project_root is set to the directory where setup.py is located.
These options ultimately result in the following C++ compiler commands (for my GCC environment; they may vary):
c++ -pthread -fno-strict-overflow -Wsign-compare -Wunreachable-code -DNDEBUG -g -O3 -Wall -fPIC -fPIC -Ics336_basics -I......codes/assignment1-basics-bpe/lib_bpe_train_step2/include -I......codes/assignment1-basics-bpe/lib_bpe_train_step2/include/emhash -I......codes/assignment1-basics-bpe/.venv/include -I.......local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/include/python3.12 -c cs336_basics/bpe_train_step2_wrapper.cpp -o build/temp.linux-x86_64-cpython-312/cs336_basics/bpe_train_step2_wrapper.o -std=c++17
c++ -pthread -fno-strict-overflow -Wsign-compare -Wunreachable-code -DNDEBUG -g -O3 -Wall -fPIC -shared -Wl,--exclude-libs,ALL build/temp.linux-x86_64-cpython-312/cs336_basics/bpe_train_step2_wrapper.o -L......codes/assignment1-basics-bpe/lib_bpe_train_step2/lib -L.......local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib -Wl,--enable-new-dtags,-rpath,......codes/assignment1-basics-bpe/lib_bpe_train_step2/lib -lbpe_train_step2 -o build/lib.linux-x86_64-cpython-312/cs336_basics/bpe_train_step2_wrapper.cpython-312-x86_64-linux-gnu.so
3.4 Using it in Python
bpe_v9.py calls py_bpe_train_step2; bpe_v10.py calls py_bpe_train_step2_v2; bpe_v10_v2.py calls py_bpe_train_step2_opt; bpe_v11.py calls py_bpe_train_step2_v3; bpe_v11_bytes.py calls py_bpe_train_step2_v3; bpe_v11_v2.py calls py_bpe_train_step2_v4; bpe_v11_v3.py calls py_bpe_train_step2_v5; bpe_v11_v3_bytes.py calls py_bpe_train_step2_v5; bpe_v11_v4.py calls py_bpe_train_step2_v6; bpe_v11_v4_bytes.py calls py_bpe_train_step2_v6.
Their code is mostly the same. Let’s look at bpe_v11.py:
vocabulary = {i: bytes([i]) for i in range(N_BYTES)} # every byte
for i, token in enumerate(special_tokens):
vocabulary[N_BYTES + i] = token.encode('utf-8')
size = N_BYTES + len(special_tokens)
merges = []
# initial word encodings are utf-8
word_encodings = {}
for word in word_counts:
word_encodings[word] = list(word.encode('utf-8'))
word_ids = {word:id for id, word in enumerate(word_counts)}
wordid_counts = {word_ids[word]:count for word, count in word_counts.items()}
wordid_encodings = {word_ids[word]:encoding for word, encoding in word_encodings.items()}
merges_cpp, vocabulary_cpp = py_bpe_train_step2_v3(vocab_size,
vocabulary,
wordid_counts,
wordid_encodings,
merges)
vocabulary = {k:bytes(v) for k, v in vocabulary_cpp.items()}
merges = [(bytes(arr[0]), bytes(arr[1])) for arr in merges_cpp]
Before the call, we need to convert string words to integer IDs. This mapping is stored in word_ids. We then use word_ids to convert word_counts to wordid_counts and word_encodings to wordid_encodings. After the call, the returned merges_cpp and vocabulary_cpp need to have their list[int] converted back to bytes.
4. Testing
| Version | Data | Total Time (s) | Tokenization Time (s) | Merging Time (s) | Other |
|---|---|---|---|---|---|
| bpe_v8_v3 | open_web | 897/899/951 | 395/399/395 | 442/438/493 | num_counter=8, num_merger=1 |
| bpe_v9 | open_web | 831/816/867 | 400/401/390 | prepare & convert: 93/86/94 py_bpe_train_step2: 330/320/374 c++:289/281/326 | num_counter=8, num_merger=1 |
| bpe_v10 | open_web | 816/769/788 | 390/400/400 | prepare & convert: 21/17/19 py_bpe_train_step2: 402/350/367 c++:338/296/309 | num_counter=8, num_merger=1 |
| bpe_v10_v2 | open_web | 767/774/767 | 400/401/401 | prepare & convert: 18/17/17 py_bpe_train_step2: 346/355/347 c++:292/298/294 | num_counter=8, num_merger=1 |
| bpe_v10_v2 | open_web | 498/477/495 | 120/120/120 | prepare & convert: 20/19/18 py_bpe_train_step2: 355/336/354 c++:299/282/298 | num_counter=32, num_merger=4 |
| bpe_v11 | open_web | 350/340/354 | 120/120/120 | prepare & convert: 21/19/19 py_bpe_train_step2: 207/199/212 c++:183/175/190 | num_counter=32, num_merger=4 |
| bpe_v11_bytes | open_web | 311/307/305 | 80/80/80 | prepare & convert: 18/19/18 py_bpe_train_step2: 211/206/204 c++:189/183/182 | num_counter=64, num_merger=8, chunk_size 8mb |
| bpe_v11_v2 | open_web | 362/350/338 | 130/120/120 | prepare & convert: 19/18/19 py_bpe_train_step2: 210/210/197 c++:189/190/176 | num_counter=32, num_merger=4 |
| bpe_v11_v3 | open_web | 269/274/270 | 120/120/120 | prepare & convert: 18/19/18 py_bpe_train_step2: 129/133/129 c++: 106/109/106 | num_counter=32, num_merger=4 |
| bpe_v11_v3_bytes | open_web | 218/219/215 | 72/74/69 | prepare & convert: 21/21/21 py_bpe_train_step2: 123/122/123 c++: 101/100/101 | num_counter=64, num_merger=8, chunk_size 8mb |
| bpe_v11_v4 | open_web | 258/256/261 | 116/117/117 | prepare & convert: 19/18/19 py_bpe_train_step2: 121/119/123 c++: 98/97/100 | num_counter=32, num_merger=4 |
| bpe_v11_v4_bytes | open_web | 210/206/207 | 71/69/70 | prepare & convert: 20/18/19 py_bpe_train_step2: 117/117/117 c++: 95/96/95 | num_counter=64, num_merger=8, chunk_size 8mb |
Comparing bpe_v10 and bpe_v10_v2, the difference is between manually converting parameters in Python to C++ and letting Cython do it automatically. Automatic conversion is not only more convenient but also faster.
Ultimately, with bpe_v11_v4_bytes on 64 cores, the fastest training time is a little over 200 seconds, which is more than 100 times faster than our initial time of over ten hours!
5. Conclusion
This concludes the series. Here is a brief summary and links to the corresponding articles:
- Part 0: Introduction Introduces the basic BPE training algorithm and related tasks, as well as the development environment.
- Part 1: The Simplest Implementation The simplest implementation of BPE training.
- Part 2: Optimized Algorithm Implements incremental updates for pair_counts.
- Part 3: Parallel Tokenization and Frequency Counting Uses multiprocessing to implement a multi-process parallel algorithm.
- Part 4: A Failed Parallel Optimization An attempt to parallelize the max pair calculation using multiple processes.
- Part 5: Implementing the Merge Algorithm in C++ Implements a C++ merge algorithm equivalent to the Python version, and compares two ways of iterating through std::unordered_map.
- Part 6: Parallelizing the Max Pair Search with OpenMP Uses OpenMP to find the max pair in pair_counts in parallel.
- Part 7: Using Flat Hashmap to Replace std::unordered_map Uses flat hashmap to replace std::unordered_map.
- Part 8: Implementing Fine-Grained Updates Implements a fine-grained update algorithm for pair_counts using an inverted index.
- Part 9: Using a Heap to Find the Max Pair Uses a heap to find the max pair and improve performance.
- Part 10: Using Cython and PyPy for Acceleration Uses Cython and PyPy to accelerate Python code.
- Part 11: Wrapping C++ Code with Cython Wraps C++ code using Cython.
- 显示Disqus评论(需要科学上网)