bodo.pandas.BodoSeries.ai.tokenize¶
Tokenize a series of string dtype into a series of lists of int64.Parameters
-
tokenizer: function: A function returning a Transformers.PreTrainedTokenizer.
Returns
-
BodoSeries
Example
import bodo.pandas as pd
from transformers import AutoTokenizer
a = pd.Series(["bodo.ai will improve your workflows.", "This is a professional sentence.", "I am the third entry in this series.", "May the fourth be with you."])
def ret_tokenizer():
# Load a pretrained tokenizer (e.g., BERT)
return AutoTokenizer.from_pretrained("bert-base-uncased")
b = a.ai.tokenize(ret_tokenizer)
print(b)
Output:
0 [ 101 28137 1012 9932 2097 5335 2115 21...
1 [ 101 2023 2003 1037 2658 6251 1012 102]
2 [ 101 1045 2572 1996 2353 4443 1999 2023 2186 ...
3 [ 101 2089 1996 2959 2022 2007 2017 1012 102]
dtype: list<item: int64>[pyarrow]