Skip to content

sklearn.feature_extraction

sklearn.feature_extraction.text.CountVectorizer

sklearn.feature_extraction.text.CountVectorizer


This class provides CountVectorizer support to convert a collection of text documents to a matrix of token counts.

Note

Arguments max_df and min_df are not supported yet.

Methods

sklearn.feature_extraction.text.CountVectorizer.fit_transform

  • sklearn.feature_extraction.text.CountVectorizer.fit_transform ( raw_documents, y=None )

    Supported Arguments

    • raw_documents: iterables ( list, tuple, or NumPy Array, or Pandas Series that contains string)

    Note

    Bodo ignores y, which is consistent with scikit-learn.

sklearn.feature_extraction.text.CountVectorizer.get_feature_names_out

  • sklearn.feature_extraction.text.CountVectorizer. get_feature_names_out()

Example Usage

>>> import bodo
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> @bodo.jit
>>> def test_count_vectorizer(corpus):
>>>   vectorizer = CountVectorizer()
>>>   X = vectorizer.fit_transform(corpus)
>>>   print(vectorizer.get_feature_names_out())
...
>>> test_count_vectorizer(corpus)
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']

sklearn.feature_extraction.text.HashingVectorizer

sklearn.feature_extraction.text.HashingVectorizer


This class provides HashingVectorizer support to convert a collection of text documents to a matrix of token occurrences.

Methods

sklearn.feature_extraction.text.HashingVectorizer.fit_transform

  • sklearn.feature_extraction.text.HashingVectorizer.fit_transform(X, y=None)

    Supported Arguments

    • X: iterables ( list, tuple, or NumPy Array, or Pandas Series that contains string)

    Note

    Bodo ignores y, which is consistent with scikit-learn.

Example Usage

>>> import bodo
>>> from sklearn.feature_extraction.text import HashingVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> @bodo.jit
>>> def test_hashing_vectorizer(corpus):
>>>   vectorizer = HashingVectorizer(n_features=2**4)
>>>   X = vectorizer.fit_transform(corpus)
>>>   print(X.shape)
...
>>> test_hashing_vectorizer(corpus)
(4, 16)