Extracting text features using Scikit-Learn

Text analysis is the main application area of machine learning algorithms. Since most machine learning algorithms can only receive fixed-length numeric matrix features, resulting in text strings and so on cannot be used directly, Scikit-Learn provides a method to convert text to numeric features for this problem, so let’s learn it together today.

sklearn.feature_extraction.text in Scikit-Learn provides tools for converting text into feature vectors:.

CountVectorizer(): converts text into a word frequency matrix
TfidfTransformer(): converts the CountVectorizer() word frequency matrix into a tf-idf matrix
TfidfVectorizer(): convert text directly into TF-IDF matrix
HashingVectorizer(): convert the text into a Hash matrix

CountVectorizer

CountVectorizer is to transform the words in the text into a word frequency matrix by the fit_transform function. The element a[i][j] of the matrix indicates the word frequency of word j under the ith text. That is, the number of occurrences of each word. The keywords of all texts can be seen by get_feature_names(), and the results of the word frequency matrix can be seen by toarray().

Example.

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]

vectorizer = CountVectorizer()
count = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(vectorizer.vocabulary_)
print(count.toarray())

# 输出
# ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
# {'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
# [[0 1 1 1 0 0 1 0 1]
#  [0 1 0 1 0 2 1 0 1]
#  [1 0 0 0 1 0 1 1 0]
#  [0 1 1 1 0 0 1 0 1]]

class sklearn.feature_extraction.text.CountVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’, max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>)

Parameter description.

input:string {‘filename’, ‘file’, ‘content’}
- If ‘filename’, the sequence passed as an argument to fit is expected to be a list of filenames that need to be read to get the original content to be analyzed.
- If ‘file’, the sequence item must have a ‘read’ method (file-like object) that is called to get the bytes in memory.
- Otherwise, the expected input is the sequence string or byte item is expected to be parsed directly.
encoding:string,‘utf-8’ by default.
- If bytes or files are given for analysis, this encoding is used for decoding.
decode_error: {‘strict’,‘ignore’,‘replace’}
- Instructs what to do if given a sequence of bytes to parse that contains characters that are not of the given encoding. By default, it is ‘strict’, which means UnicodeDecodeError will be raised. other values are ‘ignore’ and ‘replace “.
strip_accents: {‘ascii’, ‘unicode’, None}
- Whether to remove accents in the preprocessing step.’ ascii’ is a fast method that applies only to characters with direct ASCII mappings.’ unicode’ is a slightly slower method for any character. none (default) does nothing.
lowercase: boolean, True by default
- Convert all characters to lowercase before the token token
preprocessor: callable or None (default)
- Override the preprocessor (string conversion) stage while preserving the tokenizing and n-grams generation steps.
tokenizer: callable or None (default)
- Override the string tokenization step, while keeping the preprocessing and n-grams generation steps.
- Only applies to analyzer == ‘word’
stop_words: string {’english’}, list, or None (default)
- If it is a string, pass it to _check_stop_list and return the corresponding stop list.’ english’ is currently the only string value supported.
- If a list, which is assumed to contain stop words, all of these will be removed from the generated token. Applies only if. analyzer == ‘word’
- If not, stop words will not be used. max_df can be set to a value in the range [0.7,1.0] to automatically detect and filter stop words based on the corpus document frequency of the term.
token_pattern: string
- Regular expression that filters by default for mixed alphabetic and numeric characters of length >= 2 (punctuation is completely ignored and always treated as a token separator). Only used if analyzer==‘word’ is used.
- ngram_range: tuple (min_n, max_n)
- The lower and upper bounds of the n-value range for different n-values are extracted. All n values will be used such that min_n <= n <= max_n.
analyzer: string, {‘word’, ‘char’, ‘char_wb’} or callable
- Whether the feature should consist of a word or a character n-gram. The option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are filled with spaces.
- If passed callable, it will be used to extract feature sequences from the original unprocessed input.
max_df: float in range [0.0, 1.0] or int, default=1.0
- When constructing a vocabulary, words with a document frequency higher than the given threshold are strictly ignored, and the corpus specifies deactivated words. In case of floating point values, this parameter represents the proportion of documents, integer absolute count value, if the vocabulary is not None, this parameter is ignored.
min_df: float in range [0.0, 1.0] or int, default=1
- When constructing the vocabulary, words with document frequencies below the given threshold are strictly ignored, and the corpus specifies deactivated words. In case of floating point values, this parameter represents the proportion of documents, integer absolute count value, if the vocabulary is not None, this parameter is ignored.
max_features: int or None, default=None
- If None, construct a vocabulary considering only max_features
- Sort by corpus word frequency, if vocabulary is not None, this parameter is ignored
vocabulary: Mapping or iterable, optional
- Also a Map (e.g., a dictionary) where the keys are the lexical entries and the values are indexed in the feature matrix, or iterators in the lexical entries. If not given, the vocabulary is determined from the input file. There must be no duplication of indexes in the mapping, and there must be no breaks between 0 and the maximum index value.
binary: boolean, default=False
- If not True, all non-zero counts are set to 1. This is useful for discrete probability models, modeling binary events instead of integer counts
dtype: type, optional
- The type of the matrix returned by fit_transform() or transform().

Attributes.

vocabulary_: dict
- A mapping of terms to feature indexes.
stop_words_: set
- Terms that are ignored because they.
  - appear in too many files (max_df)
  - appear in too few files (min_df)
  - cut off by feature selection (max_features)

Method.

build_analyzer(self) Return a callable that handles preprocessing and tokenization
build_preprocessor(self) Return a function to preprocess the text before tokenization
build_tokenizer(self) Return a function that splits a string into a sequence of tokens
decode(self, doc) Decode the input into a string of unicode symbols
fit(self, raw_documents[, y]) The main purpose is to load the data and compute it accordingly.
transform(self, raw_documents) The main purpose is to transform the data into matrix form.
fit_transform(self, raw_documents[, y]) puts the fit and transform steps together.
get_feature_names(self) Get all features, i.e. a list of keywords
get_params(self[, deep]) Get parameters for this estimator.
get_stop_words(self) Build or fetch the effective stop words list
inverse_transform(self, X) Return terms per document with nonzero entries in X.
set_params(self, **params) Set the parameters of this estimator.

TfidfTransformer

TfidfTransformer is to count the tf-idf weights of each word in CountVectorizer.

Example.

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]

vectorizer = CountVectorizer()
transformer = TfidfTransformer()
count = vectorizer.fit_transform(corpus)
tfidf_matrix = transformer.fit_transform(count)
print(tfidf_matrix.toarray())

# 输出
# [[0.         0.43877674 0.54197657 0.43877674 0.         0.
#   0.35872874 0.         0.43877674]
#  [0.         0.27230147 0.         0.27230147 0.         0.85322574
#   0.22262429 0.         0.27230147]
#  [0.55280532 0.         0.         0.         0.55280532 0.
#   0.28847675 0.55280532 0.        ]
#  [0.         0.43877674 0.54197657 0.43877674 0.         0.
#   0.35872874 0.         0.43877674]]

`1`	`class sklearn.feature_extraction.text.TfidfTransformer(norm=’l2’, use_idf=True, smooth_idf=True, sublinear_tf=False)`

Parameters.

norm: ’l1’, ’l2’ or None, optional (default=‘l2’)
- Whether to normalize the data, None means no normalization.
use_idf : boolean (default=True)
- whether to use idf, if False, then degrade to simple word frequency statistics
smooth_idf: boolean (default=True)
- smooth idf weight by adding 1 to document frequency, adding an extra document to prevent division by zero
sublinear_tf: boolean (default=False)
- Apply linear scaling TF, if True, use 1 + log(tf) instead of tf

Attributes.

idf_: array, shape (n_features)
- The inverse document frequency (IDF) vector; only defined if use_idf is True.

Methods:

fit(self, X[, y]) Learn the idf vector (global term weights)
transform(self, X[, copy]) Transform a count matrix to a tf or tf-idf representation
fit_transform(self, X[, y]) Fit to data, then transform it.
get_params(self[, deep]) Get parameters for this estimator.
set_params(self, **params) Set the parameters of this estimator.
transform(self, X[, copy]) Transform a count matrix to a tf or tf-idf representation

TfidfVectorizer

The collection of original documents is transformed into a matrix of tf-idf characteristics, which is equivalent to the effect of CountVectorizer used with TfidfTransformer. That is, the TfidfVectorizer class will CountVectorizer and TfidfTransformer class wrapped together.

Example.

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]

tfidf_vec = TfidfVectorizer()
tfidf_matrix = tfidf_vec.fit_transform(corpus)
print(tfidf_vec.get_feature_names())
print(tfidf_vec.vocabulary_)
print(tfidf_matrix.toarray())

# 输出
# ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
# {'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
# [[0.         0.43877674 0.54197657 0.43877674 0.         0.
#   0.35872874 0.         0.43877674]
#  [0.         0.27230147 0.         0.27230147 0.         0.85322574
#   0.22262429 0.         0.27230147]
#  [0.55280532 0.         0.         0.         0.55280532 0.
#   0.28847675 0.55280532 0.        ]
#  [0.         0.43877674 0.54197657 0.43877674 0.         0.
#   0.35872874 0.         0.43877674]]

HashingVectorizer

Word frequencies and weights are useful, but when the vocabulary becomes large, the above two approaches become limited. In turn, this would require huge vectors to encode the document and would be very memory demanding and slow down the algorithm. A good approach is to use a one-way hashing method to convert words into integers. The advantage is that the method does not require a vocabulary and can choose an arbitrarily long fixed-length vector. The disadvantage is that the hash quantization is one-way, so it is not possible to convert the encoding back to words (perhaps not important with many supervised learning tasks).

The HashingVectorizer class implements this method, so it can be used to continuously hash quantize words and then lexicalize and encode documents on demand. Here is an example of encoding a single document using HashingVectorizer. We have chosen an arbitrary vector of fixed length 20. This value corresponds to the range of the hash function; small values (e.g., 20) may lead to hash collisions. In previous computer science courses, we have introduced heuristic algorithms that allow the choice of hash length and collision probability based on the estimated vocabulary.

Note that this quantization method does not require calling a function to fit the training data file. Instead, after instantiation, it can be used directly to encode the document.

from sklearn.feature_extraction.text import HashingVectorizer

text = ["The quick brown fox jumped over the lazy dog."]
vectorizer = HashingVectorizer(n_features=20)
vector = vectorizer.transform(text)
print(vector.shape)
print(vector.toarray())

Running the sample code encodes the sample document as a sparse matrix with 20 elements. The value of the encoded document corresponds to the regularized word count, which defaults to a value between -1 and 1, but the default can be modified and then set to an integer count value.

(1, 20)
[[ 0.          0.          0.          0.          0.          0.33333333
   0.         -0.33333333  0.33333333  0.          0.          0.33333333
   0.          0.          0.         -0.33333333  0.          0.
  -0.66666667  0.        ]]

Reference link.

module-sklearn.feature_extraction.text

Table of Contents

CountVectorizer

TfidfTransformer

TfidfVectorizer

HashingVectorizer