Text Scraping#
In a Retrieval Augmented Generation (RAG) framework, the “document” retrieved and provided to the Large Language Model (LLM) to generate an answer corresponds to chunks extracted from the documentation.
The first important aspect is to be aware that the context of the LLM is limited. Therefore, we need to provide chunks of documentation that are relatively limited and focused to not reach the context limit.
The most common strategy is to extract chunks of text with a given number of tokens and an overlap between chunks.
The various tutorials to build RAG models use this strategy. While it is a fast way to get started, it is not the best strategy to get the most out of the scikit-learn documentation. In the subsequent sections, we present different strategies specifically designed for certain portions of the scikit-learn documentation.
API documentation scraper#
We refer to “API documentation” as the following documentation entry point: https://scikit-learn.org/stable/modules/classes.html.
It corresponds to the documentation of each class and function implemented in
scikit-learn. This documentation is automatically generated from the docstrings of the
classes and functions. These docstrings follow the numpydoc
formatting. As an example,
we show a generated HTML page containing the documentation of a scikit-learn estimator:
Before diving into the chunking mechanism, it is interesting to think about the type of queries that such documentation can help answer. Indeed, these documentation pages are intended to provide information about class or function parameters, short usage snippets of code, and related classes or functions. The narration on these pages is relatively short, and further discussions are generally provided in the user guide instead. So we would expect that the chunks of documentation to be useful to answer questions such as:
What are the parameters of
LogisticRegression
?What are the values of the
strategy
parameter in a dummy classifier?
Now that we have better framed our expectations, we can think about the chunks extraction. We could go forward with the naive approach described above. However, it will fall short to help the LLM answer the questions. Let’s go into an example to illustrate this point.
Consider the second question above: “What are the values of the strategy
parameter in
a dummy classifier?” While our retrievers (refer to the section
Retriever of the documentation) are able to get the association
between the DummyClassifier
and the strategy parameter, the LLM
will not be able to get this link if the chunk retrieved does not contain this
relationship. Indeed, the naive approach will provide a chunk where strategy could be
mentioned, but it might not belong to the DummyClassifier
class.
For instance, we could retrieve the following three chunks that are relatively relevant to the query:
Chunk #1:
strategy : {"most_frequent", "prior", "stratified", "uniform", \
"constant"}, default="prior"
Strategy to use to generate predictions.
* "most_frequent": the `predict` method always returns the most
frequent class label in the observed `y` argument passed to `fit`.
The `predict_proba` method returns the matching one-hot encoded
vector.
* "prior": the `predict` method always returns the most frequent
class label in the observed `y` argument passed to `fit` (like
"most_frequent"). ``predict_proba`` always returns the empirical
class distribution of `y` also known as the empirical class prior
distribution.
* "stratified": the `predict_proba` method randomly samples one-hot
vectors from a multinomial distribution parametrized by the empirical
class prior probabilities.
The `predict` method returns the class label which got probability
one in the one-hot vector of `predict_proba`.
Each sampled row of both methods is therefore independent and
identically distributed.
* "uniform": generates predictions uniformly at random from the list
of unique classes observed in `y`, i.e. each class has equal
probability.
* "constant": always predicts a constant label that is provided by
the user. This is useful for metrics that evaluate a non-majority
class.
Chunk #2:
strategy : {"mean", "median", "quantile", "constant"}, default="mean"
Strategy to use to generate predictions.
* "mean": always predicts the mean of the training set
* "median": always predicts the median of the training set
* "quantile": always predicts a specified quantile of the training set,
provided with the quantile parameter.
* "constant": always predicts a constant value that is provided by
the user.
Chunk #3:
strategy : str, default='mean'
The imputation strategy.
- If "mean", then replace missing values using the mean along
each column. Can only be used with numeric data.
- If "median", then replace missing values using the median along
each column. Can only be used with numeric data.
- If "most_frequent", then replace missing using the most frequent
value along each column. Can be used with strings or numeric data.
If there is more than one such value, only the smallest is returned.
- If "constant", then replace missing values with fill_value. Can be
used with strings or numeric data.
Therefore, the chunks are relevant to the strategy parameter, but they are related to
the DummyClassifier
, DummyRegressor
, and
SimpleImputer
classes.
If we provide such information to a human who is not familiar with the scikit-learn API, they will not be able to determine which of the above chunks is relevant to answer the query. If they are experts, they might use their previous knowledge to select the relevant chunk.
So when it comes to an LLM, you should not expect more than a human: if the LLM has been trained on similar queries, then it might be able to use the relevant information, but otherwise, it will not be the case. For example, the Mistral 7b model would only summarize the information of the chunks and provide an unhelpful answer.
As a straightforward solution to the above problem, we could think that we should go beyond the naive chunking strategy. For instance, if our chunk contains the associated class or function to the parameter description, then it will allow us to disambiguate the information and thus help our LLM answer the relevant question.
As previously stated, scikit-learn uses the numpydoc
formalism to document the classes
and functions. This library comes with a parser that structures the docstring
information, such that you know about the section, the parameters, the types, etc. We
implemented APINumPyDocExtractor
that leverages this
information to build meaningful chunks of documentation. The chunk size in this case is
not controlled, but because of the nature of the documentation, we know that it will
never be too large.
For example, a chunk created that is going to be relevant to the previous query is the following:
source: https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html
content: Parameter strategy of sklearn.dummy.DummyClassifier.
strategy is described as 'Strategy to use to generate predictions.
* "most_frequent": the `predict` method always returns the most
frequent class label in the observed `y` argument passed to `fit`.
The `predict_proba` method returns the matching one-hot encoded
vector.
* "prior": the `predict` method always returns the most frequent
class label in the observed `y` argument passed to `fit` (like
"most_frequent"). ``predict_proba`` always returns the empirical
class distribution of `y` also known as the empirical class prior
distribution.
* "stratified": the `predict_proba` method randomly samples one-hot
vectors from a multinomial distribution parametrized by the empirical
class prior probabilities.
The `predict` method returns the class label which got probability
one in the one-hot vector of `predict_proba`.
Each sampled row of both methods is therefore independent and
identically distributed.
* "uniform": generates predictions uniformly at random from the list
of unique classes observed in `y`, i.e. each class has equal
probability.
* "constant": always predicts a constant label that is provided by
the user. This is useful for metrics that evaluate a non-majority
class.
.. versionchanged:: 0.24
The default value of `strategy` has changed to "prior" in version
0.24.' and has the following type(s): {"most_frequent", "prior", "stratified",
"uniform", "constant"}, default="prior"
By providing chunks that maintain the relationship between the parameter and its corresponding class, we enable the Mistral 7b model to disambiguate the information and provide a relevant answer.
Chunk formatting leveraging numpydoc
#
In this section, we provide detailed information regarding the formatting used to create
the chunks for classes and functions by leveraging the numpydoc
formalism. You can
refer to the numpydoc documentation have more information
regarding this formalism.
We are creating individual chunks for the following sections:
class signature with default parameters
class short and extended summary
class parameters description
class attributes description
associated class or function in “See Also” section
class note section
class example usage
class references
For each of these sections, we create a chunk of text in natural language to summarize the information. A similar approach is used for functions and methods of a class. We provide an example of chunks extracted for the sklearn.feature_extraction.image.extract_patches_2d:
sklearn.feature_extraction.image.extract_patches_2d
The parameters of extract_patches_2d with their default values when known are:
image, patch_size, max_patches (default=None), random_state (default=None).
The description of the extract_patches_2d is as follow.
Reshape a 2D image into a collection of patches.
The resulting patches are allocated in a dedicated array.
Read more in the :ref:`User Guide <image_feature_extraction>`.
Parameter image of sklearn.feature_extraction.image.extract_patches_2d.
image is described as 'The original image data. For color images, the last dimension
specifies
the channel: a RGB image would have `n_channels=3`.' and has the following type(s):
ndarray of shape (image_height, image_width) or
(image_height, image_width, n_channels)
Parameter patch_size of sklearn.feature_extraction.image.extract_patches_2d.
patch_size is described as 'The dimensions of one patch.' and has the following
type(s): tuple of int (patch_height, patch_width)
Parameter max_patches of sklearn.feature_extraction.image.extract_patches_2d.
max_patches is described as 'The maximum number of patches to extract. If
`max_patches` is a float between 0 and 1, it is taken to be a proportion of the
total number of patches. If `max_patches` is None it corresponds to the total number
of patches that can be extracted.' and has the following type(s): int or float,
default=None
Parameter random_state of sklearn.feature_extraction.image.extract_patches_2d.
random_state is described as 'Determines the random number generator used for
random sampling when `max_patches` is not None. Use an int to make the randomness
deterministic.
See :term:`Glossary <random_state>`.' and has the following type(s): int,
RandomState instance, default=None
patches is returned by sklearn.feature_extraction.image.extract_patches_2d.
patches is described as 'The collection of patches extracted from the image, where
`n_patches` is either `max_patches` or the total number of patches that can be
extracted.' and has the following type(s): array of shape
(n_patches, patch_height, patch_width) or
(n_patches, patch_height, patch_width, n_channels)
sklearn.feature_extraction.image.extract_patches_2d
Here is a usage example of extract_patches_2d:
>>> from sklearn.datasets import load_sample_image
>>> from sklearn.feature_extraction import image
>>> # Use the array data from the first image in this dataset:
>>> one_image = load_sample_image("china.jpg")
>>> print('Image shape: {}'.format(one_image.shape))
Image shape: (427, 640, 3)
>>> patches = image.extract_patches_2d(one_image, (2, 2))
>>> print('Patches shape: {}'.format(patches.shape))
Patches shape: (272214, 2, 2, 3)
>>> # Here are just two of these patches:
>>> print(patches[1])
[[[174 201 231]
[174 201 231]]
[[173 200 230]
[173 200 230]]]
>>> print(patches[800])
[[[187 214 243]
[188 215 244]]
[[187 214 243]
[188 215 244]]]
User Guide documentation scraper#
We refer to “User Guide documentation” to the narrative documentation that is handwritten and provides a detailed explanation of the concepts of machine learning concept and how those translate into scikit-learn usage. The HTML generated pages are available at https://scikit-learn.org/stable/user_guide.html. Each page have the following look:
Here, we observed that the information is not structure as in the API documentation.
The naive approach of chunking is more appropriate.
UserGuideDocExtractor
is a scraper that chunks the
documentation in this manner. It relies on beautifulsoup4
to parse the HTML content
and recursively chunk the content.
It provides two main parameters chunk_size
and chunk_overlap
to controlled the
chunking process. It is quite important to not have too large chunks such that the
number of token does not exceed the limit of the retriever. Otherwise, the embeddings
will just truncate the input. Also, it seems that having a small overlap is beneficial
to not retrieve multiple times the same information.
Here, we can forsee an improvement by parsing the documentation at the section high-level and perform the chunking within these sections. This improvement could done in the future.
The class also provides the parameter folders_to_exclude
to exclude some files or
folders that we don’t want to incorporate into our index.
Example gallery scraper#
The last type of documentation in scikit-learn is the gallery of examples. It
corresponds to a set of python examples that show some usage cases or tutorial-like
examples. These examples are written to follow the formalism of sphinx-gallery
. The
HTML generated pages are available at
https://scikit-learn.org/stable/auto_examples/index.html.
We mainly have two types of examples in scikit-learn. The first type are more related to a usage example as shown below:
These examples have a title and a description followed by a single block of code.
The second type of examples are more tutorial-like and have sections with titles and interlace code blocks with text. An example is shown below:
GalleryExampleExtractor
is a scraper that chunks these
two types of example. In the first case, it will chunk the title and the description as
an individual block an chunk separately the code block. In the second case, it will
instead parse first the section of the example and create blocks for each section. Then,
we will chunk each block separately. The idea behind this strategy is that a section
of text is usually an introduction or a description of the code that follows it.
Scraper API#
The different scraper classes have a common API that is the scikit-learn transformer
API. They all implement the method fit
, transform
, and fit_transform
. The
scrappers are stateless and only parameter validation is done during fit
. All the
processing is happening when calling transform
.
This API allows to leverage the scikit-learn
Pipeline
and for instance to create A pipeline and a
retriever with a unique Python instance.