UserGuideDocExtractor#
- class ragger_duck.scraping.UserGuideDocExtractor(*, folders_to_exclude=None, chunk_size=300, chunk_overlap=50, n_jobs=None)#
Extract text from the User Guide documentation.
This function can process classes and functions.
- Parameters:
- folders_to_excludelist of str, default=None
A list of strings corresponding to folders name to exclude from the HTML pages to process.
- chunk_sizeint or None, default=300
The size of the chunks to split the text into. If None, the text is not chunked.
- chunk_overlapint, default=50
The overlap between two consecutive chunks.
- n_jobsint, default=None
The number of jobs to run in parallel. If None, then the number of jobs is set to the number of CPU cores.
- Attributes:
- text_splitter_
langchain.text_splitter.RecursiveCharacterTextSplitter
The text splitter to use to chunk the document. If
chunk_size
is None, this attribute is None.
- text_splitter_
Methods
fit
([X, y])No-op operation, only validate parameters.
fit_transform
(X[, y])Fit to data, then transform it.
Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Extract text from the API documentation.
- fit(X=None, y=None)#
No-op operation, only validate parameters.
- Parameters:
- XNone
This parameter is ignored.
- yNone
This parameter is ignored.
- Returns:
- self
The fitted estimator.
- fit_transform(X, y=None, **fit_params)#
Fit to data, then transform it.
Fits transformer to
X
andy
with optional parametersfit_params
and returns a transformed version ofX
.- Parameters:
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns:
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_metadata_routing()#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- get_params(deep=True)#
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- set_output(*, transform=None)#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”, “polars”}, default=None
Configure output of
transform
andfit_transform
."default"
: Default output format of a transformer"pandas"
: DataFrame output"polars"
: Polars outputNone
: Transform configuration is unchanged
Added in version 1.4:
"polars"
option was added.
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.
- transform(X)#
Extract text from the API documentation.
- Parameters:
- X
pathlib.Path
The path to the API documentation folder.
- X
- Returns:
- outputlist
A list of dictionaries containing the source and text of the User Guide documentation.
Examples using ragger_duck.scraping.UserGuideDocExtractor
#
Documentation scraping strategies