.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/plot_documentation_scraping.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_documentation_scraping.py: ================================= Documentation scraping strategies ================================= This example illustrates how the different documentation scraping strategies work in `ragger_duck`. .. GENERATED FROM PYTHON SOURCE LINES 11-22 API documentation scraping -------------------------- First, we look at the :class:`~ragger_duck.scraping.APINumPyDocExtractor` class. This class is used to scrape the API documentation of scikit-learn. It leverages the `numpydoc` scraper and create semi-structured chunk of text. Let's show an example where we scrape the documentation of :class:`~sklearn.ensemble.RandomForestClassifier`. Our scrapper requires the HTML generated file to infer if this is part of the public API. To do so,s we copied the HTML generated file in the folder `toy_documentation/api`. We can therefore process this folder. .. GENERATED FROM PYTHON SOURCE LINES 22-29 .. code-block:: Python from pathlib import Path from ragger_duck.scraping import APINumPyDocExtractor path_api_doc = Path(".") / "toy_documentation" / "api" chunks = APINumPyDocExtractor().fit_transform(path_api_doc) .. GENERATED FROM PYTHON SOURCE LINES 30-31 The chunks are stored in a list of dictionaries. .. GENERATED FROM PYTHON SOURCE LINES 31-34 .. code-block:: Python print(f"Chunks is {type(chunks)}") print(f"A chunk is {type(chunks[0])}") .. rst-class:: sphx-glr-script-out .. code-block:: none Chunks is A chunk is .. GENERATED FROM PYTHON SOURCE LINES 35-37 A chunk contains 2 keys: `"source"` that is the HTML source page and `"text"` that is the extracted text. .. GENERATED FROM PYTHON SOURCE LINES 37-39 .. code-block:: Python chunks[0].keys() .. rst-class:: sphx-glr-script-out .. code-block:: none dict_keys(['source', 'text']) .. GENERATED FROM PYTHON SOURCE LINES 40-42 For the API documentation, we use `numpydoc` to generate meaningful chunks. For instance, this is the first chunk of text. .. GENERATED FROM PYTHON SOURCE LINES 42-44 .. code-block:: Python print(chunks[0]["text"]) .. rst-class:: sphx-glr-script-out .. code-block:: none sklearn.ensemble.RandomForestClassifier The parameters of RandomForestClassifier with their default values when known are: n_estimators (default=100), criterion (default=gini), max_depth (default=None), min_samples_split (default=2), min_samples_leaf (default=1), min_weight_fraction_leaf (default=0.0), max_features (default=sqrt), max_leaf_nodes (default=None), min_impurity_decrease (default=0.0), bootstrap (default=True), oob_score (default=False), n_jobs (default=None), random_state (default=None), verbose (default=0), warm_start (default=False), class_weight (default=None), ccp_alpha (default=0.0), max_samples (default=None), monotonic_cst (default=None). The description of the RandomForestClassifier is as follow. A random forest classifier. A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. Trees in the forest use the best split strategy, i.e. equivalent to passing `splitter="best"` to the underlying :class:`~sklearn.tree.DecisionTreeRegressor`. The sub-sample size is controlled with the `max_samples` parameter if `bootstrap=True` (default), otherwise the whole dataset is used to build each tree. For a comparison between tree-based ensemble models see the example :ref:`sphx_glr_auto_examples_ensemble_plot_forest_hist_grad_boosting_comparison.py`. Read more in the :ref:`User Guide `. .. GENERATED FROM PYTHON SOURCE LINES 45-64 The first line of the chunk corresponds to the estimator or class name and its module. This information is useful to disambiguate the documentation when using an LLM: sometimes we can have multiple parameters name defined in different classes or functions. An LLM will tend to summarize the information coming from the different chunks. However, if we provide the class or function name and this information is present in the user prompt, then the LLM is likely to generate a more accurate answer. Since `numpydoc` offer a structured information based on the sections of the docstring, we therefore use these sections and create hand-crafted chunks that we find meaningful in regards to the API documentation. User guide documentation scraping --------------------------------- First, we look at the :class:`~ragger_duck.scraping.UserGuideExtractor` class. This class is used to scrape the user guide documentation of scikit-learn. The chunking strategy is really simple: we split the text into chunks of a fixed size. Additionally, chunks can be overlapping. Those behaviors can be controlled by the `chunk_size` and `chunk_overlap` parameters. .. GENERATED FROM PYTHON SOURCE LINES 64-71 .. code-block:: Python from ragger_duck.scraping import UserGuideDocExtractor path_user_guide = Path(".") / "toy_documentation" / "user_guide" chunks = UserGuideDocExtractor(chunk_size=500, chunk_overlap=100).fit_transform( path_user_guide ) .. GENERATED FROM PYTHON SOURCE LINES 72-73 We provide an example of two overlapping chunks. .. GENERATED FROM PYTHON SOURCE LINES 73-78 .. code-block:: Python print("Chunk #1\n") print(chunks[0]["text"]) print("\nChunk #2\n") print(chunks[1]["text"]) .. rst-class:: sphx-glr-script-out .. code-block:: none Chunk #1 Getting StartedĀ¶ The purpose of this guide is to illustrate some of the main features that scikit-learn provides. It assumes a very basic working knowledge of machine learning practices (model fitting, predicting, cross-validation, etc.). Please refer to our installation instructions for installing scikit-learn. Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for Chunk #2 supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities. Fitting and predicting: estimator basicsĀ¶ Scikit-learn provides dozens of built-in machine learning algorithms and models, called estimators. Each estimator can be fitted to some data using its fit method. Here is a simple example where we fit a RandomForestClassifier to some very basic data: .. GENERATED FROM PYTHON SOURCE LINES 79-80 The size of the chunks might varies depending of the break characters in the text. .. GENERATED FROM PYTHON SOURCE LINES 80-83 .. code-block:: Python print(len(chunks[0]["text"])) print(len(chunks[1]["text"])) .. rst-class:: sphx-glr-script-out .. code-block:: none 456 462 .. GENERATED FROM PYTHON SOURCE LINES 84-94 It should be noted that we could improve this strategy by using a more sophisticated chunking strategy. For instance, we could detect the sections and make sure to not define chunks overlapping between independent sections. In the same manner, we could think of a strategy to not split code block of the user guide since they are quite small and self-contained. Examples documentation scraping ------------------------------- Finally, we look at the :class:`~ragger_duck.scraping.GalleryExampleExtractor` class. This class is used to scrape examples from the scikit-learn gallery. .. GENERATED FROM PYTHON SOURCE LINES 94-99 .. code-block:: Python from ragger_duck.scraping import GalleryExampleExtractor path_examples = Path(".") / "toy_documentation" / "gallery" chunks = GalleryExampleExtractor(chunk_size=1_000).fit_transform(path_examples) .. GENERATED FROM PYTHON SOURCE LINES 100-106 In scikit-learn, we have two types of examples. The first type only contain a single introduction paragraph and follow with a single code blocks. The second type contains multiple blocks of code and text and look like a tutorial. We therefore have different strategies. Let's look first at the first type of example. .. GENERATED FROM PYTHON SOURCE LINES 108-109 extract the chunk of the first example .. GENERATED FROM PYTHON SOURCE LINES 109-112 .. code-block:: Python chunks_text = [chunk["text"] for chunk in chunks if "pca" in chunk["source"]] print(len(chunks_text)) .. rst-class:: sphx-glr-script-out .. code-block:: none 3 .. GENERATED FROM PYTHON SOURCE LINES 113-115 We see that for the first type of example, we only have a few chunks. Let's check in more details the content of the chunks. .. GENERATED FROM PYTHON SOURCE LINES 115-120 .. code-block:: Python for chunk in chunks_text: print("XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX") print(chunk) print("\n") .. rst-class:: sphx-glr-script-out .. code-block:: none XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX ======================================================= Comparison of LDA and PCA 2D projection of Iris dataset ======================================================= The Iris dataset represents 3 kind of Iris flowers (Setosa, Versicolour and Virginica) with 4 attributes: sepal length, sepal width, petal length and petal width. Principal Component Analysis (PCA) applied to this data identifies the combination of attributes (principal components, or directions in the feature space) that account for the most variance in the data. Here we plot the different samples on the 2 first principal components. Linear Discriminant Analysis (LDA) tries to identify attributes that account for the most variance *between classes*. In particular, LDA, in contrast to PCA, is a supervised method, using known class labels. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX import matplotlib.pyplot as plt from sklearn import datasets from sklearn.decomposition import PCA from sklearn.discriminant_analysis import LinearDiscriminantAnalysis iris = datasets.load_iris() X = iris.data y = iris.target target_names = iris.target_names pca = PCA(n_components=2) X_r = pca.fit(X).transform(X) lda = LinearDiscriminantAnalysis(n_components=2) X_r2 = lda.fit(X, y).transform(X) # Percentage of variance explained for each components print( "explained variance ratio (first two components): %s" % str(pca.explained_variance_ratio_) ) plt.figure() colors = ["navy", "turquoise", "darkorange"] lw = 2 for color, i, target_name in zip(colors, [0, 1, 2], target_names): plt.scatter( X_r[y == i, 0], X_r[y == i, 1], color=color, alpha=0.8, lw=lw, label=target_name ) plt.legend(loc="best", shadow=False, scatterpoints=1) plt.title("PCA of IRIS dataset") XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX plt.figure() for color, i, target_name in zip(colors, [0, 1, 2], target_names): plt.scatter( X_r2[y == i, 0], X_r2[y == i, 1], alpha=0.8, color=color, label=target_name ) plt.legend(loc="best", shadow=False, scatterpoints=1) plt.title("LDA of IRIS dataset") plt.show() .. GENERATED FROM PYTHON SOURCE LINES 121-125 For those type of examples, we split the text block from the code block. Once these blocks are separated, we create chunks of a fixed size. Let's now look at the second type of example. .. GENERATED FROM PYTHON SOURCE LINES 125-128 .. code-block:: Python chunks_text = [chunk["text"] for chunk in chunks if "causal" in chunk["source"]] print(len(chunks_text)) .. rst-class:: sphx-glr-script-out .. code-block:: none 9 .. GENERATED FROM PYTHON SOURCE LINES 129-131 For the second type of example, we observe many more chunks. Let's check in more details the content of the chunks. .. GENERATED FROM PYTHON SOURCE LINES 131-136 .. code-block:: Python for chunk in chunks_text: print("XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX") print(chunk) print("\n") .. rst-class:: sphx-glr-script-out .. code-block:: none XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX =================================================== Failure of Machine Learning to infer causal effects =================================================== Machine Learning models are great for measuring statistical associations. Unfortunately, unless we're willing to make strong assumptions about the data, those models are unable to infer causal effects. To illustrate this, we will simulate a situation in which we try to answer one of the most important questions in economics of education: **what is the causal effect of earning a college degree on hourly wages?** Although the answer to this question is crucial to policy makers, `Omitted-Variable Biases `_ (OVB) prevent us from identifying that causal effect. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX The dataset: simulated hourly wages ----------------------------------- The data generating process is laid out in the code below. Work experience in years and a measure of ability are drawn from Normal distributions; the hourly wage of one of the parents is drawn from Beta distribution. We then create an indicator of college degree which is positively impacted by ability and parental hourly wage. Finally, we model hourly wages as a linear function of all the previous variables and a random component. Note that all variables have a positive effect on hourly wages. import numpy as np import pandas as pd n_samples = 10_000 rng = np.random.RandomState(32) XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX experiences = rng.normal(20, 10, size=n_samples).astype(int) experiences[experiences < 0] = 0 abilities = rng.normal(0, 0.15, size=n_samples) parent_hourly_wages = 50 * rng.beta(2, 8, size=n_samples) parent_hourly_wages[parent_hourly_wages < 0] = 0 college_degrees = ( 9 * abilities + 0.02 * parent_hourly_wages + rng.randn(n_samples) > 0.7 ).astype(int) true_coef = pd.Series( { "college degree": 2.0, "ability": 5.0, "experience": 0.2, "parent hourly wage": 1.0, } ) hourly_wages = ( true_coef["experience"] * experiences + true_coef["parent hourly wage"] * parent_hourly_wages + true_coef["college degree"] * college_degrees + true_coef["ability"] * abilities + rng.normal(0, 1, size=n_samples) ) hourly_wages[hourly_wages < 0] = 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Description of the simulated data --------------------------------- The following plot shows the distribution of each variable, and pairwise scatter plots. Key to our OVB story is the positive relationship between ability and college degree. import seaborn as sns df = pd.DataFrame( { "college degree": college_degrees, "ability": abilities, "hourly wage": hourly_wages, "experience": experiences, "parent hourly wage": parent_hourly_wages, } ) grid = sns.pairplot(df, diag_kind="kde", corner=True) In the next section, we train predictive models and we therefore split the target column from over features and we split the data into a training and a testing set. from sklearn.model_selection import train_test_split target_name = "hourly wage" X, y = df.drop(columns=target_name), df[target_name] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Income prediction with fully observed variables ----------------------------------------------- First, we train a predictive model, a :class:`~sklearn.linear_model.LinearRegression` model. In this experiment, we assume that all variables used by the true generative model are available. from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score features_names = ["experience", "parent hourly wage", "college degree", "ability"] regressor_with_ability = LinearRegression() regressor_with_ability.fit(X_train[features_names], y_train) y_pred_with_ability = regressor_with_ability.predict(X_test[features_names]) R2_with_ability = r2_score(y_test, y_pred_with_ability) print(f"R2 score with ability: {R2_with_ability:.3f}") This model predicts well the hourly wages as shown by the high R2 score. We plot the model coefficients to show that we exactly recover the values of the true generative model. import matplotlib.pyplot as plt XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX model_coef = pd.Series(regressor_with_ability.coef_, index=features_names) coef = pd.concat( [true_coef[features_names], model_coef], keys=["Coefficients of true generative model", "Model coefficients"], axis=1, ) ax = coef.plot.barh() ax.set_xlabel("Coefficient values") ax.set_title("Coefficients of the linear regression including the ability features") _ = plt.tight_layout() XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Income prediction with partial observations ------------------------------------------- In practice, intellectual abilities are not observed or are only estimated from proxies that inadvertently measure education as well (e.g. by IQ tests). But omitting the "ability" feature from a linear model inflates the estimate via a positive OVB. features_names = ["experience", "parent hourly wage", "college degree"] regressor_without_ability = LinearRegression() regressor_without_ability.fit(X_train[features_names], y_train) y_pred_without_ability = regressor_without_ability.predict(X_test[features_names]) R2_without_ability = r2_score(y_test, y_pred_without_ability) print(f"R2 score without ability: {R2_without_ability:.3f}") The predictive power of our model is similar when we omit the ability feature in terms of R2 score. We now check if the coefficient of the model are different from the true generative model. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX model_coef = pd.Series(regressor_without_ability.coef_, index=features_names) coef = pd.concat( [true_coef[features_names], model_coef], keys=["Coefficients of true generative model", "Model coefficients"], axis=1, ) ax = coef.plot.barh() ax.set_xlabel("Coefficient values") _ = ax.set_title("Coefficients of the linear regression excluding the ability feature") plt.tight_layout() plt.show() XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Lessons learned --------------- Machine learning models are not designed for the estimation of causal effects. While we showed this with a linear model, OVB can affect any type of model. Whenever interpreting a coefficient or a change in predictions brought about by a change in one of the features, it is important to keep in mind potentially unobserved variables that could be correlated with both the feature in question and the target variable. Such variables are called `Confounding Variables `_. In order to still estimate causal effect in the presence of confounding, researchers usually conduct experiments in which the treatment variable (e.g. college degree) is randomized. When an experiment is prohibitively expensive or unethical, researchers can sometimes use other causal inference techniques such as `Instrumental Variables `_ (IV) estimations. .. GENERATED FROM PYTHON SOURCE LINES 137-155 For those type of examples, we first detect the sections using `sphinx-gallery` and once get the text and code blocks within these sections. Since the code is usually related to the text around it, we do not split the text from the code blocks. Instead, we create chunks of a fixed size. Conclusion ---------- In this example, we have seen the different strategies used to scrape the API documentation, user guide documentation, and examples documentation of scikit-learn. The API documentation is the most structured and we can leverage the sections of the docstring to create meaningful chunks. The user guide documentation is less structured and we use a simple chunking strategy. Finally, the examples documentation is the less structured and we use a more sophisticated strategy to detect the sections and create meaningful chunks. Since the documentation scrapping is a crucial step for the RAG model, more sophisticated strategies could be used to improve the quality of the generated chunks. Here, they are enough advanced to make a proof of concept. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.106 seconds) .. _sphx_glr_download_auto_examples_plot_documentation_scraping.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_documentation_scraping.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_documentation_scraping.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_documentation_scraping.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_