aboutsummaryrefslogtreecommitdiffstats
path: root/python/vespa/docs
diff options
context:
space:
mode:
authortmartins <thigm85@gmail.com>2020-09-02 11:54:41 +0200
committertmartins <thigm85@gmail.com>2020-09-02 11:54:41 +0200
commit2f6da2e21409ab57a97cb7c1df6031939bb17579 (patch)
tree9f42acbe5b5dc0341d6984f60afd7959c4cb4113 /python/vespa/docs
parenta6958a9ec7c0cffd1818e7d71b26b165a3ea436d (diff)
include notebooks that will be part of the documentation
Diffstat (limited to 'python/vespa/docs')
-rw-r--r--python/vespa/docs/sphinx/source/application-package.ipynb176
-rw-r--r--python/vespa/docs/sphinx/source/collect-training-data.ipynb1231
-rw-r--r--python/vespa/docs/sphinx/source/create-and-deploy-vespa-cloud.ipynb691
-rw-r--r--python/vespa/docs/sphinx/source/deploy-application.ipynb32
-rw-r--r--python/vespa/docs/sphinx/source/evaluation.ipynb296
-rw-r--r--python/vespa/docs/sphinx/source/query-model.ipynb32
-rw-r--r--python/vespa/docs/sphinx/source/query.ipynb320
7 files changed, 2778 insertions, 0 deletions
diff --git a/python/vespa/docs/sphinx/source/application-package.ipynb b/python/vespa/docs/sphinx/source/application-package.ipynb
new file mode 100644
index 00000000000..5cc1638f7de
--- /dev/null
+++ b/python/vespa/docs/sphinx/source/application-package.ipynb
@@ -0,0 +1,176 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# hide\n",
+ "%load_ext autoreload\n",
+ "%autoreload 2"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Vespa - Application Package\n",
+ "\n",
+ "> Python API to create, modify and deploy application packages"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Our goal is to create, modify and deploy simple application packages using our python API. This enables us to run data analysis experiments that are fully integrated with Vespa. As an example, we want to create the application package we used in our [text search tutorial](https://docs.vespa.ai/documentation/tutorials/text-search.html). "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Application spec"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Our goal in this section is to create the following `msmarco` schema using our python API."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "```\n",
+ "schema msmarco {\n",
+ " document msmarco {\n",
+ " field id type string {\n",
+ " indexing: attribute | summary\n",
+ " }\n",
+ " field title type string {\n",
+ " indexing: index | summary\n",
+ " index: enable-bm25\n",
+ " }\n",
+ " field body type string {\n",
+ " indexing: index | summary\n",
+ " index: enable-bm25\n",
+ " }\n",
+ " }\n",
+ "\n",
+ " fieldset default {\n",
+ " fields: title, body\n",
+ " }\n",
+ "\n",
+ " rank-profile default {\n",
+ " first-phase {\n",
+ " expression: nativeRank(title, body)\n",
+ " }\n",
+ " }\n",
+ "\n",
+ " rank-profile bm25 inherits default {\n",
+ " first-phase {\n",
+ " expression: bm25(title) + bm25(body)\n",
+ " }\n",
+ " }\n",
+ "\n",
+ "}\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Schema API"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.package import Document, Field, Schema, FieldSet, RankProfile, ApplicationPackage\n",
+ "\n",
+ "document = Document(\n",
+ " fields=[\n",
+ " Field(name = \"id\", type = \"string\", indexing = [\"attribute\", \"summary\"]),\n",
+ " Field(name = \"title\", type = \"string\", indexing = [\"index\", \"summary\"], index = \"enable-bm25\"),\n",
+ " Field(name = \"body\", type = \"string\", indexing = [\"index\", \"summary\"], index = \"enable-bm25\") \n",
+ " ]\n",
+ ")\n",
+ "\n",
+ "msmarco_schema = Schema(\n",
+ " name = \"msmarco\", \n",
+ " document = document, \n",
+ " fieldsets = [FieldSet(name = \"default\", fields = [\"title\", \"body\"])],\n",
+ " rank_profiles = [RankProfile(name = \"default\", first_phase = \"nativeRank(title, body)\")]\n",
+ ")\n",
+ "\n",
+ "app_package = ApplicationPackage(name = \"msmarco\", schema=msmarco_schema)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Deploy it locally"
+ ]
+ },
+ {
+ "cell_type": "raw",
+ "metadata": {},
+ "source": [
+ "from vespa.package import VespaDocker\n",
+ "\n",
+ "vespa_docker = VespaDocker(application_package=app_package)\n",
+ "vespa_docker.deploy(disk_folder=\"/Users/username/sample_application\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Change the application package and redeploy"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can add a new rank profile and redeploy our application"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "app_package.schema.add_rank_profile(\n",
+ " RankProfile(name = \"bm25\", inherits = \"default\", first_phase = \"bm25(title) + bm25(body)\")\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "raw",
+ "metadata": {},
+ "source": [
+ "vespa_docker.deploy(disk_folder=\"/Users/username/sample_application\")"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "vespa",
+ "language": "python",
+ "name": "vespa"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/python/vespa/docs/sphinx/source/collect-training-data.ipynb b/python/vespa/docs/sphinx/source/collect-training-data.ipynb
new file mode 100644
index 00000000000..ab0952bb11c
--- /dev/null
+++ b/python/vespa/docs/sphinx/source/collect-training-data.ipynb
@@ -0,0 +1,1231 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# hide\n",
+ "%load_ext autoreload\n",
+ "%autoreload 2"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Vespa - collect training data\n",
+ "\n",
+ "> Collect training data to analyse and/or improve ranking functions"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Example setup"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Connect to the application and define a query model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.application import Vespa\n",
+ "from vespa.query import Query, RankProfile, OR\n",
+ "\n",
+ "app = Vespa(url = \"https://api.cord19.vespa.ai\")\n",
+ "query_model = Query(\n",
+ " match_phase = OR(),\n",
+ " rank_profile = RankProfile(name=\"bm25\", list_features=True))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Define some labelled data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "labelled_data = [\n",
+ " {\n",
+ " \"query_id\": 0, \n",
+ " \"query\": \"Intrauterine virus infections and congenital heart disease\",\n",
+ " \"relevant_docs\": [{\"id\": 0, \"score\": 1}, {\"id\": 3, \"score\": 1}]\n",
+ " },\n",
+ " {\n",
+ " \"query_id\": 1, \n",
+ " \"query\": \"Clinical and immunologic studies in identical twins discordant for systemic lupus erythematosus\",\n",
+ " \"relevant_docs\": [{\"id\": 1, \"score\": 1}, {\"id\": 5, \"score\": 1}]\n",
+ " }\n",
+ "]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Collect training data in batch"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>attributeMatch(authors.first)</th>\n",
+ " <th>attributeMatch(authors.first).averageWeight</th>\n",
+ " <th>attributeMatch(authors.first).completeness</th>\n",
+ " <th>attributeMatch(authors.first).fieldCompleteness</th>\n",
+ " <th>attributeMatch(authors.first).importance</th>\n",
+ " <th>attributeMatch(authors.first).matches</th>\n",
+ " <th>attributeMatch(authors.first).maxWeight</th>\n",
+ " <th>attributeMatch(authors.first).normalizedWeight</th>\n",
+ " <th>attributeMatch(authors.first).normalizedWeightedWeight</th>\n",
+ " <th>attributeMatch(authors.first).queryCompleteness</th>\n",
+ " <th>...</th>\n",
+ " <th>textSimilarity(results).queryCoverage</th>\n",
+ " <th>textSimilarity(results).score</th>\n",
+ " <th>textSimilarity(title).fieldCoverage</th>\n",
+ " <th>textSimilarity(title).order</th>\n",
+ " <th>textSimilarity(title).proximity</th>\n",
+ " <th>textSimilarity(title).queryCoverage</th>\n",
+ " <th>textSimilarity(title).score</th>\n",
+ " <th>document_id</th>\n",
+ " <th>query_id</th>\n",
+ " <th>relevant</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>0</td>\n",
+ " <td>0</td>\n",
+ " <td>1</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>56212</td>\n",
+ " <td>0</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.187500</td>\n",
+ " <td>0.5</td>\n",
+ " <td>0.617188</td>\n",
+ " <td>0.428571</td>\n",
+ " <td>0.457087</td>\n",
+ " <td>34026</td>\n",
+ " <td>0</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>3</td>\n",
+ " <td>0</td>\n",
+ " <td>1</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>56212</td>\n",
+ " <td>0</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.187500</td>\n",
+ " <td>0.5</td>\n",
+ " <td>0.617188</td>\n",
+ " <td>0.428571</td>\n",
+ " <td>0.457087</td>\n",
+ " <td>34026</td>\n",
+ " <td>0</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>6</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.071429</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>0.083333</td>\n",
+ " <td>0.039286</td>\n",
+ " <td>1</td>\n",
+ " <td>1</td>\n",
+ " <td>1</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>7</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>29774</td>\n",
+ " <td>1</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>8</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.500000</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>0.333333</td>\n",
+ " <td>0.700000</td>\n",
+ " <td>22787</td>\n",
+ " <td>1</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>9</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.058824</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>0.083333</td>\n",
+ " <td>0.036765</td>\n",
+ " <td>5</td>\n",
+ " <td>1</td>\n",
+ " <td>1</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>10</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>29774</td>\n",
+ " <td>1</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>11</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.500000</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>0.333333</td>\n",
+ " <td>0.700000</td>\n",
+ " <td>22787</td>\n",
+ " <td>1</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>12 rows × 984 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " attributeMatch(authors.first) \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).averageWeight \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).completeness \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).fieldCompleteness \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).importance \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).matches \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).maxWeight \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).normalizedWeight \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).normalizedWeightedWeight \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).queryCompleteness ... \\\n",
+ "0 0.0 ... \n",
+ "1 0.0 ... \n",
+ "2 0.0 ... \n",
+ "3 0.0 ... \n",
+ "4 0.0 ... \n",
+ "5 0.0 ... \n",
+ "6 0.0 ... \n",
+ "7 0.0 ... \n",
+ "8 0.0 ... \n",
+ "9 0.0 ... \n",
+ "10 0.0 ... \n",
+ "11 0.0 ... \n",
+ "\n",
+ " textSimilarity(results).queryCoverage textSimilarity(results).score \\\n",
+ "0 0.0 0.0 \n",
+ "1 0.0 0.0 \n",
+ "2 0.0 0.0 \n",
+ "3 0.0 0.0 \n",
+ "4 0.0 0.0 \n",
+ "5 0.0 0.0 \n",
+ "6 0.0 0.0 \n",
+ "7 0.0 0.0 \n",
+ "8 0.0 0.0 \n",
+ "9 0.0 0.0 \n",
+ "10 0.0 0.0 \n",
+ "11 0.0 0.0 \n",
+ "\n",
+ " textSimilarity(title).fieldCoverage textSimilarity(title).order \\\n",
+ "0 0.000000 0.0 \n",
+ "1 1.000000 1.0 \n",
+ "2 0.187500 0.5 \n",
+ "3 0.000000 0.0 \n",
+ "4 1.000000 1.0 \n",
+ "5 0.187500 0.5 \n",
+ "6 0.071429 0.0 \n",
+ "7 1.000000 1.0 \n",
+ "8 0.500000 1.0 \n",
+ "9 0.058824 0.0 \n",
+ "10 1.000000 1.0 \n",
+ "11 0.500000 1.0 \n",
+ "\n",
+ " textSimilarity(title).proximity textSimilarity(title).queryCoverage \\\n",
+ "0 0.000000 0.000000 \n",
+ "1 1.000000 1.000000 \n",
+ "2 0.617188 0.428571 \n",
+ "3 0.000000 0.000000 \n",
+ "4 1.000000 1.000000 \n",
+ "5 0.617188 0.428571 \n",
+ "6 0.000000 0.083333 \n",
+ "7 1.000000 1.000000 \n",
+ "8 1.000000 0.333333 \n",
+ "9 0.000000 0.083333 \n",
+ "10 1.000000 1.000000 \n",
+ "11 1.000000 0.333333 \n",
+ "\n",
+ " textSimilarity(title).score document_id query_id relevant \n",
+ "0 0.000000 0 0 1 \n",
+ "1 1.000000 56212 0 0 \n",
+ "2 0.457087 34026 0 0 \n",
+ "3 0.000000 3 0 1 \n",
+ "4 1.000000 56212 0 0 \n",
+ "5 0.457087 34026 0 0 \n",
+ "6 0.039286 1 1 1 \n",
+ "7 1.000000 29774 1 0 \n",
+ "8 0.700000 22787 1 0 \n",
+ "9 0.036765 5 1 1 \n",
+ "10 1.000000 29774 1 0 \n",
+ "11 0.700000 22787 1 0 \n",
+ "\n",
+ "[12 rows x 984 columns]"
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "training_data_batch = app.collect_training_data(\n",
+ " labelled_data = labelled_data,\n",
+ " id_field = \"id\",\n",
+ " query_model = query_model,\n",
+ " number_additional_docs = 2\n",
+ ")\n",
+ "training_data_batch"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Collect training data point\n",
+ "\n",
+ "> You can have finer control with the `collect_training_data_point` method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>attributeMatch(authors.first)</th>\n",
+ " <th>attributeMatch(authors.first).averageWeight</th>\n",
+ " <th>attributeMatch(authors.first).completeness</th>\n",
+ " <th>attributeMatch(authors.first).fieldCompleteness</th>\n",
+ " <th>attributeMatch(authors.first).importance</th>\n",
+ " <th>attributeMatch(authors.first).matches</th>\n",
+ " <th>attributeMatch(authors.first).maxWeight</th>\n",
+ " <th>attributeMatch(authors.first).normalizedWeight</th>\n",
+ " <th>attributeMatch(authors.first).normalizedWeightedWeight</th>\n",
+ " <th>attributeMatch(authors.first).queryCompleteness</th>\n",
+ " <th>...</th>\n",
+ " <th>textSimilarity(results).queryCoverage</th>\n",
+ " <th>textSimilarity(results).score</th>\n",
+ " <th>textSimilarity(title).fieldCoverage</th>\n",
+ " <th>textSimilarity(title).order</th>\n",
+ " <th>textSimilarity(title).proximity</th>\n",
+ " <th>textSimilarity(title).queryCoverage</th>\n",
+ " <th>textSimilarity(title).score</th>\n",
+ " <th>document_id</th>\n",
+ " <th>query_id</th>\n",
+ " <th>relevant</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>0</td>\n",
+ " <td>0</td>\n",
+ " <td>1</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>56212</td>\n",
+ " <td>0</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.187500</td>\n",
+ " <td>0.5</td>\n",
+ " <td>0.617188</td>\n",
+ " <td>0.428571</td>\n",
+ " <td>0.457087</td>\n",
+ " <td>34026</td>\n",
+ " <td>0</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>3</td>\n",
+ " <td>0</td>\n",
+ " <td>1</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>56212</td>\n",
+ " <td>0</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.187500</td>\n",
+ " <td>0.5</td>\n",
+ " <td>0.617188</td>\n",
+ " <td>0.428571</td>\n",
+ " <td>0.457087</td>\n",
+ " <td>34026</td>\n",
+ " <td>0</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>6</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.071429</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>0.083333</td>\n",
+ " <td>0.039286</td>\n",
+ " <td>1</td>\n",
+ " <td>1</td>\n",
+ " <td>1</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>7</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>29774</td>\n",
+ " <td>1</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>8</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.500000</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>0.333333</td>\n",
+ " <td>0.700000</td>\n",
+ " <td>22787</td>\n",
+ " <td>1</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>9</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.058824</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>0.083333</td>\n",
+ " <td>0.036765</td>\n",
+ " <td>5</td>\n",
+ " <td>1</td>\n",
+ " <td>1</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>10</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>29774</td>\n",
+ " <td>1</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>11</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.500000</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>0.333333</td>\n",
+ " <td>0.700000</td>\n",
+ " <td>22787</td>\n",
+ " <td>1</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>12 rows × 984 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " attributeMatch(authors.first) \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).averageWeight \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).completeness \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).fieldCompleteness \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).importance \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).matches \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).maxWeight \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).normalizedWeight \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).normalizedWeightedWeight \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).queryCompleteness ... \\\n",
+ "0 0.0 ... \n",
+ "1 0.0 ... \n",
+ "2 0.0 ... \n",
+ "3 0.0 ... \n",
+ "4 0.0 ... \n",
+ "5 0.0 ... \n",
+ "6 0.0 ... \n",
+ "7 0.0 ... \n",
+ "8 0.0 ... \n",
+ "9 0.0 ... \n",
+ "10 0.0 ... \n",
+ "11 0.0 ... \n",
+ "\n",
+ " textSimilarity(results).queryCoverage textSimilarity(results).score \\\n",
+ "0 0.0 0.0 \n",
+ "1 0.0 0.0 \n",
+ "2 0.0 0.0 \n",
+ "3 0.0 0.0 \n",
+ "4 0.0 0.0 \n",
+ "5 0.0 0.0 \n",
+ "6 0.0 0.0 \n",
+ "7 0.0 0.0 \n",
+ "8 0.0 0.0 \n",
+ "9 0.0 0.0 \n",
+ "10 0.0 0.0 \n",
+ "11 0.0 0.0 \n",
+ "\n",
+ " textSimilarity(title).fieldCoverage textSimilarity(title).order \\\n",
+ "0 0.000000 0.0 \n",
+ "1 1.000000 1.0 \n",
+ "2 0.187500 0.5 \n",
+ "3 0.000000 0.0 \n",
+ "4 1.000000 1.0 \n",
+ "5 0.187500 0.5 \n",
+ "6 0.071429 0.0 \n",
+ "7 1.000000 1.0 \n",
+ "8 0.500000 1.0 \n",
+ "9 0.058824 0.0 \n",
+ "10 1.000000 1.0 \n",
+ "11 0.500000 1.0 \n",
+ "\n",
+ " textSimilarity(title).proximity textSimilarity(title).queryCoverage \\\n",
+ "0 0.000000 0.000000 \n",
+ "1 1.000000 1.000000 \n",
+ "2 0.617188 0.428571 \n",
+ "3 0.000000 0.000000 \n",
+ "4 1.000000 1.000000 \n",
+ "5 0.617188 0.428571 \n",
+ "6 0.000000 0.083333 \n",
+ "7 1.000000 1.000000 \n",
+ "8 1.000000 0.333333 \n",
+ "9 0.000000 0.083333 \n",
+ "10 1.000000 1.000000 \n",
+ "11 1.000000 0.333333 \n",
+ "\n",
+ " textSimilarity(title).score document_id query_id relevant \n",
+ "0 0.000000 0 0 1 \n",
+ "1 1.000000 56212 0 0 \n",
+ "2 0.457087 34026 0 0 \n",
+ "3 0.000000 3 0 1 \n",
+ "4 1.000000 56212 0 0 \n",
+ "5 0.457087 34026 0 0 \n",
+ "6 0.039286 1 1 1 \n",
+ "7 1.000000 29774 1 0 \n",
+ "8 0.700000 22787 1 0 \n",
+ "9 0.036765 5 1 1 \n",
+ "10 1.000000 29774 1 0 \n",
+ "11 0.700000 22787 1 0 \n",
+ "\n",
+ "[12 rows x 984 columns]"
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "from pandas import concat, DataFrame\n",
+ "\n",
+ "\n",
+ "training_data = []\n",
+ "for query_data in labelled_data:\n",
+ " for doc_data in query_data[\"relevant_docs\"]:\n",
+ " training_data_point = app.collect_training_data_point(\n",
+ " query = query_data[\"query\"],\n",
+ " query_id = query_data[\"query_id\"],\n",
+ " relevant_id = doc_data[\"id\"],\n",
+ " id_field = \"id\",\n",
+ " query_model = query_model,\n",
+ " number_additional_docs = 2\n",
+ " )\n",
+ " training_data.extend(training_data_point)\n",
+ "training_data = DataFrame.from_records(training_data)\n",
+ "training_data"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "vespa",
+ "language": "python",
+ "name": "vespa"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/python/vespa/docs/sphinx/source/create-and-deploy-vespa-cloud.ipynb b/python/vespa/docs/sphinx/source/create-and-deploy-vespa-cloud.ipynb
new file mode 100644
index 00000000000..3c0f4a4201d
--- /dev/null
+++ b/python/vespa/docs/sphinx/source/create-and-deploy-vespa-cloud.ipynb
@@ -0,0 +1,691 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# hide\n",
+ "%load_ext autoreload\n",
+ "%autoreload 2"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Build end-to-end Vespa apps with pyvespa\n",
+ "\n",
+ "> Python API to create, modify, deploy and interact with Vespa applications"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "`pyvespa` provides a python API to [vespa.ai](vespa.ai). It allow us to create, modify, deploy and interact with running Vespa instances. The main goal of the library is to allow for faster prototyping and ML experimentation. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This tutorial will create a text search application from scratch based on the MS MARCO dataset, similar to our [text search tutorials](https://docs.vespa.ai/documentation/tutorials/text-search.html). We will first show how to define the app by creating an application package [REF]. Then we locally deploy the app in a Docker container. Once the app is up and running we show how to feed data to it. After the data is sent, we can make queries and inspect the results. We then show how to add a new rank profile to the application package and to redeploy the app with the latest changes. We proceed to show how to evaluate and compare two rank profiles with evaluation metrics such as Recall and Reciprocal Rank."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Application package API"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We first create a `Document` instance containing the `Field`s that we want to store in the app. In this case we will keep the application simple and only feed a unique `id`, `title` and `body` of the MS MARCO documents."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.package import Document, Field\n",
+ "\n",
+ "document = Document(\n",
+ " fields=[\n",
+ " Field(name = \"id\", type = \"string\", indexing = [\"attribute\", \"summary\"]),\n",
+ " Field(name = \"title\", type = \"string\", indexing = [\"index\", \"summary\"], index = \"enable-bm25\"),\n",
+ " Field(name = \"body\", type = \"string\", indexing = [\"index\", \"summary\"], index = \"enable-bm25\") \n",
+ " ]\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The complete `Schema` of our application will be named `msmarco` and contains the `Document` instance that we defined above, the default `FieldSet` indicates that queries will look for matches by searching both in the titles and bodies of the documents. The default `RankProfile` indicates that all the matched documents will be ranked by the `nativeRank` expression involving the title and the body of the matched documents."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.package import Schema, FieldSet, RankProfile\n",
+ "\n",
+ "msmarco_schema = Schema(\n",
+ " name = \"msmarco\", \n",
+ " document = document, \n",
+ " fieldsets = [FieldSet(name = \"default\", fields = [\"title\", \"body\"])],\n",
+ " rank_profiles = [RankProfile(name = \"default\", first_phase = \"nativeRank(title, body)\")]\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Once the `Schema` is defined, all we have to do is to create our msmarco `ApplicationPackage`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.package import ApplicationPackage\n",
+ "\n",
+ "app_package = ApplicationPackage(name = \"msmarco\", schema=msmarco_schema)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "At this point, `app_package` contains all the relevant information required to create our MS MARCO text search app. We now need to deploy it."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Deploy to Vespa Cloud"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This tutorial shows how to deploy the application package to [Vespa Cloud](https://cloud.vespa.ai/). For the following to work you need to sign-up to Vespa Cloud, register an application name there and generate your user API key on the Vespa Cloud console."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We first create a `VespaCloud` context named `cloud` that will handle the secure communication with Vespa Cloud servers. In order to do that, all we need is your Vespa Cloud tenant name, the application name that you registered and the user key you generated on the Vespa Cloud console:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Note:** It takes around 15 min to call `cloud.deploy` for the first time, as Vespa Cloud will have the setup the environment. Subsequent calls will be much faster, usually taking less than 10 seconds."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.package import VespaCloud\n",
+ "\n",
+ "with VespaCloud(\"vespa-team\", \"ms-marco\", \"/Users/tmartins/sample_application/tmartins.vespa-team.pem\") as cloud:\n",
+ " vespa = cloud.deploy('from-notebook', app_package)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Deployment started in run 12 of dev-aws-us-east-1c for vespa-team.ms-marco.from-notebook. This may take about 15 minutes the first time.\n",
+ "INFO [10:37:04] Deploying platform version 7.278.21 and application version unknown ...\n",
+ "INFO [10:37:05] No services requiring restart.\n",
+ "INFO [10:37:05] Deployment successful.\n",
+ "INFO [10:37:05] Session 13751 for tenant 'vespa-team' prepared and activated.\n",
+ "INFO [10:37:06] ######## Details for all nodes ########\n",
+ "INFO [10:37:06] h711a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP\n",
+ "INFO [10:37:06] --- platform docker.ouroath.com:4443/vespa/centos-tenant:7.278.21\n",
+ "INFO [10:37:06] --- container on port 4080 has not started \n",
+ "INFO [10:37:06] h712a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP\n",
+ "INFO [10:37:06] --- platform docker.ouroath.com:4443/vespa/centos-tenant:7.278.21\n",
+ "INFO [10:37:06] --- logserver-container on port 4080 has config generation 13751, wanted is 13751\n",
+ "INFO [10:37:06] h713a.dev.aws-us-east-1c.vespa-external.aws.oath.cloud: expected to be UP\n",
+ "INFO [10:37:06] --- platform docker.ouroath.com:4443/vespa/centos-tenant:7.278.21\n",
+ "INFO [10:37:06] --- container-clustercontroller on port 19050 has config generation 13751, wanted is 13751\n",
+ "INFO [10:37:06] --- storagenode on port 19102 has config generation 13750, wanted is 13751\n",
+ "INFO [10:37:06] --- searchnode on port 19107 has config generation 13751, wanted is 13751\n",
+ "INFO [10:37:06] --- distributor on port 19111 has config generation 13751, wanted is 13751\n",
+ "INFO [10:37:30] Found endpoints:\n",
+ "INFO [10:37:30] - dev.aws-us-east-1c\n",
+ "INFO [10:37:30] |-- https://msmarco-container.from-notebook.ms-marco.vespa-team.aws-us-east-1c.dev.public.vespa.oath.cloud/ (cluster 'msmarco_container')\n",
+ "INFO [10:37:31] Installation succeeded!\n"
+ ]
+ }
+ ],
+ "source": [
+ "from vespa.package import VespaCloud\n",
+ "\n",
+ "vespa_cloud = VespaCloud(\n",
+ " \"vespa-team\", \n",
+ " \"ms-marco\", \n",
+ " \"/Users/tmartins/sample_application/tmartins.vespa-team.pem\", \n",
+ " app_package\n",
+ ")\n",
+ "app = vespa_cloud.deploy('from-notebook', \"/Users/tmartins/sample_application\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The app variable above will hold a `Vespa` instance that will be used to connect and interact with our text search application. We can see the deployment message returned by the Vespa engine:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "app.__class__"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "app.deployment_message"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Feed data to the app "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We now have our text search app up and running. We can start to feed data to it. We have pre-processed and sampled some MS MARCO data to use in this tutorial. We can load 996 documents that we want to feed and check the first two documents in this sample."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from pandas import read_csv\n",
+ "\n",
+ "docs = read_csv(\"https://thigm85.github.io/data/msmarco/docs.tsv\", sep = \"\\t\")\n",
+ "docs.shape"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "docs.head(2)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To feed the data we need to specify the `schema` that we are sending data to. We name our schema `msmarco` in a previous section. Each data point needs to have a unique `data_id` associated with it, independent of having an id field or not. The `fields` should be a dict containing all the fields in the schema, which are `id`, `title` and `body` in our case. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "app.feed_data_point(\n",
+ " schema = \"msmarco\", \n",
+ " data_id = \"test\", \n",
+ " fields = {\n",
+ " \"id\": \"test\", \n",
+ " \"title\": \"this is a test title\", \n",
+ " \"body\": \"this is test body\"\n",
+ " }\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "for idx, row in docs.iterrows():\n",
+ " print(idx)\n",
+ " response = app.feed_data_point(\n",
+ " schema = \"msmarco\", \n",
+ " data_id = str(row[\"id\"]), \n",
+ " fields = {\n",
+ " \"id\": str(row[\"id\"]), \n",
+ " \"title\": str(row[\"title\"]), \n",
+ " \"body\": str(row[\"body\"])\n",
+ " }\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Each call to the method `feed_data_point` sends a POST request to the appropriate Vespa endpoint and we can check the response of the requests if needed, such as the status code and the message returned."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "response.status_code"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "response.json()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Make a simple query"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Once our application is fed we can start to use it by sending queries to it. The MS MARCO app expectes to receive questions as queries and the goal of the application is to return documents that are relevant to the questions made."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In the example below, we will send a question via the `query` parameter. In addition, we need to specify how we want the documents to be matched and ranked. We do this by specifying a `Query` model. The query model below will have the `OR` operator in the match phase, indicating that the application will match all the documents which have at least one query term within the title or the body (due to the default `FieldSet` we defined earlier) of the document. And we will rank all the matched documents by the default `RankProfile` that we defined earlier."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.query import Query, OR, RankProfile as Ranking\n",
+ "\n",
+ "results = app.query(\n",
+ " query=\"Where is my text?\", \n",
+ " query_model = Query(\n",
+ " match_phase=OR(), \n",
+ " rank_profile=Ranking(name=\"default\")\n",
+ " ),\n",
+ " hits = 2\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "results.hits"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In addition to the `query` and `query_model` parameters, we can specify a multitude of relevant Vespa parameters such as the number of `hits` that we want Vespa to return. We chose `hits=2` for simplicity in this tutorial."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "len(results.hits)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Change the application package and redeploy"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can also make specific changes to our application by changing the application package and redeploying. Lets add a new rank profile based on BM25 to our `Schema`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "app_package.schema.add_rank_profile(\n",
+ " RankProfile(name = \"bm25\", inherits = \"default\", first_phase = \"bm25(title) + bm25(body)\")\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "After that we can redeploy our application, similar to what we did earlier:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "app = vespa_cloud.deploy('from-notebook', \"/Users/tmartins/sample_application\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can then use the newly created `bm25` rank profile to make queries:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "results = app.query(\n",
+ " query=\"Where is my text?\", \n",
+ " query_model = Query(\n",
+ " match_phase=OR(), \n",
+ " rank_profile=Ranking(name=\"bm25\")\n",
+ " ),\n",
+ " hits = 2\n",
+ ")\n",
+ "len(results.hits)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Compare query models"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "When we are building a search application, we often want to experiment and compare different query models. In this section we want to show how easy it is to compare different query models in Vespa."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Lets load some labelled data where each data point contains a `query_id`, a `query` and a list of `relevant_docs` associated with the query. In this case, we have only one relevant document for each query."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests, json\n",
+ "\n",
+ "labelled_data = json.loads(\n",
+ " requests.get(\"https://thigm85.github.io/data/msmarco/query-labels.json\").text\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Following we can see two examples of the labelled data:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "labelled_data[0:2]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Lets define two `Query` models to be compared. We are going to use the same `OR` operator in the match phase and compare the `default` and `bm25` rank profiles."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "default_ranking = Query(\n",
+ " match_phase=OR(), \n",
+ " rank_profile=Ranking(name=\"default\")\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bm25_ranking = Query(\n",
+ " match_phase=OR(), \n",
+ " rank_profile=Ranking(name=\"bm25\")\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now we will chose which evaluation metrics we want to look at. In this case we will chose the `MatchRatio` to check how many documents have been matched by the query, the `Recall` at 10 and the `ReciprocalRank` at 10."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.evaluation import MatchRatio, Recall, ReciprocalRank\n",
+ "\n",
+ "eval_metrics = [MatchRatio(), Recall(at = 10), ReciprocalRank(at = 10)]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We now can run the `evaluation` method for each `Query` model. This will make queries to the application and process the results to compute the pre-defined `eval_metrics` defined above."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "default_evaluation = app.evaluate(\n",
+ " labelled_data=labelled_data, \n",
+ " eval_metrics=eval_metrics, \n",
+ " query_model=default_ranking, \n",
+ " id_field=\"id\",\n",
+ " timeout=5,\n",
+ " hits=10\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bm25_evaluation = app.evaluate(\n",
+ " labelled_data=labelled_data, \n",
+ " eval_metrics=eval_metrics, \n",
+ " query_model=bm25_ranking, \n",
+ " id_field=\"id\",\n",
+ " timeout=5,\n",
+ " hits=10\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can then merge the DataFrames returned by the `evaluation` method and start to analyse the results."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from pandas import merge\n",
+ "\n",
+ "eval_comparison = merge(\n",
+ " left=default_evaluation, \n",
+ " right=bm25_evaluation, \n",
+ " on=\"query_id\", \n",
+ " suffixes=('_default', '_bm25')\n",
+ ")\n",
+ "eval_comparison.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Notice that we expect to observe the same match ratio for both query models since they use the same `OR` operator."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "eval_comparison[[\"match_ratio_value_default\", \"match_ratio_value_bm25\"]].describe().loc[[\"mean\", \"std\"]]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The `bm25` rank profile obtained a significantly higher recall than the `default`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "eval_comparison[[\"recall_10_value_default\", \"recall_10_value_bm25\"]].describe().loc[[\"mean\", \"std\"]]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Similarly, `bm25` also get a significantly higher reciprocal rank value when compared to the `default` rank profile."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "eval_comparison[[\"reciprocal_rank_10_value_default\", \"reciprocal_rank_10_value_bm25\"]].describe().loc[[\"mean\", \"std\"]]"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.8.5"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/python/vespa/docs/sphinx/source/deploy-application.ipynb b/python/vespa/docs/sphinx/source/deploy-application.ipynb
new file mode 100644
index 00000000000..956cbd9b30a
--- /dev/null
+++ b/python/vespa/docs/sphinx/source/deploy-application.ipynb
@@ -0,0 +1,32 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.8.5"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/python/vespa/docs/sphinx/source/evaluation.ipynb b/python/vespa/docs/sphinx/source/evaluation.ipynb
new file mode 100644
index 00000000000..9a37effc691
--- /dev/null
+++ b/python/vespa/docs/sphinx/source/evaluation.ipynb
@@ -0,0 +1,296 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# hide\n",
+ "%load_ext autoreload\n",
+ "%autoreload 2"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Vespa - Evaluate query models\n",
+ "\n",
+ "> Define metrics and evaluate query models"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Example setup"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Connect to the application and define a query model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.application import Vespa\n",
+ "from vespa.query import Query, RankProfile, OR\n",
+ "\n",
+ "app = Vespa(url = \"https://api.cord19.vespa.ai\")\n",
+ "query_model = Query(\n",
+ " match_phase = OR(),\n",
+ " rank_profile = RankProfile(name=\"bm25\", list_features=True))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Define some labelled data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "labelled_data = [\n",
+ " {\n",
+ " \"query_id\": 0, \n",
+ " \"query\": \"Intrauterine virus infections and congenital heart disease\",\n",
+ " \"relevant_docs\": [{\"id\": 0, \"score\": 1}, {\"id\": 3, \"score\": 1}]\n",
+ " },\n",
+ " {\n",
+ " \"query_id\": 1, \n",
+ " \"query\": \"Clinical and immunologic studies in identical twins discordant for systemic lupus erythematosus\",\n",
+ " \"relevant_docs\": [{\"id\": 1, \"score\": 1}, {\"id\": 5, \"score\": 1}]\n",
+ " }\n",
+ "]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Define metrics"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.evaluation import MatchRatio, Recall, ReciprocalRank\n",
+ "\n",
+ "eval_metrics = [MatchRatio(), Recall(at=10), ReciprocalRank(at=10)]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Evaluate in batch"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>query_id</th>\n",
+ " <th>match_ratio_retrieved_docs</th>\n",
+ " <th>match_ratio_docs_available</th>\n",
+ " <th>match_ratio_value</th>\n",
+ " <th>recall_10_value</th>\n",
+ " <th>reciprocal_rank_10_value</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>0</td>\n",
+ " <td>52526</td>\n",
+ " <td>58692</td>\n",
+ " <td>0.894943</td>\n",
+ " <td>0</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>1</td>\n",
+ " <td>54048</td>\n",
+ " <td>58692</td>\n",
+ " <td>0.920875</td>\n",
+ " <td>0</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " query_id match_ratio_retrieved_docs match_ratio_docs_available \\\n",
+ "0 0 52526 58692 \n",
+ "1 1 54048 58692 \n",
+ "\n",
+ " match_ratio_value recall_10_value reciprocal_rank_10_value \n",
+ "0 0.894943 0 0 \n",
+ "1 0.920875 0 0 "
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "evaluation = app.evaluate(\n",
+ " labelled_data = labelled_data,\n",
+ " eval_metrics = eval_metrics, \n",
+ " query_model = query_model, \n",
+ " id_field = \"id\",\n",
+ ")\n",
+ "evaluation"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Evaluate specific query\n",
+ "\n",
+ "> You can have finer control with the `evaluate_query` method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>query_id</th>\n",
+ " <th>match_ratio_retrieved_docs</th>\n",
+ " <th>match_ratio_docs_available</th>\n",
+ " <th>match_ratio_value</th>\n",
+ " <th>recall_10_value</th>\n",
+ " <th>reciprocal_rank_10_value</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>0</td>\n",
+ " <td>52526</td>\n",
+ " <td>58692</td>\n",
+ " <td>0.894943</td>\n",
+ " <td>0</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>1</td>\n",
+ " <td>54048</td>\n",
+ " <td>58692</td>\n",
+ " <td>0.920875</td>\n",
+ " <td>0</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " query_id match_ratio_retrieved_docs match_ratio_docs_available \\\n",
+ "0 0 52526 58692 \n",
+ "1 1 54048 58692 \n",
+ "\n",
+ " match_ratio_value recall_10_value reciprocal_rank_10_value \n",
+ "0 0.894943 0 0 \n",
+ "1 0.920875 0 0 "
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "from pandas import concat, DataFrame\n",
+ "\n",
+ "evaluation = []\n",
+ "for query_data in labelled_data:\n",
+ " query_evaluation = app.evaluate_query(\n",
+ " eval_metrics = eval_metrics, \n",
+ " query_model = query_model, \n",
+ " query_id = query_data[\"query_id\"], \n",
+ " query = query_data[\"query\"], \n",
+ " id_field = \"id\",\n",
+ " relevant_docs = query_data[\"relevant_docs\"],\n",
+ " default_score = 0\n",
+ " )\n",
+ " evaluation.append(query_evaluation)\n",
+ "evaluation = DataFrame.from_records(evaluation)\n",
+ "evaluation"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "vespa",
+ "language": "python",
+ "name": "vespa"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/python/vespa/docs/sphinx/source/query-model.ipynb b/python/vespa/docs/sphinx/source/query-model.ipynb
new file mode 100644
index 00000000000..956cbd9b30a
--- /dev/null
+++ b/python/vespa/docs/sphinx/source/query-model.ipynb
@@ -0,0 +1,32 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.8.5"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/python/vespa/docs/sphinx/source/query.ipynb b/python/vespa/docs/sphinx/source/query.ipynb
new file mode 100644
index 00000000000..82bc1b8ac29
--- /dev/null
+++ b/python/vespa/docs/sphinx/source/query.ipynb
@@ -0,0 +1,320 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# hide\n",
+ "%load_ext autoreload\n",
+ "%autoreload 2"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Query API\n",
+ "\n",
+ "> Python query API"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can connect to the CORD-19 Search app and use it to exemplify the query API"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.application import Vespa\n",
+ "\n",
+ "app = Vespa(url = \"https://api.cord19.vespa.ai\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Specify the request body\n",
+ "\n",
+ "> Full flexibility by specifying the entire request body"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "body = {\n",
+ " 'yql': 'select title, abstract from sources * where userQuery();',\n",
+ " 'hits': 5,\n",
+ " 'query': 'Is remdesivir an effective treatment for COVID-19?',\n",
+ " 'type': 'any',\n",
+ " 'ranking': 'bm25'\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "results = app.query(body=body)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "108882"
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "results.number_documents_retrieved"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Specify a query model"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Query + term-matching + rank profile"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.query import Query, OR, RankProfile\n",
+ "\n",
+ "results = app.query(\n",
+ " query=\"Is remdesivir an effective treatment for COVID-19?\", \n",
+ " query_model = Query(\n",
+ " match_phase=OR(), \n",
+ " rank_profile=RankProfile(name=\"bm25\")\n",
+ " )\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "108882"
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "results.number_documents_retrieved"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Query + term-matching + ann operator + rank_profile"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.query import Query, ANN, WeakAnd, Union, RankProfile\n",
+ "from random import random\n",
+ "\n",
+ "match_phase = Union(\n",
+ " WeakAnd(hits = 10), \n",
+ " ANN(\n",
+ " doc_vector=\"title_embedding\", \n",
+ " query_vector=\"title_vector\", \n",
+ " embedding_model=lambda x: [random() for x in range(768)],\n",
+ " hits = 10,\n",
+ " label=\"title\"\n",
+ " )\n",
+ ")\n",
+ "rank_profile = RankProfile(name=\"bm25\", list_features=True)\n",
+ "query_model = Query(match_phase=match_phase, rank_profile=rank_profile)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "results = app.query(query=\"Is remdesivir an effective treatment for COVID-19?\", \n",
+ " query_model=query_model)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "947"
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "results.number_documents_retrieved"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Recall specific documents"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's take a look at the top 3 ids from the last query."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[117166, 60125, 28903]"
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "top_ids = [hit[\"fields\"][\"id\"] for hit in results.hits[0:3]]\n",
+ "top_ids"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Assume that we now want to retrieve the second and third ids above. We can do so with the `recall` argument."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "results_with_recall = app.query(query=\"Is remdesivir an effective treatment for COVID-19?\", \n",
+ " query_model=query_model,\n",
+ " recall = (\"id\", top_ids[1:3]))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "It will only retrieve the documents with Vespa field `id` that is defined on the list that is inside the tuple."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[60125, 28903]"
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "id_recalled = [hit[\"fields\"][\"id\"] for hit in results_with_recall.hits]\n",
+ "id_recalled"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#hide\n",
+ "from fastcore.test import all_equal, test\n",
+ "\n",
+ "test(id_recalled, top_ids[1:3], all_equal)"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "vespa",
+ "language": "python",
+ "name": "vespa"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.7"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}