aboutsummaryrefslogtreecommitdiffstats
path: root/python
diff options
context:
space:
mode:
authortmartins <thigm85@gmail.com>2020-09-07 09:50:26 +0200
committertmartins <thigm85@gmail.com>2020-09-07 09:50:26 +0200
commitd7b80f32746f7f69b77d5c0388c10c1b19d6e492 (patch)
tree2527dae2d2146382422448b4cfdf04248b548e89 /python
parent5ab6a06dabd0367003e619d000d1d877c8d611e6 (diff)
remove duplicated notebooks
Diffstat (limited to 'python')
-rw-r--r--python/vespa/notebooks/application_package.ipynb176
-rw-r--r--python/vespa/notebooks/collect_training_data.ipynb1231
-rw-r--r--python/vespa/notebooks/connect-to-vespa-instance.ipynb977
-rw-r--r--python/vespa/notebooks/create-and-deploy-vespa.ipynb1064
-rw-r--r--python/vespa/notebooks/evaluation.ipynb296
-rw-r--r--python/vespa/notebooks/index.ipynb255
-rw-r--r--python/vespa/notebooks/query.ipynb320
7 files changed, 0 insertions, 4319 deletions
diff --git a/python/vespa/notebooks/application_package.ipynb b/python/vespa/notebooks/application_package.ipynb
deleted file mode 100644
index 5cc1638f7de..00000000000
--- a/python/vespa/notebooks/application_package.ipynb
+++ /dev/null
@@ -1,176 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# hide\n",
- "%load_ext autoreload\n",
- "%autoreload 2"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Vespa - Application Package\n",
- "\n",
- "> Python API to create, modify and deploy application packages"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Our goal is to create, modify and deploy simple application packages using our python API. This enables us to run data analysis experiments that are fully integrated with Vespa. As an example, we want to create the application package we used in our [text search tutorial](https://docs.vespa.ai/documentation/tutorials/text-search.html). "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Application spec"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Our goal in this section is to create the following `msmarco` schema using our python API."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "```\n",
- "schema msmarco {\n",
- " document msmarco {\n",
- " field id type string {\n",
- " indexing: attribute | summary\n",
- " }\n",
- " field title type string {\n",
- " indexing: index | summary\n",
- " index: enable-bm25\n",
- " }\n",
- " field body type string {\n",
- " indexing: index | summary\n",
- " index: enable-bm25\n",
- " }\n",
- " }\n",
- "\n",
- " fieldset default {\n",
- " fields: title, body\n",
- " }\n",
- "\n",
- " rank-profile default {\n",
- " first-phase {\n",
- " expression: nativeRank(title, body)\n",
- " }\n",
- " }\n",
- "\n",
- " rank-profile bm25 inherits default {\n",
- " first-phase {\n",
- " expression: bm25(title) + bm25(body)\n",
- " }\n",
- " }\n",
- "\n",
- "}\n",
- "```"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Schema API"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from vespa.package import Document, Field, Schema, FieldSet, RankProfile, ApplicationPackage\n",
- "\n",
- "document = Document(\n",
- " fields=[\n",
- " Field(name = \"id\", type = \"string\", indexing = [\"attribute\", \"summary\"]),\n",
- " Field(name = \"title\", type = \"string\", indexing = [\"index\", \"summary\"], index = \"enable-bm25\"),\n",
- " Field(name = \"body\", type = \"string\", indexing = [\"index\", \"summary\"], index = \"enable-bm25\") \n",
- " ]\n",
- ")\n",
- "\n",
- "msmarco_schema = Schema(\n",
- " name = \"msmarco\", \n",
- " document = document, \n",
- " fieldsets = [FieldSet(name = \"default\", fields = [\"title\", \"body\"])],\n",
- " rank_profiles = [RankProfile(name = \"default\", first_phase = \"nativeRank(title, body)\")]\n",
- ")\n",
- "\n",
- "app_package = ApplicationPackage(name = \"msmarco\", schema=msmarco_schema)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Deploy it locally"
- ]
- },
- {
- "cell_type": "raw",
- "metadata": {},
- "source": [
- "from vespa.package import VespaDocker\n",
- "\n",
- "vespa_docker = VespaDocker(application_package=app_package)\n",
- "vespa_docker.deploy(disk_folder=\"/Users/username/sample_application\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Change the application package and redeploy"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can add a new rank profile and redeploy our application"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "app_package.schema.add_rank_profile(\n",
- " RankProfile(name = \"bm25\", inherits = \"default\", first_phase = \"bm25(title) + bm25(body)\")\n",
- ")"
- ]
- },
- {
- "cell_type": "raw",
- "metadata": {},
- "source": [
- "vespa_docker.deploy(disk_folder=\"/Users/username/sample_application\")"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "vespa",
- "language": "python",
- "name": "vespa"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/python/vespa/notebooks/collect_training_data.ipynb b/python/vespa/notebooks/collect_training_data.ipynb
deleted file mode 100644
index ab0952bb11c..00000000000
--- a/python/vespa/notebooks/collect_training_data.ipynb
+++ /dev/null
@@ -1,1231 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# hide\n",
- "%load_ext autoreload\n",
- "%autoreload 2"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Vespa - collect training data\n",
- "\n",
- "> Collect training data to analyse and/or improve ranking functions"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Example setup"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Connect to the application and define a query model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from vespa.application import Vespa\n",
- "from vespa.query import Query, RankProfile, OR\n",
- "\n",
- "app = Vespa(url = \"https://api.cord19.vespa.ai\")\n",
- "query_model = Query(\n",
- " match_phase = OR(),\n",
- " rank_profile = RankProfile(name=\"bm25\", list_features=True))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Define some labelled data."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "labelled_data = [\n",
- " {\n",
- " \"query_id\": 0, \n",
- " \"query\": \"Intrauterine virus infections and congenital heart disease\",\n",
- " \"relevant_docs\": [{\"id\": 0, \"score\": 1}, {\"id\": 3, \"score\": 1}]\n",
- " },\n",
- " {\n",
- " \"query_id\": 1, \n",
- " \"query\": \"Clinical and immunologic studies in identical twins discordant for systemic lupus erythematosus\",\n",
- " \"relevant_docs\": [{\"id\": 1, \"score\": 1}, {\"id\": 5, \"score\": 1}]\n",
- " }\n",
- "]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Collect training data in batch"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>attributeMatch(authors.first)</th>\n",
- " <th>attributeMatch(authors.first).averageWeight</th>\n",
- " <th>attributeMatch(authors.first).completeness</th>\n",
- " <th>attributeMatch(authors.first).fieldCompleteness</th>\n",
- " <th>attributeMatch(authors.first).importance</th>\n",
- " <th>attributeMatch(authors.first).matches</th>\n",
- " <th>attributeMatch(authors.first).maxWeight</th>\n",
- " <th>attributeMatch(authors.first).normalizedWeight</th>\n",
- " <th>attributeMatch(authors.first).normalizedWeightedWeight</th>\n",
- " <th>attributeMatch(authors.first).queryCompleteness</th>\n",
- " <th>...</th>\n",
- " <th>textSimilarity(results).queryCoverage</th>\n",
- " <th>textSimilarity(results).score</th>\n",
- " <th>textSimilarity(title).fieldCoverage</th>\n",
- " <th>textSimilarity(title).order</th>\n",
- " <th>textSimilarity(title).proximity</th>\n",
- " <th>textSimilarity(title).queryCoverage</th>\n",
- " <th>textSimilarity(title).score</th>\n",
- " <th>document_id</th>\n",
- " <th>query_id</th>\n",
- " <th>relevant</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.000000</td>\n",
- " <td>0.0</td>\n",
- " <td>0.000000</td>\n",
- " <td>0.000000</td>\n",
- " <td>0.000000</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>56212</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.187500</td>\n",
- " <td>0.5</td>\n",
- " <td>0.617188</td>\n",
- " <td>0.428571</td>\n",
- " <td>0.457087</td>\n",
- " <td>34026</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.000000</td>\n",
- " <td>0.0</td>\n",
- " <td>0.000000</td>\n",
- " <td>0.000000</td>\n",
- " <td>0.000000</td>\n",
- " <td>3</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>56212</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>5</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.187500</td>\n",
- " <td>0.5</td>\n",
- " <td>0.617188</td>\n",
- " <td>0.428571</td>\n",
- " <td>0.457087</td>\n",
- " <td>34026</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>6</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.071429</td>\n",
- " <td>0.0</td>\n",
- " <td>0.000000</td>\n",
- " <td>0.083333</td>\n",
- " <td>0.039286</td>\n",
- " <td>1</td>\n",
- " <td>1</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>7</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>29774</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>8</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.500000</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>0.333333</td>\n",
- " <td>0.700000</td>\n",
- " <td>22787</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>9</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.058824</td>\n",
- " <td>0.0</td>\n",
- " <td>0.000000</td>\n",
- " <td>0.083333</td>\n",
- " <td>0.036765</td>\n",
- " <td>5</td>\n",
- " <td>1</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>10</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>29774</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>11</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.500000</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>0.333333</td>\n",
- " <td>0.700000</td>\n",
- " <td>22787</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "<p>12 rows × 984 columns</p>\n",
- "</div>"
- ],
- "text/plain": [
- " attributeMatch(authors.first) \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).averageWeight \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).completeness \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).fieldCompleteness \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).importance \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).matches \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).maxWeight \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).normalizedWeight \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).normalizedWeightedWeight \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).queryCompleteness ... \\\n",
- "0 0.0 ... \n",
- "1 0.0 ... \n",
- "2 0.0 ... \n",
- "3 0.0 ... \n",
- "4 0.0 ... \n",
- "5 0.0 ... \n",
- "6 0.0 ... \n",
- "7 0.0 ... \n",
- "8 0.0 ... \n",
- "9 0.0 ... \n",
- "10 0.0 ... \n",
- "11 0.0 ... \n",
- "\n",
- " textSimilarity(results).queryCoverage textSimilarity(results).score \\\n",
- "0 0.0 0.0 \n",
- "1 0.0 0.0 \n",
- "2 0.0 0.0 \n",
- "3 0.0 0.0 \n",
- "4 0.0 0.0 \n",
- "5 0.0 0.0 \n",
- "6 0.0 0.0 \n",
- "7 0.0 0.0 \n",
- "8 0.0 0.0 \n",
- "9 0.0 0.0 \n",
- "10 0.0 0.0 \n",
- "11 0.0 0.0 \n",
- "\n",
- " textSimilarity(title).fieldCoverage textSimilarity(title).order \\\n",
- "0 0.000000 0.0 \n",
- "1 1.000000 1.0 \n",
- "2 0.187500 0.5 \n",
- "3 0.000000 0.0 \n",
- "4 1.000000 1.0 \n",
- "5 0.187500 0.5 \n",
- "6 0.071429 0.0 \n",
- "7 1.000000 1.0 \n",
- "8 0.500000 1.0 \n",
- "9 0.058824 0.0 \n",
- "10 1.000000 1.0 \n",
- "11 0.500000 1.0 \n",
- "\n",
- " textSimilarity(title).proximity textSimilarity(title).queryCoverage \\\n",
- "0 0.000000 0.000000 \n",
- "1 1.000000 1.000000 \n",
- "2 0.617188 0.428571 \n",
- "3 0.000000 0.000000 \n",
- "4 1.000000 1.000000 \n",
- "5 0.617188 0.428571 \n",
- "6 0.000000 0.083333 \n",
- "7 1.000000 1.000000 \n",
- "8 1.000000 0.333333 \n",
- "9 0.000000 0.083333 \n",
- "10 1.000000 1.000000 \n",
- "11 1.000000 0.333333 \n",
- "\n",
- " textSimilarity(title).score document_id query_id relevant \n",
- "0 0.000000 0 0 1 \n",
- "1 1.000000 56212 0 0 \n",
- "2 0.457087 34026 0 0 \n",
- "3 0.000000 3 0 1 \n",
- "4 1.000000 56212 0 0 \n",
- "5 0.457087 34026 0 0 \n",
- "6 0.039286 1 1 1 \n",
- "7 1.000000 29774 1 0 \n",
- "8 0.700000 22787 1 0 \n",
- "9 0.036765 5 1 1 \n",
- "10 1.000000 29774 1 0 \n",
- "11 0.700000 22787 1 0 \n",
- "\n",
- "[12 rows x 984 columns]"
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "training_data_batch = app.collect_training_data(\n",
- " labelled_data = labelled_data,\n",
- " id_field = \"id\",\n",
- " query_model = query_model,\n",
- " number_additional_docs = 2\n",
- ")\n",
- "training_data_batch"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Collect training data point\n",
- "\n",
- "> You can have finer control with the `collect_training_data_point` method."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>attributeMatch(authors.first)</th>\n",
- " <th>attributeMatch(authors.first).averageWeight</th>\n",
- " <th>attributeMatch(authors.first).completeness</th>\n",
- " <th>attributeMatch(authors.first).fieldCompleteness</th>\n",
- " <th>attributeMatch(authors.first).importance</th>\n",
- " <th>attributeMatch(authors.first).matches</th>\n",
- " <th>attributeMatch(authors.first).maxWeight</th>\n",
- " <th>attributeMatch(authors.first).normalizedWeight</th>\n",
- " <th>attributeMatch(authors.first).normalizedWeightedWeight</th>\n",
- " <th>attributeMatch(authors.first).queryCompleteness</th>\n",
- " <th>...</th>\n",
- " <th>textSimilarity(results).queryCoverage</th>\n",
- " <th>textSimilarity(results).score</th>\n",
- " <th>textSimilarity(title).fieldCoverage</th>\n",
- " <th>textSimilarity(title).order</th>\n",
- " <th>textSimilarity(title).proximity</th>\n",
- " <th>textSimilarity(title).queryCoverage</th>\n",
- " <th>textSimilarity(title).score</th>\n",
- " <th>document_id</th>\n",
- " <th>query_id</th>\n",
- " <th>relevant</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.000000</td>\n",
- " <td>0.0</td>\n",
- " <td>0.000000</td>\n",
- " <td>0.000000</td>\n",
- " <td>0.000000</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>56212</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.187500</td>\n",
- " <td>0.5</td>\n",
- " <td>0.617188</td>\n",
- " <td>0.428571</td>\n",
- " <td>0.457087</td>\n",
- " <td>34026</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.000000</td>\n",
- " <td>0.0</td>\n",
- " <td>0.000000</td>\n",
- " <td>0.000000</td>\n",
- " <td>0.000000</td>\n",
- " <td>3</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>56212</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>5</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.187500</td>\n",
- " <td>0.5</td>\n",
- " <td>0.617188</td>\n",
- " <td>0.428571</td>\n",
- " <td>0.457087</td>\n",
- " <td>34026</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>6</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.071429</td>\n",
- " <td>0.0</td>\n",
- " <td>0.000000</td>\n",
- " <td>0.083333</td>\n",
- " <td>0.039286</td>\n",
- " <td>1</td>\n",
- " <td>1</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>7</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>29774</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>8</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.500000</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>0.333333</td>\n",
- " <td>0.700000</td>\n",
- " <td>22787</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>9</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.058824</td>\n",
- " <td>0.0</td>\n",
- " <td>0.000000</td>\n",
- " <td>0.083333</td>\n",
- " <td>0.036765</td>\n",
- " <td>5</td>\n",
- " <td>1</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>10</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>29774</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>11</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.500000</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>0.333333</td>\n",
- " <td>0.700000</td>\n",
- " <td>22787</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "<p>12 rows × 984 columns</p>\n",
- "</div>"
- ],
- "text/plain": [
- " attributeMatch(authors.first) \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).averageWeight \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).completeness \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).fieldCompleteness \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).importance \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).matches \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).maxWeight \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).normalizedWeight \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).normalizedWeightedWeight \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).queryCompleteness ... \\\n",
- "0 0.0 ... \n",
- "1 0.0 ... \n",
- "2 0.0 ... \n",
- "3 0.0 ... \n",
- "4 0.0 ... \n",
- "5 0.0 ... \n",
- "6 0.0 ... \n",
- "7 0.0 ... \n",
- "8 0.0 ... \n",
- "9 0.0 ... \n",
- "10 0.0 ... \n",
- "11 0.0 ... \n",
- "\n",
- " textSimilarity(results).queryCoverage textSimilarity(results).score \\\n",
- "0 0.0 0.0 \n",
- "1 0.0 0.0 \n",
- "2 0.0 0.0 \n",
- "3 0.0 0.0 \n",
- "4 0.0 0.0 \n",
- "5 0.0 0.0 \n",
- "6 0.0 0.0 \n",
- "7 0.0 0.0 \n",
- "8 0.0 0.0 \n",
- "9 0.0 0.0 \n",
- "10 0.0 0.0 \n",
- "11 0.0 0.0 \n",
- "\n",
- " textSimilarity(title).fieldCoverage textSimilarity(title).order \\\n",
- "0 0.000000 0.0 \n",
- "1 1.000000 1.0 \n",
- "2 0.187500 0.5 \n",
- "3 0.000000 0.0 \n",
- "4 1.000000 1.0 \n",
- "5 0.187500 0.5 \n",
- "6 0.071429 0.0 \n",
- "7 1.000000 1.0 \n",
- "8 0.500000 1.0 \n",
- "9 0.058824 0.0 \n",
- "10 1.000000 1.0 \n",
- "11 0.500000 1.0 \n",
- "\n",
- " textSimilarity(title).proximity textSimilarity(title).queryCoverage \\\n",
- "0 0.000000 0.000000 \n",
- "1 1.000000 1.000000 \n",
- "2 0.617188 0.428571 \n",
- "3 0.000000 0.000000 \n",
- "4 1.000000 1.000000 \n",
- "5 0.617188 0.428571 \n",
- "6 0.000000 0.083333 \n",
- "7 1.000000 1.000000 \n",
- "8 1.000000 0.333333 \n",
- "9 0.000000 0.083333 \n",
- "10 1.000000 1.000000 \n",
- "11 1.000000 0.333333 \n",
- "\n",
- " textSimilarity(title).score document_id query_id relevant \n",
- "0 0.000000 0 0 1 \n",
- "1 1.000000 56212 0 0 \n",
- "2 0.457087 34026 0 0 \n",
- "3 0.000000 3 0 1 \n",
- "4 1.000000 56212 0 0 \n",
- "5 0.457087 34026 0 0 \n",
- "6 0.039286 1 1 1 \n",
- "7 1.000000 29774 1 0 \n",
- "8 0.700000 22787 1 0 \n",
- "9 0.036765 5 1 1 \n",
- "10 1.000000 29774 1 0 \n",
- "11 0.700000 22787 1 0 \n",
- "\n",
- "[12 rows x 984 columns]"
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "from pandas import concat, DataFrame\n",
- "\n",
- "\n",
- "training_data = []\n",
- "for query_data in labelled_data:\n",
- " for doc_data in query_data[\"relevant_docs\"]:\n",
- " training_data_point = app.collect_training_data_point(\n",
- " query = query_data[\"query\"],\n",
- " query_id = query_data[\"query_id\"],\n",
- " relevant_id = doc_data[\"id\"],\n",
- " id_field = \"id\",\n",
- " query_model = query_model,\n",
- " number_additional_docs = 2\n",
- " )\n",
- " training_data.extend(training_data_point)\n",
- "training_data = DataFrame.from_records(training_data)\n",
- "training_data"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "vespa",
- "language": "python",
- "name": "vespa"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/python/vespa/notebooks/connect-to-vespa-instance.ipynb b/python/vespa/notebooks/connect-to-vespa-instance.ipynb
deleted file mode 100644
index 6dfecd0c099..00000000000
--- a/python/vespa/notebooks/connect-to-vespa-instance.ipynb
+++ /dev/null
@@ -1,977 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Vespa library for data analysis\n",
- "\n",
- "> Provide data analysis support for Vespa applications \n",
- "\n",
- "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vespa-engine/vespa/blob/tgm/pyvespa-tutorial/python/vespa/notebooks/connect-to-vespa-instance.ipynb)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "`pyvespa` provides a python API to [vespa.ai](vespa.ai). It allow us to create, modify, deploy and interact with running Vespa instances. The main goal of the library is to allow for faster prototyping and ML experimentation. "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "This tutorial will show you how to connect to a pre-existing Vespa instance. We will use the https://cord19.vespa.ai/ app as an example. You can run this tutorial yourself in Google Colab by clicking on the badge located at the top of the tutorial."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Install"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The library is available at PyPI and therefore can be installed with `pip`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "!pip install pyvespa"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Connect to a Vespa app\n",
- "\n",
- "> Connect to a running Vespa application"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can connect to a running Vespa instance by created an instance of `Vespa` with the appropriate url. The resulting `app` will then be used to communicate with the application."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from vespa.application import Vespa\n",
- "\n",
- "app = Vespa(url = \"https://api.cord19.vespa.ai\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Define a Query model\n",
- "\n",
- "> Easily define matching and ranking criteria"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "When building a search application, we usually want to expirement with different query models. A `Query` model consists of a match phase and a ranking phase. The matching phase will define how to match documents based on the query sent and the ranking phase will define how to rank the matched documents. Both phases can get quite complex and being able to easily express and experiment with them is very valuable."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "In the example below we define the match phase to be the `Union` of the `WeakAnd` and the `ANN` operators. The `WeakAnd` will match documents based on query terms while the Approximate Nearest Neighbor (`ANN`) operator will match documents based on the distance between the query and document embeddings. This is an illustration of how easy it is to combine term and semantic matching in Vespa. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from vespa.query import Union, WeakAnd, ANN\n",
- "from random import random\n",
- "\n",
- "match_phase = Union(\n",
- " WeakAnd(hits = 10), \n",
- " ANN(\n",
- " doc_vector=\"title_embedding\", \n",
- " query_vector=\"title_vector\", \n",
- " embedding_model=lambda x: [random() for x in range(768)],\n",
- " hits = 10,\n",
- " label=\"title\"\n",
- " )\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We then define the the ranking to be done by the `bm25` rank-profile that is already defined in the application package. We set `list_features=True` to be able to collect ranking-features later in this tutorial. After defining the `match_phase` and the `rank_profile` we can instantiate the `Query` model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from vespa.query import Query, RankProfile\n",
- "\n",
- "rank_profile = RankProfile(name=\"bm25\", list_features=True)\n",
- "\n",
- "query_model = Query(match_phase=match_phase, rank_profile=rank_profile)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Query the vespa app\n",
- "\n",
- "> Send queries via the query API. See the [query page](/vespa/query) for more examples."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can use the `query_model` that we just defined to issue queries to the application via the `query` method."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "query_result = app.query(\n",
- " query=\"Is remdesivir an effective treatment for COVID-19?\", \n",
- " query_model=query_model\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can see the number of documents that were retrieved by Vespa:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "965"
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "query_result.number_documents_retrieved"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "And the number of documents that were returned to us:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "10"
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "len(query_result.hits)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Labelled data\n",
- "\n",
- "> How to structure labelled data"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We often need to either evaluate query models or to collect data to improve query models through ML. In both cases we usually need labelled data. Lets create some labelled data to illustrate their expected format and their usage in the library."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Each data point contains a `query_id`, a `query` and `relevant_docs` associated with the query."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "labelled_data = [\n",
- " {\n",
- " \"query_id\": 0, \n",
- " \"query\": \"Intrauterine virus infections and congenital heart disease\",\n",
- " \"relevant_docs\": [{\"id\": 0, \"score\": 1}, {\"id\": 3, \"score\": 1}]\n",
- " },\n",
- " {\n",
- " \"query_id\": 1, \n",
- " \"query\": \"Clinical and immunologic studies in identical twins discordant for systemic lupus erythematosus\",\n",
- " \"relevant_docs\": [{\"id\": 1, \"score\": 1}, {\"id\": 5, \"score\": 1}]\n",
- " }\n",
- "]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Non-relevant documents are assigned `\"score\": 0` by default. Relevant documents will be assigned `\"score\": 1` by default if the field is missing from the labelled data. The defaults for both relevant and non-relevant documents can be modified on the appropriate methods."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Collect training data\n",
- "\n",
- "> Collect training data to analyse and/or improve ranking functions. See the [collect training data page](/vespa/collect_training_data) for more examples."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can colect training data with the `collect_training_data` method according to a specific `query_model`. Below we will collect two documents for each query in addition to the relevant ones."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>attributeMatch(authors.first)</th>\n",
- " <th>attributeMatch(authors.first).averageWeight</th>\n",
- " <th>attributeMatch(authors.first).completeness</th>\n",
- " <th>attributeMatch(authors.first).fieldCompleteness</th>\n",
- " <th>attributeMatch(authors.first).importance</th>\n",
- " <th>attributeMatch(authors.first).matches</th>\n",
- " <th>attributeMatch(authors.first).maxWeight</th>\n",
- " <th>attributeMatch(authors.first).normalizedWeight</th>\n",
- " <th>attributeMatch(authors.first).normalizedWeightedWeight</th>\n",
- " <th>attributeMatch(authors.first).queryCompleteness</th>\n",
- " <th>...</th>\n",
- " <th>textSimilarity(results).queryCoverage</th>\n",
- " <th>textSimilarity(results).score</th>\n",
- " <th>textSimilarity(title).fieldCoverage</th>\n",
- " <th>textSimilarity(title).order</th>\n",
- " <th>textSimilarity(title).proximity</th>\n",
- " <th>textSimilarity(title).queryCoverage</th>\n",
- " <th>textSimilarity(title).score</th>\n",
- " <th>document_id</th>\n",
- " <th>query_id</th>\n",
- " <th>relevant</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.062500</td>\n",
- " <td>0.0</td>\n",
- " <td>0.000000</td>\n",
- " <td>0.142857</td>\n",
- " <td>0.055357</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>97200</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.266667</td>\n",
- " <td>1.0</td>\n",
- " <td>0.869792</td>\n",
- " <td>0.571429</td>\n",
- " <td>0.679189</td>\n",
- " <td>69447</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.142857</td>\n",
- " <td>0.0</td>\n",
- " <td>0.437500</td>\n",
- " <td>0.142857</td>\n",
- " <td>0.224554</td>\n",
- " <td>3</td>\n",
- " <td>0</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>97200</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>5</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.266667</td>\n",
- " <td>1.0</td>\n",
- " <td>0.869792</td>\n",
- " <td>0.571429</td>\n",
- " <td>0.679189</td>\n",
- " <td>69447</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>6</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.111111</td>\n",
- " <td>0.0</td>\n",
- " <td>0.000000</td>\n",
- " <td>0.083333</td>\n",
- " <td>0.047222</td>\n",
- " <td>1</td>\n",
- " <td>1</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>7</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>116256</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>8</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.187500</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>0.250000</td>\n",
- " <td>0.612500</td>\n",
- " <td>14888</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>9</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.083333</td>\n",
- " <td>0.0</td>\n",
- " <td>0.000000</td>\n",
- " <td>0.083333</td>\n",
- " <td>0.041667</td>\n",
- " <td>5</td>\n",
- " <td>1</td>\n",
- " <td>1</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>10</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>1.000000</td>\n",
- " <td>116256</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>11</th>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>...</td>\n",
- " <td>0.0</td>\n",
- " <td>0.0</td>\n",
- " <td>0.187500</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " <td>0.250000</td>\n",
- " <td>0.612500</td>\n",
- " <td>14888</td>\n",
- " <td>1</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "<p>12 rows × 984 columns</p>\n",
- "</div>"
- ],
- "text/plain": [
- " attributeMatch(authors.first) \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).averageWeight \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).completeness \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).fieldCompleteness \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).importance \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).matches \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).maxWeight \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).normalizedWeight \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).normalizedWeightedWeight \\\n",
- "0 0.0 \n",
- "1 0.0 \n",
- "2 0.0 \n",
- "3 0.0 \n",
- "4 0.0 \n",
- "5 0.0 \n",
- "6 0.0 \n",
- "7 0.0 \n",
- "8 0.0 \n",
- "9 0.0 \n",
- "10 0.0 \n",
- "11 0.0 \n",
- "\n",
- " attributeMatch(authors.first).queryCompleteness ... \\\n",
- "0 0.0 ... \n",
- "1 0.0 ... \n",
- "2 0.0 ... \n",
- "3 0.0 ... \n",
- "4 0.0 ... \n",
- "5 0.0 ... \n",
- "6 0.0 ... \n",
- "7 0.0 ... \n",
- "8 0.0 ... \n",
- "9 0.0 ... \n",
- "10 0.0 ... \n",
- "11 0.0 ... \n",
- "\n",
- " textSimilarity(results).queryCoverage textSimilarity(results).score \\\n",
- "0 0.0 0.0 \n",
- "1 0.0 0.0 \n",
- "2 0.0 0.0 \n",
- "3 0.0 0.0 \n",
- "4 0.0 0.0 \n",
- "5 0.0 0.0 \n",
- "6 0.0 0.0 \n",
- "7 0.0 0.0 \n",
- "8 0.0 0.0 \n",
- "9 0.0 0.0 \n",
- "10 0.0 0.0 \n",
- "11 0.0 0.0 \n",
- "\n",
- " textSimilarity(title).fieldCoverage textSimilarity(title).order \\\n",
- "0 0.062500 0.0 \n",
- "1 1.000000 1.0 \n",
- "2 0.266667 1.0 \n",
- "3 0.142857 0.0 \n",
- "4 1.000000 1.0 \n",
- "5 0.266667 1.0 \n",
- "6 0.111111 0.0 \n",
- "7 1.000000 1.0 \n",
- "8 0.187500 1.0 \n",
- "9 0.083333 0.0 \n",
- "10 1.000000 1.0 \n",
- "11 0.187500 1.0 \n",
- "\n",
- " textSimilarity(title).proximity textSimilarity(title).queryCoverage \\\n",
- "0 0.000000 0.142857 \n",
- "1 1.000000 1.000000 \n",
- "2 0.869792 0.571429 \n",
- "3 0.437500 0.142857 \n",
- "4 1.000000 1.000000 \n",
- "5 0.869792 0.571429 \n",
- "6 0.000000 0.083333 \n",
- "7 1.000000 1.000000 \n",
- "8 1.000000 0.250000 \n",
- "9 0.000000 0.083333 \n",
- "10 1.000000 1.000000 \n",
- "11 1.000000 0.250000 \n",
- "\n",
- " textSimilarity(title).score document_id query_id relevant \n",
- "0 0.055357 0 0 1 \n",
- "1 1.000000 97200 0 0 \n",
- "2 0.679189 69447 0 0 \n",
- "3 0.224554 3 0 1 \n",
- "4 1.000000 97200 0 0 \n",
- "5 0.679189 69447 0 0 \n",
- "6 0.047222 1 1 1 \n",
- "7 1.000000 116256 1 0 \n",
- "8 0.612500 14888 1 0 \n",
- "9 0.041667 5 1 1 \n",
- "10 1.000000 116256 1 0 \n",
- "11 0.612500 14888 1 0 \n",
- "\n",
- "[12 rows x 984 columns]"
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "training_data_batch = app.collect_training_data(\n",
- " labelled_data = labelled_data,\n",
- " id_field = \"id\",\n",
- " query_model = query_model,\n",
- " number_additional_docs = 2\n",
- ")\n",
- "training_data_batch"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Evaluating a query model\n",
- "\n",
- "> Define metrics and evaluate query models. See the [evaluation page](/vespa/evaluation) for more examples."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We will define the following evaluation metrics:\n",
- "* % of documents retrieved per query\n",
- "* recall @ 10 per query\n",
- "* MRR @ 10 per query"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from vespa.evaluation import MatchRatio, Recall, ReciprocalRank\n",
- "\n",
- "eval_metrics = [MatchRatio(), Recall(at=10), ReciprocalRank(at=10)]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Evaluate:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>query_id</th>\n",
- " <th>match_ratio_retrieved_docs</th>\n",
- " <th>match_ratio_docs_available</th>\n",
- " <th>match_ratio_value</th>\n",
- " <th>recall_10_value</th>\n",
- " <th>reciprocal_rank_10_value</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>0</td>\n",
- " <td>1033</td>\n",
- " <td>127518</td>\n",
- " <td>0.008101</td>\n",
- " <td>0.0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>1</td>\n",
- " <td>928</td>\n",
- " <td>127518</td>\n",
- " <td>0.007277</td>\n",
- " <td>0.0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " query_id match_ratio_retrieved_docs match_ratio_docs_available \\\n",
- "0 0 1033 127518 \n",
- "1 1 928 127518 \n",
- "\n",
- " match_ratio_value recall_10_value reciprocal_rank_10_value \n",
- "0 0.008101 0.0 0 \n",
- "1 0.007277 0.0 0 "
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "evaluation = app.evaluate(\n",
- " labelled_data = labelled_data,\n",
- " eval_metrics = eval_metrics, \n",
- " query_model = query_model, \n",
- " id_field = \"id\",\n",
- ")\n",
- "evaluation"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "vespa",
- "language": "python",
- "name": "vespa"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/python/vespa/notebooks/create-and-deploy-vespa.ipynb b/python/vespa/notebooks/create-and-deploy-vespa.ipynb
deleted file mode 100644
index 86d5fa08fc5..00000000000
--- a/python/vespa/notebooks/create-and-deploy-vespa.ipynb
+++ /dev/null
@@ -1,1064 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# hide\n",
- "%load_ext autoreload\n",
- "%autoreload 2"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Build end-to-end Vespa apps with pyvespa\n",
- "\n",
- "> Python API to create, modify, deploy and interact with Vespa applications"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "`pyvespa` provides a python API to [vespa.ai](vespa.ai). It allow us to create, modify, deploy and interact with running Vespa instances. The main goal of the library is to allow for faster prototyping and ML experimentation. "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "This tutorial will create a text search application from scratch based on the MS MARCO dataset, similar to our [text search tutorials](https://docs.vespa.ai/documentation/tutorials/text-search.html). We will first show how to define the app by creating an application package [REF]. Then we locally deploy the app in a Docker container. Once the app is up and running we show how to feed data to it. After the data is sent, we can make queries and inspect the results. We then show how to add a new rank profile to the application package and to redeploy the app with the latest changes. We proceed to show how to evaluate and compare two rank profiles with evaluation metrics such as Recall and Reciprocal Rank."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Application package API"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We first create a `Document` instance containing the `Field`s that we want to store in the app. In this case we will keep the application simple and only feed a unique `id`, `title` and `body` of the MS MARCO documents."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from vespa.package import Document, Field\n",
- "\n",
- "document = Document(\n",
- " fields=[\n",
- " Field(name = \"id\", type = \"string\", indexing = [\"attribute\", \"summary\"]),\n",
- " Field(name = \"title\", type = \"string\", indexing = [\"index\", \"summary\"], index = \"enable-bm25\"),\n",
- " Field(name = \"body\", type = \"string\", indexing = [\"index\", \"summary\"], index = \"enable-bm25\") \n",
- " ]\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The complete `Schema` of our application will be named `msmarco` and contains the `Document` instance that we defined above, the default `FieldSet` indicates that queries will look for matches by searching both in the titles and bodies of the documents. The default `RankProfile` indicates that all the matched documents will be ranked by the `nativeRank` expression involving the title and the body of the matched documents."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from vespa.package import Schema, FieldSet, RankProfile\n",
- "\n",
- "msmarco_schema = Schema(\n",
- " name = \"msmarco\", \n",
- " document = document, \n",
- " fieldsets = [FieldSet(name = \"default\", fields = [\"title\", \"body\"])],\n",
- " rank_profiles = [RankProfile(name = \"default\", first_phase = \"nativeRank(title, body)\")]\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Once the `Schema` is defined, all we have to do is to create our msmarco `ApplicationPackage`:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from vespa.package import ApplicationPackage\n",
- "\n",
- "app_package = ApplicationPackage(name = \"msmarco\", schema=msmarco_schema)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "At this point, `app_package` contains all the relevant information required to create our MS MARCO text search app. We now need to deploy it."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Deploy it locally"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "This tutorial shows how to deploy the application package locally in a Docker container. For the following to work you need to run this from a machine with Docker installed. We first create a `VespaDocker` instance based on the application package."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from vespa.package import VespaDocker\n",
- "\n",
- "vespa_docker = VespaDocker(application_package=app_package)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We then call the `deploy` method and specify a `disk_folder` with write access. Behind the scenes, `pyvespa` will write the Vespa config files and store them in the `disk_folder`, it will then run a Vespa engine Docker container and deploy those config files in the container."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "app = vespa_docker.deploy(disk_folder=\"/Users/username/sample_application\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The app variable above will hold a `Vespa` instance that will be used to connect and interact with our text search application. We can see the deployment message returned by the Vespa engine:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[\"Uploading application '/app/application' using http://localhost:19071/application/v2/tenant/default/session\",\n",
- " \"Session 18 for tenant 'default' created.\",\n",
- " 'Preparing session 18 using http://localhost:19071/application/v2/tenant/default/session/18/prepared',\n",
- " \"WARNING: Host named 'msmarco' may not receive any config since it is not a canonical hostname. Disregard this warning when testing in a Docker container.\",\n",
- " \"Session 18 for tenant 'default' prepared.\",\n",
- " 'Activating session 18 using http://localhost:19071/application/v2/tenant/default/session/18/active',\n",
- " \"Session 18 for tenant 'default' activated.\",\n",
- " 'Checksum: 09203c16fa5f582b712711bb98932812',\n",
- " 'Timestamp: 1598011224920',\n",
- " 'Generation: 18',\n",
- " '']"
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "app.deployment_message"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Feed data to the app "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We now have our text search app up and running. We can start to feed data to it. We have pre-processed and sampled some MS MARCO data to use in this tutorial. We can load 996 documents that we want to feed and check the first two documents in this sample."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "(996, 3)"
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "from pandas import read_csv\n",
- "\n",
- "docs = read_csv(\"https://thigm85.github.io/data/msmarco/docs.tsv\", sep = \"\\t\")\n",
- "docs.shape"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>id</th>\n",
- " <th>title</th>\n",
- " <th>body</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>D2185715</td>\n",
- " <td>What Is an Appropriate Gift for a Bris</td>\n",
- " <td>Hub Pages Religion and Philosophy Judaism...</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>D2819479</td>\n",
- " <td>lunge</td>\n",
- " <td>1lungenoun ˈlənj Popularity Bottom 40 of...</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " id title \\\n",
- "0 D2185715 What Is an Appropriate Gift for a Bris \n",
- "1 D2819479 lunge \n",
- "\n",
- " body \n",
- "0 Hub Pages Religion and Philosophy Judaism... \n",
- "1 1lungenoun ˈlənj Popularity Bottom 40 of... "
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "docs.head(2)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "To feed the data we need to specify the `schema` that we are sending data to. We name our schema `msmarco` in a previous section. Each data point needs to have a unique `data_id` associated with it, independent of having an id field or not. The `fields` should be a dict containing all the fields in the schema, which are `id`, `title` and `body` in our case. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "for idx, row in docs.iterrows():\n",
- " response = app.feed_data_point(\n",
- " schema = \"msmarco\", \n",
- " data_id = str(row[\"id\"]), \n",
- " fields = {\n",
- " \"id\": str(row[\"id\"]), \n",
- " \"title\": str(row[\"title\"]), \n",
- " \"body\": str(row[\"body\"])\n",
- " }\n",
- " )"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Each call to the method `feed_data_point` sends a POST request to the appropriate Vespa endpoint and we can check the response of the requests if needed, such as the status code and the message returned."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "200"
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "response.status_code"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "{'id': 'id:msmarco:msmarco::D2002872',\n",
- " 'pathId': '/document/v1/msmarco/msmarco/docid/D2002872'}"
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "response.json()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Make a simple query"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Once our application is fed we can start to use it by sending queries to it. The MS MARCO app expectes to receive questions as queries and the goal of the application is to return documents that are relevant to the questions made."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "In the example below, we will send a question via the `query` parameter. In addition, we need to specify how we want the documents to be matched and ranked. We do this by specifying a `Query` model. The query model below will have the `OR` operator in the match phase, indicating that the application will match all the documents which have at least one query term within the title or the body (due to the default `FieldSet` we defined earlier) of the document. And we will rank all the matched documents by the default `RankProfile` that we defined earlier."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from vespa.query import Query, OR, RankProfile as Ranking\n",
- "\n",
- "results = app.query(\n",
- " query=\"Where is my text?\", \n",
- " query_model = Query(\n",
- " match_phase=OR(), \n",
- " rank_profile=Ranking(name=\"default\")\n",
- " ),\n",
- " hits = 2\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "In addition to the `query` and `query_model` parameters, we can specify a multitude of relevant Vespa parameters such as the number of `hits` that we want Vespa to return. We chose `hits=2` for simplicity in this tutorial."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "2"
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "len(results.hits)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Change the application package and redeploy"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can also make specific changes to our application by changing the application package and redeploying. Lets add a new rank profile based on BM25 to our `Schema`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "app_package.schema.add_rank_profile(\n",
- " RankProfile(name = \"bm25\", inherits = \"default\", first_phase = \"bm25(title) + bm25(body)\")\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "After that we can redeploy our application, similar to what we did earlier:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "Vespa(http://localhost, 8080)"
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "vespa_docker.deploy(disk_folder=\"/Users/username/sample_application\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can then use the newly created `bm25` rank profile to make queries:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "2"
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "results = app.query(\n",
- " query=\"Where is my text?\", \n",
- " query_model = Query(\n",
- " match_phase=OR(), \n",
- " rank_profile=Ranking(name=\"bm25\")\n",
- " ),\n",
- " hits = 2\n",
- ")\n",
- "len(results.hits)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Compare query models"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "When we are building a search application, we often want to experiment and compare different query models. In this section we want to show how easy it is to compare different query models in Vespa."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Lets load some labelled data where each data point contains a `query_id`, a `query` and a list of `relevant_docs` associated with the query. In this case, we have only one relevant document for each query."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import requests\n",
- "\n",
- "labelled_data = json.loads(\n",
- " requests.get(\"https://thigm85.github.io/data/msmarco/query-labels.json\").text\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Following we can see two examples of the labelled data:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[{'query_id': '1',\n",
- " 'query': 'what county is aspen co',\n",
- " 'relevant_docs': [{'id': 'D1098819'}]},\n",
- " {'query_id': '2',\n",
- " 'query': 'where is aeropostale located',\n",
- " 'relevant_docs': [{'id': 'D2268823'}]}]"
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "labelled_data[0:2]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Lets define two `Query` models to be compared. We are going to use the same `OR` operator in the match phase and compare the `default` and `bm25` rank profiles."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "default_ranking = Query(\n",
- " match_phase=OR(), \n",
- " rank_profile=Ranking(name=\"default\")\n",
- ")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "bm25_ranking = Query(\n",
- " match_phase=OR(), \n",
- " rank_profile=Ranking(name=\"bm25\")\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now we will chose which evaluation metrics we want to look at. In this case we will chose the `MatchRatio` to check how many documents have been matched by the query, the `Recall` at 10 and the `ReciprocalRank` at 10."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from vespa.evaluation import MatchRatio, Recall, ReciprocalRank\n",
- "\n",
- "eval_metrics = [MatchRatio(), Recall(at = 10), ReciprocalRank(at = 10)]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We now can run the `evaluation` method for each `Query` model. This will make queries to the application and process the results to compute the pre-defined `eval_metrics` defined above."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "default_evaluation = app.evaluate(\n",
- " labelled_data=labelled_data, \n",
- " eval_metrics=eval_metrics, \n",
- " query_model=default_ranking, \n",
- " id_field=\"id\",\n",
- " timeout=5,\n",
- " hits=10\n",
- ")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "bm25_evaluation = app.evaluate(\n",
- " labelled_data=labelled_data, \n",
- " eval_metrics=eval_metrics, \n",
- " query_model=bm25_ranking, \n",
- " id_field=\"id\",\n",
- " timeout=5,\n",
- " hits=10\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can then merge the DataFrames returned by the `evaluation` method and start to analyse the results."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>query_id</th>\n",
- " <th>match_ratio_retrieved_docs_default</th>\n",
- " <th>match_ratio_docs_available_default</th>\n",
- " <th>match_ratio_value_default</th>\n",
- " <th>recall_10_value_default</th>\n",
- " <th>reciprocal_rank_10_value_default</th>\n",
- " <th>match_ratio_retrieved_docs_bm25</th>\n",
- " <th>match_ratio_docs_available_bm25</th>\n",
- " <th>match_ratio_value_bm25</th>\n",
- " <th>recall_10_value_bm25</th>\n",
- " <th>reciprocal_rank_10_value_bm25</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>1</td>\n",
- " <td>914</td>\n",
- " <td>997</td>\n",
- " <td>0.916750</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000</td>\n",
- " <td>914</td>\n",
- " <td>997</td>\n",
- " <td>0.916750</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>2</td>\n",
- " <td>896</td>\n",
- " <td>997</td>\n",
- " <td>0.898696</td>\n",
- " <td>1.0</td>\n",
- " <td>0.125</td>\n",
- " <td>896</td>\n",
- " <td>997</td>\n",
- " <td>0.898696</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>2</th>\n",
- " <td>3</td>\n",
- " <td>971</td>\n",
- " <td>997</td>\n",
- " <td>0.973922</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000</td>\n",
- " <td>971</td>\n",
- " <td>997</td>\n",
- " <td>0.973922</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>3</th>\n",
- " <td>4</td>\n",
- " <td>982</td>\n",
- " <td>997</td>\n",
- " <td>0.984955</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000</td>\n",
- " <td>982</td>\n",
- " <td>997</td>\n",
- " <td>0.984955</td>\n",
- " <td>1.0</td>\n",
- " <td>1.000000</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>4</th>\n",
- " <td>5</td>\n",
- " <td>748</td>\n",
- " <td>997</td>\n",
- " <td>0.750251</td>\n",
- " <td>1.0</td>\n",
- " <td>0.500</td>\n",
- " <td>748</td>\n",
- " <td>997</td>\n",
- " <td>0.750251</td>\n",
- " <td>1.0</td>\n",
- " <td>0.333333</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " query_id match_ratio_retrieved_docs_default \\\n",
- "0 1 914 \n",
- "1 2 896 \n",
- "2 3 971 \n",
- "3 4 982 \n",
- "4 5 748 \n",
- "\n",
- " match_ratio_docs_available_default match_ratio_value_default \\\n",
- "0 997 0.916750 \n",
- "1 997 0.898696 \n",
- "2 997 0.973922 \n",
- "3 997 0.984955 \n",
- "4 997 0.750251 \n",
- "\n",
- " recall_10_value_default reciprocal_rank_10_value_default \\\n",
- "0 1.0 1.000 \n",
- "1 1.0 0.125 \n",
- "2 1.0 1.000 \n",
- "3 1.0 1.000 \n",
- "4 1.0 0.500 \n",
- "\n",
- " match_ratio_retrieved_docs_bm25 match_ratio_docs_available_bm25 \\\n",
- "0 914 997 \n",
- "1 896 997 \n",
- "2 971 997 \n",
- "3 982 997 \n",
- "4 748 997 \n",
- "\n",
- " match_ratio_value_bm25 recall_10_value_bm25 reciprocal_rank_10_value_bm25 \n",
- "0 0.916750 1.0 1.000000 \n",
- "1 0.898696 1.0 1.000000 \n",
- "2 0.973922 1.0 1.000000 \n",
- "3 0.984955 1.0 1.000000 \n",
- "4 0.750251 1.0 0.333333 "
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "from pandas import merge\n",
- "\n",
- "eval_comparison = merge(\n",
- " left=default_evaluation, \n",
- " right=bm25_evaluation, \n",
- " on=\"query_id\", \n",
- " suffixes=('_default', '_bm25')\n",
- ")\n",
- "eval_comparison.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Notice that we expect to observe the same match ratio for both query models since they use the same `OR` operator."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>match_ratio_value_default</th>\n",
- " <th>match_ratio_value_bm25</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>mean</th>\n",
- " <td>0.866861</td>\n",
- " <td>0.866861</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>std</th>\n",
- " <td>0.181418</td>\n",
- " <td>0.181418</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " match_ratio_value_default match_ratio_value_bm25\n",
- "mean 0.866861 0.866861\n",
- "std 0.181418 0.181418"
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "eval_comparison[[\"match_ratio_value_default\", \"match_ratio_value_bm25\"]].describe().loc[[\"mean\", \"std\"]]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The `bm25` rank profile obtained a significantly higher recall than the `default`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>recall_10_value_default</th>\n",
- " <th>recall_10_value_bm25</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>mean</th>\n",
- " <td>0.840000</td>\n",
- " <td>0.960000</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>std</th>\n",
- " <td>0.368453</td>\n",
- " <td>0.196946</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " recall_10_value_default recall_10_value_bm25\n",
- "mean 0.840000 0.960000\n",
- "std 0.368453 0.196946"
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "eval_comparison[[\"recall_10_value_default\", \"recall_10_value_bm25\"]].describe().loc[[\"mean\", \"std\"]]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Similarly, `bm25` also get a significantly higher reciprocal rank value when compared to the `default` rank profile."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>reciprocal_rank_10_value_default</th>\n",
- " <th>reciprocal_rank_10_value_bm25</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>mean</th>\n",
- " <td>0.724750</td>\n",
- " <td>0.943333</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>std</th>\n",
- " <td>0.399118</td>\n",
- " <td>0.216103</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " reciprocal_rank_10_value_default reciprocal_rank_10_value_bm25\n",
- "mean 0.724750 0.943333\n",
- "std 0.399118 0.216103"
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "eval_comparison[[\"reciprocal_rank_10_value_default\", \"reciprocal_rank_10_value_bm25\"]].describe().loc[[\"mean\", \"std\"]]"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "vespa",
- "language": "python",
- "name": "vespa"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/python/vespa/notebooks/evaluation.ipynb b/python/vespa/notebooks/evaluation.ipynb
deleted file mode 100644
index 9a37effc691..00000000000
--- a/python/vespa/notebooks/evaluation.ipynb
+++ /dev/null
@@ -1,296 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# hide\n",
- "%load_ext autoreload\n",
- "%autoreload 2"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Vespa - Evaluate query models\n",
- "\n",
- "> Define metrics and evaluate query models"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Example setup"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Connect to the application and define a query model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from vespa.application import Vespa\n",
- "from vespa.query import Query, RankProfile, OR\n",
- "\n",
- "app = Vespa(url = \"https://api.cord19.vespa.ai\")\n",
- "query_model = Query(\n",
- " match_phase = OR(),\n",
- " rank_profile = RankProfile(name=\"bm25\", list_features=True))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Define some labelled data."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "labelled_data = [\n",
- " {\n",
- " \"query_id\": 0, \n",
- " \"query\": \"Intrauterine virus infections and congenital heart disease\",\n",
- " \"relevant_docs\": [{\"id\": 0, \"score\": 1}, {\"id\": 3, \"score\": 1}]\n",
- " },\n",
- " {\n",
- " \"query_id\": 1, \n",
- " \"query\": \"Clinical and immunologic studies in identical twins discordant for systemic lupus erythematosus\",\n",
- " \"relevant_docs\": [{\"id\": 1, \"score\": 1}, {\"id\": 5, \"score\": 1}]\n",
- " }\n",
- "]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Define metrics"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from vespa.evaluation import MatchRatio, Recall, ReciprocalRank\n",
- "\n",
- "eval_metrics = [MatchRatio(), Recall(at=10), ReciprocalRank(at=10)]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Evaluate in batch"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>query_id</th>\n",
- " <th>match_ratio_retrieved_docs</th>\n",
- " <th>match_ratio_docs_available</th>\n",
- " <th>match_ratio_value</th>\n",
- " <th>recall_10_value</th>\n",
- " <th>reciprocal_rank_10_value</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>0</td>\n",
- " <td>52526</td>\n",
- " <td>58692</td>\n",
- " <td>0.894943</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>1</td>\n",
- " <td>54048</td>\n",
- " <td>58692</td>\n",
- " <td>0.920875</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " query_id match_ratio_retrieved_docs match_ratio_docs_available \\\n",
- "0 0 52526 58692 \n",
- "1 1 54048 58692 \n",
- "\n",
- " match_ratio_value recall_10_value reciprocal_rank_10_value \n",
- "0 0.894943 0 0 \n",
- "1 0.920875 0 0 "
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "evaluation = app.evaluate(\n",
- " labelled_data = labelled_data,\n",
- " eval_metrics = eval_metrics, \n",
- " query_model = query_model, \n",
- " id_field = \"id\",\n",
- ")\n",
- "evaluation"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Evaluate specific query\n",
- "\n",
- "> You can have finer control with the `evaluate_query` method."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "<div>\n",
- "<style scoped>\n",
- " .dataframe tbody tr th:only-of-type {\n",
- " vertical-align: middle;\n",
- " }\n",
- "\n",
- " .dataframe tbody tr th {\n",
- " vertical-align: top;\n",
- " }\n",
- "\n",
- " .dataframe thead th {\n",
- " text-align: right;\n",
- " }\n",
- "</style>\n",
- "<table border=\"1\" class=\"dataframe\">\n",
- " <thead>\n",
- " <tr style=\"text-align: right;\">\n",
- " <th></th>\n",
- " <th>query_id</th>\n",
- " <th>match_ratio_retrieved_docs</th>\n",
- " <th>match_ratio_docs_available</th>\n",
- " <th>match_ratio_value</th>\n",
- " <th>recall_10_value</th>\n",
- " <th>reciprocal_rank_10_value</th>\n",
- " </tr>\n",
- " </thead>\n",
- " <tbody>\n",
- " <tr>\n",
- " <th>0</th>\n",
- " <td>0</td>\n",
- " <td>52526</td>\n",
- " <td>58692</td>\n",
- " <td>0.894943</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " <tr>\n",
- " <th>1</th>\n",
- " <td>1</td>\n",
- " <td>54048</td>\n",
- " <td>58692</td>\n",
- " <td>0.920875</td>\n",
- " <td>0</td>\n",
- " <td>0</td>\n",
- " </tr>\n",
- " </tbody>\n",
- "</table>\n",
- "</div>"
- ],
- "text/plain": [
- " query_id match_ratio_retrieved_docs match_ratio_docs_available \\\n",
- "0 0 52526 58692 \n",
- "1 1 54048 58692 \n",
- "\n",
- " match_ratio_value recall_10_value reciprocal_rank_10_value \n",
- "0 0.894943 0 0 \n",
- "1 0.920875 0 0 "
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "from pandas import concat, DataFrame\n",
- "\n",
- "evaluation = []\n",
- "for query_data in labelled_data:\n",
- " query_evaluation = app.evaluate_query(\n",
- " eval_metrics = eval_metrics, \n",
- " query_model = query_model, \n",
- " query_id = query_data[\"query_id\"], \n",
- " query = query_data[\"query\"], \n",
- " id_field = \"id\",\n",
- " relevant_docs = query_data[\"relevant_docs\"],\n",
- " default_score = 0\n",
- " )\n",
- " evaluation.append(query_evaluation)\n",
- "evaluation = DataFrame.from_records(evaluation)\n",
- "evaluation"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "vespa",
- "language": "python",
- "name": "vespa"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/python/vespa/notebooks/index.ipynb b/python/vespa/notebooks/index.ipynb
deleted file mode 100644
index b9864688ff4..00000000000
--- a/python/vespa/notebooks/index.ipynb
+++ /dev/null
@@ -1,255 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# hide\n",
- "%load_ext autoreload\n",
- "%autoreload 2"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Vespa library for data analysis\n",
- "\n",
- "> Provide data analysis support for Vespa applications"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Install"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "`pip install pyvespa`"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Connect to a Vespa app\n",
- "\n",
- "> Connect to a running Vespa application"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from vespa.application import Vespa\n",
- "\n",
- "app = Vespa(url = \"https://api.cord19.vespa.ai\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Define a Query model\n",
- "\n",
- "> Easily define matching and ranking criteria"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from vespa.query import Query, Union, WeakAnd, ANN, RankProfile\n",
- "from random import random\n",
- "\n",
- "match_phase = Union(\n",
- " WeakAnd(hits = 10), \n",
- " ANN(\n",
- " doc_vector=\"title_embedding\", \n",
- " query_vector=\"title_vector\", \n",
- " embedding_model=lambda x: [random() for x in range(768)],\n",
- " hits = 10,\n",
- " label=\"title\"\n",
- " )\n",
- ")\n",
- "\n",
- "rank_profile = RankProfile(name=\"bm25\", list_features=True)\n",
- "\n",
- "query_model = Query(match_phase=match_phase, rank_profile=rank_profile)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Query the vespa app\n",
- "\n",
- "> Send queries via the query API. See the [query page](/vespa/query) for more examples."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "query_result = app.query(\n",
- " query=\"Is remdesivir an effective treatment for COVID-19?\", \n",
- " query_model=query_model\n",
- ")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "query_result.number_documents_retrieved"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Labelled data\n",
- "\n",
- "> How to structure labelled data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "labelled_data = [\n",
- " {\n",
- " \"query_id\": 0, \n",
- " \"query\": \"Intrauterine virus infections and congenital heart disease\",\n",
- " \"relevant_docs\": [{\"id\": 0, \"score\": 1}, {\"id\": 3, \"score\": 1}]\n",
- " },\n",
- " {\n",
- " \"query_id\": 1, \n",
- " \"query\": \"Clinical and immunologic studies in identical twins discordant for systemic lupus erythematosus\",\n",
- " \"relevant_docs\": [{\"id\": 1, \"score\": 1}, {\"id\": 5, \"score\": 1}]\n",
- " }\n",
- "]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Non-relevant documents are assigned `\"score\": 0` by default. Relevant documents will be assigned `\"score\": 1` by default if the field is missing from the labelled data. The defaults for both relevant and non-relevant documents can be modified on the appropriate methods."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Collect training data\n",
- "\n",
- "> Collect training data to analyse and/or improve ranking functions. See the [collect training data page](/vespa/collect_training_data) for more examples."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "training_data_batch = app.collect_training_data(\n",
- " labelled_data = labelled_data,\n",
- " id_field = \"id\",\n",
- " query_model = query_model,\n",
- " number_additional_docs = 2\n",
- ")\n",
- "training_data_batch"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Evaluating a query model\n",
- "\n",
- "> Define metrics and evaluate query models. See the [evaluation page](/vespa/evaluation) for more examples."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We will define the following evaluation metrics:\n",
- "* % of documents retrieved per query\n",
- "* recall @ 10 per query\n",
- "* MRR @ 10 per query"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from vespa.evaluation import MatchRatio, Recall, ReciprocalRank\n",
- "\n",
- "eval_metrics = [MatchRatio(), Recall(at=10), ReciprocalRank(at=10)]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Evaluate:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "evaluation = app.evaluate(\n",
- " labelled_data = labelled_data,\n",
- " eval_metrics = eval_metrics, \n",
- " query_model = query_model, \n",
- " id_field = \"id\",\n",
- ")\n",
- "evaluation"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "vespa",
- "language": "python",
- "name": "vespa"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.7"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/python/vespa/notebooks/query.ipynb b/python/vespa/notebooks/query.ipynb
deleted file mode 100644
index 82bc1b8ac29..00000000000
--- a/python/vespa/notebooks/query.ipynb
+++ /dev/null
@@ -1,320 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# hide\n",
- "%load_ext autoreload\n",
- "%autoreload 2"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Query API\n",
- "\n",
- "> Python query API"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can connect to the CORD-19 Search app and use it to exemplify the query API"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from vespa.application import Vespa\n",
- "\n",
- "app = Vespa(url = \"https://api.cord19.vespa.ai\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Specify the request body\n",
- "\n",
- "> Full flexibility by specifying the entire request body"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "body = {\n",
- " 'yql': 'select title, abstract from sources * where userQuery();',\n",
- " 'hits': 5,\n",
- " 'query': 'Is remdesivir an effective treatment for COVID-19?',\n",
- " 'type': 'any',\n",
- " 'ranking': 'bm25'\n",
- "}"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "results = app.query(body=body)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "108882"
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "results.number_documents_retrieved"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Specify a query model"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Query + term-matching + rank profile"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from vespa.query import Query, OR, RankProfile\n",
- "\n",
- "results = app.query(\n",
- " query=\"Is remdesivir an effective treatment for COVID-19?\", \n",
- " query_model = Query(\n",
- " match_phase=OR(), \n",
- " rank_profile=RankProfile(name=\"bm25\")\n",
- " )\n",
- ")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "108882"
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "results.number_documents_retrieved"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Query + term-matching + ann operator + rank_profile"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from vespa.query import Query, ANN, WeakAnd, Union, RankProfile\n",
- "from random import random\n",
- "\n",
- "match_phase = Union(\n",
- " WeakAnd(hits = 10), \n",
- " ANN(\n",
- " doc_vector=\"title_embedding\", \n",
- " query_vector=\"title_vector\", \n",
- " embedding_model=lambda x: [random() for x in range(768)],\n",
- " hits = 10,\n",
- " label=\"title\"\n",
- " )\n",
- ")\n",
- "rank_profile = RankProfile(name=\"bm25\", list_features=True)\n",
- "query_model = Query(match_phase=match_phase, rank_profile=rank_profile)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "results = app.query(query=\"Is remdesivir an effective treatment for COVID-19?\", \n",
- " query_model=query_model)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "947"
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "results.number_documents_retrieved"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Recall specific documents"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's take a look at the top 3 ids from the last query."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[117166, 60125, 28903]"
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "top_ids = [hit[\"fields\"][\"id\"] for hit in results.hits[0:3]]\n",
- "top_ids"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Assume that we now want to retrieve the second and third ids above. We can do so with the `recall` argument."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "results_with_recall = app.query(query=\"Is remdesivir an effective treatment for COVID-19?\", \n",
- " query_model=query_model,\n",
- " recall = (\"id\", top_ids[1:3]))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "It will only retrieve the documents with Vespa field `id` that is defined on the list that is inside the tuple."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[60125, 28903]"
- ]
- },
- "execution_count": null,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "id_recalled = [hit[\"fields\"][\"id\"] for hit in results_with_recall.hits]\n",
- "id_recalled"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "#hide\n",
- "from fastcore.test import all_equal, test\n",
- "\n",
- "test(id_recalled, top_ids[1:3], all_equal)"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "vespa",
- "language": "python",
- "name": "vespa"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.7"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}