aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorKristian Aune <kkraune@users.noreply.github.com>2020-08-24 13:08:17 +0200
committerGitHub <noreply@github.com>2020-08-24 13:08:17 +0200
commit9e710c2d1e1bd73e00e2ec1c2a6ab229c02e14d6 (patch)
tree9469b9695c6976b6aae6b5ef4bd6c0a78ec8331b
parent799f23ff147c17a79cd9164c99925bafca9486c8 (diff)
parent46a3d61a10ff6b7b792e735e1efec8fa179a0aae (diff)
Merge pull request #14144 from vespa-engine/tgm/pyvespa-tutorial
pyvespa tutorials
-rw-r--r--python/vespa/notebooks/connect-to-vespa-instance.ipynb977
-rw-r--r--python/vespa/notebooks/create-and-deploy-vespa.ipynb1064
2 files changed, 2041 insertions, 0 deletions
diff --git a/python/vespa/notebooks/connect-to-vespa-instance.ipynb b/python/vespa/notebooks/connect-to-vespa-instance.ipynb
new file mode 100644
index 00000000000..6dfecd0c099
--- /dev/null
+++ b/python/vespa/notebooks/connect-to-vespa-instance.ipynb
@@ -0,0 +1,977 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Vespa library for data analysis\n",
+ "\n",
+ "> Provide data analysis support for Vespa applications \n",
+ "\n",
+ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vespa-engine/vespa/blob/tgm/pyvespa-tutorial/python/vespa/notebooks/connect-to-vespa-instance.ipynb)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "`pyvespa` provides a python API to [vespa.ai](vespa.ai). It allow us to create, modify, deploy and interact with running Vespa instances. The main goal of the library is to allow for faster prototyping and ML experimentation. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This tutorial will show you how to connect to a pre-existing Vespa instance. We will use the https://cord19.vespa.ai/ app as an example. You can run this tutorial yourself in Google Colab by clicking on the badge located at the top of the tutorial."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Install"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The library is available at PyPI and therefore can be installed with `pip`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!pip install pyvespa"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Connect to a Vespa app\n",
+ "\n",
+ "> Connect to a running Vespa application"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can connect to a running Vespa instance by created an instance of `Vespa` with the appropriate url. The resulting `app` will then be used to communicate with the application."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.application import Vespa\n",
+ "\n",
+ "app = Vespa(url = \"https://api.cord19.vespa.ai\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Define a Query model\n",
+ "\n",
+ "> Easily define matching and ranking criteria"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "When building a search application, we usually want to expirement with different query models. A `Query` model consists of a match phase and a ranking phase. The matching phase will define how to match documents based on the query sent and the ranking phase will define how to rank the matched documents. Both phases can get quite complex and being able to easily express and experiment with them is very valuable."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In the example below we define the match phase to be the `Union` of the `WeakAnd` and the `ANN` operators. The `WeakAnd` will match documents based on query terms while the Approximate Nearest Neighbor (`ANN`) operator will match documents based on the distance between the query and document embeddings. This is an illustration of how easy it is to combine term and semantic matching in Vespa. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.query import Union, WeakAnd, ANN\n",
+ "from random import random\n",
+ "\n",
+ "match_phase = Union(\n",
+ " WeakAnd(hits = 10), \n",
+ " ANN(\n",
+ " doc_vector=\"title_embedding\", \n",
+ " query_vector=\"title_vector\", \n",
+ " embedding_model=lambda x: [random() for x in range(768)],\n",
+ " hits = 10,\n",
+ " label=\"title\"\n",
+ " )\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We then define the the ranking to be done by the `bm25` rank-profile that is already defined in the application package. We set `list_features=True` to be able to collect ranking-features later in this tutorial. After defining the `match_phase` and the `rank_profile` we can instantiate the `Query` model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.query import Query, RankProfile\n",
+ "\n",
+ "rank_profile = RankProfile(name=\"bm25\", list_features=True)\n",
+ "\n",
+ "query_model = Query(match_phase=match_phase, rank_profile=rank_profile)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Query the vespa app\n",
+ "\n",
+ "> Send queries via the query API. See the [query page](/vespa/query) for more examples."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can use the `query_model` that we just defined to issue queries to the application via the `query` method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "query_result = app.query(\n",
+ " query=\"Is remdesivir an effective treatment for COVID-19?\", \n",
+ " query_model=query_model\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can see the number of documents that were retrieved by Vespa:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "965"
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "query_result.number_documents_retrieved"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "And the number of documents that were returned to us:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "10"
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(query_result.hits)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Labelled data\n",
+ "\n",
+ "> How to structure labelled data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We often need to either evaluate query models or to collect data to improve query models through ML. In both cases we usually need labelled data. Lets create some labelled data to illustrate their expected format and their usage in the library."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Each data point contains a `query_id`, a `query` and `relevant_docs` associated with the query."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "labelled_data = [\n",
+ " {\n",
+ " \"query_id\": 0, \n",
+ " \"query\": \"Intrauterine virus infections and congenital heart disease\",\n",
+ " \"relevant_docs\": [{\"id\": 0, \"score\": 1}, {\"id\": 3, \"score\": 1}]\n",
+ " },\n",
+ " {\n",
+ " \"query_id\": 1, \n",
+ " \"query\": \"Clinical and immunologic studies in identical twins discordant for systemic lupus erythematosus\",\n",
+ " \"relevant_docs\": [{\"id\": 1, \"score\": 1}, {\"id\": 5, \"score\": 1}]\n",
+ " }\n",
+ "]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Non-relevant documents are assigned `\"score\": 0` by default. Relevant documents will be assigned `\"score\": 1` by default if the field is missing from the labelled data. The defaults for both relevant and non-relevant documents can be modified on the appropriate methods."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Collect training data\n",
+ "\n",
+ "> Collect training data to analyse and/or improve ranking functions. See the [collect training data page](/vespa/collect_training_data) for more examples."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can colect training data with the `collect_training_data` method according to a specific `query_model`. Below we will collect two documents for each query in addition to the relevant ones."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>attributeMatch(authors.first)</th>\n",
+ " <th>attributeMatch(authors.first).averageWeight</th>\n",
+ " <th>attributeMatch(authors.first).completeness</th>\n",
+ " <th>attributeMatch(authors.first).fieldCompleteness</th>\n",
+ " <th>attributeMatch(authors.first).importance</th>\n",
+ " <th>attributeMatch(authors.first).matches</th>\n",
+ " <th>attributeMatch(authors.first).maxWeight</th>\n",
+ " <th>attributeMatch(authors.first).normalizedWeight</th>\n",
+ " <th>attributeMatch(authors.first).normalizedWeightedWeight</th>\n",
+ " <th>attributeMatch(authors.first).queryCompleteness</th>\n",
+ " <th>...</th>\n",
+ " <th>textSimilarity(results).queryCoverage</th>\n",
+ " <th>textSimilarity(results).score</th>\n",
+ " <th>textSimilarity(title).fieldCoverage</th>\n",
+ " <th>textSimilarity(title).order</th>\n",
+ " <th>textSimilarity(title).proximity</th>\n",
+ " <th>textSimilarity(title).queryCoverage</th>\n",
+ " <th>textSimilarity(title).score</th>\n",
+ " <th>document_id</th>\n",
+ " <th>query_id</th>\n",
+ " <th>relevant</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.062500</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>0.142857</td>\n",
+ " <td>0.055357</td>\n",
+ " <td>0</td>\n",
+ " <td>0</td>\n",
+ " <td>1</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>97200</td>\n",
+ " <td>0</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.266667</td>\n",
+ " <td>1.0</td>\n",
+ " <td>0.869792</td>\n",
+ " <td>0.571429</td>\n",
+ " <td>0.679189</td>\n",
+ " <td>69447</td>\n",
+ " <td>0</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.142857</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.437500</td>\n",
+ " <td>0.142857</td>\n",
+ " <td>0.224554</td>\n",
+ " <td>3</td>\n",
+ " <td>0</td>\n",
+ " <td>1</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>97200</td>\n",
+ " <td>0</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.266667</td>\n",
+ " <td>1.0</td>\n",
+ " <td>0.869792</td>\n",
+ " <td>0.571429</td>\n",
+ " <td>0.679189</td>\n",
+ " <td>69447</td>\n",
+ " <td>0</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>6</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.111111</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>0.083333</td>\n",
+ " <td>0.047222</td>\n",
+ " <td>1</td>\n",
+ " <td>1</td>\n",
+ " <td>1</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>7</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>116256</td>\n",
+ " <td>1</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>8</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.187500</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>0.250000</td>\n",
+ " <td>0.612500</td>\n",
+ " <td>14888</td>\n",
+ " <td>1</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>9</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.083333</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.000000</td>\n",
+ " <td>0.083333</td>\n",
+ " <td>0.041667</td>\n",
+ " <td>5</td>\n",
+ " <td>1</td>\n",
+ " <td>1</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>10</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>116256</td>\n",
+ " <td>1</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>11</th>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>...</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0.187500</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " <td>0.250000</td>\n",
+ " <td>0.612500</td>\n",
+ " <td>14888</td>\n",
+ " <td>1</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "<p>12 rows × 984 columns</p>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " attributeMatch(authors.first) \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).averageWeight \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).completeness \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).fieldCompleteness \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).importance \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).matches \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).maxWeight \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).normalizedWeight \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).normalizedWeightedWeight \\\n",
+ "0 0.0 \n",
+ "1 0.0 \n",
+ "2 0.0 \n",
+ "3 0.0 \n",
+ "4 0.0 \n",
+ "5 0.0 \n",
+ "6 0.0 \n",
+ "7 0.0 \n",
+ "8 0.0 \n",
+ "9 0.0 \n",
+ "10 0.0 \n",
+ "11 0.0 \n",
+ "\n",
+ " attributeMatch(authors.first).queryCompleteness ... \\\n",
+ "0 0.0 ... \n",
+ "1 0.0 ... \n",
+ "2 0.0 ... \n",
+ "3 0.0 ... \n",
+ "4 0.0 ... \n",
+ "5 0.0 ... \n",
+ "6 0.0 ... \n",
+ "7 0.0 ... \n",
+ "8 0.0 ... \n",
+ "9 0.0 ... \n",
+ "10 0.0 ... \n",
+ "11 0.0 ... \n",
+ "\n",
+ " textSimilarity(results).queryCoverage textSimilarity(results).score \\\n",
+ "0 0.0 0.0 \n",
+ "1 0.0 0.0 \n",
+ "2 0.0 0.0 \n",
+ "3 0.0 0.0 \n",
+ "4 0.0 0.0 \n",
+ "5 0.0 0.0 \n",
+ "6 0.0 0.0 \n",
+ "7 0.0 0.0 \n",
+ "8 0.0 0.0 \n",
+ "9 0.0 0.0 \n",
+ "10 0.0 0.0 \n",
+ "11 0.0 0.0 \n",
+ "\n",
+ " textSimilarity(title).fieldCoverage textSimilarity(title).order \\\n",
+ "0 0.062500 0.0 \n",
+ "1 1.000000 1.0 \n",
+ "2 0.266667 1.0 \n",
+ "3 0.142857 0.0 \n",
+ "4 1.000000 1.0 \n",
+ "5 0.266667 1.0 \n",
+ "6 0.111111 0.0 \n",
+ "7 1.000000 1.0 \n",
+ "8 0.187500 1.0 \n",
+ "9 0.083333 0.0 \n",
+ "10 1.000000 1.0 \n",
+ "11 0.187500 1.0 \n",
+ "\n",
+ " textSimilarity(title).proximity textSimilarity(title).queryCoverage \\\n",
+ "0 0.000000 0.142857 \n",
+ "1 1.000000 1.000000 \n",
+ "2 0.869792 0.571429 \n",
+ "3 0.437500 0.142857 \n",
+ "4 1.000000 1.000000 \n",
+ "5 0.869792 0.571429 \n",
+ "6 0.000000 0.083333 \n",
+ "7 1.000000 1.000000 \n",
+ "8 1.000000 0.250000 \n",
+ "9 0.000000 0.083333 \n",
+ "10 1.000000 1.000000 \n",
+ "11 1.000000 0.250000 \n",
+ "\n",
+ " textSimilarity(title).score document_id query_id relevant \n",
+ "0 0.055357 0 0 1 \n",
+ "1 1.000000 97200 0 0 \n",
+ "2 0.679189 69447 0 0 \n",
+ "3 0.224554 3 0 1 \n",
+ "4 1.000000 97200 0 0 \n",
+ "5 0.679189 69447 0 0 \n",
+ "6 0.047222 1 1 1 \n",
+ "7 1.000000 116256 1 0 \n",
+ "8 0.612500 14888 1 0 \n",
+ "9 0.041667 5 1 1 \n",
+ "10 1.000000 116256 1 0 \n",
+ "11 0.612500 14888 1 0 \n",
+ "\n",
+ "[12 rows x 984 columns]"
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "training_data_batch = app.collect_training_data(\n",
+ " labelled_data = labelled_data,\n",
+ " id_field = \"id\",\n",
+ " query_model = query_model,\n",
+ " number_additional_docs = 2\n",
+ ")\n",
+ "training_data_batch"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Evaluating a query model\n",
+ "\n",
+ "> Define metrics and evaluate query models. See the [evaluation page](/vespa/evaluation) for more examples."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We will define the following evaluation metrics:\n",
+ "* % of documents retrieved per query\n",
+ "* recall @ 10 per query\n",
+ "* MRR @ 10 per query"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.evaluation import MatchRatio, Recall, ReciprocalRank\n",
+ "\n",
+ "eval_metrics = [MatchRatio(), Recall(at=10), ReciprocalRank(at=10)]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Evaluate:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>query_id</th>\n",
+ " <th>match_ratio_retrieved_docs</th>\n",
+ " <th>match_ratio_docs_available</th>\n",
+ " <th>match_ratio_value</th>\n",
+ " <th>recall_10_value</th>\n",
+ " <th>reciprocal_rank_10_value</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>0</td>\n",
+ " <td>1033</td>\n",
+ " <td>127518</td>\n",
+ " <td>0.008101</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>1</td>\n",
+ " <td>928</td>\n",
+ " <td>127518</td>\n",
+ " <td>0.007277</td>\n",
+ " <td>0.0</td>\n",
+ " <td>0</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " query_id match_ratio_retrieved_docs match_ratio_docs_available \\\n",
+ "0 0 1033 127518 \n",
+ "1 1 928 127518 \n",
+ "\n",
+ " match_ratio_value recall_10_value reciprocal_rank_10_value \n",
+ "0 0.008101 0.0 0 \n",
+ "1 0.007277 0.0 0 "
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "evaluation = app.evaluate(\n",
+ " labelled_data = labelled_data,\n",
+ " eval_metrics = eval_metrics, \n",
+ " query_model = query_model, \n",
+ " id_field = \"id\",\n",
+ ")\n",
+ "evaluation"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "vespa",
+ "language": "python",
+ "name": "vespa"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/python/vespa/notebooks/create-and-deploy-vespa.ipynb b/python/vespa/notebooks/create-and-deploy-vespa.ipynb
new file mode 100644
index 00000000000..86d5fa08fc5
--- /dev/null
+++ b/python/vespa/notebooks/create-and-deploy-vespa.ipynb
@@ -0,0 +1,1064 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# hide\n",
+ "%load_ext autoreload\n",
+ "%autoreload 2"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Build end-to-end Vespa apps with pyvespa\n",
+ "\n",
+ "> Python API to create, modify, deploy and interact with Vespa applications"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "`pyvespa` provides a python API to [vespa.ai](vespa.ai). It allow us to create, modify, deploy and interact with running Vespa instances. The main goal of the library is to allow for faster prototyping and ML experimentation. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This tutorial will create a text search application from scratch based on the MS MARCO dataset, similar to our [text search tutorials](https://docs.vespa.ai/documentation/tutorials/text-search.html). We will first show how to define the app by creating an application package [REF]. Then we locally deploy the app in a Docker container. Once the app is up and running we show how to feed data to it. After the data is sent, we can make queries and inspect the results. We then show how to add a new rank profile to the application package and to redeploy the app with the latest changes. We proceed to show how to evaluate and compare two rank profiles with evaluation metrics such as Recall and Reciprocal Rank."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Application package API"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We first create a `Document` instance containing the `Field`s that we want to store in the app. In this case we will keep the application simple and only feed a unique `id`, `title` and `body` of the MS MARCO documents."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.package import Document, Field\n",
+ "\n",
+ "document = Document(\n",
+ " fields=[\n",
+ " Field(name = \"id\", type = \"string\", indexing = [\"attribute\", \"summary\"]),\n",
+ " Field(name = \"title\", type = \"string\", indexing = [\"index\", \"summary\"], index = \"enable-bm25\"),\n",
+ " Field(name = \"body\", type = \"string\", indexing = [\"index\", \"summary\"], index = \"enable-bm25\") \n",
+ " ]\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The complete `Schema` of our application will be named `msmarco` and contains the `Document` instance that we defined above, the default `FieldSet` indicates that queries will look for matches by searching both in the titles and bodies of the documents. The default `RankProfile` indicates that all the matched documents will be ranked by the `nativeRank` expression involving the title and the body of the matched documents."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.package import Schema, FieldSet, RankProfile\n",
+ "\n",
+ "msmarco_schema = Schema(\n",
+ " name = \"msmarco\", \n",
+ " document = document, \n",
+ " fieldsets = [FieldSet(name = \"default\", fields = [\"title\", \"body\"])],\n",
+ " rank_profiles = [RankProfile(name = \"default\", first_phase = \"nativeRank(title, body)\")]\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Once the `Schema` is defined, all we have to do is to create our msmarco `ApplicationPackage`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.package import ApplicationPackage\n",
+ "\n",
+ "app_package = ApplicationPackage(name = \"msmarco\", schema=msmarco_schema)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "At this point, `app_package` contains all the relevant information required to create our MS MARCO text search app. We now need to deploy it."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Deploy it locally"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This tutorial shows how to deploy the application package locally in a Docker container. For the following to work you need to run this from a machine with Docker installed. We first create a `VespaDocker` instance based on the application package."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.package import VespaDocker\n",
+ "\n",
+ "vespa_docker = VespaDocker(application_package=app_package)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We then call the `deploy` method and specify a `disk_folder` with write access. Behind the scenes, `pyvespa` will write the Vespa config files and store them in the `disk_folder`, it will then run a Vespa engine Docker container and deploy those config files in the container."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "app = vespa_docker.deploy(disk_folder=\"/Users/username/sample_application\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The app variable above will hold a `Vespa` instance that will be used to connect and interact with our text search application. We can see the deployment message returned by the Vespa engine:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[\"Uploading application '/app/application' using http://localhost:19071/application/v2/tenant/default/session\",\n",
+ " \"Session 18 for tenant 'default' created.\",\n",
+ " 'Preparing session 18 using http://localhost:19071/application/v2/tenant/default/session/18/prepared',\n",
+ " \"WARNING: Host named 'msmarco' may not receive any config since it is not a canonical hostname. Disregard this warning when testing in a Docker container.\",\n",
+ " \"Session 18 for tenant 'default' prepared.\",\n",
+ " 'Activating session 18 using http://localhost:19071/application/v2/tenant/default/session/18/active',\n",
+ " \"Session 18 for tenant 'default' activated.\",\n",
+ " 'Checksum: 09203c16fa5f582b712711bb98932812',\n",
+ " 'Timestamp: 1598011224920',\n",
+ " 'Generation: 18',\n",
+ " '']"
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "app.deployment_message"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Feed data to the app "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We now have our text search app up and running. We can start to feed data to it. We have pre-processed and sampled some MS MARCO data to use in this tutorial. We can load 996 documents that we want to feed and check the first two documents in this sample."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(996, 3)"
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "from pandas import read_csv\n",
+ "\n",
+ "docs = read_csv(\"https://thigm85.github.io/data/msmarco/docs.tsv\", sep = \"\\t\")\n",
+ "docs.shape"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>id</th>\n",
+ " <th>title</th>\n",
+ " <th>body</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>D2185715</td>\n",
+ " <td>What Is an Appropriate Gift for a Bris</td>\n",
+ " <td>Hub Pages Religion and Philosophy Judaism...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>D2819479</td>\n",
+ " <td>lunge</td>\n",
+ " <td>1lungenoun ˈlənj Popularity Bottom 40 of...</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " id title \\\n",
+ "0 D2185715 What Is an Appropriate Gift for a Bris \n",
+ "1 D2819479 lunge \n",
+ "\n",
+ " body \n",
+ "0 Hub Pages Religion and Philosophy Judaism... \n",
+ "1 1lungenoun ˈlənj Popularity Bottom 40 of... "
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "docs.head(2)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To feed the data we need to specify the `schema` that we are sending data to. We name our schema `msmarco` in a previous section. Each data point needs to have a unique `data_id` associated with it, independent of having an id field or not. The `fields` should be a dict containing all the fields in the schema, which are `id`, `title` and `body` in our case. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "for idx, row in docs.iterrows():\n",
+ " response = app.feed_data_point(\n",
+ " schema = \"msmarco\", \n",
+ " data_id = str(row[\"id\"]), \n",
+ " fields = {\n",
+ " \"id\": str(row[\"id\"]), \n",
+ " \"title\": str(row[\"title\"]), \n",
+ " \"body\": str(row[\"body\"])\n",
+ " }\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Each call to the method `feed_data_point` sends a POST request to the appropriate Vespa endpoint and we can check the response of the requests if needed, such as the status code and the message returned."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "200"
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "response.status_code"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'id': 'id:msmarco:msmarco::D2002872',\n",
+ " 'pathId': '/document/v1/msmarco/msmarco/docid/D2002872'}"
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "response.json()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Make a simple query"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Once our application is fed we can start to use it by sending queries to it. The MS MARCO app expectes to receive questions as queries and the goal of the application is to return documents that are relevant to the questions made."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In the example below, we will send a question via the `query` parameter. In addition, we need to specify how we want the documents to be matched and ranked. We do this by specifying a `Query` model. The query model below will have the `OR` operator in the match phase, indicating that the application will match all the documents which have at least one query term within the title or the body (due to the default `FieldSet` we defined earlier) of the document. And we will rank all the matched documents by the default `RankProfile` that we defined earlier."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.query import Query, OR, RankProfile as Ranking\n",
+ "\n",
+ "results = app.query(\n",
+ " query=\"Where is my text?\", \n",
+ " query_model = Query(\n",
+ " match_phase=OR(), \n",
+ " rank_profile=Ranking(name=\"default\")\n",
+ " ),\n",
+ " hits = 2\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In addition to the `query` and `query_model` parameters, we can specify a multitude of relevant Vespa parameters such as the number of `hits` that we want Vespa to return. We chose `hits=2` for simplicity in this tutorial."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "2"
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(results.hits)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Change the application package and redeploy"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can also make specific changes to our application by changing the application package and redeploying. Lets add a new rank profile based on BM25 to our `Schema`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "app_package.schema.add_rank_profile(\n",
+ " RankProfile(name = \"bm25\", inherits = \"default\", first_phase = \"bm25(title) + bm25(body)\")\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "After that we can redeploy our application, similar to what we did earlier:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Vespa(http://localhost, 8080)"
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "vespa_docker.deploy(disk_folder=\"/Users/username/sample_application\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can then use the newly created `bm25` rank profile to make queries:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "2"
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "results = app.query(\n",
+ " query=\"Where is my text?\", \n",
+ " query_model = Query(\n",
+ " match_phase=OR(), \n",
+ " rank_profile=Ranking(name=\"bm25\")\n",
+ " ),\n",
+ " hits = 2\n",
+ ")\n",
+ "len(results.hits)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Compare query models"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "When we are building a search application, we often want to experiment and compare different query models. In this section we want to show how easy it is to compare different query models in Vespa."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Lets load some labelled data where each data point contains a `query_id`, a `query` and a list of `relevant_docs` associated with the query. In this case, we have only one relevant document for each query."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests\n",
+ "\n",
+ "labelled_data = json.loads(\n",
+ " requests.get(\"https://thigm85.github.io/data/msmarco/query-labels.json\").text\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Following we can see two examples of the labelled data:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[{'query_id': '1',\n",
+ " 'query': 'what county is aspen co',\n",
+ " 'relevant_docs': [{'id': 'D1098819'}]},\n",
+ " {'query_id': '2',\n",
+ " 'query': 'where is aeropostale located',\n",
+ " 'relevant_docs': [{'id': 'D2268823'}]}]"
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "labelled_data[0:2]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Lets define two `Query` models to be compared. We are going to use the same `OR` operator in the match phase and compare the `default` and `bm25` rank profiles."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "default_ranking = Query(\n",
+ " match_phase=OR(), \n",
+ " rank_profile=Ranking(name=\"default\")\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bm25_ranking = Query(\n",
+ " match_phase=OR(), \n",
+ " rank_profile=Ranking(name=\"bm25\")\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now we will chose which evaluation metrics we want to look at. In this case we will chose the `MatchRatio` to check how many documents have been matched by the query, the `Recall` at 10 and the `ReciprocalRank` at 10."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from vespa.evaluation import MatchRatio, Recall, ReciprocalRank\n",
+ "\n",
+ "eval_metrics = [MatchRatio(), Recall(at = 10), ReciprocalRank(at = 10)]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We now can run the `evaluation` method for each `Query` model. This will make queries to the application and process the results to compute the pre-defined `eval_metrics` defined above."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "default_evaluation = app.evaluate(\n",
+ " labelled_data=labelled_data, \n",
+ " eval_metrics=eval_metrics, \n",
+ " query_model=default_ranking, \n",
+ " id_field=\"id\",\n",
+ " timeout=5,\n",
+ " hits=10\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bm25_evaluation = app.evaluate(\n",
+ " labelled_data=labelled_data, \n",
+ " eval_metrics=eval_metrics, \n",
+ " query_model=bm25_ranking, \n",
+ " id_field=\"id\",\n",
+ " timeout=5,\n",
+ " hits=10\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can then merge the DataFrames returned by the `evaluation` method and start to analyse the results."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>query_id</th>\n",
+ " <th>match_ratio_retrieved_docs_default</th>\n",
+ " <th>match_ratio_docs_available_default</th>\n",
+ " <th>match_ratio_value_default</th>\n",
+ " <th>recall_10_value_default</th>\n",
+ " <th>reciprocal_rank_10_value_default</th>\n",
+ " <th>match_ratio_retrieved_docs_bm25</th>\n",
+ " <th>match_ratio_docs_available_bm25</th>\n",
+ " <th>match_ratio_value_bm25</th>\n",
+ " <th>recall_10_value_bm25</th>\n",
+ " <th>reciprocal_rank_10_value_bm25</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>1</td>\n",
+ " <td>914</td>\n",
+ " <td>997</td>\n",
+ " <td>0.916750</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000</td>\n",
+ " <td>914</td>\n",
+ " <td>997</td>\n",
+ " <td>0.916750</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>2</td>\n",
+ " <td>896</td>\n",
+ " <td>997</td>\n",
+ " <td>0.898696</td>\n",
+ " <td>1.0</td>\n",
+ " <td>0.125</td>\n",
+ " <td>896</td>\n",
+ " <td>997</td>\n",
+ " <td>0.898696</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>3</td>\n",
+ " <td>971</td>\n",
+ " <td>997</td>\n",
+ " <td>0.973922</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000</td>\n",
+ " <td>971</td>\n",
+ " <td>997</td>\n",
+ " <td>0.973922</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>4</td>\n",
+ " <td>982</td>\n",
+ " <td>997</td>\n",
+ " <td>0.984955</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000</td>\n",
+ " <td>982</td>\n",
+ " <td>997</td>\n",
+ " <td>0.984955</td>\n",
+ " <td>1.0</td>\n",
+ " <td>1.000000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>5</td>\n",
+ " <td>748</td>\n",
+ " <td>997</td>\n",
+ " <td>0.750251</td>\n",
+ " <td>1.0</td>\n",
+ " <td>0.500</td>\n",
+ " <td>748</td>\n",
+ " <td>997</td>\n",
+ " <td>0.750251</td>\n",
+ " <td>1.0</td>\n",
+ " <td>0.333333</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " query_id match_ratio_retrieved_docs_default \\\n",
+ "0 1 914 \n",
+ "1 2 896 \n",
+ "2 3 971 \n",
+ "3 4 982 \n",
+ "4 5 748 \n",
+ "\n",
+ " match_ratio_docs_available_default match_ratio_value_default \\\n",
+ "0 997 0.916750 \n",
+ "1 997 0.898696 \n",
+ "2 997 0.973922 \n",
+ "3 997 0.984955 \n",
+ "4 997 0.750251 \n",
+ "\n",
+ " recall_10_value_default reciprocal_rank_10_value_default \\\n",
+ "0 1.0 1.000 \n",
+ "1 1.0 0.125 \n",
+ "2 1.0 1.000 \n",
+ "3 1.0 1.000 \n",
+ "4 1.0 0.500 \n",
+ "\n",
+ " match_ratio_retrieved_docs_bm25 match_ratio_docs_available_bm25 \\\n",
+ "0 914 997 \n",
+ "1 896 997 \n",
+ "2 971 997 \n",
+ "3 982 997 \n",
+ "4 748 997 \n",
+ "\n",
+ " match_ratio_value_bm25 recall_10_value_bm25 reciprocal_rank_10_value_bm25 \n",
+ "0 0.916750 1.0 1.000000 \n",
+ "1 0.898696 1.0 1.000000 \n",
+ "2 0.973922 1.0 1.000000 \n",
+ "3 0.984955 1.0 1.000000 \n",
+ "4 0.750251 1.0 0.333333 "
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "from pandas import merge\n",
+ "\n",
+ "eval_comparison = merge(\n",
+ " left=default_evaluation, \n",
+ " right=bm25_evaluation, \n",
+ " on=\"query_id\", \n",
+ " suffixes=('_default', '_bm25')\n",
+ ")\n",
+ "eval_comparison.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Notice that we expect to observe the same match ratio for both query models since they use the same `OR` operator."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>match_ratio_value_default</th>\n",
+ " <th>match_ratio_value_bm25</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>mean</th>\n",
+ " <td>0.866861</td>\n",
+ " <td>0.866861</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>std</th>\n",
+ " <td>0.181418</td>\n",
+ " <td>0.181418</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " match_ratio_value_default match_ratio_value_bm25\n",
+ "mean 0.866861 0.866861\n",
+ "std 0.181418 0.181418"
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "eval_comparison[[\"match_ratio_value_default\", \"match_ratio_value_bm25\"]].describe().loc[[\"mean\", \"std\"]]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The `bm25` rank profile obtained a significantly higher recall than the `default`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>recall_10_value_default</th>\n",
+ " <th>recall_10_value_bm25</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>mean</th>\n",
+ " <td>0.840000</td>\n",
+ " <td>0.960000</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>std</th>\n",
+ " <td>0.368453</td>\n",
+ " <td>0.196946</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " recall_10_value_default recall_10_value_bm25\n",
+ "mean 0.840000 0.960000\n",
+ "std 0.368453 0.196946"
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "eval_comparison[[\"recall_10_value_default\", \"recall_10_value_bm25\"]].describe().loc[[\"mean\", \"std\"]]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Similarly, `bm25` also get a significantly higher reciprocal rank value when compared to the `default` rank profile."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>reciprocal_rank_10_value_default</th>\n",
+ " <th>reciprocal_rank_10_value_bm25</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>mean</th>\n",
+ " <td>0.724750</td>\n",
+ " <td>0.943333</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>std</th>\n",
+ " <td>0.399118</td>\n",
+ " <td>0.216103</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " reciprocal_rank_10_value_default reciprocal_rank_10_value_bm25\n",
+ "mean 0.724750 0.943333\n",
+ "std 0.399118 0.216103"
+ ]
+ },
+ "execution_count": null,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "eval_comparison[[\"reciprocal_rank_10_value_default\", \"reciprocal_rank_10_value_bm25\"]].describe().loc[[\"mean\", \"std\"]]"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "vespa",
+ "language": "python",
+ "name": "vespa"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}