include notebook

author: tmartins <thigm85@gmail.com> 2020-08-27 13:52:57 +0200
committer: tmartins <thigm85@gmail.com> 2020-08-27 13:52:57 +0200
commit: 7bbca41e1806bbd6295112da481f015fbd79b7ad (patch)
tree: 9378692a72763a3abfdaa95204ee8e67e5295e29 /python
parent: cc49d0a8581aa12c3a3cf49e6a29d0463f107148 (diff)
1 files changed, 357 insertions, 0 deletions
diff --git a/python/vespa/docs/sphinx/source/connect-to-vespa-instance.ipynb b/python/vespa/docs/sphinx/source/connect-to-vespa-instance.ipynb
new file mode 100644
index 00000000000..53ca3263741
--- /dev/null
+++ b/python/vespa/docs/sphinx/source/connect-to-vespa-instance.ipynb
@@ -0,0 +1,357 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Vespa library for data analysis\n",
+    "\n",
+    "> Provide data analysis support for Vespa applications \n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vespa-engine/vespa/blob/tgm/pyvespa-tutorial/python/vespa/notebooks/connect-to-vespa-instance.ipynb)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`pyvespa` provides a python API to [vespa.ai](vespa.ai). It allow us to create, modify, deploy and interact with running Vespa instances. The main goal of the library is to allow for faster prototyping and ML experimentation. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This tutorial will show you how to connect to a pre-existing Vespa instance. We will use the https://cord19.vespa.ai/ app as an example. You can run this tutorial yourself in Google Colab by clicking on the badge located at the top of the tutorial."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Install"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The library is available at PyPI and therefore can be installed with `pip`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install pyvespa"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Connect to a Vespa app\n",
+    "\n",
+    "> Connect to a running Vespa application"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can connect to a running Vespa instance by created an instance of `Vespa` with the appropriate url. The resulting `app` will then be used to communicate with the application."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from vespa.application import Vespa\n",
+    "\n",
+    "app = Vespa(url = \"https://api.cord19.vespa.ai\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Define a Query model\n",
+    "\n",
+    "> Easily define matching and ranking criteria"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "When building a search application, we usually want to expirement with different query models. A `Query` model consists of a match phase and a ranking phase. The matching phase will define how to match documents based on the query sent and the ranking phase will define how to rank the matched documents. Both phases can get quite complex and being able to easily express and experiment with them is very valuable."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the example below we define the match phase to be the `Union` of the `WeakAnd` and the `ANN` operators. The `WeakAnd` will match documents based on query terms while the Approximate Nearest Neighbor (`ANN`) operator will match documents based on the distance between the query and document embeddings. This is an illustration of how easy it is to combine term and semantic matching in Vespa.  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from vespa.query import Union, WeakAnd, ANN\n",
+    "from random import random\n",
+    "\n",
+    "match_phase = Union(\n",
+    "    WeakAnd(hits = 10), \n",
+    "    ANN(\n",
+    "        doc_vector=\"title_embedding\", \n",
+    "        query_vector=\"title_vector\", \n",
+    "        embedding_model=lambda x: [random() for x in range(768)],\n",
+    "        hits = 10,\n",
+    "        label=\"title\"\n",
+    "    )\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We then define the the ranking to be done by the `bm25` rank-profile that is already defined in the application package. We set `list_features=True` to be able to collect ranking-features later in this tutorial. After defining the `match_phase` and the `rank_profile` we can instantiate the `Query` model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from vespa.query import Query, RankProfile\n",
+    "\n",
+    "rank_profile = RankProfile(name=\"bm25\", list_features=True)\n",
+    "\n",
+    "query_model = Query(match_phase=match_phase, rank_profile=rank_profile)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Query the vespa app\n",
+    "\n",
+    "> Send queries via the query API. See the [query page](/vespa/query) for more examples."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can use the `query_model` that we just defined to issue queries to the application via the `query` method."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query_result = app.query(\n",
+    "    query=\"Is remdesivir an effective treatment for COVID-19?\", \n",
+    "    query_model=query_model\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can see the number of documents that were retrieved by Vespa:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query_result.number_documents_retrieved"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And the number of documents that were returned to us:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "len(query_result.hits)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Labelled data\n",
+    "\n",
+    "> How to structure labelled data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We often need to either evaluate query models or to collect data to improve query models through ML. In both cases we usually need labelled data. Lets create some labelled data to illustrate their expected format and their usage in the library."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Each data point contains a `query_id`, a `query` and `relevant_docs` associated with the query."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "labelled_data = [\n",
+    "    {\n",
+    "        \"query_id\": 0, \n",
+    "        \"query\": \"Intrauterine virus infections and congenital heart disease\",\n",
+    "        \"relevant_docs\": [{\"id\": 0, \"score\": 1}, {\"id\": 3, \"score\": 1}]\n",
+    "    },\n",
+    "    {\n",
+    "        \"query_id\": 1, \n",
+    "        \"query\": \"Clinical and immunologic studies in identical twins discordant for systemic lupus erythematosus\",\n",
+    "        \"relevant_docs\": [{\"id\": 1, \"score\": 1}, {\"id\": 5, \"score\": 1}]\n",
+    "    }\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Non-relevant documents are assigned `\"score\": 0` by default. Relevant documents will be assigned `\"score\": 1` by default if the field is missing from the labelled data. The defaults for both relevant and non-relevant documents can be modified on the appropriate methods."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Collect training data\n",
+    "\n",
+    "> Collect training data to analyse and/or improve ranking functions. See the [collect training data page](/vespa/collect_training_data) for more examples."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can colect training data with the `collect_training_data` method according to a specific `query_model`. Below we will collect two documents for each query in addition to the relevant ones."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "training_data_batch = app.collect_training_data(\n",
+    "    labelled_data = labelled_data,\n",
+    "    id_field = \"id\",\n",
+    "    query_model = query_model,\n",
+    "    number_additional_docs = 2\n",
+    ")\n",
+    "training_data_batch"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Evaluating a query model\n",
+    "\n",
+    "> Define metrics and evaluate query models. See the [evaluation page](/vespa/evaluation) for more examples."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We will define the following evaluation metrics:\n",
+    "* % of documents retrieved per query\n",
+    "* recall @ 10 per query\n",
+    "* MRR @ 10 per query"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from vespa.evaluation import MatchRatio, Recall, ReciprocalRank\n",
+    "\n",
+    "eval_metrics = [MatchRatio(), Recall(at=10), ReciprocalRank(at=10)]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Evaluate:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "evaluation = app.evaluate(\n",
+    "    labelled_data = labelled_data,\n",
+    "    eval_metrics = eval_metrics, \n",
+    "    query_model = query_model, \n",
+    "    id_field = \"id\",\n",
+    ")\n",
+    "evaluation"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "vespa",
+   "language": "python",
+   "name": "vespa"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
author	tmartins <thigm85@gmail.com>	2020-08-27 13:52:57 +0200
committer	tmartins <thigm85@gmail.com>	2020-08-27 13:52:57 +0200
commit	7bbca41e1806bbd6295112da481f015fbd79b7ad (patch)
tree	9378692a72763a3abfdaa95204ee8e67e5295e29 /python
parent	cc49d0a8581aa12c3a3cf49e6a29d0463f107148 (diff)