finish create and deploy tutorial

author: tmartins <thigm85@gmail.com> 2020-08-24 09:06:37 +0200
committer: tmartins <thigm85@gmail.com> 2020-08-24 09:06:37 +0200
commit: 24dadde24919aecb6ac8bab94d14f559d48bd9af (patch)
tree: 5e83a4b62814c683ddcb6f55c38daa6d1ccfa021 /python
parent: de1851d7ca0e0da32e479fb584d39037e88029bc (diff)
1 files changed, 870 insertions, 32 deletions
diff --git a/python/vespa/notebooks/create-and-deploy-vespa.ipynb b/python/vespa/notebooks/create-and-deploy-vespa.ipynb
index 334b39e21ee..86d5fa08fc5 100644
--- a/python/vespa/notebooks/create-and-deploy-vespa.ipynb
+++ b/python/vespa/notebooks/create-and-deploy-vespa.ipynb
@@ -24,14 +24,28 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Our goal is to create, modify and deploy simple application packages using our python API. This enables us to run data analysis experiments that are fully integrated with Vespa. As an example, we want to create the application package we used in our [text search tutorial](https://docs.vespa.ai/documentation/tutorials/text-search.html). "
+    "`pyvespa` provides a python API to [vespa.ai](vespa.ai). It allow us to create, modify, deploy and interact with running Vespa instances. The main goal of the library is to allow for faster prototyping and ML experimentation. "
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Schema API"
+    "This tutorial will create a text search application from scratch based on the MS MARCO dataset, similar to our [text search tutorials](https://docs.vespa.ai/documentation/tutorials/text-search.html). We will first show how to define the app by creating an application package [REF]. Then we locally deploy the app in a Docker container. Once the app is up and running we show how to feed data to it. After the data is sent, we can make queries and inspect the results. We then show how to add a new rank profile to the application package and to redeploy the app with the latest changes. We proceed to show how to evaluate and compare two rank profiles with evaluation metrics such as Recall and Reciprocal Rank."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Application package API"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We first create a `Document` instance containing the `Field`s that we want to store in the app. In this case we will keep the application simple and only feed a unique `id`, `title` and `body` of the MS MARCO documents."
    ]
   },
   {
@@ -40,7 +54,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from vespa.package import Document, Field, Schema, FieldSet, RankProfile, ApplicationPackage\n",
+    "from vespa.package import Document, Field\n",
     "\n",
     "document = Document(\n",
     "    fields=[\n",
@@ -48,14 +62,46 @@
     "        Field(name = \"title\", type = \"string\", indexing = [\"index\", \"summary\"], index = \"enable-bm25\"),\n",
     "        Field(name = \"body\", type = \"string\", indexing = [\"index\", \"summary\"], index = \"enable-bm25\")        \n",
     "    ]\n",
-    ")\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The complete `Schema` of our application will be named `msmarco` and contains the `Document` instance that we defined above, the default `FieldSet` indicates that queries will look for matches by searching both in the titles and bodies of the documents. The default `RankProfile` indicates that all the matched documents will be ranked by the `nativeRank` expression involving the title and the body of the matched documents."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from vespa.package import Schema, FieldSet, RankProfile\n",
     "\n",
     "msmarco_schema = Schema(\n",
     "    name = \"msmarco\", \n",
     "    document = document, \n",
     "    fieldsets = [FieldSet(name = \"default\", fields = [\"title\", \"body\"])],\n",
     "    rank_profiles = [RankProfile(name = \"default\", first_phase = \"nativeRank(title, body)\")]\n",
-    ")\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once the `Schema` is defined, all we have to do is to create our msmarco `ApplicationPackage`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from vespa.package import ApplicationPackage\n",
     "\n",
     "app_package = ApplicationPackage(name = \"msmarco\", schema=msmarco_schema)"
    ]
@@ -64,10 +110,24 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "At this point, `app_package` contains all the relevant information required to create our MS MARCO text search app. We now need to deploy it."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "## Deploy it locally"
    ]
   },
   {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This tutorial shows how to deploy the application package locally in a Docker container. For the following to work you need to run this from a machine with Docker installed. We first create a `VespaDocker` instance based on the application package."
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
@@ -75,8 +135,14 @@
    "source": [
     "from vespa.package import VespaDocker\n",
     "\n",
-    "vespa_docker = VespaDocker(application_package=app_package)\n",
-    "deployment_msg, app = vespa_docker.deploy(disk_folder=\"/Users/tmartins/projects/vespa/vespa/python/vespa/notebooks/sample_application\")"
+    "vespa_docker = VespaDocker(application_package=app_package)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We then call the `deploy` method and specify a `disk_folder` with write access. Behind the scenes, `pyvespa` will write the Vespa config files and store them in the `disk_folder`, it will then run a Vespa engine Docker container and deploy those config files in the container."
    ]
   },
   {
@@ -85,16 +151,44 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "deployment_msg"
+    "app = vespa_docker.deploy(disk_folder=\"/Users/username/sample_application\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The app variable above will hold a `Vespa` instance that will be used to connect and interact with our text search application. We can see the deployment message returned by the Vespa engine:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[\"Uploading application '/app/application' using http://localhost:19071/application/v2/tenant/default/session\",\n",
+       " \"Session 18 for tenant 'default' created.\",\n",
+       " 'Preparing session 18 using http://localhost:19071/application/v2/tenant/default/session/18/prepared',\n",
+       " \"WARNING: Host named 'msmarco' may not receive any config since it is not a canonical hostname. Disregard this warning when testing in a Docker container.\",\n",
+       " \"Session 18 for tenant 'default' prepared.\",\n",
+       " 'Activating session 18 using http://localhost:19071/application/v2/tenant/default/session/18/active',\n",
+       " \"Session 18 for tenant 'default' activated.\",\n",
+       " 'Checksum:   09203c16fa5f582b712711bb98932812',\n",
+       " 'Timestamp:  1598011224920',\n",
+       " 'Generation: 18',\n",
+       " '']"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
-    "app"
+    "app.deployment_message"
    ]
   },
   {
@@ -108,24 +202,104 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "To feed data we need to specify the `schema` that we are sending data to. Each data point needs to have a unique `data_id` associated with it, independent of having an id field or not. The `fields` should be a dict containing all the fields in the schema. "
+    "We now have our text search app up and running. We can start to feed data to it. We have pre-processed and sampled some MS MARCO data to use in this tutorial. We can load 996 documents that we want to feed and check the first two documents in this sample."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(996, 3)"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
-    "response = app.feed_data_point(\n",
-    "    schema = \"msmarco\", \n",
-    "    data_id = 1, \n",
-    "    fields = {\n",
-    "        \"id\": \"1\", \n",
-    "        \"title\": \"This is a text\", \n",
-    "        \"body\": \"This is the body of the text\"\n",
-    "    }\n",
-    ")"
+    "from pandas import read_csv\n",
+    "\n",
+    "docs = read_csv(\"https://thigm85.github.io/data/msmarco/docs.tsv\", sep = \"\\t\")\n",
+    "docs.shape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>id</th>\n",
+       "      <th>title</th>\n",
+       "      <th>body</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>D2185715</td>\n",
+       "      <td>What Is an Appropriate Gift for a Bris</td>\n",
+       "      <td>Hub Pages   Religion and Philosophy   Judaism...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>D2819479</td>\n",
+       "      <td>lunge</td>\n",
+       "      <td>1lungenoun   ˈlənj  Popularity  Bottom 40  of...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "         id                                    title  \\\n",
+       "0  D2185715  What Is an Appropriate Gift for a Bris    \n",
+       "1  D2819479                                    lunge   \n",
+       "\n",
+       "                                                body  \n",
+       "0   Hub Pages   Religion and Philosophy   Judaism...  \n",
+       "1   1lungenoun   ˈlənj  Popularity  Bottom 40  of...  "
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "docs.head(2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To feed the data we need to specify the `schema` that we are sending data to. We name our schema `msmarco` in a previous section. Each data point needs to have a unique `data_id` associated with it, independent of having an id field or not. The `fields` should be a dict containing all the fields in the schema, which are `id`, `title` and `body` in our case. "
    ]
   },
   {
@@ -134,6 +308,42 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "for idx, row in docs.iterrows():\n",
+    "    response = app.feed_data_point(\n",
+    "        schema = \"msmarco\", \n",
+    "        data_id = str(row[\"id\"]), \n",
+    "        fields = {\n",
+    "            \"id\": str(row[\"id\"]), \n",
+    "            \"title\": str(row[\"title\"]), \n",
+    "            \"body\": str(row[\"body\"])\n",
+    "        }\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Each call to the method `feed_data_point` sends a POST request to the appropriate Vespa endpoint and we can check the response of the requests if needed, such as the status code and the message returned."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "200"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
     "response.status_code"
    ]
   },
@@ -141,7 +351,19 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'id': 'id:msmarco:msmarco::D2002872',\n",
+       " 'pathId': '/document/v1/msmarco/msmarco/docid/D2002872'}"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
     "response.json()"
    ]
@@ -154,31 +376,62 @@
    ]
   },
   {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once our application is fed we can start to use it by sending queries to it. The MS MARCO app expectes to receive questions as queries and the goal of the application is to return documents that are relevant to the questions made."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the example below, we will send a question via the `query` parameter. In addition, we need to specify how we want the documents to be matched and ranked. We do this by specifying a `Query` model. The query model below will have the `OR` operator in the match phase, indicating that the application will match all the documents which have at least one query term within the title or the body (due to the default `FieldSet` we defined earlier) of the document. And we will rank all the matched documents by the default `RankProfile` that we defined earlier."
+   ]
+  },
+  {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
-    "from vespa.query import Query, OR, RankProfile\n",
+    "from vespa.query import Query, OR, RankProfile as Ranking\n",
     "\n",
     "results = app.query(\n",
     "    query=\"Where is my text?\", \n",
     "    query_model = Query(\n",
     "        match_phase=OR(), \n",
-    "        rank_profile=RankProfile(name=\"default\")\n",
-    "    )\n",
-    ")\n",
-    "\n",
-    "results.number_documents_retrieved"
+    "        rank_profile=Ranking(name=\"default\")\n",
+    "    ),\n",
+    "    hits = 2\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In addition to the `query` and `query_model` parameters, we can specify a multitude of relevant Vespa parameters such as the number of `hits` that we want Vespa to return. We chose `hits=2` for simplicity in this tutorial."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "2"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
-    "results.hits"
+    "len(results.hits)"
    ]
   },
   {
@@ -192,7 +445,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can add a new rank profile and redeploy our application"
+    "We can also make specific changes to our application by changing the application package and redeploying. Lets add a new rank profile based on BM25 to our `Schema`."
    ]
   },
   {
@@ -207,11 +460,596 @@
    ]
   },
   {
-   "cell_type": "raw",
+   "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "After that we can redeploy our application, similar to what we did earlier:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Vespa(http://localhost, 8080)"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
     "vespa_docker.deploy(disk_folder=\"/Users/username/sample_application\")"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can then use the newly created `bm25` rank profile to make queries:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "2"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "results = app.query(\n",
+    "    query=\"Where is my text?\", \n",
+    "    query_model = Query(\n",
+    "        match_phase=OR(), \n",
+    "        rank_profile=Ranking(name=\"bm25\")\n",
+    "    ),\n",
+    "    hits = 2\n",
+    ")\n",
+    "len(results.hits)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Compare query models"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "When we are building a search application, we often want to experiment and compare different query models. In this section we want to show how easy it is to compare different query models in Vespa."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Lets load some labelled data where each data point contains a `query_id`, a `query` and a list of `relevant_docs` associated with the query. In this case, we have only one relevant document for each query."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "\n",
+    "labelled_data = json.loads(\n",
+    "    requests.get(\"https://thigm85.github.io/data/msmarco/query-labels.json\").text\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Following we can see two examples of the labelled data:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[{'query_id': '1',\n",
+       "  'query': 'what county is aspen co',\n",
+       "  'relevant_docs': [{'id': 'D1098819'}]},\n",
+       " {'query_id': '2',\n",
+       "  'query': 'where is aeropostale located',\n",
+       "  'relevant_docs': [{'id': 'D2268823'}]}]"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "labelled_data[0:2]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Lets define two `Query` models to be compared. We are going to use the same `OR` operator in the match phase and compare the `default` and `bm25` rank profiles."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "default_ranking = Query(\n",
+    "    match_phase=OR(), \n",
+    "    rank_profile=Ranking(name=\"default\")\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "bm25_ranking = Query(\n",
+    "    match_phase=OR(), \n",
+    "    rank_profile=Ranking(name=\"bm25\")\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we will chose which evaluation metrics we want to look at. In this case we will chose the `MatchRatio` to check how many documents have been matched by the query, the `Recall` at 10 and the `ReciprocalRank` at 10."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from vespa.evaluation import MatchRatio, Recall, ReciprocalRank\n",
+    "\n",
+    "eval_metrics = [MatchRatio(), Recall(at = 10), ReciprocalRank(at = 10)]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We now can run the `evaluation` method for each `Query` model. This will make queries to the application and process the results to compute the pre-defined `eval_metrics` defined above."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "default_evaluation = app.evaluate(\n",
+    "    labelled_data=labelled_data, \n",
+    "    eval_metrics=eval_metrics, \n",
+    "    query_model=default_ranking, \n",
+    "    id_field=\"id\",\n",
+    "    timeout=5,\n",
+    "    hits=10\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "bm25_evaluation = app.evaluate(\n",
+    "    labelled_data=labelled_data, \n",
+    "    eval_metrics=eval_metrics, \n",
+    "    query_model=bm25_ranking, \n",
+    "    id_field=\"id\",\n",
+    "    timeout=5,\n",
+    "    hits=10\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can then merge the DataFrames returned by the `evaluation` method and start to analyse the results."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>query_id</th>\n",
+       "      <th>match_ratio_retrieved_docs_default</th>\n",
+       "      <th>match_ratio_docs_available_default</th>\n",
+       "      <th>match_ratio_value_default</th>\n",
+       "      <th>recall_10_value_default</th>\n",
+       "      <th>reciprocal_rank_10_value_default</th>\n",
+       "      <th>match_ratio_retrieved_docs_bm25</th>\n",
+       "      <th>match_ratio_docs_available_bm25</th>\n",
+       "      <th>match_ratio_value_bm25</th>\n",
+       "      <th>recall_10_value_bm25</th>\n",
+       "      <th>reciprocal_rank_10_value_bm25</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1</td>\n",
+       "      <td>914</td>\n",
+       "      <td>997</td>\n",
+       "      <td>0.916750</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>1.000</td>\n",
+       "      <td>914</td>\n",
+       "      <td>997</td>\n",
+       "      <td>0.916750</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>1.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2</td>\n",
+       "      <td>896</td>\n",
+       "      <td>997</td>\n",
+       "      <td>0.898696</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>0.125</td>\n",
+       "      <td>896</td>\n",
+       "      <td>997</td>\n",
+       "      <td>0.898696</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>1.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>3</td>\n",
+       "      <td>971</td>\n",
+       "      <td>997</td>\n",
+       "      <td>0.973922</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>1.000</td>\n",
+       "      <td>971</td>\n",
+       "      <td>997</td>\n",
+       "      <td>0.973922</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>1.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>4</td>\n",
+       "      <td>982</td>\n",
+       "      <td>997</td>\n",
+       "      <td>0.984955</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>1.000</td>\n",
+       "      <td>982</td>\n",
+       "      <td>997</td>\n",
+       "      <td>0.984955</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>1.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>5</td>\n",
+       "      <td>748</td>\n",
+       "      <td>997</td>\n",
+       "      <td>0.750251</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>0.500</td>\n",
+       "      <td>748</td>\n",
+       "      <td>997</td>\n",
+       "      <td>0.750251</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>0.333333</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "  query_id  match_ratio_retrieved_docs_default  \\\n",
+       "0        1                                 914   \n",
+       "1        2                                 896   \n",
+       "2        3                                 971   \n",
+       "3        4                                 982   \n",
+       "4        5                                 748   \n",
+       "\n",
+       "   match_ratio_docs_available_default  match_ratio_value_default  \\\n",
+       "0                                 997                   0.916750   \n",
+       "1                                 997                   0.898696   \n",
+       "2                                 997                   0.973922   \n",
+       "3                                 997                   0.984955   \n",
+       "4                                 997                   0.750251   \n",
+       "\n",
+       "   recall_10_value_default  reciprocal_rank_10_value_default  \\\n",
+       "0                      1.0                             1.000   \n",
+       "1                      1.0                             0.125   \n",
+       "2                      1.0                             1.000   \n",
+       "3                      1.0                             1.000   \n",
+       "4                      1.0                             0.500   \n",
+       "\n",
+       "   match_ratio_retrieved_docs_bm25  match_ratio_docs_available_bm25  \\\n",
+       "0                              914                              997   \n",
+       "1                              896                              997   \n",
+       "2                              971                              997   \n",
+       "3                              982                              997   \n",
+       "4                              748                              997   \n",
+       "\n",
+       "   match_ratio_value_bm25  recall_10_value_bm25  reciprocal_rank_10_value_bm25  \n",
+       "0                0.916750                   1.0                       1.000000  \n",
+       "1                0.898696                   1.0                       1.000000  \n",
+       "2                0.973922                   1.0                       1.000000  \n",
+       "3                0.984955                   1.0                       1.000000  \n",
+       "4                0.750251                   1.0                       0.333333  "
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from pandas import merge\n",
+    "\n",
+    "eval_comparison = merge(\n",
+    "    left=default_evaluation, \n",
+    "    right=bm25_evaluation, \n",
+    "    on=\"query_id\", \n",
+    "    suffixes=('_default', '_bm25')\n",
+    ")\n",
+    "eval_comparison.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Notice that we expect to observe the same match ratio for both query models since they use the same `OR` operator."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>match_ratio_value_default</th>\n",
+       "      <th>match_ratio_value_bm25</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>mean</th>\n",
+       "      <td>0.866861</td>\n",
+       "      <td>0.866861</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>std</th>\n",
+       "      <td>0.181418</td>\n",
+       "      <td>0.181418</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "      match_ratio_value_default  match_ratio_value_bm25\n",
+       "mean                   0.866861                0.866861\n",
+       "std                    0.181418                0.181418"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "eval_comparison[[\"match_ratio_value_default\", \"match_ratio_value_bm25\"]].describe().loc[[\"mean\", \"std\"]]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `bm25` rank profile obtained a significantly higher recall than the `default`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>recall_10_value_default</th>\n",
+       "      <th>recall_10_value_bm25</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>mean</th>\n",
+       "      <td>0.840000</td>\n",
+       "      <td>0.960000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>std</th>\n",
+       "      <td>0.368453</td>\n",
+       "      <td>0.196946</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "      recall_10_value_default  recall_10_value_bm25\n",
+       "mean                 0.840000              0.960000\n",
+       "std                  0.368453              0.196946"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "eval_comparison[[\"recall_10_value_default\", \"recall_10_value_bm25\"]].describe().loc[[\"mean\", \"std\"]]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Similarly, `bm25` also get a significantly higher reciprocal rank value when compared to the `default` rank profile."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>reciprocal_rank_10_value_default</th>\n",
+       "      <th>reciprocal_rank_10_value_bm25</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>mean</th>\n",
+       "      <td>0.724750</td>\n",
+       "      <td>0.943333</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>std</th>\n",
+       "      <td>0.399118</td>\n",
+       "      <td>0.216103</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "      reciprocal_rank_10_value_default  reciprocal_rank_10_value_bm25\n",
+       "mean                          0.724750                       0.943333\n",
+       "std                           0.399118                       0.216103"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "eval_comparison[[\"reciprocal_rank_10_value_default\", \"reciprocal_rank_10_value_bm25\"]].describe().loc[[\"mean\", \"std\"]]"
+   ]
   }
  ],
  "metadata": {
author	tmartins <thigm85@gmail.com>	2020-08-24 09:06:37 +0200
committer	tmartins <thigm85@gmail.com>	2020-08-24 09:06:37 +0200
commit	24dadde24919aecb6ac8bab94d14f559d48bd9af (patch)
tree	5e83a4b62814c683ddcb6f55c38daa6d1ccfa021 /python
parent	de1851d7ca0e0da32e479fb584d39037e88029bc (diff)