1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
|
# Vespa library for data analysis
> Provide data analysis support for Vespa applications
## Install
`pip install pyvespa`
## Connect to a Vespa app
> Connect to a running Vespa application
```
from vespa.application import Vespa
app = Vespa(url = "https://api.cord19.vespa.ai")
```
## Define a Query model
> Easily define matching and ranking criteria
```
from vespa.query import Query, Union, WeakAnd, ANN, RankProfile
from random import random
match_phase = Union(
WeakAnd(hits = 10),
ANN(
doc_vector="title_embedding",
query_vector="title_vector",
embedding_model=lambda x: [random() for x in range(768)],
hits = 10,
label="title"
)
)
rank_profile = RankProfile(name="bm25", list_features=True)
query_model = Query(match_phase=match_phase, rank_profile=rank_profile)
```
## Query the vespa app
> Send queries via the query API. See the [query page](/vespa/query) for more examples.
```
query_result = app.query(
query="Is remdesivir an effective treatment for COVID-19?",
query_model=query_model
)
```
```
query_result.number_documents_retrieved
```
## Labelled data
> How to structure labelled data
```
labelled_data = [
{
"query_id": 0,
"query": "Intrauterine virus infections and congenital heart disease",
"relevant_docs": [{"id": 0, "score": 1}, {"id": 3, "score": 1}]
},
{
"query_id": 1,
"query": "Clinical and immunologic studies in identical twins discordant for systemic lupus erythematosus",
"relevant_docs": [{"id": 1, "score": 1}, {"id": 5, "score": 1}]
}
]
```
Non-relevant documents are assigned `"score": 0` by default. Relevant documents will be assigned `"score": 1` by default if the field is missing from the labelled data. The defaults for both relevant and non-relevant documents can be modified on the appropriate methods.
## Collect training data
> Collect training data to analyse and/or improve ranking functions. See the [collect training data page](/vespa/collect_training_data) for more examples.
```
training_data_batch = app.collect_training_data(
labelled_data = labelled_data,
id_field = "id",
query_model = query_model,
number_additional_docs = 2
)
training_data_batch
```
## Evaluating a query model
> Define metrics and evaluate query models. See the [evaluation page](/vespa/evaluation) for more examples.
We will define the following evaluation metrics:
* % of documents retrieved per query
* recall @ 10 per query
* MRR @ 10 per query
```
from vespa.evaluation import MatchRatio, Recall, ReciprocalRank
eval_metrics = [MatchRatio(), Recall(at=10), ReciprocalRank(at=10)]
```
Evaluate:
```
evaluation = app.evaluate(
labelled_data = labelled_data,
eval_metrics = eval_metrics,
query_model = query_model,
id_field = "id",
)
evaluation
```
|