# Vespa library for data analysis > Provide data analysis support for Vespa applications ## Install `pip install vespa` ## Connect to a Vespa app > Connect to a running Vespa application ``` from vespa.application import Vespa app = Vespa(url = "https://api.cord19.vespa.ai") ``` ## Define a Query model > Easily define matching and ranking criteria ``` from vespa.query import Query, Union, WeakAnd, ANN, RankProfile from random import random match_phase = Union( WeakAnd(hits = 10), ANN( doc_vector="title_embedding", query_vector="title_vector", embedding_model=lambda x: [random() for x in range(768)], hits = 10, label="title" ) ) rank_profile = RankProfile(name="bm25", list_features=True) query_model = Query(match_phase=match_phase, rank_profile=rank_profile) ``` ## Query the vespa app > Send queries via the query API. See the [query page](/vespa/query) for more examples. ``` query_result = app.query( query="Is remdesivir an effective treatment for COVID-19?", query_model=query_model ) ``` ``` query_result["root"]["fields"] ``` {'totalCount': 1077} ## Labelled data > How to structure labelled data ``` labelled_data = [ { "query_id": 0, "query": "Intrauterine virus infections and congenital heart disease", "relevant_docs": [{"id": 0, "score": 1}, {"id": 3, "score": 1}] }, { "query_id": 1, "query": "Clinical and immunologic studies in identical twins discordant for systemic lupus erythematosus", "relevant_docs": [{"id": 1, "score": 1}, {"id": 5, "score": 1}] } ] ``` Non-relevant documents are assigned `"score": 0` by default. Relevant documents will be assigned `"score": 1` by default if the field is missing from the labelled data. The defaults for both relevant and non-relevant documents can be modified on the appropriate methods. ## Collect training data > Collect training data to analyse and/or improve ranking functions. See the [collect training data page](/vespa/collect_training_data) for more examples. ``` training_data_batch = app.collect_training_data( labelled_data = labelled_data, id_field = "id", query_model = query_model, number_additional_docs = 2 ) training_data_batch ```

	attributeMatch(authors.first)	attributeMatch(authors.first).averageWeight	attributeMatch(authors.first).completeness	attributeMatch(authors.first).fieldCompleteness	attributeMatch(authors.first).importance	attributeMatch(authors.first).matches	attributeMatch(authors.first).maxWeight	attributeMatch(authors.first).normalizedWeight	attributeMatch(authors.first).normalizedWeightedWeight	attributeMatch(authors.first).queryCompleteness	...	textSimilarity(results).queryCoverage	textSimilarity(results).score	textSimilarity(title).fieldCoverage	textSimilarity(title).order	textSimilarity(title).proximity	textSimilarity(title).queryCoverage	textSimilarity(title).score	document_id	query_id	relevant
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.000000	0.0	0.000000	0.000000	0.000000	0	0	1
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	1.000000	1.0	1.000000	1.000000	1.000000	56212	0	0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.187500	0.5	0.617188	0.428571	0.457087	34026	0	0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.000000	0.0	0.000000	0.000000	0.000000	3	0	1
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	1.000000	1.0	1.000000	1.000000	1.000000	56212	0	0
5	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.187500	0.5	0.617188	0.428571	0.457087	34026	0	0
6	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.071429	0.0	0.000000	0.083333	0.039286	1	1	1
7	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	1.000000	1.0	1.000000	1.000000	1.000000	29774	1	0
8	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.500000	1.0	1.000000	0.333333	0.700000	22787	1	0
9	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.058824	0.0	0.000000	0.083333	0.036765	5	1	1
10	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	1.000000	1.0	1.000000	1.000000	1.000000	29774	1	0
11	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.500000	1.0	1.000000	0.333333	0.700000	22787	1	0

12 rows × 984 columns

## Evaluating a query model > Define metrics and evaluate query models. See the [evaluation page](/vespa/evaluation) for more examples. We will define the following evaluation metrics: * % of documents retrieved per query * recall @ 10 per query * MRR @ 10 per query ``` from vespa.evaluation import MatchRatio, Recall, ReciprocalRank eval_metrics = [MatchRatio(), Recall(at=10), ReciprocalRank(at=10)] ``` Evaluate: ``` evaluation = app.evaluate( labelled_data = labelled_data, eval_metrics = eval_metrics, query_model = query_model, id_field = "id", ) evaluation ```

	query_id	match_ratio_retrieved_docs	match_ratio_docs_available	match_ratio_value	recall_10_value	reciprocal_rank_10_value
0	0	1267	62529	0.020263	0	0
1	1	887	62529	0.014185	0	0