# Vespa library for data analysis > Provide data analysis support for Vespa applications ## Install `pip install vespa` ## Connect to a Vespa app > Connect to a running Vespa application ``` from vespa.application import Vespa app = Vespa(url = "https://api.cord19.vespa.ai") ``` ## Define a Query model > Easily define matching and ranking criteria ``` from vespa.query import Query, Union, WeakAnd, ANN, RankProfile from random import random match_phase = Union( WeakAnd(hits = 10), ANN( doc_vector="title_embedding", query_vector="title_vector", embedding_model=lambda x: [random() for x in range(768)], hits = 10, label="title" ) ) rank_profile = RankProfile(name="bm25", list_features=True) query_model = Query(match_phase=match_phase, rank_profile=rank_profile) ``` ## Query the vespa app > Send queries via the query API. See the [query page](/vespa/query) for more examples. ``` query_result = app.query( query="Is remdesivir an effective treatment for COVID-19?", query_model=query_model ) ``` ``` query_result["root"]["fields"] ``` {'totalCount': 1077} ## Labelled data > How to structure labelled data ``` labelled_data = [ { "query_id": 0, "query": "Intrauterine virus infections and congenital heart disease", "relevant_docs": [{"id": 0, "score": 1}, {"id": 3, "score": 1}] }, { "query_id": 1, "query": "Clinical and immunologic studies in identical twins discordant for systemic lupus erythematosus", "relevant_docs": [{"id": 1, "score": 1}, {"id": 5, "score": 1}] } ] ``` Non-relevant documents are assigned `"score": 0` by default. Relevant documents will be assigned `"score": 1` by default if the field is missing from the labelled data. The defaults for both relevant and non-relevant documents can be modified on the appropriate methods. ## Collect training data > Collect training data to analyse and/or improve ranking functions. See the [collect training data page](/vespa/collect_training_data) for more examples. ``` training_data_batch = app.collect_training_data( labelled_data = labelled_data, id_field = "id", query_model = query_model, number_additional_docs = 2 ) training_data_batch ```
attributeMatch(authors.first) | attributeMatch(authors.first).averageWeight | attributeMatch(authors.first).completeness | attributeMatch(authors.first).fieldCompleteness | attributeMatch(authors.first).importance | attributeMatch(authors.first).matches | attributeMatch(authors.first).maxWeight | attributeMatch(authors.first).normalizedWeight | attributeMatch(authors.first).normalizedWeightedWeight | attributeMatch(authors.first).queryCompleteness | ... | textSimilarity(results).queryCoverage | textSimilarity(results).score | textSimilarity(title).fieldCoverage | textSimilarity(title).order | textSimilarity(title).proximity | textSimilarity(title).queryCoverage | textSimilarity(title).score | document_id | query_id | relevant | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0 | 0 | 1 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 1.000000 | 1.0 | 1.000000 | 1.000000 | 1.000000 | 56212 | 0 | 0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.187500 | 0.5 | 0.617188 | 0.428571 | 0.457087 | 34026 | 0 | 0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 3 | 0 | 1 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 1.000000 | 1.0 | 1.000000 | 1.000000 | 1.000000 | 56212 | 0 | 0 |
5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.187500 | 0.5 | 0.617188 | 0.428571 | 0.457087 | 34026 | 0 | 0 |
6 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.071429 | 0.0 | 0.000000 | 0.083333 | 0.039286 | 1 | 1 | 1 |
7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 1.000000 | 1.0 | 1.000000 | 1.000000 | 1.000000 | 29774 | 1 | 0 |
8 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.500000 | 1.0 | 1.000000 | 0.333333 | 0.700000 | 22787 | 1 | 0 |
9 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.058824 | 0.0 | 0.000000 | 0.083333 | 0.036765 | 5 | 1 | 1 |
10 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 1.000000 | 1.0 | 1.000000 | 1.000000 | 1.000000 | 29774 | 1 | 0 |
11 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.500000 | 1.0 | 1.000000 | 0.333333 | 0.700000 | 22787 | 1 | 0 |
12 rows × 984 columns
query_id | match_ratio_retrieved_docs | match_ratio_docs_available | match_ratio_value | recall_10_value | reciprocal_rank_10_value | |
---|---|---|---|---|---|---|
0 | 0 | 1267 | 62529 | 0.020263 | 0 | 0 |
1 | 1 | 887 | 62529 | 0.014185 | 0 | 0 |