aboutsummaryrefslogtreecommitdiffstats
path: root/sample-apps/blog-tutorial-shared/README.md
blob: 97b3247a7cdb88a27c11405e0f34cf224104d232 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# Vespa tutorial utility scripts

This directory contains utility code for the blog-search and blog-recommendation sample applications.

## Vespa Tutorial pt. 1

### From raw JSON to Vespa Feeding format

    $ python parse.py trainPosts.json > somefile.json

Parses JSON from the file trainPosts.json downloaded from Kaggle during the [blog search tutorial](https://docs.vespa.ai/documentation/tutorials/blog-search.html) and format it according to Vespa Document JSON format.

    $ python parse.py -p trainPosts.json > somefile.json

Give it the flag "-p" or "--popularity", and the script also calculates and adds the field `popularity`, as introduced [in the tutorial](https://docs.vespa.ai/documentation/tutorials/blog-search.html#blog-popularity-signal).

## Vespa Tutorial pt. 2

### Building and running the Spark script for calculating latent factors

1. Install the latest version of [Apache Spark](http://spark.apache.org/) and [sbt](http://www.scala-sbt.org/download.html).

2. Clone this repository and build the Spark script with `sbt package` (in the root directory of this repo).

3. Use the resulting jar file when running spark jobs included in the tutorials.

## Vespa Tutorial pt.3

Pre-computed data used throughout the tutorial will be made available shortly.

### Create Training Dataset

    $ ./src/R/generateDataset.R -d blog_job/user_item_cf_cv/product.json \
                                -u blog_job/user_item_cf_cv/user.json \
                                -t blog_job/training_and_test_indices/train.txt \
                                -o blog_job/nn_model/training_set.txt

### Train model with TensorFlow

Train the model with

    $ python vespaModel.py --product_features_file_path vespa_tutorial_data/user_item_cf_cv/product.json \
                           --user_features_file_path vespa_tutorial_data/user_item_cf_cv/user.json \
                           --dataset_file_path vespa_tutorial_data/nn_model/training_set.txt

Model parameters and summary statistics will be saved at folder ```runs/${start_time}``` with ```${start_time}``` representing the time you started to train the model.

Visualize the accuracy and loss metrics with

    $ tensorboard --logdir runs/1473845959/summaries/

**Note**: The folder ```1473845959``` depends on the time you start to train the model and will be different in your case.

### Export model parameters to Tensor Vespa format

```checkpoint_dir``` holds the folder that TensorFlow writes the learned model parameters (stored using protobuf) and ```output_dir``` is the folder that we will output the model parameters in
Vespa Tensor format.

    import vespaModel

    checkpoint_dir = "./runs/1473845959/checkpoints"
    output_dir = "application_package/constants"

    serializer = serializeVespaModel(checkpoint_dir, output_dir)
    serializer.serialize_to_disk(variable_name = "W_hidden", dimension_names = ['input', 'hidden'])
    serializer.serialize_to_disk(variable_name = "b_hidden", dimension_names = ['hidden'])
    serializer.serialize_to_disk(variable_name = "W_final", dimension_names = ['hidden', 'final'])
    serializer.serialize_to_disk(variable_name = "b_final", dimension_names = ['final'])

The python code containing the class ```serializeVespaModel``` can be found at: ```src/python/vespaModel.py```

### Offline evaluation

Query Vespa using the rank-profile ```tensor``` for users in the test set and return 100 blog post recommendations. Use those recommendations in the information contained in the test set to compute
metrics defined in the Tutorial pt. 2.

    pig -x local -f tutorial_compute_metric.pig \
      -param VESPA_HADOOP_JAR=vespa-hadoop.jar \
      -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \
      -param ENDPOINT=$(hostname):8080
      -param NUMBER_RECOMMENDATIONS=100
      -param RANKING_NAME=tensor
      -param OUTPUT=blog-job/cf-metric

Repeat the process, but now using the rank-profile ```nn_tensor```.

    pig -x local -f tutorial_compute_metric.pig \
      -param VESPA_HADOOP_JAR=vespa-hadoop.jar \
      -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \
      -param ENDPOINT=$(hostname):8080
      -param NUMBER_RECOMMENDATIONS=100
      -param RANKING_NAME=nn_tensor
      -param OUTPUT=blog-job/cf-metric