diff options
Diffstat (limited to 'sample-apps')
-rw-r--r-- | sample-apps/blog-tutorial-shared/README.md | 83 |
1 files changed, 80 insertions, 3 deletions
diff --git a/sample-apps/blog-tutorial-shared/README.md b/sample-apps/blog-tutorial-shared/README.md index 09ac61e6b56..846156908c3 100644 --- a/sample-apps/blog-tutorial-shared/README.md +++ b/sample-apps/blog-tutorial-shared/README.md @@ -1,6 +1,8 @@ # Vespa tutorial utility scripts -## From raw JSON to Vespa Feeding format +## Vespa Tutorial pt. 1 + +### From raw JSON to Vespa Feeding format $ python parse.py trainPosts.json > somefile.json @@ -10,10 +12,85 @@ Parses JSON from the file trainPosts.json downloaded from Kaggle during the [blo Give it the flag "-p" or "--popularity", and the script also calculates and adds the field `popularity`, as introduced [in the tutorial](https://git.corp.yahoo.com/pages/vespa/documentation/documentation/tutorials/blog-search.html#blog-popularity-signal). -## Building and running the Spark script for calculating latent factors +## Vespa Tutorial pt. 2 + +### Building and running the Spark script for calculating latent factors 1. Install the latest version of [Apache Spark](http://spark.apache.org/) and [sbt](http://www.scala-sbt.org/download.html). 2. Clone this repository and build the Spark script with `sbt package` (in the root directory of this repo). -3. Use the resulting jar file when running spark jobs included in the tutorials.
\ No newline at end of file +3. Use the resulting jar file when running spark jobs included in the tutorials. + +## Vespa Tutorial pt.3 + +Pre-computed data used through out the tutorial can be found [here](http://trdstorage.trondheim.corp.yahoo.com/~tmartins/vespa_tutorial_data/). + +You can download ```vespa_tutorial_data.tar.gz``` (144MB) and decompress it with + + $ wget http://trdstorage.trondheim.corp.yahoo.com/~tmartins/vespa_tutorial_data.tar.gz + $ tar -xvzf vespa_tutorial_data.tar.gz + +### Create Training Dataset + + $ ./generateDataset.R -d vespa_tutorial_data/user_item_cf_cv/product.json \ + -u vespa_tutorial_data/user_item_cf_cv/user.json \ + -t vespa_tutorial_data/training_and_test_indices/train.txt \ + -o vespa_tutorial_data/nn_model/training_set.txt + +### Train model with TensorFlow + +Train the model with + + $ python vespaModel.py --product_features_file_path vespa_tutorial_data/user_item_cf_cv/product.json \ + --user_features_file_path vespa_tutorial_data/user_item_cf_cv/user.json \ + --dataset_file_path vespa_tutorial_data/nn_model/training_set.txt + +Model parameters and summary statistics will be saved at folder ```runs/${start_time}``` with ```${start_time}``` representing the time you started to train the model. + +Visualize the accuracy and loss metrics with + + $ tensorboard --logdir runs/1473845959/summaries/ + +**Note**: The folder ```1473845959``` depends on the time you start to train the model and will be different in your case. + +### Export model parameters to Tensor Vespa format + +```checkpoint_dir``` holds the folder that TensorFlow writes the learned model parameters (stored using protobuf) and ```output_dir``` is the folder that we will output the model parameters in +Vespa Tensor format. + + import vespaModel + + checkpoint_dir = "./runs/1473845959/checkpoints" + output_dir = "application_package/constants" + + serializer = serializeVespaModel(checkpoint_dir, output_dir) + serializer.serialize_to_disk(variable_name = "W_hidden", dimension_names = ['input', 'hidden']) + serializer.serialize_to_disk(variable_name = "b_hidden", dimension_names = ['hidden']) + serializer.serialize_to_disk(variable_name = "W_final", dimension_names = ['hidden', 'final']) + serializer.serialize_to_disk(variable_name = "b_final", dimension_names = ['final']) + +The python code containing the class ```serializeVespaModel``` can be found at: ```src/python/vespaModel.py``` + +### Offline evaluation + +Query Vespa using the rank-profile ```tensor``` for users in the test set and return 100 blog post recommendations. Use those recommendations in the information contained in the test set to compute +metrics defined in the Tutorial pt. 2. + + pig -x local -f tutorial_compute_metric.pig \ + -param VESPA_HADOOP_JAR=vespa-hadoop.jar \ + -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \ + -param ENDPOINT=$(hostname):8080 + -param NUMBER_RECOMMENDATIONS=100 + -param RANKING_NAME=tensor + -param OUTPUT=blog-job/cf-metric + +Repeat the process, but now using the rank-profile ```nn_tensor```. + + pig -x local -f tutorial_compute_metric.pig \ + -param VESPA_HADOOP_JAR=vespa-hadoop.jar \ + -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \ + -param ENDPOINT=$(hostname):8080 + -param NUMBER_RECOMMENDATIONS=100 + -param RANKING_NAME=nn_tensor + -param OUTPUT=blog-job/cf-metric
\ No newline at end of file |