summaryrefslogtreecommitdiffstats
path: root/sample-apps/blog-tutorial-shared/README.md
blob: 09ac61e6b56ba5e4f89ec30dd7b3f12d40a36377 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Vespa tutorial utility scripts

## From raw JSON to Vespa Feeding format

    $ python parse.py trainPosts.json > somefile.json

Parses JSON from the file trainPosts.json downloaded from Kaggle during the [blog search tutorial](https://git.corp.yahoo.com/pages/vespa/documentation/documentation/tutorials/blog-search.html) and format it according to Vespa Document JSON format.

    $ python parse.py -p trainPosts.json > somefile.json
    
Give it the flag "-p" or "--popularity", and the script also calculates and adds the field `popularity`, as introduced [in the tutorial](https://git.corp.yahoo.com/pages/vespa/documentation/documentation/tutorials/blog-search.html#blog-popularity-signal).

## Building and running the Spark script for calculating latent factors

1. Install the latest version of [Apache Spark](http://spark.apache.org/) and [sbt](http://www.scala-sbt.org/download.html).

2. Clone this repository and build the Spark script with `sbt package` (in the root directory of this repo).

3. Use the resulting jar file when running spark jobs included in the tutorials.