summaryrefslogtreecommitdiffstats
path: root/sample-apps/blog-tutorial-shared/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'sample-apps/blog-tutorial-shared/README.md')
-rw-r--r--sample-apps/blog-tutorial-shared/README.md19
1 files changed, 19 insertions, 0 deletions
diff --git a/sample-apps/blog-tutorial-shared/README.md b/sample-apps/blog-tutorial-shared/README.md
new file mode 100644
index 00000000000..09ac61e6b56
--- /dev/null
+++ b/sample-apps/blog-tutorial-shared/README.md
@@ -0,0 +1,19 @@
+# Vespa tutorial utility scripts
+
+## From raw JSON to Vespa Feeding format
+
+ $ python parse.py trainPosts.json > somefile.json
+
+Parses JSON from the file trainPosts.json downloaded from Kaggle during the [blog search tutorial](https://git.corp.yahoo.com/pages/vespa/documentation/documentation/tutorials/blog-search.html) and format it according to Vespa Document JSON format.
+
+ $ python parse.py -p trainPosts.json > somefile.json
+
+Give it the flag "-p" or "--popularity", and the script also calculates and adds the field `popularity`, as introduced [in the tutorial](https://git.corp.yahoo.com/pages/vespa/documentation/documentation/tutorials/blog-search.html#blog-popularity-signal).
+
+## Building and running the Spark script for calculating latent factors
+
+1. Install the latest version of [Apache Spark](http://spark.apache.org/) and [sbt](http://www.scala-sbt.org/download.html).
+
+2. Clone this repository and build the Spark script with `sbt package` (in the root directory of this repo).
+
+3. Use the resulting jar file when running spark jobs included in the tutorials. \ No newline at end of file