Notes on Elastic

We want to encode the ASR metadata for each word (start, end, conf) into the elastic index with the delimited_payload filter.

Create an index with a delimited payload filter:

{
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "term_vector": "with_positions_payloads",
        "analyzer": "whitespace_plus_delimited"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace_plus_delimited": {
          "tokenizer": "whitespace",
          "filter": [ "plus_delimited" ]
        }
      },
      "filter": {
        "plus_delimited": {
          "type": "delimited_payload",
          "delimiter": "|",
          "encoding": "float"
        }
      }
    }
  }
}
'

top level "transcript" field,

token|mediaNum,start,end,conf