Tagging behaviour

To ensure that Starmind's tagging endpoints give optimal results for different inputs texts, it is possible to specify a tagging behaviour that is optimised for a specific type of text.

Text-processing types

The type of behaviour can be passed as a text_processing parameter to the tags/extract, search/tags/suggestions and learning/entities-with-action endpoints. The parameter can take one of several pre-defined values, i.e. "text_processing": <value> with <value> replaced by one of the following: - "default" - "chat_message" - "cv" - "job_description" - "calendar"

Default

Pass the parameter value "text_processing": "default" to the endpoint. This is the fallback behaviour if no value is passed.

Chat message

Pass the parameter value "text_processing": "chat_message" to the endpoint. This tag behaviour should be used for any chat-type content, e.g. Slack. This content tends towards an informal tone, and requires stronger filters and penalties to get rid of non-relevant content and to extract meaningful expertise.

CV

Pass the parameter value "text_processing": "cv" to the endpoint. This tag behaviour should be used for CV content, e.g. descriptions of past jobs, education experience. The behaviour was developed with longer-form input in mind, where relevant expertise can be found throughout the whole text.

Job description

Pass the parameter value "text_processing": "job_description" to the endpoint. This tag behaviour should be used for job description content, e.g. relevant work and education experience, necessary skills. The behaviour was developed with longer-form input in mind, where relevant expertise can be found throughout the whole text.

Calendar

Pass the parameter value "text_processing": "calendar" to the endpoint. This tag behaviour should be used for calendar content, e.g. both calendar event titles or summaries, and longer description texts. The behaviour is similar to the default behaviour, with specific tags in the excludedLabels list.

Behaviour structure

Each behaviour is defined by the values described in the following table.

Value Type Description
commonWordThreshold Double Must be in range [0, 1]. If set to 0, it would mean that the most common word has a score 0, and if set to 1, that there is no common word penalty. So the lower the threshold, the higher the common word penalty factor).
positionFactor Int Must be strictly greater than 1. The lower the factor, the more importance is given to tags extracted earlier from the text.
excludedLabels Seq[String] A list of labels that are not to be suggested.
minTextLength Option[Int] If given, doesn't extract any tags from texts with word length < minTextLength.
limit Int Defines the maximum number of tags allowed for tag suggestion and extraction.

For the endpoints mentioned above, one can sometimes pass a parameter that is defined in the TagBehaviour directly to the endpoint, e.g. position_factor can be passed to all three. If a positionFactor is passed directly, then it overrides the value from the TagBehaviour. Another example is the list of excluded labels, which can be passed to the tag suggestions and entities-with-action endpoints. If given, then the list would be merged with the one from TagBehaviour.