Tagging behaviour
To ensure that Starmind's tagging endpoints give optimal results for different inputs texts, it is possible to specify a tagging behaviour that is optimised for a specific type of text.
Text-processing types
The type of behaviour can be passed as a text_processing
parameter to the tags/extract
, search/tags/suggestions
and learning/entities-with-action
endpoints. The parameter can take one of several pre-defined values, i.e. "text_processing": <value>
with <value>
replaced by one of the following:
- "default"
- "chat_message"
- "cv"
- "job_description"
- "calendar"
Default
Pass the parameter value "text_processing": "default"
to the endpoint. This is the fallback behaviour if no value is passed.
Chat message
Pass the parameter value "text_processing": "chat_message"
to the endpoint. This tag behaviour should be used for any chat-type content, e.g. Slack. This content tends towards an informal tone, and requires stronger filters and penalties to get rid of non-relevant content and to extract meaningful expertise.
CV
Pass the parameter value "text_processing": "cv"
to the endpoint. This tag behaviour should be used for CV content, e.g. descriptions of past jobs, education experience. The behaviour was developed with longer-form input in mind, where relevant expertise can be found throughout the whole text.
Job description
Pass the parameter value "text_processing": "job_description"
to the endpoint. This tag behaviour should be used for job description content, e.g. relevant work and education experience, necessary skills. The behaviour was developed with longer-form input in mind, where relevant expertise can be found throughout the whole text.
Calendar
Pass the parameter value "text_processing": "calendar"
to the endpoint. This tag behaviour should be used for calendar content, e.g. both calendar event titles or summaries, and longer description texts. The behaviour is similar to the default behaviour, with specific tags in the excludedLabels
list.
Behaviour structure
Each behaviour is defined by the values described in the following table.
Value | Type | Description |
---|---|---|
commonWordThreshold |
Double |
Must be in range [0, 1] . If set to 0, it would mean that the most common word has a score 0, and if set to 1, that there is no common word penalty. So the lower the threshold, the higher the common word penalty factor). |
positionFactor |
Int |
Must be strictly greater than 1. The lower the factor, the more importance is given to tags extracted earlier from the text. |
excludedLabels |
Seq[String] |
A list of labels that are not to be suggested. |
minTextLength |
Option[Int] |
If given, doesn't extract any tags from texts with word length < minTextLength . |
limit |
Int |
Defines the maximum number of tags allowed for tag suggestion and extraction. |
For the endpoints mentioned above, one can sometimes pass a parameter that is defined in the TagBehaviour
directly to the endpoint, e.g. position_factor
can be passed to all three. If a positionFactor
is passed directly, then it overrides the value from the TagBehaviour
. Another example is the list of excluded labels, which can be passed to the tag suggestions and entities-with-action endpoints. If given, then the list would be merged with the one from TagBehaviour
.