Search Syntax

Use the ZyLAB ONE search language to search for one or more keywords within a data set. Not only the keywords used in text, image or audio files, but also the keywords used in the information about these files (the metadata), can be searched.

Search Language Techniques

View the ZyLAB search language techniques in the table below.

Operators

Term Operators

AND

Fuzzy

~n

OR

Wild Cards

?

NOT

 

*

TO

 

[character(s)]

IN fieldname{query}

 

[character-range]

BOD, EOS, OEP, EOG, EOD

 

[^]

Within

 

W/n

 

+

W/n/term

 

{m,n}

 

/n,m/

 

{m}

Precedes

 

P/n

 

{m,}

P/n/term

 


Number Range

 

<

 

 

<=

 

 

=

 

 

<>

 

 

>

Field Filter

>=

fieldname=query

Quorum

n of {term, term, ..}

 

Exclude List of Terms from Fuzzy/Wild Card Query

fuzzy/wild card query - {exclude_term_1, ..., exclude_term_n}


Please note that though some operators are expressed in capital letters, this is only done for your personal clarity. The search engine does not differentiate between capital and lowercase letters.

There should always be a space between an operator and a keyword, otherwise both operator and keyword will be seen as one term. These are correct: "NOT term", "not term". These are not correct: "NOTterm", "notterm".

However, when using parentheses to surround the keyword, no spaces are necessary: "NOT(term)". For more information, see the definition of Parentheses.

Search Results Explained

Once a search query is being executed a result list will appear. Retrieved terms (occurrences) will be highlighted in the files. Of course, to be found, terms need to be present in the file. However, whether a term is retrieved also depends on the settings in the character map, the indexing structure and the tokenizer.

Based on the character map the tokenizer will process all files. How this is done, we will explain here.


The building blocks of a text file are characters, (hyphenated) terms and phrases. Characters are letters, numbers or symbols like %, @, &, ^, *, etc. Terms are characters or words; they are unique entries in the dictionary with a separator on either side. Phrases are two or more terms with no intervening operator. Hyphenated terms (such as sugar-free) are two or more separate terms, connected with a hyphen. Each part of a hyphenated term has the same token id, given by the tokenizer. When searching for "sugar-free", you will only find instances of "sugar-free". When searching for "sugar free", you will get more results, including "sugar", "free", "sugarfree" and "sugar-free".

Token

I

like

sugar-

free

food

EOS

EOD

Token id

1

2

3

3

4

x

x

A token id is the natural number or position of a token, given by the tokenizer. Token ids are used to determine the distance between the terms. Separators do not have token ids. If a term or combination of terms you are searching for contains a hyphen, that term will be found, even if you did not include a hyphen in your search query. For example, when you search for 'email' or 'e mail', it will also find 'e-mail'. However, 'e-mail' will only retrieve 'e-mail'. In addition, 'e mail' will not find 'email' or the other way around ('email' will not find 'e mail'). The tokenizer extracts text from a file and produces tokens, based on the settings defined in the character map. Tokens can be anything between two separators. Tokens are the identified small parts that form or define a file. Tokens are not terms! For example, hyphenated terms all have the same token id, but are separate terms. And a separator (for example, EOD) can be a token, but not a term.

The character map determines which characters are used to separate terms, which characters are indexed, which ones are used for punctuation, etc. All possible characters that can be recognized and searched on are listed in the character map. By default some characters are not indexed and will not be found unless the default character map is adjusted. How characters are defined in the character map, influences the outcome of a search. For example, when brackets are set to be separators, the following text will be identified as 3 terms: 'most definite(ly)'.

For more information on the character map and how to configure it, please contact support (http://help.zylab.com).

In addition to the characters defined in the character map to be recognized by the tokenizer as separators, the tokenizer creates separators to mark beginning of a document (BOD), the end of a sentence (EOS), end of a paragraph (EOP), end of a page (EOG) or the end of a document (EOD). You can search for the operators BOD, EOS, EOP, EOG and EOD.

Tip: When searching for EOD, the query returns all files with nothing highlighted. Since each file has an EOD token, it is an easy query to find all files in a data set.