Appendix B: Definitions

Character

A character can be a letter, a number or a symbol like %, @, &, ^, *, etc.

Please be aware that by default many symbols are not indexed and will not be found unless the character set is adjusted.

See also: Reserved Characters

Character Map (or Set)

The character map (or set) determines which characters are used to separate terms, which characters are indexed, which ones are used for punctuation, etc. All possible characters that can be recognized and searched on are listed in the character map. By default some characters are not indexed and will not be found unless the default character map is adjusted. How characters are defined in the character map, influences the outcome of a search. For example, when brackets are set to be separators, the following text will be identified as 3 terms: 'most definite(ly)'.
For more information on the character map and how to configure it, please contact support ( http://help.zylab.com ).       

Element

An element is a character or a range of characters.

Fuzzy

Fuzzy searches are used to find all variations of a word, including the ones that were not recognized correctly during the conversion from paper files to digital files. For more information, see Fuzzy Searches .

Hyphenated terms

Hyphenated terms are not uncommon (for example, sugar-free). Each part of a hyphenated term will get the same token id. However, this does not mean that the hyphenated term will be seen as one term. This is important to know, especially when you consider that when a sequence query is being processed (and a hyphenated term might be what you are looking for), the query is processed from back to front.

We will explain how this will effect your search with an example search query using the NOT operator:

Consider the text "fresh apple-banana pie", where "fresh" has position 1, "apple" has position 2 and "banana" has position 2 (since they are combined with a hyphen), "pie" has position 3. As we start processing backwards, the search results can be very different depending on your query.          

Example of query

Results

Results Explained

NOT(apple) banana pie

fresh apple-banana pie

 

First the term "pie" is matched at position 3, then at position 2 "banana" is matched, next, NOT(apple) is matched with "fresh" at position 1. In this example, "apple" will be skipped.

apple NOT(banana) pie

no results

First the term "pie" is matched at position 3, then at position 2 we will first find "banana", but that does not match the query NOT(banana). Therefor, no results are returned.

Note 1 : Since you may not know upfront if a combination of terms is hyphenated, it is advised to try different combinations of search queries when you suspect a hyphenated term might be part of the results.

Note 2 : If a term or combination of terms you are searching for contains a hyphen, that term will often be found, even if you did not include a hyphen in your search query. For example, when you search for "email" or "e mail", it will also find "e-mail". However, "e mail" will not find "email" or the other way around ("email" will not find "e mail").        

Keyword

A keyword is a term used in a search query.

Occurrence

An occurrence is the number of times a given term occurs in the collection. An occurrence is defined by a combination of Document id, Field id and Token id. Occurrences will be highlighted in the files.

Operators

Operators connect terms in a search query, making the search query more effective. Operators can be used to broaden or narrow your search. They can also be used to define your search more precisely.

For your personal clarity, operators are expressed in capitalized letters in this Guide.        

Parentheses

Group words or phrases with round brackets when combining operators in your search query to show the order in which connections should be interpreted. For example, "(cow or goat) and (farm or dairy)". The queries placed between brackets will be processed first.

Brackets aren't always required, they are mostly used for your own clarity. However, please be aware that using brackets can influence the outcome of a search. For example, searching for "cars or not used cars" will return different results than searching for "cars or not (used cars)". The first query will return "cars" and all words in front of "cars", except "used". The second query will only find the word "cars".       

When using brackets, you do not need to leave a space between the operator/query and the first bracket, you can do both:      

 

NOT(query)

NOT (query)        

Period

A period (".") is treated like a separator when defined as such in the tokenizer/character map, except when:

  • the period is preceded and followed by a number ("0.1" is one term)
  • the period is preceded by a space and followed by a number (" .1" is one term)
  • the period is preceded and followed by one alphabetic character, which can be repeated. ("A.B.C" is one term)
    If the last character is followed by a period, this last period will not be recognized as part of the term. This is because a period followed by a space is recognized as a separator.          

Phrase

A phrase is identified as two or more words.     

Precedence

When no brackets are used to define the order of precedence (see Parentheses ), the following search order is applied:

1. NOT
2. OR        
3. W/n, P/n (these operators are of equal precedence)        
4. AND        
5. TO        

Quotes

Quotes are used to search for separators. Examples:

"and"
"http://localhost/?id=10"        

Regular Expression (or Search Expression)

A regular expression (abbreviated as regex) is a subset of a Search Expression. A regular expression is a sequence of characters that forms a (codified) search pattern. This pattern is used to find what we want. 

Reserved Characters

All printable ASCII characters are directly searchable, except for those designated as reserved for a special purpose within the Search Engine. The following characters are reserved:        

?

single character wildcard

*

multiple character wildcard

+

used to match the preceding element one or more times

,

used as a decimal separator in different operators

.

A period (".") is treated like a separator when defined as such in the tokenizer/character map, except when:

  • the period is preceded and followed by a number ("0.1" is one term)
  • the period is preceded by a space and followed by a number (" .1" is one term)
  • the period is preceded and followed by one alphabetic character, which can be repeated. ("A.B.C" is one term)
    If the last character is followed by a period, this last period will not be recognized as part of the term. This is because a period followed by a space is recognized as a separator.

:

numeric range operator

()

used to nest sub-expressions in a search expression

[]

used for character class specification

{}

used in search statements with field definitions and quorum searches, and used in regular expressions

<>

used for numeric and file date comparisons

=

used for numeric comparisons, and for searching by file name and file date

/

used in proximity range operator searches

-

used in defining a range in a character class, and in proximity range searches to indicate negative values

     

You can use ? to search on reserved characters, but only if the reserved character is not configured as a separator . For example, query "t??st" will retrieve "t++st".

    

Search Query

A search query consists of one or more terms or keywords. Terms/keywords can be enhanced with Term Operators (Fuzzy/Wild Cards) and connected with Boolean or Proximity Operators. When using Boolean or Proximity Operators in your search query, group terms or phrases with round brackets to show the search order in which connections should be interpreted. For more information, see Parentheses.    

Separator

A separator is used by the tokenizer to mark the beginning of a document (BOD), the end of a sentence (EOS), end of a paragraph (EOP), end of a line (EOL), end of a page (EOG) or the end of a document (EOD).

The ZyLAB tokenizer can be configured to have some characters behave as separators. For example, ".", "(", ")", etc. These separators act as boundaries between tokens. Once a character has been recognized by the tokenizer as a separator, the tokenizer will stop processing the current token, the separator will be removed and the tokenizer will continue with processing the next token.        

  • You cannot search for a character if that has been configured as a separator.
  • You can search for operators like EOS and EOD.
  • Separators do not have token ids.

The following separators are supported:

BOD

supported

EOS

Supported, but disabled by default

EOP

supported

EOL

supported

EOG

supported

EOD

supported
The search engine looks for the context in which EOD is used. If it is part of a proximity, sequence or TO query, the content of documents will be searched. Otherwise, the EOD query will enumerate all documents.

      

Tip : When searching for "EOD", the query returns all documents with nothing highlighted. Since each document has an EOD token, it is an easy query to find all documents in a data set.

Search Examples in a Paragraph          

  • In the first paragraph: BOD to EOP { query }
  • In any paragraph: EOP to EOP { query }
  • In the last paragraph: EOP to EOD { query }

Term

A term is a type of query, the word query. It is also a unique entry in the dictionary. A term can be a character, a word (for example, sandwich) or a number. A term has a separator on either side. When, for example, brackets are set to be separators, the following text will be identified as 3 terms: "most definite(ly)".  

Token

A token is often a term (word, number or separator), but a token can also be anything between two separators. Tokens are the identified small parts that form or define a file.        

Token id

A token id is the natural number or position of a token, given by the tokenizer. Token ids are used to determine the distance between the words/numbers. Separators do not have token ids.

Token

There

are

5

files

EOS

EOD

Position

1

2

3

4

x

x

       

Tokenizer

A tokenizer breaks a stream of text up into words, numbers or other meaningful elements called tokens.     

Wildcards

Wildcards are used to replace or represent one or more characters in a term, making the search query more flexible and efficient.        

Word

A word is identified as one or more characters.     

Word Query

A word query consists of one term.