Glossary

File

Definition: A single ~~container of~~ code and/or ~~text.~~text container.
Characteristics:
- ~~Serves as the fundamental unit in search operations within the system.~~
- ~~It does not directly support Similar Queries; however, similar results can be achieved by using smaller minimum hashing unit sizes.~~

~~Characteristics:~~
- ~~Segments are uniformly sized, except for the final segment, which may be smaller.~~
- ~~Each segment is individually indexed for efficient searching(candidate query).~~
- ~~It should have overlapping tokens between adjective segments to make coherent minimum search/hashing units.~~

Definition: A set of search parameters used to locate Source files containing specific subsequences found in the Target.
Components:
- Encompasses all files within the Target dataset.
- Searches through candidate Source files and their associated project metadata using Candidate Queries.
- Identifies source files that contain token subsequences that match any subsequence within target files.
Types:
- Exact Query: Retrieves subsequences that are precisely identical between Source and Target.
- Similar Query: Retrieves subsequences that are similar based on predefined similarity parameters.
Result:
- Pair of Source File and Target File: Links the Source file with the corresponding Target file.
- Pair of Subsequence Indices: This indicates the start and end indices of the matching subsequences in both the Source and Target files.

Definition: A preliminary search mechanism similar to a Query returns only a list of files from the Source.
Characteristics:
- Operates over the entire Source dataset.
- Allows false positives but not false negatives.
- It does not directly support Similar Queries; similar functionality can be achieved using a smaller minimum hashing unit size.

Definition: A data structure built from the list of files returned by a Candidate Query.
Characteristics:
- Optimized for detailed search operations based on Exact and Similar Queries.

Definition: The initial indexing for the entire Source.
Characteristics:
- Supports Candidate Queries by providing a file-level listing.

Definition: A user-provided dataset intended for analysis.
Characteristics:
- Consists of files supplied by the user.
- Undergoes immediate tokenization and/or indexing upon receipt.
- Remains separate and is never incorporated into the Source.

Definition: A repository of datasets maintained by the service provider.
Characteristics:
- Acts as the reference pool for matching against the Target.
- Receives updates infrequently, typically on a weekly or monthly schedule.
- Pre-indexed and readily available when accessed by the user.
- The duration of Candidate Indexing does not influence data structure design.

Definition: An aggregate collection of files.
Types:
- Source Dataset: The comprehensive collection of files maintained as the Source.
  - Interchangeably referred to as "Source."
- Target Dataset: The complete collection of files provided by the user.
  - Interchangeably referred to as "Target."

Definition: The smallest unit of data derived from the content of a file.
Generation Methods:
- Code Data: Tokens are produced by a tokenizer based on the syntax of the code.
- Text Data (Natural Language):
  - Western Languages: Tokens are separated by spaces.
  - Chinese: Each character is treated as an individual token.
  - Korean: Tokens are split by spaces; postpositional particles and suffixes are tokenized separately.
  - Japanese: Tokens are determined through morphological analysis, segmenting words based on grammar and context using tools like MeCab or Kuromoji.
  - Special Considerations: Both Korean and Japanese require specialized tokenizers to accurately handle the complexity of their morphology and syntax.

Definition:
- Serves as the fundamental unit in search operations within the system.
- It does not directly support Similar Queries; however, similar results can be achieved by using smaller minimum hashing unit sizes.

Sequence:
- Definition: An ordered list of tokens derived from a file or segment.
Subsequence:
- Definition: A contiguous subset of tokens within an entire sequence.

Definition: User-defined criteria that refine and tailor query operations.
Types:
- Similarity Rate: Specifies the required degree of similarity for matches in Similar Queries.
- Search Unit (by token):
  - Minimum Hashing Unit Size:
    - Definition: The smallest unit size used for hashing tokens during the search process.
  - Minimum Search Unit Size:
    - Definition: The smallest unit size considered when performing searches.