Skip to main content
Glossary
File
-
Definition: A single container of code and/or text.
-
Characteristics:
- Serves as the fundamental unit in search operations within the system.
- It does not directly support Similar Queries; however, similar results can be achieved by using smaller minimum hashing unit sizes.
Segment
-
Definition: A subdivision of a file.
-
Characteristics:
- Segments are uniformly sized, except for the final segment, which may be smaller.
- Each segment is individually indexed for efficient searching(candidate query).
- It should have overlapping tokens between adjective segments to make coherent minimum search/hashing units.
Query
-
Definition: A set of search parameters used to locate Source files containing specific subsequences found in the Target.
-
Components:
- Encompasses all files within the Target dataset.
- Searches through candidate Source files and their associated project metadata using Candidate Queries.
- Identifies source files that contain token subsequences that match any subsequence within target files.
-
Types:
-
Exact Query: Retrieves subsequences that are precisely identical between Source and Target.
-
Similar Query: Retrieves subsequences that are similar based on predefined similarity parameters.
-
Result:
-
Pair of Source File and Target File: Links the Source file with the corresponding Target file.
-
Pair of Subsequence Indices: This indicates the start and end indices of the matching subsequences in both the Source and Target files.
Candidate Query
-
Definition: A preliminary search mechanism similar to a Query returns only a list of files from the Source.
-
Characteristics:
- Operates over the entire Source dataset.
- Allows false positives but not false negatives.
- It does not directly support Similar Queries; similar functionality can be achieved using a smaller minimum hashing unit size.
Index
-
Definition: A data structure built from the list of files returned by a Candidate Query.
-
Characteristics:
- Optimized for detailed search operations based on Exact and Similar Queries.
Candidate Index
-
Definition: The initial indexing for the entire Source.
-
Characteristics:
- Supports Candidate Queries by providing a file-level listing.
Target
-
Definition: A user-provided dataset intended for analysis.
-
Characteristics:
- Consists of files supplied by the user.
- Undergoes immediate tokenization and/or indexing upon receipt.
- Remains separate and is never incorporated into the Source.
Source
-
Definition: A repository of datasets maintained by the service provider.
-
Characteristics:
- Acts as the reference pool for matching against the Target.
- Receives updates infrequently, typically on a weekly or monthly schedule.
- Pre-indexed and readily available when accessed by the user.
- The duration of Candidate Indexing does not influence data structure design.
Dataset
-
Definition: An aggregate collection of files.
-
Types:
-
Source Dataset: The comprehensive collection of files maintained as the Source.
- Interchangeably referred to as "Source."
-
Target Dataset: The complete collection of files provided by the user.
- Interchangeably referred to as "Target."
Token
-
Definition: The smallest unit of data derived from the content of a file.
-
Generation Methods:
-
Code Data: Tokens are produced by a tokenizer based on the syntax of the code.
-
Text Data (Natural Language):
-
Western Languages: Tokens are separated by spaces.
-
Chinese: Each character is treated as an individual token.
-
Korean: Tokens are split by spaces; postpositional particles and suffixes are tokenized separately.
-
Japanese: Tokens are determined through morphological analysis, segmenting words based on grammar and context using tools like MeCab or Kuromoji.
-
Special Considerations: Both Korean and Japanese require specialized tokenizers to accurately handle the complexity of their morphology and syntax.
Sequence and Subsequence
-
Sequence:
-
Definition: An ordered list of tokens derived from a file or segment.
-
Subsequence:
-
Definition: A contiguous subset of tokens within an entire sequence.
Parameter
-
Definition: User-defined criteria that refine and tailor query operations.
-
Types:
-
Similarity Rate: Specifies the required degree of similarity for matches in Similar Queries.
-
Search Unit (by token):
-
Minimum Hashing Unit Size:
-
Definition: The smallest unit size used for hashing tokens during the search process.
-
Minimum Search Unit Size:
-
Definition: The smallest unit size considered when performing searches.