Skip to main content
Glossary
File
-
Definition: A single
container of code and/or text.text container.
-
Characteristics:
Serves as the fundamental unit in search operations within the system.
It does not directly support Similar Queries; however, similar results can be achieved by using smaller minimum hashing unit sizes.
Segment
Definition: A subdivision of a file.
Characteristics:
Segments are uniformly sized, except for the final segment, which may be smaller.
Each segment is individually indexed for efficient searching(candidate query).
It should have overlapping tokens between adjective segments to make coherent minimum search/hashing units.
Query
-
Definition: A set of search parameters used to locate Source files containing specific subsequences found in the Target.
-
Components:
- Encompasses all files within the Target dataset.
- Searches through candidate Source files and their associated project metadata using Candidate Queries.
- Identifies source files that contain token subsequences that match any subsequence within target files.
-
Types:
-
Exact Query: Retrieves subsequences that are precisely identical between Source and Target.
-
Similar Query: Retrieves subsequences that are similar based on predefined similarity parameters.
-
Result:
-
Pair of Source File and Target File: Links the Source file with the corresponding Target file.
-
Pair of Subsequence Indices: This indicates the start and end indices of the matching subsequences in both the Source and Target files.
Candidate Query
-
Definition: A preliminary search mechanism similar to a Query returns only a list of files from the Source.
-
Characteristics:
- Operates over the entire Source dataset.
- Allows false positives but not false negatives.
- It does not directly support Similar Queries; similar functionality can be achieved using a smaller minimum hashing unit size.
Index
-
Definition: A data structure built from the list of files returned by a Candidate Query.
-
Characteristics:
- Optimized for detailed search operations based on Exact and Similar Queries.
Candidate Index
-
Definition: The initial indexing for the entire Source.
-
Characteristics:
- Supports Candidate Queries by providing a file-level listing.
Target
-
Definition: A user-provided dataset intended for analysis.
-
Characteristics:
- Consists of files supplied by the user.
- Undergoes immediate tokenization and/or indexing upon receipt.
- Remains separate and is never incorporated into the Source.
Source
-
Definition: A repository of datasets maintained by the service provider.
-
Characteristics:
- Acts as the reference pool for matching against the Target.
- Receives updates infrequently, typically on a weekly or monthly schedule.
- Pre-indexed and readily available when accessed by the user.
- The duration of Candidate Indexing does not influence data structure design.
Dataset
-
Definition: An aggregate collection of files.
-
Types:
-
Source Dataset: The comprehensive collection of files maintained as the Source.
- Interchangeably referred to as "Source."
-
Target Dataset: The complete collection of files provided by the user.
- Interchangeably referred to as "Target."
Token
-
Definition: The smallest unit of data derived from the content of a file.
-
Generation Methods:
-
Code Data: Tokens are produced by a tokenizer based on the syntax of the code.
-
Text Data (Natural Language):
-
Western Languages: Tokens are separated by spaces.
-
Chinese: Each character is treated as an individual token.
-
Korean: Tokens are split by spaces; postpositional particles and suffixes are tokenized separately.
-
Japanese: Tokens are determined through morphological analysis, segmenting words based on grammar and context using tools like MeCab or Kuromoji.
-
Special Considerations: Both Korean and Japanese require specialized tokenizers to accurately handle the complexity of their morphology and syntax.
Minimun Token Sequence
-
Definition:
- Serves as the fundamental unit in search operations within the system.
- It does not directly support Similar Queries; however, similar results can be achieved by using smaller minimum hashing unit sizes.
Sequence and Subsequence
-
Sequence:
-
Definition: An ordered list of tokens derived from a file or segment.
-
Subsequence:
-
Definition: A contiguous subset of tokens within an entire sequence.
Parameter
-
Definition: User-defined criteria that refine and tailor query operations.
-
Types:
-
Similarity Rate: Specifies the required degree of similarity for matches in Similar Queries.
-
Search Unit (by token):
-
Minimum Hashing Unit Size:
-
Definition: The smallest unit size used for hashing tokens during the search process.
-
Minimum Search Unit Size:
-
Definition: The smallest unit size considered when performing searches.