Glossary
File
- Definition: A single code and/or text container.
-
Characteristics:
- Consists of a sequence of original tokens(raw source code).
- Indexing, Candidating, and Comparing phase do not read the file directly.
- Only viewer interface reads the file directly.
QueryDefinition:A set of search parameters used to locate Source files containing specific subsequences found in the Target.Components:Encompasses all files within the Target dataset.Searches through candidate Source files and their associated project metadata using Candidate Queries.Identifies source files that contain token subsequences that match any subsequence within target files.
Types:Exact Query:Retrieves subsequences that are precisely identical between Source and Target.Similar Query:Retrieves subsequences that are similar based on predefined similarity parameters.
Result:Pair of Source File and Target File:Links the Source file with the corresponding Target file.Pair of Subsequence Indices:This indicates the start and end indices of the matching subsequences in both the Source and Target files.
Candidate QueryDefinition:A preliminary search mechanism similar to a Query returns only a list of files from the Source.Characteristics:Operates over the entire Source dataset.Allows false positives but not false negatives.It does not directly support Similar Queries; similar functionality can be achieved using a smaller minimum hashing unit size.
IndexDefinition:A data structure built from the list of files returned by a Candidate Query.Characteristics:Optimized for detailed search operations based on Exact and Similar Queries.
Candidate IndexDefinition:The initial indexing for the entire Source.Characteristics:Supports Candidate Queries by providing a file-level listing.
Target
- Definition: A user-provided dataset intended for analysis.
-
Characteristics:
- Consists of files supplied by the user.
- Undergoes immediate tokenization and/or indexing upon receipt.
- Remains separate and is never incorporated into the Source.
Source
- Definition: A repository of datasets maintained by the service provider.
-
Characteristics:
- Acts as the reference pool for matching against the Target.
- Receives updates infrequently, typically on a weekly or monthly schedule.
- Pre-indexed and readily available when accessed by the user.
- The duration of Candidate Indexing does not influence data structure design.
Dataset
- Definition: An aggregate collection of files.
-
Types:
-
Source Dataset: The comprehensive collection of files maintained as the Source.
- Interchangeably referred to as "Source."
-
Target Dataset: The complete collection of files provided by the user.
- Interchangeably referred to as "Target."
-
Source Dataset: The comprehensive collection of files maintained as the Source.
Token
- Definition: The smallest unit of data derived from the content of a file.
-
Generation Tokenizing Methods:
-
Code Data:
Tokens are producedProduced by a tokenizer based on the syntax of the code. -
Text Data (Natural Language):
- Western Languages: Tokens are separated by spaces.
- Chinese: Each character is treated as an individual token.
- Korean: Tokens are split by spaces; postpositional particles and suffixes are tokenized separately.
-
Japanese: Tokens are determined through morphological analysis, segmenting words based on grammar and
context using tools like MeCab or Kuromoji.context. - Special Considerations: Both Korean and Japanese require specialized tokenizers to accurately handle the complexity of their morphology and syntax.
-
Code Data:
and SubsequenceMinimun TokenSequence-
Definition:
Serves as the fundamental unit in search operations within the system.It does not directly support Similar Queries; however, similar results can be achieved by using smaller minimum hashing unit sizes.
Sequence and Subsequence-
Sequence:
Definition:An ordered list of tokens derived from a file or segment.
-
Subsequence:
Definition:A contiguous subset of tokens within an entire sequence.
Minimum Token Sequence (MTS) and Mimimum Token Sequence Unit (MTSU)
-
Definition:
- Minimum Token Sequence (MTS): The fundamental subsequence unit from original sequence by fixed size token.
- Mimimum Token Sequence Unit (MTSU): The fundamental unit in indexing, candidating and comparing within the system.
-
Characteristics:
- MTSU is calculated by hashing MTS.
- Because MTS is consisted with several tokens, duplication possibility of MTS is lesser than a single token.
Tokenized File
- Definition: A tokenized version of a file.
-
Characteristics:
- Generated by a tokenizer.
- Consists of sequence of pair
- Index of token
- Original characters(string)
Query, Query Request, and Query Result
!!TODO: Clearly define Candidate Query and Query
-
Definition:
- Query: The process of searching for candidate files from the Source that have common subsequence between the Target.
- Query Request: A set of search parameters used to locate Source files containing specific subsequences found in the Target.
- Query Result: The output of a Query, which pair of Target File and candidate Source file list.
-
Components:
- Encompasses all files within the Target dataset.
- Searches through candidate Source files and their associated project metadata using Candidate Queries.
- Identifies source files that contain token subsequences that match any subsequence within target files.
-
Types:
- Exact Query: Retrieves subsequences that are precisely identical between Source and Target.
- Similar Query: Retrieves subsequences that are similar based on predefined similarity parameters.
-
Result:
- Pair of Source File and Target File: Links the Source file with the corresponding Target file.
- Pair of Subsequence Indices: This indicates the start and end indices of the matching subsequences in both the Source and Target files.
Candidate Query
!!TODO: Clearly define Candidate Query and Query
- Definition: A preliminary search mechanism similar to a Query returns only a list of files from the Source.
-
Characteristics:
- Operates over the entire Source dataset.
- Allows false positives but does not allow false negatives.
- Does not directly support Similar Queries
- However, similar functionality can be achieved using a smaller MTS size.
Batch Query
- Definition: A method of processing multiple queries simultaneously to improve efficiency.
-
Characteristics:
- .
Index
!!TODO: Clearly define Index and Candidate Index
- Definition: A data structure built from the list of files returned by a Candidate Query.
-
Characteristics:
- Optimized for detailed search operations based on Exact and Similar Queries.
Candidate Index
- Definition: The initial indexing for the entire Source.
-
Characteristics:
- Supports Candidate Queries by providing a file-level listing.
Partitioning
- Definition: The process of dividing MTSU in Query Request to optimize Finding.
-
Characteristics:
- Index is sorted by order of MTSU.
File Archiver
- Definition: Returns the Source File from file ID of Source File.
- Characteristics:
Indexer
- Definition: Creates the Index from Source datasets.
-
Characteristics:
- Processes Source files
- Generate Tokenized Files from Source Files.
- Generate Index by MTSU and metadata of Source Files from a Tokenized Files.
- Processes Source files
Querier
- Definition: Make query from Target Files.
-
Characteristics:
- Processes Target Files
- Generate Tokenized Files from Target Files.
- Generate a query request from MTSU of Tokenized Files.
- MTSU list and relevant metadata of Target Files.
- Grouped/Deduplicated MTSU which is optimized for query operation.
- Grouping on Querier or Indexer?
- Processes Target Files
Finder
- Definition: Candidates files during query operations.
-
Characteristics:
- From a query requests, get candidiate file list (or candidiate file list ID) for each MTSU from Index.
Merger
- Definition: Merge the candidate file list based on the query results.
-
Characteristics:
- Merge the candidate file list based on the query results.
- Generate the final candidate file list for the query request.
Hash Function
- Definition: A function that converts input data(MTS) into a fixed-size string of bytes(MTSU).
Language Family
- Definition: A grouping of programming languages based on the similarity of their actual representation.
-
Characteristics:
- Languages within the same family share common grammatical structures, enabling identical token sequences across different languages.
- A language can belong to multiple language families if it shares patterns with multiple groups.
- Facilitates partitioning of Source data to optimize search and indexing based on language-specific features.
Parameter
- Definition: User-defined criteria that refine and tailor query operations.
-
Types:
- Similarity Rate: Specifies the required degree of similarity for matches in Similar Queries.
-
Search Unit (by token):
-
Minimum Hashing UnitMTSU Size:-
Definition: The smallest unit size used for hashing
tokenssubsequence during the search process.
-
Definition: The smallest unit size used for hashing
-
Minimum Search Unit Size:
- Definition: The smallest unit size considered when performing searches.
-