Glossary
File
- Definition: A single code and/or text container.
-
Characteristics:
- Consists of a sequence of original tokens(raw source code).
- Indexing, Candidating, and Comparing phase do not read the file directly.
- Only viewer interface reads the file directly.
Target
- Definition: A user-provided dataset intended for analysis.
-
Characteristics:
- Consists of files supplied by the user.
- Undergoes immediate tokenization and/or indexing upon receipt.
- Remains separate and is never incorporated into the Source.
Source
- Definition: A repository of datasets maintained by the service provider.
-
Characteristics:
- Acts as the reference pool for matching against the Target.
- Receives updates infrequently, typically on a weekly or monthly schedule.
- Pre-indexed and readily available when accessed by the user.
- The duration of Candidate Indexing does not influence data structure design.
Dataset
- Definition: An aggregate collection of files.
-
Types:
-
Source Dataset: The comprehensive collection of files maintained as the Source.
- Interchangeably referred to as "Source."
-
Target Dataset: The complete collection of files provided by the user.
- Interchangeably referred to as "Target."
-
Source Dataset: The comprehensive collection of files maintained as the Source.
Token
- Definition: The smallest unit of data derived from the content of a file.
-
Generation Tokenizing Methods:
- Code Data: Produced by a tokenizer based on the syntax of the code.
-
Text Data (Natural Language):
- Western Languages: Tokens are separated by spaces.
- Chinese: Each character is treated as an individual token.
- Korean: Tokens are split by spaces; postpositional particles and suffixes are tokenized separately.
- Japanese: Tokens are determined through morphological analysis, segmenting words based on grammar and context.
- Special Considerations: Both Korean and Japanese require specialized tokenizers to accurately handle the complexity of their morphology and syntax.
Sequence and Subsequence
-
Definition:
- Sequence: An ordered list of tokens derived from a file or segment.
- Subsequence: A contiguous subset of tokens within an entire sequence.
Minimum Token Sequence (MTS) and Mimimum Token Sequence Unit (MTSU)
-
Definition:
- Minimum Token Sequence (MTS): The fundamental subsequence unit from original sequence by fixed size token.
- Mimimum Token Sequence Unit (MTSU): The fundamental unit in indexing, candidating and comparing within the system.
-
Characteristics:
- MTSU is calculated by hashing MTS.
- Because MTS is consisted with several tokens, duplication possibility of MTS is lesser than a single token.
Tokenized File
- Definition: A tokenized version of a file.
-
Characteristics:
- Generated by a tokenizer.
- Consists of sequence of pair
- Index of token
- Original characters(string)
Query, Query Request, and Query Result
!!TODO: Clearly define Candidate Query and Query
-
Definition:
- Query: The process of searching for candidate files from the Source that have common subsequence between the Target.
- Query Request: A set of search parameters used to locate Source files containing specific subsequences found in the Target.
- Query Result: The output of a Query, which pair of Target File and candidate Source file list.
-
Components:
- Encompasses all files within the Target dataset.
- Searches through candidate Source files and their associated project metadata using Candidate Queries.
- Identifies source files that contain token subsequences that match any subsequence within target files.
-
Types:
- Exact Query: Retrieves subsequences that are precisely identical between Source and Target.
- Similar Query: Retrieves subsequences that are similar based on predefined similarity parameters.
-
Result:
- Pair of Source File and Target File: Links the Source file with the corresponding Target file.
- Pair of Subsequence Indices: This indicates the start and end indices of the matching subsequences in both the Source and Target files.
Candidate Query
!!TODO: Clearly define Candidate Query and Query
- Definition: A preliminary search mechanism similar to a Query returns only a list of files from the Source.
-
Characteristics:
- Operates over the entire Source dataset.
- Allows false positives but does not allow false negatives.
- Does not directly support Similar Queries
- However, similar functionality can be achieved using a smaller MTS size.
Batch Query
- Definition: A method of processing multiple queries simultaneously to improve efficiency.
-
Characteristics:
- .
Index
!!TODO: Clearly define Index and Candidate Index
- Definition: A data structure built from the list of files returned by a Candidate Query.
-
Characteristics:
- Optimized for detailed search operations based on Exact and Similar Queries.
Candidate Index
- Definition: The initial indexing for the entire Source.
-
Characteristics:
- Supports Candidate Queries by providing a file-level listing.
Partitioning
- Definition: The process of dividing MTSU in Query Request to optimize Finding.
-
Characteristics:
- Index is sorted by order of MTSU.
File Archiver
- Definition: Returns the Source File from file ID of Source File.
- Characteristics:
Indexer
- Definition: Creates the Index from Source datasets.
-
Characteristics:
- Processes Source files
- Generate Tokenized Files from Source Files.
- Generate Index by MTSU and metadata of Source Files from a Tokenized Files.
- Processes Source files
Querier
- Definition: Make query from Target Files.
-
Characteristics:
- Processes Target Files
- Generate Tokenized Files from Target Files.
- Generate a query request from MTSU of Tokenized Files.
- MTSU list and relevant metadata of Target Files.
- Grouped/Deduplicated MTSU which is optimized for query operation.
- Grouping on Querier or Indexer?
- Processes Target Files
Finder
- Definition: Candidates files during query operations.
-
Characteristics:
- From a query requests, get candidiate file list (or candidiate file list ID) for each MTSU from Index.
Merger
- Definition: Merge the candidate file list based on the query results.
-
Characteristics:
- Merge the candidate file list based on the query results.
- Generate the final candidate file list for the query request.
Hash Function
- Definition: A function that converts input data(MTS) into a fixed-size string of bytes(MTSU).
Language Family
- Definition: A grouping of programming languages based on the similarity of their actual representation.
-
Characteristics:
- Languages within the same family share common grammatical structures, enabling identical token sequences across different languages.
- A language can belong to multiple language families if it shares patterns with multiple groups.
- Facilitates partitioning of Source data to optimize search and indexing based on language-specific features.
Parameter
- Definition: User-defined criteria that refine and tailor query operations.
-
Types:
- Similarity Rate: Specifies the required degree of similarity for matches in Similar Queries.
-
Search Unit (by token):
-
MTSU Size:
- Definition: The smallest unit size used for hashing subsequence during the search process.
-
Minimum Search Unit Size:
- Definition: The smallest unit size considered when performing searches.
-
MTSU Size: