Glossary

File

Definition: A single code and/or text container.
Characteristics:
- Consists of a sequence of original tokens(raw source code).
- Indexing, Candidating, and Comparing phase do not read the file directly.
  - Only viewer interface reads the file directly.

Definition: A user-provided dataset intended for analysis.
Characteristics:
- Consists of files supplied by the user.
- Undergoes immediate tokenization and/or indexing upon receipt.
- Remains separate and is never incorporated into the Source.

Definition: A repository of datasets maintained by the service provider.
Characteristics:
- Acts as the reference pool for matching against the Target.
- Receives updates infrequently, typically on a weekly or monthly schedule.
- Pre-indexed and readily available when accessed by the user.
- The duration of Candidate Indexing does not influence data structure design.

Definition: An aggregate collection of files.
Types:
- Source Dataset: The comprehensive collection of files maintained as the Source.
  - Interchangeably referred to as "Source."
- Target Dataset: The complete collection of files provided by the user.
  - Interchangeably referred to as "Target."

Definition: The smallest unit of data derived from the content of a file.
Generation Tokenizing Methods:
- Code Data: Produced by a tokenizer based on the syntax of the code.
- Text Data (Natural Language):
  - Western Languages: Tokens are separated by spaces.
  - Chinese: Each character is treated as an individual token.
  - Korean: Tokens are split by spaces; postpositional particles and suffixes are tokenized separately.
  - Japanese: Tokens are determined through morphological analysis, segmenting words based on grammar and context.
  - Special Considerations: Both Korean and Japanese require specialized tokenizers to accurately handle the complexity of their morphology and syntax.

Definition:
- Sequence: An ordered list of tokens derived from a file or segment.
- Subsequence: A contiguous subset of tokens within an entire sequence.

Definition:
- Minimum Token Sequence (MTS): The fundamental subsequence unit from original sequence by fixed size token.
- Mimimum Token Sequence Unit (MTSU): The fundamental unit in indexing, candidating and comparing within the system.
Characteristics:
- MTSU is calculated by hashing MTS.
- Because MTS is consisted with several tokens, duplication possibility of MTS is lesser than a single token.

Definition: A tokenized version of a file.
Characteristics:
- Generated by a tokenizer.
- Consists of sequence of pair
  - Index of token
  - Original characters(string)

!!TODO: Clearly define Candidate Query and Query

Definition:
- Query: The process of searching for candidate files from the Source that have common subsequence between the Target.
- Query Request: A set of search parameters used to locate Source files containing specific subsequences found in the Target.
- Query Result: The output of a Query, which pair of Target File and candidate Source file list.
Components:
- Encompasses all files within the Target dataset.
- Searches through candidate Source files and their associated project metadata using Candidate Queries.
- Identifies source files that contain token subsequences that match any subsequence within target files.
Types:
- Exact Query: Retrieves subsequences that are precisely identical between Source and Target.
- Similar Query: Retrieves subsequences that are similar based on predefined similarity parameters.
Result:
- Pair of Source File and Target File: Links the Source file with the corresponding Target file.
- Pair of Subsequence Indices: This indicates the start and end indices of the matching subsequences in both the Source and Target files.

!!TODO: Clearly define Candidate Query and Query

Definition: A preliminary search mechanism similar to a Query returns only a list of files from the Source.
Characteristics:
- Operates over the entire Source dataset.
- Allows false positives but does not allow false negatives.
- Does not directly support Similar Queries
  - However, similar functionality can be achieved using a smaller MTS size.

Definition: A method of processing multiple queries simultaneously to improve efficiency.
Characteristics:
- .

!!TODO: Clearly define Index and Candidate Index

Definition: A data structure built from the list of files returned by a Candidate Query.
Characteristics:
- Optimized for detailed search operations based on Exact and Similar Queries.

Definition: The initial indexing for the entire Source.
Characteristics:
- Supports Candidate Queries by providing a file-level listing.

Definition: The process of dividing MTSU in Query Request to optimize Finding.
Characteristics:
- Index is sorted by order of MTSU.

Definition: Creates the Index from Source datasets.
Characteristics:
- Processes Source files
  - Generate Tokenized Files from Source Files.
  - Generate Index by MTSU and metadata of Source Files from a Tokenized Files.

Definition: Make query from Target Files.
Characteristics:
- Processes Target Files
  - Generate Tokenized Files from Target Files.
  - Generate a query request from MTSU of Tokenized Files.
    - MTSU list and relevant metadata of Target Files.
    - Grouped/Deduplicated MTSU which is optimized for query operation.
      - Grouping on Querier or Indexer?

Definition: Candidates files during query operations.
Characteristics:
- From a query requests, get candidiate file list (or candidiate file list ID) for each MTSU from Index.

Definition: Merge the candidate file list based on the query results.
Characteristics:
- Merge the candidate file list based on the query results.
- Generate the final candidate file list for the query request.

Definition: A function that converts input data(MTS) into a fixed-size string of bytes(MTSU).

Definition: A grouping of programming languages based on the similarity of their actual representation.
Characteristics:
- Languages within the same family share common grammatical structures, enabling identical token sequences across different languages.
- A language can belong to multiple language families if it shares patterns with multiple groups.
- Facilitates partitioning of Source data to optimize search and indexing based on language-specific features.

Definition: User-defined criteria that refine and tailor query operations.
Types:
- Similarity Rate: Specifies the required degree of similarity for matches in Similar Queries.
- Search Unit (by token):
  - MTSU Size:
    - Definition: The smallest unit size used for hashing subsequence during the search process.
  - Minimum Search Unit Size:
    - Definition: The smallest unit size considered when performing searches.