Glossary

File

Definition: A single code and/or text container.
Characteristics:
- Consists of a sequence of original tokens(raw source code).
- Indexing, Candidating, and Comparing phase do not read the file directly.
  - Only viewer interface reads the file directly.
  ~~Query~~
  - ~~Definition:~~ ~~A set of search parameters used to locate Source files containing specific subsequences found in the Target.~~
  - ~~Components:~~
    - ~~Encompasses all files within the Target dataset.~~
    - ~~Searches through candidate Source files and their associated project metadata using Candidate Queries.~~
    - ~~Identifies source files that contain token subsequences that match any subsequence within target files.~~
  - ~~Types:~~
    - ~~Exact Query:~~ ~~Retrieves subsequences that are precisely identical between Source and Target.~~
    - ~~Similar Query:~~ ~~Retrieves subsequences that are similar based on predefined similarity parameters.~~
  - ~~Result:~~
    - ~~Pair of Source File and Target File:~~ ~~Links the Source file with the corresponding Target file.~~
    - ~~Pair of Subsequence Indices:~~ ~~This indicates the start and end indices of the matching subsequences in both the Source and Target files.~~
  ~~Candidate Query~~
  - ~~Definition:~~ ~~A preliminary search mechanism similar to a Query returns only a list of files from the Source.~~
  - ~~Characteristics:~~
    - ~~Operates over the entire Source dataset.~~
    - ~~Allows false positives but not false negatives.~~
    - ~~It does not directly support Similar Queries; similar functionality can be achieved using a smaller minimum hashing unit size.~~
  ~~Index~~
  - ~~Definition:~~ ~~A data structure built from the list of files returned by a Candidate Query.~~
  - ~~Characteristics:~~
    - ~~Optimized for detailed search operations based on Exact and Similar Queries.~~
  ~~Candidate Index~~
  - ~~Definition:~~ ~~The initial indexing for the entire Source.~~
  - ~~Characteristics:~~
    - ~~Supports Candidate Queries by providing a file-level listing.~~
  Target
  - Definition: A user-provided dataset intended for analysis.
  - Characteristics:
    - Consists of files supplied by the user.
    - Undergoes immediate tokenization and/or indexing upon receipt.
    - Remains separate and is never incorporated into the Source.
  Source
  - Definition: A repository of datasets maintained by the service provider.
  - Characteristics:
    - Acts as the reference pool for matching against the Target.
    - Receives updates infrequently, typically on a weekly or monthly schedule.
    - Pre-indexed and readily available when accessed by the user.
    - The duration of Candidate Indexing does not influence data structure design.
  Dataset
  - Definition: An aggregate collection of files.
  - Types:
    - Source Dataset: The comprehensive collection of files maintained as the Source.
      - Interchangeably referred to as "Source."
    - Target Dataset: The complete collection of files provided by the user.
      - Interchangeably referred to as "Target."
  Token
  - Definition: The smallest unit of data derived from the content of a file.
  - Generation Tokenizing Methods:
    - Code Data: ~~Tokens are produced~~Produced by a tokenizer based on the syntax of the code.
    - Text Data (Natural Language):
      - Western Languages: Tokens are separated by spaces.
      - Chinese: Each character is treated as an individual token.
      - Korean: Tokens are split by spaces; postpositional particles and suffixes are tokenized separately.
      - Japanese: Tokens are determined through morphological analysis, segmenting words based on grammar and ~~context using tools like MeCab or Kuromoji.~~context.
      - Special Considerations: Both Korean and Japanese require specialized tokenizers to accurately handle the complexity of their morphology and syntax.
  ~~Minimun Token~~ Sequence
  and Subsequence
  - Definition:
    - ~~Serves as the fundamental unit in search operations within the system.~~
    - ~~It does not directly support Similar Queries; however, similar results can be achieved by using smaller minimum hashing unit sizes.~~
  ~~Sequence and Subsequence~~
  - Sequence:
    
    ~~Definition:~~ An ordered list of tokens derived from a file or segment.
  Minimum Token Sequence (MTS) and Mimimum Token Sequence Unit (MTSU)
  
  Definition:
  
  Minimum Token Sequence (MTS): The fundamental subsequence unit from original sequence by fixed size token.
  
  Mimimum Token Sequence Unit (MTSU): The fundamental unit in indexing, candidating and comparing within the system.
  
  Characteristics:
  
  MTSU is calculated by hashing MTS.
  
  Because MTS is consisted with several tokens, duplication possibility of MTS is lesser than a single token.
  
  Tokenized File
  
  Definition: A tokenized version of a file.
  
  Characteristics:
  
  Generated by a tokenizer.
  
  Consists of sequence of pair
  
  Index of token
  
  Original characters(string)
  
  Query, Query Request, and Query Result
  
  !!TODO: Clearly define Candidate Query and Query
  
  Definition:
  
  Query: The process of searching for candidate files from the Source that have common subsequence between the Target.
  
  Query Request: A set of search parameters used to locate Source files containing specific subsequences found in the Target.
  
  Query Result: The output of a Query, which pair of Target File and candidate Source file list.
  
  Components:
  
  Encompasses all files within the Target dataset.
  
  Searches through candidate Source files and their associated project metadata using Candidate Queries.
  
  Identifies source files that contain token subsequences that match any subsequence within target files.
  
  Types:
  
  Exact Query: Retrieves subsequences that are precisely identical between Source and Target.
  
  Similar Query: Retrieves subsequences that are similar based on predefined similarity parameters.
  
  Result:
  
  Pair of Source File and Target File: Links the Source file with the corresponding Target file.
  
  Pair of Subsequence Indices: This indicates the start and end indices of the matching subsequences in both the Source and Target files.
  
  Candidate Query
  
  !!TODO: Clearly define Candidate Query and Query
  
  Definition: A preliminary search mechanism similar to a Query returns only a list of files from the Source.
  
  Characteristics:
  
  Operates over the entire Source dataset.
  
  Allows false positives but does not allow false negatives.
  
  Does not directly support Similar Queries
  
  However, similar functionality can be achieved using a smaller MTS size.
  
  Batch Query
  
  Definition: A method of processing multiple queries simultaneously to improve efficiency.
  
  Characteristics:
  
  .
  
  Index
  
  !!TODO: Clearly define Index and Candidate Index
  
  Definition: A data structure built from the list of files returned by a Candidate Query.
  
  Characteristics:
  
  Optimized for detailed search operations based on Exact and Similar Queries.
  
  Candidate Index
  
  Definition: The initial indexing for the entire Source.
  
  Characteristics:
  
  Supports Candidate Queries by providing a file-level listing.
  
  Partitioning
  
  Definition: The process of dividing MTSU in Query Request to optimize Finding.
  
  Characteristics:
  
  Index is sorted by order of MTSU.
  
  File Archiver
  
  Definition: Returns the Source File from file ID of Source File.
  
  Characteristics:
  
  Indexer
  
  Definition: Creates the Index from Source datasets.
  
  Characteristics:
  
  Processes Source files
  
  Generate Tokenized Files from Source Files.
  
  Generate Index by MTSU and metadata of Source Files from a Tokenized Files.
  
  Querier
  
  Definition: Make query from Target Files.
  
  Characteristics:
  
  Processes Target Files
  
  Generate Tokenized Files from Target Files.
  
  Generate a query request from MTSU of Tokenized Files.
  
  MTSU list and relevant metadata of Target Files.
  
  Grouped/Deduplicated MTSU which is optimized for query operation.
  
  Grouping on Querier or Indexer?
  
  Finder
  
  Definition: Candidates files during query operations.
  
  Characteristics:
  
  From a query requests, get candidiate file list (or candidiate file list ID) for each MTSU from Index.
  
  Merger
  
  Definition: Merge the candidate file list based on the query results.
  
  Characteristics:
  
  Merge the candidate file list based on the query results.
  
  Generate the final candidate file list for the query request.
  
  Hash Function
  
  Definition: A function that converts input data(MTS) into a fixed-size string of bytes(MTSU).
  
  Language Family
  
  Definition: A grouping of programming languages based on the similarity of their actual representation.
  
  Characteristics:
  
  Languages within the same family share common grammatical structures, enabling identical token sequences across different languages.
  
  A language can belong to multiple language families if it shares patterns with multiple groups.
  
  Facilitates partitioning of Source data to optimize search and indexing based on language-specific features.
  
  Parameter
  
  Definition: User-defined criteria that refine and tailor query operations.
  
  Types:
  
  Similarity Rate: Specifies the required degree of similarity for matches in Similar Queries.
  
  Search Unit (by token):
  
  ~~Minimum Hashing Unit~~MTSU Size:
  
  Definition: The smallest unit size used for hashing ~~tokens~~subsequence during the search process.
  
  Minimum Search Unit Size:
  
  Definition: The smallest unit size considered when performing searches.

Back to top