Skip to main content

Glossary

File

  • Definition: A single code and/or text container.
  • Characteristics:
    • Consists of a sequence of original tokens(raw source code).
    • Indexing, Candidating, and Comparing phase do not read the file directly.
      • Only viewer interface reads the file directly.

Target

  • Definition: A user-provided dataset intended for analysis.
  • Characteristics:
    • Consists of files supplied by the user.
    • Undergoes immediate tokenization and/or indexing upon receipt.
    • Remains separate and is never incorporated into the Source.

Source

  • Definition: A repository of datasets maintained by the service provider.
  • Characteristics:
    • Acts as the reference pool for matching against the Target.
    • Receives updates infrequently, typically on a weekly or monthly schedule.
    • Pre-indexed and readily available when accessed by the user.
    • The duration of Candidate Indexing does not influence data structure design.

Dataset

  • Definition: An aggregate collection of files.
  • Types:
    • Source Dataset: The comprehensive collection of files maintained as the Source.
      • Interchangeably referred to as "Source."
    • Target Dataset: The complete collection of files provided by the user.
      • Interchangeably referred to as "Target."

Token

  • Definition: The smallest unit of data derived from the content of a file.
  • Generation Tokenizing Methods:
    • Code Data: Produced by a tokenizer based on the syntax of the code.
    • Text Data (Natural Language):
      • Western Languages: Tokens are separated by spaces.
      • Chinese: Each character is treated as an individual token.
      • Korean: Tokens are split by spaces; postpositional particles and suffixes are tokenized separately.
      • Japanese: Tokens are determined through morphological analysis, segmenting words based on grammar and context.
      • Special Considerations: Both Korean and Japanese require specialized tokenizers to accurately handle the complexity of their morphology and syntax.

Sequence and Subsequence

  • Definition:
    • Sequence: An ordered list of tokens derived from a file or segment.
    • Subsequence: A contiguous subset of tokens within an entire sequence.

Minimum Token Sequence (MTS) and Mimimum Token Sequence Unit (MTSU)

  • Definition:
    • Minimum Token Sequence (MTS): The fundamental subsequence unit from original sequence by fixed size token.
    • Mimimum Token Sequence Unit (MTSU): The fundamental unit in indexing, candidating and comparing within the system.
  • Characteristics:
    • MTSU is calculated by hashing MTS.
    • Because MTS is consisted with several tokens, duplication possibility of MTS is lesser than a single token.

Tokenized File

  • Definition: A tokenized version of a file.
  • Characteristics:
    • Generated by a tokenizer.
    • Consists of sequence of pair
      • Index of token
      • Original characters(string)

Query, Query Request, and Query Result

!!TODO: Clearly define Candidate Query and Query

  • Definition:
    • Query: The process of searching for candidate files from the Source that have common subsequence between the Target.
    • Query Request: A set of search parameters used to locate Source files containing specific subsequences found in the Target.
    • Query Result: The output of a Query, which pair of Target File and candidate Source file list.
  • Components:
    • Encompasses all files within the Target dataset.
    • Searches through candidate Source files and their associated project metadata using Candidate Queries.
    • Identifies source files that contain token subsequences that match any subsequence within target files.
  • Types:
    • Exact Query: Retrieves subsequences that are precisely identical between Source and Target.
    • Similar Query: Retrieves subsequences that are similar based on predefined similarity parameters.
  • Result:
    • Pair of Source File and Target File: Links the Source file with the corresponding Target file.
    • Pair of Subsequence Indices: This indicates the start and end indices of the matching subsequences in both the Source and Target files.

Candidate Query

!!TODO: Clearly define Candidate Query and Query

  • Definition: A preliminary search mechanism similar to a Query returns only a list of files from the Source.
  • Characteristics:
    • Operates over the entire Source dataset.
    • Allows false positives but does not allow false negatives.
    • Does not directly support Similar Queries
      • However, similar functionality can be achieved using a smaller MTS size.

Batch Query

  • Definition: A method of processing multiple queries simultaneously to improve efficiency.
  • Characteristics:
    • .

Index

!!TODO: Clearly define Index and Candidate Index

  • Definition: A data structure built from the list of files returned by a Candidate Query.
  • Characteristics:
    • Optimized for detailed search operations based on Exact and Similar Queries.

Candidate Index

  • Definition: The initial indexing for the entire Source.
  • Characteristics:
    • Supports Candidate Queries by providing a file-level listing.

Partitioning

  • Definition: The process of dividing MTSU in Query Request to optimize Finding.
  • Characteristics:
    • Index is sorted by order of MTSU.

File Archiver

  • Definition: Returns the Source File from file ID of Source File.
  • Characteristics:

Indexer

  • Definition: Creates the Index from Source datasets.
  • Characteristics:
    • Processes Source files
      • Generate Tokenized Files from Source Files.
      • Generate Index by MTSU and metadata of Source Files from a Tokenized Files.

Querier

  • Definition: Make query from Target Files.
  • Characteristics:
    • Processes Target Files
      • Generate Tokenized Files from Target Files.
      • Generate a query request from MTSU of Tokenized Files.
        • MTSU list and relevant metadata of Target Files.
        • Grouped/Deduplicated MTSU which is optimized for query operation.
          • Grouping on Querier or Indexer?

Finder

  • Definition: Candidates files during query operations.
  • Characteristics:
    • From a query requests, get candidiate file list (or candidiate file list ID) for each MTSU from Index.

Merger

  • Definition: Merge the candidate file list based on the query results.
  • Characteristics:
    • Merge the candidate file list based on the query results.
    • Generate the final candidate file list for the query request.

Hash Function

  • Definition: A function that converts input data(MTS) into a fixed-size string of bytes(MTSU).

Language Family

  • Definition: A grouping of programming languages based on the similarity of their actual representation.
  • Characteristics:
    • Languages within the same family share common grammatical structures, enabling identical token sequences across different languages.
    • A language can belong to multiple language families if it shares patterns with multiple groups.
    • Facilitates partitioning of Source data to optimize search and indexing based on language-specific features.

Parameter

  • Definition: User-defined criteria that refine and tailor query operations.
  • Types:
    • Similarity Rate: Specifies the required degree of similarity for matches in Similar Queries.
    • Search Unit (by token):
      • MTSU Size:
        • Definition: The smallest unit size used for hashing subsequence during the search process.
      • Minimum Search Unit Size:
        • Definition: The smallest unit size considered when performing searches.