Skip to main content

Glossary

File

  • Definition: A single code and/or text container.
  • Characteristics:
    • Consists of a sequence of original tokens(raw source code).
    • Indexing, Candidating, and Comparing phase do not read the file directly.
      • Only viewer interface reads the file directly.

      Query

      • Definition: A set of search parameters used to locate Source files containing specific subsequences found in the Target.
      • Components:
        • Encompasses all files within the Target dataset.
        • Searches through candidate Source files and their associated project metadata using Candidate Queries.
        • Identifies source files that contain token subsequences that match any subsequence within target files.
      • Types:
        • Exact Query: Retrieves subsequences that are precisely identical between Source and Target.
        • Similar Query: Retrieves subsequences that are similar based on predefined similarity parameters.
      • Result:
        • Pair of Source File and Target File: Links the Source file with the corresponding Target file.
        • Pair of Subsequence Indices: This indicates the start and end indices of the matching subsequences in both the Source and Target files.

      Candidate Query

      • Definition: A preliminary search mechanism similar to a Query returns only a list of files from the Source.
      • Characteristics:
        • Operates over the entire Source dataset.
        • Allows false positives but not false negatives.
        • It does not directly support Similar Queries; similar functionality can be achieved using a smaller minimum hashing unit size.

      Index

      • Definition: A data structure built from the list of files returned by a Candidate Query.
      • Characteristics:
        • Optimized for detailed search operations based on Exact and Similar Queries.

      Candidate Index

      • Definition: The initial indexing for the entire Source.
      • Characteristics:
        • Supports Candidate Queries by providing a file-level listing.

      Target

      • Definition: A user-provided dataset intended for analysis.
      • Characteristics:
        • Consists of files supplied by the user.
        • Undergoes immediate tokenization and/or indexing upon receipt.
        • Remains separate and is never incorporated into the Source.

      Source

      • Definition: A repository of datasets maintained by the service provider.
      • Characteristics:
        • Acts as the reference pool for matching against the Target.
        • Receives updates infrequently, typically on a weekly or monthly schedule.
        • Pre-indexed and readily available when accessed by the user.
        • The duration of Candidate Indexing does not influence data structure design.

      Dataset

      • Definition: An aggregate collection of files.
      • Types:
        • Source Dataset: The comprehensive collection of files maintained as the Source.
          • Interchangeably referred to as "Source."
        • Target Dataset: The complete collection of files provided by the user.
          • Interchangeably referred to as "Target."

      Token

      • Definition: The smallest unit of data derived from the content of a file.
      • Generation Tokenizing Methods:
        • Code Data: Tokens are producedProduced by a tokenizer based on the syntax of the code.
        • Text Data (Natural Language):
          • Western Languages: Tokens are separated by spaces.
          • Chinese: Each character is treated as an individual token.
          • Korean: Tokens are split by spaces; postpositional particles and suffixes are tokenized separately.
          • Japanese: Tokens are determined through morphological analysis, segmenting words based on grammar and context using tools like MeCab or Kuromoji.context.
          • Special Considerations: Both Korean and Japanese require specialized tokenizers to accurately handle the complexity of their morphology and syntax.

      Minimun Token Sequence

       and Subsequence
      • Definition:
        • Serves as the fundamental unit in search operations within the system.
        • It does not directly support Similar Queries; however, similar results can be achieved by using smaller minimum hashing unit sizes.

      Sequence and Subsequence

      • Sequence:
        • Definition: An ordered list of tokens derived from a file or segment.
      • Subsequence:
        • Definition: A contiguous subset of tokens within an entire sequence.

      Minimum Token Sequence (MTS) and Mimimum Token Sequence Unit (MTSU)

      • Definition:
        • Minimum Token Sequence (MTS): The fundamental subsequence unit from original sequence by fixed size token.
        • Mimimum Token Sequence Unit (MTSU): The fundamental unit in indexing, candidating and comparing within the system.
      • Characteristics:
        • MTSU is calculated by hashing MTS.
        • Because MTS is consisted with several tokens, duplication possibility of MTS is lesser than a single token.

      Tokenized File

      • Definition: A tokenized version of a file.
      • Characteristics:
        • Generated by a tokenizer.
        • Consists of sequence of pair
          • Index of token
          • Original characters(string)

      Query, Query Request, and Query Result

      !!TODO: Clearly define Candidate Query and Query

      • Definition:
        • Query: The process of searching for candidate files from the Source that have common subsequence between the Target.
        • Query Request: A set of search parameters used to locate Source files containing specific subsequences found in the Target.
        • Query Result: The output of a Query, which pair of Target File and candidate Source file list.
      • Components:
        • Encompasses all files within the Target dataset.
        • Searches through candidate Source files and their associated project metadata using Candidate Queries.
        • Identifies source files that contain token subsequences that match any subsequence within target files.
      • Types:
        • Exact Query: Retrieves subsequences that are precisely identical between Source and Target.
        • Similar Query: Retrieves subsequences that are similar based on predefined similarity parameters.
      • Result:
        • Pair of Source File and Target File: Links the Source file with the corresponding Target file.
        • Pair of Subsequence Indices: This indicates the start and end indices of the matching subsequences in both the Source and Target files.

      Candidate Query

      !!TODO: Clearly define Candidate Query and Query

      • Definition: A preliminary search mechanism similar to a Query returns only a list of files from the Source.
      • Characteristics:
        • Operates over the entire Source dataset.
        • Allows false positives but does not allow false negatives.
        • Does not directly support Similar Queries
          • However, similar functionality can be achieved using a smaller MTS size.

      Batch Query

      • Definition: A method of processing multiple queries simultaneously to improve efficiency.
      • Characteristics:
        • .

      Index

      !!TODO: Clearly define Index and Candidate Index

      • Definition: A data structure built from the list of files returned by a Candidate Query.
      • Characteristics:
        • Optimized for detailed search operations based on Exact and Similar Queries.

      Candidate Index

      • Definition: The initial indexing for the entire Source.
      • Characteristics:
        • Supports Candidate Queries by providing a file-level listing.

      Partitioning

      • Definition: The process of dividing MTSU in Query Request to optimize Finding.
      • Characteristics:
        • Index is sorted by order of MTSU.

      File Archiver

      • Definition: Returns the Source File from file ID of Source File.
      • Characteristics:

      Indexer

      • Definition: Creates the Index from Source datasets.
      • Characteristics:
        • Processes Source files
          • Generate Tokenized Files from Source Files.
          • Generate Index by MTSU and metadata of Source Files from a Tokenized Files.

      Querier

      • Definition: Make query from Target Files.
      • Characteristics:
        • Processes Target Files
          • Generate Tokenized Files from Target Files.
          • Generate a query request from MTSU of Tokenized Files.
            • MTSU list and relevant metadata of Target Files.
            • Grouped/Deduplicated MTSU which is optimized for query operation.
              • Grouping on Querier or Indexer?

      Finder

      • Definition: Candidates files during query operations.
      • Characteristics:
        • From a query requests, get candidiate file list (or candidiate file list ID) for each MTSU from Index.

      Merger

      • Definition: Merge the candidate file list based on the query results.
      • Characteristics:
        • Merge the candidate file list based on the query results.
        • Generate the final candidate file list for the query request.

      Hash Function

      • Definition: A function that converts input data(MTS) into a fixed-size string of bytes(MTSU).

      Language Family

      • Definition: A grouping of programming languages based on the similarity of their actual representation.
      • Characteristics:
        • Languages within the same family share common grammatical structures, enabling identical token sequences across different languages.
        • A language can belong to multiple language families if it shares patterns with multiple groups.
        • Facilitates partitioning of Source data to optimize search and indexing based on language-specific features.

      Parameter

      • Definition: User-defined criteria that refine and tailor query operations.
      • Types:
        • Similarity Rate: Specifies the required degree of similarity for matches in Similar Queries.
        • Search Unit (by token):
          • Minimum Hashing UnitMTSU Size:
            • Definition: The smallest unit size used for hashing tokenssubsequence during the search process.
          • Minimum Search Unit Size:
            • Definition: The smallest unit size considered when performing searches.