Skip to main content

Glossary

File

  • Definition: A single container of code and/or text.text container.
  • Characteristics:
    • Serves as the fundamental unit in search operations within the system.
    • It does not directly support Similar Queries; however, similar results can be achieved by using smaller minimum hashing unit sizes.

Segment

  • Definition: A subdivision of a file.
  • Characteristics:
    • Segments are uniformly sized, except for the final segment, which may be smaller.
    • Each segment is individually indexed for efficient searching(candidate query).
    • It should have overlapping tokens between adjective segments to make coherent minimum search/hashing units.

Query

  • Definition: A set of search parameters used to locate Source files containing specific subsequences found in the Target.
  • Components:
    • Encompasses all files within the Target dataset.
    • Searches through candidate Source files and their associated project metadata using Candidate Queries.
    • Identifies source files that contain token subsequences that match any subsequence within target files.
  • Types:
    • Exact Query: Retrieves subsequences that are precisely identical between Source and Target.
    • Similar Query: Retrieves subsequences that are similar based on predefined similarity parameters.
  • Result:
    • Pair of Source File and Target File: Links the Source file with the corresponding Target file.
    • Pair of Subsequence Indices: This indicates the start and end indices of the matching subsequences in both the Source and Target files.

Candidate Query

  • Definition: A preliminary search mechanism similar to a Query returns only a list of files from the Source.
  • Characteristics:
    • Operates over the entire Source dataset.
    • Allows false positives but not false negatives.
    • It does not directly support Similar Queries; similar functionality can be achieved using a smaller minimum hashing unit size.

Index

  • Definition: A data structure built from the list of files returned by a Candidate Query.
  • Characteristics:
    • Optimized for detailed search operations based on Exact and Similar Queries.

Candidate Index

  • Definition: The initial indexing for the entire Source.
  • Characteristics:
    • Supports Candidate Queries by providing a file-level listing.

Target

  • Definition: A user-provided dataset intended for analysis.
  • Characteristics:
    • Consists of files supplied by the user.
    • Undergoes immediate tokenization and/or indexing upon receipt.
    • Remains separate and is never incorporated into the Source.

Source

  • Definition: A repository of datasets maintained by the service provider.
  • Characteristics:
    • Acts as the reference pool for matching against the Target.
    • Receives updates infrequently, typically on a weekly or monthly schedule.
    • Pre-indexed and readily available when accessed by the user.
    • The duration of Candidate Indexing does not influence data structure design.

Dataset

  • Definition: An aggregate collection of files.
  • Types:
    • Source Dataset: The comprehensive collection of files maintained as the Source.
      • Interchangeably referred to as "Source."
    • Target Dataset: The complete collection of files provided by the user.
      • Interchangeably referred to as "Target."

Token

  • Definition: The smallest unit of data derived from the content of a file.
  • Generation Methods:
    • Code Data: Tokens are produced by a tokenizer based on the syntax of the code.
    • Text Data (Natural Language):
      • Western Languages: Tokens are separated by spaces.
      • Chinese: Each character is treated as an individual token.
      • Korean: Tokens are split by spaces; postpositional particles and suffixes are tokenized separately.
      • Japanese: Tokens are determined through morphological analysis, segmenting words based on grammar and context using tools like MeCab or Kuromoji.
      • Special Considerations: Both Korean and Japanese require specialized tokenizers to accurately handle the complexity of their morphology and syntax.

Minimun Token Sequence

  • Definition:
    • Serves as the fundamental unit in search operations within the system.
    • It does not directly support Similar Queries; however, similar results can be achieved by using smaller minimum hashing unit sizes.

Sequence and Subsequence

  • Sequence:
    • Definition: An ordered list of tokens derived from a file or segment.
  • Subsequence:
    • Definition: A contiguous subset of tokens within an entire sequence.

Parameter

  • Definition: User-defined criteria that refine and tailor query operations.
  • Types:
    • Similarity Rate: Specifies the required degree of similarity for matches in Similar Queries.
    • Search Unit (by token):
      • Minimum Hashing Unit Size:
        • Definition: The smallest unit size used for hashing tokens during the search process.
      • Minimum Search Unit Size:
        • Definition: The smallest unit size considered when performing searches.