Skip to main content

Glossary

Glossary

Basic Descriptors

Object

  • Definition: Any unit that can be processed or analyzed by the QOSI system.
  • Types:
    • File, File ID,
    • Offset, Offset Index
    • Tokenized File, Tokenized File ID
    • Candidate Object List, Candidate Object Bitmap
    • MTS, MTSU
    • Bitmap
    • TODO:
    • etc.

Target

  • Definition: User-provided objects intended for analysis.

  • Descriptions:

    • Consists of objects submitted by the user.
    • Always remains separate and is never merged into the Source.
  • Characteristics:

    • Relatively smaller than the Source.
    • Never incorporated into the Source.
  • Workflow:

    • Is subject to immediate tokenization and/or indexing once received.

Source

  • Definition: Objects to match against the Target.

  • Description:

    • Objects maintained by the service provider.
    • The main data collection (objects) for queries and comparisons.
  • Characteristics:

    • Acts as the reference pool for matching against the Target.
    • Mainly OSS(Open Source Software) or other public datasets.
    • It is readily available. Pre-indexed.
    • It is updated infrequently
      • e.g., weekly or monthly
    • TODO: Move to Indexing Candidate Indexing duration has no significant impact on data structure design.

Candidate

  • Definition: Potential objects with common subsequence with the Target.

  • Description:

    • Identified during the Candidate Query phase.
    • Subset of the Source.
  • Characteristics:

    • May include false positives but never false negatives.
    • Used to quickly identify potential matches in the Source.
    • In current, every candidate includes MTS, but may not include MSs.

Identified

  • Definition: The subset of Source/Target objects that contain a common subsequence with Target/Source.

  • Description:

    • Identified during the Comparing phase.
    • Identified Source: Source objects that contain a common subsequence with the Target.
    • Identified Target: Target objects that contain a common subsequence with the Source.

Index

  • Definition: A metadata to identify the order of the data. Similar to ID.

  • Description:

    • Many objects (ex: File, Tokenized File, Token, etc.) have their own Index.

Immutable

  • Definition: Would not be changed during the process.

  • Description:

    • The Index is immutable after the indexing phase(when the index is created).

Dataset

Dataset

  • Definition: An aggregated collection of objects.

  • Characteristics:

    • Each Dataset may contain numerous objects
    • A dataset may contain only the same type or different types of objects.
      • source code, text, etc.
      • File, Tokenized File, Candidate Object List

Source Dataset

  • Definition: The comprehensive set of objects maintained as the Source.
  • Usage:
    • Interchangeably referred to simply as "Source."

Target Dataset

  • Definition: The complete set of objects the user provides.
  • Usage:
    • Interchangeably referred to simply as "Target."

Token and Sequence

Token

  • Definition: The smallest unit of data derived from the content of a file.

  • Description:

    • Generated through tokenization—splitting code or text into discrete elements.
    • Varies by language; e.g., whitespace-based for Western languages, morphological analysis for Japanese.
  • Characteristics:

    • Enables fine-grained comparison between Source and Target files.
    • Significantly reduces duplication compared to entire lines or blocks of text.
  • Details:

    • Code Data: Tokenized according to code syntax rules.
    • Text Data (Natural Language):
      • Western Languages: Split by spaces (plus punctuation considerations).
      • Chinese: Every character is treated as a token.
      • Korean: Split by spaces; postpositional particles and suffixes are separately tokenized.
      • Japanese: Uses morphological analysis to segment words based on grammar and context.
      • Special Cases: Korean/Japanese require specialized tokenizers due to morphological complexity.

Syntax Token

  • Definition: A token that does not reflect its original string but its syntax role.

String Token

  • Definition: A token that reflects its original string.

Sequence and Subsequence

  • Definition:

    • Sequence: An ordered list of tokens derived from a file or segment.
    • Subsequence: A contiguous subset of tokens within a sequence.
  • Description:

    • Used for matching and comparison processes (e.g., detecting shared code fragments).

Minimum Token Sequence (MTS) and Minimum Token Sequence Unit (MTSU)

  • Definition:

    • MTS: A fixed-size subsequence of tokens from the original sequence.
    • MTSU: The hashed version of an MTS, serving as a fundamental unit for indexing, candidating, and comparing.
  • Characteristics:

    • MTSU is generated by hashing an MTS.
    • Because an MTS is multiple tokens combined, it reduces the chance of random duplication compared to single-token matching.

Minimum Subsequence (MSs) (or Minimum Token Subsequence (MTSs))

  • Definition: The subsequence that is requested by minimum size(length) by the user.

  • Details:

    • MSs should be equal or greater(longer) than MTS.
    • When comparing on Comparator, connect MTS(MTSU) to construct MSs.
  • Example:

    • MTS: 3 tokens
    • MSs: 5 tokens
    • When finding a MSs A B C D E in a file, the Comparator will find MTS A B C, B C D and C D E by MTSU and match it to MSs A B C D E.

Hash Function

  • Definition: A function that converts an MTS into a fixed-size value (an MTSU).

  • Description:

    • Ensures consistent, quick comparisons.
    • Takes not only a Token, but Token Subsequence to get more identification power.

File and Tokenized File

File

  • Definition: A code and/or text container.

  • Description:

    • Consists of a sequence of original tokens (raw source code, natural language text).
    • Only Reporter(viewer) access the file directly
      • In Indexing, and Comparing phases, the system uses Tokenized Files.
  • Characteristics:

    • Typically stored in a filesystem or code repository.
      • Source Files are stored in Archiver.
    • Can be part of either Source Dataset or a Target Dataset.

Tokenized File

  • Definition: A file after it has been converted into a sequence of tokens.

  • Description:

    • Produced by a tokenizer from the raw file.
    • Consists of pairs: (**Token Index**, **MTSU**, separated token #1, separated token #2, ...).

Token Index

  • Definition: The token's position in the token sequence of the file.

Partitioning

  • Definition: Dividing something by MTSU to optimize searching.
    • Candidate Index, File Archive, and etc.

Query

Query, Query Request, and Query Result

  • Definition:

    • Query: The act of searching the Source for common subsequences found in the Target.
    • Query Request: Specific search parameters used to locate relevant subsequences in Source files.
    • Query Result: The outcome of a Query, typically pairing Target files with matching Source files.
  • Description:

    • Involves scanning across the entire Target dataset to find potential matches in the Source.
    • Candidate Queries help identify which Source files might contain matching subsequences.
  • Types:

    • Exact Query: Finds subsequences that match exactly between Source and Target.
    • Similar Query: Finds similar subsequences, given a user-defined similarity threshold.
  • Result:

    • Pairs each Source file with the relevant Target file.
    • Includes index ranges (start/end) of matching subsequences for both files.
      • By Token Index
      • By Line Number

Candidate Query

Definition:** A preliminary, broad search returns only a list of candidate files from the Source.

  • Description:
    • Operates over the entire Source dataset.
    • Allows false positives but no false negatives (i.e., it might over-include but never miss a true match).
    • It does not directly support similar queries (though smaller MTS sizes can approximate similarity).

Batch Query

  • Definition: A method for processing multiple queries to improve efficiency.

  • Characteristics:

    • Reduces repeated work by handling similar queries in a single pass.
    • Constructed by MTS list
    • Link between Source ID and MTS is not sent to Nominator, but stored in Querier.
  • Example:

    • Target Object A's Token Sequence: A B C D E F G
      • A's MTS:
        • A B C
        • B C D
        • C D E
        • D E F
        • E F G
    • Target Object B's Token Sequence: A B C E F G
      • B's MTS:
        • A B C
        • B C E
        • C E F
        • E F G
    • Batch Query: A B C, B C D, C D E, B C E, C E F, E F G
      • Common MSss among Target Object (ex: A B C, E F G)are not duplicated in Batch Query

Preprocess Phase

Preprocessor

  • Definition: A component that converts a File to a Tokenized File, and extract metadata.

  • Description:

    • Extracts metadata from the file.
    • Include Tokenizing(Tokenizer)

Tokenizer

  • Definition: A component that converts a File to a sequence of tokens and make a Tokenized File.

File Archiver

  • Definition: Retrieves a Source File and/or Source Tokenized File’s content by the file ID.

  • Description:

    • Archive Source Files and Source Tokenized Files, and its metadata.

Index Phase

Candidate Index (The Index)

  • Definition: A data structure built from candidate Source files to facilitate queries.

  • Description:

    • Used to respond to Candidate Queries quickly.
    • It acts as a high-level map of Source Files.
    • Is immutable after the indexing phase.
      • Nominator does not modify The Index.
    • Is organized (sorted) by MTSU to speed up lookups.
    • Partitioning is applied to The Index for efficient searching.
  • Types:

    • Raw Candidate Index
      • Is not optimized or compressed.
      • May use RocksDB or LevelDB.
    • Candidate Index, The Index
      • Is optimized and compressed.
      • Nominator uses this index.

Indexer (Source File Indexer)

  • Definition: Builds the Index from the Source datasets.

  • Description:

    • Converts Source Files into Tokenized Files.
    • Generates MTSU entries and relevant metadata for the Index.
  • Workflow

    • From Tokenized File, record the file ID by MTSU to Candidate Index.

Index Compressor

  • Definition: Extract Key-Value and Compress to reduce the size of the Index to improve performance.

  • Description:

  • Workflow:

    • Extract Key-Value pairs from Candidate Index.
    • Compress Key-Value pairs by Elias-Fano Encoding.

Query Phase

Querier

  • Definition: Generates and executes queries based on Target Files.

  • Description:

    • Before Querier, it should pass Preprocessing for Target.
    • Produces Query Requests that lists MTSU and metadata from the Target Object.
    • Generate Batch Query
      • May group or deduplicate MTSU to optimize further searching.
    • Stores MTSU and Source ID map.

Nominator

  • Definition: Identifies candidate files during query operations.

  • Description:

    • Takes a Query Request and looks up possible file matches from the Index.
    • Returns a list of Source file IDs that contain matching MTSU.
      • Key: MTSU
      • Value: Source file ID list (by Bitmap)

Comparison Phase

Merger

  • Definition: Consolidates and refines the final list of candidate files after a candidate query from Nominator.

  • Description:

    • Produces the final candidate file set for the Query.
    • Combines results across multiple MTSU lookups for each Target File.
    • Use Bitmap to merge(OR operation) the candidate file list.

File Extractor

Comparator

Evaluator

Reporter


Partitioning

Language Family

  • Definition: A grouping of languages based on similar syntax or representation.

  • Description:

    • Languages in the same family may share identical tokenization patterns.
    • A single language can belong to multiple families if it exhibits shared features with different groups.
    • Helps partition Source data for more efficient indexing and searching.

Parameter

Parameter

  • Definition: A user-defined setting that customizes query behavior.

  • Types:

    • Similarity Rate: Sets the degree of overlap needed for matches in a Similar Query.
    • Search Unit (by token):
      • MTSU Size: The smallest hashable subsequence size.
      • Minimum Search Unit Size: The smallest token sequence considered for matching.
  • Description:

    • Affects how queries are performed and how results are filtered.
    • Allows for tuning accuracy versus performance.