Skip to main content

Glossary

Glossary

Basic Descriptors

Object

  • Definition: Any unit that can be processed or analyzed by the QOSI system.
  • Types:
    • File, File ID,
    • Offset, Offset Index
    • Tokenized File, Tokenized File ID
    • Candidate Object List, Candidate Object Bitmap
    • MTS, MTSU
    • Bitmap
    • TODO:
    • etc.

Target

  • Definition: AUser-provided singleobjects codeintended and/orfor textanalysis.

    container.
  • Characteristics:Descriptions:

    • Consists of aobjects sequence of original tokens(raw source code).
    • Indexing, Candidating, and Comparing phase do not read the file directly.
      • Only viewer interface reads the file directly.

Target

  • Definition: A user-provided dataset intended for analysis.
  • Characteristics:
    • Consists of files suppliedsubmitted by the user.
    • UndergoesAlways immediate tokenization and/or indexing upon receipt.
    • Remainsremains separate and is never merged into the Source.
  • Characteristics:

    • Relatively smaller than the Source.
    • Never incorporated into the Source.
  • Workflow:

    • Is subject to immediate tokenization and/or indexing once received.

Source

  • Definition: AObjects repositoryto ofmatch datasetsagainst the Target.

  • Description:

    • Objects maintained by the service provider.
    • The main data collection (objects) for queries and comparisons.
  • Characteristics:

    • Acts as the reference pool for matching against the Target.
    • ReceivesMainly updatesOSS(Open infrequently,Source typicallySoftware) onor aother public datasets.
    • It is readily available. Pre-indexed.
    • It is updated infrequently
      • e.g., weekly or monthly
      • schedule.
    • Pre-indexedTODO: andMove readilyto available when accessed by the user.
    • The duration ofIndexing Candidate Indexing doesduration nothas influenceno significant impact on data structure design.

Dataset

Candidate

  • Definition: Potential objects with common subsequence with the Target.

  • Description:

    • Identified during the Candidate Query phase.
    • Subset of the Source.
  • Characteristics:

    • May include false positives but never false negatives.
    • Used to quickly identify potential matches in the Source.
    • In current, every candidate includes MTS, but may not include MSs.

Identified

  • Definition: The subset of Source/Target objects that contain a common subsequence with Target/Source.

  • Description:

    • Identified during the Comparing phase.
    • Identified Source: Source objects that contain a common subsequence with the Target.
    • Identified Target: Target objects that contain a common subsequence with the Source.

Index

  • Definition: A metadata to identify the order of the data. Similar to ID.

  • Description:

    • Many objects (ex: File, Tokenized File, Token, etc.) have their own Index.

Immutable

  • Definition: Would not be changed during the process.

  • Description:

    • The Index is immutable after the indexing phase(when the index is created).

Dataset

Dataset

  • Definition: An aggregateaggregated collection of files.objects.

  • Types:Characteristics:

    • Each Dataset may contain numerous objects
    • A dataset may contain only the same type or different types of objects.
      • source code, text, etc.
      • File, Tokenized File, Candidate Object List

Source Dataset:Dataset

  • Definition: The comprehensive collectionset of filesobjects maintained as the Source.
  • Usage:
    • Interchangeably referred to simply as "Source."

Target Dataset:Dataset

  • Definition: The complete collectionset of files provided byobjects the user.user provides.
  • Usage:
    • Interchangeably referred to simply as "Target."

Token and Sequence

Token

  • Definition: The smallest unit of data derived from the content of a file.

  • Generation Tokenizing Methods:Description:

    • Generated through tokenization—splitting code or text into discrete elements.
    • Varies by language; e.g., whitespace-based for Western languages, morphological analysis for Japanese.
  • Characteristics:

    • Enables fine-grained comparison between Source and Target files.
    • Significantly reduces duplication compared to entire lines or blocks of text.
  • Details:

    • Code Data: ProducedTokenized byaccording ato tokenizer based on thecode syntax of the code.rules.
    • Text Data (Natural Language):
      • Western Languages: Tokens are separatedSplit by spaces.spaces (plus punctuation considerations).
      • Chinese: EachEvery character is treated as an individuala token.
      • Korean: Tokens are splitSplit by spaces; postpositional particles and suffixes are tokenizedseparately separately.tokenized.
      • Japanese: Tokens are determined throughUses morphological analysis,analysis segmentingto segment words based on grammar and context.
      • Special Considerations:Cases: Both Korean and Korean/Japanese require specialized tokenizers due to accuratelymorphological handle the complexity of their morphology and syntax.complexity.

Syntax Token

  • Definition: A token that does not reflect its original string but its syntax role.

String Token

  • Definition: A token that reflects its original string.

Sequence and Subsequence

  • Definition:

    • Sequence: An ordered list of tokens derived from a file or segment.
    • Subsequence: A contiguous subset of tokens within an entirea sequence.
  • Description:

    • Used for matching and comparison processes (e.g., detecting shared code fragments).

Minimum Token Sequence (MTS) and MimimumMinimum Token Sequence Unit (MTSU)

  • Definition:

    • MTS: A fixed-size subsequence of tokens from the original sequence.
    • Minimum Token Sequence (MTS):MTSU: The fundamentalhashed subsequenceversion unitof from original sequence by fixed size token.
    • an MimimumMTS, Tokenserving Sequenceas Unit (MTSU): Thea fundamental unit infor indexing, candidatingcandidating, and comparing within the system.comparing.
  • Characteristics:

    • MTSU is calculatedgenerated by hashing an MTS.
    • Because MTS is consisted with several tokens, duplication possibility ofan MTS is lessermultiple thantokens acombined, singleit token.reduces the chance of random duplication compared to single-token matching.

Tokenized

Minimum File

Subsequence (MSs) (or Minimum Token Subsequence (MTSs))

  • Definition: The subsequence that is requested by minimum size(length) by the user.

  • Details:

    • MSs should be equal or greater(longer) than MTS.
    • When comparing on Comparator, connect MTS(MTSU) to construct MSs.
  • Example:

    • MTS: 3 tokens
    • MSs: 5 tokens
    • When finding a MSs A B C D E in a file, the Comparator will find MTS A B C, B C D and C D E by MTSU and match it to MSs A B C D E.

Hash Function

  • Definition: A tokenizedfunction versionthat ofconverts an MTS into a file.fixed-size value (an MTSU).

  • Characteristics:Description:

    • GeneratedEnsures byconsistent, quick comparisons.
    • Takes not only a tokenizer.Token, but Token Subsequence to get more identification power.

File and Tokenized File

File

  • Definition: A code and/or text container.

  • Description:

    • Consists of a sequence of pairoriginal tokens (raw source code, natural language text).
    • Only Reporter(viewer) access the file directly
      • IndexIn ofIndexing, token
      • and
      • OriginalComparing characters(string)phases, the system uses Tokenized Files.
  • Characteristics:

    • Typically stored in a filesystem or code repository.
      • Source Files are stored in Archiver.

    • Can be part of either QuerySource Dataset or a Target Dataset.

Tokenized File

  • Definition: A file after it has been converted into a sequence of tokens.

  • Description:

    • Produced by a tokenizer from the raw file.
    • Consists of pairs: (**Token Index**, **MTSU**, separated token #1, separated token #2, ...).

Token Index

  • Definition: The token's position in the token sequence of the file.

Partitioning

  • Definition: Dividing something by MTSU to optimize searching.
    • Candidate Index, QueryFile RequestArchive, and etc.

Query

Query, Query Result

!!TODO: Clearly define Candidate QueryRequest, and Query Result

  • Definition:

    • Definition:
      • Query: The processact of searching for candidate files from the Source that havefor common subsequence between the Target.
      • Query Request: A set of search parameters used to locate Source files containing specific subsequences found in the Target.
      • Query Request: Specific search parameters used to locate relevant subsequences in Source files.
      • Query Result: The outputoutcome of a Query, whichtypically pair ofpairing Target Filefiles andwith candidatematching Source file list.
    • Components:
      • Encompasses all files within the Target dataset.
      • Searches through candidate Source files and their associated project metadata using Candidate Queries.
      • Identifies source files that contain token subsequences that match any subsequence within target files.
    • Types:Description:

      • Involves scanning across the entire Target dataset to find potential matches in the Source.
      • Candidate Queries help identify which Source files might contain matching subsequences.
    • Types:

      • Exact Query: RetrievesFinds subsequences that arematch precisely identicalexactly between Source and Target.
      • Similar Query: Retrieves subsequences that areFinds similar basedsubsequences, ongiven predefineda user-defined similarity parameters.threshold.
    • Result:

      • Pairs Pair of Source File and Target File: Links theeach Source file with the correspondingrelevant Target file.
      • Includes Pairindex ranges (start/end) of Subsequence Indices: This indicates the start and end indices of the matching subsequences infor both thefiles. Source
          and
        • By TargetToken files.Index
        • By Line Number

    Candidate Query

    !!TODO: Clearly define Candidate Query and Query

    • Definition:** A preliminarypreliminary, broad search mechanism similar to a Query returns only a list of candidate files from the Source.
      • Characteristics:Description:
        • Operates over the entire Source dataset.
        • Allows false positives but does not allowno false negatives.negatives (i.e., it might over-include but never miss a true match).
        • DoesIt does not directly support Similar Queries
          • However, similar functionalityqueries can be achieved using a(though smaller MTS size.sizes can approximate similarity).

        Batch Query

        • Definition: A method for processing multiple queries to improve efficiency.

        • Characteristics:

          • Reduces repeated work by handling similar queries in a single pass.
          • Constructed by MTS list
          • Link between Source ID and MTS is not sent to Nominator, but stored in Querier.
        • Example:

          • Target Object A's Token Sequence: A B C D E F G
            • A's MTS:
              • A B C
              • B C D
              • C D E
              • D E F
              • E F G
          • Target Object B's Token Sequence: A B C E F G
            • B's MTS:
              • A B C
              • B C E
              • C E F
              • E F G
          • Batch Query: A B C, B C D, C D E, B C E, C E F, E F G
            • Common MSss among Target Object (ex: A B C, E F G)are not duplicated in Batch Query

        Preprocess Phase

        Batch QueryPreprocessor

        • Definition: A methodcomponent ofthat processingconverts multiplea queries simultaneouslyFile to improvea efficiency.Tokenized File, and extract metadata.

        • Characteristics:Description:

          • .Extracts metadata from the file.
          • Include Tokenizing(Tokenizer)

        Tokenizer

        • IndexDefinition:

          !!TODO:A Clearlycomponent definethat Indexconverts a File to a sequence of tokens and Candidatemake Indexa Tokenized File.

        File Archiver

        • Definition: Retrieves a Source File and/or Source Tokenized File’s content by the file ID.

        • Description:

          • Archive Source Files and Source Tokenized Files, and its metadata.

        Index Phase

        Candidate Index (The Index)

        • Definition: A data structure built from thecandidate list ofSource files returnedto byfacilitate aqueries.

          Candidate Query.
        • Characteristics:Description:

          • OptimizedUsed forto detailedrespond searchto operations based on Exact and Similar Queries.

        Candidate IndexQuerie

        s
        • Definition: The initial indexing for the entire Source.quickly.
        • It Characteristics:acts
          • Supports Candidate Queries by providingas a file-high-level listing.

        Partitioning

        • Definition: The processmap of dividing MTSU in Query Request to optimize Finding.
        • Characteristics:
          • Index is sorted by order of MTSU.

        File Archiver

        • Definition: Returns the Source File from file ID of Source File.
        • Characteristics:

        Indexer

        • Definition: Creates the Index from Source datasets.
        • Characteristics:
          • Processes Source files
            • Generate Tokenized Files from Source Files.
            • GenerateIs immutable after the indexing phase.
              • Nominator does not modify The Index.
            • Is organized (sorted) by MTSU to speed up lookups.
            • Partitioning is applied to The Index for efficient searching.
          • Types:

            • Raw Candidate Index
              • Is not optimized or compressed.
              • May use RocksDB or LevelDB.
            • Candidate Index, The Index
              • Is optimized and metadatacompressed.
              • of
              • Source FilesNominator fromuses athis Tokenized Files.index.

          Indexer (Source File Indexer)

          • Definition: Builds the Index from the Source datasets.

          • Description:

            • Converts Source Files into Tokenized Files.
            • Generates MTSU entries and relevant metadata for the Index.
          • Workflow

            • From Tokenized File, record the file ID by MTSU to Candidate Index.

          QuerierIndex Compressor

          • Definition: MakeExtract queryKey-Value and Compress to reduce the size of the Index to improve performance.

          • Description:

          • Workflow:

            • Extract Key-Value pairs from Candidate Index.
            • Compress Key-Value pairs by Elias-Fano Encoding.

          Query Phase

          Querier

          • Definition: Generates and executes queries based on Target Files.

          • Characteristics:Description:

            • ProcessesBefore Querier, it should pass Preprocessing for Target.
            • Produces Query Requests that lists MTSU and metadata from the Target Files
              • Generate Tokenized Files from Target FilesObject.
              • Generate aBatch queryQuery request
                  from
                • May group or deduplicate MTSU to optimize further searching.
              • Stores MTSU and Source ID map.

            Nominator

            • Definition: Identifies candidate files during query operations.

            • Description:

              • Takes a Query Request and looks up possible file matches from the Index.
              • Returns a list of TokenizedSource Filesfile IDs that contain matching MTSU.
                • Key: MTSU
                • Value: Source file ID list and relevant metadata of(by Target FilesBitmap.
                • Grouped/Deduplicated MTSU which is optimized for query operation.
                  • Grouping on Querier or Indexer?)

          FinderComparison Phase

          Merger

          • Definition: CandidatesConsolidates and refines the final list of candidate files during query operations.

          • Characteristics:
            • Fromafter a candidate query requests, get candidiate file list (or candidiate file list ID) for each MTSU from IndexNominator.

        • MergerDescription:

          • Definition: Merge the candidate file list based on the query results.
          • Characteristics:
            • Merge the candidate file list based on the query results.
            • GenerateProduces the final candidate file listset for the queryQuery.
            • request.
            • Combines results across multiple MTSU lookups for each Target File.
            • Use Bitmap to merge(OR operation) the candidate file list.

          HashFile FunctionExtractor

          Comparator

          Evaluator

          Reporter


          Partitioning

          Language Family

          • Definition: A function that converts input data(MTS) into a fixed-size string of bytes(MTSU).

          Language Family

          • Definition: A grouping of programming languages based on thesimilar similaritysyntax of their actualor representation.

          • Characteristics:Description:

            • Languages withinin the same family may share common grammatical structures, enabling identical tokentokenization sequences across different languages.patterns.
            • A single language can belong to multiple language families if it sharesexhibits patternsshared features with multipledifferent groups.
            • FacilitatesHelps partitioning ofpartition Source data tofor optimizemore searchefficient indexing and indexing based on language-specific features.searching.

          Parameter

          Parameter

          • Definition: User-A user-defined criteriasetting that refine and tailorcustomizes query operations.behavior.

          • Types:

            • Similarity Rate: SpecifiesSets the required degree of similarityoverlap needed for matches in a Similar Queries.Query.
            • Search Unit (by token):
              • MTSU Size:
                • Definition: The smallest unit size used for hashinghashable subsequence during the search process.
                size.
              • Minimum Search Unit Size:
                • Definition: The smallest unittoken sizesequence considered whenfor performing searches.matching.
            • Description:

              • Affects how queries are performed and how results are filtered.
              • Allows for tuning accuracy versus performance.