Skip to main content

Glossary

Glossary

Basic Descriptors

Object

    Definition: Any unit that can be processed or analyzed by the QOSI system. Types:
      File, File ID, Offset, Offset Index Tokenized File, Tokenized File ID Candidate Object List, Candidate Object Bitmap MTS, MTSU Bitmap TODO: etc.

      Target

      • Definition: AUser-provided singleobjects codeintended and/orfor textanalysis.

        container.
      • Characteristics:Descriptions:

        • Consists of aobjects sequence of original tokens(raw source code).
      • Indexing, Candidating, and Comparing phase do not read the file directly.
        • Only viewer interface reads the file directly.

        Target

          Definition: A user-provided dataset intended for analysis. Characteristics:
            Consists of files suppliedsubmitted by the user. UndergoesAlways immediate tokenization and/or indexing upon receipt. Remainsremains separate and is never merged into the Source.

            Characteristics:

              Relatively smaller than the Source. Never incorporated into the Source.

              Workflow:

                Is subject to immediate tokenization and/or indexing once received.

                Source

                • Definition: AObjects repositoryto ofmatch datasetsagainst the Target.

                Description:

                  Objects maintained by the service provider. The main data collection (objects) for queries and comparisons.

                  Characteristics:

                  • Acts as the reference pool for matching against the Target.
                  • ReceivesMainly updatesOSS(Open infrequently,Source typicallySoftware) onor aother public datasets.
                  It is readily available. Pre-indexed. It is updated infrequently
                    e.g., weekly or monthly schedule. Pre-indexedTODO: andMove readilyto available when accessed by the user. The duration ofIndexing Candidate Indexing doesduration nothas influenceno significant impact on data structure design.

                    Dataset

                    Candidate

                    • Definition: Potential objects with common subsequence with the Target.

                    Description:

                      Identified during the Candidate Query phase. Subset of the Source.

                      Characteristics:

                        May include false positives but never false negatives. Used to quickly identify potential matches in the Source. In current, every candidate includes MTS, but may not include MSs.

                        Identified

                          Definition: The subset of Source/Target objects that contain a common subsequence with Target/Source.

                          Description:

                            Identified during the Comparing phase. Identified Source: Source objects that contain a common subsequence with the Target. Identified Target: Target objects that contain a common subsequence with the Source.

                            Index

                              Definition: A metadata to identify the order of the data. Similar to ID.

                              Description:

                                Many objects (ex: File, Tokenized File, Token, etc.) have their own Index.

                                Immutable

                                  Definition: Would not be changed during the process.

                                  Description:

                                    The Index is immutable after the indexing phase(when the index is created).

                                    Dataset

                                    Dataset

                                      Definition: An aggregateaggregated collection of files.objects.

                                      Types:Characteristics:

                                      • Each Dataset may contain numerous objects
                                      A dataset may contain only the same type or different types of objects.
                                        source code, text, etc. File, Tokenized File, Candidate Object List

                                        Source Dataset:Dataset

                                          Definition: The comprehensive collectionset of filesobjects maintained as the Source. Usage:
                                          • Interchangeably referred to simply as "Source."

                                          Target Dataset:Dataset

                                            Definition: The complete collectionset of files provided byobjects the user.user provides. Usage:
                                            • Interchangeably referred to simply as "Target."

                                            Token and Sequence

                                            Token

                                            • Definition: The smallest unit of data derived from the content of a file.

                                            • Generation Tokenizing Methods:Description:

                                              • Generated through tokenization—splitting code or text into discrete elements.
                                            • Varies by language; e.g., whitespace-based for Western languages, morphological analysis for Japanese.

                                            Characteristics:

                                              Enables fine-grained comparison between Source and Target files. Significantly reduces duplication compared to entire lines or blocks of text.

                                              Details:

                                                Code Data: ProducedTokenized byaccording ato tokenizer based on thecode syntax of the code.rules. Text Data (Natural Language):
                                                • Western Languages: Tokens are separatedSplit by spaces.spaces (plus punctuation considerations).
                                                • Chinese: EachEvery character is treated as an individuala token.
                                                • Korean: Tokens are splitSplit by spaces; postpositional particles and suffixes are tokenizedseparately separately.tokenized.
                                                • Japanese: Tokens are determined throughUses morphological analysis,analysis segmentingto segment words based on grammar and context.
                                                • Special Considerations:Cases: Both Korean and Korean/Japanese require specialized tokenizers due to accuratelymorphological handle the complexity of their morphology and syntax.complexity.

                                                Syntax Token

                                                  Definition: A token that does not reflect its original string but its syntax role.

                                                  String Token

                                                    Definition: A token that reflects its original string.

                                                    Sequence and Subsequence

                                                    • Definition:

                                                      • Sequence: An ordered list of tokens derived from a file or segment.
                                                      • Subsequence: A contiguous subset of tokens within an entirea sequence.

                                                    Description:

                                                      Used for matching and comparison processes (e.g., detecting shared code fragments).

                                                      Minimum Token Sequence (MTS) and MimimumMinimum Token Sequence Unit (MTSU)

                                                      • Definition:

                                                        • MTS: A fixed-size subsequence of tokens from the original sequence.
                                                      • Minimum Token Sequence (MTS):MTSU: The fundamentalhashed subsequenceversion unitof from original sequence by fixed size token.
                                                      an MimimumMTS, Tokenserving Sequenceas Unit (MTSU): Thea fundamental unit infor indexing, candidatingcandidating, and comparing within the system.comparing.

                                                      Characteristics:

                                                      • MTSU is calculatedgenerated by hashing an MTS.
                                                      • Because MTS is consisted with several tokens, duplication possibility ofan MTS is lessermultiple thantokens acombined, singleit token.reduces the chance of random duplication compared to single-token matching.

                                                      Tokenized

                                                      Minimum File

                                                      Subsequence (MSs) (or Minimum Token Subsequence (MTSs))
                                                      • Definition: The subsequence that is requested by minimum size(length) by the user.

                                                      Details:

                                                        MSs should be equal or greater(longer) than MTS. When comparing on Comparator, connect MTS(MTSU) to construct MSs.

                                                        Example:

                                                          MTS: 3 tokens MSs: 5 tokens When finding a MSs A B C D E in a file, the Comparator will find MTS A B C, B C D and C D E by MTSU and match it to MSs A B C D E.

                                                          Hash Function

                                                            Definition: A tokenizedfunction versionthat ofconverts an MTS into a file.fixed-size value (an MTSU).

                                                            Characteristics:Description:

                                                            • GeneratedEnsures byconsistent, quick comparisons.
                                                            Takes not only a tokenizer.Token, but Token Subsequence to get more identification power.

                                                            File and Tokenized File

                                                            File

                                                              Definition: A code and/or text container.

                                                              Description:

                                                              • Consists of a sequence of pairoriginal tokens (raw source code, natural language text).
                                                              Only Reporter(viewer) access the file directly
                                                              • IndexIn ofIndexing, token
                                                              and OriginalComparing characters(string)phases, the system uses Tokenized Files.

                                                              Characteristics:

                                                                Typically stored in a filesystem or code repository.
                                                                  Source Files are stored in Archiver.

                                                                  Can be part of either QuerySource Dataset or a Target Dataset.

                                                                  Tokenized File

                                                                    Definition: A file after it has been converted into a sequence of tokens.

                                                                    Description:

                                                                      Produced by a tokenizer from the raw file. Consists of pairs: (**Token Index**, **MTSU**, separated token #1, separated token #2, ...).

                                                                      Token Index

                                                                        Definition: The token's position in the token sequence of the file.

                                                                        Partitioning

                                                                          Definition: Dividing something by MTSU to optimize searching.
                                                                            Candidate Index, QueryFile RequestArchive, and etc.

                                                                            Query

                                                                            Query, Query Result

                                                                            !!TODO: Clearly define Candidate QueryRequest, and Query Result

                                                                              Definition:

                                                                              • Definition:
                                                                                • Query: The processact of searching for candidate files from the Source that havefor common subsequence between the Target.
                                                                                Query Request: A set of search parameters used to locate Source files containing specific subsequences found in the Target. Query Request: Specific search parameters used to locate relevant subsequences in Source files. Query Result: The outputoutcome of a Query, whichtypically pair ofpairing Target Filefiles andwith candidatematching Source file list. Components:
                                                                                  Encompasses all files within the Target dataset. Searches through candidate Source files and their associated project metadata using Candidate Queries. Identifies source files that contain token subsequences that match any subsequence within target files.

                                                                                  Types:Description:

                                                                                  • Involves scanning across the entire Target dataset to find potential matches in the Source.
                                                                                  Candidate Queries help identify which Source files might contain matching subsequences.

                                                                                  Types:

                                                                                    Exact Query: RetrievesFinds subsequences that arematch precisely identicalexactly between Source and Target. Similar Query: Retrieves subsequences that areFinds similar basedsubsequences, ongiven predefineda user-defined similarity parameters.threshold.

                                                                                    Result:

                                                                                    • Pairs Pair of Source File and Target File: Links theeach Source file with the correspondingrelevant Target file.
                                                                                    • Includes Pairindex ranges (start/end) of Subsequence Indices: This indicates the start and end indices of the matching subsequences infor both thefiles. Source
                                                                                      and
                                                                                    • By TargetToken files.Index
                                                                                    By Line Number

                                                                                    Candidate Query

                                                                                    !!TODO: Clearly define Candidate Query and Query

                                                                                      Definition:** A preliminarypreliminary, broad search mechanism similar to a Query returns only a list of candidate files from the Source.
                                                                                      • Characteristics:Description:
                                                                                        • Operates over the entire Source dataset.
                                                                                        • Allows false positives but does not allowno false negatives.negatives (i.e., it might over-include but never miss a true match).
                                                                                        • DoesIt does not directly support Similar Queries
                                                                                          • However, similar functionalityqueries can be achieved using a(though smaller MTS size.sizes can approximate similarity).

                                                                                        Batch Query

                                                                                          Definition: A method for processing multiple queries to improve efficiency.

                                                                                          Characteristics:

                                                                                            Reduces repeated work by handling similar queries in a single pass. Constructed by MTS list Link between Source ID and MTS is not sent to Nominator, but stored in Querier.

                                                                                            Example:

                                                                                              Target Object A's Token Sequence: A B C D E F G
                                                                                                A's MTS:
                                                                                                  A B C B C D C D E D E F E F G Target Object B's Token Sequence: A B C E F G
                                                                                                    B's MTS:
                                                                                                      A B C B C E C E F E F G Batch Query: A B C, B C D, C D E, B C E, C E F, E F G
                                                                                                        Common MSss among Target Object (ex: A B C, E F G)are not duplicated in Batch Query

                                                                                                        Preprocess Phase

                                                                                                        Batch QueryPreprocessor

                                                                                                        • Definition: A methodcomponent ofthat processingconverts multiplea queries simultaneouslyFile to improvea efficiency.Tokenized File, and extract metadata.

                                                                                                        • Characteristics:Description:

                                                                                                          • .Extracts metadata from the file.
                                                                                                        • Include Tokenizing(Tokenizer)

                                                                                                        Tokenizer

                                                                                                          IndexDefinition:

                                                                                                          !!TODO:A Clearlycomponent definethat Indexconverts a File to a sequence of tokens and Candidatemake Indexa Tokenized File.

                                                                                                          File Archiver

                                                                                                            Definition: Retrieves a Source File and/or Source Tokenized File’s content by the file ID.

                                                                                                            Description:

                                                                                                            • Archive Source Files and Source Tokenized Files, and its metadata.

                                                                                                            Index Phase

                                                                                                            Candidate Index (The Index)

                                                                                                              Definition: A data structure built from thecandidate list ofSource files returnedto byfacilitate aqueries.

                                                                                                              Candidate Query.

                                                                                                              Characteristics:Description:

                                                                                                              • OptimizedUsed forto detailedrespond searchto operations based on Exact and Similar Queries.

                                                                                                              Candidate IndexQuerie

                                                                                                              s
                                                                                                                Definition: The initial indexing for the entire Source.quickly. It Characteristics:acts
                                                                                                                  Supports Candidate Queries by providingas a file-high-level listing.

                                                                                                                  Partitioning

                                                                                                                    Definition: The processmap of dividing MTSU in Query Request to optimize Finding. Characteristics:
                                                                                                                      Index is sorted by order of MTSU.

                                                                                                                      File Archiver

                                                                                                                        Definition: Returns the Source File from file ID of Source File. Characteristics:

                                                                                                                        Indexer

                                                                                                                          Definition: Creates the Index from Source datasets. Characteristics:
                                                                                                                            Processes Source files
                                                                                                                              Generate Tokenized Files from Source Files. GenerateIs immutable after the indexing phase.
                                                                                                                                Nominator does not modify The Index. Is organized (sorted) by MTSU to speed up lookups. Partitioning is applied to The Index for efficient searching.

                                                                                                                                Types:

                                                                                                                                  Raw Candidate Index
                                                                                                                                    Is not optimized or compressed. May use RocksDB or LevelDB. Candidate Index, The Index
                                                                                                                                      Is optimized and metadatacompressed. of Source FilesNominator fromuses athis Tokenized Files.index.

                                                                                                                                      Indexer (Source File Indexer)

                                                                                                                                        Definition: Builds the Index from the Source datasets.

                                                                                                                                        Description:

                                                                                                                                          Converts Source Files into Tokenized Files. Generates MTSU entries and relevant metadata for the Index.

                                                                                                                                          Workflow

                                                                                                                                            From Tokenized File, record the file ID by MTSU to Candidate Index.

                                                                                                                                            QuerierIndex Compressor

                                                                                                                                            • Definition: MakeExtract queryKey-Value and Compress to reduce the size of the Index to improve performance.

                                                                                                                                            Description:

                                                                                                                                            Workflow:

                                                                                                                                              Extract Key-Value pairs from Candidate Index. Compress Key-Value pairs by Elias-Fano Encoding.

                                                                                                                                              Query Phase

                                                                                                                                              Querier

                                                                                                                                                Definition: Generates and executes queries based on Target Files.

                                                                                                                                                Characteristics:Description:

                                                                                                                                                • ProcessesBefore Querier, it should pass Preprocessing for Target.
                                                                                                                                                Produces Query Requests that lists MTSU and metadata from the Target Files
                                                                                                                                                  Generate Tokenized Files from Target FilesObject. Generate aBatch queryQuery request
                                                                                                                                                  fromMay group or deduplicate MTSU to optimize further searching. Stores MTSU and Source ID map.

                                                                                                                                                  Nominator

                                                                                                                                                    Definition: Identifies candidate files during query operations.

                                                                                                                                                    Description:

                                                                                                                                                      Takes a Query Request and looks up possible file matches from the Index. Returns a list of TokenizedSource Filesfile IDs that contain matching MTSU.
                                                                                                                                                      • Key: MTSU
                                                                                                                                                      Value: Source file ID list and relevant metadata of(by Target FilesBitmap. Grouped/Deduplicated MTSU which is optimized for query operation.
                                                                                                                                                        Grouping on Querier or Indexer?)

                                                                                                                                                        FinderComparison Phase

                                                                                                                                                        Merger

                                                                                                                                                        • Definition: CandidatesConsolidates and refines the final list of candidate files during query operations.

                                                                                                                                                        Characteristics:
                                                                                                                                                          Fromafter a candidate query requests, get candidiate file list (or candidiate file list ID) for each MTSU from IndexNominator.

                                                                                                                                                          MergerDescription:

                                                                                                                                                          • Definition: Merge the candidate file list based on the query results.
                                                                                                                                                          Characteristics:
                                                                                                                                                            Merge the candidate file list based on the query results. GenerateProduces the final candidate file listset for the queryQuery. request.Combines results across multiple MTSU lookups for each Target File. Use Bitmap to merge(OR operation) the candidate file list.

                                                                                                                                                            HashFile FunctionExtractor

                                                                                                                                                            Comparator

                                                                                                                                                            Evaluator

                                                                                                                                                            Reporter


                                                                                                                                                            Partitioning

                                                                                                                                                            Language Family

                                                                                                                                                            • Definition: A function that converts input data(MTS) into a fixed-size string of bytes(MTSU).

                                                                                                                                                            Language Family

                                                                                                                                                              Definition: A grouping of programming languages based on thesimilar similaritysyntax of their actualor representation.

                                                                                                                                                              Characteristics:Description:

                                                                                                                                                              • Languages withinin the same family may share common grammatical structures, enabling identical tokentokenization sequences across different languages.patterns.
                                                                                                                                                              • A single language can belong to multiple language families if it sharesexhibits patternsshared features with multipledifferent groups.
                                                                                                                                                              • FacilitatesHelps partitioning ofpartition Source data tofor optimizemore searchefficient indexing and indexing based on language-specific features.searching.

                                                                                                                                                              Parameter

                                                                                                                                                              Parameter

                                                                                                                                                              • Definition: User-A user-defined criteriasetting that refine and tailorcustomizes query operations.behavior.

                                                                                                                                                              • Types:

                                                                                                                                                                • Similarity Rate: SpecifiesSets the required degree of similarityoverlap needed for matches in a Similar Queries.Query.
                                                                                                                                                                • Search Unit (by token):
                                                                                                                                                                  • MTSU Size:
                                                                                                                                                                    • Definition: The smallest unit size used for hashinghashable subsequence during the search process.
                                                                                                                                                                  size.
                                                                                                                                                                • Minimum Search Unit Size:
                                                                                                                                                                  • Definition: The smallest unittoken sizesequence considered whenfor performing searches.matching.

                                                                                                                                                                  Description:

                                                                                                                                                                    Affects how queries are performed and how results are filtered. Allows for tuning accuracy versus performance.