Glossary
Basic Descriptors
Object
- Definition: Any unit that can be processed or analyzed by the QOSI system.
- Types:
- File, File ID,
- Offset, Offset Index
- Tokenized File, Tokenized File ID
- Candidate Object List, Candidate Object Bitmap
- MTS, MTSU
- Bitmap
- TODO:
- etc.
Target
-
Definition: User-provided objects intended for analysis.
-
Descriptions:
- Consists of objects submitted by the user.
- Always remains separate and is never merged into the Source.
-
Characteristics:
- Relatively smaller than the Source.
- Never incorporated into the Source.
-
Workflow:
- Is subject to immediate tokenization and/or indexing once received.
Source
-
Definition: Objects to match against the Target.
-
Description:
- Objects maintained by the service provider.
- The main data collection (objects) for queries and comparisons.
-
Characteristics:
- Acts as the reference pool for matching against the Target.
- Mainly OSS(Open Source Software) or other public datasets.
It is readily available. Pre-indexed.- It is updated infrequently
- e.g., weekly or monthly
- TODO: Move to Indexing Candidate Indexing duration has no significant impact on data structure design.
Candidate
-
Definition: Potential objects with common subsequence with the Target.
-
Description:
- Identified during the Candidate Query phase.
- Subset of the Source.
-
Characteristics:
- May include false positives but never false negatives.
- Used to quickly identify potential matches in the Source.
- In current, every candidate includes MTS, but may not include MSs.
Identified
-
Definition: The subset of Source/Target objects that contain a common subsequence with Target/Source.
-
Description:
- Identified during the Comparing phase.
- Identified Source: Source objects that contain a common subsequence with the Target.
- Identified Target: Target objects that contain a common subsequence with the Source.
Index
-
Definition: A metadata to identify the order of the data. Similar to ID.
-
Description:
- Many objects (ex: File, Tokenized File, Token, etc.) have their own Index.
Immutable
-
Definition: Would not be changed during the process.
-
Description:
- The Index is immutable after the indexing phase(when the index is created).
Dataset
Dataset
-
Definition: An aggregated collection of objects.
-
Characteristics:
- Each Dataset may contain numerous objects
- A dataset may contain only the same type or different types of objects.
- source code, text, etc.
- File, Tokenized File, Candidate Object List
Source Dataset
- Definition: The comprehensive set of objects maintained as the Source.
- Usage:
- Interchangeably referred to simply as "Source."
Target Dataset
- Definition: The complete set of objects the user provides.
- Usage:
- Interchangeably referred to simply as "Target."
Token and Sequence
Token
-
Definition: The smallest unit of data derived from the content of a file.
-
Description:
- Generated through tokenization—splitting code or text into discrete elements.
- Varies by language; e.g., whitespace-based for Western languages, morphological analysis for Japanese.
-
Characteristics:
- Enables fine-grained comparison between Source and Target files.
- Significantly reduces duplication compared to entire lines or blocks of text.
-
Details:
- Code Data: Tokenized according to code syntax rules.
- Text Data (Natural Language):
- Western Languages: Split by spaces (plus punctuation considerations).
- Chinese: Every character is treated as a token.
- Korean: Split by spaces; postpositional particles and suffixes are separately tokenized.
- Japanese: Uses morphological analysis to segment words based on grammar and context.
- Special Cases: Korean/Japanese require specialized tokenizers due to morphological complexity.
Syntax Token
- Definition: A token that does not reflect its original string but its syntax role.
String Token
- Definition: A token that reflects its original string.
Sequence and Subsequence
-
Definition:
- Sequence: An ordered list of tokens derived from a file or segment.
- Subsequence: A contiguous subset of tokens within a sequence.
-
Description:
- Used for matching and comparison processes (e.g., detecting shared code fragments).
Minimum Token Sequence (MTS) and Minimum Token Sequence Unit (MTSU)
-
Definition:
- MTS: A fixed-size subsequence of tokens from the original sequence.
- MTSU: The hashed version of an MTS, serving as a fundamental unit for indexing, candidating, and comparing.
-
Characteristics:
- MTSU is generated by hashing an MTS.
- Because an MTS is multiple tokens combined, it reduces the chance of random duplication compared to single-token matching.
Minimum Subsequence (MSs) (or Minimum Token Subsequence (MTSs))
-
Definition: The subsequence that is requested by minimum size(length) by the user.
-
Details:
- MSs should be equal or greater(longer) than MTS.
- When comparing on Comparator, connect MTS(MTSU) to construct MSs.
-
Example:
- MTS: 3 tokens
- MSs: 5 tokens
- When finding a MSs
A B C D Ein a file, the Comparator will find MTSA B C,B C DandC D Eby MTSU and match it to MSsA B C D E.
Hash Function
-
Definition: A function that converts an MTS into a fixed-size value (an MTSU).
-
Description:
- Ensures consistent, quick comparisons.
- Takes not only a Token, but Token Subsequence to get more identification power.
File and Tokenized File
File
-
Definition: A code and/or text container.
-
Description:
- Consists of a sequence of original tokens (raw source code, natural language text).
- Only Reporter(viewer) access the file directly
- In Indexing, and Comparing phases, the system uses Tokenized Files.
-
Characteristics:
- Typically stored in a filesystem or code repository.
- Source Files are stored in Archiver.
- Can be part of either Source Dataset or a Target Dataset.
- Typically stored in a filesystem or code repository.
Tokenized File
-
Definition: A file after it has been converted into a sequence of tokens.
-
Description:
- Produced by a tokenizer from the raw file.
- Consists of pairs:
(**Token Index**, **MTSU**, separated token #1, separated token #2, ...).
Token Index
- Definition: The token's position in the token sequence of the file.
Partitioning
- Definition: Dividing something by MTSU to optimize searching.
- Candidate Index, File Archive, and etc.
Query
Query, Query Request, and Query Result
-
Definition:
- Query: The act of searching the Source for common subsequences found in the Target.
- Query Request: Specific search parameters used to locate relevant subsequences in Source files.
- Query Result: The outcome of a Query, typically pairing Target files with matching Source files.
-
Description:
- Involves scanning across the entire Target dataset to find potential matches in the Source.
- Candidate Queries help identify which Source files might contain matching subsequences.
-
Types:
- Exact Query: Finds subsequences that match exactly between Source and Target.
- Similar Query: Finds similar subsequences, given a user-defined similarity threshold.
-
Result:
- Pairs each Source file with the relevant Target file.
- Includes index ranges (start/end) of matching subsequences for both files.
- By Token Index
- By Line Number
Candidate Query
Definition:** A preliminary, broad search returns only a list of candidate files from the Source.
- Description:
- Operates over the entire Source dataset.
- Allows false positives but no false negatives (i.e., it might over-include but never miss a true match).
- It does not directly support similar queries (though smaller MTS sizes can approximate similarity).
Batch Query
-
Definition: A method for processing multiple queries to improve efficiency.
-
Characteristics:
- Reduces repeated work by handling similar queries in a single pass.
- Constructed by MTS list
- Link between Source ID and MTS is not sent to Nominator, but stored in Querier.
-
Example:
- Target Object A's Token Sequence:
A B C D E F G- A's MTS:
A B CB C DC D ED E FE F G
- A's MTS:
- Target Object B's Token Sequence:
A B C E F G- B's MTS:
A B CB C EC E FE F G
- B's MTS:
- Batch Query:
A B C,B C D,C D E,B C E,C E F,E F G- Common MSss among Target Object (ex:
A B C,E F G)are not duplicated in Batch Query
- Common MSss among Target Object (ex:
- Target Object A's Token Sequence:
Preprocess Phase
Preprocessor
-
Definition: A component that converts a File to a Tokenized File, and extract metadata.
-
Description:
- Extracts metadata from the file.
- Include Tokenizing(Tokenizer)
Tokenizer
- Definition: A component that converts a File to a sequence of tokens and make a Tokenized File.
File Archiver
-
Definition: Retrieves a Source File and/or Source Tokenized File’s content by the file ID.
-
Description:
- Archive Source Files and Source Tokenized Files, and its metadata.
Index Phase
Candidate Index (The Index)
-
Definition: A data structure built from candidate Source files to facilitate queries.
-
Description:
- Used to respond to Candidate Queries quickly.
- It acts as a high-level map of Source Files.
- Is immutable after the indexing phase.
- Nominator does not modify The Index.
- Is organized (sorted) by MTSU to speed up lookups.
- Partitioning is applied to The Index for efficient searching.
-
Types:
- Raw Candidate Index
- Is not optimized or compressed.
- May use RocksDB or LevelDB.
- Candidate Index, The Index
- Is optimized and compressed.
- Nominator uses this index.
- Raw Candidate Index
Indexer (Source File Indexer)
-
Definition: Builds the Index from the Source datasets.
-
Description:
- Converts Source Files into Tokenized Files.
- Generates MTSU entries and relevant metadata for the Index.
-
Workflow
- From Tokenized File, record the file ID by MTSU to Candidate Index.
Index Compressor
-
Definition: Extract Key-Value and Compress to reduce the size of the Index to improve performance.
-
Description:
-
Workflow:
- Extract Key-Value pairs from Candidate Index.
- Compress Key-Value pairs by Elias-Fano Encoding.
Query Phase
Querier
-
Definition: Generates and executes queries based on Target Files.
-
Description:
- Before Querier, it should pass Preprocessing for Target.
- Produces Query Requests that lists MTSU and metadata from the Target Object.
- Generate Batch Query
- May group or deduplicate MTSU to optimize further searching.
- Stores MTSU and Source ID map.
Nominator
-
Definition: Identifies candidate files during query operations.
-
Description:
- Takes a Query Request and looks up possible file matches from the Index.
- Returns a list of Source file IDs that contain matching MTSU.
- Key: MTSU
- Value: Source file ID list (by Bitmap)
Comparison Phase
Merger
-
Definition: Consolidates and refines the final list of candidate files after a candidate query from Nominator.
-
Description:
- Produces the final candidate file set for the Query.
- Combines results across multiple MTSU lookups for each Target File.
- Use Bitmap to merge(OR operation) the candidate file list.
File Extractor
Comparator
Evaluator
Reporter
Partitioning
Language Family
-
Definition: A grouping of languages based on similar syntax or representation.
-
Description:
- Languages in the same family may share identical tokenization patterns.
- A single language can belong to multiple families if it exhibits shared features with different groups.
- Helps partition Source data for more efficient indexing and searching.
Parameter
Parameter
-
Definition: A user-defined setting that customizes query behavior.
-
Types:
- Similarity Rate: Sets the degree of overlap needed for matches in a Similar Query.
- Search Unit (by token):
- MTSU Size: The smallest hashable subsequence size.
- Minimum Search Unit Size: The smallest token sequence considered for matching.
-
Description:
- Affects how queries are performed and how results are filtered.
- Allows for tuning accuracy versus performance.