Glossary
Glossary
Basic Descriptors
Object
- Definition: Any unit that can be processed or analyzed by the QOSI system.
- Types:
- File, File ID,
- Offset, Offset Index
- Tokenized File, Tokenized File ID
- Candidate Object List, Candidate Object Bitmap
- MTS, MTSU
- Bitmap
- TODO:
- etc.
Target
-
Definition:
AUser-providedsingleobjectscodeintendedand/orfortextanalysis.container. -
Characteristics:Descriptions:- Consists of
aobjectssequence of original tokens(raw source code). Indexing, Candidating, and Comparing phase do not read the file directly.Only viewer interface reads the file directly.
- Consists of
Target
Definition:A user-provided dataset intended for analysis.Characteristics:Consists of files suppliedsubmitted by the user.UndergoesAlwaysimmediate tokenization and/or indexing upon receipt.Remainsremains separate and is never merged into the Source.
-
Characteristics:
- Relatively smaller than the Source.
- Never incorporated into the Source.
-
Workflow:
- Is subject to immediate tokenization and/or indexing once received.
Source
-
Definition:
AObjectsrepositorytoofmatchdatasetsagainst the Target. -
Description:
- Objects maintained by the service provider.
- The main data collection (objects) for queries and comparisons.
-
Characteristics:
- Acts as the reference pool for matching against the Target.
ReceivesMainlyupdatesOSS(Openinfrequently,SourcetypicallySoftware)onoraother public datasets.It is readily available. Pre-indexed.- It is updated infrequently
- e.g., weekly or monthly
schedule. Pre-indexedTODO:andMovereadilytoavailable when accessed by the user.The duration ofIndexing Candidate Indexingdoesdurationnothasinfluenceno significant impact on data structure design.
Dataset
Candidate
-
Definition: Potential objects with common subsequence with the Target.
-
Description:
- Identified during the Candidate Query phase.
- Subset of the Source.
-
Characteristics:
- May include false positives but never false negatives.
- Used to quickly identify potential matches in the Source.
- In current, every candidate includes MTS, but may not include MSs.
Identified
-
Definition: The subset of Source/Target objects that contain a common subsequence with Target/Source.
-
Description:
- Identified during the Comparing phase.
- Identified Source: Source objects that contain a common subsequence with the Target.
- Identified Target: Target objects that contain a common subsequence with the Source.
Index
-
Definition: A metadata to identify the order of the data. Similar to ID.
-
Description:
- Many objects (ex: File, Tokenized File, Token, etc.) have their own Index.
Immutable
-
Definition: Would not be changed during the process.
-
Description:
- The Index is immutable after the indexing phase(when the index is created).
Dataset
Dataset
-
Definition: An
aggregateaggregated collection offiles.objects. -
Types:Characteristics:- Each Dataset may contain numerous objects
- A dataset may contain only the same type or different types of objects.
- source code, text, etc.
- File, Tokenized File, Candidate Object List
Source Dataset:Dataset
- Definition: The comprehensive
collectionset offilesobjects maintained as the Source. - Usage:
- Interchangeably referred to simply as "Source."
Target Dataset:Dataset
- Definition: The complete
collectionset offiles provided byobjects theuser.user provides. - Usage:
- Interchangeably referred to simply as "Target."
Token and Sequence
Token
-
Definition: The smallest unit of data derived from the content of a file.
-
Generation Tokenizing Methods:Description:- Generated through tokenization—splitting code or text into discrete elements.
- Varies by language; e.g., whitespace-based for Western languages, morphological analysis for Japanese.
-
Characteristics:
- Enables fine-grained comparison between Source and Target files.
- Significantly reduces duplication compared to entire lines or blocks of text.
-
Details:
- Code Data:
ProducedTokenizedbyaccordingatotokenizer based on thecode syntaxof the code.rules. Text Data (Natural Language):Western Languages:Tokens are separatedSplit byspaces.spaces (plus punctuation considerations).Chinese:EachEvery character is treated asan individuala token.Korean:Tokens are splitSplit by spaces; postpositional particles and suffixes aretokenizedseparatelyseparately.tokenized.Japanese:Tokens are determined throughUses morphologicalanalysis,analysissegmentingto segment words based on grammar and context.SpecialConsiderations:Cases:Both Korean andKorean/Japanese require specialized tokenizers due toaccuratelymorphologicalhandle the complexity of their morphology and syntax.complexity.
- Code Data:
Syntax Token
- Definition: A token that does not reflect its original string but its syntax role.
String Token
- Definition: A token that reflects its original string.
Sequence and Subsequence
-
Definition:
Sequence: An ordered list of tokens derived from a file or segment.Subsequence: A contiguous subset of tokens withinan entirea sequence.
-
Description:
- Used for matching and comparison processes (e.g., detecting shared code fragments).
Minimum Token Sequence (MTS) and MimimumMinimum Token Sequence Unit (MTSU)
-
Definition:
- MTS: A fixed-size subsequence of tokens from the original sequence.
Minimum Token Sequence (MTS):MTSU: Thefundamentalhashedsubsequenceversionunitoffrom original sequence by fixed size token.- an
MimimumMTS,TokenservingSequenceasUnit (MTSU):Thea fundamental unitinfor indexing,candidatingcandidating, andcomparing within the system.comparing.
-
Characteristics:
MTSU iscalculatedgenerated by hashing an MTS.- Because
MTSis consisted with several tokens, duplication possibility ofan MTS islessermultiplethantokensacombined,singleittoken.reduces the chance of random duplication compared to single-token matching.
TokenizedMinimum File
Subsequence (MSs) (or Minimum Token Subsequence (MTSs))
-
Definition: The subsequence that is requested by minimum size(length) by the user.
-
Details:
- MSs should be equal or greater(longer) than MTS.
- When comparing on Comparator, connect MTS(MTSU) to construct MSs.
-
Example:
- MTS: 3 tokens
- MSs: 5 tokens
- When finding a MSs
A B C D Ein a file, the Comparator will find MTSA B C,B C DandC D Eby MTSU and match it to MSsA B C D E.
Hash Function
-
Definition: A
tokenizedfunctionversionthatofconverts an MTS into afile.fixed-size value (an MTSU). -
Characteristics:Description:GeneratedEnsuresbyconsistent, quick comparisons.- Takes not only a
tokenizer.Token, but Token Subsequence to get more identification power.
File and Tokenized File
File
-
Definition: A code and/or text container.
-
Description:
- Consists of a sequence of
pairoriginal tokens (raw source code, natural language text). - Only Reporter(viewer) access the file directly
IndexInofIndexing,tokenand OriginalComparingcharacters(string)phases, the system uses Tokenized Files.
- Consists of a sequence of
-
Characteristics:
- Typically stored in a filesystem or code repository.
- Source Files are stored in Archiver.
- Can be part of either
QuerySource Dataset or a Target Dataset.
- Typically stored in a filesystem or code repository.
Tokenized File
-
Definition: A file after it has been converted into a sequence of tokens.
-
Description:
- Produced by a tokenizer from the raw file.
- Consists of pairs:
(**Token Index**, **MTSU**, separated token #1, separated token #2, ...).
Token Index
- Definition: The token's position in the token sequence of the file.
Partitioning
- Definition: Dividing something by MTSU to optimize searching.
- Candidate Index,
QueryFileRequestArchive, and etc.
- Candidate Index,
Query
Query, Query Result
!!TODO: Clearly define Candidate QueryRequest, and Query Result
-
Definition:
Definition:Query: Theprocessact of searchingfor candidate files fromthe Sourcethat havefor commonsubsequence between the Target.Query Request:A set of search parameters used to locate Source files containing specificsubsequences found in the Target.- Query Request: Specific search parameters used to locate relevant subsequences in Source files.
- Query Result: The
outputoutcome of a Query,whichtypicallypair ofpairing TargetFilefilesandwithcandidatematching Sourcefile list.
Components:Encompasses all files within the Target dataset.Searches through candidate Source files and their associated project metadata using Candidate Queries.Identifies source files that contain token subsequences that match any subsequence within targetfiles.
-
Types:Description:- Involves scanning across the entire Target dataset to find potential matches in the Source.
- Candidate Queries help identify which Source files might contain matching subsequences.
-
Types:
- Exact Query:
RetrievesFinds subsequences thatarematchprecisely identicalexactly between Source and Target. Similar Query:Retrieves subsequences that areFinds similarbasedsubsequences,ongivenpredefineda user-defined similarityparameters.threshold.
- Exact Query:
-
Result:
- Pairs
Pair of Source File and Target File:Links theeach Source file with thecorrespondingrelevant Target file. - Includes
Pairindex ranges (start/end) ofSubsequence Indices:This indicates the start and end indices of thematching subsequencesinfor boththefiles.Source- By
TargetTokenfiles.Index - By Line Number
and - By
- Pairs
Candidate Query
!!TODO: Clearly define Candidate Query and QueryDefinition:
Characteristics:Description:- Operates over the entire Source dataset.
- Allows false positives but
does not allowno falsenegatives.negatives (i.e., it might over-include but never miss a true match). DoesIt does not directly supportSimilar QueriesHowever,similarfunctionalityqueriescan be achieved using a(though smaller MTSsize.sizes can approximate similarity).
Batch Query
-
Definition: A method for processing multiple queries to improve efficiency.
-
Characteristics:
- Reduces repeated work by handling similar queries in a single pass.
- Constructed by MTS list
- Link between Source ID and MTS is not sent to Nominator, but stored in Querier.
-
Example:
- Target Object A's Token Sequence:
A B C D E F G- A's MTS:
A B CB C DC D ED E FE F G
- A's MTS:
- Target Object B's Token Sequence:
A B C E F G- B's MTS:
A B CB C EC E FE F G
- B's MTS:
- Batch Query:
A B C,B C D,C D E,B C E,C E F,E F G- Common MSss among Target Object (ex:
A B C,E F G)are not duplicated in Batch Query
- Common MSss among Target Object (ex:
- Target Object A's Token Sequence:
Preprocess Phase
Batch QueryPreprocessor-
Definition: A
methodcomponentofthatprocessingconvertsmultipleaqueries simultaneouslyFile toimproveaefficiency.Tokenized File, and extract metadata. -
Characteristics:Description:.Extracts metadata from the file.- Include Tokenizing(Tokenizer)
Tokenizer
IndexDefinition:
!!TODO:AClearlycomponentdefinethatIndexconverts a File to a sequence of tokens andCandidatemakeIndexa Tokenized File.
File Archiver
-
Definition: Retrieves a Source File and/or Source Tokenized File’s content by the file ID.
-
Description:
- Archive Source Files and Source Tokenized Files, and its metadata.
Index Phase
Candidate Index (The Index)
-
Definition: A data structure built from
thecandidatelist ofSource filesreturnedtobyfacilitateaqueries.Candidate Query. -
Characteristics:Description:OptimizedUsedfortodetailedrespondsearchtooperations based on Exact and Similar Queries.
Candidate IndexQuerie
s Definition:The initial indexing for the entire Source.quickly.- It
Characteristics:actsSupports Candidate Queries by providingas afile-high-levellisting.
Partitioning
Definition:The processmap ofdividingMTSUinQuery Requestto optimizeFinding.Characteristics:Indexis sorted by order ofMTSU.
File Archiver
Definition:Returns theSource Filefrom file ID ofSource File.Characteristics:
Indexer
Definition:Creates the Index from Source datasets.Characteristics:ProcessesSource filesGenerateTokenized FilesfromSource Files.GenerateIs immutable after the indexing phase.- Nominator does not modify The Index.
- Is organized (sorted) by MTSU to speed up lookups.
- Partitioning is applied to The Index for efficient searching.
-
Types:
- Raw Candidate Index
- Is not optimized or compressed.
- May use RocksDB or LevelDB.
- Candidate Index, The Index
- Is optimized and
metadatacompressed. Source FilesNominatorfromusesathisTokenized Files.index.
of - Is optimized and
- Raw Candidate Index
Indexer (Source File Indexer)
-
Definition: Builds the Index from the Source datasets.
-
Description:
- Converts Source Files into Tokenized Files.
- Generates MTSU entries and relevant metadata for the Index.
-
Workflow
- From Tokenized File, record the file ID by MTSU to Candidate Index.
QuerierIndex Compressor-
Definition:
MakeExtractqueryKey-Value and Compress to reduce the size of the Index to improve performance. -
Description:
-
Workflow:
- Extract Key-Value pairs from Candidate Index.
- Compress Key-Value pairs by Elias-Fano Encoding.
Query Phase
Querier
-
Definition: Generates and executes queries based on Target Files.
-
Characteristics:Description:ProcessesBefore Querier, it should pass Preprocessing for Target.- Produces Query Requests that lists MTSU and metadata from the Target
FilesGenerateTokenized FilesfromTarget FilesObject.- Generate
aBatchqueryQueryrequest- May group or deduplicate MTSU to optimize further searching.
from - Stores MTSU and Source ID map.
Nominator
-
Definition: Identifies candidate files during query operations.
-
Description:
- Takes a Query Request and looks up possible file matches from the Index.
- Returns a list of
TokenizedSourceFilesfile IDs that contain matching MTSU.- Key: MTSU
- Value: Source file ID list
and relevant metadata of(byTarget FilesBitmap. Grouped/DeduplicatedMTSUwhich is optimized for query operation.Grouping onQuerierorIndexer?)
FinderComparison PhaseMerger
-
Definition:
CandidatesConsolidates and refines the final list of candidate filesduring query operations. Characteristics:Fromafter a candidate queryrequests, get candidiate file list (or candidiate file list ID) for eachMTSUfromIndexNominator.
-
MergerDescription:Definition:Merge the candidate file list based on the query results.Characteristics:Merge the candidate file list based on the query results.GenerateProduces the final candidate filelistset for thequeryQuery.- Combines results across multiple MTSU lookups for each Target File.
- Use Bitmap to merge(OR operation) the candidate file list.
request.
HashFileFunctionExtractorComparator
Evaluator
Reporter
Partitioning
Language Family
-
Definition:A function that converts input data(MTS) into a fixed-size string of bytes(MTSU).
Language FamilyDefinition: A grouping of
programminglanguages based onthesimilarsimilaritysyntaxof their actualor representation.-
Characteristics:Description:- Languages
withinin the same family may sharecommon grammatical structures, enablingidenticaltokentokenizationsequences across different languages.patterns. - A single language can belong to multiple
languagefamilies if itsharesexhibitspatternsshared features withmultipledifferent groups. FacilitatesHelpspartitioning ofpartition Source datatoforoptimizemoresearchefficient indexing andindexing based on language-specific features.searching.
- Languages
Parameter
Parameter
-
Definition:
User-A user-definedcriteriasetting thatrefine and tailorcustomizes queryoperations.behavior. -
Types:
Similarity Rate:SpecifiesSets therequireddegree ofsimilarityoverlap needed for matches in a SimilarQueries.Query.Search Unit (by token):MTSU Size:Definition:The smallestunit size used for hashinghashable subsequenceduring the search process.
size.Minimum Search Unit Size:Definition:The smallestunittokensizesequence consideredwhenforperforming searches.matching.
-
Description:
- Affects how queries are performed and how results are filtered.
- Allows for tuning accuracy versus performance.