Glossary
Glossary
Basic Descriptors
Object
Target
-
Definition:
AUser-providedsingleobjectscodeintendedand/orfortextanalysis.container. -
Characteristics:Descriptions:- Consists of
aobjectssequence of original tokens(raw source code).
- Consists of
Indexing, Candidating, and Comparing phase do not read the file directly.Only viewer interface reads the file directly.
Target
Characteristics:
Workflow:
Source
-
Definition:
AObjectsrepositorytoofmatchdatasetsagainst the Target.
Description:
Characteristics:
- Acts as the reference pool for matching against the Target.
ReceivesMainlyupdatesOSS(Openinfrequently,SourcetypicallySoftware)onoraother public datasets.
Dataset
Candidate
-
Definition: Potential objects with common subsequence with the Target.
Description:
Characteristics:
Identified
Definition: The subset of Source/Target objects that contain a common subsequence with Target/Source.
Description:
Index
Definition: A metadata to identify the order of the data. Similar to ID.
Description:
Immutable
Definition: Would not be changed during the process.
Description:
Dataset
Dataset
Definition: An aggregateaggregated collection of files.objects.
Types:Characteristics:
- Each Dataset may contain numerous objects
Source Dataset:Dataset
- Interchangeably referred to simply as "Source."
Target Dataset:Dataset
- Interchangeably referred to simply as "Target."
Token and Sequence
Token
-
Definition: The smallest unit of data derived from the content of a file.
-
Generation Tokenizing Methods:Description:- Generated through tokenization—splitting code or text into discrete elements.
- Varies by language; e.g., whitespace-based for Western languages, morphological analysis for Japanese.
Characteristics:
Details:
Western Languages:Tokens are separatedSplit byspaces.spaces (plus punctuation considerations).Chinese:EachEvery character is treated asan individuala token.Korean:Tokens are splitSplit by spaces; postpositional particles and suffixes aretokenizedseparatelyseparately.tokenized.Japanese:Tokens are determined throughUses morphologicalanalysis,analysissegmentingto segment words based on grammar and context.SpecialConsiderations:Cases:Both Korean andKorean/Japanese require specialized tokenizers due toaccuratelymorphologicalhandle the complexity of their morphology and syntax.complexity.
Syntax Token
String Token
Sequence and Subsequence
-
Definition:
Sequence: An ordered list of tokens derived from a file or segment.Subsequence: A contiguous subset of tokens withinan entirea sequence.
Description:
Minimum Token Sequence (MTS) and MimimumMinimum Token Sequence Unit (MTSU)
-
Definition:
- MTS: A fixed-size subsequence of tokens from the original sequence.
Minimum Token Sequence (MTS):MTSU: Thefundamentalhashedsubsequenceversionunitoffrom original sequence by fixed size token.
Characteristics:
MTSU iscalculatedgenerated by hashing an MTS.- Because
MTSis consisted with several tokens, duplication possibility ofan MTS islessermultiplethantokensacombined,singleittoken.reduces the chance of random duplication compared to single-token matching.
Tokenized
Minimum File
Subsequence (MSs) (or Minimum Token Subsequence (MTSs))
-
Definition: The subsequence that is requested by minimum size(length) by the user.
Details:
Example:
A B C D E in a file, the Comparator will find MTS A B C, B C D and C D E by MTSU and match it to MSs A B C D E.
Hash Function
Definition: A tokenizedfunction versionthat ofconverts an MTS into a file.fixed-size value (an MTSU).
Characteristics:Description:
GeneratedEnsuresbyconsistent, quick comparisons.
File and Tokenized File
File
Definition: A code and/or text container.
Description:
- Consists of a sequence of
pairoriginal tokens (raw source code, natural language text).
IndexInofIndexing,token
Characteristics:
Tokenized File
Definition: A file after it has been converted into a sequence of tokens.
Description:
(**Token Index**, **MTSU**, separated token #1, separated token #2, ...).
Token Index
Partitioning
Query
Query, Query Result
!!TODO: Clearly define Candidate QueryRequest, and Query Result
Definition:
Definition:Query: Theprocessact of searchingfor candidate files fromthe Sourcethat havefor commonsubsequence between the Target.
Types:Description:
- Involves scanning across the entire Target dataset to find potential matches in the Source.
Types:
Result:
- Pairs
Pair of Source File and Target File:Links theeach Source file with thecorrespondingrelevant Target file. - Includes
Pairindex ranges (start/end) ofSubsequence Indices:This indicates the start and end indices of thematching subsequencesinfor boththefiles.Sourceand - By
TargetTokenfiles.Index
Candidate Query
!!TODO: Clearly define Candidate Query and Query
Characteristics:Description:- Operates over the entire Source dataset.
- Allows false positives but
does not allowno falsenegatives.negatives (i.e., it might over-include but never miss a true match). DoesIt does not directly supportSimilar QueriesHowever,similarfunctionalityqueriescan be achieved using a(though smaller MTSsize.sizes can approximate similarity).
Batch Query
Definition: A method for processing multiple queries to improve efficiency.
Characteristics:
Example:
A B C D E F G
A B C
B C D
C D E
D E F
E F G
Target Object B's Token Sequence: A B C E F G
A B C
B C E
C E F
E F G
Batch Query: A B C, B C D, C D E, B C E, C E F, E F G
A B C, E F G)are not duplicated in Batch Query
Preprocess Phase
Batch QueryPreprocessor
-
Definition: A
methodcomponentofthatprocessingconvertsmultipleaqueries simultaneouslyFile toimproveaefficiency.Tokenized File, and extract metadata. -
Characteristics:Description:.Extracts metadata from the file.
- Include Tokenizing(Tokenizer)
Tokenizer
!!TODO:A Clearlycomponent definethat Indexconverts a File to a sequence of tokens and Candidatemake Indexa Tokenized File.
File Archiver
Definition: Retrieves a Source File and/or Source Tokenized File’s content by the file ID.
Description:
- Archive Source Files and Source Tokenized Files, and its metadata.
Index Phase
Candidate Index (The Index)
Definition: A data structure built from thecandidate list ofSource files returnedto byfacilitate aqueries.
Characteristics:Description:
OptimizedUsedfortodetailedrespondsearchtooperations based on Exact and Similar Queries.
Candidate IndexQuerie
s Partitioning
File Archiver
Indexer
Types:
Indexer (Source File Indexer)
Definition: Builds the Index from the Source datasets.
Description:
Workflow
QuerierIndex Compressor
-
Definition:
MakeExtractqueryKey-Value and Compress to reduce the size of the Index to improve performance.
Description:
Workflow:
Query Phase
Querier
Definition: Generates and executes queries based on Target Files.
Characteristics:Description:
ProcessesBefore Querier, it should pass Preprocessing for Target.
Nominator
Definition: Identifies candidate files during query operations.
Description:
- Key: MTSU
FinderComparison Phase
Merger
-
Definition:
CandidatesConsolidates and refines the final list of candidate filesduring query operations.
MergerDescription:
Definition:Merge the candidate file list based on the query results.
HashFile FunctionExtractor
Comparator
Evaluator
Reporter
Partitioning
Language Family
-
Definition:A function that converts input data(MTS) into a fixed-size string of bytes(MTSU).
Language Family
Definition: A grouping of programming languages based on thesimilar similaritysyntax of their actualor representation.
Characteristics:Description:
- Languages
withinin the same family may sharecommon grammatical structures, enablingidenticaltokentokenizationsequences across different languages.patterns. - A single language can belong to multiple
languagefamilies if itsharesexhibitspatternsshared features withmultipledifferent groups. FacilitatesHelpspartitioning ofpartition Source datatoforoptimizemoresearchefficient indexing andindexing based on language-specific features.searching.
Parameter
Parameter
-
Definition:
User-A user-definedcriteriasetting thatrefine and tailorcustomizes queryoperations.behavior. -
Types:
Similarity Rate:SpecifiesSets therequireddegree ofsimilarityoverlap needed for matches in a SimilarQueries.Query.Search Unit (by token):MTSU Size:Definition:The smallestunit size used for hashinghashable subsequenceduring the search process.
size. Minimum Search Unit Size:Definition:The smallestunittokensizesequence consideredwhenforperforming searches.matching.
Description: