Glossary

Basic Descriptors

Object

Definition: Any unit that can be processed or analyzed by the QOSI system.

Types:
- File, File ID,
- Offset, Offset Index
- Tokenized File, Tokenized File ID
- Candidate Object List, Candidate Object Bitmap
- MTS, MTSU
- Bitmap
- TODO:
- etc.

Target

Definition: AUser-provided ~~single~~objects ~~code~~intended ~~and/or~~for ~~text~~analysis.
~~container.~~
~~Characteristics:~~Descriptions:
- Consists of aobjects ~~sequence of original tokens(raw source code).~~
- ~~Indexing, Candidating, and Comparing phase do not read the file directly.~~
  - ~~Only viewer interface reads the file directly.~~

Target

~~Definition:~~ ~~A user-provided dataset intended for analysis.~~

~~Characteristics:~~
- ~~Consists of files supplied~~submitted by the user.
- ~~Undergoes~~Always ~~immediate tokenization and/or indexing upon receipt.~~
- ~~Remains~~remains separate and is never merged into the Source.

Characteristics:
- Relatively smaller than the Source.
- Never incorporated into the Source.
Workflow:
- Is subject to immediate tokenization and/or indexing once received.

Source

Definition: AObjects ~~repository~~to ofmatch ~~datasets~~against the Target.

Description:

Objects maintained by the service provider.

The main data collection (objects) for queries and comparisons.

Characteristics:

Acts as the reference pool for matching against the Target.

~~Receives~~Mainly ~~updates~~OSS(Open ~~infrequently,~~Source ~~typically~~Software) onor aother public datasets.

~~It is readily available. Pre-indexed.~~

It is updated infrequently

e.g., weekly or monthly
~~schedule.~~

~~Pre-indexed~~TODO: ~~and~~Move ~~readily~~to ~~available when accessed by the user.~~

~~The duration of~~Indexing Candidate Indexing ~~does~~duration ~~not~~has ~~influence~~no significant impact on data structure design.

~~Dataset~~
Candidate

Definition: Potential objects with common subsequence with the Target.

Description:

Identified during the Candidate Query phase.

Subset of the Source.

Characteristics:

May include false positives but never false negatives.

Used to quickly identify potential matches in the Source.

In current, every candidate includes MTS, but may not include MSs.

Identified

Definition: The subset of Source/Target objects that contain a common subsequence with Target/Source.

Description:

Identified during the Comparing phase.

Identified Source: Source objects that contain a common subsequence with the Target.

Identified Target: Target objects that contain a common subsequence with the Source.

Index

Definition: A metadata to identify the order of the data. Similar to ID.

Description:

Many objects (ex: File, Tokenized File, Token, etc.) have their own Index.

Immutable

Definition: Would not be changed during the process.

Description:

The Index is immutable after the indexing phase(when the index is created).

Dataset

Dataset

Definition: An ~~aggregate~~aggregated collection of ~~files.~~objects.

~~Types:~~Characteristics:

Each Dataset may contain numerous objects

A dataset may contain only the same type or different types of objects.

source code, text, etc.

File, Tokenized File, Candidate Object List

Source ~~Dataset:~~Dataset

Definition: The comprehensive ~~collection~~set of ~~files~~objects maintained as the Source.

Usage:

Interchangeably referred to simply as "Source."

Target ~~Dataset:~~Dataset

Definition: The complete ~~collection~~set of ~~files provided by~~objects the ~~user.~~user provides.

Usage:

Interchangeably referred to simply as "Target."

Token and Sequence

Token

Definition: The smallest unit of data derived from the content of a file.

~~Generation Tokenizing Methods:~~Description:

Generated through tokenization—splitting code or text into discrete elements.

Varies by language; e.g., whitespace-based for Western languages, morphological analysis for Japanese.

Characteristics:

Enables fine-grained comparison between Source and Target files.

Significantly reduces duplication compared to entire lines or blocks of text.

Details:

Code Data: ~~Produced~~Tokenized byaccording ato ~~tokenizer based on the~~code syntax ~~of the code.~~rules.

Text Data (Natural Language):

Western Languages: ~~Tokens are separated~~Split by ~~spaces.~~spaces (plus punctuation considerations).

Chinese: ~~Each~~Every character is treated as ~~an individual~~a token.

Korean: ~~Tokens are split~~Split by spaces; postpositional particles and suffixes are ~~tokenized~~separately ~~separately.~~tokenized.

Japanese: ~~Tokens are determined through~~Uses morphological ~~analysis,~~analysis ~~segmenting~~to segment words based on grammar and context.

Special ~~Considerations:~~Cases: ~~Both Korean and~~ Korean/Japanese require specialized tokenizers due to ~~accurately~~morphological ~~handle the complexity of their morphology and syntax.~~complexity.

Syntax Token

Definition: A token that does not reflect its original string but its syntax role.

String Token

Definition: A token that reflects its original string.

Sequence and Subsequence

Definition:

Sequence: An ordered list of tokens derived from a file or segment.

Subsequence: A contiguous subset of tokens within ~~an entire~~a sequence.

Description:

Used for matching and comparison processes (e.g., detecting shared code fragments).

Minimum Token Sequence (MTS) and ~~Mimimum~~Minimum Token Sequence Unit (MTSU)

Definition:

MTS: A fixed-size subsequence of tokens from the original sequence.

~~Minimum Token Sequence (MTS):~~MTSU: The ~~fundamental~~hashed ~~subsequence~~version ~~unit~~of ~~from original sequence by fixed size token.~~

an ~~Mimimum~~MTS, ~~Token~~serving ~~Sequence~~as ~~Unit (MTSU):~~ ~~The~~a fundamental unit infor indexing, ~~candidating~~candidating, and ~~comparing within the system.~~comparing.
Characteristics:
- MTSU is ~~calculated~~generated by hashing an MTS.
- Because ~~MTS~~ ~~is consisted with several tokens, duplication possibility of~~an MTS is ~~lesser~~multiple ~~than~~tokens acombined, ~~single~~it ~~token.~~reduces the chance of random duplication compared to single-token matching.

Tokenized
Minimum File
Subsequence (MSs) (or Minimum Token Subsequence (MTSs))

Definition: The subsequence that is requested by minimum size(length) by the user.

Details:
- MSs should be equal or greater(longer) than MTS.
- When comparing on Comparator, connect MTS(MTSU) to construct MSs.

Example:
- MTS: 3 tokens
- MSs: 5 tokens
- When finding a MSs A B C D E in a file, the Comparator will find MTS A B C, B C D and C D E by MTSU and match it to MSs A B C D E.

Hash Function

Definition: A ~~tokenized~~function ~~version~~that ofconverts an MTS into a ~~file.~~fixed-size value (an MTSU).
~~Characteristics:~~Description:
- ~~Generated~~Ensures byconsistent, quick comparisons.
- Takes not only a ~~tokenizer.~~Token, but Token Subsequence to get more identification power.

File and Tokenized File

File

Definition: A code and/or text container.

Description:
- Consists of a sequence of ~~pair~~original tokens (raw source code, natural language text).
- Only Reporter(viewer) access the file directly
  - ~~Index~~In ofIndexing, ~~token~~
  - ~~Original~~Comparing ~~characters(string)~~phases, the system uses Tokenized Files.
Characteristics:
- Typically stored in a filesystem or code repository.
  - Source Files are stored in Archiver.
- Can be part of either ~~Query~~Source Dataset or a Target Dataset.

Tokenized File

Definition: A file after it has been converted into a sequence of tokens.

Description:
- Produced by a tokenizer from the raw file.
- Consists of pairs: (**Token Index**, **MTSU**, separated token #1, separated token #2, ...).

Token Index

Definition: The token's position in the token sequence of the file.

Partitioning

Definition: Dividing something by MTSU to optimize searching.
- Candidate Index, ~~Query~~File ~~Request~~Archive, and etc.

Query

Query, Query Result

~~!!TODO: Clearly define Candidate Query~~Request, and Query Result

Definition:
- ~~Definition:~~
  - Query: The ~~process~~act of searching ~~for candidate files from~~ the Source ~~that have~~for common ~~subsequence between the Target.~~
  - ~~Query Request:~~ ~~A set of search parameters used to locate Source files containing specific~~ subsequences found in the Target.
  - Query Request: Specific search parameters used to locate relevant subsequences in Source files.
  - Query Result: The ~~output~~outcome of a Query, ~~which~~typically ~~pair of~~pairing Target ~~File~~files ~~and~~with ~~candidate~~matching Source ~~file list.~~
- ~~Components:~~
  - ~~Encompasses all files within the Target dataset.~~
  - ~~Searches through candidate Source files and their associated project metadata using Candidate Queries.~~
  - ~~Identifies source files that contain token subsequences that match any subsequence within target~~ files.
- ~~Types:~~Description:
  - Involves scanning across the entire Target dataset to find potential matches in the Source.
  - Candidate Queries help identify which Source files might contain matching subsequences.
- Types:
  - Exact Query: ~~Retrieves~~Finds subsequences that ~~are~~match ~~precisely identical~~exactly between Source and Target.
  - Similar Query: ~~Retrieves subsequences that are~~Finds similar ~~based~~subsequences, ongiven ~~predefined~~a user-defined similarity ~~parameters.~~threshold.
- Result:
  - Pairs ~~Pair of Source File and Target File:~~ ~~Links the~~each Source file with the ~~corresponding~~relevant Target file.
  - Includes
    ~~Pair~~index ranges (start/end) of ~~Subsequence Indices:~~ ~~This indicates the start and end indices of the~~ matching subsequences infor both ~~the~~files. ~~Source~~
    - By ~~Target~~Token ~~files.~~Index
    - By Line Number
Candidate Query

~~!!TODO: Clearly define Candidate Query and Query~~

Glossary

Glossary

Basic Descriptors

Object

Target

Target

Source

Dataset

Candidate

Identified

Index

Immutable

Dataset

Dataset

Source Dataset:Dataset

Target Dataset:Dataset

Token and Sequence

Token

Syntax Token

String Token

Sequence and Subsequence

Minimum Token Sequence (MTS) and MimimumMinimum Token Sequence Unit (MTSU)

TokenizedMinimum FileSubsequence (MSs) (or Minimum Token Subsequence (MTSs))

Minimum File

Hash Function

File and Tokenized File

File

Tokenized File

Token Index

Partitioning

Query

Query, Query Result

Candidate Query

Batch Query

Preprocess Phase

Batch QueryPreprocessor

Tokenizer

File Archiver

Index Phase

Candidate Index (The Index)

Candidate IndexQuerie

Partitioning

File Archiver

Indexer

Indexer (Source File Indexer)

QuerierIndex Compressor

Query Phase

Querier

Nominator

FinderComparison Phase

Merger

MergerDescription:

HashFile FunctionExtractor

Comparator

Evaluator

Reporter

Partitioning

Language Family

Language Family

Parameter

Parameter

Tokenized
Minimum File
Subsequence (MSs) (or Minimum Token Subsequence (MTSs))