API Reference ¶

class py_pdf_term.PyPDFTermSingleDomainExtractor(xml_config: XMLLayerConfig | None = None, candidate_config: CandidateLayerConfig | None = None, method_config: SingleDomainMethodLayerConfig | None = None, styling_config: StylingLayerConfig | None = None, techterm_config: TechnicalTermLayerConfig | None = None, bin_opener_mapper: BinaryOpenerMapper | None = None, lang_tokenizer_mapper: LanguageTokenizerMapper | None = None, token_classifier_mapper: TokenClassifierMapper | None = None, token_filter_mapper: CandidateTokenFilterMapper | None = None, term_filter_mapper: CandidateTermFilterMapper | None = None, splitter_mapper: SplitterMapper | None = None, augmenter_mapper: AugmenterMapper | None = None, method_mapper: SingleDomainRankingMethodMapper | None = None, styling_score_mapper: StylingScoreMapper | None = None, xml_cache_mapper: XMLLayerCacheMapper | None = None, candidate_cache_mapper: CandidateLayerCacheMapper | None = None, method_ranking_cache_mapper: MethodLayerRankingCacheMapper | None = None, method_data_cache_mapper: MethodLayerDataCacheMapper | None = None, styling_cache_mapper: StylingLayerCacheMapper | None = None, cache_dir: str = DEFAULT_CACHE_DIR)¶

Bases: object

Top level class of py-pdf-term. This class extracts technical terms from a PDF file withoout cross-domain information.

Parameters:

xml_config – Config of XML Layer.
candidate_config – Config of Candidate Term Layer.
method_config – Config of Method Layer.
styling_config – Config of Styling Layer.
techterm_config – Config of Technial Term Layer.
bin_opener_mapper – Mapper from xml_config.open_bin to a function to open a input PDF file in the binary mode. This is used in XML Layer.
lang_tokenizer_mapper – Mapper from an element in candidate_config.lang_tokenizers to a class to tokenize texts in a specific language with spaCy. This is used in Candidate Term Layer.
token_classifier_mapper – Mapper from an element in candidate_config.token_classifiers to a class to classify tokens into True/False by several functions. This is used in Candidate Term Layer.
token_filter_mapper – Mapper from an element in candidate_config.token_filters to a class to filter tokens which are likely to be parts of candidates. This is used in Candidate Term Layer.
term_filter_mapper – Mapper from an element in candidate_config.term_filters to a class to filter terms which are likely to be candidates. This is used in Candidate Term Layer.
splitter_mapper – Mapper from an element in candidate_config.splitters to a class to split too long terms or wrongly concatenated terms. This is used in Candidate Term Layer.
augmenter_mapper – Mapper from an element in candidate_config.augmenters to a class to augment candidates. The augumentation means that if a long candidate is found, sub-terms of it could also be candidates. This is used in Candidate Term Layer.
method_mapper – Mapper from method_config.method to a class to calculate method scores of candidate terms. This is used in Method Layer.
styling_score_mapper – Mapper from an element in styling_config.styling_scores to a class to calculate scores of candidate terms based on their styling such as color, fontsize and so on. This is used in Styling Layer.
xml_cache_mapper – Mapper from xml_config.cache to a class to provide XML Layer with the cache mechanism. The xml cache manages XML files converted from input PDF files.
candidate_cache_mapper – Mapper from candidate_config.cache to a class to provide Candidate Term Layer with the cache mechanism. The candidate cache manages lists of candidate terms.
method_ranking_cache_mapper – Mapper from method_config.ranking_cache to a class to provide Method Layer with the cache mechanism. The method ranking cache manages candidate terms ordered by the method scores.
method_data_cache_mapper – Mapper from method_config.data_cache to a class to provide Method Layer with the cache mechanism. The method data cache manages analysis data of the candidate terms such as frequency or likelihood.
styling_cache_mapper – Mapper from styling_config.cache to a class to provide Styling Layer with the cache mechanism. The styling cache manages candidate terms ordered by the styling scores.
cache_dir – Path like string where cache files to be stored. For example, path to a local directory, a url or a bucket name of a cloud storage service.

extract(pdf_path: str, domain_pdfs: DomainPDFList) → PDFTechnicalTermList¶

Extract technical terms from a PDF file.

Parameters:

pdf_path – Path like string to the input PDF file which terminologies to be extracted. The file MUST belong to domain.
domain_pdfs – List of path like strings to the PDF files which belong to a specific domain.

Returns:

Terminology list per page from the input PDF file.

Return type:

py_pdf_term.configs subpackage¶

class py_pdf_term.configs.CandidateLayerConfig(lang_tokenizers: list[str] = <factory>, token_classifiers: list[str] = <factory>, token_filters: list[str] = <factory>, term_filters: list[str] = <factory>, splitters: list[str] = <factory>, augmenters: list[str] = <factory>, cache: str = 'py_pdf_term.CandidateLayerFileCache')¶

Bases: BaseLayerConfig

Configuration for candidate layer.

Parameters:

lang_tokenizers (list[str]) – List of language tokenizer class names. The default tokenizers are “py_pdf_term.JapaneseTokenizer” and “py_pdf_term.EnglishTokenizer”.
token_classifiers (list[str]) – List of token classifier class names. The default classifiers are “py_pdf_term.JapaneseTokenClassifier” and “py_pdf_term.EnglishTokenClassifier”.
token_filters (list[str]) – List of token filter class names. The default filters are “py_pdf_term.JapaneseTokenFilter” and “py_pdf_term.EnglishTokenFilter”.
term_filters (list[str]) – List of term filter class names. The default filters are “py_pdf_term.JapaneseConcatenationFilter”, “py_pdf_term.EnglishConcatenationFilter”, “py_pdf_term.JapaneseSymbolLikeFilter”, “py_pdf_term.EnglishSymbolLikeFilter”, “py_pdf_term.JapaneseProperNounFilter”, “py_pdf_term.EnglishProperNounFilter”, “py_pdf_term.JapaneseNumericFilter”, and “py_pdf_term.EnglishNumericFilter”.
splitters (list[str]) – List of splitter class names. The default splitters are “py_pdf_term.SymbolNameSplitter” and “py_pdf_term.RepeatSplitter”.
augmenters (list[str]) – List of augmenter class names. The default augmenters are “py_pdf_term.JapaneseAugmenter” and “py_pdf_term.EnglishAugmenter”.
cache (str) – Cache class name. The default cache is “py_pdf_term.CandidateLayerFileCache”.

cache: str = 'py_pdf_term.CandidateLayerFileCache'¶

lang_tokenizers: list[str]¶

token_classifiers: list[str]¶

token_filters: list[str]¶

term_filters: list[str]¶

splitters: list[str]¶

augmenters: list[str]¶

class py_pdf_term.configs.MultiDomainMethodLayerConfig(method: str = 'py_pdf_term.TFIDFMethod', hyper_params: dict[str, ~typing.Any] = <factory>, ranking_cache: str = 'py_pdf_term.MethodLayerRankingFileCache', data_cache: str = 'py_pdf_term.MethodLayerDataFileCache')¶

Bases: BaseMethodLayerConfig

Configuration for a multi-domain method layer.

Parameters:

method (str) – Multi-domain method class name. The default method is “py_pdf_term.TFIDFMethod”.
hyper_params (dict[str, Any]) – Hyper parameters for the method. The default hyper parameters are empty.
ranking_cache (str) – Ranking cache class name. The default cache is “py_pdf_term.MethodLayerRankingFileCache”.
data_cache (str) – Data cache class name. The default cache is “py_pdf_term.MethodLayerDataFileCache”.

method: str = 'py_pdf_term.TFIDFMethod'¶

class py_pdf_term.configs.SingleDomainMethodLayerConfig(method: str = 'py_pdf_term.FLRHMethod', hyper_params: dict[str, ~typing.Any] = <factory>, ranking_cache: str = 'py_pdf_term.MethodLayerRankingFileCache', data_cache: str = 'py_pdf_term.MethodLayerDataFileCache')¶

Bases: BaseMethodLayerConfig

Configuration for a single-domain method layer.

Parameters:

method – Single-domain method class name. The default method is “py_pdf_term.FLRHMethod”.
hyper_params – Hyper parameters for the method. The default hyper parameters are empty.
ranking_cache – Ranking cache class name. The default cache is “py_pdf_term.MethodLayerRankingFileCache”.
data_cache – Data cache class name. The default cache is “py_pdf_term.MethodLayerDataFileCache”.

method: str = 'py_pdf_term.FLRHMethod'¶

class py_pdf_term.configs.StylingLayerConfig(styling_scores: list[str] = <factory>, cache: str = 'py_pdf_term.StylingLayerFileCache')¶

Bases: BaseLayerConfig

Configuration for a styling layer.

Parameters:

styling_scores (list[str]) – List of styling score class names. The default scores are “py_pdf_term.FontsizeScore” and “py_pdf_term.ColorScore”.
cache (str) – Cache class name. The default cache is “py_pdf_term.StylingLayerFileCache”.

cache: str = 'py_pdf_term.StylingLayerFileCache'¶

styling_scores: list[str]¶

class py_pdf_term.configs.TechnicalTermLayerConfig(max_num_terms: int = 10, acceptance_rate: float = 0.75)¶

Bases: BaseLayerConfig

Configuration for a technical term layer.

Parameters:

max_num_terms (int) – Maximum number of terms in a page of a PDF file to be extracted. The N-best candidates are extracted as technical terms. The default value is 10.
acceptance_rate (float) – Acceptance rate of the ranking method scores. The candidates whose ranking method scores are lower than the acceptance rate are filtered out even if they are in the N-best candidates. The default value is 0.75.

acceptance_rate: float = 0.75¶

max_num_terms: int = 10¶

class py_pdf_term.configs.XMLLayerConfig(bin_opener: str = 'py_pdf_term.StandardBinaryOpener', include_pattern: str | None = None, exclude_pattern: str | None = None, nfc_norm: bool = True, cache: str = 'py_pdf_term.XMLLayerFileCache')¶

Bases: BaseLayerConfig

Configuration for an XML layer.

Parameters:

bin_opener (str) – Binary opener class name. The default opener is “py_pdf_term.StandardBinaryOpener”.
include_pattern (str | None) – Regular expression pattern of text to include in the output.
exclude_pattern (str | None) – Regular expression pattern of text to exclude from the output (overrides include_pattern).
nfc_norm (bool) – If True, normalize text to NFC, otherwise keep original.
cache (str) – Cache class name. The default cache is “py_pdf_term.XMLLayerFileCache”.

bin_opener: str = 'py_pdf_term.StandardBinaryOpener'¶

cache: str = 'py_pdf_term.XMLLayerFileCache'¶

exclude_pattern: str | None = None¶

include_pattern: str | None = None¶

nfc_norm: bool = True¶

py_pdf_term.mappers subpackage¶

class py_pdf_term.mappers.AugmenterMapper¶

Bases: BaseMapper[type[BaseAugmenter]]

Mapper to find augmenter classes.

classmethod default_mapper() → Self¶: Return a default mapper for this class.

class py_pdf_term.mappers.BinaryOpenerMapper¶

Bases: BaseMapper[type[BaseBinaryOpener]]

Mapper to find binary opener classes.

classmethod default_mapper() → Self¶: Return a default mapper for this class.

class py_pdf_term.mappers.CandidateLayerCacheMapper¶

Bases: BaseMapper[type[BaseCandidateLayerCache]]

Mapper to find candidate layer cache classes.

classmethod default_mapper() → Self¶: Return a default mapper for this class.

class py_pdf_term.mappers.CandidateTermFilterMapper¶

Bases: BaseMapper[type[BaseCandidateTermFilter]]

Mapper to find candidate term filter classes.

classmethod default_mapper() → Self¶: Return a default mapper for this class.

class py_pdf_term.mappers.CandidateTokenFilterMapper¶

Bases: BaseMapper[type[BaseCandidateTokenFilter]]

Mapper to find candidate token filter classes.

classmethod default_mapper() → Self¶: Return a default mapper for this class.

class py_pdf_term.mappers.LanguageTokenizerMapper¶

Bases: BaseMapper[type[BaseLanguageTokenizer]]

Mapper to find language tokenizer classes.

classmethod default_mapper() → Self¶: Return a default mapper for this class.

class py_pdf_term.mappers.MethodLayerDataCacheMapper¶

Bases: BaseMapper[type[BaseMethodLayerDataCache[Any]]]

Mapper to find method layer data cache classes.

classmethod default_mapper() → Self¶: Return a default mapper for this class.

class py_pdf_term.mappers.MethodLayerRankingCacheMapper¶

Bases: BaseMapper[type[BaseMethodLayerRankingCache]]

Mapper to find method layer ranking cache classes.

classmethod default_mapper() → Self¶: Return a default mapper for this class.

class py_pdf_term.mappers.MultiDomainRankingMethodMapper¶

Bases: BaseMapper[type[BaseMultiDomainRankingMethod[Any]]]

Mapper to find multi-domain ranking method classes.

classmethod default_mapper() → Self¶: Return a default mapper for this class.

class py_pdf_term.mappers.SingleDomainRankingMethodMapper¶

Bases: BaseMapper[type[BaseSingleDomainRankingMethod[Any]]]

Mapper to find single-domain ranking method classes.

classmethod default_mapper() → Self¶: Return a default mapper for this class.

class py_pdf_term.mappers.SplitterMapper¶

Bases: BaseMapper[type[BaseSplitter]]

Mapper to find splitter classes.

classmethod default_mapper() → Self¶: Return a default mapper for this class.

class py_pdf_term.mappers.StylingLayerCacheMapper¶

Bases: BaseMapper[type[BaseStylingLayerCache]]

Mapper to find styling layer cache classes.

classmethod default_mapper() → Self¶: Return a default mapper for this class.

class py_pdf_term.mappers.StylingScoreMapper¶

Bases: BaseMapper[type[BaseStylingScore]]

Mapper to find styling score classes.

classmethod default_mapper() → Self¶: Return a default mapper for this class.

class py_pdf_term.mappers.TokenClassifierMapper¶

Bases: BaseMapper[type[BaseTokenClassifier]]

Mapper to find token classifier classes.

classmethod default_mapper() → Self¶: Return a default mapper for this class.

class py_pdf_term.mappers.XMLLayerCacheMapper¶

Bases: BaseMapper[type[BaseXMLLayerCache]]

Mapper to find XML layer cache classes.

classmethod default_mapper() → Self¶: Return a default mapper for this class.

py_pdf_term.pdftoxml package¶

class py_pdf_term.pdftoxml.PDFnXMLElement(pdf_path: str, xml_root: Element)¶

Bases: object

Pair of path to a PDF file and XML element tree.

Parameters:

pdf_path (str) – Path to a PDF file.
xml_root (xml.etree.ElementTree.Element) – Root element of a XML element tree.

classmethod from_dict(obj: dict[str, Any]) → Self¶

to_dict() → dict[str, Any]¶

pdf_path: str¶

xml_root: Element¶

class py_pdf_term.pdftoxml.PDFnXMLPath(pdf_path: str, xml_path: str)¶

Bases: object

Pair of path to a PDF file and that to a XML file.

Parameters:

pdf_path (str) – Path to a PDF file.
xml_path (str) – Path to a XML file.

classmethod from_dict(obj: dict[str, Any]) → Self¶

to_dict() → dict[str, Any]¶

pdf_path: str¶

xml_path: str¶

class py_pdf_term.pdftoxml.PDFtoXMLConverter(bin_opener: BaseBinaryOpener | None = None)¶

Bases: object

Converter from PDF to textful XML format.

Parameters:: bin_opener – Binary opener to open PDF and XML files. If None, StandardBinaryOpener is used, which opens files with the standard open function in Python.

convert_as_element(pdf_path: str, nfc_norm: bool = True, include_pattern: str | None = None, exclude_pattern: str | None = None) → PDFnXMLElement¶

Convert a PDF file to a textful XML element.

Parameters:

pdf_path – Path to a PDF file.
nfc_norm – If True, normalize text to NFC, otherwise keep original.
include_pattern – Regular expression pattern of text to include in the output.
exclude_pattern – Regular expression pattern of text to exclude from the output (overrides include_pattern).

Return type:

Pair of path to the PDF file and XML element tree of the output.

convert_as_file(pdf_path: str, xml_path: str, nfc_norm: bool = True, include_pattern: str | None = None, exclude_pattern: str | None = None) → PDFnXMLPath¶

Convert a PDF file to a textful XML file.

Parameters:

pdf_path – Path to a PDF file.
xml_path – Path to a XML file to output.
nfc_norm – If True, normalize text to NFC, otherwise keep original.
include_pattern – Regular expression pattern of text to include in the output.
exclude_pattern – Regular expression pattern of text to exclude from the output (overrides include_pattern).

Return type:

Pair of path to the PDF file and that to the output XML file.

py_pdf_term.tokenizers package¶

class py_pdf_term.tokenizers.BaseLanguageTokenizer¶

Bases: object

Base class for language tokenizers. A language tokenizer is expected to tokenize a text into a list of tokens by SpaCy.

abstractmethod classmethod class_init() → None¶: Initialize the language tokenizer class. This method is expected to be called before using the language tokenizer.

abstractmethod inscope(text: str) → bool¶

Test whether the text is in the scope of the language tokenizer.

Parameters:: text – Text to test.
Returns:: True if the text is in the scope of the language tokenizer, otherwise False.
Return type:: bool

abstractmethod tokenize(scoped_text: str) → list[Token]¶

Tokenize a scoped text into a list of tokens.

Parameters:: scoped_text – Text to tokenize. This text is expected to be in the scope of the language tokenizer.
Returns:: List of tokens.
Return type:: list[Token]

class py_pdf_term.tokenizers.EnglishTokenizer¶

Bases: BaseLanguageTokenizer

Tokenizer for English. This tokenizer uses SpaCy’s en_core_web_sm model.

classmethod class_init() → None¶: Initialize the language tokenizer class. This method is expected to be called before using the language tokenizer.

inscope(text: str) → bool¶

Test whether the text is in the scope of the language tokenizer.

Parameters:: text – Text to test.
Returns:: True if the text is in the scope of the language tokenizer, otherwise False.
Return type:: bool

tokenize(scoped_text: str) → list[Token]¶

Tokenize a scoped text into a list of tokens.

Parameters:: scoped_text – Text to tokenize. This text is expected to be in the scope of the language tokenizer.
Returns:: List of tokens.
Return type:: list[Token]

class py_pdf_term.tokenizers.JapaneseTokenizer¶

Bases: BaseLanguageTokenizer

Tokenizer for Japanese. This tokenizer uses SpaCy’s ja_core_news_sm model.

classmethod class_init() → None¶: Initialize the language tokenizer class. This method is expected to be called before using the language tokenizer.

inscope(text: str) → bool¶

Test whether the text is in the scope of the language tokenizer.

Parameters:: text – Text to test.
Returns:: True if the text is in the scope of the language tokenizer, otherwise False.
Return type:: bool

tokenize(scoped_text: str) → list[Token]¶

Tokenize a scoped text into a list of tokens.

Parameters:: scoped_text – Text to tokenize. This text is expected to be in the scope of the language tokenizer.
Returns:: List of tokens.
Return type:: list[Token]

class py_pdf_term.tokenizers.Term(tokens: list[Token], fontsize: float = 0.0, ncolor: str = '', augmented: bool = False)¶

Bases: object

augmented: bool = False¶

fontsize: float = 0.0¶

classmethod from_dict(obj: dict[str, Any]) → Self¶

property lang: str | None¶

lemma() → str¶

ncolor: str = ''¶

surface_form() → str¶

to_dict() → dict[str, Any]¶

tokens: list[Token]¶

class py_pdf_term.tokenizers.Token(lang: str, surface_form: str, pos: str, category: str, subcategory: str, lemma: str, is_meaningless: bool = False)¶

Bases: object

Token in a text.

Parameters:

lang (str) – Language of the token. (e.g., “en”, “ja”)
surface_form (str) – Surface form of the token.
pos (str) – Part-of-speech tag of the token.
category (str) – Category of the token.
subcategory (str) – Subcategory of the token.
lemma (str) – Lemmatized form of the token.
is_meaningless (bool) – Whether the token is meaningless or not. This is calculated by MeaninglessMarker.

NUM_ATTR: ClassVar[int] = 6¶

classmethod from_dict(obj: dict[str, Any]) → Self¶

is_meaningless: bool = False¶

to_dict() → dict[str, str]¶

lang: str¶

surface_form: str¶

pos: str¶

category: str¶

subcategory: str¶

lemma: str¶

class py_pdf_term.tokenizers.Tokenizer(lang_tokenizers: list[BaseLanguageTokenizer] | None = None)¶

Bases: object

Tokenizer for multiple languages. This tokenizer uses SpaCy.

Parameters:: lang_tokenizers – List of language tokenizers. The order of the language tokenizers is important. The first language tokenizer that returns True in inscope() is used. If None, this tokenizer uses the default language tokenizers. The default language tokenizers are JapaneseTokenizer and EnglishTokenizer.

tokenize(text: str) → list[Token]¶

Tokenize text into tokens.

Parameters:: text – Text to tokenize.
Returns:: List of tokens.
Return type:: list[Token]

py_pdf_term.candidates package¶

class py_pdf_term.candidates.CandidateTermExtractor(lang_tokenizer_clses: list[type[BaseLanguageTokenizer]] | None = None, token_classifier_clses: list[type[BaseTokenClassifier]] | None = None, token_filter_clses: list[type[BaseCandidateTokenFilter]] | None = None, term_filter_clses: list[type[BaseCandidateTermFilter]] | None = None, splitter_clses: list[type[BaseSplitter]] | None = None, augmenter_clses: list[type[BaseAugmenter]] | None = None)¶

Bases: object

Term extractor which extracts candidate terms from a XML file.

Parameters:

lang_tokenizer_clses – List of language tokenizer classes to tokenize texts. If None, the default language tokenizers are used.
token_classifier_clses – List of token classifier classes to classify tokens. If None, the default token classifiers are used.
token_filter_clses – List of token filter classes to filter tokens. If None, the default token filters are used.
term_filter_clses – List of term filter classes to filter candidate terms. If None, the default term filters are used.
splitter_clses – List of splitter classes to split candidate terms. If None, the default splitters are used.
augmenter_clses – List of augmenter classes to augment candidate terms. If None, the default augmenters are used.

extract_from_domain_elements(domain: str, pdfnxmls: list[PDFnXMLElement]) → DomainCandidateTermList¶

Extract candidate terms from pairs of PDF and XML elements in a domain.

Parameters:

domain – Domain name of PDF files.
pdfnxmls – List of pairs of paths to PDF and XML elements in a domain.

Returns:

List of candidate terms in a domain.

Return type:

DomainCandidateTermList

extract_from_domain_files(domain: str, pdfnxmls: list[PDFnXMLPath]) → DomainCandidateTermList¶

Extract candidte terms from pairs of PDF and XML files in a domain.

Parameters:

domain – Domain name of PDF files.
pdfnxmls – List of pairs of paths to PDF and XML files in a domain.

Returns:

List of candidate terms in a domain.

Return type:

DomainCandidateTermList

extract_from_text(text: str, fontsize: float = 0.0, ncolor: str = '') → list[Term]¶

Extract candidate terms from a text. This method is mainly used for testing.

Parameters:

text – Text to extract candidate terms.
fontsize – Font size of output terms.
ncolor – Color of output terms.

Returns:

List of candidate terms in a text.

Return type:

list[Term]

extract_from_xml_element(pdfnxml: PDFnXMLElement) → PDFCandidateTermList¶

Extract candidate terms from a pair of PDF and XML elements.

Parameters:: pdfnxml – Pair of path to a PDF and XML elements.
Returns:: List of candidate terms in a PDF file.
Return type:: PDFCandidateTermList

extract_from_xml_file(pdfnxml: PDFnXMLPath) → PDFCandidateTermList¶

Extract candidate terms from a pair of PDF and XML files.

Parameters:: pdfnxml – Pair of paths to a PDF and XML file.
Returns:: List of candidate terms in a PDF file.
Return type:: PDFCandidateTermList

class py_pdf_term.candidates.DomainCandidateTermList(domain: str, pdfs: list[PDFCandidateTermList])¶

Bases: object

Domain name of PDF files and candidate terms of the domain.

Parameters:

domain (str) – Domain name. (e.g., “natural language processing”)
pdfs (list[PDFCandidateTermList]) – Candidate terms of each PDF file of the domain.

classmethod from_dict(obj: dict[str, Any]) → Self¶

to_candidates_str_set(to_str: Callable[[Term], str] = str) → set[str]¶

to_dict() → dict[str, Any]¶

to_nostyle_candidates_dict(to_str: Callable[[Term], str] = str) → dict[str, Term]¶

domain: str¶

pdfs: list[PDFCandidateTermList]¶

class py_pdf_term.candidates.PDFCandidateTermList(pdf_path: str, pages: list[PageCandidateTermList])¶

Bases: object

Path of a PDF file and candidate terms of the PDF file.

Parameters:

pdf_path (str) – Path of a PDF file.
pages (list[PageCandidateTermList]) – Candidate terms of each page of the PDF file.

classmethod from_dict(obj: dict[str, Any]) → Self¶

to_candidates_str_set(to_str: Callable[[Term], str] = str) → set[str]¶

to_dict() → dict[str, Any]¶

to_nostyle_candidates_dict(to_str: Callable[[Term], str] = str) → dict[str, Term]¶

pdf_path: str¶

pages: list[PageCandidateTermList]¶

class py_pdf_term.candidates.PageCandidateTermList(page_num: int, candidates: list[Term])¶

Bases: object

Page number and candidate terms of the page.

Parameters:

page_num (int) – Page number of a PDF file.
candidates (list[Term]) – Candidate terms of the page.

classmethod from_dict(obj: dict[str, Any]) → Self¶

to_candidates_str_set(to_str: Callable[[Term], str] = str) → set[str]¶

to_dict() → dict[str, Any]¶

to_nostyle_candidates_dict(to_str: Callable[[Term], str] = str) → dict[str, Term]¶

page_num: int¶

candidates: list[Term]¶

py_pdf_term.candidates.augmenters subpackage¶

class py_pdf_term.candidates.augmenters.AugmenterCombiner(augmenters: list[BaseAugmenter] | None = None)¶

Bases: object

Combiner of augmenters of a candidate term.

Parameters:: augmenters – List of augmenters to be combined. The augmenters are applied in order. If None, the default augmenters are used. The default augmenters are JapaneseConnectorTermAugmenter and EnglishConnectorTermAugmenter.

augment(term: Term) → list[Term]¶

Augment a candidate term.

Parameters:: term – Candidate term to be augmented.
Returns:: List of augmented terms. The the original term is not included in the list.
Return type:: list[Term]

class py_pdf_term.candidates.augmenters.BaseAugmenter¶

Bases: object

Base class for augmenters of a candidate term.

When a long term is a candidate, subterms of the long term may be also candidates. For example, if “semantic analysis of programming language” is a candidate, “semantic analysis” and “programming language” may be also candidates.

This class is used to augment a candidate term to its subterms.

abstractmethod augment(term: Term) → list[Term]¶

Augment a candidate term.

Parameters:: term – Candidate term to be augmented.
Returns:: List of augmented terms. The first term is the original term.
Return type:: list[Term]

class py_pdf_term.candidates.augmenters.EnglishConnectorTermAugmenter¶

Bases: BaseSeparationAugmenter

An augmenter of a candidate term by separating tokens based on English connector terms.

augment(term: Term) → list[Term]¶

Augment a candidate term by separating tokens based on English connector terms.

Parameters:: term – Candidate term to be augmented.
Returns:: List of augmented terms. If a given term is not an English term, the list is empty.
Return type:: list[Term]

class py_pdf_term.candidates.augmenters.JapaneseConnectorTermAugmenter¶

Bases: BaseSeparationAugmenter

An augmenter of a candidate term by separating tokens based on Japanese connector terms.

augment(term: Term) → list[Term]¶

Augment a candidate term by separating tokens based on Japanese connector terms.

Parameters:: term – Candidate term to be augmented.
Returns:: List of augmented terms. If a given term is not a Japanese term, the list is empty.
Return type:: list[Term]

py_pdf_term.candidates.splitters subpackage¶

class py_pdf_term.candidates.splitters.BaseSplitter(classifiers: list[BaseTokenClassifier] | None = None)¶

Bases: object

Base class for splitters of a wrongly concatenated term.

Since text extraction from PDF is not perfect especially in a table or a figure, a term may be wrongly concatenated. For example, when a PDF file contains a table which shows the difference between quick sort, merge sort, and heap sort, the extracted text may be something like “quick sort merge sort heap sort”. In this case, “quick sort”, “merge sort”, and “heap sort” are wrongly concatenated.

This class is used to split a wrongly concatenated term into subterms.

Parameters:: classifiers – List of token classifiers to classify tokens into specific categories. If None, the default classifiers are used. The default classifiers are JapaneseTokenClassifier and EnglishTokenClassifier.

abstractmethod split(term: Term) → list[Term]¶

Split a wrongly concatenated term.

Parameters:: term – Wrongly concatenated term to be split.
Returns:: List of split terms.
Return type:: list[Term]

class py_pdf_term.candidates.splitters.RepeatSplitter(classifiers: list[BaseTokenClassifier] | None = None)¶

Bases: BaseSplitter

Splitter to split a term by repeated tokens. For example, “quick sort merge sort heap sort” is split into “quick sort”, “merge sort”, and “heap sort”.

Parameters:: classifiers – List of token classifiers to classify tokens into specific categories. If None, the default classifiers are used. The default classifiers are JapaneseTokenClassifier and EnglishTokenClassifier.

split(term: Term) → list[Term]¶

Split a wrongly concatenated term.

Parameters:: term – Wrongly concatenated term to be split.
Returns:: List of split terms.
Return type:: list[Term]

class py_pdf_term.candidates.splitters.SplitterCombiner(splitters: list[BaseSplitter] | None = None)¶

Bases: object

Combiner of splitters.

Parameters:: splitters – List of splitters to split terms. The splitters are applied in order. If None, the default splitters are used. The default splitters are SymbolNameSplitter and RepeatSplitter.

split(term: Term) → list[Term]¶

Split a wrongly concatenated term.

Parameters:: term – Wrongly concatenated term to be split.
Returns:: List of split terms.
Return type:: list[Term]

class py_pdf_term.candidates.splitters.SymbolNameSplitter(classifiers: list[BaseTokenClassifier] | None = None)¶

Bases: BaseSplitter

Splitter to split down a symbol at the end of a term. For example, given “Programming Language 2”, this splitter splits it into “Programming Language” and “2”, and then “2” is ignored as a meaningless term.

Parameters:: classifiers – List of token classifiers to classify tokens into specific categories. If None, the default classifiers are used. The default classifiers are JapaneseTokenClassifier and EnglishTokenClassifier.

split(term: Term) → list[Term]¶

Split a wrongly concatenated term.

Parameters:: term – Wrongly concatenated term to be split.
Returns:: List of split terms.
Return type:: list[Term]

py_pdf_term.candidates.filters subpackage¶

class py_pdf_term.candidates.filters.BaseCandidateTermFilter¶

Bases: object

Base class for filters of candidate terms.

abstractmethod inscope(term: Term) → bool¶

Test if a term is in scope of this filter.

Parameters:: term – Term to be tested.
Returns:: True if the term is in scope of this filter, False otherwise.
Return type:: bool

abstractmethod is_candidate(scoped_term: Term) → bool¶

Test if a scoped term is a candidate term.

Parameters:: scoped_term – Scoped term to be tested.
Returns:: True if the scoped term is a candidate term, False otherwise.
Return type:: bool

class py_pdf_term.candidates.filters.BaseCandidateTokenFilter¶

Bases: object

Base class for filters of tokens which can be part of a candidate term.

abstractmethod inscope(token: Token) → bool¶

Test if a token is in scope of this filter.

Parameters:: token – Token to be tested.
Returns:: True if the token is in scope of this filter, False otherwise.
Return type:: bool

abstractmethod is_partof_candidate(tokens: list[Token], idx: int) → bool¶

Test if a token can be part of a candidate term.

Parameters:

tokens – List of tokens.
idx – An index of the token to be tested.

Returns:

True if the token can be part of a candidate term, False otherwise.

Return type:

bool

class py_pdf_term.candidates.filters.BaseEnglishCandidateTermFilter¶

Bases: BaseCandidateTermFilter

Base class for filters of English candidate terms.

inscope(term: Term) → bool¶

Test if a term is in scope of this filter.

Parameters:: term – Term to be tested.
Returns:: True if the term is in scope of this filter, False otherwise.
Return type:: bool

class py_pdf_term.candidates.filters.BaseJapaneseCandidateTermFilter¶

Bases: BaseCandidateTermFilter

Base class for filters of Japanese candidate terms.

inscope(term: Term) → bool¶

Test if a term is in scope of this filter.

Parameters:: term – Term to be tested.
Returns:: True if the term is in scope of this filter, False otherwise.
Return type:: bool

class py_pdf_term.candidates.filters.EnglishConcatenationFilter¶

Candidate term filter to filter out invalidly concatenated English terms.

is_candidate(scoped_term: Term) → bool¶

Test if a scoped term is a candidate term.

Parameters:: scoped_term – Scoped term to be tested.
Returns:: True if the scoped term is a candidate term, False otherwise.
Return type:: bool

class py_pdf_term.candidates.filters.EnglishNumericFilter¶

Term filter to remove English numeric phrases from candidate terms.

is_candidate(scoped_term: Term) → bool¶

Test if a scoped term is a candidate term.

Parameters:: scoped_term – Scoped term to be tested.
Returns:: True if the scoped term is a candidate term, False otherwise.
Return type:: bool

class py_pdf_term.candidates.filters.EnglishProperNounFilter¶

Term filter to remove English proper nouns from candidate terms.

is_candidate(scoped_term: Term) → bool¶

Test if a scoped term is a candidate term.

Parameters:: scoped_term – Scoped term to be tested.
Returns:: True if the scoped term is a candidate term, False otherwise.
Return type:: bool

class py_pdf_term.candidates.filters.EnglishSymbolLikeFilter¶

Candidate term filter to filter out symbol-like English terms.

is_candidate(scoped_term: Term) → bool¶

Test if a scoped term is a candidate term.

Parameters:: scoped_term – Scoped term to be tested.
Returns:: True if the scoped term is a candidate term, False otherwise.
Return type:: bool

class py_pdf_term.candidates.filters.EnglishTokenFilter¶

Bases: BaseCandidateTokenFilter

Candidate token filter to filter out English tokens which cannot be part of candidate terms.

inscope(token: Token) → bool¶

Test if a token is in scope of this filter.

Parameters:: token – Token to be tested.
Returns:: True if the token is in scope of this filter, False otherwise.
Return type:: bool

is_partof_candidate(tokens: list[Token], idx: int) → bool¶

Test if a token can be part of a candidate term.

Parameters:

tokens – List of tokens.
idx – An index of the token to be tested.

Returns:

True if the token can be part of a candidate term, False otherwise.

Return type:

bool

class py_pdf_term.candidates.filters.FilterCombiner(token_filters: list[BaseCandidateTokenFilter] | None = None, term_filters: list[BaseCandidateTermFilter] | None = None)¶

Bases: object

Combiner of token filters and term filters.

Parameters:

token_filters – List of token filters to filter tokens. If None, the default token filters are used. The default token filters are JapaneseTokenFilter and EnglishTokenFilter.
term_filters – List of term filters to filter candidate terms. If None, the default term filters are used. The default term filters are JapaneseConcatenationFilter, EnglishConcatenationFilter, JapaneseSymbolLikeFilter, EnglishSymbolLikeFilter, JapaneseProperNounFilter, EnglishProperNounFilter, JapaneseNumericFilter, and EnglishNumericFilter.

is_candidate(term: Term) → bool¶

Test if a term is a candidate term using term filters.

Parameters:: term – Term to be tested.
Returns:: True if the term is a candidate term, False otherwise.
Return type:: bool

is_partof_candidate(tokens: list[Token], idx: int) → bool¶

Test if a token can be part of a candidate term using token filters.

Parameters:

tokens – List of tokens.
idx – Index of the token to be tested.

Returns:

True if the token can be part of a candidate term, False otherwise.

Return type:

bool

class py_pdf_term.candidates.filters.JapaneseConcatenationFilter¶

Candidate term filter to filter out invalidly concatenated Japanese terms.

is_candidate(scoped_term: Term) → bool¶

Test if a scoped term is a candidate term.

Parameters:: scoped_term – Scoped term to be tested.
Returns:: True if the scoped term is a candidate term, False otherwise.
Return type:: bool

class py_pdf_term.candidates.filters.JapaneseNumericFilter¶

Term filter to remove Japanese numeric phrases from candidate terms.

is_candidate(scoped_term: Term) → bool¶

Test if a scoped term is a candidate term.

Parameters:: scoped_term – Scoped term to be tested.
Returns:: True if the scoped term is a candidate term, False otherwise.
Return type:: bool

class py_pdf_term.candidates.filters.JapaneseProperNounFilter¶

Term filter to remove Japanese proper nouns from candidate terms.

is_candidate(scoped_term: Term) → bool¶

Test if a scoped term is a candidate term.

Parameters:: scoped_term – Scoped term to be tested.
Returns:: True if the scoped term is a candidate term, False otherwise.
Return type:: bool

class py_pdf_term.candidates.filters.JapaneseSymbolLikeFilter¶

Candidate term filter to filter out symbol-like Japanese terms.

is_candidate(scoped_term: Term) → bool¶

Test if a scoped term is a candidate term.

Parameters:: scoped_term – Scoped term to be tested.
Returns:: True if the scoped term is a candidate term, False otherwise.
Return type:: bool

class py_pdf_term.candidates.filters.JapaneseTokenFilter¶

Bases: BaseCandidateTokenFilter

Candidate token filter to filter out Japanese tokens which cannot be part of candidate terms.

inscope(token: Token) → bool¶

Test if a token is in scope of this filter.

Parameters:: token – Token to be tested.
Returns:: True if the token is in scope of this filter, False otherwise.
Return type:: bool

is_partof_candidate(tokens: list[Token], idx: int) → bool¶

Test if a token can be part of a candidate term.

Parameters:

tokens – List of tokens.
idx – An index of the token to be tested.

Returns:

True if the token can be part of a candidate term, False otherwise.

Return type:

bool

py_pdf_term.candidates.classifiers subpackage¶

class py_pdf_term.candidates.classifiers.BaseTokenClassifier¶

Bases: object

Base class for token classifiers. A token classifier is used to classify a token into a specific category.

abstractmethod inscope(token: Token) → bool¶

Test whether a token is in the scope of this classifier or not.

Parameters:: token – Token to be tested.
Returns:: True if the token is in the scope of this classifier, False otherwise.
Return type:: bool

is_connector(token: Token) → bool¶

Test whether a token is a connector or not. A connector is a token that is used to connect two terms such as a connector symbol and a connector term.

Parameters:: token – Token to be tested.
Returns:: True if the token is a connector, False otherwise.
Return type:: bool

abstractmethod is_connector_symbol(token: Token) → bool¶

Test whether a token is a connector symbol or not. A connector symbol is a symbol that is used to connect two terms such as - and ・. If this method returns True, is_symbol() must also return True.

Parameters:: token – Token to be tested.
Returns:: True if the token is a connector symbol, False otherwise.
Return type:: bool

abstractmethod is_connector_term(token: Token) → bool¶

Test whether a token is a connector term or not. A connector term is a term that is used to connect two terms such as “of” and “in” in English, and “の” in Japanese.

Parameters:: token – Token to be tested.
Returns:: True if the token is a connector term, False otherwise.
Return type:: bool

is_meaningless(token: Token) → bool¶

Test whether a token is meaningless or not. A meaningless token is a token that does not have any meaning such as a symbol and a connector term.

Parameters:: token – Token to be tested.
Returns:: True if the token is meaningless, False otherwise.
Return type:: bool

abstractmethod is_symbol(token: Token) → bool¶

Test whether a token is a symbol or not.

Parameters:: token – Token to be tested.
Returns:: True if the token is a symbol, False otherwise.
Return type:: bool

class py_pdf_term.candidates.classifiers.EnglishTokenClassifier¶

Bases: BaseTokenClassifier

Token classifier for English tokens.

inscope(token: Token) → bool¶

Test whether a token is in the scope of this classifier or not.

Parameters:: token – Token to be tested.
Returns:: True if the token is in the scope of this classifier, False otherwise.
Return type:: bool

is_connector_symbol(token: Token) → bool¶

Test whether a token is a connector symbol or not. A connector symbol is a symbol that is used to connect two terms such as - and ・. If this method returns True, is_symbol() must also return True.

Parameters:: token – Token to be tested.
Returns:: True if the token is a connector symbol, False otherwise.
Return type:: bool

is_connector_term(token: Token) → bool¶

Test whether a token is a connector term or not. A connector term is a term that is used to connect two terms such as “of” and “in” in English, and “の” in Japanese.

Parameters:: token – Token to be tested.
Returns:: True if the token is a connector term, False otherwise.
Return type:: bool

is_symbol(token: Token) → bool¶

Test whether a token is a symbol or not.

Parameters:: token – Token to be tested.
Returns:: True if the token is a symbol, False otherwise.
Return type:: bool

class py_pdf_term.candidates.classifiers.JapaneseTokenClassifier¶

Bases: BaseTokenClassifier

Token classifier for Japanese tokens.

inscope(token: Token) → bool¶

Test whether a token is in the scope of this classifier or not.

Parameters:: token – Token to be tested.
Returns:: True if the token is in the scope of this classifier, False otherwise.
Return type:: bool

is_connector_symbol(token: Token) → bool¶

Test whether a token is a connector symbol or not. A connector symbol is a symbol that is used to connect two terms such as - and ・. If this method returns True, is_symbol() must also return True.

Parameters:: token – Token to be tested.
Returns:: True if the token is a connector symbol, False otherwise.
Return type:: bool

is_connector_term(token: Token) → bool¶

Test whether a token is a connector term or not. A connector term is a term that is used to connect two terms such as “of” and “in” in English, and “の” in Japanese.

Parameters:: token – Token to be tested.
Returns:: True if the token is a connector term, False otherwise.
Return type:: bool

is_symbol(token: Token) → bool¶

Test whether a token is a symbol or not.

Parameters:: token – Token to be tested.
Returns:: True if the token is a symbol, False otherwise.
Return type:: bool

class py_pdf_term.candidates.classifiers.MeaninglessMarker(classifiers: list[BaseTokenClassifier] | None = None)¶

Bases: object

Marker class to mark meaningless tokens in a term.

Parameters:: classifiers – List of token classifiers to mark meaningless tokens. If None, JapaneseTokenClassifier and EnglishTokenClassifier are used.

mark(term: Term) → Term¶

Mark meaningless tokens in a term. The original term is modified in-place.

Parameters:: term – Term to be marked.
Returns:: Term with meaningless tokens marked.
Return type:: Term

py_pdf_term.analysis package¶

class py_pdf_term.analysis.ContainerTermsAnalyzer(ignore_augmented: bool = True)¶

Bases: object

Analyze container terms of the domain.

Parameters:: ignore_augmented – If True, ignore augmented terms. The default is True.

analyze(domain_candidates: DomainCandidateTermList) → DomainContainerTerms¶

Analyze container terms of the domain.

Parameters:: domain_candidates – List of candidate terms in a domain. The target of analysis.
Returns:: Domain name and container terms of candidate terms in the domain.
Return type:: DomainContainerTerms

class py_pdf_term.analysis.DomainContainerTerms(domain: str, container_terms: dict[str, set[str]])¶

Bases: object

Domain name and container terms of the domain.

Parameters:

domain (str) – Domain name. (e.g., “natural language processing”)
container_terms (dict[str, set[str]]) – Set of lemmatized containers of the lemmatized term in the domain. (term, container) is valid if and only if the container contains the term as a proper subsequence.

domain: str¶

container_terms: dict[str, set[str]]¶

class py_pdf_term.analysis.DomainLeftRightFrequency(domain: str, left_freq: dict[str, dict[str, int]], right_freq: dict[str, dict[str, int]])¶

Bases: object

Domain name and left/right frequency of the domain.

Parameters:

domain (str) – Domain name. (e.g., “natural language processing”)
left_freq (dict[str, dict[str, int]]) – Number of occurrences of lemmatized (left, token) in the domain. If token or left is meaningless, this is fixed at zero.
right_freq (dict[str, dict[str, int]]) – Number of occurrences of lemmatized (token, right) in the domain. If token or right is meaningless, this is fixed at zero.

domain: str¶

left_freq: dict[str, dict[str, int]]¶

right_freq: dict[str, dict[str, int]]¶

class py_pdf_term.analysis.DomainTermOccurrence(domain: str, term_freq: dict[str, int], doc_term_freq: dict[str, int])¶

Bases: object

Domain name and term occurrence of the domain

Parameters:

domain (str) – Domain name. (e.g., “natural language processing”)
term_freq (dict[str, int]) – Brute force counting of lemmatized term occurrences in the domain. Count even if the lemmatized term occurs as a part of a lemmatized phrase.
doc_term_freq (dict[str, int]) – Number of documents in the domain that contain the lemmatized term. Count even if the lemmatized term occurs as a part of a lemmatized phrase.

domain: str¶

term_freq: dict[str, int]¶

doc_term_freq: dict[str, int]¶

class py_pdf_term.analysis.TermLeftRightFrequencyAnalyzer(ignore_augmented: bool = True)¶

Bases: object

Analyze left/right frequency of terms in a domain.

Parameters:: ignore_augmented – If True, ignore augmented terms. The default is True.

analyze(domain_candidates: DomainCandidateTermList) → DomainLeftRightFrequency¶

Analyze left/right frequency of terms in a domain.

Parameters:: domain_candidates – List of candidate terms in a domain. The target of analysis.
Returns:: Domain name and left/right frequency of candidate terms in the domain.
Return type:: DomainLeftRightFrequency

class py_pdf_term.analysis.TermOccurrenceAnalyzer(ignore_augmented: bool = True)¶

Bases: object

Analyze term occurrences in a domain.

Parameters:: ignore_augmented – If True, ignore augmented terms. The default is True.

analyze(domain_candidates: DomainCandidateTermList) → DomainTermOccurrence¶

Analyze term occurrences in a domain.

Parameters:: domain_candidates – List of candidate terms in a domain. The target of analysis.
Returns:: Domain name and term occurrence of candidate terms in the domain.
Return type:: DomainTermOccurrence

py_pdf_term.methods package¶

class py_pdf_term.methods.BaseMultiDomainRankingMethod(data_collector: BaseRankingDataCollector, ranker: BaseMultiDomainRanker)¶

Bases: Generic

Base class for ranking methods with an algorithm which requires cross-domain information.

Parameters:

data_collector – Collector of metadata to rank candidate terms in domain-specific PDF documents.
ranker – Ranker of candidate terms in PDF documents by an algorithm which requires cross-domain information.

collect_data(domain_candidates: DomainCandidateTermList) → RankingData¶

Collect metadata to rank candidate terms in PDF documents. This method is used to collect metadata before ranking candidate terms in PDF documents. The following two code snippets are equivalent:

` ranking_data_list = list(map(method.collect_data, domain_candidates_list)) term_ranking = method.rank_terms(domain_candidates, ranking_data_list) `

and

` term_ranking = method.rank_terms(domain_candidates) `

This method is useful when you want to utilize cached metadata to rank candidate terms in PDF documents.

Parameters:: domain_candidates – List of candidate terms in domain-specific PDF documents.

abstractmethod classmethod collect_data_from_dict(obj: dict[str, Any]) → RankingData¶

Collect metadata to rank candidate terms in PDF documents from a dictionary. This method is used to load cached metadata.

Parameters:: obj – Dictionary which contains metadata to rank candidate terms in PDF documents in a domain.
Returns:: Metadata to rank candidate terms in PDF documents.
Return type:: RankingData

rank_domain_terms(domain: str, domain_candidates_list: list[DomainCandidateTermList], ranking_data_list: list[RankingData] | None = None) → MethodTermRanking¶

Rank candidate terms in PDF documents in a domain.

Parameters:

domain – Domain to rank candidate terms in PDF documents.
domain_candidates_list – List of candidate terms in domain-specific PDF documents.
ranking_data_list – Metadata to rank candidate terms in PDF documents. If this argument is not None, this method skips collecting metadata and uses this argument instead. The default is None.

Returns:

Ranking result of candidate terms in PDF documents.

Return type:

rank_terms(domain_candidates_list: list[DomainCandidateTermList], ranking_data_list: list[RankingData] | None = None) → Iterator[MethodTermRanking]¶

Rank candidate terms in PDF documents in multiple domains.

Parameters:

domain_candidates_list – List of candidate terms in domain-specific PDF documents.
ranking_data_list – Metadata to rank candidate terms in PDF documents. If this argument is not None, this method skips collecting metadata and uses this argument instead. The default is None.

Yields:

MethodTermRanking – Ranking result of candidate terms in PDF documents.

class py_pdf_term.methods.BaseSingleDomainRankingMethod(data_collector: BaseRankingDataCollector, ranker: BaseSingleDomainRanker)¶

Bases: Generic

Base class for ranking methods with an algorithm which does not require cross-domain information.

Parameters:

data_collector – Collector of metadata to rank candidate terms in domain-specific PDF documents.
ranker – Ranker of candidate terms in PDF documents by an algorithm which does not require cross-domain information.

collect_data(domain_candidates: DomainCandidateTermList) → RankingData¶

Collect metadata to rank candidate terms in PDF documents. This method is used to collect metadata before ranking candidate terms in PDF documents. The following two code snippets are equivalent:

` ranking_data = method.collect_data(domain_candidates) term_ranking = method.rank_terms(domain_candidates, ranking_data) `

and

` term_ranking = method.rank_terms(domain_candidates) `

This method is useful when you want to utilize cached metadata to rank candidate terms in PDF documents.

Parameters:: domain_candidates – List of candidate terms in domain-specific PDF documents.
Returns:: Metadata to rank candidate terms in PDF documents.
Return type:: RankingData

abstractmethod classmethod collect_data_from_dict(obj: dict[str, Any]) → RankingData¶

Collect metadata to rank candidate terms in PDF documents from a dictionary. This method is used to load cached metadata.

Parameters:: obj – Dictionary which contains metadata to rank candidate terms in PDF documents in a domain.
Returns:: Metadata to rank candidate terms in PDF documents.
Return type:: RankingData

rank_terms(domain_candidates: DomainCandidateTermList, ranking_data: RankingData | None = None) → MethodTermRanking¶

Rank candidate terms in PDF documents in a domain.

Parameters:

domain_candidates – List of candidate terms in domain-specific PDF documents.
ranking_data – Metadata to rank candidate terms in PDF documents. If this argument is not None, this method skips collecting metadata and uses this argument instead. The default is None.

Returns:

Ranking result of candidate terms in PDF documents.

Return type:

class py_pdf_term.methods.FLRHMethod(threshold: float = 1e-8, max_loop: int = 1000)¶

Bases: BaseSingleDomainRankingMethod[FLRHRankingData]

Ranking method by FLRH algorithm. This algorithm is a combination of FLR and HITS.

Parameters:

threshold – Threshold of the FLRH algorithm. The default is 1e-8.
max_loop – Maximum number of loops of the FLRH algorithm. The default is 1000.

classmethod collect_data_from_dict(obj: dict[str, Any]) → FLRHRankingData¶

Collect metadata to rank candidate terms in PDF documents from a dictionary. This method is used to load cached metadata.

Parameters:: obj – Dictionary which contains metadata to rank candidate terms in PDF documents in a domain.
Returns:: Metadata to rank candidate terms in PDF documents.
Return type:: RankingData

class py_pdf_term.methods.FLRMethod¶

Bases: BaseSingleDomainRankingMethod[FLRRankingData]

Ranking method by FLR algorithm.

classmethod collect_data_from_dict(obj: dict[str, Any]) → FLRRankingData¶

Collect metadata to rank candidate terms in PDF documents from a dictionary. This method is used to load cached metadata.

Parameters:: obj – Dictionary which contains metadata to rank candidate terms in PDF documents in a domain.
Returns:: Metadata to rank candidate terms in PDF documents.
Return type:: RankingData

class py_pdf_term.methods.HITSMethod(threshold: float = 1e-8, max_loop: int = 1000)¶

Bases: BaseSingleDomainRankingMethod[HITSRankingData]

Ranking method by HITS algorithm.

Parameters:

threshold – Threshold to determine convergence. If the difference between original auth/hub values and new auth/hub values is less than this threshold, the algorithm is considered to be converged. The default is 1e-8.
max_loop – Maximum number of loops to run the algorithm. If the algorithm does not converge within this number of loops, it is forced to stop. The default is 1000.

classmethod collect_data_from_dict(obj: dict[str, Any]) → HITSRankingData¶

Collect metadata to rank candidate terms in PDF documents from a dictionary. This method is used to load cached metadata.

Parameters:: obj – Dictionary which contains metadata to rank candidate terms in PDF documents in a domain.
Returns:: Metadata to rank candidate terms in PDF documents.
Return type:: RankingData

class py_pdf_term.methods.MCValueMethod¶

Bases: BaseSingleDomainRankingMethod[MCValueRankingData]

Ranking method by MC-Value algorithm.

classmethod collect_data_from_dict(obj: dict[str, Any]) → MCValueRankingData¶

Collect metadata to rank candidate terms in PDF documents from a dictionary. This method is used to load cached metadata.

Parameters:: obj – Dictionary which contains metadata to rank candidate terms in PDF documents in a domain.
Returns:: Metadata to rank candidate terms in PDF documents.
Return type:: RankingData

class py_pdf_term.methods.MDPMethod¶

Bases: BaseMultiDomainRankingMethod[MDPRankingData]

Ranking method by MDP algorithm.

classmethod collect_data_from_dict(obj: dict[str, Any]) → MDPRankingData¶

Collect metadata to rank candidate terms in PDF documents from a dictionary. This method is used to load cached metadata.

Parameters:: obj – Dictionary which contains metadata to rank candidate terms in PDF documents in a domain.
Returns:: Metadata to rank candidate terms in PDF documents.
Return type:: RankingData

class py_pdf_term.methods.MethodTermRanking(domain: str, ranking: list[ScoredTerm])¶

Bases: object

Domain name and ranking of technical terms of the domain.

Parameters:

domain (str) – Domain name. (e.g., “natural language processing”)
ranking (list[ScoredTerm]) – List of pairs of lemmatized term and method score. The list is sorted by the score in descending order.

classmethod from_dict(obj: dict[str, Any]) → Self¶

to_dict() → dict[str, Any]¶

domain: str¶

ranking: list[ScoredTerm]¶

class py_pdf_term.methods.TFIDFMethod¶

Bases: BaseMultiDomainRankingMethod[TFIDFRankingData]

Ranking method by TF-IDF algorithm.

classmethod collect_data_from_dict(obj: dict[str, Any]) → TFIDFRankingData¶

Collect metadata to rank candidate terms in PDF documents from a dictionary. This method is used to load cached metadata.

Parameters:: obj – Dictionary which contains metadata to rank candidate terms in PDF documents in a domain.
Returns:: Metadata to rank candidate terms in PDF documents.
Return type:: RankingData

py_pdf_term.methods.collectors subpackage¶

class py_pdf_term.methods.collectors.BaseRankingDataCollector¶

Bases: Generic

Base class for ranking data collectors. This class is used to collect metadata to rank candidate terms in domain-specific PDF documents.

abstractmethod collect(domain_candidates: DomainCandidateTermList) → RankingData¶

Collect metadata to rank candidate terms in domain-specific PDF documents.

Parameters:: domain_candidates – List of candidate terms in domain-specific PDF documents.
Returns:: Metadata to rank candidate terms in PDF documents.
Return type:: RankingData

class py_pdf_term.methods.collectors.FLRHRankingDataCollector¶

Bases: BaseRankingDataCollector[FLRHRankingData]

Collector of metadata to rank candidate terms in domain-specific PDF documents by FLRH algorithm.

collect(domain_candidates: DomainCandidateTermList) → FLRHRankingData¶

Collect metadata to rank candidate terms in domain-specific PDF documents.

Parameters:: domain_candidates – List of candidate terms in domain-specific PDF documents.
Returns:: Metadata to rank candidate terms in PDF documents.
Return type:: RankingData

class py_pdf_term.methods.collectors.FLRRankingDataCollector¶

Bases: BaseRankingDataCollector[FLRRankingData]

Collector of metadata to rank candidate terms in domain-specific PDF documents by FLR algorithm.

collect(domain_candidates: DomainCandidateTermList) → FLRRankingData¶

Collect metadata to rank candidate terms in domain-specific PDF documents.

Parameters:: domain_candidates – List of candidate terms in domain-specific PDF documents.
Returns:: Metadata to rank candidate terms in PDF documents.
Return type:: RankingData

class py_pdf_term.methods.collectors.HITSRankingDataCollector¶

Bases: BaseRankingDataCollector[HITSRankingData]

Collector of metadata to rank candidate terms in domain-specific PDF documents by HITS algorithm.

collect(domain_candidates: DomainCandidateTermList) → HITSRankingData¶

Collect metadata to rank candidate terms in domain-specific PDF documents.

Parameters:: domain_candidates – List of candidate terms in domain-specific PDF documents.
Returns:: Metadata to rank candidate terms in PDF documents.
Return type:: RankingData

class py_pdf_term.methods.collectors.MCValueRankingDataCollector¶

Bases: BaseRankingDataCollector[MCValueRankingData]

Collector of metadata to rank candidate terms in domain-specific PDF documents by MC-Value algorithm.

collect(domain_candidates: DomainCandidateTermList) → MCValueRankingData¶

Collect metadata to rank candidate terms in domain-specific PDF documents.

Parameters:: domain_candidates – List of candidate terms in domain-specific PDF documents.
Returns:: Metadata to rank candidate terms in PDF documents.
Return type:: RankingData

class py_pdf_term.methods.collectors.MDPRankingDataCollector¶

Bases: BaseRankingDataCollector[MDPRankingData]

Collector of metadata to rank candidate terms in domain-specific PDF documents by MDP algorithm.

collect(domain_candidates: DomainCandidateTermList) → MDPRankingData¶

Collect metadata to rank candidate terms in domain-specific PDF documents.

Parameters:: domain_candidates – List of candidate terms in domain-specific PDF documents.
Returns:: Metadata to rank candidate terms in PDF documents.
Return type:: RankingData

class py_pdf_term.methods.collectors.TFIDFRankingDataCollector¶

Bases: BaseRankingDataCollector[TFIDFRankingData]

Collector of metadata to rank candidate terms in domain-specific PDF documents by TF-IDF algorithm.

collect(domain_candidates: DomainCandidateTermList) → TFIDFRankingData¶

Collect metadata to rank candidate terms in domain-specific PDF documents.

Parameters:: domain_candidates – List of candidate terms in domain-specific PDF documents.
Returns:: Metadata to rank candidate terms in PDF documents.
Return type:: RankingData

py_pdf_term.methods.rankers subpackage¶

class py_pdf_term.methods.rankers.BaseMultiDomainRanker¶

Bases: Generic

Base class for term rankers with an algorithm which requires cross-domain information.

abstractmethod rank_terms(domain_candidates: DomainCandidateTermList, ranking_data_list: list[RankingData]) → MethodTermRanking¶

Rank candidate terms in domain-specific PDF documents.

Parameters:

domain_candidates – List of candidate terms in domain-specific PDF documents.
ranking_data_list – List of metadata to rank candidate terms in PDF documents for each domain.

Returns:

Ranking result of candidate terms in PDF documents.

Return type:

class py_pdf_term.methods.rankers.BaseSingleDomainRanker¶

Bases: Generic

Base class for term rankers with an algorithm which does not require cross-domain information.

abstractmethod rank_terms(domain_candidates: DomainCandidateTermList, ranking_data: RankingData) → MethodTermRanking¶

Rank candidate terms in domain-specific PDF documents.

Parameters:

domain_candidates – List of candidate terms in domain-specific PDF documents.
ranking_data – Metadata to rank candidate terms in PDF documents.

Returns:

Ranking result of candidate terms in PDF documents.

Return type:

class py_pdf_term.methods.rankers.FLRHRanker(threshold: float = 1e-8, max_loop: int = 1000)¶

Bases: BaseSingleDomainRanker[FLRHRankingData]

Term ranker by FLRH algorithm. This algorithm is a combination of FLR and HITS.

Parameters:

threshold – Threshold value for HITS algorithm. The default is 1e-8.
max_loop – Maximum number of loops for HITS algorithm. The default is 1000.

rank_terms(domain_candidates: DomainCandidateTermList, ranking_data: FLRHRankingData) → MethodTermRanking¶

Rank candidate terms in domain-specific PDF documents.

Parameters:

domain_candidates – List of candidate terms in domain-specific PDF documents.
ranking_data – Metadata to rank candidate terms in PDF documents.

Returns:

Ranking result of candidate terms in PDF documents.

Return type:

class py_pdf_term.methods.rankers.FLRRanker¶

Bases: BaseSingleDomainRanker[FLRRankingData]

Term ranker by FLR algorithm.

rank_terms(domain_candidates: DomainCandidateTermList, ranking_data: FLRRankingData) → MethodTermRanking¶

Rank candidate terms in domain-specific PDF documents.

Parameters:

domain_candidates – List of candidate terms in domain-specific PDF documents.
ranking_data – Metadata to rank candidate terms in PDF documents.

Returns:

Ranking result of candidate terms in PDF documents.

Return type:

class py_pdf_term.methods.rankers.HITSRanker(threshold: float = 1e-8, max_loop: int = 1000)¶

Bases: BaseSingleDomainRanker[HITSRankingData]

Term ranker by HITS algorithm.

Parameters:

threshold – Threshold to determine convergence. If the difference between original auth/hub values and new auth/hub values is less than this threshold, the algorithm is considered to be converged. The default is 1e-8.
max_loop – Maximum number of loops to run the algorithm. If the algorithm does not converge within this number of loops, it is forced to stop. The default is 1000.

rank_terms(domain_candidates: DomainCandidateTermList, ranking_data: HITSRankingData) → MethodTermRanking¶

Rank candidate terms in domain-specific PDF documents.

Parameters:

domain_candidates – List of candidate terms in domain-specific PDF documents.
ranking_data – Metadata to rank candidate terms in PDF documents.

Returns:

Ranking result of candidate terms in PDF documents.

Return type:

class py_pdf_term.methods.rankers.MCValueRanker¶

Bases: BaseSingleDomainRanker[MCValueRankingData]

Term ranker by MC-Value algorithm.

rank_terms(domain_candidates: DomainCandidateTermList, ranking_data: MCValueRankingData) → MethodTermRanking¶

Rank candidate terms in domain-specific PDF documents.

Parameters:

domain_candidates – List of candidate terms in domain-specific PDF documents.
ranking_data – Metadata to rank candidate terms in PDF documents.

Returns:

Ranking result of candidate terms in PDF documents.

Return type:

class py_pdf_term.methods.rankers.MDPRanker¶

Bases: BaseMultiDomainRanker[MDPRankingData]

Term ranker by MDP algorithm.

rank_terms(domain_candidates: DomainCandidateTermList, ranking_data_list: list[MDPRankingData]) → MethodTermRanking¶

Rank candidate terms in domain-specific PDF documents.

Parameters:

domain_candidates – List of candidate terms in domain-specific PDF documents.
ranking_data_list – List of metadata to rank candidate terms in PDF documents for each domain.

Returns:

Ranking result of candidate terms in PDF documents.

Return type:

class py_pdf_term.methods.rankers.TFIDFRanker¶

Bases: BaseMultiDomainRanker[TFIDFRankingData]

Term ranker by TF-IDF algorithm.

rank_terms(domain_candidates: DomainCandidateTermList, ranking_data_list: list[TFIDFRankingData]) → MethodTermRanking¶

Rank candidate terms in domain-specific PDF documents.

Parameters:

domain_candidates – List of candidate terms in domain-specific PDF documents.
ranking_data_list – List of metadata to rank candidate terms in PDF documents for each domain.

Returns:

Ranking result of candidate terms in PDF documents.

Return type:

py_pdf_term.methods.rankingdata subpackage¶

class py_pdf_term.methods.rankingdata.BaseRankingData(domain: str)¶

Bases: object

Base class for ranking data of technical terms of a domain.

Parameters:: domain (str) – Domain name. (e.g., “natural language processing”)

classmethod from_dict(obj: dict[str, Any]) → Self¶

to_dict() → dict[str, Any]¶

domain: str¶

class py_pdf_term.methods.rankingdata.FLRHRankingData(domain: str, term_freq: dict[str, int], left_freq: dict[str, dict[str, int]], right_freq: dict[str, dict[str, int]])¶

Data of technical terms of a domain for FLRH algorithm.

Parameters:

domain (str) – Domain name. (e.g., “natural language processing”)
term_freq (dict[str, int]) – Brute force counting of lemmatized term occurrences in the domain. Count even if the lemmatized term occurs as a part of a lemmatized phrase.
left_freq (dict[str, dict[str, int]]) – Number of occurrences of lemmatized (left, token) in the domain. If token or left is meaningless this is fixed at zero.
right_freq (dict[str, dict[str, int]]) – Number of occurrences of lemmatized (token, right) in the domain. If token or right is meaningless this is fixed at zero.

domain: str¶

term_freq: dict[str, int]¶

left_freq: dict[str, dict[str, int]]¶

right_freq: dict[str, dict[str, int]]¶

class py_pdf_term.methods.rankingdata.FLRRankingData(domain: str, term_freq: dict[str, int], left_freq: dict[str, dict[str, int]], right_freq: dict[str, dict[str, int]])¶

Data of technical terms of a domain for FLR algorithm.

Parameters:

domain (str) – Domain name. (e.g., “natural language processing”)
term_freq (dict[str, int]) – Brute force counting of lemmatized term occurrences in the domain. Count even if the lemmatized term occurs as a part of a lemmatized phrase.
left_freq (dict[str, dict[str, int]]) – Number of occurrences of lemmatized (left, token) in the domain. If token or left is meaningless this is fixed at zero.
right_freq (dict[str, dict[str, int]]) – Number of occurrences of lemmatized (token, right) in the domain. If token or right is meaningless this is fixed at zero.

domain: str¶

term_freq: dict[str, int]¶

left_freq: dict[str, dict[str, int]]¶

right_freq: dict[str, dict[str, int]]¶

class py_pdf_term.methods.rankingdata.HITSRankingData(domain: str, term_freq: dict[str, int], left_freq: dict[str, dict[str, int]], right_freq: dict[str, dict[str, int]])¶

Data of technical terms of a domain for HITS algorithm.

Parameters:

domain (str) – Domain name. (e.g., “natural language processing”)
term_freq (dict[str, int]) – Brute force counting of lemmatized term occurrences in the domain. Count even if the lemmatized term occurs as a part of a lemmatized phrase.
left_freq (dict[str, dict[str, int]]) – Number of occurrences of lemmatized (left, token) in the domain. If token or left is meaningless this is fixed at zero.
right_freq (dict[str, dict[str, int]]) – Number of occurrences of lemmatized (token, right) in the domain. If token or right is meaningless this is fixed at zero.

domain: str¶

term_freq: dict[str, int]¶

left_freq: dict[str, dict[str, int]]¶

right_freq: dict[str, dict[str, int]]¶

class py_pdf_term.methods.rankingdata.MCValueRankingData(domain: str, term_freq: dict[str, int], container_terms: dict[str, set[str]])¶

Data of technical terms of a domain for MC-Value algorithm.

Parameters:

domain (str) – Domain name. (e.g., “natural language processing”)
term_freq (dict[str, int]) – Brute force counting of lemmatized term occurrences in the domain. Count even if the lemmatized term occurs as a part of a lemmatized phrase.
container_terms (dict[str, set[str]]) – Set of containers of the lemmatized term in the domain. (term, container) is valid iff the container contains the term as a proper subsequence.

classmethod from_dict(obj: dict[str, Any]) → Self¶

to_dict() → dict[str, Any]¶

domain: str¶

term_freq: dict[str, int]¶

container_terms: dict[str, set[str]]¶

class py_pdf_term.methods.rankingdata.MDPRankingData(domain: str, term_freq: dict[str, int])¶

Data of technical terms of a domain for MDP algorithm.

Parameters:

domain (str) – Domain name. (e.g., “natural language processing”)
term_freq (dict[str, int]) – Brute force counting of lemmatized term occurrences in the domain. Count even if the lemmatized term occurs as a part of a lemmatized phrase.
num_terms (int) – Brute force counting of all lemmatized terms occurrences in the domain. Count even if the lemmatized term occurs as a part of a lemmatized phrase.

classmethod from_dict(obj: dict[str, Any]) → Self¶

to_dict() → dict[str, Any]¶

domain: str¶

term_freq: dict[str, int]¶

num_terms: int¶

class py_pdf_term.methods.rankingdata.TFIDFRankingData(domain: str, term_freq: dict[str, int], doc_freq: dict[str, int], num_docs: int)¶

Data of technical terms of a domain for TF-IDF algorithm.

Parameters:

domain (str) – Domain name. (e.g., “natural language processing”)
term_freq (dict[str, int]) – Brute force counting of lemmatized term occurrences in the domain. Count even if the lemmatized term occurs as a part of a lemmatized phrase.
doc_freq (dict[str, int]) – Number of documents in the domain that contain the lemmatized term. Count even if the lemmatized term occurs as a part of a lemmatized phrase.
num_docs (int) – Number of documents in the domain.

domain: str¶

term_freq: dict[str, int]¶

doc_freq: dict[str, int]¶

num_docs: int¶

py_pdf_term.stylings package¶

class py_pdf_term.stylings.DomainStylingScoreList(domain: str, pdfs: list[PDFStylingScoreList])¶

Bases: object

Domain name of PDF files and styling scores of technical terms of the domain.

Parameters:

domain (str) – Domain name. (e.g., “natural language processing”)
pdfs (list[PDFStylingScoreList]) – Styling scores of each PDF file of the domain.

classmethod from_dict(obj: dict[str, Any]) → Self¶

to_dict() → dict[str, Any]¶

domain: str¶

pdfs: list[PDFStylingScoreList]¶

class py_pdf_term.stylings.PDFStylingScoreList(pdf_path: str, pages: list[PageStylingScoreList])¶

Bases: object

Path of a PDF file and styling scores of technical terms of the PDF file.

Parameters:

pdf_path (str) – Path of a PDF file.
pages (list[PageStylingScoreList]) – Styling scores of each page of the PDF file.

classmethod from_dict(obj: dict[str, Any]) → Self¶

to_dict() → dict[str, Any]¶

pdf_path: str¶

pages: list[PageStylingScoreList]¶

class py_pdf_term.stylings.PageStylingScoreList(page_num: int, ranking: list[ScoredTerm])¶

Bases: object

Page number and styling scores of technical terms of the page.

Parameters:

page_num (int) – Page number of a PDF file.
ranking (list[ScoredTerm]) – List of pairs of lemmatized term and styling score. The list is sorted by the score in descending order.

classmethod from_dict(obj: dict[str, Any]) → Self¶

to_dict() → dict[str, Any]¶

page_num: int¶

ranking: list[ScoredTerm]¶

class py_pdf_term.stylings.StylingScorer(styling_score_clses: list[type[BaseStylingScore]] | None = None)¶

Bases: object

Scorer for styling scores. The styling scores are combined by multiplication of each score.

Parameters:: styling_score_clses – Styling scorers to be combined. If None, the default scorers are used. The default scorers are FontsizeScore and ColorScore.

score_domain_candidates(domain_candidates: DomainCandidateTermList) → DomainStylingScoreList¶

Calculate styling scores for each candidate term in a domain.

Parameters:: domain_candidates – List of candidate terms in a domain. The target of analysis.
Returns:: List of styling scores for each candidate term in a domain. The scores are sorted in descending order.
Return type:: DomainStylingScoreList

score_pdf_candidates(pdf_candidates: PDFCandidateTermList) → PDFStylingScoreList¶

Calculate styling scores for each candidate term in a PDF file.

Parameters:: pdf_candidates – List of candidate terms in a PDF file. The target of analysis.
Returns:: List of styling scores for each candidate term in a PDF file. The scores are sorted in descending order.
Return type:: PDFStylingScoreList

py_pdf_term.stylings.scores subpackage¶

class py_pdf_term.stylings.scores.BaseStylingScore(page_candidates: PageCandidateTermList)¶

Bases: object

Base class for styling scores. A styling score is expected to focus on a single styling feature, such as font size, font family, and font color. The score is calculated per a page of a PDF file, not per a domain of PDF files.

Parameters:: page_candidates – List of candidate terms in a page of a PDF file. The target of analysis.

abstractmethod calculate_score(candidate: Term) → float¶

Calculate the styling score of a candidate term.

Parameters:: candidate – Candidate term to calculate the styling score. This term is expected to be included in the list of candidate terms passed to the constructor.
Returns:: The styling score of the candidate term.
Return type:: float

class py_pdf_term.stylings.scores.ColorScore(page_candidates: PageCandidateTermList)¶

Bases: BaseStylingScore

Styling score for font color. The more rarely the color appears in the page, the higher the score is.

Parameters:: page_candidates – List of candidate terms in a page of a PDF file. The target of analysis.

calculate_score(candidate: Term) → float¶

Calculate the styling score of a candidate term.

Parameters:: candidate – Candidate term to calculate the styling score. This term is expected to be included in the list of candidate terms passed to the constructor.
Returns:: The styling score of the candidate term.
Return type:: float

class py_pdf_term.stylings.scores.FontsizeScore(page_candidates: PageCandidateTermList)¶

Bases: BaseStylingScore

Styling score for font size. The larger the font size is, the higher the score is. The score is normalized by the mean and the standard deviation of font sizes in the page.

Parameters:: page_candidates – List of candidate terms in a page of a PDF file. The target of analysis.

calculate_score(candidate: Term) → float¶

Calculate the styling score of a candidate term.

Parameters:: candidate – Candidate term to calculate the styling score. This term is expected to be included in the list of candidate terms passed to the constructor.
Returns:: The styling score of the candidate term.
Return type:: float

py_pdf_term.techterms package¶

class py_pdf_term.techterms.DomainTechnicalTermList(domain: str, pdfs: list[PDFTechnicalTermList])¶

Bases: object

Domain name of PDF files and technical terms of the domain.

Parameters:

domain (str) – Domain name. (e.g., “natural language processing”)
pdfs (list[PDFTechnicalTermList]) – Technical terms of each PDF file of the domain.

classmethod from_dict(obj: dict[str, Any]) → Self¶

to_dict() → dict[str, Any]¶

domain: str¶

pdfs: list[PDFTechnicalTermList]¶

class py_pdf_term.techterms.PDFTechnicalTermList(pdf_path: str, pages: list[PageTechnicalTermList])¶

Bases: object

Path of a PDF file and technical terms of the PDF file.

Parameters:

pdf_path (str) – Path of a PDF file.
pages (list[PageTechnicalTermList]) – Technical terms of each page of the PDF file.

classmethod from_dict(obj: dict[str, Any]) → Self¶

to_dict() → dict[str, Any]¶

pdf_path: str¶

pages: list[PageTechnicalTermList]¶

class py_pdf_term.techterms.PageTechnicalTermList(page_num: int, terms: list[ScoredTerm])¶

Bases: object

Page number and technical terms of the page.

Parameters:

page_num (int) – Page number of a PDF file.
terms (list[ScoredTerm]) – Technical terms of the page.

classmethod from_dict(obj: dict[str, Any]) → Self¶

to_dict() → dict[str, Any]¶

page_num: int¶

terms: list[ScoredTerm]¶

class py_pdf_term.techterms.TechnicalTermExtractor(max_num_terms: int = 10, acceptance_rate: float = 0.75)¶

Bases: object

Technical term extrator based on ranking method scores and styling scores.

Parameters:

max_num_terms – Maximum number of terms in a page of a PDF file to be extracted. The N-best candidates are extracted as technical terms. The default value is 10.
acceptance_rate – Acceptance rate of the ranking method scores. The candidates whose ranking method scores are lower than the acceptance rate are filtered out even if they are in the N-best candidates. The default value is 0.75.

extract_from_domain(domain_candidates: DomainCandidateTermList, term_ranking: MethodTermRanking, domain_styling_scores: DomainStylingScoreList) → DomainTechnicalTermList¶

Extract tecnical terms in PDF files in a domain. The terms are sorted in appearance order, not in score order.

Parameters:

domain_candidates – List of candidate terms in a domain. The target of extraction.
term_ranking – Ranking method scores for each candidate term in a domain.
domain_styling_scores – Styling scores for each candidate term in a domain.

Returns:

List of technical terms in PDF files in a domain. The terms are sorted in appearance order, not in score order.

Return type:

DomainTechnicalTermList

extract_from_pdf(pdf_candidates: PDFCandidateTermList, term_ranking: MethodTermRanking, pdf_styling_scores: PDFStylingScoreList) → PDFTechnicalTermList¶

Extract tecnical terms in a PDF file. The terms are sorted in appearance order, not in score order.

Parameters:

pdf_candidates – List of candidate terms in a PDF file. The target of extraction.
term_ranking – Ranking method scores for each candidate term in a domain.
pdf_styling_scores – Styling scores for each candidate term in a PDF file.

Returns:

List of technical terms in a PDF file. The terms are sorted in appearance order, not in score order.

Return type:

py_pdf_term.endtoend package¶

class py_pdf_term.endtoend.DomainPDFList(domain: str, pdf_paths: list[str])¶

Bases: object

Domain name and PDF file paths of the domain

Parameters:

domain (str) – Domain name. (e.g., “natural language processing”)
pdf_paths (list[str]) – PDF file paths of the domain.

classmethod validate(domain_pdfs: DomainPDFList) → None¶

domain: str¶

pdf_paths: list[str]¶

class py_pdf_term.endtoend.PDFTechnicalTermList(pdf_path: str, pages: list[PageTechnicalTermList])¶

Bases: object

Path of a PDF file and technical terms of the PDF file.

Parameters:

pdf_path (str) – Path of a PDF file.
pages (list[PageTechnicalTermList]) – Technical terms of each page of the PDF file.

classmethod from_dict(obj: dict[str, Any]) → Self¶

to_dict() → dict[str, Any]¶

pdf_path: str¶

pages: list[PageTechnicalTermList]¶

class py_pdf_term.endtoend.PyPDFTermMultiDomainExtractor(xml_config: XMLLayerConfig | None = None, candidate_config: CandidateLayerConfig | None = None, method_config: MultiDomainMethodLayerConfig | None = None, styling_config: StylingLayerConfig | None = None, techterm_config: TechnicalTermLayerConfig | None = None, bin_opener_mapper: BinaryOpenerMapper | None = None, lang_tokenizer_mapper: LanguageTokenizerMapper | None = None, token_classifier_mapper: TokenClassifierMapper | None = None, token_filter_mapper: CandidateTokenFilterMapper | None = None, term_filter_mapper: CandidateTermFilterMapper | None = None, splitter_mapper: SplitterMapper | None = None, augmenter_mapper: AugmenterMapper | None = None, method_mapper: MultiDomainRankingMethodMapper | None = None, styling_score_mapper: StylingScoreMapper | None = None, xml_cache_mapper: XMLLayerCacheMapper | None = None, candidate_cache_mapper: CandidateLayerCacheMapper | None = None, method_ranking_cache_mapper: MethodLayerRankingCacheMapper | None = None, method_data_cache_mapper: MethodLayerDataCacheMapper | None = None, styling_cache_mapper: StylingLayerCacheMapper | None = None, cache_dir: str = DEFAULT_CACHE_DIR)¶

Bases: object

Top level class of py-pdf-term. This class extracts technical terms from a PDF file with cross-domain information.

Parameters:

xml_config – Config of XML Layer.
candidate_config – Config of Candidate Term Layer.
method_config – Config of Method Layer.
styling_config – Config of Styling Layer.
techterm_config – Config of Technial Term Layer.
bin_opener_mapper – Mapper from xml_config.open_bin to a function to open a input PDF file in the binary mode. This is used in XML Layer.
lang_tokenizer_mapper – Mapper from an element in candidate_config.lang_tokenizers to a class to tokenize texts in a specific language with spaCy. This is used in Candidate Term Layer.
token_classifier_mapper – Mapper from an element in candidate_config.token_classifiers to a class to classify tokens into True/False by several functions. This is used in Candidate Term Layer.
token_filter_mapper – Mapper from an element in candidate_config.token_filters to a class to filter tokens which are likely to be parts of candidates. This is used in Candidate Term Layer.
term_filter_mapper – Mapper from an element in candidate_config.term_filters to a class to filter terms which are likely to be candidates. This is used in Candidate Term Layer.
splitter_mapper – Mapper from an element in candidate_config.splitters to a class to split too long terms or wrongly concatenated terms. This is used in Candidate Term Layer.
augmenter_mapper – Mapper from an element in candidate_config.augmenters to a class to augment candidates. The augumentation means that if a long candidate is found, sub-terms of it could also be candidates. This is used in Candidate Term Layer.
method_mapper – Mapper from method_config.method to a class to calculate method scores of candidate terms. This is used in Method Layer.
styling_score_mapper – Mapper from an element in styling_config.styling_scores to a class to calculate scores of candidate terms based on their styling such as color, fontsize and so on. This is used in Styling Layer.
xml_cache_mapper – Mapper from xml_config.cache to a class to provide XML Layer with the cache mechanism. The xml cache manages XML files converted from input PDF files.
candidate_cache_mapper – Mapper from candidate_config.cache to a class to provide Candidate Term Layer with the cache mechanism. The candidate cache manages lists of candidate terms.
method_ranking_cache_mapper – Mapper from method_config.ranking_cache to a class to provide Method Layer with the cache mechanism. The method ranking cache manages candidate terms ordered by the method scores.
method_data_cache_mapper – Mapper from method_config.data_cache to a class to provide Method Layer with the cache mechanism. The method data cache manages analysis data of the candidate terms such as frequency or likelihood.
styling_cache_mapper – Mapper from styling_config.cache to a class to provide Styling Layer with the cache mechanism. The styling cache manages candidate terms ordered by the styling scores.
cache_dir – Path like string where cache files to be stored. For example, path to a local directory, a url or a bucket name of a cloud storage service.

extract(domain: str, pdf_path: str, multi_domain_pdfs: list[DomainPDFList]) → PDFTechnicalTermList¶

Extract technical terms from a PDF file.

Parameters:

domain – Domain name which the input PDF file belongs to. This may be the name of a course, the name of a technical field or something.
pdf_path – Path like string to the input PDF file which terminologies to be extracted. The file MUST belong to domain.
multi_domain_pdfs – List of path like strings to the PDF files which classified by domain. There MUST be an element in multi_domain_pdfs whose domain equals to domain.

Returns:

Terminology list per page from the input PDF file.

Return type:

class py_pdf_term.endtoend.PyPDFTermSingleDomainExtractor(xml_config: XMLLayerConfig | None = None, candidate_config: CandidateLayerConfig | None = None, method_config: SingleDomainMethodLayerConfig | None = None, styling_config: StylingLayerConfig | None = None, techterm_config: TechnicalTermLayerConfig | None = None, bin_opener_mapper: BinaryOpenerMapper | None = None, lang_tokenizer_mapper: LanguageTokenizerMapper | None = None, token_classifier_mapper: TokenClassifierMapper | None = None, token_filter_mapper: CandidateTokenFilterMapper | None = None, term_filter_mapper: CandidateTermFilterMapper | None = None, splitter_mapper: SplitterMapper | None = None, augmenter_mapper: AugmenterMapper | None = None, method_mapper: SingleDomainRankingMethodMapper | None = None, styling_score_mapper: StylingScoreMapper | None = None, xml_cache_mapper: XMLLayerCacheMapper | None = None, candidate_cache_mapper: CandidateLayerCacheMapper | None = None, method_ranking_cache_mapper: MethodLayerRankingCacheMapper | None = None, method_data_cache_mapper: MethodLayerDataCacheMapper | None = None, styling_cache_mapper: StylingLayerCacheMapper | None = None, cache_dir: str = DEFAULT_CACHE_DIR)¶

Bases: object

Top level class of py-pdf-term. This class extracts technical terms from a PDF file withoout cross-domain information.

Parameters:

xml_config – Config of XML Layer.
candidate_config – Config of Candidate Term Layer.
method_config – Config of Method Layer.
styling_config – Config of Styling Layer.
techterm_config – Config of Technial Term Layer.
bin_opener_mapper – Mapper from xml_config.open_bin to a function to open a input PDF file in the binary mode. This is used in XML Layer.
lang_tokenizer_mapper – Mapper from an element in candidate_config.lang_tokenizers to a class to tokenize texts in a specific language with spaCy. This is used in Candidate Term Layer.
token_classifier_mapper – Mapper from an element in candidate_config.token_classifiers to a class to classify tokens into True/False by several functions. This is used in Candidate Term Layer.
token_filter_mapper – Mapper from an element in candidate_config.token_filters to a class to filter tokens which are likely to be parts of candidates. This is used in Candidate Term Layer.
term_filter_mapper – Mapper from an element in candidate_config.term_filters to a class to filter terms which are likely to be candidates. This is used in Candidate Term Layer.
splitter_mapper – Mapper from an element in candidate_config.splitters to a class to split too long terms or wrongly concatenated terms. This is used in Candidate Term Layer.
augmenter_mapper – Mapper from an element in candidate_config.augmenters to a class to augment candidates. The augumentation means that if a long candidate is found, sub-terms of it could also be candidates. This is used in Candidate Term Layer.
method_mapper – Mapper from method_config.method to a class to calculate method scores of candidate terms. This is used in Method Layer.
styling_score_mapper – Mapper from an element in styling_config.styling_scores to a class to calculate scores of candidate terms based on their styling such as color, fontsize and so on. This is used in Styling Layer.
xml_cache_mapper – Mapper from xml_config.cache to a class to provide XML Layer with the cache mechanism. The xml cache manages XML files converted from input PDF files.
candidate_cache_mapper – Mapper from candidate_config.cache to a class to provide Candidate Term Layer with the cache mechanism. The candidate cache manages lists of candidate terms.
method_ranking_cache_mapper – Mapper from method_config.ranking_cache to a class to provide Method Layer with the cache mechanism. The method ranking cache manages candidate terms ordered by the method scores.
method_data_cache_mapper – Mapper from method_config.data_cache to a class to provide Method Layer with the cache mechanism. The method data cache manages analysis data of the candidate terms such as frequency or likelihood.
styling_cache_mapper – Mapper from styling_config.cache to a class to provide Styling Layer with the cache mechanism. The styling cache manages candidate terms ordered by the styling scores.
cache_dir – Path like string where cache files to be stored. For example, path to a local directory, a url or a bucket name of a cloud storage service.

extract(pdf_path: str, domain_pdfs: DomainPDFList) → PDFTechnicalTermList¶

Extract technical terms from a PDF file.

Parameters:

pdf_path – Path like string to the input PDF file which terminologies to be extracted. The file MUST belong to domain.
domain_pdfs – List of path like strings to the PDF files which belong to a specific domain.

Returns:

Terminology list per page from the input PDF file.

Return type: