paperweight.document

Object oriented access to LaTeX documents.

The paperweight.document module provides object-oriented interfaces for manipulating or mining LaTeX documents. Much of the functionality of the paperweight.texutils, paperweight.gitio and paperweight.nlputils modules can be accessed through this interface.

Depending on how the LaTeX document is stored, you should use either of two document classes. paperweight.document.FilesystemTexDocument should be used for regular documents in the filesystem. If you wish to operate on documents stored within a certain commit of a checked-out Git repository, then use paperweight.document.GitTexDocument. The interfaces for both classes are consistent since they inherit from paperweight.document.TexDocument under the hood.

class paperweight.document.FilesystemTexDocument(path, recursive=True)

Bases: paperweight.document.TexDocument

A TeX document derived from a file in the filesystem.

Parameters:

filepath : unicode

Path to the ‘.tex’ on the filesystem.

recursive : bool

If True (default), then tex documents input by this root document will be opened.

Attributes

bib_keys List of all bib keys in the document (and input documents).
bib_name Name of the BibTeX bibliography file (e.g., 'mybibliography.bib').
bib_path Absolute file path to the .bib bibliography document.
bibitems List of bibitem strings appearing in the document.
sections List with tuples of section names and positions.

Methods

extract_citation_context([n_words]) Generate a dictionary of all bib keys in the document (and input documents), with rich of metadata about the context of each citation in the document.
find_input_documents() Find all tex documents input by this root document.
inline_bbl() Inline a compiled bibliography (.bbl) in place of a bibliography environment.
inline_inputs() Inline all input latex files references by this document.
remove_comments([recursive]) Remove latex comments from document (modifies document in place).
write(path) Write the document’s text to a path on the filesystem.
bib_keys

List of all bib keys in the document (and input documents).

bib_name

Name of the BibTeX bibliography file (e.g., 'mybibliography.bib').

bib_path

Absolute file path to the .bib bibliography document.

bibitems

List of bibitem strings appearing in the document.

extract_citation_context(n_words=20)

Generate a dictionary of all bib keys in the document (and input documents), with rich of metadata about the context of each citation in the document.

For example, suppose 'Sick:2014' is cited twice within a document. Then the dictionary returned by this method will have a length-2 list under the 'Sick:2014' key. Each item in this list will be a dictionary providing metadata of the context for that citation. Fields of this dictionary are:

  • position: (int) the cumulative word count at which the citation occurs.
  • wordsbefore: (unicode) text occuring before the citation.
  • wordsafter: (unicode) text occuring after the citation.
  • section: (unicode) name of the section in which the citation occurs.
Parameters:

n_words : int

Number of words before and after the citation to extract for context.

Returns:

bib_keys : dict

Dictionary, keyed by BibTeX cite key, where entires are lists of instances of citations. See above for the format of the instance metadata.

find_input_documents()

Find all tex documents input by this root document.

Returns:

paths : list

List of filepaths for input documents. Paths are relative to the document (i.e., as written in the latex document).

inline_bbl()

Inline a compiled bibliography (.bbl) in place of a bibliography environment. The document is modified in place.

inline_inputs()

Inline all input latex files references by this document. The inlining is accomplished recursively. The document is modified in place.

remove_comments(recursive=True)

Remove latex comments from document (modifies document in place).

Parameters:

recursive : bool

Remove comments from all input LaTeX documents (default True).

sections

List with tuples of section names and positions. Positions of section names are measured by cumulative word count.

write(path)

Write the document’s text to a path on the filesystem.

class paperweight.document.GitTexDocument(git_path, git_hash, repo_dir='.', recursive=True)

Bases: paperweight.document.TexDocument

A tex document derived from a file in the git repository.

Parameters:

git_path : str

Path to the document in the git repository, relative to the root of the repository.

git_hash : str

Any SHA or git tag that can resolve into a commit in the git repository.

repo_dir : str

Path from current working directory to the root of the git repository.

Attributes

bib_keys List of all bib keys in the document (and input documents).
bib_name Name of the BibTeX bibliography file (e.g., 'mybibliography.bib').
bib_path Absolute file path to the .bib bibliography document.
bibitems List of bibitem strings appearing in the document.
sections List with tuples of section names and positions.

Methods

extract_citation_context([n_words]) Generate a dictionary of all bib keys in the document (and input documents), with rich of metadata about the context of each citation in the document.
find_input_documents() Find all tex documents input by this root document.
remove_comments([recursive]) Remove latex comments from document (modifies document in place).
write(path) Write the document’s text to a path on the filesystem.
bib_keys

List of all bib keys in the document (and input documents).

bib_name

Name of the BibTeX bibliography file (e.g., 'mybibliography.bib').

bib_path

Absolute file path to the .bib bibliography document.

bibitems

List of bibitem strings appearing in the document.

extract_citation_context(n_words=20)

Generate a dictionary of all bib keys in the document (and input documents), with rich of metadata about the context of each citation in the document.

For example, suppose 'Sick:2014' is cited twice within a document. Then the dictionary returned by this method will have a length-2 list under the 'Sick:2014' key. Each item in this list will be a dictionary providing metadata of the context for that citation. Fields of this dictionary are:

  • position: (int) the cumulative word count at which the citation occurs.
  • wordsbefore: (unicode) text occuring before the citation.
  • wordsafter: (unicode) text occuring after the citation.
  • section: (unicode) name of the section in which the citation occurs.
Parameters:

n_words : int

Number of words before and after the citation to extract for context.

Returns:

bib_keys : dict

Dictionary, keyed by BibTeX cite key, where entires are lists of instances of citations. See above for the format of the instance metadata.

find_input_documents()

Find all tex documents input by this root document.

Returns:

paths : list

List of filepaths for input documents. Paths are relative to the document (i.e., as written in the latex document).

remove_comments(recursive=True)

Remove latex comments from document (modifies document in place).

Parameters:

recursive : bool

Remove comments from all input LaTeX documents (default True).

sections

List with tuples of section names and positions. Positions of section names are measured by cumulative word count.

write(path)

Write the document’s text to a path on the filesystem.

class paperweight.document.TexDocument(text)

Bases: object

Baseclass for a tex document.

Parameters:

text : unicode

Unicode-encoded text of the latex document.

Attributes

text (unicode) Text of the document as a unicode string.

Methods

extract_citation_context([n_words]) Generate a dictionary of all bib keys in the document (and input documents), with rich of metadata about the context of each citation in the document.
find_input_documents() Find all tex documents input by this root document.
remove_comments([recursive]) Remove latex comments from document (modifies document in place).
write(path) Write the document’s text to a path on the filesystem.
bib_keys

List of all bib keys in the document (and input documents).

bib_name

Name of the BibTeX bibliography file (e.g., 'mybibliography.bib').

bib_path

Absolute file path to the .bib bibliography document.

bibitems

List of bibitem strings appearing in the document.

extract_citation_context(n_words=20)

Generate a dictionary of all bib keys in the document (and input documents), with rich of metadata about the context of each citation in the document.

For example, suppose 'Sick:2014' is cited twice within a document. Then the dictionary returned by this method will have a length-2 list under the 'Sick:2014' key. Each item in this list will be a dictionary providing metadata of the context for that citation. Fields of this dictionary are:

  • position: (int) the cumulative word count at which the citation occurs.
  • wordsbefore: (unicode) text occuring before the citation.
  • wordsafter: (unicode) text occuring after the citation.
  • section: (unicode) name of the section in which the citation occurs.
Parameters:

n_words : int

Number of words before and after the citation to extract for context.

Returns:

bib_keys : dict

Dictionary, keyed by BibTeX cite key, where entires are lists of instances of citations. See above for the format of the instance metadata.

find_input_documents()

Find all tex documents input by this root document.

Returns:

paths : list

List of filepaths for input documents. Paths are relative to the document (i.e., as written in the latex document).

remove_comments(recursive=True)

Remove latex comments from document (modifies document in place).

Parameters:

recursive : bool

Remove comments from all input LaTeX documents (default True).

sections

List with tuples of section names and positions. Positions of section names are measured by cumulative word count.

write(path)

Write the document’s text to a path on the filesystem.