API Documentation¶

python Mlst Local Search Tool

Whole Genome MLST¶

“Core classes and functions to work with Whole Genome MLST data.

pymlst.wg.core.open_wg(file=None, ref='ref')[source]¶

A context manager function to wrap the creation a: WholeGenomeMLST object.

Context managers allow you to instantiate objects using the with keyword, eliminating the need to manage exceptions and commit/close processes yourself.

Parameters

file – The path to the database file to work with.
ref – The name that will be given to the reference strain in the database.

Yields

A WholeGenomeMLST object.

class pymlst.wg.core.DuplicationHandling(value)[source]¶: An enumeration.

class pymlst.wg.core.DatabaseWG(file, ref)[source]¶

A core level class to manipulate the genomic database.

Warning

Shouldn’t be instantiated directly, see WholeGenomeMLST instead.

__init__(file, ref)[source]¶

Parameters: path – The path to the database file to work with.

add_infos(repository, species, version)[source]¶: Add infos of the cgMLST schema use in this database

remove_gene(gene)[source]¶: Removes a specific gene and its sequences.

remove_strain(strain)[source]¶: Removes a specific strain.

contains_souche(souche)[source]¶: Whether the strain exists in the base or not.

get_infos()[source]¶: Return infos values of the database

get_gene_sequences(gene)[source]¶: Return all the sequences for a specific gene and lists the strains that are referencing them.

get_duplicated_genes()[source]¶: Return the genes that are duplicated.

get_all_strains()[source]¶: Return all distinct strains.

get_core_genes()[source]¶: Return all distinct genes.

count_sequences_per_gene()[source]¶: Return the number of distinct sequences per gene.

count_souches_per_gene()[source]¶: Return the number of distinct stains per gene.

count_genes_per_souche(valid_shema)[source]¶

Return the number of distinct genes per strain.

The counted genes are restricted to the ones given in the valid_schema.

count_sequences()[source]¶: Return the number of distinct.

get_strains_distances(valid_schema)[source]¶

Return the distances between strains.

For all the possible pairs of strains, counts how many of their genes are different (different seqids so different sequences). The compared genes are restricted to the ones given in the valid_schema.

get_mlst(valid_schema)[source]¶

Return the the genes MLST.

Returns a dictionary associated with each gene, all the strains that reference it, and their sequence ids. The genes returned are restricted to those given in the valid_schema.

class pymlst.wg.core.WholeGenomeMLST(file, ref)[source]¶

Whole Genome MLST python representation.

Example of usage:

open_wg('database.db') as db:
    db.create(open('genome.fasta'))
    db.add_strain(open('strain_1.fasta'))
    db.add_strain(open('strain_2.fasta'))

__init__(file, ref)[source]¶

Parameters

file – The path to the database file to work with.
ref – The name that will be given to the reference strain in the database.

create(coregene, concatenate=False, remove=False)[source]¶

Creates a whole genome MLST database from a core genome fasta file.

Parameters

coregene – The fasta Path containing the reference core genome.
concatenate – Whether we should concatenate genes with identical sequences.
remove – Whether we should remove genes with identical sequences.

For instance, if concatenate is set to True, 2 genes g1 and g2 having the same sequence will be stored as a single gene named g1;g2.

add_infos(repository, species, version)[source]¶

Add infos of the cgMLST schema store in database.

Parameters

repository – Source of the cgMLST data
species – Name of the specie
version – Version of the database

get_infos(output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶: Get infos of the cgMLST schema store in the database

add_strain(genome, strain=None, identity=0.95, coverage=0.9)[source]¶

Adds a genome strain to the database.

How it works:

A BLAT research is performed on each given contig of the strain to find sub-sequences matching the core genes.
The identified sub-sequences are extracted and added to our database where they are associated to a sequence ID.
An MLST entry is created, referencing the sequence, the gene it belongs to, and the strain it was found in.

Parameters

genome – The strain genome we want to add as a fasta Path.
strain – The name that will be given to the new strain in the database.
identity – Sets the minimum identity used by BLAT for sequences research (in percent).
coverage – Sets the minimum accepted coverage for found sequences.

add_reads(fastqs, strain=None, identity=0.95, coverage=0.9, reads=10)[source]¶

Adds raw reads of a strain to the database.

How it works:

A KMA research is performed on reads (fastq) of the strain to find sub-sequences matching the core genes.
The identified sub-sequences are extracted and added to our database where they are associated to a sequence ID.
An MLST entry is created, referencing the sequence, the gene it belongs to, and the strain it was found in.

Parameters

fastqs – The reads we want to add as a list of fastq file.
strain – The name that will be given to the new strain in the database.
identity – Sets the minimum identity used by BWA for sequences research (in percent).
coverage – Sets the minimum accepted coverage for found sequences.
reads – Sets the minimum number of reads coverage to conserved an results

remove_gene(genes, file=None)[source]¶

Removes genes from the database.

Parameters

genes – Names of the genes to remove.
file – A file containing a gene name per line.

remove_strain(strains, file=None)[source]¶

Removes entire strains from the database.

Parameters

strains – Names of the strains to remove.
file – A file containing a strain name per line.

extract(extractor, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

Takes an extractor object and writes the extraction result on the given output.

Parameters

extractor – A Extractor object describing the way data should be extracted.
output – The output that will receive extracted data.

class pymlst.wg.core.Extractor[source]¶

A simple interface to ease the process of creating new extractors.

abstract extract(base, output)[source]¶

Parameters

base – The database to extract data from.
output – The output where to write the extraction results.

pymlst.wg.core.find_recombination(genes, alignment, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

Counts the number of versions of each gene.

Parameters

genes – List of genes (output of TableExtractor using export='gene').
alignment – fasta file alignment (output of SequenceExtractor using align=True).
output – The output where to write the results.

pymlst.wg.core.find_subgraph(distance, threshold=50, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, export='list')[source]¶

Searches groups of strains separated by a distance threshold.

Parameters

threshold – Minimum distance to maintain for groups extraction.
distance – Distance matrix file (output of TableExtractor with export='distance').
output – The output where to write the results.
export – Sets the export type.

Set of methods to extract different types of results from wgMLST

class pymlst.wg.extractors.SequenceExtractor(file=None, reference=False)[source]¶

Extracts coregene sequences into fasta file.

__init__(file=None, reference=False)[source]¶

Parameters: file – Path of the file containing the coregens to extract

extract(base, output)[source]¶

Parameters

base – The database to extract data from.
output – The output where to write the extraction results.

class pymlst.wg.extractors.MsaExtractor(file=None, realign=False)[source]¶

Compute Multiple Sequence Alignment (MSA) and extracts the aligned sequences.

__init__(file=None, realign=False)[source]¶

Parameters

file – Path of the file containing the coregens to extract
realign – Realign genes with same length

extract(base, output)[source]¶

Parameters

base – The database to extract data from.
output – The output where to write the extraction results.

add_sequence_strain(seqs, strains, sequences)[source]¶: Add a sequence to multi-align, take the first gene in case of repetition

class pymlst.wg.extractors.TableExtractor(mincover=0, keep=False, duplicate=False, inverse=False)[source]¶

Extraction of cgMLST distance matrix, MLST profiles, Genes and Strains list from a wgMLST database.

__init__(mincover=0, keep=False, duplicate=False, inverse=False)[source]¶: Initialize self. See help(type(self)) for accurate signature.

abstract extract(base, output)[source]¶

Parameters

base – The database to extract data from.
output – The output where to write the extraction results.

class pymlst.wg.extractors.TableExtractorCommand(*args, **kwargs)[source]¶

Options supported by TableExtractor.

__init__(*args, **kwargs)[source]¶: Initialize self. See help(type(self)) for accurate signature.

class pymlst.wg.extractors.GeneExtractor(**kwargs)[source]¶

Extracts a list of genes from a wgMLST database.

__init__(**kwargs)[source]¶: Initialize self. See help(type(self)) for accurate signature.

extract(base, output)[source]¶

Parameters

base – The database to extract data from.
output – The output where to write the extraction results.

class pymlst.wg.extractors.StatsExtractor[source]¶

Extracts stats, number of strains, coregenes and sequences from a wgMLST database.

extract(base, output)[source]¶

Parameters

base – The database to extract data from.
output – The output where to write the extraction results.

class pymlst.wg.extractors.StrainExtractor(count=False, **kwargs)[source]¶

Extracts a list of strains from a wgMLST database.

__init__(count=False, **kwargs)[source]¶: Initialize self. See help(type(self)) for accurate signature.

extract(base, output)[source]¶

Parameters

base – The database to extract data from.
output – The output where to write the extraction results.

class pymlst.wg.extractors.DistanceExtractor(mincover=0, keep=False, duplicate=False, inverse=False)[source]¶

Extracts a distance matrix from a wgMLST database.

extract(base, output)[source]¶

Parameters

base – The database to extract data from.
output – The output where to write the extraction results.

class pymlst.wg.extractors.MlstExtractor(form='default', **kwargs)[source]¶

Extracts an MLST table from a wgMLST database.

__init__(form='default', **kwargs)[source]¶: Initialize self. See help(type(self)) for accurate signature.

extract(base, output)[source]¶

Parameters

base – The database to extract data from.
output – The output where to write the extraction results.

Classical MLST¶

Core classes and functions to work with Classical MLST data.

pymlst.cla.core.open_cla(file=None, ref=1)[source]¶

A context manager function to wrap the creation a: ClassicalMLST object.

Context managers allow you to instantiate objects using the with keyword, eliminating the need to manage exceptions and commit/close processes yourself.

Parameters

file – The path to the database file to work with.
ref – The name that will be given to the reference strain in the database.

Yields

A ClassicalMLST object.

class pymlst.cla.core.DatabaseCLA(file, ref)[source]¶

A core level class to manipulate the genomic database.

Warning

Shouldn’t be instantiated directly, see ClassicalMLST instead.

__init__(file, ref)[source]¶

Parameters: path – The path to the database file to work with.

add_infos(repository, species, mlst, version)[source]¶: Add infos of the MLST schema use in this database

add_sequence(sequence, gene, allele)[source]¶: Adds a new sequence associated to a gene and an allele.

add_mlst(sequence_typing, gene, allele)[source]¶: Adds a new sequence typing, associated to a gene and an allele.

get_infos()[source]¶: Return infos values of the database

get_genes_by_allele(allele)[source]¶: Returns all the distinct genes in the database and their sequences for a given allele.

get_allele_by_sequence_and_gene(sequence, gene)[source]¶: Gets an allele by sequence and gene.

get_st()[source]¶: Gets all St present in the database

get_st_by_gene_and_allele(gene, allele)[source]¶: Gets all the STs of a gene/allele pair.

get_sequence_by_gene_and_allele(gene, allele)[source]¶: Gets a sequence by gene and allele.

get_all_sequences_by_gene(gene)[source]¶: Get all the sequences from a particular gene

get_all_sequences()[source]¶: Get all sequences allele

remove_allele(gene, allele)[source]¶: Remove an particular allele for the gene

class pymlst.cla.core.ClassicalMLST(file, ref)[source]¶

Classical MLST python representation.

Example of usage:

open_cla('database.db') as db:
    db.create(open('profile.txt'), [open('gene1.fasta'),
                                   open('gene2.fasta'),
                                   open('gene3.fasta')])
    db.multi_search(open('genome.fasta'))

__init__(file, ref)[source]¶

Parameters

file – The path to the database file to work with.
ref – The name that will be given to the reference strain in the database.

create(profile, alleles)[source]¶

Creates a classical MLST database from an MLST profile and a list of alleles.

Parameters

profile – The MLST profile
alleles – A list of alleles files.

The MLST profile should be a TXT file respecting the following format:

MLST Profile TXT¶
ST	gene1	gene2	gene3	…
1	1	1	1	…
2	3	3	2	…
3	1	2	1	…
4	1	1	3	…
…	…	…	…	…

add_infos(repository, species, mlst, version)[source]¶

Add infos of the MLST schema store in database.

Parameters

repository – Source of the MLST data
species – Name of the specie
mlst – Name of the MLST schema
version – Version of the database

get_infos(output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶: Get infos of the MLST schema store in the database

search_st(genome, identity=0.9, coverage=0.9, fasta=None)[source]¶

Search the Sequence Type number of a strain.

Parameters

genome – The strain genome we want to add as a fasta path.
identity – Sets the minimum identity used by BLAT for sequences research (in percent).
coverage – Sets the minimum accepted gene coverage for found sequences.
fasta – A file where to export genes alleles results in a fasta format.

multi_search(genomes, identity=0.9, coverage=0.9, fasta=None, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

Searches the Sequence Type number of one or multi strain(s) from an assembly genome.

Parameters

genomes – Tuple of one or more strain genomes given as input Path
output – An output for the sequence type research results.
identity – Sets the minimum identity used by BLAT for sequences research (in percent).
fasta – A file where to export genes alleles results in a fasta format.
coverage – Sets the minimum accepted coverage for found sequences.

search_read(fastqs, identity=0.9, coverage=0.95, reads=10, fasta=None)[source]¶

Searches the Sequence Type from raw reads of one strain.

Parameters

fastq – List of fastq files containing raw reads
identity – Sets the minimum identity used by KMA for sequences research (in percent).
coverage – Sets the minimum accepted gene coverage for found sequences.
reads – Sets the minimum reads coverage to conserve an mapping
fasta – A file where to export genes alleles results in a fasta format.

multi_read(fastqs, identity=0.9, coverage=0.95, reads=10, paired=True, fasta=None, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

Search the Sequence Type number of one or multi strain(s) from raw reads.

Parameters

fastqs – Tuple of one or more strain raw reads given as input
output – An output for the sequence type research results.
identity – Sets the minimum identity used by KMA for sequences research (in percent).
reads – Sets the minimum reads coverage to conserve an mapping
paired – Defined if the raw reads are by paired or single
fasta – A file where to export genes alleles results in a fasta format.
coverage – Sets the minimum accepted coverage for found sequences.

remove_allele(gene, allele)[source]¶

Removes some problematic allele out of the claMLST database

Parameters

gene – Gene name on the database, ignoring case
allele – Integer allele number of this gene

class pymlst.cla.core.ST_result(genome_name, st_val, alleles)[source]¶

Writes the results of the ST research

__init__(genome_name, st_val, alleles)[source]¶

Parameters

genome_name – Name of the genome retrieved from the path provided by the user
st_val – ST values identified for each genome by search_st
alleles – Alleles of the strain recovered in the core genome

write(output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, header=True)[source]¶: Writes the results in output file

Other Typing¶

Core classes and functions to work in alternative typing methods.

pymlst.pytyper.core.open_typer(method)[source]¶

Parameters: method – Defines typing method to apply. Possible values : 1- fim 2- spa 3- clmt
Yields: A :class: ‘~pymlst.pytyper.core.pyTyper’ object.

class pymlst.pytyper.core.PyTyper(fi, typing)[source]¶

Primary class for all pyTyper objects listed on method

__init__(fi, typing)[source]¶

Parameters

fi – Path to the database file
typing – Typing type

abstract search_genome(genome, identity=0.9, coverage=0.9, fasta=None)[source]¶

Abstract method for searching alleles against a genome.

Parameters

genome – Path to the fasta genome
identity – Minimum identity treshold (0.9)
coverage – Minimum coverage threshold (0.9)
fasta – Path to a file to write alleles in fasta format (None)

abstract create()[source]¶

Initialiazes the database for a specific typing method:: FimH Spa Clermont

Uses a scheme created automatically that is specific to the typing

abstract multi_search(genomes, identity, coverage, fasta=None, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

Performed batch search analysis of list of genomes

Parameters

genomes – List of path to the fasta genomes
identity – Minimum identity treshold
coverage – Minimum coverage threshold
fasta – Handle to a file to write alleles in fasta format (None)
output – Write result on this output (stdout)

close()[source]¶: Close database

check_input(identity, coverage)[source]¶

Verify input identity and coverage to be between 0 to 1

Parameters

identity – Minimum identity treshold
coverage – Minimum coverage threshold

write_fasta_allele(genome, fasta, psl)[source]¶

Export allele in fasta output for a list of psl results

Parameters

genome – Path of genome fasta file
fasta – handle of fasta output
psl – Blat alignement results {gene:[alignements]}

class pymlst.pytyper.core.FimH(fi)[source]¶

fimH typing for Escherichia coli.

__init__(fi)[source]¶

Parameters

fi – Path to the database file
typing – Typing type

search_genome(genome, identity, coverage, fasta)[source]¶

Abstract method for searching alleles against a genome.

Parameters

genome – Path to the fasta genome
identity – Minimum identity treshold (0.9)
coverage – Minimum coverage threshold (0.9)
fasta – Path to a file to write alleles in fasta format (None)

multi_search(genomes, identity=0.9, coverage=0.9, fasta=None, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

Performed batch search analysis of list of genomes

Parameters

genomes – List of path to the fasta genomes
identity – Minimum identity treshold
coverage – Minimum coverage threshold
fasta – Handle to a file to write alleles in fasta format (None)
output – Write result on this output (stdout)

create()[source]¶

Initialiazes the database for a specific typing method:: FimH Spa Clermont

Uses a scheme created automatically that is specific to the typing

class pymlst.pytyper.core.Spa(fi)[source]¶

Spa typing for Staphylococcus aureus.

__init__(fi)[source]¶

Parameters

fi – Path to the database file
typing – Typing type

search_genome(genome, identity, coverage, fasta)[source]¶

Abstract method for searching alleles against a genome.

Parameters

genome – Path to the fasta genome
identity – Minimum identity treshold (0.9)
coverage – Minimum coverage threshold (0.9)
fasta – Path to a file to write alleles in fasta format (None)

multi_search(genomes, identity=0.9, coverage=0.9, fasta=None, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

Performed batch search analysis of list of genomes

Parameters

genomes – List of path to the fasta genomes
identity – Minimum identity treshold
coverage – Minimum coverage threshold
fasta – Handle to a file to write alleles in fasta format (None)
output – Write result on this output (stdout)

create()[source]¶

Initialiazes the database for a specific typing method:: FimH Spa Clermont

Uses a scheme created automatically that is specific to the typing

check_repetition(spans)[source]¶: Check that repetition are successive

class pymlst.pytyper.core.Clmt(fi)[source]¶

Phylogroupe determination using ClermontTyping methods for Escherichia coli.

__init__(fi)[source]¶

Parameters

fi – Path to the database file
typing – Typing type

search_genome(genome, identity, coverage, fasta)[source]¶

Abstract method for searching alleles against a genome.

Parameters

genome – Path to the fasta genome
identity – Minimum identity treshold (0.9)
coverage – Minimum coverage threshold (0.9)
fasta – Path to a file to write alleles in fasta format (None)

multi_search(genomes, identity=0.9, coverage=0.99, fasta=None, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

Performed batch search analysis of list of genomes

Parameters

genomes – List of path to the fasta genomes
identity – Minimum identity treshold
coverage – Minimum coverage threshold
fasta – Handle to a file to write alleles in fasta format (None)
output – Write result on this output (stdout)

create()[source]¶

Initialiazes the database for a specific typing method:: FimH Spa Clermont

Uses a scheme created automatically that is specific to the typing

class pymlst.pytyper.core.TypingResult(genome_name, method)[source]¶

Writes the results of the TYPING research

__init__(genome_name, method)[source]¶

Parameters

genome_name – Name of the genome retrieved from the path provided by the user
method – Typing method uses for analysis

write(output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, header=True)[source]¶: Writes the results in output file

set_allele(allele)[source]¶

Parameters: allele – Allele of the strain recovered in the core genome

set_st(st_val)[source]¶

Parameters: st_val – ST values identified for each genome by search

set_notes(notes)[source]¶

Parameters: notes – Add notes to search results