API Documentation

python Mlst Local Search Tool

Whole Genome MLST

“Core classes and functions to work with Whole Genome MLST data.

pymlst.wg.core.open_wg(file=None, ref='ref')[source]
A context manager function to wrap the creation a

WholeGenomeMLST object.

Context managers allow you to instantiate objects using the with keyword, eliminating the need to manage exceptions and commit/close processes yourself.

Parameters
  • file – The path to the database file to work with.

  • ref – The name that will be given to the reference strain in the database.

Yields

A WholeGenomeMLST object.

class pymlst.wg.core.DuplicationHandling(value)[source]

An enumeration.

class pymlst.wg.core.DatabaseWG(file, ref)[source]

A core level class to manipulate the genomic database.

Warning

Shouldn’t be instantiated directly, see WholeGenomeMLST instead.

__init__(file, ref)[source]
Parameters

path – The path to the database file to work with.

add_infos(repository, species, version)[source]

Add infos of the cgMLST schema use in this database

remove_gene(gene)[source]

Removes a specific gene and its sequences.

remove_strain(strain)[source]

Removes a specific strain.

contains_souche(souche)[source]

Whether the strain exists in the base or not.

get_infos()[source]

Return infos values of the database

get_gene_sequences(gene)[source]

Return all the sequences for a specific gene and lists the strains that are referencing them.

get_duplicated_genes()[source]

Return the genes that are duplicated.

get_all_strains()[source]

Return all distinct strains.

get_core_genes()[source]

Return all distinct genes.

count_sequences_per_gene()[source]

Return the number of distinct sequences per gene.

count_souches_per_gene()[source]

Return the number of distinct stains per gene.

count_genes_per_souche(valid_shema)[source]

Return the number of distinct genes per strain.

The counted genes are restricted to the ones given in the valid_schema.

count_sequences()[source]

Return the number of distinct.

get_strains_distances(valid_schema)[source]

Return the distances between strains.

For all the possible pairs of strains, counts how many of their genes are different (different seqids so different sequences). The compared genes are restricted to the ones given in the valid_schema.

get_mlst(valid_schema)[source]

Return the the genes MLST.

Returns a dictionary associated with each gene, all the strains that reference it, and their sequence ids. The genes returned are restricted to those given in the valid_schema.

class pymlst.wg.core.WholeGenomeMLST(file, ref)[source]

Whole Genome MLST python representation.

Example of usage:

open_wg('database.db') as db:
    db.create(open('genome.fasta'))
    db.add_strain(open('strain_1.fasta'))
    db.add_strain(open('strain_2.fasta'))
__init__(file, ref)[source]
Parameters
  • file – The path to the database file to work with.

  • ref – The name that will be given to the reference strain in the database.

create(coregene, concatenate=False, remove=False)[source]

Creates a whole genome MLST database from a core genome fasta file.

Parameters
  • coregene – The fasta Path containing the reference core genome.

  • concatenate – Whether we should concatenate genes with identical sequences.

  • remove – Whether we should remove genes with identical sequences.

For instance, if concatenate is set to True, 2 genes g1 and g2 having the same sequence will be stored as a single gene named g1;g2.

add_infos(repository, species, version)[source]

Add infos of the cgMLST schema store in database.

Parameters
  • repository – Source of the cgMLST data

  • species – Name of the specie

  • version – Version of the database

get_infos(output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Get infos of the cgMLST schema store in the database

add_strain(genome, strain=None, identity=0.95, coverage=0.9)[source]

Adds a genome strain to the database.

How it works:

  1. A BLAT research is performed on each given contig of the strain to find sub-sequences matching the core genes.

  2. The identified sub-sequences are extracted and added to our database where they are associated to a sequence ID.

  3. An MLST entry is created, referencing the sequence, the gene it belongs to, and the strain it was found in.

Parameters
  • genome – The strain genome we want to add as a fasta Path.

  • strain – The name that will be given to the new strain in the database.

  • identity – Sets the minimum identity used by BLAT for sequences research (in percent).

  • coverage – Sets the minimum accepted coverage for found sequences.

add_reads(fastqs, strain=None, identity=0.95, coverage=0.9, reads=10)[source]

Adds raw reads of a strain to the database.

How it works:

  1. A KMA research is performed on reads (fastq) of the strain to find sub-sequences matching the core genes.

  2. The identified sub-sequences are extracted and added to our database where they are associated to a sequence ID.

  3. An MLST entry is created, referencing the sequence, the gene it belongs to, and the strain it was found in.

Parameters
  • fastqs – The reads we want to add as a list of fastq file.

  • strain – The name that will be given to the new strain in the database.

  • identity – Sets the minimum identity used by BWA for sequences research (in percent).

  • coverage – Sets the minimum accepted coverage for found sequences.

  • reads – Sets the minimum number of reads coverage to conserved an results

remove_gene(genes, file=None)[source]

Removes genes from the database.

Parameters
  • genes – Names of the genes to remove.

  • file – A file containing a gene name per line.

remove_strain(strains, file=None)[source]

Removes entire strains from the database.

Parameters
  • strains – Names of the strains to remove.

  • file – A file containing a strain name per line.

extract(extractor, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Takes an extractor object and writes the extraction result on the given output.

Parameters
  • extractor – A Extractor object describing the way data should be extracted.

  • output – The output that will receive extracted data.

class pymlst.wg.core.Extractor[source]

A simple interface to ease the process of creating new extractors.

abstract extract(base, output)[source]
Parameters
  • base – The database to extract data from.

  • output – The output where to write the extraction results.

pymlst.wg.core.find_recombination(genes, alignment, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Counts the number of versions of each gene.

Parameters
  • genes – List of genes (output of TableExtractor using export='gene').

  • alignmentfasta file alignment (output of SequenceExtractor using align=True).

  • output – The output where to write the results.

pymlst.wg.core.find_subgraph(distance, threshold=50, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, export='list')[source]

Searches groups of strains separated by a distance threshold.

Parameters
  • threshold – Minimum distance to maintain for groups extraction.

  • distance – Distance matrix file (output of TableExtractor with export='distance').

  • output – The output where to write the results.

  • export – Sets the export type.

Set of methods to extract different types of results from wgMLST

class pymlst.wg.extractors.SequenceExtractor(file=None, reference=False)[source]

Extracts coregene sequences into fasta file.

__init__(file=None, reference=False)[source]
Parameters

file – Path of the file containing the coregens to extract

extract(base, output)[source]
Parameters
  • base – The database to extract data from.

  • output – The output where to write the extraction results.

class pymlst.wg.extractors.MsaExtractor(file=None, realign=False)[source]

Compute Multiple Sequence Alignment (MSA) and extracts the aligned sequences.

__init__(file=None, realign=False)[source]
Parameters
  • file – Path of the file containing the coregens to extract

  • realign – Realign genes with same length

extract(base, output)[source]
Parameters
  • base – The database to extract data from.

  • output – The output where to write the extraction results.

add_sequence_strain(seqs, strains, sequences)[source]

Add a sequence to multi-align, take the first gene in case of repetition

class pymlst.wg.extractors.TableExtractor(mincover=0, keep=False, duplicate=False, inverse=False)[source]

Extraction of cgMLST distance matrix, MLST profiles, Genes and Strains list from a wgMLST database.

__init__(mincover=0, keep=False, duplicate=False, inverse=False)[source]

Initialize self. See help(type(self)) for accurate signature.

abstract extract(base, output)[source]
Parameters
  • base – The database to extract data from.

  • output – The output where to write the extraction results.

class pymlst.wg.extractors.TableExtractorCommand(*args, **kwargs)[source]

Options supported by TableExtractor.

__init__(*args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

class pymlst.wg.extractors.GeneExtractor(**kwargs)[source]

Extracts a list of genes from a wgMLST database.

__init__(**kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

extract(base, output)[source]
Parameters
  • base – The database to extract data from.

  • output – The output where to write the extraction results.

class pymlst.wg.extractors.StatsExtractor[source]

Extracts stats, number of strains, coregenes and sequences from a wgMLST database.

extract(base, output)[source]
Parameters
  • base – The database to extract data from.

  • output – The output where to write the extraction results.

class pymlst.wg.extractors.StrainExtractor(count=False, **kwargs)[source]

Extracts a list of strains from a wgMLST database.

__init__(count=False, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

extract(base, output)[source]
Parameters
  • base – The database to extract data from.

  • output – The output where to write the extraction results.

class pymlst.wg.extractors.DistanceExtractor(mincover=0, keep=False, duplicate=False, inverse=False)[source]

Extracts a distance matrix from a wgMLST database.

extract(base, output)[source]
Parameters
  • base – The database to extract data from.

  • output – The output where to write the extraction results.

class pymlst.wg.extractors.MlstExtractor(form='default', **kwargs)[source]

Extracts an MLST table from a wgMLST database.

__init__(form='default', **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

extract(base, output)[source]
Parameters
  • base – The database to extract data from.

  • output – The output where to write the extraction results.

Classical MLST

Core classes and functions to work with Classical MLST data.

pymlst.cla.core.open_cla(file=None, ref=1)[source]
A context manager function to wrap the creation a

ClassicalMLST object.

Context managers allow you to instantiate objects using the with keyword, eliminating the need to manage exceptions and commit/close processes yourself.

Parameters
  • file – The path to the database file to work with.

  • ref – The name that will be given to the reference strain in the database.

Yields

A ClassicalMLST object.

class pymlst.cla.core.DatabaseCLA(file, ref)[source]

A core level class to manipulate the genomic database.

Warning

Shouldn’t be instantiated directly, see ClassicalMLST instead.

__init__(file, ref)[source]
Parameters

path – The path to the database file to work with.

add_infos(repository, species, mlst, version)[source]

Add infos of the MLST schema use in this database

add_sequence(sequence, gene, allele)[source]

Adds a new sequence associated to a gene and an allele.

add_mlst(sequence_typing, gene, allele)[source]

Adds a new sequence typing, associated to a gene and an allele.

get_infos()[source]

Return infos values of the database

get_genes_by_allele(allele)[source]

Returns all the distinct genes in the database and their sequences for a given allele.

get_allele_by_sequence_and_gene(sequence, gene)[source]

Gets an allele by sequence and gene.

get_st()[source]

Gets all St present in the database

get_st_by_gene_and_allele(gene, allele)[source]

Gets all the STs of a gene/allele pair.

get_sequence_by_gene_and_allele(gene, allele)[source]

Gets a sequence by gene and allele.

get_all_sequences_by_gene(gene)[source]

Get all the sequences from a particular gene

get_all_sequences()[source]

Get all sequences allele

remove_allele(gene, allele)[source]

Remove an particular allele for the gene

class pymlst.cla.core.ClassicalMLST(file, ref)[source]

Classical MLST python representation.

Example of usage:

open_cla('database.db') as db:
    db.create(open('profile.txt'), [open('gene1.fasta'),
                                   open('gene2.fasta'),
                                   open('gene3.fasta')])
    db.multi_search(open('genome.fasta'))
__init__(file, ref)[source]
Parameters
  • file – The path to the database file to work with.

  • ref – The name that will be given to the reference strain in the database.

create(profile, alleles)[source]

Creates a classical MLST database from an MLST profile and a list of alleles.

Parameters
  • profile – The MLST profile

  • alleles – A list of alleles files.

The MLST profile should be a TXT file respecting the following format:

MLST Profile TXT

ST

gene1

gene2

gene3

1

1

1

1

2

3

3

2

3

1

2

1

4

1

1

3

add_infos(repository, species, mlst, version)[source]

Add infos of the MLST schema store in database.

Parameters
  • repository – Source of the MLST data

  • species – Name of the specie

  • mlst – Name of the MLST schema

  • version – Version of the database

get_infos(output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Get infos of the MLST schema store in the database

search_st(genome, identity=0.9, coverage=0.9, fasta=None)[source]

Search the Sequence Type number of a strain.

Parameters
  • genome – The strain genome we want to add as a fasta path.

  • identity – Sets the minimum identity used by BLAT for sequences research (in percent).

  • coverage – Sets the minimum accepted gene coverage for found sequences.

  • fasta – A file where to export genes alleles results in a fasta format.

Searches the Sequence Type number of one or multi strain(s) from an assembly genome.

Parameters
  • genomes – Tuple of one or more strain genomes given as input Path

  • output – An output for the sequence type research results.

  • identity – Sets the minimum identity used by BLAT for sequences research (in percent).

  • fasta – A file where to export genes alleles results in a fasta format.

  • coverage – Sets the minimum accepted coverage for found sequences.

search_read(fastqs, identity=0.9, coverage=0.95, reads=10, fasta=None)[source]

Searches the Sequence Type from raw reads of one strain.

Parameters
  • fastq – List of fastq files containing raw reads

  • identity – Sets the minimum identity used by KMA for sequences research (in percent).

  • coverage – Sets the minimum accepted gene coverage for found sequences.

  • reads – Sets the minimum reads coverage to conserve an mapping

  • fasta – A file where to export genes alleles results in a fasta format.

multi_read(fastqs, identity=0.9, coverage=0.95, reads=10, paired=True, fasta=None, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Search the Sequence Type number of one or multi strain(s) from raw reads.

Parameters
  • fastqs – Tuple of one or more strain raw reads given as input

  • output – An output for the sequence type research results.

  • identity – Sets the minimum identity used by KMA for sequences research (in percent).

  • reads – Sets the minimum reads coverage to conserve an mapping

  • paired – Defined if the raw reads are by paired or single

  • fasta – A file where to export genes alleles results in a fasta format.

  • coverage – Sets the minimum accepted coverage for found sequences.

remove_allele(gene, allele)[source]

Removes some problematic allele out of the claMLST database

Parameters
  • gene – Gene name on the database, ignoring case

  • allele – Integer allele number of this gene

class pymlst.cla.core.ST_result(genome_name, st_val, alleles)[source]

Writes the results of the ST research

__init__(genome_name, st_val, alleles)[source]
Parameters
  • genome_name – Name of the genome retrieved from the path provided by the user

  • st_val – ST values identified for each genome by search_st

  • alleles – Alleles of the strain recovered in the core genome

write(output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, header=True)[source]

Writes the results in output file

Other Typing

Core classes and functions to work in alternative typing methods.

pymlst.pytyper.core.open_typer(method)[source]
Parameters

method – Defines typing method to apply. Possible values : 1- fim 2- spa 3- clmt

Yields

A :class: ‘~pymlst.pytyper.core.pyTyper’ object.

class pymlst.pytyper.core.PyTyper(fi, typing)[source]

Primary class for all pyTyper objects listed on method

__init__(fi, typing)[source]
Parameters
  • fi – Path to the database file

  • typing – Typing type

abstract search_genome(genome, identity=0.9, coverage=0.9, fasta=None)[source]

Abstract method for searching alleles against a genome.

Parameters
  • genome – Path to the fasta genome

  • identity – Minimum identity treshold (0.9)

  • coverage – Minimum coverage threshold (0.9)

  • fasta – Path to a file to write alleles in fasta format (None)

abstract create()[source]
Initialiazes the database for a specific typing method:

FimH Spa Clermont

Uses a scheme created automatically that is specific to the typing

Performed batch search analysis of list of genomes

Parameters
  • genomes – List of path to the fasta genomes

  • identity – Minimum identity treshold

  • coverage – Minimum coverage threshold

  • fasta – Handle to a file to write alleles in fasta format (None)

  • output – Write result on this output (stdout)

close()[source]

Close database

check_input(identity, coverage)[source]

Verify input identity and coverage to be between 0 to 1

Parameters
  • identity – Minimum identity treshold

  • coverage – Minimum coverage threshold

write_fasta_allele(genome, fasta, psl)[source]

Export allele in fasta output for a list of psl results

Parameters
  • genome – Path of genome fasta file

  • fasta – handle of fasta output

  • psl – Blat alignement results {gene:[alignements]}

class pymlst.pytyper.core.FimH(fi)[source]

fimH typing for Escherichia coli.

__init__(fi)[source]
Parameters
  • fi – Path to the database file

  • typing – Typing type

search_genome(genome, identity, coverage, fasta)[source]

Abstract method for searching alleles against a genome.

Parameters
  • genome – Path to the fasta genome

  • identity – Minimum identity treshold (0.9)

  • coverage – Minimum coverage threshold (0.9)

  • fasta – Path to a file to write alleles in fasta format (None)

Performed batch search analysis of list of genomes

Parameters
  • genomes – List of path to the fasta genomes

  • identity – Minimum identity treshold

  • coverage – Minimum coverage threshold

  • fasta – Handle to a file to write alleles in fasta format (None)

  • output – Write result on this output (stdout)

create()[source]
Initialiazes the database for a specific typing method:

FimH Spa Clermont

Uses a scheme created automatically that is specific to the typing

class pymlst.pytyper.core.Spa(fi)[source]

Spa typing for Staphylococcus aureus.

__init__(fi)[source]
Parameters
  • fi – Path to the database file

  • typing – Typing type

search_genome(genome, identity, coverage, fasta)[source]

Abstract method for searching alleles against a genome.

Parameters
  • genome – Path to the fasta genome

  • identity – Minimum identity treshold (0.9)

  • coverage – Minimum coverage threshold (0.9)

  • fasta – Path to a file to write alleles in fasta format (None)

Performed batch search analysis of list of genomes

Parameters
  • genomes – List of path to the fasta genomes

  • identity – Minimum identity treshold

  • coverage – Minimum coverage threshold

  • fasta – Handle to a file to write alleles in fasta format (None)

  • output – Write result on this output (stdout)

create()[source]
Initialiazes the database for a specific typing method:

FimH Spa Clermont

Uses a scheme created automatically that is specific to the typing

check_repetition(spans)[source]

Check that repetition are successive

class pymlst.pytyper.core.Clmt(fi)[source]

Phylogroupe determination using ClermontTyping methods for Escherichia coli.

__init__(fi)[source]
Parameters
  • fi – Path to the database file

  • typing – Typing type

search_genome(genome, identity, coverage, fasta)[source]

Abstract method for searching alleles against a genome.

Parameters
  • genome – Path to the fasta genome

  • identity – Minimum identity treshold (0.9)

  • coverage – Minimum coverage threshold (0.9)

  • fasta – Path to a file to write alleles in fasta format (None)

Performed batch search analysis of list of genomes

Parameters
  • genomes – List of path to the fasta genomes

  • identity – Minimum identity treshold

  • coverage – Minimum coverage threshold

  • fasta – Handle to a file to write alleles in fasta format (None)

  • output – Write result on this output (stdout)

create()[source]
Initialiazes the database for a specific typing method:

FimH Spa Clermont

Uses a scheme created automatically that is specific to the typing

class pymlst.pytyper.core.TypingResult(genome_name, method)[source]

Writes the results of the TYPING research

__init__(genome_name, method)[source]
Parameters
  • genome_name – Name of the genome retrieved from the path provided by the user

  • method – Typing method uses for analysis

write(output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, header=True)[source]

Writes the results in output file

set_allele(allele)[source]
Parameters

allele – Allele of the strain recovered in the core genome

set_st(st_val)[source]
Parameters

st_val – ST values identified for each genome by search

set_notes(notes)[source]
Parameters

notes – Add notes to search results