Overview of the filtering process within the amr.watch workflow applied to public genomes of priority bacterial pathogens available in the International Nucleotide Sequence Database Collaboration (INSDC) databases. The genome data are filtered in a series of steps, depicted from left to right in the table, with the numbers in each column representing a subset of those from the previous column.

For the pathogens that are grouped together, we initially accept genomes annotated in the ENA with any of the corresponding taxonomy IDs from the same group and use the Speciator assignments in subsequent processing.

Pathogen	ENA entries (run accessions)	Illumina paired-end entries	Filtered entries¹	Entries with geotemporal data²	Entries available for download in SRA	Entries associated with unique samples³	Assembled genomes	Genomes with correct species	Genomes that passed QC	Genomes with collection date post-2010
All Pathogens	329,397	301,215	298,141	136,600	136,580	128,774	123,180	122,111	117,071	100,486
A. baumannii	42,475	35,119	33,951	21,802	21,784	21,090	20,705	20,302	19,785	18,496
Klebsiella	115,375	101,598	100,229	61,198	61,196	56,446	55,006	54,582	51,239	49,072
S. pneumoniae	171,547	164,498	163,961	53,600	53,600	51,238	47,469	47,227	46,047	32,918

Entries (runs) are filtered to include only those with two FASTQ files, ≥20x mean coverage (via assessment of the "base_count" field) and those associated with a single sample accession.
Entries (runs) are filtered to include those with a collection date that is decodable to at least the year and a sampling location that is decodable to at least the country level.
Entries (runs) are filtered to ensure only one run per sample accession is included (selecting the run with the highest number of bases via assessment of the "base_count" field).