ST.26: The New Sequence Listing Standard – Part II

This article is a follow-up to our previously presented introductory article on WIPO’s ST.26 sequencing standard, which can be found here.

Below, we continue our discussions on the differences between the presentation of sequence data under the ST.26 standard when compared with the older ST.25 sequencing format.

ST.26 Sequence Data

The sequence data section comprises data elements for describing a single sequence. Each individual sequence is assigned their own sequence identification number.

The most noticeable change when considering ST.26 sequences over ST.25 sequences is likely the change in file format, which utilises the .xml (eXtensible Markup Language) file format, rather than the ASCII .txt file format.  This change is significant in terms of the ability to provide more information to the sequence data, as the information in .xml format is digitally tagged with descriptive elements and various other attributes, as will be apparent in later discussions within this article. Advantageously, this means that the data contained in ST.26 format is more easily accessible and easily understood by people, rather than be presented in a code-like fashion for computer interpretation.

Biolog 1

As briefly mentioned in our first article, the ST.26 format makes it possible to include D-amino acids, linear portions of branched sequences, and nucleotide analogs, all of which were not possible to include in the ST.25 format.

Under the ST.26 standard, sequences with nucleotides of fewer than 10 defined nucleotides or amino acids of fewer than 4 defined residues are no longer permitted. This means that any nucleotide sequence comprising of 10 nucleotides, containing one or more undefined “n” residues, is not permissible . Likewise, any 4-residue amino acid sequence having one or more “X” residues is also not permitted.

Crucially, the letter “t” as used in ST.26 sequences denotes uracil in RNA sequences and thymine in DNA sequences, which contrasts with the use of “u” residues in ST.25 sequences. For RNA sequences imported from an ST.25 sequence listing this change is performed automatically, whereby all “u” residues are replaced by “t” residues. However, for DNA sequences that are imported from an ST.25 sequence, “u” residues will not be automatically concerted to “t” residues, since the WIPO Sequence software is unable to determine if the “u” residue in a DNA sequence is a modified residue, for example a uracil residue on a DNA backbone, or if it is an RNA segment from a hybrid DNA/RNA molecule. In such cases, the user would then have to manually amend “u” residues to “t” residues if required, and to include the necessary feature keys and qualifiers to explain the positioning of the uracil in the DNA sequence.

100554 James Cook University Laboratory 56

The simplified nature of ST.26 amino acid sequences means that these now utilise one-letter amino acid codes, which differs from the three-letter amino acid codes used in ST.25 amino acid sequences.

There is also better separation of the ST.26 nucleotide sequences from their translated amino acid sequences, since the ST.26 standard no longer permits having the amino acid translation presented below a nucleotide sequence. Instead, translated sequences are now represented elsewhere.

The ST.26 format further permits feature location, which helps to further characterise the nucleotide sequence by requiring at least one location descriptor defining a site or region that corresponds to a feature present in the sequence. For example, it is now possible, using symbols “<” and/or “>” to specify the indicated location in a sequence where the nucleotides are joined in an end-to-end configuration, to form one contiguous sequence, using the location syntax “join”.  It is also possible to specify the “order” of the elements, and whether the feature is located on the strand complementary to the sequence region specified, when read in the 5’ to 3’ direction. Feature location was previously not any given consideration in the older ST.25 format.

Variable or undefined nucleotides denoted by the letter “n” or letters “Xaa” no longer need to have a definition provided in the sequence under the ST.26 format, but instead a default value is now assigned to “n” and “X” nucleotides. In other words, residue “n” will now be construed to mean any one of “a”, “c”, “g” or “t/u” except where it is used with a further description in the feature table. Despite this change, it remains encouraged to always use the most appropriately restrictive symbol where applicable, for example to use the letter “r” to represent either an “a” or a “g” residue.

National Cancer Institute Lxprhcm8 Ti Unsplash

With respect to nomenclature, there are minimal differences between the two format standards for describing the organism. For example, one could still use the genus or species names, the only minor difference being that “artificial sequences” are now called “synthetic construct” and those that are “unknown” are to be referred to as “unidentified”.

Finally, and as briefly mentioned in our introductory article, in light of the recent updates effective 1 July 2024, it is also now possible to better present DNA sequences when their sequence directionality undergoes change within a single length of a DNA strand as a result of a 3’ to 3’ reversed linkage. Briefly, this involves first disclosing the portion of the DNA sequence that is in the 5’ to 3’ direction and allocating this portion to a SEQ ID number. The nucleotide at the position which is linked to 3’ to 3’ reversed linkage, should then be described in a feature table using the feature key “misc_feature” and the qualifier “note” with a value indicating that the residue is connected to an inverted nucleotide sequence through a 3’-3’ phosphodiester bond. The second portion of the DNA sequence, which is reversed orientated, would then be represented under a separate SEQ ID number, in the 5’ to 3’ orientation. In this second portion of sequence, the nucleotide that connects at the other end with the 3’ to 3’ reversed linkage would be described in the same manner as per above, in a feature table using the “misc_feature” and the qualifier “note”.

Up Next

Our final instalment in this series will discuss sequence annotations, including how this feature is particularly beneficial when preparing sequence listings under the ST.26 standard. We will also briefly discuss our recommendations to Applicants considering filing divisional applications in Australia and/or overseas.

Need advice on preparing sequence listings? MBIP is happy to help. Submit a book a meeting form on our website and an attorney will be in contact with you.