In the field of biotechnology, biological sequences are the core elements of innovation, and traditional keyword search methods might overlook crucial information, thereby increasing risk. Therefore, sequence information search is often adopted in the patent field for patent freedom-to-operate (FTO) and novelty search tasks.
The current search methods mostly rely on homology sequence alignment algorithms, searching similar sequences in sequence databases to ensure comprehensive results. However, there exists a peculiar kind of sequence in patents, which is termed as degenerate sequences.
Degenerate sequence explanation: Patent drafters use a method similar to describing chemical structures, introducing degenerate symbols, wildcards, and operators into the sequence, and describing the specific parameters of these symbols through interpretative documents. The generic sequence doesn't have biological significance, it's mainly used to extend the scope of patent protection and set up search barriers. Traditional sequence homology comparison algorithms have not taken into account the situation of these generic sequences, so there is a risk of omission when using traditional algorithms for search, and it is impossible to find all potential target sequences.
According to the statistical data from Patsnap Bio Sequence Database, the number of these special generic sequences is not low among global patent literature: about 7.4 million for nucleic acids, accounting for 7.12% of the total nucleic acids; and 1.31 million protein sequences, accounting for 7.55% of the total protein sequences. This shows that a large number of generic sequences, due to the presence of special symbols, will impact our search results, posing a high risk for sequence FTO.
For example, when querying the sequence:
"EVGSYPAPSDACPSDYFYCDASGRSAGGGGTENLYFQGSGGS",
it would match the target sequence as:
"EVGSYXXXXX XCXXXXXXCX XSGRSAGGGG TENLYFQGSG GS".
When using traditional sequence-based search methods, the similarity score using the BLAST algorithm is only 67%, but in reality, the similarity is 100%. Searching for this type of biological sequence through conventional algorithms can lead to two situations: either the sequence cannot be found or it is excluded from the results due to the similarity being below the threshold. In either case, it brings inconvenience to sequence search personnel, as they cannot easily compare homology with patent claims and may even miss key sequence information.
In order to address the risk of missed detection brought by degenerate sequences, the Patsnap Engineering team utilizes its self-developed NLP, CV, entity recognition, anaphora resolution and other technologies to construct a deep learning model. It is used to identify and parse generic sequences and their substitution information in sequence lists and full-text patents, establishing a degenerate sequence search database.
This degenerate sequence search, through a special sequence comparison algorithm, not only can search for such sequences in generic sequence retrieval, but also returns the true similarity. Patsnap's generic sequence retrieval solution further reduces the risk of missing checks in patent FTO and novelty check work.
Given the scale of possible variations within a degenerate formula sequence that can reach the order of one hundred billion, traditional sequential alignment algorithms fail to meet the demand for real-time search. Patsnap utilizes a deeply customized sequence alignment algorithm to dynamically load substitution information of the general formula sequence during the sequence search process, ensuring precise searching and controlling the retrieval time within a reasonable range. In the scanning phase, Patsnap proposes a compression algorithm to construct a seed word list for heuristic search, greatly reducing unnecessary comparisons and improving search efficiency. When comparing the query sequence and the target sequence, Patsnap's proprietary algorithm introduces general formula substitution information, making alignment and query results more precise and comparison results more intuitive, directly showcasing the best comparative results under different variations between the query sequence and the target sequence.
In June of 2023, Patsnap’s biological sequence Bio database introduced a powerful degenerate sequence search feature, causing a paradigm shift in the patent domain. This disruptive advancement provides researchers with an immensely robust tool that offers an extensive collection of degenerate sequences, allowing users to effortlessly obtain the most accurate and relevant information in their searches.
To schedule a demo or learn more, visit patsnap.com/solutions/bio.