Nouvelle publication

Rechercher

Article
· Mai 15, 2024 9m de lecture

DNA Similarity and Classification: An Analysis

DNA Similarity and Classification was developed as a REST API utilizing InterSystems Vector Search technology to investigate genetic similarities and efficiently classify DNA sequences. This is an application that utilizes artificial intelligence techniques, such as machine learning, enhanced by vector search capabilities, to classify genetic families and identify known similar DNAs from an unknown input DNA.

K-mer Analysis: Fundamentals in DNA Sequence Analysis

Fragmentation of a DNA sequence into k-mers is a fundamental technique in genetic data processing. This approach involves breaking down the DNA sequence into smaller subsequences of fixed size, known as k-mers. For example, if we take the sequence "ATCGTAGCTA" and define k as 3, we would obtain the following k-mers: "ATC", "TCG", "CGT", "GTA", "TAG", "AGC", "GCT", and "CTA".

This process is carried out by sliding a window of size k along the sequence, extracting each subsequence of size k, and recording it as a k-mer. The window then moves to the next set of bases, allowing overlap between the k-mers. This overlap is crucial to ensure that no information is lost during the fragmentation process.

The resulting k-mers retain crucial local information about the structure of the DNA sequence. This is critical because many biologically relevant pieces of information are contained within specific regions of DNA. Subsequent analysis of these k-mers can reveal patterns, identify conserved or similar regions between different DNA sequences, and assist in the investigation of genetic functions and interactions.

In summary, DNA Similarity and Classification offer an approach to identify genetic similarities and classify DNA sequences.

Application

The following code initializes the project, performing the necessary setup before building:

ClassMethod init()
{
    write "START INIT"
    Do ##class(dc.data.DNAClassification).CreateTheModelML()
    Do ##class(dc.data.HumanDNA).UploadData()
}

The UploadData method implemented in the code is responsible for fragmenting DNA sequences into k-mers, converting these k-mers into vectors using a specific encoding model, and then storing these vectors in the database.

print("Reading dataset")
human_df = pd.read_table('/opt/irisbuild/data/human_data.txt')
human_df = human_df.sample(n=1500, random_state=42)

To enhance performance and reduce processing time during the project's build phase, a sampling process is implemented, limiting the read to the first 1,500 records of the file. While this strategy speeds up the process, it's worth noting that a comprehensive analysis of the file could provide a more thorough understanding of the data. However, due to the commitment to efficiency and the pace of construction, this approach was adopted.

def getKmers(sequence, size=6, max_length=5000):
    kmers = [sequence[x:x+size].lower() for x in range(len(sequence) - size + 1)]
    kmers = [kmer for kmer in kmers if len(kmer) > 0]
    if len(kmers) == 0:
        return [sequence]
    return kmers[:max_length]

The MAX_LEN parameter in getKmers is used to limit the maximum size of the k-mers processed to optimize the runtime of the process. However, it's important to note that this approach may not be ideal in terms of fully representing the DNA data.

 print("Creating K-mers groups")
    human_df['K_mers'] = human_df['sequence'].apply(getKmers)

    print("Combining K-mers into strings")
    human_df['K_mers_str'] = human_df['K_mers'].apply(lambda x: ' '.join(x))

    print("Download stsb-roberta-base-v2 model")
    model = SentenceTransformer('stsb-roberta-base-v2')

    print("Encode K_mers")
    embeddings = model.encode(human_df['K_mers_str'].tolist(), normalize_embeddings=True)

    print("Creating column sequence_vectorized")
    human_df['sequence_vectorized'] = embeddings.tolist()

The code vectorizes DNA sequences using the encoding model stsb-roberta-base-v2 from the SentenceTransformer library. This step converts DNA sequences into numerical vectors, facilitating their vector manipulation and computational analysis. After vectorization, the data is stored in the database for future use. This step is crucial to ensure that DNA sequence vectors are available for queries and additional analyses.

Subsequently, the method findSimilarity is responsible for finding the most similar DNA sequences based on a received DNA sequence input. It fragments the DNA sequence into k-mers and queries the database to find the five closest DNA sequences.

ClassMethod findSimilarity(pSequence As %String) As %String [ Language = python ]
{
    import iris
    from sentence_transformers import SentenceTransformer

    results = []

    def getKmers(sequence, size=6, max_length=5000):
        kmers = [sequence[x:x+size].lower() for x in range(len(sequence) - size + 1)]
        kmers = [kmer for kmer in kmers if len(kmer) > 0]
        if len(kmers) == 0:
            return [sequence]
        return kmers[:max_length]
        

    model = SentenceTransformer('stsb-roberta-base-v2') 
    kmers = getKmers(pSequence)
    kmers_str = ' '.join(kmers)
    search_vector = model.encode(kmers_str, normalize_embeddings=True).tolist()

    stmt = iris.sql.prepare("SELECT TOP 5 ID FROM dc_data.HumanDNA ORDER BY VECTOR_DOT_PRODUCT(kMersVector, TO_VECTOR(?)) DESC ")
    rs = stmt.execute(str(search_vector))

    class_mapping = {
            0: 'G protein coupled receptors',
            1: 'tyrosine kinase',
            2: 'tyrosine phosphatase',
            3: 'synthetase',
            4: 'synthase',
            5: 'lon channel',
            6: 'transcription factor',
    }

    for idx, row in enumerate(rs):
        humanDNA = iris.cls("dc.data.HumanDNA")._OpenId(row[0])    

        results.append({
            "sequence": humanDNA.sequence,
            "dnaClass": class_mapping[humanDNA.dnaClass]
        })
    return results
}

These processes are essential for the project as they enable the comparative analysis of DNA sequences and the identification of patterns or similarities among them.

Applied Machine Learning

Machine learning has been utilized to perform DNA sequence classification. The DNAClassification class contains methods for training a multinomial Naive Bayes classification model and for classifying DNA sequences based on this model.

The CreateTheModelML method is responsible for training the classification model. Firstly, DNA sequences are preprocessed, transformed into k-mers, and grouped. Then, these k-mer groups are converted into a numerical representation using vectorization technique. The multinomial Naive Bayes model is then trained with these vectorized data. After training, the model's performance is evaluated through analyses such as precision, recall, and F1 score.

ClassMethod CreateTheModelML()
{
    import pandas as pd

    def getKmers(sequence, size=6):
        kmers = [sequence[x:x+size].lower() for x in range(len(sequence) - size + 1)]
        kmers = [kmer for kmer in kmers if len(kmer) > 0]
        if len(kmers) == 0:
            return [sequence]
        return kmers

    print("Reading dataset")
    human_df = pd.read_table('/opt/irisbuild/data/human_data.txt')

    print("Creating K-mers groups")
    human_df['K_mers'] = human_df['sequence'].apply(getKmers)
    
    human_df['words'] = human_df.apply(lambda x: getKmers(x['sequence']), axis=1)
    human_df = human_df.drop('sequence', axis=1)

    human_texts = list(human_df['words'])
    for item in range(len(human_texts)):
        human_texts[item] = ' '.join(human_texts[item])
    y_data = human_df.iloc[:, 0].values        
    
    from sklearn.feature_extraction.text import CountVectorizer
    cv = CountVectorizer(ngram_range=(4,4))
    X_human_dna = cv.fit_transform(human_texts)

    print(X_human_dna.shape)

    print("PREPARING FIT DATA")
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X_human_dna, 
                                                        y_data, 
                                                        test_size = 0.20, 
                                                        random_state=42)
    
    print("FIT THE MODEL")
    from sklearn.naive_bayes import MultinomialNB
    classifier = MultinomialNB(alpha=0.1)
    classifier.fit(X_train, y_train)

    y_pred = classifier.predict(X_test)

    print("VALIDATING THE MODEL")

    from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
    print("Confusion matrix\n")
    print(pd.crosstab(pd.Series(y_test, name='Actual'), pd.Series(y_pred, name='Predicted')))
    def get_metrics(y_test, y_predicted):
        accuracy = accuracy_score(y_test, y_predicted)
        precision = precision_score(y_test, y_predicted, average='weighted')
        recall = recall_score(y_test, y_predicted, average='weighted')
        f1 = f1_score(y_test, y_predicted, average='weighted')
        return accuracy, precision, recall, f1
    accuracy, precision, recall, f1 = get_metrics(y_test, y_pred)
    print("accuracy = %.3f \nprecision = %.3f \nrecall = %.3f \nf1 = %.3f" % (accuracy, precision, recall, f1))


    print("SAVE THE MODEL")
    import joblib
    joblib.dump(classifier, '/opt/irisbuild/data/multinomial_nb_model.pkl')
    joblib.dump(cv, '/opt/irisbuild/data/cv_to_multinomial_nb_model.pkl')
}

On the other hand, the ClassifyDnaSequence method is responsible for classifying the received DNA sequences as input. Firstly, the sequence is preprocessed and transformed into k-mers. Then, these k-mers are vectorized using the same process used during the model training. The trained model is then loaded and used to predict the class of the DNA sequence. The classification result includes the predicted class and the probabilities associated with each class.

ClassMethod ClassifyDnaSequence(sequence As %String) As %Status
{
    import joblib
    from sklearn.feature_extraction.text import CountVectorizer

    classifier = joblib.load('/opt/irisbuild/data/multinomial_nb_model.pkl')
    cv = joblib.load('/opt/irisbuild/data/cv_to_multinomial_nb_model.pkl')

    def getKmers(sequence, size=6):
        kmers = [sequence[x:x+size].lower() for x in range(len(sequence) - size + 1)]
        kmers = [kmer for kmer in kmers if len(kmer) > 0]
        if len(kmers) == 0:
            return [sequence]
        return kmers
    k_mers = getKmers(sequence)

    k_mers_vec = cv.transform([' '.join(k_mers)])

    predicted_class = classifier.predict(k_mers_vec)[0]
    probabilities = classifier.predict_proba(k_mers_vec)[0]

    class_mapping = {
            0: 'G protein coupled receptors',
            1: 'tyrosine kinase',
            2: 'tyrosine phosphatase',
            3: 'synthetase',
            4: 'synthase',
            5: 'lon channel',
            6: 'transcription factor',
    }
    

    result = {
        "Classification": class_mapping[predicted_class],
        "Probabilities": {
            class_mapping[class_index]: float(probability)
            for class_index, probability in enumerate(probabilities)
        }
    }
    
    return result
}

Usage of the Application

Users interact with the system by sending DNA sequences through queries and can request a JSON-formatted response containing information about the discovered similar sequences and their classifications, simplifying the process of genetic analysis.

Below is the API to be used: http://localhost:52773/api/dna/find?&dna=<YOUR_DNA>

  • dna: Enter the DNA sequence you want to search

Example: http://localhost:52773/api/dna/find?&dna=ATGAACTGTCCAGCCCCTGTGGAGATCTCCT...

Using Postman to visualize the API response (can be done using a browser or other applications like Insomnia):

Project: GitHub

Voting: DNA-similarity-and-classify

2 Comments
Discussion (2)2
Connectez-vous ou inscrivez-vous pour continuer
InterSystems officiel
· Mai 15, 2024

InterSystems announces InterSystems IRIS 2024.1.0.267.2, includes support for Ubuntu 24.04

InterSystems is pleased to announce the general availability of:

  • InterSystems IRIS Data Platform 2024.1.0.267.2
  • InterSystems IRIS for Health 2024.1.0.267.2
  • HealthShare Health Connect 2024.1.0.267.2

This release adds support for the Ubuntu 24.04 operating system.  Ubuntu 24.04 includes Linux kernel 6.8, security improvements, along with installer and user interface improvements.  InterSystems IRIS IntegratedML is not yet available on Ubuntu 24.04.

Additionally, this release addresses two defects for all platforms:

  • A fix for some SQL queries using “NOT %INLIST” returning incorrect results. We previously issued an alert on this error.
  • A fix for incomplete stack traces in certain circumstances. 

How to get the software

As usual, Extended Maintenance (EM) releases come with classic installation packages for all supported platforms, as well as container images in Docker container format. For a complete list, refer to the Supported Platforms page.

Classic installation packages

Installation packages are available from the WRC's Extended Maintenance Releases page. Additionally, kits can also be found in the Evaluation Services website.

Containers

Container images for both Enterprise and Community Editions of InterSystems IRIS and IRIS for Health and all corresponding components are available from the InterSystems Container Registry web interface.

Containers are tagged as both "2024.1" or "latest-em".

Discussion (0)1
Connectez-vous ou inscrivez-vous pour continuer
Question
· Mai 15, 2024

End of File handling in Cache and IRIS

In Cache End of file throws error but in IRIS no indication of End of file. I have to do an explicit $ZOF. How are you handling/detecting End of File in IRIS?

In cache this line will throw End of file error -  F PREC=1:1 U FILE R REC  D SOMETHING

But in IRIS this goes to forever, has anyone noticed this behaviour in IRIS?

2 Comments
Discussion (2)2
Connectez-vous ou inscrivez-vous pour continuer
Question
· Mai 15, 2024

Modification d'un objet dynamique provenant d'un JSON

Bonjour, 

Je souhaite réaliser une méthode générique pour modifier des propriétés d'une classe dynamique. Lorsque je fais un JSONImport() cela fonctionne très bien pour certains objet, or dans le cas d'un objet contenant une liste cela m'ajoute un élément en plus au lieu de la modifier. J'ai essayée de vérifier le type lors de l'itération du JSON afin de faire un Insert mais je n'arrive pas à utiliser les $METHOD sur le Insert ou même le %Set. 

Voici la classe : 

Class Epc.conf.poste Extends (%Persistent, %JSON.Adaptor)
{

Property NomPoste As %String(%JSONFIELDNAME = "NomPoste", MAXLEN = "");
Property tPrinter As list Of EpErp.conf.printer(%JSONFIELDNAME = "tPrinter", SQLPROJECTION = "table/column", STORAGEDEFAULT = "array");
}
Class EpErp.conf.printer Extends (%SerialObject, %JSON.Adaptor)
{

Property Service As %String(%JSONFIELDNAME = "NomService", MAXLEN = "");
Property Print As %String(%JSONFIELDNAME = "NomImprimante", MAXLEN = "");
}

Mes données actuelles : 

Le fichier que j'envoie afin de modifier ma liste, je souhaite donc modifier un des éléments de ma list  : 

{
	"NomPoste":"Poste",
	"tPrinter":
	[
		{
			"NomService":"Petite",
			"NomImprimante":"EPFR_12"
		},
		{
			"NomService":"Grande",
			"NomImprimante":"Modif"
		}
	]
}

Ma méthode de modification générique : 

ClassMethod update(){
    
set payload={}.%FromJSON(%request.Content)

 set itr           = payload.%GetIterator()  

 if ##class(%Dictionary.ClassDefinition).%ExistsId(className){
  set class = $CLASSMETHOD(className,"%OpenId",id)
   if $ISOBJECT(class){
       if payload '= ""{
          while itr.%GetNext(.prop,.val) {
              if prop.%IsA("%DynamicArray"){
                    ?                 
                }else{
                    Set $PROPERTY(class, prop) = val
                }
               
                  do class.%Save()
                 }
             }
        }
     }
}

Merci par avance pour votre aide. 

2 Comments
Discussion (2)2
Connectez-vous ou inscrivez-vous pour continuer
Article
· Mai 15, 2024 3m de lecture

Demostración simple de LowCode que transforma mensajes HL7 SIU a Kafka y luego consume mensajes Kafka para almacenarlos en IRIS a través de SQL

Gitter

Configuración de producción

Esta demostración tiene una producción de interoperabilidad con 16 elementos.

Configuración de producción HL7 + Kafka Producer

La primera parte de esta demostración consiste en enviar un archivo HL7 SIU que será transmitido a los otros 2 flujos HL7 (HTTP y TCP), y transformado y transmitido al servidor Kafka. Los flujos HTTP y TCP transformarán los mensajes HL7 del mismo modo antes de enviarlos también a Kafka.

  • 3 Servicios HL7
  • 1 Enrutador HL7
  • 2 Operaciones HL7
  • 1 Operación de Negocio que envía los mensajes transformados a Kafka

Reglas de negocio

La producción tiene un proceso de negocio con un enrutador HL7, que transforma y envía mensajes HL7 a Kafka.

Transformación de datos

Data Transformation Builder permite la edición de la definición de una transformación entre fuentes SIU HL7v2 en mensajes Kafka. Transformación de datos 

 

Visual Trace

Después de que se haya procesado un mensaje HL7, es decir, al copiar algunos mensajes de /data/HL7/test al directorio /data/HL7/in), podréis ver su seguimiento visual.

  

Podéis ver aquí el mensaje con I/O y el HL7 ACK

Kafka Manager

Luego, podéis verificar los mensajes en Kafka, usando la interfaz KafkaManager y obteniendo datos de los diferentes temas. 

 

Y el contenido de un tema:

Configuración de Producción Kafka Consumer + SQL IRIS

 

La segunda parte de esta demostración consiste en consumir mensajes Kafka y enrutarlos a tablas IRIS a través de componentes SQL.

  • 3 Servicios de Kafka que consumen 3 temas de Kafka
  • 1 Enrutador
  • 3 Operaciones SQL insertando datos en la base de datos IRIS

Reglas de negocio

La producción tiene un proceso de negocio con un enrutador Kafka, que envía mensajes Kafka a los componentes IRIS SQL. 

Visual Trace

Cada vez que se consume un tema de Kafka, se envía al proceso de enrutador de Kafka, que realiza el enrutamiento basado en contenido de los mensajes de Kafka, a las tablas SQL apropiadas en IRIS. Si observáis atentamente los mensajes, podréis notar que el mensaje se envía directamente a IRIS sin ser transformado (mismo ID de mensaje).

  

 

Podéis ver aquí el mensaje con I/O y el resultado de la inserción SQL.

SQL

Luego podréis ver los resultados dentro de la base de datos IRIS a través de consultas SQL.

  • TrakCare table

* Surg table

* Y gracias a la herencia, también podéis consultar todos los datos simplemente consultando la tabla raíz, aquí data.kafka

ClassExplorer

 

El Explorador de clases o ClassExplorer os permite ver el modelo de datos de las clases IRIS.

Ajustes predeterminados

Para simplificar el proceso de copiar una definición de producción de un entorno a otro y garantizar una separación perfecta entre los parámetros de los diferentes entornos, se recomienda establecer configuraciones fuera de la clase de producción, en la configuración predeterminada del sistema. 

Entonces veréis los ajustes en azul en la configuración de producción.

Pre requisitos

 

Aseguraos de tener git y Docker Desktop instalados.

Instalación: ZPM

Abrid el Espacio de nombres IRIS con la Interoperabilidad habilitada. Abrid Terminal y llamad: USER>zpm "install hl7v2-to-kafka"

Instalación: Docker

1. Clone/git extrae el repositorio a cualquier directorio local

$ git clone https://github.com/SylvainGuilbaud/hl7v2-to-kafka.git

2. Abrid la terminal en este directorio y ejecutad:

$ docker-compose build

3. Ejecutad el contenedor IRIS con vuestro proyecto:

$ docker-compose up -d

Cómo ejecutar la muestra

  1. copiad algunos mensajes HL7 de /data/HL7/test a /data/HL7/in
  2. comprobad en Visual Trace
  3. ved un seguimiento completo (full trace)
  4. Id a Kafka Manager y obtened datos de los diferentes temas.
Discussion (0)1
Connectez-vous ou inscrivez-vous pour continuer