Rechercher

Annonce
· Mai 15, 2024

iris-medicopilot video announcement

Hello developers, 

Our project was designed to optimize patient clinical outcomes by reducing hospitalization time and supporting the development of resident and novice physicians. Additionally, it contributes to lowering financial waste in the healthcare system by improving the monitoring of pregnant patients, thereby decreasing risks and enhancing their safety.

Using the most accessible tool, the smartphone, was the obvious choice to make patients' lives easier.

Acknowledgment

Once again, we would like to thank you for the community's support of each of our applications.

If you found our app interesting and contributed some insight, please vote for iris-medicopilot and help us on this journey! laugh

2 Comments
Discussion (2)1
Connectez-vous ou inscrivez-vous pour continuer
Question
· Mai 15, 2024

FHIR Profile validation Java issues

Hello Community,

I'm trying to execute the FHIR R4 profile validation and got the below error. How to fix this below error

http://localhost:52773/csp/healthshare/learn1/fhir/r4/Patient/$validate?profile=http://examples.com/fhir/StructureDefinition/aktestpatient|4.0.1

3 Comments
Discussion (3)2
Connectez-vous ou inscrivez-vous pour continuer
Article
· Mai 15, 2024 9m de lecture

DNA Similarity and Classification: An Analysis

DNA Similarity and Classification was developed as a REST API utilizing InterSystems Vector Search technology to investigate genetic similarities and efficiently classify DNA sequences. This is an application that utilizes artificial intelligence techniques, such as machine learning, enhanced by vector search capabilities, to classify genetic families and identify known similar DNAs from an unknown input DNA.

K-mer Analysis: Fundamentals in DNA Sequence Analysis

Fragmentation of a DNA sequence into k-mers is a fundamental technique in genetic data processing. This approach involves breaking down the DNA sequence into smaller subsequences of fixed size, known as k-mers. For example, if we take the sequence "ATCGTAGCTA" and define k as 3, we would obtain the following k-mers: "ATC", "TCG", "CGT", "GTA", "TAG", "AGC", "GCT", and "CTA".

This process is carried out by sliding a window of size k along the sequence, extracting each subsequence of size k, and recording it as a k-mer. The window then moves to the next set of bases, allowing overlap between the k-mers. This overlap is crucial to ensure that no information is lost during the fragmentation process.

The resulting k-mers retain crucial local information about the structure of the DNA sequence. This is critical because many biologically relevant pieces of information are contained within specific regions of DNA. Subsequent analysis of these k-mers can reveal patterns, identify conserved or similar regions between different DNA sequences, and assist in the investigation of genetic functions and interactions.

In summary, DNA Similarity and Classification offer an approach to identify genetic similarities and classify DNA sequences.

Application

The following code initializes the project, performing the necessary setup before building:

ClassMethod init()
{
    write "START INIT"
    Do ##class(dc.data.DNAClassification).CreateTheModelML()
    Do ##class(dc.data.HumanDNA).UploadData()
}

The UploadData method implemented in the code is responsible for fragmenting DNA sequences into k-mers, converting these k-mers into vectors using a specific encoding model, and then storing these vectors in the database.

print("Reading dataset")
human_df = pd.read_table('/opt/irisbuild/data/human_data.txt')
human_df = human_df.sample(n=1500, random_state=42)

To enhance performance and reduce processing time during the project's build phase, a sampling process is implemented, limiting the read to the first 1,500 records of the file. While this strategy speeds up the process, it's worth noting that a comprehensive analysis of the file could provide a more thorough understanding of the data. However, due to the commitment to efficiency and the pace of construction, this approach was adopted.

def getKmers(sequence, size=6, max_length=5000):
    kmers = [sequence[x:x+size].lower() for x in range(len(sequence) - size + 1)]
    kmers = [kmer for kmer in kmers if len(kmer) > 0]
    if len(kmers) == 0:
        return [sequence]
    return kmers[:max_length]

The MAX_LEN parameter in getKmers is used to limit the maximum size of the k-mers processed to optimize the runtime of the process. However, it's important to note that this approach may not be ideal in terms of fully representing the DNA data.

 print("Creating K-mers groups")
    human_df['K_mers'] = human_df['sequence'].apply(getKmers)

    print("Combining K-mers into strings")
    human_df['K_mers_str'] = human_df['K_mers'].apply(lambda x: ' '.join(x))

    print("Download stsb-roberta-base-v2 model")
    model = SentenceTransformer('stsb-roberta-base-v2')

    print("Encode K_mers")
    embeddings = model.encode(human_df['K_mers_str'].tolist(), normalize_embeddings=True)

    print("Creating column sequence_vectorized")
    human_df['sequence_vectorized'] = embeddings.tolist()

The code vectorizes DNA sequences using the encoding model stsb-roberta-base-v2 from the SentenceTransformer library. This step converts DNA sequences into numerical vectors, facilitating their vector manipulation and computational analysis. After vectorization, the data is stored in the database for future use. This step is crucial to ensure that DNA sequence vectors are available for queries and additional analyses.

Subsequently, the method findSimilarity is responsible for finding the most similar DNA sequences based on a received DNA sequence input. It fragments the DNA sequence into k-mers and queries the database to find the five closest DNA sequences.

ClassMethod findSimilarity(pSequence As %String) As %String [ Language = python ]
{
    import iris
    from sentence_transformers import SentenceTransformer

    results = []

    def getKmers(sequence, size=6, max_length=5000):
        kmers = [sequence[x:x+size].lower() for x in range(len(sequence) - size + 1)]
        kmers = [kmer for kmer in kmers if len(kmer) > 0]
        if len(kmers) == 0:
            return [sequence]
        return kmers[:max_length]
        

    model = SentenceTransformer('stsb-roberta-base-v2') 
    kmers = getKmers(pSequence)
    kmers_str = ' '.join(kmers)
    search_vector = model.encode(kmers_str, normalize_embeddings=True).tolist()

    stmt = iris.sql.prepare("SELECT TOP 5 ID FROM dc_data.HumanDNA ORDER BY VECTOR_DOT_PRODUCT(kMersVector, TO_VECTOR(?)) DESC ")
    rs = stmt.execute(str(search_vector))

    class_mapping = {
            0: 'G protein coupled receptors',
            1: 'tyrosine kinase',
            2: 'tyrosine phosphatase',
            3: 'synthetase',
            4: 'synthase',
            5: 'lon channel',
            6: 'transcription factor',
    }

    for idx, row in enumerate(rs):
        humanDNA = iris.cls("dc.data.HumanDNA")._OpenId(row[0])    

        results.append({
            "sequence": humanDNA.sequence,
            "dnaClass": class_mapping[humanDNA.dnaClass]
        })
    return results
}

These processes are essential for the project as they enable the comparative analysis of DNA sequences and the identification of patterns or similarities among them.

Applied Machine Learning

Machine learning has been utilized to perform DNA sequence classification. The DNAClassification class contains methods for training a multinomial Naive Bayes classification model and for classifying DNA sequences based on this model.

The CreateTheModelML method is responsible for training the classification model. Firstly, DNA sequences are preprocessed, transformed into k-mers, and grouped. Then, these k-mer groups are converted into a numerical representation using vectorization technique. The multinomial Naive Bayes model is then trained with these vectorized data. After training, the model's performance is evaluated through analyses such as precision, recall, and F1 score.

ClassMethod CreateTheModelML()
{
    import pandas as pd

    def getKmers(sequence, size=6):
        kmers = [sequence[x:x+size].lower() for x in range(len(sequence) - size + 1)]
        kmers = [kmer for kmer in kmers if len(kmer) > 0]
        if len(kmers) == 0:
            return [sequence]
        return kmers

    print("Reading dataset")
    human_df = pd.read_table('/opt/irisbuild/data/human_data.txt')

    print("Creating K-mers groups")
    human_df['K_mers'] = human_df['sequence'].apply(getKmers)
    
    human_df['words'] = human_df.apply(lambda x: getKmers(x['sequence']), axis=1)
    human_df = human_df.drop('sequence', axis=1)

    human_texts = list(human_df['words'])
    for item in range(len(human_texts)):
        human_texts[item] = ' '.join(human_texts[item])
    y_data = human_df.iloc[:, 0].values        
    
    from sklearn.feature_extraction.text import CountVectorizer
    cv = CountVectorizer(ngram_range=(4,4))
    X_human_dna = cv.fit_transform(human_texts)

    print(X_human_dna.shape)

    print("PREPARING FIT DATA")
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X_human_dna, 
                                                        y_data, 
                                                        test_size = 0.20, 
                                                        random_state=42)
    
    print("FIT THE MODEL")
    from sklearn.naive_bayes import MultinomialNB
    classifier = MultinomialNB(alpha=0.1)
    classifier.fit(X_train, y_train)

    y_pred = classifier.predict(X_test)

    print("VALIDATING THE MODEL")

    from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
    print("Confusion matrix\n")
    print(pd.crosstab(pd.Series(y_test, name='Actual'), pd.Series(y_pred, name='Predicted')))
    def get_metrics(y_test, y_predicted):
        accuracy = accuracy_score(y_test, y_predicted)
        precision = precision_score(y_test, y_predicted, average='weighted')
        recall = recall_score(y_test, y_predicted, average='weighted')
        f1 = f1_score(y_test, y_predicted, average='weighted')
        return accuracy, precision, recall, f1
    accuracy, precision, recall, f1 = get_metrics(y_test, y_pred)
    print("accuracy = %.3f \nprecision = %.3f \nrecall = %.3f \nf1 = %.3f" % (accuracy, precision, recall, f1))


    print("SAVE THE MODEL")
    import joblib
    joblib.dump(classifier, '/opt/irisbuild/data/multinomial_nb_model.pkl')
    joblib.dump(cv, '/opt/irisbuild/data/cv_to_multinomial_nb_model.pkl')
}

On the other hand, the ClassifyDnaSequence method is responsible for classifying the received DNA sequences as input. Firstly, the sequence is preprocessed and transformed into k-mers. Then, these k-mers are vectorized using the same process used during the model training. The trained model is then loaded and used to predict the class of the DNA sequence. The classification result includes the predicted class and the probabilities associated with each class.

ClassMethod ClassifyDnaSequence(sequence As %String) As %Status
{
    import joblib
    from sklearn.feature_extraction.text import CountVectorizer

    classifier = joblib.load('/opt/irisbuild/data/multinomial_nb_model.pkl')
    cv = joblib.load('/opt/irisbuild/data/cv_to_multinomial_nb_model.pkl')

    def getKmers(sequence, size=6):
        kmers = [sequence[x:x+size].lower() for x in range(len(sequence) - size + 1)]
        kmers = [kmer for kmer in kmers if len(kmer) > 0]
        if len(kmers) == 0:
            return [sequence]
        return kmers
    k_mers = getKmers(sequence)

    k_mers_vec = cv.transform([' '.join(k_mers)])

    predicted_class = classifier.predict(k_mers_vec)[0]
    probabilities = classifier.predict_proba(k_mers_vec)[0]

    class_mapping = {
            0: 'G protein coupled receptors',
            1: 'tyrosine kinase',
            2: 'tyrosine phosphatase',
            3: 'synthetase',
            4: 'synthase',
            5: 'lon channel',
            6: 'transcription factor',
    }
    

    result = {
        "Classification": class_mapping[predicted_class],
        "Probabilities": {
            class_mapping[class_index]: float(probability)
            for class_index, probability in enumerate(probabilities)
        }
    }
    
    return result
}

Usage of the Application

Users interact with the system by sending DNA sequences through queries and can request a JSON-formatted response containing information about the discovered similar sequences and their classifications, simplifying the process of genetic analysis.

Below is the API to be used: http://localhost:52773/api/dna/find?&dna=<YOUR_DNA>

  • dna: Enter the DNA sequence you want to search

Example: http://localhost:52773/api/dna/find?&dna=ATGAACTGTCCAGCCCCTGTGGAGATCTCCT...

Using Postman to visualize the API response (can be done using a browser or other applications like Insomnia):

Project: GitHub

Voting: DNA-similarity-and-classify

2 Comments
Discussion (2)2
Connectez-vous ou inscrivez-vous pour continuer
InterSystems officiel
· Mai 15, 2024

InterSystems announces InterSystems IRIS 2024.1.0.267.2, includes support for Ubuntu 24.04

InterSystems is pleased to announce the general availability of:

  • InterSystems IRIS Data Platform 2024.1.0.267.2
  • InterSystems IRIS for Health 2024.1.0.267.2
  • HealthShare Health Connect 2024.1.0.267.2

This release adds support for the Ubuntu 24.04 operating system.  Ubuntu 24.04 includes Linux kernel 6.8, security improvements, along with installer and user interface improvements.  InterSystems IRIS IntegratedML is not yet available on Ubuntu 24.04.

Additionally, this release addresses two defects for all platforms:

  • A fix for some SQL queries using “NOT %INLIST” returning incorrect results. We previously issued an alert on this error.
  • A fix for incomplete stack traces in certain circumstances. 

How to get the software

As usual, Extended Maintenance (EM) releases come with classic installation packages for all supported platforms, as well as container images in Docker container format. For a complete list, refer to the Supported Platforms page.

Classic installation packages

Installation packages are available from the WRC's Extended Maintenance Releases page. Additionally, kits can also be found in the Evaluation Services website.

Containers

Container images for both Enterprise and Community Editions of InterSystems IRIS and IRIS for Health and all corresponding components are available from the InterSystems Container Registry web interface.

Containers are tagged as both "2024.1" or "latest-em".

Discussion (0)1
Connectez-vous ou inscrivez-vous pour continuer
Question
· Mai 15, 2024

End of File handling in Cache and IRIS

In Cache End of file throws error but in IRIS no indication of End of file. I have to do an explicit $ZOF. How are you handling/detecting End of File in IRIS?

In cache this line will throw End of file error -  F PREC=1:1 U FILE R REC  D SOMETHING

But in IRIS this goes to forever, has anyone noticed this behaviour in IRIS?

2 Comments
Discussion (2)2
Connectez-vous ou inscrivez-vous pour continuer