Molecular analysis of 16s-rRNA and associated gene segments for identification of probiotic phenotypes

© The Author 2023. This work is licensed under a Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) that allows others to share and adapt the material for any purpose (even commercially), in any medium with an acknowledgement of the work's authorship and initial publication in this journal. 62 Molecular analysis of 16s-rRNA and associated gene segments for identification of probiotic phenotypes


Introduction
Investigation of the role of good bacteria in various diseases and illnesses, such as alcoholic liver injury, asthma, allergic rhinitis, hepatocellular carcinoma, and multidrug resistance [1]- [4], has been conducted, with a focus on five genera of bacteria in this study: Bacillus, Bifidobacterium, Lactobacillus, Lactococcus, and Streptococcus. These bacteria were classified according to their metabolism, specifically their heterofermentative and homofermentative abilities, and their genome. Bacteria that produce only one fermentation product were classified as homofermentative, such as L. acidophilus and its lactic acid fermentation product. Bacteria that produce more than one fermentation product were classified as heterofermentative, such as L. brevis, which produces lactic acid and ethanol.
A large number of studies have investigated the safety of probiotics, particularly those containing Lactobacillus species, with most showing either a positive or no influence on human health [5]- [7]. Potential negative effects of probiotics are still being explored, and while they are generally considered to be beneficial to health, they may be detrimental to individuals with compromised immunity. In such cases, the introduction of probiotics to the human gut microbiome may result in sepsis, bacteraemia and even death [8], [9]. Some reports have suggested an association between lactobacillemia in three AIDS patients and the intake of Lactobacillus rhamnosus GG, although further research is required to fully ascertain the impact of probiotics on human health [10], [11].
Analysis of 16s-rRNA genes remains a central method in microbiology, serving both to explore microbial diversity and as a day-to-day tool for bacterial identification. Such identification techniques are generally easier to interpret than molecular phylogenetic analyses and are often preferred when the groups are well understood. Since the seminal work of Carl R. Woese et al. in 1977 [12], research on 16s-rRNA has continued to grow in popularity, with 35577 publications appearing in PubMed in the five years preceding this project [13]. Recent studies utilizing 16s-rRNA have included the molecular identification of clinical Nocardia isolates, novel identification, sample screening, and other applications [14]- [17].

Method
Materials used in this project can be available on demand.

Bacterial growth and enumeration
A growth environment was established for bacteria by utilizing an incubator (Innova 42 Incubator Shaker Series, Eppendorf North America, USA) set to 37°C for 24h. Three different media were employed for probiotic growth: the standard medium of DeMan, Rogosa and Sharpe (MRS) for Lactobacillus; Lysogeny broth (LB) as a control medium for verifying the sterility of the environment; and HHD agar utilized for bacteria separation and enumeration based on the fermentation process. The latter medium contains bromcresol green, which reacts with the pH changes caused by the bacterial fermentation process. A heterofermentative bacterium will not alter the colour of the medium, which remains blue, whereas a homofermentative bacterium will result in a medium colour change to green if the pH drops below 3.8 or will remain blue within the pH range of 5.4. A four-step protocol was used to carry out the growth, separation, and enumeration of bacteria. To begin, 0.5g of the bacterial mix from a capsule was dissolved in 20ml of the medium and incubated for 24h at 37°C. This yielded a mix of different probiotics, which was then transferred to agar plates (LB, MRS, HHD). The plates were incubated for 24h at 37°C, after which one colony was taken and transferred to a fresh medium for a further 24h. This enabled enumeration of a single colony. Additionally, three known bacteria (two different strains of L. plantarum and B. subtilis) were grown, and four random colonies were taken from the different plates.

DNA isolation, primer design and polymerase chain reaction (PCR)
Isolation of genomic bacterial DNA was conducted using a commercial bacterial kit (details available on demand). Quality control was performed with a Multiskan™ GO Microplate Spectrophotometer (Thermo Scientific™, United States), and the isolated DNA was subsequently utilized for a PCR reaction. Relevant literature was consulted, and primers were designed based on previous research [18]- [21]. These primers (details available on demand) were tested with the Silva test primer, and those with the highest coverage and specificity were chosen [19]. Coverage is a value that measures the number of sequences that are matched in one taxonomic unit by the matched or mismatched sequences, with higher values indicating more favorable primers for that taxonomic unit. Similarly, specificity indicates how accurately the primer fits the overall sequence in the database, with higher values indicating better precision. A gradient PCR was then employed to determine the optimal annealing temperature of 56 °C, as gradient PCR allows for different temperatures in each well. The setup and thermal cycler utilized for this project are available on demand, and the StepOnePlus™ Real-Time PCR System (Germany) was employed.

Sequencing and sequence alignment tools
Sequencing was done in MedSankTek company in Turkey. The company utilized the Sanger method with a single read, resulting in the acquisition of sequencing data. Before the sequencing process, the samples were purified. Bioinformatic analysis was conducted to assess sequencing and nucleotide identification using the Phred score. This score is a numerical representation of the probability of a nucleotide being read incorrectly, calculated using Formula 1 based on a logarithmic relationship. Quality values were further characterized from the Phred website. Two software programs, Phred and Codon Code Aligner, were utilized to facilitate the analysis. Q = -10 log10 P or P=10 -Q/10 Formula 1: Phred score calculation After performing sequence alignment, the next step was to identify the bacterial sequence.
The efficacy of four sequence identification algorithms was tested for the purpose of this project. The algorithms, BLASTn, RDP, USEARCH, and VSEARCH, were divided into two categories: those with a graphical interface and those that do not. Upon testing, they were graded according to the ease and speed of use. BLASTn and RPD featured graphical interfaces, while USEARCH and VSEARCH needed to be programmed through batch commands or other command-based programming.
BLASTn is a widely used program for sequence comparison and has a user-friendly interface and access to a large database. It is suitable for short sequences and evaluates alignments through parameters like maximum score, total score, e-value, percentage identity, and accession number. The speed of BLAST is affected by the size of the job and time of day, with completion time ranging from seconds to days. For Europeans, the optimal time to run a BLAST job is 6-12 am.
The Ribosomal Database Project (RDP) contains 2.8 million annotated sequences from bacteria, archaea, and fungi. It provides a Hierarchical Browser, Classifier, Probe Match, FunGene, Library Compare, Sequence Match, RDPipeline, Aligner, and Tree Builder. It provides a seqmatch score to reflect the number of oligomers shared between two sequences. RDP suffers from the same problems as BLAST, with time to execution dependent on server load. USEARCH and VSEARCH are alternatives with faster execution and wider database use.
USEARCH and VSEARCH are efficient due to their combination of multiple algorithms. Both support Bash and Perl programming languages; USEARCH is open source in 32 bit version, but VSEARCH is open source with no memory limit. VSEARCH is based on the USEARCH method, which compares "words" to the query to find similar sequences. Six databases were used for comparison: HOMD, NCBI, SILVA, GreenGenes, RDP, and prokMSA, with prokMSA the largest. Both programs were run in parallel on a Linux OS with 4GB RAM.

Cultivation of bacteria
As demonstrated in Chapter 2: Materials and Methods, the bacteria were cultivated in three separate media. These media successfully demonstrated the ability to cultivate probiotics, with the best growth being found in MRS medium.

DNA isolation and PCR
The DNA was isolated utilizing the kit provided in a period of 60 minutes. Visualization of the isolated DNA was achieved via gel electrophoresis. Table 1 depicts the purity and overall concentration of the isolated DNA yield. PCR was performed after quality assurance in a volume of 30 µl, with an approximate total reaction time of 1.1h. To visualize the amplicons, a 1:5 ratio of 6x loading dye to PCR product was added, with 1 µl of dye on every 5 µl of PCR product.
Confirmation of the ability to amplify the target regions and the appropriateness of the primer temperature were established, thus allowing for the purification of the PCR product through the use of the gel extraction kit, as previously outlined in Table 1. Subsequently, the purified DNA was sent for sequencing.

Sequencing and bioinformatic analysis
Seven days after the samples were sent, the full sequences were acquired utilizing the Codon Code Aligner can be available on demand. Table 2 provides a synopsis of the acquired sequences. Confirmation of sequence quality was accomplished through evaluation of read quantity and quality. It was observed that a majority of nucleotide reads possessed a satisfactory level of quality, with the exception of a sample marked by PL. Subsequently, sequence alignment was performed to further the analysis, with the results being divided into categories according to the utilized tool.

BLASTn and ribosomal database project
BLASTn was the first tool tested in this study. Twenty-one sequences were obtained, with twenty successfully identified; the only exception was PL, which was not identified due to its short length. As BLASTn queries the NCBI database, only a single result was returned.   If the value is lower than 5e-120 then the BLAST interprets this as a 0. Average time required was 00:04:43,37 which represents higher amount of time then other algorithms mentioned in the research. Average time required to complete every sequence on RDP in on this list was 00:00:35, 92. This indicates a very low amount of time required actually to run and to get results using online RDP compared to NCBI which was substantially slower.

USEARCH
USEARCH was the first tool to lack a graphical user interface. The results were stratified according to the database utilized; out of six available databases, results were obtained for GreenGenes, HOMD, NCBI, and SILVA. However, the other two databases, RDP and prokMSA, could not be processed due to the constraints of the 32-bit version of the software.

VSEARCH
VSEARCH was the second tool without the graphical interface. The results were segmented based on the source database, and access to the 64-bit version allowed for analysis of all six databases. prokMSA. Using a threshold of 50%, the species of each sequence were identified by determining which organism had the highest overall hits. This score was divided by the maximum score (12) and multiplied by 100%. If the score was higher than 50%, it was considered to be a positive confirmation. Based on the 21 results obtained, the most common results are provided in Table 6. The primary challenge in this project was the accurate separation of bacterial colonies. While colonies were obtained, they were randomly selected and used for further characterization, making it difficult to ascertain whether the chosen colonies were from different or the same bacteria, even when utilizing selective characterization media. This difficulty is shared by laboratories worldwide. After obtaining the DNA, the same was sent for sequencing to Turkey. Transportation of the DNA, which must be kept on ice, posed a challenge due to temperature fluctuations and transport vibrations which could potentially cause degradation of the sample. Fortunately, this did not occur and 21 sequences of either 350 or 500 bp amplicons were obtained. Initially, the whole 1500 bp gene was isolated and amplified, but due to a lack of resources, it was not sequenced. Consequently, the decision was made to focus only on the V3 -V4 region.
The obtained sequences were utilized for a comprehensive bioinformatic analysis, as detailed in Chapter 3: Results. This analysis revealed that the V3 and V4 hyper-variable regions of the 16s-rRNA can be employed for prediction of bacterial identification; however, the extent of accuracy is highly dependent on the sequence of the region, which is variable among different bacteria. For instance, B. subtilis has a high chance of being identified by the V3 -V4 region, yet bacteria such as L. rhamnosus often become confused with L. plantarum and L. paraplantarum due to the fact that the V3 and V4 regions of all three bacteria are more than 90% identical in sequence, rendering the software unable to recognize the difference. When this project was initially designed, the research conducted at the time suggested that the V3 -V3 regions were better for comparison than V1 -V2 regions [22]- [24]. The results obtained from this analysis indicate that the use of these regions is not recommended for the identification of bacteria. Additionally, the precision of these algorithms and the speed at which they can be run is worth noting. Algorithms with graphical interfaces are generally easier to use and are employed by biologists with limited programming knowledge. In such cases, users may find that the amount of time required to complete the task is contingent on the number of other users utilizing the software at the same time. However, local software can be used if there are any issues with the online software, as they do not require an internet connection so long as the relevant databases and sequences are locally stored. The disadvantage of these programs is that they are confined to the capabilities of the computer they are running on and will run more quickly on better machines.
Based on the results obtained and discussed, several recommendations can be made for future studies. Firstly, it is suggested that the overall size of the regions should be increased, either by taking three or more hypervariable regions of 16s-rRNA, or even the entire gene, which was the original plan for this project. However, due to the issues encountered when performing PCR and the need to send samples for sequencing abroad, multiple repetitions of the experiment were not possible. For this type of study, it is necessary to have a sequencing device in the institution or country, thus enabling overnight sequencing and enabling mistakes to be avoided or corrected. Secondly, it is recommended to use more software and algorithms with a higher percentage of sequence available, combined with an increased working memory. The minimal amount of RAM should ideally be 32 GB to start this type of analysis, while 64 GB is preferable for smooth functioning. Additionally, it is suggested to use software such as FASTCAR, GASSS.T. and Genoogle.

Conclusion
This study assessed the growth and DNA extraction from Lactobacillus, Bacillus, Lactococcus, Bifidobacterium, and Streptococcus colonies in MRS medium. The mean value of DNA concentration was 17.17 ng/µl (standard deviation of 0.416) with a DNA purity of 1.75 (standard deviation of 0.026). Furthermore, BLASTn, RDP, USEARCH, and VSEARCH software were tested to analyze DNA sequences. BLASTn yielded a single result, with six organisms identified and ten incorrect predictions. RDP provided a match with an average time of 00:00:35,92. USEARCH was a 32-bit version with an average time of 00:01:05.83 per sequence. VSEARCH was the only tool to access all six databases, however, the average time for each database varied from 00:51.56 to 06:43.12. The findings demonstrated that VSEARCH successfully identified B. subtilis with the highest accuracy rate of 91.67%, followed by L. plantarum with an accuracy rate of 75%. The lowest accuracy rate was observed for Bacillus sp. at 41.67%. Subsequently, the potential of the V3 and V4 hyper-variable regions of the 16s-rRNA gene for bacterial identification was also investigated. Results suggested that the performance of the V3 -V4 regions in bacterial identification is highly dependent on the sequence of the region, which is variable among different bacteria. Moreover, the analysis revealed that while the V3 -V4 region of B. subtilis is highly likely to be successfully identified, other bacteria such as L.
rhamnosus may be confused with L. plantarum and L. paraplantarum due to the fact that the V3 and V4 regions of all three bacteria are more than 90% identical in sequence. As a consequence, the software was unable to make a distinction between them. Consequently, the results of this analysis provide evidence that the use of the V3 -V4 regions is not recommended for bacterial identification.

Declaration of competing interest
The authors declare that they have no / any known financial or non-financial competing interests in any material discussed in this paper.

Funding information
This research was funded by International University of Sarajevo, Bosnia and Herzegovina.