top of page
Writer's pictureMonica Kay Royal

Applications of AI and AI Infrastructure

Hosed by AI Educators Organization and Yujian Tang


An event for AI/ML/Data professionals and aspiring AI/ML/Data professionals about the most cutting edge technology!


First, an introduction to the creator of this event:

Yujian Tang, Ex Amazon AutoML Team | AI/ML Enthusiast | Top 100 r/Python Poster of All Time | LinkedIn ML/AWS/Python Top 5% Skill Badge | DEI Advocate


This was his first ever web conference all about the Applications of AI and AI Infrastructure and he did an amazing job at gathering the best group of speakers from the top tech companies in AI.


Additionally, we must give a huge shout out to Bill Liu and AICamp for working with Yujian and helping spread information about the most innovative uses for AI and AI Infrastructure!


The conference started precisely at 9am PT, just as scheduled. The room filled with individuals which I could only imagine were sitting at the edge of their seats like me, waiting with anticipation for more attendees to join before the opening statements. And let me tell ya, it was worth the wait!



Redefining State-of-the-Art with YOLOv5 & Vision AI


I love when a presentation starts off with a backstory, rather than making assumptions that everyone is on the same page and just jumping right in. Ayush did a fantastic job at this by first defining what he means by ‘State of the Art’. Because you can’t really redefine something that you don’t have a solid understanding of in the first place. He goes on to mention that many companies have technical metrics that cover a variety of aspects. However, at Ultralytics, they take the holistic approach and measure metrics that matter to its customers such as the ease of use, ease of deployment, and ease of solving real world problems. 👏🏻


He then jumps in (jumping is appropriate now) and asks the audience ‘What would you change in our world?’. Many people oftentimes respond by wanting to build something that doesn’t exist. This is cool and all, but Ultralytics focuses on building things which can and will solve the problems that already exist.


Solutions they offer:
  • Classification

  • Detection

  • Segmentation


Amazing Examples:

…removing plastic from the oceans?

With 93% YOLOv5s precision, the tools was aimed to detect plastic in the ocean which then helped predict locations of other pieces and thus assisting in the removal from the ocean.


…stopping the spread of forest fires

Fires are especially difficult to contain because by the time it takes to alert the authorities and for those authorities to arrive at the scene, the fire has already spread and damage has occurred.

With 80.6% YOLOv5s precision, detecting the early signs of a fire can help save both trees and wildlife.



YOLOv5 can be used in many other use cases (vehicle tracking & mask-wearing detection) in various industries (Healthcare, Aerospace, Manufacturing, Electronics, etc.)



Resources:

YOLOv5 is a completely free, open source tool, that can be accessed on their GitHub



Data Quality in Healthcare

Miriam Seoane Santos, Developer Advocate at YData

This was my favorite session, hands down!

Miriam has a background in Biomedical Engineering and has worked in the medical industry which made her recognize the complexity of the collection and use of data in this domain. Miriam and I share an understanding that data quality is crucial to producing anything impactful, and it’s not an easy thing to achieve.

Many Healthcare AI articles only focus on the outcomes, but a lot of us data professionals know there is so much more!

I won’t go into the iceberg analogy here but >>




How complex is healthcare data, you may ask...

Miriam describes healthcare data as:

‘A perfect storm!’
  • Large amounts of data (?) sometimes produced within milliseconds

  • Handled by different people within institutions

  • Collected from multiple sources (heterogeneous data, unstructured data)

  • Recorded at different frequencies and with different formats: audio, video, image, text, sensor

  • Stored in decentralized databases

  • Data Quality Issues: True Imperfections versus Data Intrinsic Characteristics?

  • Data Acquisition, Transmission, and Storage versus Nature of Domains!

  • Special Concerns: Interpretability/Explainability, high-stakes domains (mistakes cost lives), privacy concerns and data availability, fairness and ethics concerns!

  • However: Great opportunity for Data-Centric AI and one that will perhaps offer the greatest advantages!


Miriam goes on to discuss some of the specific Data Quality Issues in Healthcare, with a disclaimer. These are related to structured/tabular data, the data quality issues are from a development perspective, and they may apply to other domains. I thought this was exceptionally interesting because she gave reference to DAMA and mentioned that they have upwards of 60 Data Quality Dimensions. I was aware of the core 6 and figured more existed, but 60! 😲


Miriam goes into four main issues she has experienced, but shared that others are worth exploring as well (dataset shift, noisy data, lack of data, irrelevant or redundant data, inconsistent data).


Data Quality Issues

I’m sharing her slides because the visuals and explanations are fabulous!































The last topic Miriam shared was Data Profiling. Pandas Profiling is an open source Python module with which we can quickly do an exploratory data analysis with just a few lines of code, It auto generates Data Quality Alerts and supports both Tabular and Time-Series data. More information can be found here.


Resources

A really excellent skeptic puts the term ‘science’ into the term ‘data science’.

~ Cathay O’Neil, On Being a Data Skeptic




Applications of NLP on Text Data


Yujian has always been into AI, hence the creation of this conference, and considers himself to be an AI nerd! He also loves Natural Language Processing and it shows. This was a fantastic walkthrough of how NLP works and how you can use it in real life scenarios. He also gets into some neural network architectures, which honestly I could not do justice in a summary here, but will provide resources to his work at the end.


Introduction to NLP

Natural Language Processing is just that, the processing of natural language by a machine. But what exactly is Natural Language? There are two types:

  • Natural Languages (English, Spanish, etc) are formed naturally over time and used to communicate. These languages do not have a specific set of rules; therefore machines have a hard time dealing with these

  • Formal Languages (Python, C, JavaScript, etc.) can be thought of as programming languages that follow a set of rules; therefore are easier for machines to handle


While it is most common for machines to process these natural languages, machines can also understand and generate these languages as well.

  • Natural Language Understanding is like Alexa. If you command your Alexa to turn on your kitchen lights, the Alexa needs to understand what ‘turn on’ and ‘kitchen lights’ means.

  • Natural Language Generation has recently gained popularity with ChatGPT, which takes a prompt and generates a written response. Advantages: it can generate anything such as a story or a summary. Disadvantages: it does not understand what language really means. It basically just generates words it has seen before, based on math (kind of like an imitation of natural language).


Applications of NLP

ChatGPT is just the most recent and popular application, but this could really be any chatbot from an ecommerce website. Also, fun fact, Eliza was one of the original conversational AI, which is a Rogerian psychotherapist chatbot.


Areas of Application

Finance

  • Risk Assessment

  • Accounting/Auditing

  • Portfolio Optimization

  • Financial Document Analysis

Automation

  • Supp[ly Chain Documentation

  • Insight Discovery

  • Entity Discovery

  • Sentiment Analysis

Reputation

  • News Outlet Sentiment

  • Social Media Sentiment

  • Linking Mentions


Challenges with NLP
  • Context: there are a lot of issues that come from context, mainly because machines do not understand context like us humans (I would argue that we humans are not the best at it sometimes either)

  • Processing: machines can do this faster than humans but there is still a challenge in processing things correctly

  • Tone of Voice: a lot of language comes from tone and body language which machines cannot identify from text data

  • Domain Specificity: this is especially important in Finance since there is a lot of jargon used and in Reputation like social media because of slang


NLP can be used in a lot of real world examples. One example that Yujian shared was his use to help him identify sentiment when analyzing which sports team is more likely to win. He looked at NFL and the World Cup.


Resources

Blog articles


Forging Your Tech Career

Hailey Yoon, Co-Founder & CTO of IO21

Hailey has a unique background, which led her to become one of 50 technology leaders in the middle east by Engati and Google’s Women Tech Founders MENA 2021. She is the Co-Founder of IO21, which has been selected as 101 top Arab world startups for website development. In college, she never thought she would be a software engineer in Industry, but she enjoyed her experiences working on large scale projects and loves sharing her journey.


Tips to Forging Your Own Tech Career:

Explore options and collect helpful information
  • Googling is good, but should only be a start.

  • Use credible resources such as papers and talking with colleges. This is where you really get to learn more about which career, industry, and role you would like to take.

  • Be careful of the people sharing exaggeration on social media

Plan, plan, plan
  • It is ok if you do not complete 100% of your plan

  • The plan can be viewed as guidance, completing 70-80% is still progress

Make your goals manageable (you don’t need to have everything figured out)
  • You may want a drastic/immediate change, but you cannot change over night

  • Choose small milestones at first and then build on them

Don't be scared to ask for help
  • Many women in tech struggle with this and can spend a lot of time trying to figure things out independently

  • Changing this mindset can change your lifestyle

Learn how to handle difficult situations and learn from the experience
  • Practice soft skills and interacting with people

  • Some interactions can be uncomfortable, but it can help if you educate others

Avoid distraction
  • Focus on your career and keep track of your goals

  • Think about using a habit tracker (UpWork is Hailey’s favorite)

  • Enjoy your life but keep track of your progress

Sharpen your skills
  • These are very important to develop so you do not become outdated, especially in the tech field as things change quickly

  • Soft skills: clear communication, positive attitude, empathy, conflict resolution, taking responsibility, sense of human, etc.

  • Tech skills: learn new technologies, stay on top of new trending tools, work on hands-on projects, etc.



Avoid an MLOops with ML Monitoring

Sage Elliott, Technical Evangelist at WhyLabs

Sage loves machine learning and MLOps, but does not enjoy the occasional MLOops. What is an ‘MLOops’? You might have experienced them when you try to buy flowers from an online shop but the aps recommends a bell pepper substitution. I would rather receive the bell pepper, but that is a personal preference 😂 and that does not excuse the oops made by the recommendation model.


This happens quite often in the wild actually, from other models such as QA analysis, price predictions, and credit approvals and these types of mistakes can come from many origins.


Reasons for the oops:
  • Data Quality: external changes, schema changes, pipeline bugs

  • Data Drift: seasonality changes, new group of users, consumer preferences have changed

  • Concept Drift: housing market, COVID-19

  • Data & Model Bias


ML Monitoring can help with these situations, but it is important to know that if you are not monitoring, someone else is!


Best case: dedicated monitoring set-up, alerts on-call engineer on undesirable behavior (nice!)

Average case: downstream systems that use outputs of the ML model alerts on-call engineer (ok…)

Worst case: the end user, end customer, or ‘bystander’ in the physical world alerts you (yikes!)

Disaster case: the quarterly-business-review maps $ losses to the undesirable behavior (someone is in trouble!)


You can create the best case by setting up an Observability System / AI Observability Platform, which should include ML pipeline triggers and a data viz. Layer for team workflows. The best time to perform monitoring is as soon as the model is in production, and possibly sooner if your environment allows.


WhyLabs offers data observability options to monitor, improve, and trust your AI applications, data streams, data pipelines, and ML models. It also offers an open source library for logging any kind of data, called whylogs.


Resources



DataPrepOps: Making Data ML-Ready

Jennifer Prendki, Founder and CEO at Alectio

Jennifer started us off by giving us a 2022 Retrospective of the good, the bad, and the ugly!


The Good:

This was particularly interesting to me because I hadn’t heard of Imagen before today, but apparently this outperforms DALL-E 2 on the COCO benchmark (COCO is a large-scale object detection, segmentation, and captioning dataset).


The Bad:

Hacker News Bot reports Azure has run out of compute


The Ugly:

Uniprocessor performance (single core) is starting to plateau, which might be the end to Moore’s law


Data-centric AI became popular a few years ago and is defined as the discipline of systematically engineering the data used to build an AI system with the main premise that not all data is created equal.


It was typical that when a model was not working as intended, the resolution was to re-tune the model. With the recent shift to low-code/no-code, models are viewed as more of a static element. Therefore, the new approach to improve performance of a model is to go back to the data. First, make sure that it is still located and can be accessed in the same manner. Then, think of ways you can alter the data for improvements. This can include selecting different records, generating data, modifying labels, and/or removing records.


Some groups out there are still not fully on board with using generated (synthetic) data. However, through the analysis of model accuracies, experiments have supported the idea that data creation has a real benefit. In fact, benefits of data curation include saving on labeling, computational resources, operational costs, reducing training time, boosting model performance, diagnosing model issues, and explaining the model.


One topic that Jennifer did a deep dive into was Data Labeling. This was a topic near and dear to me as I have been on projects which require manual data labeling, so I understand the pains and struggles.


Look at these pictures and imagine how you would label each



















< This one made me literally laugh out loud 🤣 Just when we thought we successfully defined hot dogs as sandwiches, now we have to fight the good fight again and prove they can be cars too? 🌭



As you can see, manual data labeling is complex and time intensive. However, there are other approaches which include self-supervised learning, data augmentation, and our gold star: synthetic data generation. But no matter the approach, it is important that the process is quick and accurate so the company can achieve the best ROI.




Semantic Models


Way to end with a bang! Michelangiolo wrapped up this conference with one of the most educational sessions I have attended in a while. This is another one that I could not do justice with a summary, he definitely knows his stuff. With a background in quantitative finance, specialization in cloud and data science, and experience with some top AI companies, he created Goliath AI Consulting to help companies adopt AI technologies. Michelangiolo talked about the progress of NLP and gave us a thorough walkthrough of language models, including Embeddings vs Language Models Output, Static Embeddings, and Dynamic Embeddings.


NLP first started gaining traction around 10 years ago, with word2vec, developed by Google. At this point in time the technology could only encode individual words and had only about 300 dimensions. After the creation of BERT in 2018 (another Google development), NLP models now leverage semantic models which Michelangiolo refers to as the heart of NLP.


This is when I felt like I teleported into Tony Stark's lab while Michelangiolo explained how these models work. Weighting different types of words and plotting them on what looked to be on a 3D plane to visually represent words that are related to one another.


I left with mixed feelings. On one hand, I now realize the complexity of the inner workings of NLP models, they are fascinating! On the other hand, I feel a bit less impressed with ChatGPT (hot take, I know, that’s what I’m here for!) 😂


Common NLP Software

Resources



Closing Statements

I want to congratulate Yujian for hosting his first conference! There was a great turnout, the presentations were enlightening, and the attendees seemed to all have a great time! Hope to see more from you in the future!



Thank you for reading, Thank you for supporting


and as always,
Happy Learning!!

Recent Posts

See All

Comments


bottom of page